### Hierarchical Clustering

One issue with K-means clustering is that it requires a predetermined number of clusters. What if you don't know how many clusters are in your data? Hierarchical Clustering allows the researcher to build a hierarchical tree linking the data points, and then determine the number and composition of clusters using a distance metric.

For this analysis, I'm going to build a Hierarchical Cluster using the WorldNewsAPI title TFIDF dataset, to see if I can cluster the articles based on topic.

In [1]:
library(ggplot2)
library(lsa)
library(tidyverse)
worldnews_title <- read.csv('../data/wdms/tfidf/worldnewsapi/lemmed/title.csv',row.names=1)
worldnews_text <- read.csv('../data/wdms/tfidf/worldnewsapi/lemmed/text.csv',row.names=1)
wdm1 <- apply(as.matrix(worldnews_title[,6:ncol(worldnews_title)]),1,replace_na,0)
wdm2 <- apply(as.matrix(worldnews_text[,6:ncol(worldnews_text)]),1,replace_na,0)
head(wdm1)
head(wdm2)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang


ERROR: Error: package or namespace load failed for 'ggplot2':
 .onAttach failed in attachNamespace() for 'ggplot2', details:
  call: NULL
  error: .onLoad failed in loadNamespace() for 'withr', details:
  call: unlockBinding("defer", <environment>)
  error: no binding for "defer"


In [None]:
#Cosine produces a similarity matrix, must first convert to distance matrix
cosdist <- function(wdm){as.dist(apply(10**(1-cosine(wdm)),1,replace_na,10))}
dist1 <- cosdist(wdm1)
dist2 <- cosdist(wdm2)

In [None]:
clust1 <- hclust(dist1)
plot(clust1,labels=FALSE)

In [None]:
clust2 <- hclust(dist2)
plot(clust2,labels=FALSE)

In [None]:
labels <- cutree(clust2,k=10)
worldnews_text$hclust <- labels
head(worldnews_text)

In [None]:
total <- apply(wdm2,1,sum)
for(l in 1:10){
    # Create a df by summing the tfidf scores within each document cluster, and dividing by the tfidf scores across the whole corpus.
    # This gives a metric showing which terms appear disproportionately often in each group compared to the corpus overall.
    wdm_l <- as.data.frame(cbind(rownames(wdm2),apply(wdm2[,worldnews_text$hclust==l],1,sum)/total),row.names=rownames(wdm2))
    colnames(wdm_l) <- c('term','tfidf')
    wdm_l <- wdm_l %>%
        arrange(desc(tfidf))
    ggplot(aes(x=wdm_l$term,y=wdm_l$tfidf)) +
    geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
    coord_flip() +
    xlab("") +
    theme_bw()
    }