## Using networks for text analysis

Graphs are a powerful tool for visualising relationships between things and networks are a kind of graph. Social scientists probably first think of networks as social networks representing relationships between people. But we can also use networks to represent relationships in texts. One quite common way of doing this is to use a network to represent collocation relationships between words - LADAL provides an excellent [tutorial](https://slcladal.github.io/coll.html#3_Visualizing_Collocations) on this topic. Here, we will use a network to visualise how words are distributed across several groups of related texts, helping us to answer the question: does the presence or absence of particular words characterise one group of texts compared to the other groups?

## Getting set up

We will use two packages in this session: **tm**, which is a package of text mining tools, and **igraph**, which is a graph manipulation and plotting package (also available for Python).

If you do not already have these packages installed, you will need to run the code in the next cell. If you already have the packages installed, you can skip to the following cell.

In [None]:
## install packages
install.packages(tm)
install.packages(igraph)

In [None]:
## activate packages
library(tm)
library(igraph)

## Loading the data

After loading the data file, we create five text files, one for each of the author groups represented (Federal Government, Higher Education, Individuals, Local Government and Non-Government Organisations). These five texts are then assembled as a single vector and then a **tm** corpus object is created. As usual, it's good to check that the results are what we expect (messy text in this case!).

In [None]:
## load data (you will need a full pathy if this file is not in your working directory)
senate_data <- read.csv('submission_file_categorisation_content.csv')

## join text from each author group 
fed_gov_text <- paste(subset(senate_data, Category == "FedGov")[,7], collapse = ' ')
hed_text <- paste(subset(senate_data, Category == "HEd")[,7], collapse = ' ')
individ_text <- paste(subset(senate_data, Category == "Individ")[,7],collapse = ' ')
loc_gov_text <- paste(subset(senate_data, Category == "LocalGov")[,7], collapse = ' ')
ngo_text <- paste(subset(senate_data, Category == "NGO")[,7], collapse = ' ')

## create Corpus object
texts <- c(fed_gov_text, hed_text, individ_text, loc_gov_text, ngo_text)
docs <- Corpus(VectorSource(texts))

## inspect
str(as.character(docs[[2]]))

## Preprocessing

One reason we are using the **tm** package is because it makes it easy to carry out some cleaning. Most of the processes here are to make sure that we will count the various types and tokens appropriately. Warning messages may appear as these steps are executed but you don't need to worry about them. Again, we inspect the results - stemmed text looks strange the first time you see it.

In [None]:
## preprocessing
docs <-tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)

## inspect again
str(as.character(docs[[2]]))

## Shaping the data

The second reason for using the **tm** package is that it has a functions to generate matrices of the occurrence of terms in documents. Here, we create a term-document matrix, but we could also create a document-term matrix which would be a transposed version of the TDM. The TDM is easier for our purposes though.

In [None]:
## create a term-document matrix, make it a data frame
TDM <- TermDocumentMatrix(docs)
dim(TDM)
TDM <- as.matrix(TDM)
TDM_df <- as.data.frame(TDM)
colnames(TDM_df) <- c('fedGov', 'HEd', 'Indiv','LocGov','NGO')

We are interested in how words are distributed across submissions with different sources. For this purpose, words which appear in all five grouped documents are not interesting. Words which appear in only one grouped document may be interesting, but visualising them as a network will not be. Therefore we need to include in our data table information about how many groups of documents terms appear in. We also want information about the total number of occurrences of each term so that can order the data by frequency. Finally here, we get a count of the number of tokens in each group fo documents; later we will use this to calculate normalised frequencies of terms.

In [None]:
## function gets number of docs where term occurs
in_docs <- function(rowx) {
  number = 5
  for (value in rowx) {
    if (value == 0) {number <- number - 1}
  }
  return(number)
}

## add columns to data frame: total number of occurrences in docs, number of docs in which a term appears
TDM_df$freq <- apply(TDM_df, sum, MARGIN = 1)

in_doc_vector <- apply(TDM_df, in_docs, MARGIN = 1)
TDM_df$inDocs <- in_doc_vector
str(TDM_df)

## get token counts for each group of docs
token_counts <- apply(TDM_df, sum, MARGIN = 2)


## Data for plotting a graph

The input for **igraph** to produce a graph object is a list of edges, that is, a list of pairs of nodes where each pair represents the link between two nodes. The code here is structured so that we can set three variables and easily produce plots for various combinations of them. The variables are how many documents groups the terms appear in, how many words we want to include and whether we will take words from the top or the bottom of the frequency distribution.

In [None]:
## set variables for getting plot data
## words in how many groups of documents? (2,3,4)
doc_groups <- 3
## number of words to include
word_count <- 20
## most frequent or least frequent? values 'most', 'least'
freq <- 'most'

First, we extract the relevant rows from our main data table, order them and then take the relevant number of words from one end of the subset.

In [None]:
## get subset of data, order by frequency of terms
plot_data <- subset(TDM_df, TDM_df$inDocs == doc_groups)
plot_data <- plot_data[order(plot_data$freq),]
if (freq == 'most') {
  plot_data <- plot_data[(nrow(plot_data) - word_count):(nrow(plot_data)),]
} else if (freq == "least") {
  plot_data <- plot_data[1:word_count,]
}

Then we work through the table of relevant data and make a pair for each term and document combination. This will produce NULL entries which we prune before building the list cumulatively. We also calculate a normalised frequency of occurrence for each term, that is, the average frequency per 10,000 tokens in the (cleaned) corpus. This information will allow us to show the strength of connections in our final plot.
Finally we make the result a data frame.

In [None]:
## make edge table, includes edge weights (= normalised frequency)
edge_list <- list()
sources <- c('FedGov', 'HEd','Indiv','LocGov','NGO')

for (j in 1:nrow(plot_data)) {
  edges_y <- list()
  data_row <- plot_data[j,]
  for (i in 1:5) {
    if (as.numeric(data_row[i]) != 0) {
      edges_y[[i]] <- c(rownames(data_row), sources[i], data_row[i]/(token_counts[i]/10000))
    }
  }
  edges_y = edges_y[-which(sapply(edges_y, is.null))]
  edges_y <- do.call(rbind, edges_y)
  edge_list <- rbind(edge_list, edges_y)
}

edge_list_df <-as.data.frame(edge_list)
colnames(edge_list_df) <- c('term','doc','weight')
str(edge_list_df)

## Visualising our results

**igraph** has various functions which generate graph objects, one of them takes a data frame as input. The first two columns in the data frame must conatin the edge list.
A bipartite graph has the property that nodes divide into two groups with edges only linking nodes from one group with nodes from the other group (i.e. there are no edges linking nodes in the same group). We check that our graph is bipartite and then assign the type labels generated by the checking function to the nodes in the graph. Attributes of the nodes and edges of the graph can be accessed via the V([graph_name]) and E([graph_name]) objects.
Now we can take a first look at our results. If you like, you could also view the default layout **igraph** gives this data - just remove the *layout* argument from the **plot()** function.

In [None]:
## make the graph object
bi_plot <- graph.data.frame(edge_list_df, directed = FALSE)

## check that the graph object is bipartite and assign type values to the nodes
bipartite.mapping(bi_plot)
V(bi_plot)$type <- bipartite_mapping(bi_plot)$type

## view basic plot
plot(bi_plot, layout = layout.bipartite)

## Tweaking the plot

We can add more information to our visualisation by showing the strength of connection of edges, which here means frequency of occurrence of terms. The third column of our plotting data frame has the normalised frequencies for each term in the relvant group of documents; this has become the *weight* attribute of the edges object (**E(bi_plot)$weight**). There are various ways we can use this; the two obvious ones are to scale the width of the edges or to scale the colour of the edges. Here we will scale the colour, or more precisely, the transparency of the edges.
Transparency or *alpha* has a value between 0 (completely transparent) and 1(completely opaque). We therefore need to set a  scaling factor which takes our weight values and returns values in that range. We wnat the maximum weight value to end up close to 1, so we look at that value and then do a simple calculation and reset the weight values. To set transparency in an **igraph** plot, we have to use the **rgb()** colour setting function; This function takes four arguments: three values for the red, green and blue components on the colour (here we use straight red), and the alpha value.

In [None]:
## plot with edge transparency reflecting weight
## check max value for weight
max(unlist((E(bi_plot)$weight)))

## choose a scaling factor which will give $weight a maximummvalue close to 1
E(bi_plot)$weight <- as.numeric(E(bi_plot)$weight) * 0.022

## use $weight to set edge transparency
plot(bi_plot, layout = layout.bipartite, edge.color = rgb(1,0,0, E(bi_plot)$weight))



We can make the graph more readable by doing without circles around the term nodes and by making the label text smaller (so that terms overlap less). Again, we do this by setting attributes of the V([graph_name]) object.

In [None]:
## no circles for terms
V(bi_plot)$shape <- ifelse(V(bi_plot)$type == "TRUE", "circle", "none")

## make term labels smaller
V(bi_plot)$label.cex <- 0.6

## plot again
plot(bi_plot, layout = layout.bipartite, edge.color = rgb(1,0,0, E(bi_plot)$weight))