# Lab 6 - Text Analysis and Social Network Analysis

## Text analysis

What we call "text analysis" in this class is often called *natural language processing* or *NLP* within computer science. NLP methods which enable computers to derive meaning from human language.

A field has a lot of overlap with NLP is *machine learning* or *ML*. ML includes statistical methods that automatically detect patterns in data and used for making predictions in other data.

The first part of this workshop on string manipulation will be NLP with some more basic Python functionality. The second part will focus on some ML examples of NLP.

### Exploratory Data Analysis in Textual Data

Like in the modeling exercise did last week, the first thing we want to do once we get a dataset full of text is to explore that text. The simplest thing we can do is to begin by reading the text and checking what is contained within it. We read to understand the meaning of the text and to perhaps find some common themes or words.

Let's revisit the Canadian Tire Twitter dataset which we used briefly a few labs ago. These tweets were gathered by asking the Twitter Streaming API for any tweets mentioning Canadian Tire.

In [None]:
import pandas as pd
import numpy as np
df_ct = pd.read_csv('data/canadian-tire-twitter-sample.csv')

For this lab, we will use the `text` column to retrieve the tweet. However, it's possible that we could also look at the retweeted_status object, the quoted_status object, or the full_text field for tweets over 240 characters. Refer to Lab 4 to see how we can access those fields.

In [None]:
df_ct['text']

By default, `pandas` restricts how much of a text column we can view at any one time. To see these fields in full view, we can uses the `.values` attribute to give a list (technically an `ndarray`) of all the values. Since this is a list, we can use list slicing to take the first 20 elements.

In [None]:
df_ct['text'].values[0:20]

There's a few different themes here: Canadian Tire money, tools, and furniture. One of the most basic things we can do with text analysis is to count the number of times particular words exist. Let's try to count three words: *money*, *tools*, and *centre* (referring to the Canadian Tire Centre in Ottawa). The way I will do that is to use the `pandas` string method `.contains` to find out if the word appears in the text. If it does, `.contains` will return `True`. Then, I will sum up all the `True` values. This works because Python internall represents `True` as the number 1 and `False` as the number 0.

In [None]:
df_ct['text'].str.contains('money').sum()

In [None]:
df_ct['text'].str.contains('tools').sum()

In [None]:
df_ct['text'].str.contains('centre').sum()

It looks like of these, people are very often talking about Canadian Tire money on Twitter, 443 / 5000 times. Nearly 10% of all Canadian Tire tweets. They aren't really talking about tools. And they are not really talking about the Canadian Tire Centre too much.

One important thing to note about the `.contains` method is that it is *case-sensitive*, which means it will only match strings which match in case. If we want to do a *case-insensitive* search, we can set the `case` argument to `False`.

In [None]:
df_ct['text'].str.contains('money', case = False).sum()

In [None]:
df_ct['text'].str.contains('tools', case = False).sum()

In [None]:
df_ct['text'].str.contains('centre', case = False).sum()

*Wow*, that made a *huge* difference for mentions of `centre`. People are very much talking about the Canadian Tire Centre, more than they are even talking about the Canadian Tire money!

Beyond `.contains`, there are many, many string methods in `pandas`. The complete list of these and a tutorial can be [found here](http://pandas.pydata.org/pandas-docs/stable/text.html).

### Preprocessing and creating word counts

We need to know how to handle text for large-scale datasets. For that, text needs to go through several *preprocessing* steps before it can be passed to a statistical model.

There are two processes which we will start with. The first process is converting all of the words to lowercase. We saw above how case seems to mess things up. But on the level of meaning, the lowercase and uppercase of a word generally mean the same thing (sometimes they don't, though. For instance, an SMS saying 'Thank you.' means something different from one saying 'THANK YOU.'). 

In any case, we can do this using `pandas` string methods, in particular, `.lower`.

In [None]:
df_ct['new_text'] = df_ct['text'].str.lower()

In [None]:
df_ct['new_text']

Second, we're going to *tokenize* the text, meaning we separate all the meaningful *tokens* from each other. When we say tokens, we usually mean words. But tokens can also include certain kinds of punctuation which may be helpful to include. For our purposes, we can do this using the `.split` method.

In [None]:
df_ct['all_words'] = df_ct['new_text'].str.split()

In [None]:
df_ct['all_words']

This gave us a new column which contains a list of all words in the tweet. Now, we're going to create a dictionary (remember dictionaries?) which will count how many times a word occurs in the text.

In [None]:
wordcounts = {}

## first we loop through the rows
for row in range(df_ct['all_words'].shape[0]):
    
    ## next we loop through the words
    for word in df_ct['all_words'].values[row]:
        
        ## we need to put the word in the dictionary first
        if word not in wordcounts:
            wordcounts[word] = 0
            
        ## add one to the count!
        wordcounts[word] += 1

Lastly, let's make a new `DataFrame` from the dictionary.

In [None]:
## This code creates a list of tuples in the form (word, count)
wordcounts_tuples = [(k, v) for k, v in wordcounts.items()]

df_wordcounts = pd.DataFrame(wordcounts_tuples, columns = ['word', 'count'])

Let's sort these by count and see what we get.

In [None]:
df_wordcounts.sort_values('count', ascending = False)

Unsurprisingly, "canadian" and "tire" are the most popular words in the dataset. The next most common words are articles like "the" and "to". "rt" denotes "retweet" and is very high up. "centre" appears about 10th in the order.

Something that we'll explore before is removing all the *stopwords* from the text. Stopwords are words which appear very frequently in text and end up not adding much to our own subjective understanding of a string. Computationally, they appear often, which can also gum up statistical models.

**Exercise 1**

This is a small dataset full of tweets about four companies: Apple, Microsoft, Facebook, and Google. 

0. Load these data with this command:

In [None]:
df_companies = pd.read_csv('data/companies.csv')

<dd>2. Identify word counts for all four companies.</dd>
<dd>3. Create a new column called `lowercase` which converts all the tweets to lowercase.</dd>
<dd>4. Create a new column called `all_words` which splits the words up.</dd>
<dd>5. The following code creates the `wordcounts_companies_tuple` list which you can use to create a new data frame. Run it and create the word count data frame from `wordcounts_companies_tuple`.</dd>
<dd>6. Sort your new data frame and identify the most popular words.</dd>

In [None]:
wordcounts_companies = {}

## first we loop through the rows
for row in range(df_companies['all_words'].shape[0]):
    
    ## next we loop through the words
    for word in df_companies['all_words'].values[row]:
        
        ## we need to put the word in the dictionary first
        if word not in wordcounts:
            wordcounts_companies[word] = 0
            
        ## add one to the count!
        wordcounts_companies[word] += 1
wordcounts_companies_tuple = [(k, v) for k, v in wordcounts_companies.items()]

### Using `scikit-learn` for text analysis 

We can handle a lot of the preprocessing steps which we did manually above by using `scikit-learn`. `scikit-learn` is a powerful machine learning and data science library. It contains a lot of tools to do the preprocessing that we did above, including changing to lowercase, tokenizing, and also removing common stop words (words like "the", "we", etc.). That will help with making our categorization a bit clearer below.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(stop_words = 'english', lowercase = True)
X    = vect.fit_transform(df_ct['text'])

What `X` is called here is a `document-term matrix`. Each row is a document and each column is a term (or word). This is the foundational data structure which we use to do most NLP analysis. If we use the `.toarray` method, it will print out a few values of this matrix.

In [None]:
X.toarray()

The numbers which are generated from this process are called *features*. *Features* are what are often called "independent variables" or "covariates" in more traditional statistical analysis. For those of you who may have taken a stats class, imagine that the features are the numbers which we're going to use to analyze some change in a dependent variable.

Let's look at the features which are generated by this process.

In [None]:
feature_names = np.array(vect.get_feature_names())
feature_names

In [None]:
df_features = pd.DataFrame(X.toarray(), columns = feature_names)

In [None]:
df_features

To get a better idea of what this looks like, let's look at a tweet in this dataset, say one which uses the word "money". 

In [None]:
df_ct['text'][2]

Now, let's see the column for `money`.

In [None]:
df_features['money']

If we want to see which are the most used words in the list, we can take sum of all the words across all documents, then take the reverse order of words by their place in the list. Lastly, we use that ordering as an index to the <code>feature_names</code> list.

In [None]:
totals = np.sum(X.toarray(), axis = 0)
order  = np.argsort(totals)[::-1]
feature_names[order]

## Classification

A major task of lots of NLP is labeling the content of a document. Twitter or Facebook, for instance, wants to classify whether a post might be relevant to you. A researcher might want to assess whether a policy document is more liberal or conservative. A brand might want to see if posts about them are positive or negative. This is where classification comes into view.

The process of classifying text documents is depicted in the image below.
![](img/supervised-learning.png)

First, there are a set of documents which are labeled manually, i.e. by a human. The label is called a *class*. The dataset which is labeled manually is called the *training set*. It's called a training set because the machine learns from this set and then applies the knowledge it gets from the set to new, unseen data. The training is done on words or features which are part of documents. The particular statistical model which is trained is called a *classifier*. Then the body of documents which is to be classified by the classifier is called a *test set*. For the test set, the classes are hidden or unknown to the classifier. It is doing its best to guess the correct classes.

Now, how do we actually know if the classifier did its job correctly. Well, usually, we have a test set in which we actually knmow the real labels. But we test those real labels against the predicted ones. We then develop a set of metrics called *precision* and *recall*, which assess two different things.

![](img/precision-recall.png)

Precision measures what percentage of the predicted items are relevant, while recall measures what percentage of the relevant items are predicted.

Imagine this: you have a jar of coins. You want to go through the jar and pick out all the loonies and twoonies. One way of making sure you have all of the coins you want is to dump all the coins into your coin purse. In this case, your recall would be perfect (i.e. equal to 1) but your precision would be lousy. In the other case, you could search through the coins quickly with your hands and pick out which ever ones seem to pop out the quickest. You'll have much better precision here, but you might not get all the coins, so you would not have as good a recall.

So let's get started. We're first going to load the modules needed for this. One is a classifier which works particularly well with text data, which is `LinearSVC` , and the other two are assessment tools.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

Now, let's load both the training and test sets from the Reuters dataset. The Reuters data is a dataset of articles from Reuters Business which are about a particular topic. The classification task is to build a model which attempts to classify test articles based on the text.  

In [None]:
df_train = pd.read_csv('data/r8-train-all-terms.txt', sep = "\t", names = ['label', 'text'])
df_test = pd.read_csv('data/r8-test-all-terms.txt', sep = "\t", names = ['label', 'text'])

Now, what we do is create a vectorizer for the words in the documents. We will load all the words for the training set into <code>X_train</code> and all the labels for the training set into <code>y_train</code>.

In [None]:
vect_count = CountVectorizer(stop_words = 'english', lowercase = True)
X_train = vect_count.fit_transform(df_train['text'])
y_train = df_train['label']

In [None]:
X_train

We do a similar thing for the test set. Notice how we use the method <code>transform</code> rather than <code>fit_transform</code>. That's because the vectorize is expecting a bunch of words which are defined only within the training set.

In [None]:
X_test = vect_count.transform(df_test['text'])
y_test = df_test['label']

Now we define the classifier, and train it with the training data.

In [None]:
clf = LinearSVC()
clf.fit(X_train, y_train)

Lastly, we predict the new labels, based on the words in the test set.

In [None]:
y_pred = clf.predict(X_test)

In [None]:
y_pred

In [None]:
print(classification_report(y_pred, y_test))

In [None]:
print(confusion_matrix(y_pred, y_test))

## Social Network Analysis

Social network analysis is a type of analysis which interprets, analyzes, and visualises *relational* data. Instead of beginning from the person or tweet as the unit of analysis, with social network analysis (or SNA) we begin from the relationship between the two.

The building blocks of a network are *nodes* and *edges*. Nodes represent individuals in the network. They are people, tweets, firms, Twitter users, etc. They are the thing doing the interaction.

![](img/net-1-node.png)

The connection between nodes are called *edges*. They imply some kind of relationship between the edges. This interaction could be friendship, mutual attendance of an event, dating, or has done business with.

![](img/net-1-edge.png)

Edges can be *directed* or *undirected*. For instance, on Facebook, friendships are mutual and both parties must agree to that friendship. Therefore, it is called *undirected* because it is by definition a two-way relationship. However, on Twitter, user A can follow user B, but user B does not have to follow user A. This is called a *directed* graph because it can be a one-way relationship. 

Lastly, edges can be *weighted*. Weights are usually numerical values which indicate a strength of a relationship. The edge between you and your best friend is probably higher than you and one of your classmates who you do not speak to often.

![](http://evelinag.com/blog/2015/12-15-star-wars-social-network/star-wars-logo.png)

In this lab we will be using a small network that indicates [interactions in the movie Star Wars Episode IV](http://evelinag.com/blog/2015/12-15-star-wars-social-network/). Here, each node is a character and each edge indicates whether they appeared together in a scene of the movie. Edges here are thus undirected and they also have weights attached, since they can appear in multiple scenes together.

The first step is to read the list of edges in this network. For this exercise, we are going to use the <code>[networkx](https://networkx.github.io/)</code> module to read, analyse, and visualise the networks. 

In [None]:
import networkx as nx

We will also use the <code>matplotlib</code> module for visualisation.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

We can read in the network as a weighted edgelist. This is a CSV file in the format of <code>node1, node2, weight</code>. 

In [None]:
G = nx.read_weighted_edgelist('data/star-wars-network-edges.csv', delimiter = ",")

In [None]:
G

We can use a method to see all the edges in the network.

In [None]:
G.edges()

And we can use a similar one to see all the nodes.

In [None]:
G.nodes()

To see a specific attribute of an edge, we need to use <code>get_edge_attributes</code>. Who seems to have the highest weight in their interactions?

In [None]:
nx.get_edge_attributes(G, 'weight')

Now we're ready to draw. We'll use the basic <code>draw</code> method first to illustrate the graph. 

In [None]:
nx.draw(G)

This looks interesting, but we don't really know which node is which unless we add some labels. Before we add labels, we need to assign the labels particular positions on the graph.

We're going to play with two layouts. The first is a "circular" layout, which is useful because we can see all the nodes and the connections between them. However, with this layout, we have a harder time seeing what groups of nodes seem to cluster together.

The second layout is called "Fruchterman-Reingold". It is a "force-directed" layout, which implies that if subnetworks seem to be tied closer together, they squeeze together more in the graph. Let's play with both.

In [None]:
pos = nx.fruchterman_reingold_layout(G)

In [None]:
nx.draw_networkx_labels(G, pos)
nx.draw(G, pos)

So we're starting to see some patterns, even if we can't really see much of the text. Peripherial characters like Jabba and Greedo are only connected by one edge. However, in the center there seems to be a cluster of people like Luke, R2-D2, and Chewie.

Let's do the same thing with a circular layout.

In [None]:
pos = nx.circular_layout(G)
nx.draw_networkx_labels(G, pos)
nx.draw(G, pos)

This lets us see more of the links that exist between different nodes. It's actually not super useful, though, unless we have some more information about the edges. That's where the weights come into play.

We can display the weight of the edge. We can do this by setting some levels for line weights. We can have three: small, mid, and large.

In [None]:
## select edges by weight
esmall = []
for (u,v,d) in G.edges(data = True):
    if d['weight'] < 5:
        esmall.append((u,v))

esmall = [(u,v) for (u,v,d) in G.edges(data = True) if d['weight']  < 5]
emid   = [(u,v) for (u,v,d) in G.edges(data = True) if d['weight'] >= 5 and d['weight'] < 10 ]
elarge = [(u,v) for (u,v,d) in G.edges(data = True) if d['weight'] >= 10]

## draw edges in varying edge widths
nx.draw_networkx_edges(G, pos, edgelist = elarge, width = 4, alpha = 0.5)
nx.draw_networkx_edges(G, pos, edgelist = emid,   width = 2, alpha = 0.5)
nx.draw_networkx_edges(G, pos, edgelist = esmall, width = 1, alpha = 0.5)
nx.draw_networkx_nodes(G, pos)

nx.draw_networkx_labels(G, pos)

plt.axis('off')

With this, we can see some stronger links between people like Chewie and Han, Luke and Obi-Wan.

Lastly, we can set the colours of nodes based on whether the person is on the light side, the dark side, or is other. Let's use the Fruchterman-Reingold layout because it allows us to see clusters a bit better.

In [None]:
## select nodes by light side / dark side / other
dark_side = ["DARTH VADER", "MOTTI", "TARKIN"]
light_side = ["R2-D2", "CHEWBACCA", "C-3PO", "LUKE", "CAMIE", "BIGGS",
                "LEIA", "BERU", "OWEN", "OBI-WAN", "HAN", "DODONNA",
                "GOLD LEADER", "WEDGE", "RED LEADER", "RED TEN"]
other = ["GREEDO", "JABBA"]

pos = nx.fruchterman_reingold_layout(G)

nx.draw_networkx_edges(G, pos, edgelist = elarge, width = 4, alpha = 0.5)
nx.draw_networkx_edges(G, pos, edgelist = emid,   width = 2, alpha = 0.5)
nx.draw_networkx_edges(G, pos, edgelist = esmall, width = 1, alpha = 0.5)

## draw the nodes
nx.draw_networkx_nodes(G, pos, node_color = 'red', nodelist = dark_side)
nx.draw_networkx_nodes(G, pos, node_color = 'yellow', nodelist = light_side)
nx.draw_networkx_nodes(G, pos, node_color = 'gray', nodelist = other)
nx.draw_networkx_labels(G, pos)

plt.axis('off')

We see some clear patterns here. The light side is very much clustered together, while the dark side has its own grouping. The outliers -- Jabba and Greedo -- aren't grouped at all.

In addition to graphing, we can create some network-level statistics which characterize the network. This includes *density*, which measures how many of the possible connections in this network have been made. If density equals 1, that would imply that everyone in the movie had a scene with everyone else.

In [None]:
nx.density(G)

Lastly, there are node-level statistics which characterize individual nodes. One of the more important one of these is *degree*, which means how many edges are connected to this particular node. Which nodes seem to have the highest degree?

In [None]:
nx.degree(G)

In [None]:
import glob 
import json

tweets = []
files  = list(glob.iglob('politics*.json'))
edges = []

for f in files:
    fh = open(f, 'r', encoding = 'utf-8')
    tweets_json = fh.read().split("\n")

    ## remove empty lines
    tweets_json = list(filter(len, tweets_json))

    ## parse each tweet
    for tweet in tweets_json:
        tweet_obj = json.loads(tweet)
        if 'retweeted_status' in tweet_obj:
            username = tweet_obj['user']['screen_name']
            rt_username = tweet_obj['retweeted_status']['user']['screen_name']
            edge = (username, rt_username)
            edges.append(edge)


In [None]:
G = nx.graph.Graph()

In [None]:
G.add_edges_from(edges)

In [None]:
nx.draw(G)