# Lab 6 - Text Analysis



In this lab we are going to focus on text analysis techniques in Python. We will be using some of the base Python functionality, but also will be working with more advanced machine learning methods using the <code>scikit-learn<code> package

What we call "text analysis" in this class is often called *natural language processing* or *NLP* within computer science. NLP methods which enable computers to derive meaning from human language.

A field has a lot of overlap with NLP is *machine learning* or *ML*. ML includes statistical methods that automatically detect patterns in data and used for making predictions in other data.

The first part of this workshop on string manipulation will be NLP with some more basic Python functionality. The second part will focus on some ML examples of NLP.

## String manipulation

Recall that the basic text unit in Python is the string. There's basic methods we can use with a string to get its length, to convert it to upper- and lowercase, to replace one substring with another, and to split into a list.

In [None]:
my_string = "This is an example string. Strings are flexible."

In [None]:
len(my_string)

In [None]:
my_string.upper()

In [None]:
my_string.lower()

In [None]:
## note that strings are case-sensitve
my_string.replace("string", "piece of text")

In [None]:
my_string.split()

In [None]:
## string slice
my_string[0:10]

In [None]:
my_string.find("string")

We typically want to do these operations across a number of strings, and these are typically stored in data frames. Let's look at the first 20 tweets in Canadian Tire dataset from the final project.

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/ct-example.csv')

We can apply a lot of the same kinds of string methods by using the set of <code>.str</code> methods in pandas. The complete list of these and a tutorial can be [found here](http://pandas.pydata.org/pandas-docs/stable/text.html).

In [None]:
df['text'].str.len()

In [None]:
df['text'].str.upper()

In [None]:
df['text'].str.lower()

In [None]:
df['text'].str.replace('Canadian Tire', 'Canadian Fire')

In [None]:
df['created_at'].str.split(expand = True)

In [None]:
## Get the year-month pair for the Canadian Tire project
df['date'] = df['created_at'].str.split(expand = True)[0]
df['month'] = df['date'].str.slice(0, 7)

In [None]:
df['text'].str.find('Canadian Tire')

**Exercise 1**

The Reuters-21578 dataset is a set of Reuters business articles which is used as an example for text classification. 
We are going to use a reduced version of this set drawn from [jere](http://ana.cachopo.org/datasets-for-single-label-text-categorization).

1. Load these data with this command:
    <code>df_r8 = pd.read_csv('data/r8-train-all-terms.txt', sep = "\t", names = ['label', 'text'])</code>

2. Identify the unique values of the column 'label'
3. Take the length of all the text documents in the dataset. Store them in a column called 'length'.
4. Split the text into separate words. Take the first word in the text and store it in the new column called 'first_word'.
5. Identify all articles which mention the word 'trade'. Store them in a new data frame.

## Text Preprocessing

Now that we have used some basics of string handling, we need to know how to handle text for large-scale datasets. For that, text needs to go through several "preprocessing" steps before it can be passed to a statistical model.

To do that, we will start to work with some of the tools included in scikit-learn. Most notably, we are going to 'vectorize' the text, which means we will convert the text from words to a numerical representation.

There are three processes which we will perform in addition to vectorizing. One is removing all the *stopwords* from the text. Stopwords are words which appear very frequently in text and end up not adding much to our own subjective understanding of a string. Computationally, they appear often, which can also gum up statistical models.

In [None]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

ENGLISH_STOP_WORDS

The second process is converting all of the words to lowercase. On a technical level, computers think that words which are the same but have different cases are different words. So we usually convert everything to lowercase. We already 
did that above, so we don't need to do another demostration of that.

The third process is *tokenization*, meaning we separate all the meaningful *tokens* from each other. When we say tokens, we usually mean words. But tokens can also include certain kinds of punctuation which may be helpful to include.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(stop_words = 'english', lowercase = True)
X    = vect.fit_transform(df['text'])

The numbers which are generated from the vectorization process are called *features*. Features get used to do other machine learning tasks, which we will cover below.

Let's look at the features which are generated by this process.

In [None]:
feature_names = np.array(vect.get_feature_names())
feature_names

We can see the output of vectorization process. Each document is represented as a row in the matrix <code>X</code>, which is called a *term-document matrix*.

In [None]:
X[0].toarray()

If we want to see which are the most used words in the list, we can take sum of all the words across all documents, then take the reverse order of words by their place in the list. Lastly, we use that ordering as an index to the <code>feature_names</code> list.

In [None]:
totals = np.sum(X.toarray(), axis = 0)
order  = np.argsort(totals)[::-1]
feature_names[order]

We often don't want to use simple counts for vectorization. Sometimes that will inflate the importance of words which are particular to the dataset. For instance, this dataset is guaranteed to have the words 'canadian tire' in every tweet. So we can use a metric called *term frequency - inverse document frequency* or [*tf-idf*](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect_tfidf = TfidfVectorizer(stop_words = 'english', lowercase = True)
X_tfidf    = vect_tfidf.fit_transform(df['text'])

In [None]:
X_tfidf

In [None]:
X_tfidf[0].toarray()

## Classification

A major task of lots of NLP is labeling the content of a document. Twitter or Facebook, for instance, wants to classify whether a post might be relevant to you. A researcher might want to assess whether a policy document is more liberal or conservative. A brand might want to see if posts about them are positive or negative. This is where classification comes into view.

The process of classifying text documents is depicted in the image below.
![](img/supervised-learning.png)

First, there are a set of documents which are labeled manually, i.e. by a human. The label is called a *class*. The dataset which is labeled manually is called the *training set*. It's called a training set because the machine learns from this set and then applies the knowledge it gets from the set to new, unseen data. The training is done on words or features which are part of documents. The particular statistical model which is trained is called a *classifier*. Then the body of documents which is to be classified by the classifier is called a *test set*. For the test set, the classes are hidden or unknown to the classifier. It is doing its best to guess the correct classes.

There are a lot of different types of classifiers we can use for this task. But for this lab what we are going to use is a classifier called a *support vector machine* or SVM. In a very simiplified manner, what an SVM tries to do is draw an optimal hyperplane between a number of points in k-dimensional space, such that the points in the space are the furthest away from each other. So basically when the classifier sees new data, if it falls on one side of the plane, it will assign it the label associated with that side.

![](img/hastie_etal-f12-1.png)

Now, how do we actually know if the classifier did its job correctly. Well, usually, we have a test set in which we actually know the real labels. But we test those real labels against the predicted ones. We then develop a set of metrics called *precision* and *recall*, which assess two different things.

![](img/precision-recall.png)

Precision measures what percentage of the predicted items are relevant, while recall measures what percentage of the relevant items are predicted.

Imagine this: you have a jar of coins. You want to go through the jar and pick out all the loonies and twoonies. One way of making sure you have all of the coins you want is to dump all the coins into your coin purse. In this case, your recall would be perfect (i.e. equal to 1) but your precision would be lousy. In the other case, you could search through the coins quickly with your hands and pick out which ever ones seem to pop out the quickest. You'll have much better precision here, but you might not get all the coins, so you would not have as good a recall.

So let's get started. We're first going to load the modules needed for this. One is the SVM classifier, and the other two are assessment tools.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

Now, let's load both the training and test sets from the Reuters dataset.


In [None]:
df_train = pd.read_csv('data/r8-train-all-terms.txt', sep = "\t", names = ['label', 'text'])
df_test = pd.read_csv('data/r8-test-all-terms.txt', sep = "\t", names = ['label', 'text'])

Now, what we do is create a vectorizer for the words in the documents. We will load all the words for the training set into <code>X_train</code> and all the labels for the training set into <code>y_train</code>.

In [None]:
vect_tfidf = TfidfVectorizer(stop_words = 'english', lowercase = True)
X_train = vect_tfidf.fit_transform(df_train['text'])
y_train = df_train['label']

We do a similar thing for the test set. Notice how we use the method <code>transform</code> rather than <code>fit_transform</code>. That's because the vectorize is expecting a bunch of words which are defined only within the training set.

In [None]:
X_test = vect_tfidf.transform(df_test['text'])
y_test = df_test['label']

Now we define the classifier, and train it with the training data.

In [None]:
clf = LinearSVC()
clf.fit(X_train, y_train)

Lastly, we predict the new labels, based on the words in the test set.

In [None]:
y_pred = clf.predict(X_test)

In [None]:
print(classification_report(y_pred, y_test))

In [None]:
print(confusion_matrix(y_pred, y_test))