# Natural Language Processing: The Term-Document Matrix

In this course, we've already seen a few examples of working with text. We've used basic string operations and `pandas` `str` operations in order to manipulate text data. Now that we have some array programming and machine learning skills under our belt, we can take our exploration of text data much further. 

In this lecture, we'll introduce one of the most important constructs for analyzing text data: the [term-document matrix.](https://en.wikipedia.org/wiki/Document-term_matrix)

This might sound intimidating, but the idea is very simple. Consider the following three sentences. We regard each of them as a "document."

1.  This is the first one.
2.  This one is the second one.
3.  Is this the first one?

We can think of the term-document matrix as a data frame with a column for each possible word. In each column, we count up how many times that word appears in document. For example, using the three short "documents" above, the term-document matrix is: 

| document | This | is | the | first | one | second |
|----------|------|----|-----|-------|-----|--------|
| 1        |  1   | 1  |  1  |   1   |  1  |   0    |
| 2        | 1    | 1  | 1   |   0   |  2  |   1    |
| 3        |  1   | 1  |  1  |   1   |  1  |   0    |

This turns out to be an extremely convenient format for working with text data, and we'll see in coming lectures how to use it for both sentiment analysis (figuring out how "positive" a word or sentence is) and topic modeling (figuring out the main "ideas" in a set of documents). 

If you're very persistent, you would be able to make a term-document matrix using a lot of `for`-loops and basic string operations. However, `scikit-learn` offers a much more convenient approach. In this lecture, we'll see an example of organizing our data and constructing the term-document matrix. In coming lectures, we'll start to use our construction for data analysis. 

In [1]:
#Before we get started - standard imports

## Data

Our data for this lecture is the complete text of the short book *Alice’s Adventures in Wonderland* by Lewis Carroll. The package `nltk` (Natural Language ToolKit) makes it wonderfully easy to obtain this data set. 

We can use the gutenberg's raw method to read in the book

### Split into chapters
We observe that the chapters are demaracted by the all-caps word "CHAPTER". So, we can simply split on this word to break the book up into chapters. We need to exclude the very first part of the split, since this isn't a real chapter -- it just contains the title and author information. 

In [2]:
#split into chapters and get rid of text before chapter 1


There's lots of punctuation and special characters in the text, but we don't have to worry about those this time -- there are built-in functions that will filter these out for us. 

It's helpful to keep ourselves organized by placing the text of each chapter into a data frame. Out data frame will have two columsn chapter and text.

Next, we are going to grab the `CountVectorizer` function from the `sklearn.feature_extraction.text` module. This module gives a whole range of tools for turning unstructured text into delicious, quantitative numbers that we can feed into algorithms. 

We now create a `CountVectorizer` object. This is an object which will construct the term-document matrix for us. As usual, this object accepts various parameters. In this case, I've only specified the use of common English-language "stop words." A stop word is a word that's considered uninteresting for the purposes of natural language processing. For example, "she," "can", and "the" are common stop words. 

To create a term-document matrix, we use the fit_transform() method of the text column of the dataframe df

counts will tell use the number of times each word appears in each chapter, but...

In math, a matrix is said to be sparse if most of its entries are zero. This is usally the case for term-document matrices so fit_transform smartly returns a sparse matrix to save memory.

However, in order for us to use counts, it is easier if we convert to a numpy array using the toarray method

Even better, lets convert it to a dataframe with labeled columns. Labeling the columns is easy thanks to the get_feature_names method

Now, lets us concat to add this info to our original dataframe

## Interpreting the Term-Document Matrix

We can now use the Term-Document matrix to check how frequently a given term appears in each chapter of the novel. For example, the main character of the book is alice.

Let's plot the name of four characters, alice, dinah, queen and hatter to see how they evolve over time 

In [5]:
characters=['alice','dinah','queen','hatter']


# Sidebar: Normalization

In many applications, it is desirable to use not the raw number of times that a word appears. Instead, various normalizations are possible, each of which provide a quantification of how important a word is within a document. For example, one could compute what proportion of a document is allocated to each word. This approach automatically accounts for the fact that some documents are longer than others. 

The most popular way to normalize is slightly more mathematically complex: it is called [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). We can compute a tf-idf term-document matrix easily, replacing the `CountVectorizer` above with the `TfidfVectorizer`. 

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(stop_words = "english")

In [7]:
tfidf = vec.fit_transform(df['text'])
tfidf.toarray()

The entries of `count_df` are no longer integers, but rather floats that estimate a weight for a word within each document. A word is given higher weight in row n, if a) it appears in document n a lot b) it is not super common overall.

So, say the word "cat" and the word "platypus" both appear three times in document 1, but overall, across all documents, the word "cat" is way more common than the word "platypus". Then "platypus" would have a higher tfidf score in row one, because the fact that there are three "platypus"'s is more interesting than the fact that there are three appearances of the word "cat".

We won't worry much about the difference between count vectorization and tf-idf vectorization in this course, but feel free to try both when working with models to see whether you can improve your results. 
