<center>
<h2>Text Analysis with Scikit-learn</h2>

Matt Pitlyk<br>
github.com/mattpitlyk/
</center>

<div style="margin-top:200px"/>


<h3>Text analysis is</h3>
<ul><li>the extraction of information from digital text</li>
<li>difficult because computers only understand number</li></ul>


<!--Text analysis is the extraction of information from unstructured (in most cases) text. This is difficult because computer only understand numbers. So in order to analyze text with computers, we have to do some kind of processing to it. There are two basic ways to approach text analysis: look for patterns in the text (these words often appear together/near each other (multi-word term), these words are often used in the same context but never appear together (synonyms), etc.); or you can convert the text to numbers and do math on them. -->



<div style="margin-top:200px"/>

<h3>Term Frequency-Inverse Document Frequency (TF-IDF)</h3>
<ul><li>Very common text analysis technique</li>
<li>Included in many programming libraries and text analysis tools</li>
<li>Vector space model</li>
<li>Easy to understand, implement, and modify</li>
<li>Intuitive reasoning</li></ul>

<!---The technique we're going to discuss today is called Term Frequency-Inverse Document Frequency (TF-IDF), and it is a fairly common method in the area of Information Retrieval (IR) for comparing and searching documents. What we're going to end up doing is taking our big set of documents, called the corpus, and these can be any type of document: news articles, emails, online comments (although the longer the documents the better the results), and we're going to turn them into matrices (number), and then we're going to do some pretty basic math on them. That math is going to help us define what's called a distance metric between any two documents, which we can then use for comparisions and searches. Ok, but first, let's think about this problem intuitively. -->



<div style="margin-top:200px"/>

<h3>Intuitive reasoning</h3>
<br>
Let's say we have three documents. Our documents are just going to be sentences. 
<li>The dog is jumping over the fence.</li>
<li>The dog is climbing up the fence.</li>
<li>The cat is sitting on the window.</li>

<!--So how similar/disimilar are these sentences? Well, that's a little hard to answer right now because that is trying to quantify something about the text. What about a different question: which two sentences are the most similar? The first two?  Why? Because they have the same subject? What if you didn't understand English? You'd probably still say the first two. They share a lot of the same words, so we can think of them as being similar in their contents based on the inclusion of identical words.--> 



<div style="margin-top:200px"/>

<!--So this is good. We could define a similarity metric as the number of words two sentences have in common. But is that good enough? What about these sentences?-->

<h3>Intuitive Reasoning</h3>
<li>A dog was jumping over this fence and then ran over to the next yard so very quickly.</li>
<li>A cat was leaning over this edge and then jumped over to the next table so very quickly.</li>

<!--They have many more words in common than our first two sentences, so are they more similar? Probably not. But this is function of the length of the sentences, so let's try to normalize by dividing by the number of words in the sentence. This will give us a percent of similarity. -->



<div style="margin-top:200px"/>

<h3>Simple Metric: Ratio of words in common</h3>

5 common / 7 total ~= 71% 
<li>The dog is jumping over the fence.</li>
<li>The dog is climbing up the fence.</li>
<br>

13 common / 18 total ~= 72% 
<li>A dog was jumping over this fence and then ran over to the next yard so very quickly.</li>
<li>A cat was leaning over this edge and then jumped over to the next table so very quickly.</li>

<!--Well it turns out these longer sentences still get a higher similarity score. So what's the problem? Intiutively we know the first two sentences are similar, but it seems that using numbers suggests the latest two sentences are more similar. So what's the problem? The problem is that we're considering all the words equal, when, for purposes of distiguishing sentences from one another, they are not equal. Why not? -->



<div style="margin-top:200px"/>

<h3>Word Importance</h3>
How do we determine which words are important?
<li>A dog was jumping over this fence and then ran over to the next yard so very quickly.</li>
<li>A cat was leaning over this fence and then jumped over to the next yard so very quickly.</li>
<li>A squirrel was leaping over this branch and then scampered over to the next tree so very quickly.</li>
<li>A bird was peering over this branch and then flew over to the next tree.</li>

<!--If we were going to distiguish between these sentences, any word that appears in all the sentences is worthless. It doesn't help us to determine if any pair of sentences is similar or disimilar because it's common to all of them. It's almost a characteristic of a sentence in general, as opposed to a charateristics of a particular sentence. So we don't want to use words that appear in all sentences. We either throw them out or weight them very low (more on that later). What about words that appear in a lot but not all sentences? "so very quickly" appears in all but one sentence....
Somewhat useful, but not super useful.
(Note: This is contextual to your corpus. There are no absolute weighting that appaly to all situations.)-->



<div style="margin-top:200px"/>

<h3>Word Importance</h3>
These words appear in some, but not all or most, of the sentences:
<ul><li>tree (2)</li>
<li>branch (2)</li>
<li>fence (2)</li>
<li>yard (2)</li>
</ul>

<!--Tree, branch, fence, yard: appear in some, but not all or most, so they are useful and will get weighted the highest.
This is the Inverse Document Frequency part of the TF-IDF. The more documents a word appears in the, the less useful it is in distinguishing between the documents, so it get weighted less. The Inverse essentially means we divide by the number of documents the word (or term) appears in.
For the first part, the Term Frequency, we're simply going to calculate the frequency of each term in each document by counting the number of times it appears in the document and dividing by the total number of words in the documnets. The assumption here is that the more often a word appears in a document, the more realted it is to the topic of the document. -->



<div style="margin-top:200px"/>

<h3>Formula</h3>
For each document, for each term in the document vector:<br>
<h4>term_score = |term-frequency| * log(N/x)</h4>



<!--Each document vector has a slot for each term in the vocabulary, even if that term doesn't appear in that document. The value for each term is calculated using the formula above. There are several options for term frequency normalization and several ways to adjust the Inverse Document Frequency. -->



<div style="margin-top:200px"/>

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#from sklearn.feature_extraction.text import TfidfVectorizer  <- Combines the other two objects for efficiency, 
                                                               # but misses out on intermediate results.
import pandas as pd

###### Example 1: Short Sentences

In [2]:
short_sentences = ['The dog is jumping over the fence.', 
                   'The dog is climbing up the fence.',
                   'The cat is sitting on the window.']

In [3]:
# Create an instance of the CountVectorizer.
cv = CountVectorizer()

# Create a count matrix from the list of documents.
count_matrix = cv.fit_transform(short_sentences)

In [4]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
short_sent_df = pd.DataFrame(count_matrix.todense(), index=short_sentences, columns=sorted(cv.vocabulary_))
short_sent_df

Unnamed: 0,cat,climbing,dog,fence,is,jumping,on,over,sitting,the,up,window
The dog is jumping over the fence.,0,0,1,1,1,1,0,1,0,2,0,0
The dog is climbing up the fence.,0,1,1,1,1,0,0,0,0,2,1,0
The cat is sitting on the window.,1,0,0,0,1,0,1,0,1,2,0,1


<div style="margin-top:200px"/>

###### Example 2: Long sentences

In [5]:
long_sentences = ['A dog was jumping over this fence and then ran over to the next yard so very quickly.',
                  'A cat was leaning over this fence and then jumped over to the next yard so very quickly.',
                  'A squirrel was leaping over this branch and then scampered over to the next tree so very quickly.',
                  'A bird was peering over this branch and then flew over to the next tree.']

In [62]:
# Create an instance of the CountVectorizer.
cv = CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b')

# Create a count matrix from the list of documents.
long_count_matrix = cv.fit_transform(long_sentences)

Most of the time, you'll use the default value for this parameter, but I'm using non-default to make this example match our basic explanation above.

In [74]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
long_sent_df = pd.DataFrame(long_count_matrix.todense(), index=[s.split()[1] for s in long_sentences], 
                            columns=sorted(cv.vocabulary_))
long_sent_df

Unnamed: 0,a,and,bird,branch,cat,dog,fence,flew,jumped,jumping,...,so,squirrel,the,then,this,to,tree,very,was,yard
dog,1,1,0,0,0,1,1,0,0,1,...,1,0,1,1,1,1,0,1,1,1
cat,1,1,0,0,1,0,1,0,1,0,...,1,0,1,1,1,1,0,1,1,1
squirrel,1,1,0,1,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
bird,1,1,1,1,0,0,0,1,0,0,...,0,0,1,1,1,1,1,0,1,0


In [63]:
# Create an instance of a TfidfTransformer.
tfidf = TfidfTransformer(norm='l1', use_idf=False, smooth_idf=False)

# Created a weighted matrix.
weighted_matrix = tfidf.fit_transform(long_count_matrix)

Most of the time, you'll use the default values for these three parameters, but I'm using non-defaults to make this example match our basic explanation above.

In [64]:
weighted_matrix

<4x28 sparse matrix of type '<type 'numpy.float64'>'
	with 65 stored elements in Compressed Sparse Row format>

In [65]:
# Create a dataframe out of the weighted matrix with labels and columns names for easier reading.
long_sent_weighted_df = pd.DataFrame(weighted_matrix.todense(), index=[s.split()[1] for s in long_sentences], 
                                     columns=sorted(cv.vocabulary_))
long_sent_weighted_df

Unnamed: 0,a,and,bird,branch,cat,dog,fence,flew,jumped,jumping,...,so,squirrel,the,then,this,to,tree,very,was,yard
dog,0.055556,0.055556,0.0,0.0,0.0,0.055556,0.055556,0.0,0.0,0.055556,...,0.055556,0.0,0.055556,0.055556,0.055556,0.055556,0.0,0.055556,0.055556,0.055556
cat,0.055556,0.055556,0.0,0.0,0.055556,0.0,0.055556,0.0,0.055556,0.0,...,0.055556,0.0,0.055556,0.055556,0.055556,0.055556,0.0,0.055556,0.055556,0.055556
squirrel,0.055556,0.055556,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,...,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.0
bird,0.066667,0.066667,0.066667,0.066667,0.0,0.0,0.0,0.066667,0.0,0.0,...,0.0,0.0,0.066667,0.066667,0.066667,0.066667,0.066667,0.0,0.066667,0.0


Above we can see each word is scored with its normalized frequency in each sentence. This ignores the frequency of the word in the corpus as a whole, which would indicate its importance in distinguishing documents from each other.

'quickly' appears in three of the four sentences, so it has a score of 0 for the last one. The other three scores are equal because 'quickly' appears the same number of times (1) in each of those sentences, and they all have the same length.

In [66]:
long_sent_weighted_df['quickly']

dog         0.055556
cat         0.055556
squirrel    0.055556
bird        0.000000
Name: quickly, dtype: float64

In [67]:
# Print the number of words in each sentence.
for s in long_sentences:
    print(s.split()[1] + ' ' + str(len(s.split())))

dog 18
cat 18
squirrel 18
bird 15


In [68]:
# Score for 'a' in the last sentence.
1/15.0

0.06666666666666667

In [69]:
# Score for 'quickly' in the first three sentences.
1/18.0

0.05555555555555555

Let's turn on IDF

In [81]:
# Create an instance of a TfidfTransformer.
tfidf = TfidfTransformer(norm='l1', use_idf=True, smooth_idf=False)

# Created a weighted matrix.
weighted_matrix = tfidf.fit_transform(long_count_matrix)

In [82]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
long_sent_weighted_df_with_idf = pd.DataFrame(weighted_matrix.todense(), index=[s.split()[1] for s in long_sentences], 
                                              columns=sorted(cv.vocabulary_))
long_sent_weighted_df_with_idf

Unnamed: 0,a,and,bird,branch,cat,dog,fence,flew,jumped,jumping,...,so,squirrel,the,then,this,to,tree,very,was,yard
dog,0.04097,0.04097,0.0,0.0,0.0,0.097766,0.069368,0.0,0.0,0.097766,...,0.052756,0.0,0.04097,0.04097,0.04097,0.04097,0.0,0.052756,0.04097,0.069368
cat,0.04097,0.04097,0.0,0.0,0.097766,0.0,0.069368,0.0,0.097766,0.0,...,0.052756,0.0,0.04097,0.04097,0.04097,0.04097,0.0,0.052756,0.04097,0.069368
squirrel,0.04097,0.04097,0.0,0.069368,0.0,0.0,0.0,0.0,0.0,0.0,...,0.052756,0.097766,0.04097,0.04097,0.04097,0.04097,0.069368,0.052756,0.04097,0.0
bird,0.048673,0.048673,0.116149,0.082411,0.0,0.0,0.0,0.116149,0.0,0.0,...,0.0,0.0,0.048673,0.048673,0.048673,0.048673,0.082411,0.0,0.048673,0.0


In [80]:
# Previous results for comparison, i.e. just Term Frequencies.
long_sent_weighted_df

Unnamed: 0,a,and,bird,branch,cat,dog,fence,flew,jumped,jumping,...,so,squirrel,the,then,this,to,tree,very,was,yard
dog,0.055556,0.055556,0.0,0.0,0.0,0.055556,0.055556,0.0,0.0,0.055556,...,0.055556,0.0,0.055556,0.055556,0.055556,0.055556,0.0,0.055556,0.055556,0.055556
cat,0.055556,0.055556,0.0,0.0,0.055556,0.0,0.055556,0.0,0.055556,0.0,...,0.055556,0.0,0.055556,0.055556,0.055556,0.055556,0.0,0.055556,0.055556,0.055556
squirrel,0.055556,0.055556,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,...,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.0
bird,0.066667,0.066667,0.066667,0.066667,0.0,0.0,0.0,0.066667,0.0,0.0,...,0.0,0.0,0.066667,0.066667,0.066667,0.066667,0.066667,0.0,0.066667,0.0


In [77]:
# Print out IDF scores for each term.
zip(sorted(cv.vocabulary_), tfidf.idf_)

[(u'a', 1.0),
 (u'and', 1.0),
 (u'bird', 2.3862943611198908),
 (u'branch', 1.6931471805599454),
 (u'cat', 2.3862943611198908),
 (u'dog', 2.3862943611198908),
 (u'fence', 1.6931471805599454),
 (u'flew', 2.3862943611198908),
 (u'jumped', 2.3862943611198908),
 (u'jumping', 2.3862943611198908),
 (u'leaning', 2.3862943611198908),
 (u'leaping', 2.3862943611198908),
 (u'next', 1.0),
 (u'over', 1.0),
 (u'peering', 2.3862943611198908),
 (u'quickly', 1.2876820724517808),
 (u'ran', 2.3862943611198908),
 (u'scampered', 2.3862943611198908),
 (u'so', 1.2876820724517808),
 (u'squirrel', 2.3862943611198908),
 (u'the', 1.0),
 (u'then', 1.0),
 (u'this', 1.0),
 (u'to', 1.0),
 (u'tree', 1.6931471805599454),
 (u'very', 1.2876820724517808),
 (u'was', 1.0),
 (u'yard', 1.6931471805599454)]

Why are the scores different for terms with IDF = ?<br>
From the scikit-learn docs:<br>
<i>The actual formula used for tf-idf is tf \* (idf + 1) = tf + tf \* idf, instead of tf \* idf. The effect of this is that terms with zero idf, i.e. that occur in all documents of a training set, will not be entirely ignored. </i>