# Intro to TF-IDF and Document Comparisons


In [2]:

import itertools
import math
import numpy as np
import nltk
import string

import nlp_utilities as mytools  # this is our files of utilities

## Reading Multiple Files -- Setup

Before we move on to document comparisons, we need to be able to read in multiple files.  Use the new utility we introduced.

In [5]:
mytools.get_filenames("data/movie_reviews/positive/")

['data/movie_reviews/positive/cv670_tok-24009.txt',
 'data/movie_reviews/positive/cv671_tok-10077.txt',
 'data/movie_reviews/positive/cv672_tok-12350.txt',
 'data/movie_reviews/positive/cv673_tok-6552.txt',
 'data/movie_reviews/positive/cv674_tok-11591.txt',
 'data/movie_reviews/positive/cv675_tok-11864.txt',
 'data/movie_reviews/positive/cv676_tok-19999.txt',
 'data/movie_reviews/positive/cv677_tok-11867.txt',
 'data/movie_reviews/positive/cv678_tok-24352.txt',
 'data/movie_reviews/positive/cv679_tok-13972.txt',
 'data/movie_reviews/positive/cv680_tok-18142.txt',
 'data/movie_reviews/positive/cv681_tok-28559.txt',
 'data/movie_reviews/positive/cv682_tok-21593.txt',
 'data/movie_reviews/positive/cv683_tok-12295.txt',
 'data/movie_reviews/positive/cv684_tok-10367.txt',
 'data/movie_reviews/positive/cv685_tok-11187.txt',
 'data/movie_reviews/positive/cv686_tok-22284.txt',
 'data/movie_reviews/positive/cv687_tok-20347.txt',
 'data/movie_reviews/positive/cv688_tok-10047.txt',
 'data/movie_

In [6]:
filenames = mytools.get_filenames("data/movie_reviews/positive/")

## TF-IDF (Term Frequency, Inverse Document Frequency)

Most analysis of documents for text mining and machine learning do not consider the order of words important. (The possible exceptions are sentiment analysis over "time" and parsing that analyses the tree structure of sentences for parts of speech and entity identification.)  Analysis of words in texts, using just the tokens, not the order, are called **"bag-of-words"** analyses.

Some definitions for a smarter word-counting approach, using collections of texts:


**Term Frequency**: Number of appearances of a word in a document (the token counts we saw already), usually as a percentage of the words

**Document Frequency**: Number of documents that contain a word in a set of docs

**TF-IDF** is **Term Frequency / Document Frequency**, with some extra fiddles.

Example from [Manning, Raghavan, and Schuetze](http://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html) showing IDF of a rare term is high, in a corpus with those document frequencies for those terms:


<img src="assets/doc_freq.png">


TF-IDF for a word and document is usually calculated as:

**(Word t's frequency in the doc) * Log( Number of Docs / Number of docs that contain the word t)**

However, it is usually done with a + 1 term or two.  You can consider tf-idf an information measure for document words (or "features") in a bag-of-words style analysis, where the order of the words doesn't matter, just the set of words. It is a **"weight"** for a word. Some features of TF-IDF:

* If a term is very frequent in the whole document set (or corpus), it's less interesting overall and that word gets a low TF-IDF. Note this tends to suppress stopwords for you!  Stopwords are common in all texts, so they will have low scores. However, you need a lot of documents for this to work well. 
* Beware of effects of TF-IDF on small numbers of documents, it may not work as you would hope.
* A term (or token) that is frequent in a few documents, but not in a lot, has a higher score. That word helps "distinguish" or characterize (or describe) those documents.

See the discussion in [Manning, Raghavan, and Schuetze](http://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html), and even [more math in Wikipedia](http://en.wikipedia.org/wiki/Tf%E2%80%93idf). 

**You can always check to see if the implementation you use cleans stopwords or not and decide if you like that.**

Some more python references:
* [Demo using TextBlob, another lib](http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/)
* [A version written on top of NLTK](https://github.com/yebrahim/TF-IDF-Generator)
* [TF-IDF in gensim](http://radimrehurek.com/gensim/tutorial.html)
* [TF-IDF in scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html)


In [7]:
# code example from Building Machine Learning Systems with Python (Richert & Coelho) 
# - modified slightly by Lynn

import math

def tfidf(t, d, D):
    # term freq is the count of term as percent of the doc's words
    # d.count counts how many times t occurs in d.
    tf = float(d.count(t)) / len(d) 
    # Note this version doesn't use +1 in denominator as many do.
    idf = math.log( float(len(D)) / (len([doc for doc in D if t in doc])))
    return tf * idf

In [8]:
# How this works - if d is a document of tokens:
d = ["a", "b", "c"]
d.count("b")

1

Here's a simple example with letters instead of words as our tokens.

In [9]:

doc_a = ["a"]
doc_abb = ["a", "b", "b"]
doc_abc = ["a", "b", "c"]
# D is the collection of "documents"
D = [doc_a, doc_abb, doc_abc]

print("a in doc_a", tfidf("a", doc_a, D))   # a is in all of them
print("a in doc_abc", tfidf("a", doc_abc, D)) # a is in all of them
print("b in doc_abc", tfidf("b", doc_abc, D)) # b occurs only once here, but in 2 docs
print("b in doc_abb", tfidf("b", doc_abb, D)) # b occurs more frequently in this doc
print("c in doc_abc", tfidf("c", doc_abc, D)) # c is unique in the doc set

a in doc_a 0.0
a in doc_abc 0.0
b in doc_abc 0.13515503603605478
b in doc_abb 0.27031007207210955
c in doc_abc 0.3662040962227032


Here's an example using the same code, but fake tweets.

In [10]:
doc_a = nltk.word_tokenize("This tweet is about a cute kitten.")
doc_b = nltk.word_tokenize("This tweet is about Donald Trump.")
doc_c = nltk.word_tokenize("This tweet is John Oliver talking about Trump.")
doc_d = nltk.word_tokenize("Donald Trump said something shocking in a tweet.")

In [11]:
D = [doc_a, doc_b, doc_c, doc_d]

In [12]:
D

[['This', 'tweet', 'is', 'about', 'a', 'cute', 'kitten', '.'],
 ['This', 'tweet', 'is', 'about', 'Donald', 'Trump', '.'],
 ['This', 'tweet', 'is', 'John', 'Oliver', 'talking', 'about', 'Trump', '.'],
 ['Donald', 'Trump', 'said', 'something', 'shocking', 'in', 'a', 'tweet', '.']]

In [13]:
doc_a

['This', 'tweet', 'is', 'about', 'a', 'cute', 'kitten', '.']

In [14]:
tfidf("Trump", doc_c, D)

0.03196467471686454

In [15]:
tfidf("Trump", doc_d, D)

0.03196467471686454

In [16]:
tfidf("kitten", doc_a, D)

0.17328679513998632

In [17]:
tfidf("tweet", doc_b, D) # try it in other docs!

0.0

It turns out there is a function in nltk (the main python text library) that will calculate tf-idf for us.  It lives on the TextCollection object. There is also a function in scikit-learn, the machine processing library.  We will use that in a minute.

** Alert:  The functions in NLTK are much slower than in scikit-learn. If you do this on a book, you need to do it with scikit-learn instead. **

In [19]:
# Here are some more functions we can use from NLTK.  

def makeText_from_tokens(tokens):
    return nltk.Text(tokens)

def makeTextCollection(tokenslist):
    # the input is a list of lists - the tokens for each doc read in
    texts = [nltk.Text(doc) for doc in tokenslist]
    collection = nltk.TextCollection(texts)
    # it's useful to return both the list of texts, and the collection object
    return collection, texts

def compute_tfidfs_by_doc(filenames):
    """ Takes a list of filenames, tokenizes, reports a dict with tf-idf scores."""

    from collections import defaultdict  # not the textcollection!
    import nlp_utilities as nlp
    

    alltokens = []  # make a list of lists for the texts
    textslist = nlp.load_texts_as_string(filenames) # returns a dict
    for text in textslist.values():
        # I'm not cleaning them so you can see how tf-idf helps
        alltokens.append(nltk.word_tokenize(text))
    collection, textobjs = makeTextCollection(alltokens)
    
    # this is where we will store our results
    stats = [] # we are going to store data for each word and doc in a list of dictionaries
    
    for i, text in enumerate(textobjs):
        # we use enumerate to give us a counter for the text number we are on.
        for word in text.vocab().keys():  # just use the words in this text.
            # the function tf_idf is a feature of the TextCollection object in nltk.
            tfidfscore = collection.tf_idf(word, text)
            tf = collection.tf(word, text) # is actually count / len(text); or percentage of text
            count = text.count(word) # is the frequency of the word in the doc
            if tfidfscore > 0: # i.e., the word is not in all the docs!
                stats.append({
                    "word": word,
                    "tfidf": tfidfscore,
                    "tf": tf,
                    "count": count,
                    "filename": filenames[i]
                })
    return stats

So... we need to read in a folder of files, tokenize them, and create a TextCollection from them.  Once we have a TextCollection, we can look at TF-IDF for words in the documents.

In [51]:
posfiles = mytools.get_filenames("data/movie_reviews/positive")

In [52]:
len(posfiles)

30

In [53]:
res = compute_tfidfs_by_doc(posfiles)

In [54]:
res[0:10]

[{'count': 3,
  'filename': 'data/movie_reviews/positive/cv670_tok-24009.txt',
  'tf': 0.010033444816053512,
  'tfidf': 0.01614820647927191,
  'word': 'keep'},
 {'count': 1,
  'filename': 'data/movie_reviews/positive/cv670_tok-24009.txt',
  'tf': 0.0033444816053511705,
  'tfidf': 0.006738806088770116,
  'word': 'boys'},
 {'count': 1,
  'filename': 'data/movie_reviews/positive/cv670_tok-24009.txt',
  'tf': 0.0033444816053511705,
  'tfidf': 0.0023182179951837635,
  'word': 'character'},
 {'count': 1,
  'filename': 'data/movie_reviews/positive/cv670_tok-24009.txt',
  'tf': 0.0033444816053511705,
  'tfidf': 0.0023182179951837635,
  'word': 'she'},
 {'count': 2,
  'filename': 'data/movie_reviews/positive/cv670_tok-24009.txt',
  'tf': 0.006688963210702341,
  'tfidf': 0.001492599005446219,
  'word': '``'},
 {'count': 2,
  'filename': 'data/movie_reviews/positive/cv670_tok-24009.txt',
  'tf': 0.006688963210702341,
  'tfidf': 0.0012195421859127397,
  'word': 'not'},
 {'count': 1,
  'filename': 

### Pandas makes it much easier to deal with this data quickly

In [18]:
import pandas as pd

We can load dictionaries into a dataframe easily. 

In [55]:
data = pd.DataFrame.from_dict(res)

In [56]:
data.head()

Unnamed: 0,count,filename,tf,tfidf,word
0,3,data/movie_reviews/positive/cv670_tok-24009.txt,0.010033,0.016148,keep
1,1,data/movie_reviews/positive/cv670_tok-24009.txt,0.003344,0.006739,boys
2,1,data/movie_reviews/positive/cv670_tok-24009.txt,0.003344,0.002318,character
3,1,data/movie_reviews/positive/cv670_tok-24009.txt,0.003344,0.002318,she
4,2,data/movie_reviews/positive/cv670_tok-24009.txt,0.006689,0.001493,``


In [57]:
# filename is in every row - each word has it's own row and score.
byfile = data.groupby("filename")

### Now let's sort each subset by different values and see what we got

In [58]:
for group, rows in byfile:
    print(group)
    print(rows.sort_values(by="count", ascending=False).head(5)[["word", "count"]])

data/movie_reviews/positive/cv670_tok-24009.txt
      word  count
110  movie      7
104   they      7
40     you      6
29       !      6
73    this      6
data/movie_reviews/positive/cv671_tok-10077.txt
        word  count
143  douglas     10
311       he      9
186       ``      6
227      has      6
205     this      6
data/movie_reviews/positive/cv672_tok-12350.txt
        word  count
399       ``      6
479        i      6
545  hunting      5
366     film      4
454     more      4
data/movie_reviews/positive/cv673_tok-6552.txt
    word  count
791    (     12
676    )     12
593   on      9
601  for      7
788  are      5
data/movie_reviews/positive/cv674_tok-11591.txt
     word  count
1119  rob     13
1116   at     11
1147   he     11
1121  his     10
1162  you      8
data/movie_reviews/positive/cv675_tok-11864.txt
      word  count
1240    ``     39
1507    as     15
1211  film     12
1427    by     11
1451   was     11
data/movie_reviews/positive/cv676_tok-19999.txt
           

In [59]:
for group, rows in byfile:
    print(group)
    print(rows.sort_values(by="tfidf", ascending=False).head(5)[["word", "tfidf"]])

data/movie_reviews/positive/cv670_tok-24009.txt
          word     tfidf
78    butthead  0.045501
99      beavis  0.045501
21   searching  0.022750
100         tv  0.020216
26       wants  0.020216
data/movie_reviews/positive/cv671_tok-10077.txt
           word     tfidf
143     douglas  0.049201
168   detective  0.028932
150        kirk  0.021803
207       cases  0.014760
250  connection  0.014535
data/movie_reviews/positive/cv672_tok-12350.txt
         word     tfidf
545   hunting  0.029119
492     damon  0.019807
459  smartest  0.014629
525   titanic  0.014629
544  williams  0.010383
data/movie_reviews/positive/cv673_tok-6552.txt
        word     tfidf
656    carry  0.034217
625  caravan  0.027374
616    carol  0.020530
587  kenneth  0.020530
577     anna  0.020530
data/movie_reviews/positive/cv674_tok-11591.txt
          word     tfidf
1119       rob  0.052638
916   fidelity  0.028343
797     cusack  0.020245
1023    frears  0.016196
833       high  0.014931
data/movie_reviews/posi

For the most part, few stop words appear in the top TF-IDF scoring words, although there are some junk characters. Mainly we see the words that seem to be the most important distinguishing words for each document, such as movie name or job location (depending on the dataset we look at).  We did not clean the tokens at all!
How would we clean the tokens too?

### Using TF-IDF in Scikit-learn:  Use this one for any big documents or large collection of docs.

https://buhrmann.github.io/tfidf-analysis.html

In [4]:
posfiles = mytools.get_filenames("data/movie_reviews/positive")
texts = mytools.load_texts_as_string(posfiles)

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=0.2, tokenizer=mytools.tokenize_clean)

# this applies the vectorizer we defined to the texts!
tfidf_matrix = tfidf_vectorizer.fit_transform(texts.values()) #fit the vectorizer to texts

print(tfidf_matrix.shape) # this is the size of the array of features, rows and columns.



(30, 183)


In [19]:
tfidf_matrix[0]

<1x183 sparse matrix of type '<class 'numpy.float64'>'
	with 65 stored elements in Compressed Sparse Row format>

In [28]:
features = tfidf_vectorizer.get_feature_names()   # these are the words in the documents.

In [9]:
features[0:20]

["'m",
 "'re",
 "'ve",
 'acting',
 'action',
 'actor',
 'actors',
 'actually',
 'almost',
 'also',
 'although',
 'another',
 'around',
 'audience',
 'away',
 'back',
 'based',
 'become',
 'begins',
 'best']

In [29]:
# Code from https://buhrmann.github.io/tfidf-analysis.html

def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    import pandas as pd
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

def top_feats_in_doc(Xtr, features, row_id, top_n=25):
    ''' Top tfidf features in specific document (matrix row) '''
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)

In [36]:
top_feats_in_doc(tfidf_matrix, features, 4, top_n=10)

Unnamed: 0,feature,tfidf
0,however,0.391266
1,seem,0.371693
2,james,0.275637
3,soon,0.275637
4,films,0.247795
5,comedy,0.225565
6,seems,0.207058
7,like,0.139136
8,break,0.137818
9,lead,0.137818


In [30]:
## A row in the matrix corresponds to a document.  We can figure out which document by looking 
## back at the texts dictionary.

In [35]:
list(texts.values())[4]

"krippendorf's tribe is a formula comedy . done poorly , formulaic comedies might seem to signify the downfall of american cinema . however , every now and then , one emerges , like krippendorf's tribe , that actually works . professor james krippendorf ( richard dreyfuss ) , the renowned anthropologist , is in trouble . his university gave him a hefty grant to discover a lost tribe in new guinea . however , he found . . . nothing . his wife has recently died , and he has spent the remainder of the grant money in raising his three kids : shelly ( natasha lyonne ) , mickey ( gregory smith ) and edmund ( carl michael lidner ) . tonight , he is expected to lecture on his newfound tribe . rather than break the news ( and face the consequences of misusing his funds ) , he invents a tribe : the shelmikedmu ( named after his kids ) . however , one lie begets another as he is not only required to deliver filmed proof of the shelmikedmu , but his research becomes a popular phenomenon . soon , p

In [26]:
def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=25):
    ''' Return the top n features that on average are most important amongst documents in rows
        indentified by indices in grp_ids. '''
    if grp_ids:
        D = Xtr[grp_ids].toarray()
    else:
        D = Xtr.toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)

In [27]:
top_mean_feats(tfidf_matrix, features)

Unnamed: 0,feature,tfidf
0,movie,0.111948
1,n't,0.107049
2,like,0.064903
3,two,0.05346
4,first,0.050621
5,characters,0.049678
6,time,0.049567
7,williams,0.043302
8,performances,0.043216
9,also,0.041393
