## TFIDF Practice

Today we will be taking a closer look at TFIDF as a descriptive measure of rare terms, and practicing some basic plots with TFIDF.

In [89]:
# Import libraries and setup code
import pandas as pd
import seaborn as sns
import numpy as np

# sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer

%matplotlib inline

# sklearn's version of tfidf

Deep in the swamp of sklearn's documentation, lies some useful information about their exact implementation.

```
    The formula that is used to compute the tf-idf of term t is
    tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as
    idf(d, t) = log [ n / df(d, t) ] + 1 (if ``smooth_idf=False``),
    where n is the total number of documents and df(d, t) is the
    document frequency; the document frequency is the number of documents d
    that contain term t. The effect of adding "1" to the idf in the equation
    above is that terms with zero idf, i.e., terms  that occur in all documents
    in a training set, will not be entirely ignored.
    (Note that the idf formula above differs from the standard
    textbook notation that defines the idf as
    idf(d, t) = log [ n / (df(d, t) + 1) ]).
    
    If ``smooth_idf=True`` (the default), the constant "1" is added to the
    numerator and denominator of the idf as if an extra document was seen
    containing every term in the collection exactly once, which prevents
    zero divisions: idf(d, t) = log [ (1 + n) / 1 + df(d, t) ] + 1.
    Furthermore, the formulas used to compute tf and idf depend
    on parameter settings that correspond to the SMART notation used in IR
    as follows:
    
    Tf is "n" (natural) by default, "l" (logarithmic) when
    ``sublinear_tf=True``.
    Idf is "t" when use_idf is given, "n" (none) otherwise.
    Normalization is "c" (cosine) when ``norm='l2'``, "n" (none)
    when ``norm=None``.
```
> https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/feature_extraction/text.py#L941

# tfidf from scratch

Our version will differ slightly from sklearn's version but the effect of rare words will be the same.  Mainly, sklearn's version will normalize and handle additional transformation features.  Our example will look at a specific term and calculate it's tf-idf for a given document, relative to all documents in a corpus.

In [90]:
corpus = ["I am a cat", "I am a bat", "I am a rat"] # which animal are you?

In [106]:
# A function to "tokenize" each list of sentences to a list of words.
tokenize = lambda doc: doc.lower().split(" ")

# Calculates the frequency of a term in a document
def term_frequency(term, document):
    return len([word for word in document if word == term]) / float(len(document))

# log(total documents / documents containing a 
def inverse_document_frequency(term, corpus):
    
    total_documents = len(corpus)
    documents_with_term = len([doc for doc in corpus if term in doc])
    
    print "The term '%s' appears in %d document(s) of %d total document(s)" % (term, documents_with_term, total_documents)
    
    print "non logged IDF: ", float(total_documents) / float(documents_with_term)
    
    return np.log(float(total_documents) / documents_with_term) + 1.0


## Now let's try our code out
- Common terms
- Rare terms
- Different documents / terms

In [107]:
corpus

['I am a cat', 'I am a bat', 'I am a rat']

In [110]:
# Each sentence becomes a list of words
tokenized_documents = [tokenize(doc) for doc in corpus]

# # The term we are looking at
term = "i"

# # The term in the SPECIFIC document we care about
tf = term_frequency(term, tokenized_documents[0])

# # Basic TFIDF
idf = inverse_document_frequency(term, tokenized_documents)
idf

print
print "Term Frequency: ", tf
print "Inverse Document Frequency: ", idf
print "TF*IDF: ", tf * 1.0 * idf

 The term 'i' appears in 3 document(s) of 3 total document(s)
non logged IDF:  1.0

Term Frequency:  0.25
Inverse Document Frequency:  1.0
TF*IDF:  0.25


## A basic example, revisited.

To understand TFIDF is to use it at a basic level, again.  For our mini-practice today, we will revisit the cat, rat, and the bat.

In [77]:
corpus = ["I am a cat.", "I am a bat.", "I am a rat."] # which animal are you?

## 1. Vectorize your corpus using TfidfVectorizer and set to a variable called "X"
1. Initialize TfidfVectorizer to a new variable.
1. Set X to the return of fit_transform.

In [118]:
# Initialize TfidfVectorizer to a new variable.
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words=None)


# Set X to the return of fit_transform.
corpus = ["I am a cat cat.", "I am a bat that is not a cat.", "I am a rat that is also not a cat."] # which animal are you?
y = ["trump", "sanders", "sanders"]

X = tvec.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=tvec.get_feature_names())
df['target'] = y
df

Unnamed: 0,also,am,bat,cat,is,not,rat,that,target
0,0.0,0.447214,0.0,0.894427,0.0,0.0,0.0,0.0,trump
1,0.0,0.31877,0.539725,0.31877,0.410475,0.410475,0.0,0.410475,sanders
2,0.474961,0.28052,0.0,0.28052,0.36122,0.36122,0.474961,0.36122,sanders


In [120]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
model = logreg.fit(X, df['target'])


In [123]:
model.predict_proba(X), model.predict(X)

(array([[ 0.49994553,  0.50005447],
        [ 0.640828  ,  0.359172  ],
        [ 0.64624573,  0.35375427]]),
 array(['trump', 'sanders', 'sanders'], dtype=object))

## 2. Look at your vectorized corpus, X, as an array or a dense matrix.
What does it look like?

In [134]:
raw_data = [
    {"text": "There is a lot of cheese in that store.", "label": "@trump"},
    {"text": "Those damn kids better get off my lawn", "label": "@dyerrington"},
    {"text": "Gonna raise those taxes on those mofos in Canada", "label": "@trump"},
    {"text": "Always eat bacon, from Ireland.", "label": "@dyerrington"},
]

df = pd.DataFrame(raw_data)

vect = CountVectorizer()
X = vect.fit_transform(df['text'])

text_df = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
text_df['target'] = df['label']
text_df

Unnamed: 0,always,bacon,better,canada,cheese,damn,eat,from,get,gonna,...,of,off,on,raise,store,taxes,that,there,those,target
0,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,1,0,1,1,0,@trump
1,0,0,1,0,0,1,0,0,1,0,...,0,1,0,0,0,0,0,0,1,@dyerrington
2,0,0,0,1,0,0,0,0,0,1,...,0,0,1,1,0,1,0,0,2,@trump
3,1,1,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,@dyerrington


## 3. Intialize a new dataframe with an array from X.
Also, set the columns to the TF-IDF vectorizer objects `.get_feature_names()` reference.  Without it, you won't be able to reference which word features correspond to which matrix column.

Each row will coorespond to each document from the original corpus object.  Verify that each row matches the original dataset with the word features.

In [85]:
# setup your dataframe here


## 4. Aggregate your data with mean, median, min, max.  Plot each of your results with a "bar" or "barh" figure.

## 5. Refactor your existing code into a function.
- Your method should accept a corpus object, a vectorizer type (tfidf or countvectorizer), and aggregate function parameter (string or function -- your choice!).
- Your method should output a figure

An example use case of your code would be:
> ```python
>  corpus = ["I am a rat", "I am a cat", "I am a bat"]
>
>  # TFIDF plot with max aggregation
>  vectorize_and_plot(corpus, vectorizer="tfidf", agg_func="max")
>  [your plot here] 
>
>  # COUNT plot with max aggreagation
>  vectorize_and_plot(corpus, vectorizer="count", agg_func="max")
>  [your plot here] 
>  ```

## 6. Use your function to compare CountVectorizer vs TfidfVectorizer 
- Use the original corpus object
- THEN try using the new corpus object below

In [43]:
## Original corpus
# corpus = ["I am a cat.", "I am a bat.", "I am a rat."] # which animal are you?

## New corpus
# corpus = ["I am a cat.", "I am a bat.", "I am a rat.", "the cat is not a rat", "There is not a cat that sat"] # which animal are you?


## 7. Check out this awesome pipeline.
- Fit the corpus object
- Try to run a basic prediction

_This is not a real-world probem but hopefully you get a sense of the basics of looking at text data, and it's application in sklearn._

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

# setup our data

""" We use this list to filter which categories we want from our sample newsgroups dataset """
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

training_data = fetch_20newsgroups(
    subset       =  'train', 
    categories   =  categories,
    shuffle      =  True, 
    random_state =  42,
    remove       = ('headers', 'footers', 'quotes'))

test_data = fetch_20newsgroups(
    subset       =  'test', 
    categories   =  categories,
    shuffle      =  True, 
    random_state =  42,
    remove       =  ('headers', 'footers', 'quotes')
)

""" Our training data """
X_train = training_data.data
y_train = training_data.target

""" Our testing data"""
X_test = test_data.data
y_test = test_data.target

# Rememver all that code needed to vectorize our text data before modeling?  
# We still need to use it in order to do EDA and evalutate our dataset before we model. 
# DO NOT GET IN THE HABBIT OF MODELING WITHOUT EDA!
pipeline = Pipeline([
    ('vect', CountVectorizer()),     # You will have questions about this
    ('tfidf', TfidfTransformer()),
    # ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression()),
])

# Fit our data to the pipeline AS IF it were any other model like we've previously done in sklearn
# Note:  X_train is literal text data, RAW format!
model = pipeline.fit(X_train, y_train)
model.score(X_test, y_test)

0.73318551367331852

## 8. Reference the vectorized matrix from the "model" object cast from the Pipeline instance.
Plot the vectorized data with aggregation.  Your previous function will work well if you refactor. 

_It's helpful to look at different subset classes to understand how each of the features could or may contribute to prediction.  To know which model to use, how it may perform, it's essential to look at your data in order to understand model selection well and confidently evaluate and report findings._

In [35]:
## Here are some hints to jumpstart your work

# 1. Inspect the pipeline object.  All of the "steps" whithin the pipeline are contained inside.
pipeline.steps

# 2. Use the 2nd step object, "tfidf", to get a reference to the object that can transform data
# this will get "TfidfTransformer" object in the steps list ('tfidf', TfidfTransformer)
step2_transformer = pipeline.steps[1][1] 

# 3. Use the step2_transformer to .fit_transform() and examine your training dataset with proper EDA

## 9. Update the pipeline to use only CountVectorizer.
Also experiment different classification models:
- KNN
- Random Forrest