# Feynman Summary - TF IDF

- toc: true
- badges: False
- comments: true
- author: Sam Treacy
- categories: [sklearn, tf_idf, sentiment, nlp, sentiment, classification, python]

## How it works

Term Frequency - Inverse Document Frequency (TF-IDF) gives a measure of how 'important' a word is in a given document when compared to other documents. 

The method is used for text analysis and as a way to convert text to numeric arrays for machine learing algorithms. Unlike the bag-of-words method, information of the importance of each word in any given document with reference to all documents is availible to the machine learning algorithm. 

TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant. For example, if what we’re doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.

In [313]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

## A simple example

Consider each sentence in the below 'two_docs' as a seperate document. We will calculate the TF_IDF by invoking the TfidfVectorizer method to fit and transform both documents into vectorised arrays. This is our document corpus.

In [300]:
from sklearn.feature_extraction.text import TfidfVectorizer

two_docs =['the car is driven on the road',
             'the truck is driven on the highway'] 

two_docs_tf_idf = vectorize.fit_transform(two_docs)

In [301]:
print(two_docs_tf_idf)

  (0, 5)	0.42471718586982765
  (0, 4)	0.30218977576862155
  (0, 1)	0.30218977576862155
  (0, 3)	0.30218977576862155
  (0, 0)	0.42471718586982765
  (0, 6)	0.6043795515372431
  (1, 2)	0.42471718586982765
  (1, 7)	0.42471718586982765
  (1, 4)	0.30218977576862155
  (1, 1)	0.30218977576862155
  (1, 3)	0.30218977576862155
  (1, 6)	0.6043795515372431


In [297]:
vectorize.get_feature_names()

['car', 'driven', 'highway', 'is', 'on', 'road', 'the', 'truck']

In [315]:
Image(url= "https://cdn-media-1.freecodecamp.org/images/1*q3qYevXqQOjJf6Pwdlx8Mw.png", width = 800)

In [302]:
vecs = pd.Series(vectorize.idf_ )
vocab = pd.Series( vectorize.get_feature_names())

table = {'words':vocab, 'Vecs':vecs}
pd.DataFrame(table)

Unnamed: 0,words,Vecs
0,car,1.405465
1,driven,1.0
2,highway,1.405465
3,is,1.0
4,on,1.0
5,road,1.405465
6,the,1.0
7,truck,1.405465


In [304]:
sen_num = 1

print(two_docs_tf_idf[sen_num], '\n')
print(two_docs[sen_num], '\n' )
print(two_docs_tf_idf.toarray()[sen_num])

  (0, 2)	0.42471718586982765
  (0, 7)	0.42471718586982765
  (0, 4)	0.30218977576862155
  (0, 1)	0.30218977576862155
  (0, 3)	0.30218977576862155
  (0, 6)	0.6043795515372431 

the truck is driven on the highway 

[0.         0.30218978 0.42471719 0.30218978 0.30218978 0.
 0.60437955 0.42471719]




The two arrays below are a good example of the vectorised arrays that would be fed into a machine learning algorithm, e.g. SVC or Random Forest

In [305]:
for n in two_docs_tf_idf.toarray():
      print( n.round(3), '\n')

[0.425 0.302 0.    0.302 0.302 0.425 0.604 0.   ] 

[0.    0.302 0.425 0.302 0.302 0.    0.604 0.425] 



## Real world Example

In [283]:
df = pd.read_csv('DATA/Amazon_Fine_Food_Reviews.csv')

In [291]:
df.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,good quality dog food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,not as advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,delight says it all,This is a confection that has been around a fe...


### Clean data

In [285]:
df['Summary'] = df['Summary'].str.lower()
df = df.dropna(subset=['Summary'])

In [286]:
def remove_punctuation(text):
    cleaned = ''.join(char for char in text if char not in ('?', '!', '-','_','.',
                                                            '@','#', '"',',',"'",) )
    return cleaned

In [287]:
df['Summary'] = df['Summary'].apply(remove_punctuation)
df['Summary']

0                      good quality dog food
1                          not as advertised
2                        delight says it all
3                             cough medicine
4                                great taffy
                         ...                
568449                   will not do without
568450                          disappointed
568451              perfect for our maltipoo
568452    favorite training and reward treat
568453                           great honey
Name: Summary, Length: 568427, dtype: object

In [290]:
from sklearn.feature_extraction.text import TfidfVectorizer,  CountVectorizer

vectorize = TfidfVectorizer()

vector_text = vectorize.fit_transform(df['Summary'])
vector_text.shape

(568427, 41476)

## References

- https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
- https://monkeylearn.com/blog/what-is-tf-idf/