# Feynman Technique - TF IDF

- toc: true
- badges: False
- comments: true
- author: Sam Treacy
- categories: [sklearn, tf_idf, sentiment, nlp, sentiment, classification, python]

## How it works

Term Frequency - Inverse Document Frequency (TF-IDF) gives a measure of how 'important' a word is in a given document when compared to other documents. 

The method is used for text analysis and as a way to convert text to numeric arrays for machine learing algorithms. Unlike the bag-of-words method, information of the importance of each word in any given document with reference to all documents is availible to the machine learning algorithm. 

TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant. For example, if what we’re doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.

## A simple example

In [216]:
simple_text =['look at the angry dog', 
              'look at the angry cat', 
              'this is unknown', 
              'this, this, this',
              'look at', 
              'this',
              'hello',] 


sample = vectorize.fit_transform(simple_text)

In [217]:
print(sample)

  (0, 3)	0.543530401770053
  (0, 0)	0.45117691147795724
  (0, 7)	0.45117691147795724
  (0, 1)	0.38565106731999843
  (0, 6)	0.38565106731999843
  (1, 2)	0.543530401770053
  (1, 0)	0.45117691147795724
  (1, 7)	0.45117691147795724
  (1, 1)	0.38565106731999843
  (1, 6)	0.38565106731999843
  (2, 9)	0.6320217767184382
  (2, 5)	0.6320217767184382
  (2, 8)	0.4484383430387477
  (3, 8)	1.0
  (4, 1)	0.7071067811865476
  (4, 6)	0.7071067811865476
  (5, 8)	1.0
  (6, 4)	1.0


In [218]:
vectorize.get_feature_names()

['angry', 'at', 'cat', 'dog', 'hello', 'is', 'look', 'the', 'this', 'unknown']

In [236]:
vecs = pd.Series(vectorize.idf_ )

In [237]:
vocab = pd.Series( vectorize.get_feature_names())

In [238]:
table = {'words':vocab, 'Vecs':vecs}

pd.DataFrame(table)

Unnamed: 0,words,Vecs
0,angry,1.980829
1,at,1.693147
2,cat,2.386294
3,dog,2.386294
4,hello,2.386294
5,is,2.386294
6,look,1.693147
7,the,1.980829
8,this,1.693147
9,unknown,2.386294


In [243]:
sen_num = 2
print(sample[sen_num], '\n')
print(simple_text[sen_num], '\n' )
print(sample.toarray()[sen_num])

  (0, 9)	0.6320217767184382
  (0, 5)	0.6320217767184382
  (0, 8)	0.4484383430387477 

this is unknown 

[0.         0.         0.         0.         0.         0.63202178
 0.         0.         0.44843834 0.63202178]


## Real world text

In [30]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [46]:
df = pd.read_csv('DATA/Amazon_Fine_Food_Reviews.csv')

In [47]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


## Clean data

In [48]:
df['Summary'] = df['Summary'].str.lower()
df = df.dropna(subset=['Summary'])

In [49]:
def remove_punctuation(text):
    cleaned = ''.join(char for char in text if char not in ('?', '!', '-','_','.',
                                                            '@','#', '"',',',"'",) )
    return cleaned

In [50]:
df['Summary'] = df['Summary'].apply(remove_punctuation)
df['Summary']

0                      good quality dog food
1                          not as advertised
2                        delight says it all
3                             cough medicine
4                                great taffy
                         ...                
568449                   will not do without
568450                          disappointed
568451              perfect for our maltipoo
568452    favorite training and reward treat
568453                           great honey
Name: Summary, Length: 568427, dtype: object

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer,  CountVectorizer

vectorize = TfidfVectorizer()

vector_text = vectorize.fit_transform(df['Summary'])

In [58]:
vector_text.shape

(568427, 41476)