# How to process textual data using TF-IDF in Python

### An introduction to TF-IDF

TF-IDF stands for “Term Frequency — Inverse Data Frequency”. First, we will learn what this term means mathematically.

Term Frequency (tf): gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.



tf(i,j) = n(i,j)/Sumn(i,j)

Inverse Data Frequency (idf): used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. It is given by the equation below.

idf(w) = log(N/df)

Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the product of tf and idf:



w = tf * idf

### Using Python to calculate TF-IDF

In [39]:
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

In [6]:
docA = "The cat sat on my face"
docB = "The dog sat on my bed"

bowA = docA.split(" ")
bowB = docB.split(" ")

In [7]:
bowA

['The', 'cat', 'sat', 'on', 'my', 'face']

In [8]:
wordSet = set(bowA).union(set(bowB))
wordSet

{'The', 'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat'}

In [14]:
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

print(wordDictA,"""
""", wordDictB)

{'The': 0, 'on': 0, 'my': 0, 'dog': 0, 'sat': 0, 'bed': 0, 'face': 0, 'cat': 0} 
 {'The': 0, 'on': 0, 'my': 0, 'dog': 0, 'sat': 0, 'bed': 0, 'face': 0, 'cat': 0}


In [15]:
for word in bowA:
    wordDictA[word] += 1
    
for word in bowB:
    wordDictB[word] += 1
    
print(wordDictA, """
""", wordDictB)

{'The': 1, 'on': 1, 'my': 1, 'dog': 0, 'sat': 1, 'bed': 0, 'face': 1, 'cat': 1} 
 {'The': 1, 'on': 1, 'my': 1, 'dog': 1, 'sat': 1, 'bed': 1, 'face': 0, 'cat': 0}


In [16]:
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,The,on,my,dog,sat,bed,face,cat
0,1,1,1,0,1,0,1,1
1,1,1,1,1,1,1,0,0


In [22]:
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = round(count/float(bowCount), 2)
    return tfDict

In [25]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

tfBowA

{'The': 0.17,
 'on': 0.17,
 'my': 0.17,
 'dog': 0.0,
 'sat': 0.17,
 'bed': 0.0,
 'face': 0.17,
 'cat': 0.17}

In [26]:
tfBowB

{'The': 0.17,
 'on': 0.17,
 'my': 0.17,
 'dog': 0.17,
 'sat': 0.17,
 'bed': 0.17,
 'face': 0.0,
 'cat': 0.0}

In [29]:
def computeIDF(docList):
    import math
    
    idfDict = {}
    N = len(docList)
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N/float(val))
        
    return idfDict

In [30]:
idfs = computeIDF([wordDictA, wordDictB])

In [31]:
idfs

{'The': 0.0,
 'on': 0.0,
 'my': 0.0,
 'dog': 0.3010299956639812,
 'sat': 0.0,
 'bed': 0.3010299956639812,
 'face': 0.3010299956639812,
 'cat': 0.3010299956639812}

In [41]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    
    for word, val in tfBow.items():
        tfidf[word] = round(val*idfs[word], 2)
    return tfidf

In [43]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [44]:
tfidfBowA

{'The': 0.0,
 'on': 0.0,
 'my': 0.0,
 'dog': 0.0,
 'sat': 0.0,
 'bed': 0.0,
 'face': 0.05,
 'cat': 0.05}

In [45]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,The,on,my,dog,sat,bed,face,cat
0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.05
1,0.0,0.0,0.0,0.05,0.0,0.05,0.0,0.0


### Using sklearn for TF-IDF

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
docA = "The car is driven on the road"
docB = "The truck is driven on the highway"

In [53]:
tfidf = TfidfVectorizer()

In [54]:
response = tfidf.fit_transform([docA, docB])

In [55]:
feature_names = tfidf.get_feature_names()

for col in response.nonzero()[1]:
     print (feature_names[col], ' - ', response[0, col])

road  -  0.42471718586982765
on  -  0.30218977576862155
driven  -  0.30218977576862155
is  -  0.30218977576862155
car  -  0.42471718586982765
the  -  0.6043795515372431
highway  -  0.0
truck  -  0.0
on  -  0.30218977576862155
driven  -  0.30218977576862155
is  -  0.30218977576862155
the  -  0.6043795515372431


In [56]:
print(response)

  (0, 5)	0.42471718586982765
  (0, 4)	0.30218977576862155
  (0, 1)	0.30218977576862155
  (0, 3)	0.30218977576862155
  (0, 0)	0.42471718586982765
  (0, 6)	0.6043795515372431
  (1, 2)	0.42471718586982765
  (1, 7)	0.42471718586982765
  (1, 4)	0.30218977576862155
  (1, 1)	0.30218977576862155
  (1, 3)	0.30218977576862155
  (1, 6)	0.6043795515372431


To learn more about sklearn TF-IDF, you can use this 
<a> href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html" > Link </a>

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [60]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(X.shape)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)


The Link for Markdown syntax <a> href "https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034358&utm_targetid=aud-299261629574:dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=9070053&gclid=Cj0KCQjwv7L6BRDxARIsAGj-34p_a9OcdeGydoAeGsJg2IYozkI5cHRS8UF_ASh30yxGXsY6ljtE8eYaAs-WEALw_wcB" > Link to Markdown </a>