# TF(Term Frequency)-IDF(Inverse Document Frequency)

Terminology :

    t — term (word)
    d — document (set of words)
    N — count of corpus
    corpus — the total document set

Term Frequency Formula : tf(t,d) = count of t in d / number of words in d

Inverse Document Frequency(IDF): tf-idf(t,d)=tf(t,d) * log(N/(df+1))

In [1]:
import pandas as pd
import sklearn as sk
import math 

let’s load our sentences and combine them together in a single set 

In [2]:
f = "Data Science is the demanding job of the 21st century"
s ="machine learning is the key for data science"

 split so each word have their own string

In [3]:
f1=f.split(" ")
s2=s.split(" ")
print(f1)
print(s2)

['Data', 'Science', 'is', 'the', 'demanding', 'job', 'of', 'the', '21st', 'century']
['machine', 'learning', 'is', 'the', 'key', 'for', 'data', 'science']


In [4]:
total= set(f1).union(set(s2))
print(total)

{'data', 'is', 'key', 'job', 'century', 'Science', 'machine', 'learning', '21st', 'Data', 'the', 'science', 'for', 'of', 'demanding'}


Now lets add a way to count the words using a dictionary key-value pairing for both sentences 

In [5]:
wordDictA = dict.fromkeys(total, 0) 
wordDictB = dict.fromkeys(total, 0)


for i in f1:
    wordDictA[i]+=1
    
for j in s2:
    wordDictB[j]+=1

In [6]:
wordDictA

{'data': 0,
 'is': 1,
 'key': 0,
 'job': 1,
 'century': 1,
 'Science': 1,
 'machine': 0,
 'learning': 0,
 '21st': 1,
 'Data': 1,
 'the': 2,
 'science': 0,
 'for': 0,
 'of': 1,
 'demanding': 1}

In [7]:
wordDictB

{'data': 1,
 'is': 1,
 'key': 1,
 'job': 0,
 'century': 0,
 'Science': 0,
 'machine': 1,
 'learning': 1,
 '21st': 0,
 'Data': 0,
 'the': 1,
 'science': 1,
 'for': 1,
 'of': 0,
 'demanding': 0}

Now we put them in a dataframe and then view the result

In [8]:
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,data,is,key,job,century,Science,machine,learning,21st,Data,the,science,for,of,demanding
0,0,1,0,1,1,1,0,0,1,1,2,0,0,1,1
1,1,1,1,0,0,0,1,1,0,0,1,1,1,0,0


No let’s writing the TF Function

In [9]:
def computeTF(wordDict, doc):
    tfDict = {}
    corpusCount = len(doc)
    for word, count in wordDict.items():
        tfDict[word] = count/float(corpusCount)
    return(tfDict)


#running our sentences through the tf function:

tfFirst = computeTF(wordDictA, f1)
tfSecond = computeTF(wordDictB, s2)

#Converting to dataframe for visualization

tf = pd.DataFrame([tfFirst, tfSecond])

In [10]:
tfFirst

{'data': 0.0,
 'is': 0.1,
 'key': 0.0,
 'job': 0.1,
 'century': 0.1,
 'Science': 0.1,
 'machine': 0.0,
 'learning': 0.0,
 '21st': 0.1,
 'Data': 0.1,
 'the': 0.2,
 'science': 0.0,
 'for': 0.0,
 'of': 0.1,
 'demanding': 0.1}

In [11]:
tfSecond

{'data': 0.125,
 'is': 0.125,
 'key': 0.125,
 'job': 0.0,
 'century': 0.0,
 'Science': 0.0,
 'machine': 0.125,
 'learning': 0.125,
 '21st': 0.0,
 'Data': 0.0,
 'the': 0.125,
 'science': 0.125,
 'for': 0.125,
 'of': 0.0,
 'demanding': 0.0}

In [12]:
tf

Unnamed: 0,data,is,key,job,century,Science,machine,learning,21st,Data,the,science,for,of,demanding
0,0.0,0.1,0.0,0.1,0.1,0.1,0.0,0.0,0.1,0.1,0.2,0.0,0.0,0.1,0.1
1,0.125,0.125,0.125,0.0,0.0,0.0,0.125,0.125,0.0,0.0,0.125,0.125,0.125,0.0,0.0


removing the stop word

In [13]:
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in wordDictA if not w in stop_words]
print(filtered_sentence)

['data', 'key', 'job', 'century', 'Science', 'machine', 'learning', '21st', 'Data', 'science', 'demanding']


Caculate IDF

In [14]:
def computeIDF(docList):
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / (float(val) + 1))
        
    return(idfDict)

#inputing our sentences in the log file
idfs = computeIDF([wordDictA, wordDictB])

In [15]:
idfs

{'data': 0.3010299956639812,
 'is': 0.3010299956639812,
 'key': 0.3010299956639812,
 'job': 0.3010299956639812,
 'century': 0.3010299956639812,
 'Science': 0.3010299956639812,
 'machine': 0.3010299956639812,
 'learning': 0.3010299956639812,
 '21st': 0.3010299956639812,
 'Data': 0.3010299956639812,
 'the': 0.3010299956639812,
 'science': 0.3010299956639812,
 'for': 0.3010299956639812,
 'of': 0.3010299956639812,
 'demanding': 0.3010299956639812}

Another function type to calculate inverse document frequency

In [16]:
def idf2(doc):
    dict1={}
    N=len(doc)
    for k,v in doc.items():
        dict1[k]=math.log10(N/(float(v)+1))
        
    return dict1

    

In [17]:
idfA=idf2(wordDictA)
idfA

{'data': 1.1760912590556813,
 'is': 0.8750612633917001,
 'key': 1.1760912590556813,
 'job': 0.8750612633917001,
 'century': 0.8750612633917001,
 'Science': 0.8750612633917001,
 'machine': 1.1760912590556813,
 'learning': 1.1760912590556813,
 '21st': 0.8750612633917001,
 'Data': 0.8750612633917001,
 'the': 0.6989700043360189,
 'science': 1.1760912590556813,
 'for': 1.1760912590556813,
 'of': 0.8750612633917001,
 'demanding': 0.8750612633917001}

In [18]:
idfB=idf2(wordDictB)
idfB

{'data': 0.8750612633917001,
 'is': 0.8750612633917001,
 'key': 0.8750612633917001,
 'job': 1.1760912590556813,
 'century': 1.1760912590556813,
 'Science': 1.1760912590556813,
 'machine': 0.8750612633917001,
 'learning': 0.8750612633917001,
 '21st': 1.1760912590556813,
 'Data': 1.1760912590556813,
 'the': 0.8750612633917001,
 'science': 0.8750612633917001,
 'for': 0.8750612633917001,
 'of': 1.1760912590556813,
 'demanding': 1.1760912590556813}

calculating the TFIDF

In [19]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return(tfidf)
#running our two sentences through the IDF:

idfFirst = computeTFIDF(tfFirst, idfs)
idfSecond = computeTFIDF(tfSecond, idfs)

#putting it in a dataframe
idf= pd.DataFrame([idfFirst, idfSecond])
print(idf)

       data        is       key       job   century   Science   machine  \
0  0.000000  0.030103  0.000000  0.030103  0.030103  0.030103  0.000000   
1  0.037629  0.037629  0.037629  0.000000  0.000000  0.000000  0.037629   

   learning      21st      Data       the   science       for        of  \
0  0.000000  0.030103  0.030103  0.060206  0.000000  0.000000  0.030103   
1  0.037629  0.000000  0.000000  0.037629  0.037629  0.037629  0.000000   

   demanding  
0   0.030103  
1   0.000000  


Another way using 2nd idf function

In [20]:
tfidfA = computeTFIDF(tfFirst, idfA)
tfidfB = computeTFIDF(tfSecond, idfB)

#putting it in a dataframe
pd1= pd.DataFrame([tfidfA, tfidfB])
print(pd1)

       data        is       key       job   century   Science   machine  \
0  0.000000  0.087506  0.000000  0.087506  0.087506  0.087506  0.000000   
1  0.109383  0.109383  0.109383  0.000000  0.000000  0.000000  0.109383   

   learning      21st      Data       the   science       for        of  \
0  0.000000  0.087506  0.087506  0.139794  0.000000  0.000000  0.087506   
1  0.109383  0.000000  0.000000  0.109383  0.109383  0.109383  0.000000   

   demanding  
0   0.087506  
1   0.000000  
