<a href="https://colab.research.google.com/github/Himm11/NLPbasic/blob/main/Tf_Idf_Vector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import sklearn as sk
import math


In [None]:
#so let’s load our sentences and combine them together in a single set :
first_sentence = "Data Science is the sexiest job of the 21st century"
second_sentence = "machine learning is the key for data science"
#split so each word have their own string
first_sentence = first_sentence.split(" ")
second_sentence = second_sentence.split(" ")#join them to remove common duplicate words
total= set(first_sentence).union(set(second_sentence))
print(total)

In [None]:
#Now lets add a way to count the words using a dictionary key-value pairing for both sentences :
wordDictA = dict.fromkeys(total, 0)
wordDictB = dict.fromkeys(total, 0)
for word in first_sentence:
    wordDictA[word]+=1

for word in second_sentence:
    wordDictB[word]+=1


In [None]:
#Now we put them in a dataframe and then view the result:
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,machine,learning,for,is,century,of,Data,science,the,job,21st,sexiest,Science,data,key
0,0,0,0,1,1,1,1,0,2,1,1,1,1,0,0
1,1,1,1,1,0,0,0,1,1,0,0,0,0,1,1


In [None]:
#No let’s writing the TF Function :
def computeTF(wordDict, doc):
    tfDict = {}
    corpusCount = len(doc)
    for word, count in wordDict.items():
        tfDict[word] = count/float(corpusCount)
    return(tfDict)
#running our sentences through the tf function:
tfFirst = computeTF(wordDictA, first_sentence)
tfSecond = computeTF(wordDictB, second_sentence)
#Converting to dataframe for visualization
tf = pd.DataFrame([tfFirst, tfSecond])

In [None]:
tf

Unnamed: 0,machine,learning,for,is,century,of,Data,science,the,job,21st,sexiest,Science,data,key
0,0.0,0.0,0.0,0.1,0.1,0.1,0.1,0.0,0.2,0.1,0.1,0.1,0.1,0.0,0.0
1,0.125,0.125,0.125,0.125,0.0,0.0,0.0,0.125,0.125,0.0,0.0,0.0,0.0,0.125,0.125


In [None]:
'''
That’s all for TF formula , just i wanna talk about stop words that we should eliminate them because they are the most commonly occurring words which don’t give any additional value to the document vector .in-fact removing these will increase computation and space efficiency.
nltk library has a method to download the stopwords, so instead of explicitly mentioning all the stopwords ourselves we can just use the nltk library and iterate over all the words and remove the stop words. There are many efficient ways to do this, but ill just give a simple method.
those a sample of a stopwords in english language :
'''

'\nThat’s all for TF formula , just i wanna talk about stop words that we should eliminate them because they are the most commonly occurring words which don’t give any additional value to the document vector .in-fact removing these will increase computation and space efficiency.\nnltk library has a method to download the stopwords, so instead of explicitly mentioning all the stopwords ourselves we can just use the nltk library and iterate over all the words and remove the stop words. There are many efficient ways to do this, but ill just give a simple method.\nthose a sample of a stopwords in english language :\n'

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in wordDictA if not w in stop_words]
print(filtered_sentence)

In [None]:
#And now that we finished the TF section, we move onto the IDF part:
def computeIDF(docList):
    idfDict = {}
    N = len(docList)

    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / (float(val) + 1))

    return(idfDict)
#inputing our sentences in the log file
idfs = computeIDF([wordDictA, wordDictB])


In [None]:
#and now we implement the idf formula , let’s finish with calculating the TFIDF
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return(tfidf)
#running our two sentences through the IDF:
idfFirst = computeTFIDF(tfFirst, idfs)
idfSecond = computeTFIDF(tfSecond, idfs)
#putting it in a dataframe
idf= pd.DataFrame([idfFirst, idfSecond])
print(idf)

    machine  learning       for        is   century        of      Data  \
0  0.000000  0.000000  0.000000  0.030103  0.030103  0.030103  0.030103   
1  0.037629  0.037629  0.037629  0.037629  0.000000  0.000000  0.000000   

    science       the       job      21st   sexiest   Science      data  \
0  0.000000  0.060206  0.030103  0.030103  0.030103  0.030103  0.000000   
1  0.037629  0.037629  0.000000  0.000000  0.000000  0.000000  0.037629   

        key  
0  0.000000  
1  0.037629  


In [None]:
#That was a lot of work. But it is handy to know, if you are asked to code TF-IDF from scratch in the future. However, this can be done a lot simpler thanks to sklearn library. Let’s look at the example from them below:

In [None]:
#first step is to import the library
from sklearn.feature_extraction.text import TfidfVectorizer
#for the sentence, make sure all words are lowercase or you will run #into error. for simplicity, I just made the same sentence all #lowercase
firstV= "Data Science is the sexiest job of the 21st century"
secondV= "machine learning is the key for data science"
#calling the TfidfVectorizer
vectorize= TfidfVectorizer()
#fitting the model and passing our sentences right away:
response= vectorize.fit_transform([firstV, secondV])

In [None]:
print(response)

  (0, 1)	0.34211869506421816
  (0, 0)	0.34211869506421816
  (0, 9)	0.34211869506421816
  (0, 5)	0.34211869506421816
  (0, 11)	0.34211869506421816
  (0, 12)	0.48684053853849035
  (0, 4)	0.24342026926924518
  (0, 10)	0.24342026926924518
  (0, 2)	0.24342026926924518
  (1, 3)	0.40740123733358447
  (1, 6)	0.40740123733358447
  (1, 7)	0.40740123733358447
  (1, 8)	0.40740123733358447
  (1, 12)	0.28986933576883284
  (1, 4)	0.28986933576883284
  (1, 10)	0.28986933576883284
  (1, 2)	0.28986933576883284
