- Term Frequency Inverse Document Frequency (TFIDF) analysis is one of the simple and robust methods to understand the context of a text.

- Term Frequency and Inverse Document Frequency is used to find the related content and important words and phrases in a larger text. Implementing TF-IDF analysis is very easy using Python.

- Computers cannot understand the meaning of a text, but they can understand numbers. The words can be converted to numbers so that the relationship between them can be understood.

**Term Frequency:**

- The term frequency measure of a word w in a document (text) d.

- It is equal to the number of instances of word w in document d divided by the total number of words in document d.

- Term frequency serves as a metric to determine a word’s occurrence in a document as compared to the total number of words in a document. The denominator is always the same.

<img decoding="async" loading="lazy" class="alignnone" src="https://editor.analyticsvidhya.com/uploads/27405Screenshot_1.jpg" alt="term frequency | Reviews Classifier Using TF-IDF" width="773" height="102">

**Inverse Document Frequency (IDF)**

- This parameter gives a numeric value of the importance of a word.

- Inverse Document frequency of word w is defined as the total number of documents (N) in a text corpus D, divided by the number of documents containing w.

<img decoding="async" loading="lazy" class="alignnone" src="https://editor.analyticsvidhya.com/uploads/80300Screenshot_2.jpg" alt="Inverse document frequency" width="760" height="86">

- The product of TF and IDF is the TF-IDF. TF-IDF is usually one of the best metrics to determine if a term is significant to a text. It represents the importance of a word in a particular document.

**Term Frequency**

**Sentence-1**

- good movie

**Sentence-2**

- good snacks

**Sentence-3**

- movie snacks good


|vocab|Sentence-1|Sentence-2|Sentence-3|
|--------|-----|---|----|
|good|1/2|1/2|1/3|
|movie|1/2|0/2|1/3|
|snacks|0/2|1/2|1/3|

**Inverse document frequency**


|vocab|Idf|
|-----|---|
|good|loge(3/3)|
|movie|loge(3/2)|
|snacks|loge(3/2)|

**Term Frequency** * **Inverse document frequency**



|Sentence|good|movie|Snacks|
|--------|-----|---|----|
|Sentence-1|1/2*0=0|1/2*loge(3/2)|0|
|Sentence-2|1/2*0=0|0|1/2*loge(3/2)|
|Sentence-3|1/3*0=0|1/3*loge(3/2)|1/3*loge(3/2)|


|Sentence|good|movie|Snacks|
|--------|-----|---|----|
|Sentence-1|0|1/2*loge(3/2)|0|
|Sentence-2|0|0|1/2*loge(3/2)|
|Sentence-3|0|1/3*loge(3/2)|1/3*loge(3/2)|



> good is present in every sentence so the value becomes zero: less importance

> movie is present in only two sentences


Let’s cover an example of 3 documents -

Document 1            It is going to rain today.

Document 2            Today I am not going outside.

Document 3            I am going to watch the season premiere.

In [14]:
d1="It is going to rain today"
d2="Today I am not going outside"
d3="I am going to watch the season premiere"

l1=d1.split()
l2=d2.split()
l3=d3.split()
val1=set(l1+l2+l3)

In [16]:
dict1={}
for i in val1:
    if i in d1:
        dict1[i]=round(1/len(l1),2)
    else:
        dict1[i]=0
        
dict1

{'to': 0.17,
 'today': 0.17,
 'am': 0,
 'outside': 0,
 'Today': 0,
 'rain': 0.17,
 'the': 0,
 'going': 0.17,
 'It': 0.17,
 'premiere': 0,
 'season': 0,
 'watch': 0,
 'is': 0.17,
 'not': 0,
 'I': 0.17}

In [18]:
dict2={}
for i in val1:
    if i in d1:
        dict2[i]=round(1/len(l2),2)
    else:
        dict2[i]=0
        
dict2

{'to': 0.17,
 'today': 0.17,
 'am': 0,
 'outside': 0,
 'Today': 0,
 'rain': 0.17,
 'the': 0,
 'going': 0.17,
 'It': 0.17,
 'premiere': 0,
 'season': 0,
 'watch': 0,
 'is': 0.17,
 'not': 0,
 'I': 0.17}

In [20]:
dict3={}
for i in val1:
    if i in d1:
        dict3[i]=round(1/len(l3),2)
    else:
        dict3[i]=0
        
dict3

{'to': 0.12,
 'today': 0.12,
 'am': 0,
 'outside': 0,
 'Today': 0,
 'rain': 0.12,
 'the': 0,
 'going': 0.12,
 'It': 0.12,
 'premiere': 0,
 'season': 0,
 'watch': 0,
 'is': 0.12,
 'not': 0,
 'I': 0.12}

In [22]:
import pandas as pd
df1=pd.DataFrame(dict1,index=['A'])
df2=pd.DataFrame(dict2,index=['B'])
df3=pd.DataFrame(dict3,index=['C'])

In [28]:
import numpy as np
np.log(3/2)

0.4054651081081644

# TF-IDF Vectorizer

In [32]:
Document1= "going rain today."
Document2= "Today going outside."
Document3= "going watch season premiere."
Doc = [Document1 ,Document2 , Document3]
print(Doc)

['going rain today.', 'Today going outside.', 'going watch season premiere.']


In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(Doc)
X

<3x7 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [36]:
vectorizer.vocabulary_ # index order

{'going': 0,
 'rain': 3,
 'today': 5,
 'outside': 1,
 'watch': 6,
 'season': 4,
 'premiere': 2}

In [38]:
X.toarray()

array([[0.42544054, 0.        , 0.        , 0.72033345, 0.        ,
        0.54783215, 0.        ],
       [0.42544054, 0.72033345, 0.        , 0.        , 0.        ,
        0.54783215, 0.        ],
       [0.32274454, 0.        , 0.54645401, 0.        , 0.54645401,
        0.        , 0.54645401]])

In [40]:
analyze = vectorizer.build_analyzer()
print("Document 1",analyze(Document1)) # Individual tf-idf
print("Document 2",analyze(Document2))
print("Document 3",analyze(Document3))
print("Document transform",X.toarray()) # all together

Document 1 ['going', 'rain', 'today']
Document 2 ['today', 'going', 'outside']
Document 3 ['going', 'watch', 'season', 'premiere']
Document transform [[0.42544054 0.         0.         0.72033345 0.         0.54783215
  0.        ]
 [0.42544054 0.72033345 0.         0.         0.         0.54783215
  0.        ]
 [0.32274454 0.         0.54645401 0.         0.54645401 0.
  0.54645401]]


In [42]:
vectorizer.get_feature_names_out()  # sorted order of vocab

array(['going', 'outside', 'premiere', 'rain', 'season', 'today', 'watch'],
      dtype=object)

In [44]:
import pandas as pd
pd.DataFrame(X.toarray(),
             index=Doc,
             columns=[vectorizer.get_feature_names_out()],
            )

Unnamed: 0,going,outside,premiere,rain,season,today,watch
going rain today.,0.425441,0.0,0.0,0.720333,0.0,0.547832,0.0
Today going outside.,0.425441,0.720333,0.0,0.0,0.0,0.547832,0.0
going watch season premiere.,0.322745,0.0,0.546454,0.0,0.546454,0.0,0.546454


- The output signifies the important words which add context to 3 sentences. These are the words that are important in all 3 sentences and now you can ask questions of whatever nature you like to the machine, stuff like

In [47]:
len(X.toarray()[0])

7

In [49]:
X.toarray()[0]

array([0.42544054, 0.        , 0.        , 0.72033345, 0.        ,
       0.54783215, 0.        ])

In [51]:
# With out Normalization
Document1= "good movie."
Document2= "good snacks."
Document3= "movie snacks good."
Doc = [Document1 ,Document2 , Document3]
print(Doc)


from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
X = vectorizer.fit_transform(Doc)

analyze = vectorizer.build_analyzer()
print("Document 1",analyze(Document1))
print("Document 2",analyze(Document2))
print("Document 3",analyze(Document3))
print("Document transform",X.toarray())

vectorizer.get_feature_names_out()

['good movie.', 'good snacks.', 'movie snacks good.']
Document 1 ['good', 'movie']
Document 2 ['good', 'snacks']
Document 3 ['movie', 'snacks', 'good']
Document transform [[1.         1.28768207 0.        ]
 [1.         0.         1.28768207]
 [1.         1.28768207 1.28768207]]


array(['good', 'movie', 'snacks'], dtype=object)

In [53]:
import numpy as np
v1=[0.61335554, 0.78980693, 0]
np.linalg.norm(v1)

1.0000000025623583

|Sentence|good|movie|Snacks|
|--------|-----|---|----|
|Sentence-1|1/2*0=0|1/2*loge(3/2)|0|
|Sentence-2|1/2*0=0|0|1/2*loge(3/2)|
|Sentence-3|1/3*0=0|1/3*loge(3/2)|1/3*loge(3/2)|


|Sentence|good|movie|Snacks|
|--------|-----|---|----|
|Sentence-1|0|1/2*loge(3/2)|0|
|Sentence-2|0|0|1/2*loge(3/2)|
|Sentence-3|0|1/3*loge(3/2)|1/3*loge(3/2)|

(count_of_term_t_in_d) * ((log ((NUMBER_OF_DOCUMENTS + 1) / (Number_of_documents_where_t_appears +1 )) + 1)

In [59]:
import numpy as np
1 * (np.log((3 + 1)/(3+1)) + 1)

1.0

In [61]:
# Movie
1 * (np.log((3 + 1)/(2+1)) + 1)

1.2876820724517808

In [63]:
# With Normalization
Document1= "good movie."
Document2= "good snacks."
Document3= "movie snacks good."
Doc = [Document1 ,Document2 , Document3]
print(Doc)


from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(Doc)

analyze = vectorizer.build_analyzer()
print("Document 1",analyze(Document1))
print("Document 2",analyze(Document2))
print("Document 3",analyze(Document3))
print("Document transform",X.toarray())

vectorizer.get_feature_names_out()

['good movie.', 'good snacks.', 'movie snacks good.']
Document 1 ['good', 'movie']
Document 2 ['good', 'snacks']
Document 3 ['movie', 'snacks', 'good']
Document transform [[0.61335554 0.78980693 0.        ]
 [0.61335554 0.         0.78980693]
 [0.48133417 0.61980538 0.61980538]]


array(['good', 'movie', 'snacks'], dtype=object)

In [65]:
tfidf_vector = [1, 1.28768207, 0 ]

tfidf_vector = tfidf_vector / np.linalg.norm(tfidf_vector)

print(tfidf_vector)

[0.61335554 0.78980693 0.        ]
