<a href="https://colab.research.google.com/github/BenMeehan/Foundational_Machine_Learning/blob/main/TF_IDF_Vectorizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.appliedaicourse.com/
# **TF-IDF Vectorizer Implementation**

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing (NLP).

**Sample Text Corpus for testing**

In [1]:
corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

**Generating a dictionary with all the relevant words**

In [18]:
from collections import Counter

vocabulary=set()
for doc in corpus:
  for word in doc.split(" "):
    if len(word)<2:
      continue 
    vocabulary.add(word)

vocabulary=sorted(list(vocabulary))
vocabulary={j:i for i,j in enumerate(vocabulary)}
vocabulary   #Feature Names

{'and': 0,
 'document': 1,
 'first': 2,
 'is': 3,
 'one': 4,
 'second': 5,
 'the': 6,
 'third': 7,
 'this': 8}

**Calculating the term frequency for each word in a sentence**
 
The term frequency is the number of times a word occurs in a given sentence divided by the total number of words in that sentence.

*The result is stored in a sparse CSR matrix*
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix

In [48]:
from scipy.sparse import csr_matrix
def tf(corpus):
  tf=[[0 for i in range(len(vocabulary))] for i in range(len(corpus))]
  row=[]
  col=[]
  values=[]
  for idx,doc in enumerate(corpus):
    word_freq=dict(Counter(doc.split(" ")))
    for word,freq in word_freq.items():
      if len(word)>=2:
        values.append(freq/len(doc.split(" ")))
        row.append(idx)
        col.append(vocabulary[word])
  return csr_matrix((values,(row,col)),shape=(len(corpus),len(vocabulary)))

term_freq=tf(corpus)

**Calculating the Inverse Document Frequency**

Inverse Document Frequency of a word is the total number of sentences divided by the number of sentences that contain the word.

1 is added to the calcuation to prevent division by zero error (This occurs if no document contains the word)

The IDF is calculated for each term in the dictionary and the result is stored in a numpy array.

In [49]:
import math
import numpy as np
def idf(corpus):
  idf_val=np.empty(len(vocabulary))
  for word,idx in vocabulary.items():
    counter=0 
    for doc in corpus:
      if word in doc:
        counter+=1;
    x=1+math.log((1+len(corpus))/(1+counter))
    idf_val[idx]=x
  return idf_val 

inverse_doc=idf(corpus)

In [53]:
arr_tf=term_freq.toarray()

In [60]:
arr_tf

array([[0.        , 0.2       , 0.2       , 0.2       , 0.        ,
        0.        , 0.2       , 0.        , 0.2       ],
       [0.        , 0.33333333, 0.        , 0.16666667, 0.        ,
        0.16666667, 0.16666667, 0.        , 0.16666667],
       [0.16666667, 0.        , 0.        , 0.16666667, 0.16666667,
        0.        , 0.16666667, 0.16666667, 0.16666667],
       [0.        , 0.2       , 0.2       , 0.2       , 0.        ,
        0.        , 0.2       , 0.        , 0.2       ]])

**Calculating TF-IDF and Normalizing the values**

Finally the TF-IDF values are found through matrix multiplication of IDF and TF values.

In [65]:
from sklearn.preprocessing import normalize
tf_idf=normalize(arr_tf*inverse_doc)

In [69]:
print(csr_matrix(tf_idf))

  (0, 1)	0.4697913855799205
  (0, 2)	0.580285823684436
  (0, 3)	0.3840852409148149
  (0, 6)	0.3840852409148149
  (0, 8)	0.3840852409148149
  (1, 1)	0.6876235979836937
  (1, 3)	0.2810886740337529
  (1, 5)	0.5386476208856762
  (1, 6)	0.2810886740337529
  (1, 8)	0.2810886740337529
  (2, 0)	0.511848512707169
  (2, 3)	0.267103787642168
  (2, 4)	0.511848512707169
  (2, 6)	0.267103787642168
  (2, 7)	0.511848512707169
  (2, 8)	0.267103787642168
  (3, 1)	0.4697913855799205
  (3, 2)	0.580285823684436
  (3, 3)	0.3840852409148149
  (3, 6)	0.3840852409148149
  (3, 8)	0.3840852409148149


<h1>Conclusion</h1>

**This result can be used to answer questions like**

eg : *How relevant is the word first to sentence 1?*

since the word *first* is at position 2 in dictionary and its TF-IDF values can be seen in position (0, 2) as 0.58 which indicates that it is **very relevant!**



---

# **Implementation using Scikit Learn**

Everything after this is to prove the correctness of the above implementation using pre-built libraries like Scikit learn.


In [70]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
skl_output = vectorizer.transform(corpus)

In [71]:
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']




In [72]:
print(vectorizer.idf_)

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


In [73]:
print(skl_output)

  (0, 8)	0.38408524091481483
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 2)	0.5802858236844359
  (0, 1)	0.46979138557992045
  (1, 8)	0.281088674033753
  (1, 6)	0.281088674033753
  (1, 5)	0.5386476208856763
  (1, 3)	0.281088674033753
  (1, 1)	0.6876235979836938
  (2, 8)	0.267103787642168
  (2, 7)	0.511848512707169
  (2, 6)	0.267103787642168
  (2, 4)	0.511848512707169
  (2, 3)	0.267103787642168
  (2, 0)	0.511848512707169
  (3, 8)	0.38408524091481483
  (3, 6)	0.38408524091481483
  (3, 3)	0.38408524091481483
  (3, 2)	0.5802858236844359
  (3, 1)	0.46979138557992045
