TF-IDF is the product of Term Frequency and Inverse Document Frequency. 

Here’s the formula for TF-IDF calculation:

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

Term Frequency is the measure of the frequency of words in a document. It is the ratio of the number of times the word appears in a document compared to the total number of words in that document.
tf(t,d) = count of t in d / number of words in d

Inverse Document Frequency
The words that occur rarely in the corpus have a high IDF score. IDF is the log of the ratio of the number of documents to the number of documents containing the word.

We take log of this ratio because when the corpus becomes large IDF values can get large causing it to explode hence taking log will dampen this effect.

1 is added to the denominator to smoothen the value.

idf(t) = log(N/(df + 1))

In [19]:
#Data Preprocessing
#make a vocabulary set of the words in our training data and assign a unique index for each word in the set.

#Importing required modules
import numpy as np
from nltk.tokenize import  word_tokenize 
 


In [20]:
#Example text corpus for our tutorial
text = ['We are only getting older, baby.\
        And I have been thinking about it lately.' \
        'Does it ever drive you crazy,\
         Just how fast the night changes', \
        'Everything that you have evr dreamed of,\
        disappearing when you wake up.']

In [21]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [24]:
import nltk
#Preprocessing the text data
sentences = []
word_set = []
 
for sent in text:
    x = [i.lower() for  i in word_tokenize(sent) if i.isalpha()]
    sentences.append(x)
    for word in x:
        if word not in word_set:
            word_set.append(word)
 
#Set of vocab 
word_set = set(word_set)
#Total documents in our corpus
total_documents = len(sentences)
 
#Creating an index for each word in our vocab.
index_dict = {} #Dictionary to store index for each word
i = 0
for word in word_set:
    index_dict[word] = i
    i += 1

In [25]:
#Create a dictionary for keeping count of the no. of docs containing the word
 
def count_dict(sentences):
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count
 
word_count = count_dict(sentences)

In [26]:
#Term Frequency
def termfreq(document, word):
    N = len(document)
    occurance = len([token for token in document if token == word])
    return occurance/N

In [27]:
def inverse_doc_freq(word):
    try:
        word_occurance = word_count[word] + 1
    except:
        word_occurance = 1
    return np.log(total_documents/word_occurance)

In [28]:
def tf_idf(sentence):
    tf_idf_vec = np.zeros((len(word_set),))
    for word in sentence:
        tf = termfreq(sentence,word)
        idf = inverse_doc_freq(word)
         
        value = tf*idf
        tf_idf_vec[index_dict[word]] = value 
    return tf_idf_vec

In [18]:
vectors = []
for sent in sentences:
    vec = tf_idf(sent)
    vectors.append(vec)
 
print(vectors[0])

[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.         -0.01689438 -0.01689438  0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.        ]
