Computers are not known to be a great number crunchers. But not that good with dealing with text.

SO, we make the text computer friendly by turning text int numbers.

Generally, when we speak about text, we usually mean corpus.

Corpus is nothing but collection of documents.



In [6]:
#Two documents in the Corpus

docA= "The dog sat on my lap"

docB = "The cat sat on my bed"

Generally, we work on text analysis, we use Bag Of Words model to represent a document.

Here, each document can be thought as Bag Of Words.

#### Tokenizing:

Here, we're splitting each document into bag of words. Splitting a document into component words is known as tokenizing.


In [7]:
bowA = docA.split(' ')
bowB = docB.split(' ')

bowB

['The', 'cat', 'sat', 'on', 'my', 'bed']

Now, we've  to convert these bag of words into numbers.

One strategy is to create **vector** of all possible words and for each dpcument count how many times the word occurs.

In [8]:
# We use Python's Sets data structure to eliminate dupicates and give us unique words

word_set = set(bowA).union(set(bowB))

word_set

{'The', 'bed', 'cat', 'dog', 'lap', 'my', 'on', 'sat'}

In [9]:
# Create dictionary with each document with unique words found in above sets

dictA = dict.fromkeys(word_set, 0)
dictB = dict.fromkeys(word_set, 0)

dictA

{'The': 0, 'bed': 0, 'cat': 0, 'dog': 0, 'lap': 0, 'my': 0, 'on': 0, 'sat': 0}

In [10]:
# Count words present in each bag and update count in the dict

for word in bowA:
    dictA[word] += 1

for word in bowB:
     dictB[word] += 1

dictA

{'The': 1, 'bed': 0, 'cat': 0, 'dog': 1, 'lap': 1, 'my': 1, 'on': 1, 'sat': 1}

In [10]:
import pandas as pd

pd.DataFrame([dictA, dictB])


Unnamed: 0,The,bed,cat,dog,lap,my,on,sat
0,1,0,0,1,1,1,1,1
1,1,1,1,0,0,1,1,1


So, a word problem just got converted into linear algebra problem.

But, when we our document matrix out of counts, we end up with numbers which don't have much information

#### TF-IDF IS A BETTER STRATEGY:

Here, rather than counting words, we use the TF-IDF score of the words to rank its importance.

So, the calculate the TF-IDF score of the word, we use the below formula

                term frequency of word * inverse document frequency of the word

                tf(w) * idf(w) 

Luckily for us, we've TD-IDF built in most of languages, Sklearn or libraries. So we don't have to write below implementation evrytime we've to use TF-IDF.


#### Let's start with Calculating Term Frequency of the word. 

The formula to calculate term frequency is:

No of times word appears in the document/ total number of words in the document.

In [11]:
def compute_term_frequency(word_dict, bow):
    tf_dict = {}
    total_no = len(bow)
    for word, count in word_dict.items():
        tf_dict[word] = count/total_no
    return tf_dict
    

In [12]:
tfBowA = compute_term_frequency(dictA, bowA)
tfBowB = compute_term_frequency(dictB, bowB)
tfBowA

{'The': 0.16666666666666666,
 'bed': 0.0,
 'cat': 0.0,
 'dog': 0.16666666666666666,
 'lap': 0.16666666666666666,
 'my': 0.16666666666666666,
 'on': 0.16666666666666666,
 'sat': 0.16666666666666666}

#### Let's start with Calculating Inverse Document Frequency of the Word. 

The formula to calculate inverse document frequency is:

log(Number of documents, Number of documents that contain that word w)

In [13]:
def compute_inverse_df(doc_list):
    import math
    # doc_list= [{{'The': 0, 'bed': 0, 'cat': 0, 'dog': 0, 'lap': 0, 'my': 0, 'on': 0, 'sat': 0}}]
    no = len(doc_list)
    idf_dict = {}
    
    idf_dict = dict.fromkeys(doc_list[0].keys(),0)
    for doc in doc_list:
        for word, count in doc.items():
            if count > 0:
                idf_dict[word] +=1
    
    for word,count in idf_dict.items():
        idf_dict[word] = math.log(no/count)
    return idf_dict
            
    

In [16]:
idfs = compute_inverse_df([dictA, dictB])

In [17]:
def compute_TFIDF(tf_bow, idfs):
    tfidf={}
    for word,count in tf_bow.items():
        tfidf[word] = count * idfs[word]
    return tfidf
        

In [18]:
tfidfbowA = compute_TFIDF(tfBowA, idfs)
tfidfbowB = compute_TFIDF(tfBowB, idfs)

In [19]:
import pandas as pd
pd.DataFrame([tfidfbowA, tfidfbowB])

Unnamed: 0,The,bed,cat,dog,lap,my,on,sat
0,0.0,0.0,0.0,0.115525,0.115525,0.0,0.0,0.0
1,0.0,0.115525,0.115525,0.0,0.0,0.0,0.0,0.0
