What does Tfidf Vectorizer do?

Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents. The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [44]:
docA = "The car is driven on the road"
docB = "The truck is driven on the highway"

In [46]:
tfidf = TfidfVectorizer()

In [47]:
response = tfidf.fit_transform([docA, docB])

In [None]:
#computes the TF score for each word in the corpus, by document.

In [None]:
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
    print (feature_names[col], ' - ', response[0, col])

# Count Word

to start using TfidfTransformer you will first have to create a countVecorizer to count the number of words (term ferquecy) limit your vocabulary size, apply stop words and etc the code below does just that.

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer,CountVectorizer
docs = ['The sky is blue.','The sun is bright.']

In [13]:
cv=CountVectorizer()
#this step generate word counts for the words in your docs
word_count_vector=cv.fit_transform(docs)
word_count_vector.shape

(2, 6)

In [14]:
#  2 sentence   6 word without duplicates

In [15]:
text=['Perhaps one of the most significant advances made by Arabic  mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowed rational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as \"algebraic objects\". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before.']
cv=CountVectorizer()
#this step generate word counts for the words in your docs
word_count_vector=cv.fit_transform(text)
word_count_vector.shape

(1, 83)

In [16]:
#83 word

# Tokenizing-bag of words

In [27]:
docs1="Perhaps one of the most significant advances made by Arabic  mathematics began at this time with the work of al-Khwarizmi"
docs2= "namely the beginnings of algebra. It is important to understand just how significant this new idea was"
tf=docs1.split(" ")
idf=docs2.split(" ")

In [28]:
tf

['Perhaps',
 'one',
 'of',
 'the',
 'most',
 'significant',
 'advances',
 'made',
 'by',
 'Arabic',
 '',
 'mathematics',
 'began',
 'at',
 'this',
 'time',
 'with',
 'the',
 'work',
 'of',
 'al-Khwarizmi']

In [29]:
wordSet=set(tf).union(set(idf))

In [30]:
#all words in  documents
wordSet

{'',
 'Arabic',
 'It',
 'Perhaps',
 'advances',
 'al-Khwarizmi',
 'algebra.',
 'at',
 'began',
 'beginnings',
 'by',
 'how',
 'idea',
 'important',
 'is',
 'just',
 'made',
 'mathematics',
 'most',
 'namely',
 'new',
 'of',
 'one',
 'significant',
 'the',
 'this',
 'time',
 'to',
 'understand',
 'was',
 'with',
 'work'}

In [31]:
#i'll creat dictionarie to keep my word counts
worddictA=dict.fromkeys(wordSet,0)
worddictB=dict.fromkeys(wordSet,0)


In [32]:
 
worddictA

{'': 0,
 'mathematics': 0,
 'to': 0,
 'algebra.': 0,
 'at': 0,
 'It': 0,
 'the': 0,
 'al-Khwarizmi': 0,
 'most': 0,
 'by': 0,
 'is': 0,
 'idea': 0,
 'namely': 0,
 'work': 0,
 'understand': 0,
 'made': 0,
 'beginnings': 0,
 'significant': 0,
 'one': 0,
 'new': 0,
 'time': 0,
 'how': 0,
 'with': 0,
 'important': 0,
 'was': 0,
 'Perhaps': 0,
 'Arabic': 0,
 'of': 0,
 'just': 0,
 'began': 0,
 'this': 0,
 'advances': 0}

In [50]:
for word in tf:
    worddictA[word]+=1    
for word in idf:
    worddictB[word]+=1 

In [35]:
worddictA

{'': 1,
 'mathematics': 1,
 'to': 0,
 'algebra.': 0,
 'at': 1,
 'It': 0,
 'the': 2,
 'al-Khwarizmi': 1,
 'most': 1,
 'by': 1,
 'is': 0,
 'idea': 0,
 'namely': 0,
 'work': 1,
 'understand': 0,
 'made': 1,
 'beginnings': 0,
 'significant': 1,
 'one': 1,
 'new': 0,
 'time': 1,
 'how': 0,
 'with': 1,
 'important': 0,
 'was': 0,
 'Perhaps': 1,
 'Arabic': 1,
 'of': 2,
 'just': 0,
 'began': 1,
 'this': 1,
 'advances': 1}

In [52]:
import pandas as pd
pd.DataFrame([worddictA, worddictB]) 

Unnamed: 0,Unnamed: 1,mathematics,to,algebra.,at,It,the,al-Khwarizmi,most,by,...,with,important,was,Perhaps,Arabic,of,just,began,this,advances
0,3,3,0,0,3,0,6,3,3,3,...,3,0,0,3,3,6,0,3,3,3
1,0,0,1,1,0,1,1,0,0,0,...,0,1,1,0,0,1,1,0,1,0


 We use dataframe when we need a high level of abstraction and for unstructured data, such as media streams or streams of text.

# compute Term Frequency

In [58]:
def compute_term_frequency(word_dictionary,bag_of_words_a):
    term_frequency_dictionary = {}
    length_of_bag_of_words = len(tf)

    for word, count in word_dictionary.items():
        term_frequency_dictionary[word] = count / float(length_of_bag_of_words)

    return term_frequency_dictionary

# Implementation

print(compute_term_frequency(worddictA,tf))
 

{'': 0.14285714285714285, 'mathematics': 0.14285714285714285, 'to': 0.0, 'algebra.': 0.0, 'at': 0.14285714285714285, 'It': 0.0, 'the': 0.2857142857142857, 'al-Khwarizmi': 0.14285714285714285, 'most': 0.14285714285714285, 'by': 0.14285714285714285, 'is': 0.0, 'idea': 0.0, 'namely': 0.0, 'work': 0.14285714285714285, 'understand': 0.0, 'made': 0.14285714285714285, 'beginnings': 0.0, 'significant': 0.14285714285714285, 'one': 0.14285714285714285, 'new': 0.0, 'time': 0.14285714285714285, 'how': 0.0, 'with': 0.14285714285714285, 'important': 0.0, 'was': 0.0, 'Perhaps': 0.14285714285714285, 'Arabic': 0.14285714285714285, 'of': 0.2857142857142857, 'just': 0.0, 'began': 0.14285714285714285, 'this': 0.14285714285714285, 'advances': 0.14285714285714285}


# Inverse document frequency

In [61]:
import math

def compute_inverse_document_frequency(full_doc_list):
    idf_dict = {}
    length_of_doc_list = len(full_doc_list)

    idf_dict = dict.fromkeys(full_doc_list[0].keys(), 0)
    for word, value in idf_dict.items():
        idf_dict[word] = math.log(length_of_doc_list / (float(value) + 1))

    return idf_dict

final_idf_dict = compute_inverse_document_frequency([worddictA, worddictB])
print(final_idf_dict)

{'': 0.6931471805599453, 'mathematics': 0.6931471805599453, 'to': 0.6931471805599453, 'algebra.': 0.6931471805599453, 'at': 0.6931471805599453, 'It': 0.6931471805599453, 'the': 0.6931471805599453, 'al-Khwarizmi': 0.6931471805599453, 'most': 0.6931471805599453, 'by': 0.6931471805599453, 'is': 0.6931471805599453, 'idea': 0.6931471805599453, 'namely': 0.6931471805599453, 'work': 0.6931471805599453, 'understand': 0.6931471805599453, 'made': 0.6931471805599453, 'beginnings': 0.6931471805599453, 'significant': 0.6931471805599453, 'one': 0.6931471805599453, 'new': 0.6931471805599453, 'time': 0.6931471805599453, 'how': 0.6931471805599453, 'with': 0.6931471805599453, 'important': 0.6931471805599453, 'was': 0.6931471805599453, 'Perhaps': 0.6931471805599453, 'Arabic': 0.6931471805599453, 'of': 0.6931471805599453, 'just': 0.6931471805599453, 'began': 0.6931471805599453, 'this': 0.6931471805599453, 'advances': 0.6931471805599453}
