### TfidfVectorizer
Convert a collection of raw documents to a matrix of TF-IDF features



- [Use Sklearn and Manual way](https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76)
- [TF-IDF Intuition| Text Preprocessing](https://www.youtube.com/watch?v=D2V1okCEsiE)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

- [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

- TF (Term Frequency) = Number of Repetation of word in a Sentence / Total word in a Sentence    [ Shakil ]
- IDF (Inverse Data Frequency (IDF)) = Log * (Total Number Of Sentence / Number of Sentence Contain the unique Words)

In [8]:
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

In [9]:
vectorizer = TfidfVectorizer()

In [10]:
vectors = vectorizer.fit_transform([documentA, documentB])

In [17]:
feature_names = vectorizer.get_feature_names()
feature_names

['around',
 'children',
 'fire',
 'for',
 'man',
 'out',
 'sat',
 'the',
 'walk',
 'went']

In [12]:
dense = vectors.todense()
denselist = dense.tolist()

In [15]:
dense

matrix([[0.        , 0.        , 0.        , 0.4261596 , 0.4261596 ,
         0.4261596 , 0.        , 0.30321606, 0.4261596 , 0.4261596 ],
        [0.40740124, 0.40740124, 0.40740124, 0.        , 0.        ,
         0.        , 0.40740124, 0.57973867, 0.        , 0.        ]])

In [16]:
denselist

[[0.0,
  0.0,
  0.0,
  0.42615959880289433,
  0.42615959880289433,
  0.42615959880289433,
  0.0,
  0.3032160644503863,
  0.42615959880289433,
  0.42615959880289433],
 [0.40740123733358447,
  0.40740123733358447,
  0.40740123733358447,
  0.0,
  0.0,
  0.0,
  0.40740123733358447,
  0.5797386715376657,
  0.0,
  0.0]]

In [13]:
df = pd.DataFrame(denselist, columns=feature_names)

In [14]:
df

Unnamed: 0,around,children,fire,for,man,out,sat,the,walk,went
0,0.0,0.0,0.0,0.42616,0.42616,0.42616,0.0,0.303216,0.42616,0.42616
1,0.407401,0.407401,0.407401,0.0,0.0,0.0,0.407401,0.579739,0.0,0.0


### We will also do it in Natural Language Processing 

In [20]:
# Train Document Set:
d1 = 'The sky is blue'
d2 = 'The sun is bright'
# Test Document Set:
d3 = 'The sun in the sky is bright'
d4 = 'We can see the shining sun, the bright sun'

In [21]:
train = ['The sky is blue.','The sun is bright.']
test = ['The sun in the sky is bright', 'We can see the shining sun, the bright sun.']

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
countvectorizer = CountVectorizer()
tfidfvectorizer = TfidfVectorizer()

In [24]:
count_wm = countvectorizer.fit_transform(train)
tfidf_wm = tfidfvectorizer.fit_transform(train)

In [25]:
count_tokens = countvectorizer.get_feature_names()
tfidf_tokens = tfidfvectorizer.get_feature_names()
print(count_tokens)
print(tfidf_tokens)

['blue', 'bright', 'is', 'sky', 'sun', 'the']
['blue', 'bright', 'is', 'sky', 'sun', 'the']


In [26]:
df_countvect = pd.DataFrame(data = count_wm.toarray(),index = ['Doc1','Doc2'],columns = count_tokens)
df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(),index = ['Doc1','Doc2'],columns = tfidf_tokens)
print("Count Vectorizer\n")
print(df_countvect)
print("\nTD-IDF Vectorizer\n")
print(df_tfidfvect)

Count Vectorizer

      blue  bright  is  sky  sun  the
Doc1     1       0   1    1    0    1
Doc2     0       1   1    0    1    1

TD-IDF Vectorizer

          blue    bright        is       sky       sun       the
Doc1  0.576152  0.000000  0.409937  0.576152  0.000000  0.409937
Doc2  0.000000  0.576152  0.409937  0.000000  0.576152  0.409937
