# <div class='alert alert-info'>TF-IdDF Vectorizer</div>

**TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).**

In [5]:
 from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
t=['We will study NLP today','NLP stands for Natural Language Processing','We are failing to understand NLP']

**Lets check the 3 items in our list**


In [7]:
for i in t:
    print(i)

We will study NLP today
NLP stands for Natural Language Processing
We are failing to understand NLP


In [9]:
vect=TfidfVectorizer()

In [10]:
vect.fit(t)

TfidfVectorizer()

<br>**<div class='alert alert-info'>Inverse Document Frequency:</div>**<br> Mainly, it tests how relevant the word is. The key aim of the search is to locate the appropriate records that fit the demand. Since tf considers all terms equally significant, it is therefore not only <br>possible to use the term frequencies to measure the weight of the term in the paper. First, find the document frequency<br> of a term t by counting the number of documents containing the term:<br>
**df(t) = N(t)**<br>
where<br>
df(t) = Document frequency of a term t<br>
N(t) = Number of documents containing the term t<br>
Term frequency is the number of instances of a term in a single document only; although the frequency of the document is <br>the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the <br>definition of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus <br>separated by the frequency of the text.<br><br>

**idf(t) = N/ df(t) = N/N(t)**<br>
The more common word is supposed to be considered less significant, but the element (most definite integers) seems too <br>harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes:<br><br>

**idf(t) = log(N/ df(t))**<br>

In [13]:
print(vect.vocabulary_)

{'we': 12, 'will': 13, 'study': 8, 'nlp': 5, 'today': 10, 'stands': 7, 'for': 2, 'natural': 4, 'language': 3, 'processing': 6, 'are': 0, 'failing': 1, 'to': 9, 'understand': 11}


So now we will be finding out the idf of the words 

In [14]:
vect.idf_

array([1.69314718, 1.69314718, 1.69314718, 1.69314718, 1.69314718,
       1.        , 1.69314718, 1.69314718, 1.69314718, 1.69314718,
       1.69314718, 1.69314718, 1.28768207, 1.69314718])

- **Words which are present in all the sentences or documents will have the minimum idf value**
- **We can see the index 5 is NLP and it is having the value 1**(which is the minimum) **since NLP is present in all the 3 sentences we have given in the list**


**Lets take the first sentence and try trnasform it into "vect" and see the output**

In [15]:
first=t[0]
first

'We will study NLP today'

In [16]:
v=vect.transform([first])

In [17]:
print(v.toarray())

[[0.         0.         0.         0.         0.         0.29803159
  0.         0.         0.50461134 0.         0.50461134 0.
  0.38376993 0.50461134]]


In [18]:
for i in t:
    print(i)

We will study NLP today
NLP stands for Natural Language Processing
We are failing to understand NLP


- **The indexes which are given 0 those are the words not present in the first sentence at all<br>
   for example :<br>
   index 0 and 1 are the words {<font color='blue'>'are' and 'failing'</font>} which are not present in the first sentence so they are having the value 0**<br>
<br>
- **Index 8,10,13 are the words {<font color='blue'>study,today and will</font>} are the unique words which are present only in the first sentence so they are containing the highest value**(which is 0.50461134)

## <div class='alert alert-danger'><font color='black'> Lets do the same thing in a different way</font></div>

In [19]:
from pandas import DataFrame

In [20]:
t2=['Eshant is studying from GFG','GFG is providing an NLP course']

In [21]:
tf= TfidfVectorizer()

In [22]:
matrix=tf.fit_transform(t2)

In [23]:
DataFrame(matrix.toarray(),columns=tf.get_feature_names())

Unnamed: 0,an,course,eshant,from,gfg,is,nlp,providing,studying
0,0.0,0.0,0.499221,0.499221,0.3552,0.3552,0.0,0.0,0.499221
1,0.446656,0.446656,0.0,0.0,0.3178,0.3178,0.446656,0.446656,0.0


                                               #Happy Learning