# <span style = "color:green;font-size:40px"> TF-IDF </span>

***

<span style = "color:coral"> TF-IDF stands for Term Frequency - Inverse Document Frequency </span>

TF-IDF is a technique which is used to find the meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

### So what is it? Let's understand it using an example

Let's say machine is trying to understand meaning of this- "Today is a beautiful day"

What do you focus on here?

This sentence talks about <b>today</b>, it also tells us that today is a <b>beautiful day</b>. The mood is <b>happy/positive</b>,anything else?

Beauty is clearly the adjective word used here. From a BoW(Bag of Words) approach all words are broken into count and frequency with no preference to a word in particular, all words have same frequency here (1 in this case) and obviously there is no emphasis on beauty or positive mood by the machine.

The words are just broken down and if we were talking about importance, 'a' is as important as 'day' or 'beauty'.

But is it really that 'a' tells you more about context of a sentence compared to 'beauty'?

No, that's why Bag of words needed an upgrade.

Also, another major drawback is say a document has 200 words, out of which 'a' comes 20 times, 'the' comes 15 times etc.

Many words which are repeated again and again are given more importance in final feature building and we miss out on context of less repeated words like Rain, Beauty, subway, names.

So it's easy to miss on what was meant by the writer if read by a machine and it presents a problem that TF-IDF solves, so now we know why do we use TF-IDF.

### Let's see how it works

TF-IDF is useful in solving the major drawbacks of Bag of Words by introducing an important concept called inverse document frequency.

It's a score which the machine keeps where it is evaluates the words used in a sentence and measures it's usage compared to words used in the entire document. In other words, it's a score to highlight each word's relevance in the entire document. It is calculated as:

<center><b><i>IDF = Log[(Number of Documents)/(Number of Documents containing the word)] </i></b></center>

<center><b><i> TF = (Number of repetitions of word in a document)/( Number of words in a document </i></b></center>

okay, for now let's just say that TF answers questions like - how many times is beauty used in that entire document, give me a probability and IDF answers questions like how important is the word beauty in the entire list of documents, is it common theme in all the documents.

So using TF and IDF machine makes sense of important words in a document and important words throughtout all documents.

### What's the way of finding TF-IDF of a document?

The process to find meaning of documents using TF-IDF is very similar to Bag of Words,

1. Clean data/Preprocessing - Clean data(Standardise data), Normalise data(All lower case), lemmatize data (all words to root words)
2. Tokenize words with frequency
3. Find TF for words
4. Find IDF for words
5. Vectorize vocab

## <span style = "color:coral"> Lets cover an example of 3 documents</span>

<b>Document 1:</b> It is going to rain today
<b>Document 2:</b> Today I am not going outside
<b>Document 3:</b> I am going to watch the season premiere

To find TF-IDF we need to perform the steps we laid out above, let's get to it.

### Step 1: Clean data and Tokenize

In [None]:
document1 = "It is going to rain today"
document2 = "Today i am not going outside"
document3 = "I am going to watch the season premiere"

| Word | count |
| --- | --- | 
| going | 3 |
| to | 2 |
| today | 2 |
| i | 2 |
| am | 2 |
| it | 1 |
| is | 1 |
| rain | 1 |

### Step 2: Find TF

#####  Document 1 - It is going to rain today

Find it's TF = (Number of repetitions of word in a document)/(Number of words in a document)

document1 = "It is going to rain today"
document2 = "Today i am not going outside"
document3 = "I am going to watch the season premiere"

| Words/Documents | Document 1 |
| --- | --- | 
| going | 0.16 |
| to | 0.16 |
| today | 0.16 | 
| i | 0 |
| am | 0 |
| it | 0.16 |
| is | 0.16 |
| rain | 0.16 |

Continue for rest of the sentences-

| Words/Documents | Document 1 | Document 2 | Document 3 |
| --- | --- | --- | --- |
| going | 0.16 | 0.16 | 0.12 |
| to | 0.16 | 0 | 0.12 |
| today | 0.16 | 0.16 | 0 |
| i | 0 | 0.16 | 0.12 |
| am | 0 | 0.16 | 0.12 |
| it | 0.16 | 0 | 0 |
| is | 0.16 | 0 | 0 |
| rain | 0.16 | 0 | 0 |

###  Step 3: Find IDF

Find IDF for documents (we do this for feature name only/vocab words which have no stop words)

IDF = Log[(number of documents) / ( Number of documents containing the word)]

| Words | IDF Value |
| --- | --- | 
| going | log(3/3) |
| to | log(3/2) | 
| today | log(3/2) |
| i | log(3/2) |
| am | log(3/2) |
| It | log(3/1) |
| is | log(3/1) | 
| rain | log(3/1) | 

### Step 4: Build model, ie. Stack all words next to each other-

| Words | IDF Value | <span style = "background-color:black"> i </span> |Words/Documents | Document 1 | Document 2 | Document 3 |
| --- | --- | --- | --- | --- | --- | --- |
| going | log(3/3) | <span style = "background-color:black"> i </span> | going | 0.16 | 0.16 | 0.12 
| to | log(3/2) | <span style = "background-color:black"> i </span> | to | 0.16 | 0 | 0.12 |
| today | log(3/2) | <span style = "background-color:black"> i </span> | today | 0.16 | 0.16 | 0 |
| i | log(3/2) | <span style = "background-color:black"> i </span> | i | 0 | 0.16 | 0.12 |
| am | log(3/2) | <span style = "background-color:black"> i </span> | am | 0 | 0.16 | 0.12 |
| It | log(3/1) | <span style = "background-color:black"> i </span> | it | 0.16 | 0 | 0 |
| is | log(3/1) | <span style = "background-color:black"> i </span> | is | 0.16 | 0 | 0 |
| rain | log(3/1) | <span style = "background-color:black"> i </span> | rain | 0.16 | 0 | 0 |

### Step 5: Compare results and use table to ask questions

| Words/Documents | going | to | today | i | am | it | is | rain | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Document 1 | 0 | 0.07 | 0.07 | 0 | 0 | 0.17 | 0.17 | 0.17 |
| Document 2 | 0 | 0 | 0.07 | 0.07 | O.07 | 0 | 0 | 0 |
| Document 3 | 0 | 0.05 | 0 | 0.05 | 0.05 | 0 | 0 | 0 | 

You can easily see using this table that words like 'it','is','rain' are important for document 1 but not for document 2 and document 3 which means Document 1 and 2&3 are different w.r.t talking about rain.

You can also say that Document 1 and 2 talk about something 'today', and document 2 and 3 discuss something about the writer because of the word 'I'.

This table helps us find similarities and non similarities btw documents, words and more much better than BoW.

## Code TF-IDF with Python

### Step 1: Declaring all documents and assigning to a vocab document

In [1]:
document1 = "It is going to rain today"
document2 = "Today i am not going outside"
document3 = "I am going to watch the season premiere"
doc = [document1, document2,document3]
print(doc)

['It is going to rain today', 'Today i am not going outside', 'I am going to watch the season premiere']


### Step 2: Initializing TFIDF Vectorizer

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
vectorizer = TfidfVectorizer()

In [10]:
X = vectorizer.fit_transform(doc)

### Step 3: Getting feature names of final words that we will use to tag documents

In [11]:
analyze = vectorizer.build_analyzer()

print('Document 1', analyze(document1))
print('Document 2', analyze(document2))
print('Document 3', analyze(document3))
print('Document Transform', X.toarray())

Document 1 ['it', 'is', 'going', 'to', 'rain', 'today']
Document 2 ['today', 'am', 'not', 'going', 'outside']
Document 3 ['am', 'going', 'to', 'watch', 'the', 'season', 'premiere']
Document Transform [[0.         0.27824521 0.4711101  0.4711101  0.         0.
  0.         0.4711101  0.         0.         0.35829137 0.35829137
  0.        ]
 [0.40619178 0.31544415 0.         0.         0.53409337 0.53409337
  0.         0.         0.         0.         0.         0.40619178
  0.        ]
 [0.32412354 0.25171084 0.         0.         0.         0.
  0.4261835  0.         0.4261835  0.4261835  0.32412354 0.
  0.4261835 ]]


See how each sentence is broken in words and each word is represented as a number for the machine.

##### To get the feature names:

In [12]:
print(vectorizer.get_feature_names())

['am', 'going', 'is', 'it', 'not', 'outside', 'premiere', 'rain', 'season', 'the', 'to', 'today', 'watch']


The output signifies the important words which add context to 3 sentences. These are the words that are important in all 3 sentences and now you can ask questions of whatever nature you like to the machine, stuff like:

* What are similar documents?
* When will it rain?
* I am done, what to read next?

Because the machine has a score to help aid with these questions, TF-IDF proves a great tool to train machine to answer back in case of chatbots as well.

***