# Projects - TF-IDF

### **If we have a big document we can do this to figure out what the documents are all about**

## Imports

In [None]:
import nltk
import math

In [None]:
%pwd

## Load Data

Dataset contains 10 text files. Dataset hold the 10 text files as a dictionary where name of each text files is the `key` of the dictionary and value is the file.

In [None]:
dataset = {
    "tfidf_1.txt":open("tfidf_1.txt").read(),
    "tfidf_2.txt":open("tfidf_2.txt").read(),
    "tfidf_3.txt":open("tfidf_3.txt").read(),
    "tfidf_4.txt":open("tfidf_4.txt").read(),
    "tfidf_5.txt":open("tfidf_5.txt").read(),
    "tfidf_6.txt":open("tfidf_6.txt").read(),
    "tfidf_7.txt":open("tfidf_7.txt").read(),
    "tfidf_8.txt":open("tfidf_8.txt").read(),
    "tfidf_9.txt":open("tfidf_9.txt").read(),
    "tfidf_10.txt":open("tfidf_10.txt").read()
}

In [None]:
# key or the text file names
dataset.keys()

In [None]:
# Lets look at the first document
dataset['tfidf_1.txt'][:200]
# call keys/filenames to get the text 

First part of `Term frequency and Inverse Document Frequency` or `TF-IDF` is `Term Frequency` or `TF`. Term frequency means number of times(`frequency`) a word(`term`) appears in a given document. This is very similar to frequency distribution method that we have used before.

Thus we can create a function which will look into a text file in the dataset and give the frequency distribution of the words in the text file of the dataset.

## Define Functions

## Step 1. Calculate term freq. `TF` i.e. freq of terms/words of a document

In [None]:
# Calculate term frequencies
def tf(dataset, file_name):
    
    text = dataset[file_name] 
    # select the text of specific text file
    
    tokens = nltk.word_tokenize(text) 
    # tokenize the text file
    
    fd = nltk.FreqDist(tokens) 
    # make freq distribution of the tokens
    # i.e. count no. of times we saw each word
    
    return fd

In [None]:
# Freq distribution of the first text file
tf(dataset, 'tfidf_1.txt')

Next part is `IDF` or `Inverse Document Frequency` part. That means ***number of documents that contain a specific word.***

Suppose we have a word **war** only specific to first text file. Now it only appears 1 out of 10 documents i.e. 1/10. Then **inversing** it will give 10 over 1 ( 10/1 ). And we take logarithmic value of 10 over 1 or 10/1. 

Just an example for war: `10/1=10, log(10)= 1` (if a word is present in only 1 document out of 10)

Just an example `10/10=1, log(1)=0` (if a word is present in all 10 documents out of 10)


Let's write a function about that.

## Step 2. Find inverse document freq- `IDF` of a term across all document

In [None]:
# Calculate inverse document frequency
import math

def idf(dataset, term): 
# looking for a specific word i.e. term

    count = [term in dataset[file_name] for file_name in dataset]
# iterate through all textfiles in the dataset 1 by 1

# term in dataset[file_name]: gives boolean value if the   
# specific word/term present in the iterating text file- 
# i.e. list of true and false
    
    inv_df = math.log(len(count)/sum(count))
# log of the total number of textfiles we are iterating
# divided by how many textfiles contain this term

    return inv_df

In [None]:
term='War'
count = [term in dataset[file_name] 
             for file_name in dataset]

count
# boolean value of how many textfile contains 
# term `napolean`

In [None]:
# will throw error if tested for a word not in any textfile
idf(dataset, 'War')

## Step 3. Multiply `TF` and `IDF` values to get `TF-IDF` score

For a word `war` we got `TF` which says number of times it appeared in the 1st text file. 

The `inverse document frequency- IDF` is the logarithm of inverse of how many textfiles  contain the word `war` out of total number of textfiles..

Multiply `TF` and `IDF` of the word `war` to get its score. This way we can get a score for all words of all documents. 

## Step 4. Look for words with highest score

Previously we removed/filtered out `stop-words` by using NLTK. But `TF-IDF` takes care of it automatically. `TF` (frequency) of all the `stop-words` across all textfiles will be very high. But the `IDF` value will be low as we can expect `stop-words` to be present in all 10 textfiles. Thus, 10/10=1, log(1)=0. Therefore, as the stop-words will be present in all the textfiles, a high`TF` & low `IDF` will be responsible for a low `TF-IDF` value. So we do NOT need to worry about stopwords as they will have low `TF-IDF` value and we will NOT consider them

In [None]:
def tfidf(dataset, file_name, n):
# call text file i.e. filename from dataset and 
# how many key words (n) we want

    term_scores = {} # empty dict
    
    file_fd = tf(dataset,file_name)
# term freq of all words in a text file of the dataset

    for term in file_fd:
# for every term/word in the text file
        if term.isalpha(): # confirm term/word is a word
        
            idf_val = idf(dataset,term)
# get IDF value by using IDF function
            tf_val = tf(dataset, file_name)[term]
# take term frequency value of the word
            tfidf = tf_val*idf_val
# multiply TF and IDF value of a term/word
            term_scores[term] = round(tfidf,2)
# term_scores[term]: word/term will be the key
# rounded TF-IDF multiplied values. That is the score of the 
# word or terms

    return sorted(term_scores.items(), 
                  key=lambda x:-x[1])[:n]
# all the score of all the words for all documents
# in the for loop and return top n 

In [None]:
file_fd = tf(dataset, 'tfidf_1.txt')
file_fd

In [None]:
for term in file_fd:
    if term.isalpha():
        print(term)
        #continue
        break

In [None]:
no = 0
for term in file_fd: 
    if term.isalpha():
        print(term)
        no+=1
        if no==6:
            break

In [None]:
co = 0
for term in file_fd:
    if term.isalpha():
        print(tf(dataset, 'tfidf_1.txt')[term])
        co += 1
        if co==6:
            break

In [None]:
# Test
tfidf(dataset,"tfidf_1.txt",5)

Now we look at the description words specific for 1st text file and get an idea what the document might hold. As we already know this document is a story on world war 2, we know the words are making sense.

In [None]:
## Complete code
# Calculate term frequencies
def tf(dataset, file_name):
    
    text = dataset[file_name] 
    # select the specific text file
    tokens = nltk.word_tokenize(text) 
    # tokenize the text file
    fd = nltk.FreqDist(tokens) 
    # make freq distribution of the tokens
    # i.e. count of how many times we saw each word
    return fd

# Calculate inverse document frequency
def idf(dataset, term):
    count = [term in dataset[file_name] for file_name in dataset]
    inv_df = math.log(len(count)/sum(count))
    return inv_df

def tfidf(dataset, file_name, n):
    term_scores = {}
    file_fd = tf(dataset,file_name)
    for term in file_fd:
        if term.isalpha():
            idf_val = idf(dataset,term)
            tf_val = tf(dataset, file_name)[term]
            tfidf = tf_val*idf_val
            term_scores[term] = round(tfidf,2)
    return sorted(term_scores.items(), key=lambda x:-x[1])[:n]

## Run Code

Let's run a for loop for every single document

In [None]:
for file_name in dataset:
    print("{0}:\n {1}\n".format(file_name, \
                    tfidf(dataset,file_name,4)))
# This say what is key or important for the document