# TF-IDF

## What is TF-IDF?

TF-IDF stands for "Term Frequency, Inverse Document Frequency." It's a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.

* If a word appears frequently in a document, it's important. Give the word a high score.
* But if a word appears in many documents, it's not a unique identifier. Give the word a low score.

Therefore, common words like "the" and "for," which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

## Term Frequency

This measures how frequently a word occurs in a document. This highly depends on the length of the document and the generality of word, for example a very common word such as “was” can appear multiple times in a document and with the length of the document the count increases, so to normalise the value, we divide the the frequency with the total number of words in the document.<br>
So, in worst case if the term doesn’t exist in the document, then the TF value will be zero and in other extreme case, if all the words in the document are same, then it will be one. The final value of the normalised TF value will be in the range of [0 to 1]. 0, 1 inclusive.<br>

                  tf(t,d)  =  frequency that term t appears in document d

In Our lecture note, term frequency is calculated as
$$ tf(t,d)  = \log{(1 + \text{raw counts that term t appears in document d})}$$

## Document Frequency
This measures the importance of document in whole set of corpus, this is very similar to TF. The only difference is that TF is frequency counter for a term t in document d, where as DF id the count of occurrences of term t in the document set D. we consider one occurrence if the term consists in the document at least once, we do not need to know the number of times the term is present.

                     df(t) = no. of documents in D containing term t








## Inverse Document Frequency
IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words. This finally gives what we want, a relative weightage.

$$idf(t) = \frac{D}{df(t)}$$

Now there are few other problems with the IDF, in case of a large corpus, say 10,000, the IDF value explodes. So to dampen the effect we take log of IDF. In worst case, there could be no document which has 0 occurrence, and we cannot divide by 0. so to smoothen the effect we generally add 1 to the denominator.

$$idf(t) = log\frac{D}{df(t) + 1}$$

## TF-IDF
Finally, by taking a multiplicative value of TF and IDF, we get the TF-IDF score, there are many different variations of TF-IDF but for now let us concentrate on the this basic version.

$$ \operatorname{TF-IDF}(t, d) = tf(t, d) * idf(t) $$

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import re # Regular expression operations 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import math
import operator
import statistics
from string import punctuation
stop_words = set(stopwords.words('english') + list(punctuation))
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [None]:
def get_text_from_file(fname):
    """
    Get file from text doc
    """
    f=open(fname,'r')
    text=f.readlines()
    text=''.join(text) 
    # converting the list to type str
    return text

### Regular Expression Syntax
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
https://docs.python.org/3/library/re.html

```
The \s metacharacter matches whitespace character.

Whitespace characters can be:

A space character
A tab character
A carriage return character
A new line character
A vertical tab character
A form feed character

```

In [None]:
pip install pysnooper

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pysnooper
  Downloading PySnooper-1.1.1-py2.py3-none-any.whl (14 kB)
Installing collected packages: pysnooper
Successfully installed pysnooper-1.1.1


In [None]:
import pysnooper
# @pysnooper.snoop()
def remove_string_special_characters(s):
    """
    This function removes special characters from within a string.
    parameters: 
        s(str): single input string.
    return: 
        stripped(str): A string with special characters removed.
    """
    # Replace special character with ' '
    # Translation for the regex
    # Match all characters that DO NOT (^) match (\w alphanumeric characters) 
    # and \s (white space and tab) and - (hyphen) 
    stripped = re.sub('[^\w\s]', '', s) 
    stripped = re.sub('_', '', stripped)
    # Change any whitespace to one space
    stripped = re.sub('\s+', ' ', stripped)
    # Remove start and end white spaces
    stripped = stripped.strip()
    return stripped

In [None]:
def count_words(text):
    """This function returns the 
    total number of words in the input text.
    """
    count = 0
    words = word_tokenize(text)
    count = len(words)
    return count

In [None]:
def get_doc(text_sents_clean):
    """
    this function splits the text into sentences and
    considering each sentence as a document, calculate the 
    total word count of each.
    """
    doc_info = []
    i = 0
    for sent in text_sents_clean:
        count = count_words(sent)
        temp = {'doc_id' : i, 'doc_length' : count}
        doc_info.append(temp)
        i += 1 
    return doc_info

In [None]:
def create_freq_dict(sents):
    """
    This function creates a frequency dictionary
    of each document that contains words other than
    stop words.
    """
    i = 0
    freqDict_list = []
    # each sentence is considered as a document. 
    for sent in sents: 
        freq_dict = {}
        words = word_tokenize(sent)
        if len(words) == 0:
            temp = {'doc_id' : i, 'freq_dict': freq_dict}
        else:
            for word in words:
                word = word.lower()
                # Stemming programs are commonly referred to as stemming algorithms or stemmers.
                # A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” 
                # and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.
                word = ps.stem(word)
                if word not in stop_words:
                    if word in freq_dict:
                        freq_dict[word] += 1
                    else:
                        freq_dict[word] = 1
                    temp = {'doc_id' : i, 'freq_dict': freq_dict}
        i += 1
        freqDict_list.append(temp)
    return freqDict_list


In [None]:
def global_frequency(text_sents_clean):
    """
    This function returns a dictionary with the frequency 
    count of every word in the text
    """
    freq_table = {}
    text = ' '.join(text_sents_clean) #join the cleaned sentences to get the text 
    words = word_tokenize(text)
    for word in words:
        word = word.lower()
        word = ps.stem(word)
        if word not in stop_words:
            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1
    return freq_table

In [None]:
def get_keywords(text_sents_clean):
    """
    This function gets the top 5 most
    frequently occuring words in the whole text
    and stores them as keywords
    """
    freq_table = global_frequency(text_sents_clean)
    #sort in descending order
    freq_table_sorted = sorted(freq_table.items(), key = operator.itemgetter(1), reverse = True) 
    keywords = []
    for i in range(0, 5):  #taking first 5 most frequent words
        keywords.append(freq_table_sorted[i][0])
    return keywords

In [None]:
def computeTF(doc_info, freqDict_list):
    """
    tf = log(frequency of the term in the doc/total number of terms in the doc)
    """
    TF_scores = []
    
    for tempDict in freqDict_list:
        id = tempDict['doc_id']
        for k in tempDict['freq_dict']:
            temp = {'doc_id' : id,
                    'TF_score' : math.log(tempDict['freq_dict'][k] + 1, 10), 
                   'key' : k}
            # if we use frequency, 'TF_score' : tempDict['freq_dict'][k]/doc_info[id]['doc_length'],

            TF_scores.append(temp)
    return TF_scores

In [None]:
def computeIDF(doc_info, freqDict_list):
    """
    idf = log(total number of docs/(1 + number of docs with term in it))
    """
    IDF_scores = []
    counter = 0
    for dict in freqDict_list:
        for k in dict['freq_dict'].keys():
            count = sum([k in tempDict['freq_dict'] for tempDict in freqDict_list])
            temp = {'doc_id' : counter, 'IDF_score' : math.log(len(doc_info)/(count+1), 10), 'key' : k}
    
            IDF_scores.append(temp)
        counter += 1
       
    return IDF_scores

Before computing the word frequency, we shall clean the data by removing punctuation 
and special characters.


In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
cd gdrive/MyDrive/STAT4609_2021/'week8_NLP Basic NLP'

[Errno 2] No such file or directory: 'gdrive/MyDrive/STAT4609_2021/week8_NLP Basic NLP'
/content


In [None]:
nltk.download('punkt')
text = get_text_from_file('3lpigs.txt')
text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


'                  THE THREE LITTLE PIGS\n\n   Once upon a time . . . there were three little pigs, who left their mummy\nand daddy to see the world.\n   All summer long, they roamed through the woods and over the plains,playing\ngames and having fun. None were happier than the three little pigs, and they\neasily made friends with everyone. Wherever they went, they were given a warm \nwelcome, but as summer drew to a close, they realized that folk were drifting\nback to their usual jobs, and preparing for winter. Autumn came and it began\nto rain. The three little pigs started to feel they needed a real home. Sadly\nthey knew that the fun was over now and they must set to work like the others,\nor they\'d be left in the cold and rain, with no roof over their heads. They\ntalked about what to do, but each decided for himself. The laziest little pig\nsaid he\'d build a straw hut.\n   "It wlll only take a day,\' he said. The others disagreed.\n   "It\'s too fragile," they said disapprovin

In [None]:
#Return a sentence-tokenized copy of *text*, 
# using NLTK's recommended sentence tokenizer
text_sents = sent_tokenize(text)
text_sents_clean = [remove_string_special_characters(s) for s in text_sents] 

In [None]:
text_sents[0:3]

['                  THE THREE LITTLE PIGS\n\n   Once upon a time .', '.', '.']

In [None]:
text_sents_clean[0:3]

['THE THREE LITTLE PIGS Once upon a time', '', '']

In [None]:
doc_info = get_doc(text_sents_clean)
doc_info[0:6]

[{'doc_id': 0, 'doc_length': 8},
 {'doc_id': 1, 'doc_length': 0},
 {'doc_id': 2, 'doc_length': 0},
 {'doc_id': 3, 'doc_length': 15},
 {'doc_id': 4, 'doc_length': 16},
 {'doc_id': 5, 'doc_length': 15}]

In [None]:
# Term Counts in Each Documents
freqDict_list = create_freq_dict(text_sents_clean)
freqDict_list[0]

{'doc_id': 0,
 'freq_dict': {'three': 1,
  'littl': 1,
  'pig': 1,
  'onc': 1,
  'upon': 1,
  'time': 1}}

In [None]:
TF_scores = computeTF(doc_info, freqDict_list)
IDF_scores = computeIDF(doc_info, freqDict_list)


In [None]:
TF_scores[54:56]

[{'doc_id': 8, 'TF_score': 0.30102999566398114, 'key': 'feel'},
 {'doc_id': 8, 'TF_score': 0.30102999566398114, 'key': 'need'}]

In [None]:
IDF_scores[54:56]

[{'doc_id': 8, 'IDF_score': 1.4866665726258925, 'key': 'feel'},
 {'doc_id': 8, 'IDF_score': 1.6627578316815739, 'key': 'need'}]

In [None]:
def computeTFIDF(TF_scores, IDF_scores):
    """
    TFIDF is computed by multiplying the coressponding
    TF and IDF values of each term. 
    """
    TFIDF_scores = [] 
    for i in range(len(TF_scores)):
        temp = {'doc_id' : TF_scores[i]['doc_id'],
                        'TFIDF_score' : IDF_scores[i]['IDF_score']*TF_scores[i]['TF_score'],
                       'key' : IDF_scores[i]['key']}
        TFIDF_scores.append(temp)
    return TFIDF_scores
TFIDF_scores = computeTFIDF(TF_scores, IDF_scores)

## sklearn package can avoid these computing

Using the ```tfidf = TfidfVectorizer()
 default settings, TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)```
 the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
$$ \text{idf}(t) = \log{\frac{1 + D}{1+\text{df}(t)}} + 1. $$

With ```smooth_idf=False```, the “1” count is added to the idf instead of the idf’s denominator:

$$\text{idf}(t) = \log{\frac{D}{\text{df}(t)}} + 1. $$


If set ```sublinear_tf = True```, ```tf``` would be replaced with ```1 + log(tf)```.



https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [None]:
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [None]:
docA = "The car is driven on the road"
docB = "The truck is driven on the highway"



In [None]:
tfidf = TfidfVectorizer()

In [None]:
feature_matrix = tfidf.fit_transform([docA, docB])


In [None]:
feature_matrix.toarray().shape

(2, 8)

In [None]:
tfidf.get_feature_names_out()

array(['car', 'driven', 'highway', 'is', 'on', 'road', 'the', 'truck'],
      dtype=object)

In [None]:
pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names_out())

Unnamed: 0,car,driven,highway,is,on,road,the,truck
0,0.424717,0.30219,0.0,0.30219,0.30219,0.424717,0.60438,0.0
1,0.0,0.30219,0.424717,0.30219,0.30219,0.0,0.60438,0.424717


## A another example

In [None]:
document1 = """Python is a 2000 made-for-TV horror movie directed by Richard Clabaugh. The film features several 
cult favorite actors, including William Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy, Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen. The film concerns a genetically engineered snake, a python, that
escapes and unleashes itself on a small town. It includes the classic final
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles,
 California and Malibu, California. Python was followed by two sequels: Python
 II (2002) and Boa vs. Python (2004), both also made-for-TV films."""


document2 = """Python, from the Greek word (πύθων/πύθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known."""

document3 = """The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith &amp; Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made."""


Using the ```tfidf = TfidfVectorizer()
 default settings, TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)```
 the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
$$ \text{idf}(t) = \log{\frac{1 + D}{1+\text{df}(t)}} + 1. $$

With ```smooth_idf=False```, the “1” count is added to the idf instead of the idf’s denominator:

$$\text{idf}(t) = \log{\frac{D}{\text{df}(t)}} + 1. $$


If set ```sublinear_tf = True```, ```tf``` would be replaced with ```1 + log(tf)```.


In [None]:
tfidf = TfidfVectorizer(smooth_idf=False)
feature_matrix = tfidf.fit_transform([document1, document2, document3])
feature_matrix.toarray().shape

(3, 168)

In [None]:
df_tf_idf = pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names_out())
df_tf_idf

Unnamed: 0,13th,1955,2000,2002,2004,357,44,actors,africa,also,...,whalen,wheaton,wil,william,word,writers,year,zabka,πύθων,πύθωνας
0,0.0849,0.0,0.0849,0.0849,0.0849,0.0,0.0,0.0849,0.0,0.0849,...,0.0849,0.0849,0.0849,0.0849,0.0,0.0,0.0,0.0849,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.189476,0.0,...,0.0,0.0,0.0,0.0,0.189476,0.0,0.0,0.0,0.189476,0.189476
2,0.0,0.100098,0.0,0.0,0.0,0.100098,0.100098,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.100098,0.100098,0.0,0.0,0.0
