<a href="https://colab.research.google.com/github/Fathimath-Rifna-VK/fmml2021/blob/main/Module_7__Lab_4_TF_IDF_and_Documents_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lab by Arpan Dasgupta

arpan.dasgupta@research.iiit.ac.in

## Working with Text 3 : TF-IDF

The issue with using the Bag of Words method for converting text to vectors is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content” to the model as rarer but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in all documents, so that the scores for frequent words like “the” that are also frequent across all documents are penalized.

TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.

**Term Frequency**: is a scoring of the frequency of the word in the current document. \\
**Inverse Document Frequency**: is a scoring of how rare the word is across documents.

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

TF is individual to each document and word, hence we can formulate TF as follows.

    tf(t,d) = count of t in d / number of words in d

DF is the number of documents in which the word is present. We consider one occurrence if the term consists in the document at least once, we do not need to know the number of times the term is present.

    df(t) = occurrence of t in documents
    idf(t) = log(N/(df + 1)) where N = count of corpus

Thus the formula for the basic TF-IDF is

    tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

where

    t — term (word) 
    d — document (set of words)
    N — count of corpus
    corpus — the total document set

### TF-IDF example code

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

In [None]:
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))

In [None]:
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
  numOfWordsA[word] += 1

numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
  numOfWordsB[word] += 1

In [None]:
def computeTF(wordDict, bagOfWords):
    tfDict = {}
    bagOfWordsCount = len(bagOfWords)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bagOfWordsCount)
    return tfDict

In [None]:
tfA = computeTF(numOfWordsA, bagOfWordsA)
tfB = computeTF(numOfWordsB, bagOfWordsB)

In [None]:
def computeIDF(documents):
    import math
    N = len(documents)
    
    idfDict = dict.fromkeys(documents[0].keys(), 0)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val))
    return idfDict

In [None]:
idfs = computeIDF([numOfWordsA, numOfWordsB])

In [None]:
def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val * idfs[word]
    return tfidf

In [None]:
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)
df = pd.DataFrame([tfidfA, tfidfB])

print(df)

The following is the scikit-learn implementation. The values differ slightly because sklearn uses a smoothed version of idf and various other little optimizations.

In [None]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([documentA, documentB])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)

print(df)

## Working with Text 4 : Semantic Representation and Retrieval

A move advanced approach is to compare documents based on how similar their words are. For example, ‘apples’ and ‘oranges’ might be regarded as more similar than ‘apples’ and ‘Jupiter’. Judging word similarity at scale is difficult — one widely used approach is to analyse a large corpus of text and rank words that appear together often as being more similar.

This is the basis of the word embedding model GloVe: it maps words into numerical vectors — points in a multi-dimensional space so that words that occur together often are near each other in space. It is an unsupervised learning algorithm, developed at Stanford University.

In [None]:
import numpy as np

In [None]:
!pip install gensim~=3.8
!pip install nltk~=3.4

Here we do preprocessing in gensim, and also remove any HTML tags that may be present, such as if we have scraped data from the web:

We shall aim to find the more similar document to the query string

```
query_string = 'fruit and vegetables'
documents = ['cars drive on the road', 'tomatoes are actually fruit']
```

In [None]:
from re import sub
from gensim.utils import simple_preprocess

query_string = 'fruit and vegetables'
documents = ['cars drive on the road', 'tomatoes are actually fruit']

stopwords = ['the', 'and', 'are', 'a']

def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    doc = sub(r'<[^<>]+(>|$)', " ", doc)
    doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
    doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)

Then we create a similarity matrix, that contains the similarity between each pair of words, weighted using the term frequency:

In [None]:
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

# Load the model: this is a big file, can take a while to download and open
glove = api.load("glove-wiki-gigaword-50")    
similarity_index = WordEmbeddingSimilarityIndex(glove)

# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)

# Create the term similarity matrix.  
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

Finally, we calculate the soft cosine similarity between the query and each of the documents. Unlike the regular cosine similarity (which would return zero for vectors with no overlapping terms), the soft cosine similarity considers word similarity as well.

In [None]:
# Compute Soft Cosine Measure between the query and the documents.
query_tf = tfidf[dictionary.doc2bow(query)]

index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in corpus]],
            similarity_matrix)

doc_similarity_scores = index[query_tf]

# Output the sorted similarity scores and documents
sorted_indexes = np.argsort(doc_similarity_scores)[::-1]
for idx in sorted_indexes:
    print(f'{idx} \t {doc_similarity_scores[idx]:0.3f} \t {documents[idx]}')

# 1    0.688    tomatoes are actually fruit
# 0    0.000    cars drive on the road

As we can see GloVe works! Semantic similarity is good for ranking content in order, rather than making specific judgements about whether a document is or is not about a specific topic. There are other ways of using semantic similarity, like Word2Vec.



## Exercises

1. What is the requirement for IDF? What happens if we use only TF?
2. Why is BoW embeddings not sufficient?

## References and Resources

1. https://monkeylearn.com/blog/what-is-tf-idf/
2. https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76
3. https://towardsdatascience.com/how-to-rank-text-content-by-semantic-similarity-4d2419a84c32