This document covers how to use the VECTOR SPACE MODEL in text data mining to determine the similarity between a given document and other documents in a corpus.
I introduced two aproaches:

    Approach 1 - Using Counter module in Collections library in Python

    Approach 2 - Using Scikit-Learn library
    
At the end, you would also learn how to adjust for common words using the inverse document frequency idf.

In [1]:
# For this study, we would use a corpus which contains three documents

corpus = ["a rose is still a rose", "there is no there there", "rose is a rose is a rose is a rose"]

Approach 1 >>>

In [2]:
from collections import Counter
import pandas as pd

# determine the term frequency tf for each word in the corpus
tf_df = pd.DataFrame([pd.Series(Counter(doc.split())) for doc in corpus]).fillna(0)
tf_df

Unnamed: 0,a,rose,is,still,there,no
0,2.0,2.0,1.0,1.0,0.0,0.0
1,0.0,0.0,1.0,0.0,3.0,1.0
2,3.0,4.0,3.0,0.0,0.0,0.0


In [3]:
# create function that calculates the cosine distance between two vectors
import numpy as np

# function to calculate the length of a vector v
def length(v):
    return np.sqrt((v**2).sum())

# function to calculate the cosine distance
def cos_dist(v, w):
    return 1 - (v*w).sum()/(length(v)*length(w))

In [4]:
# call the cos_dist function
cos_dist(tf_df.loc[0], tf_df.loc[1]), cos_dist(tf_df.loc[0], tf_df.loc[2])

(0.9046537410754407, 0.07804555427071147)

Interpretation: 

The angle between document 0 and document 2 is smaller, therefore, they have higher similarities.

Approach 2 >>>

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_distances

vec = CountVectorizer(token_pattern=r"(?u)\b\w+\b") # creates a countvectorizer object
tf_mat = vec.fit_transform(corpus) # to create a term frequency matrix
tf_mat.todense() # to view the matrix

matrix([[2, 1, 0, 2, 1, 0],
        [0, 1, 1, 0, 0, 3],
        [3, 3, 0, 4, 0, 0]], dtype=int64)

In [6]:
# use pairwise_distances module to calculate the cosine distance
pairwise_distances(tf_mat[0], tf_mat[1:], metric="cosine")

array([[0.90465374, 0.07804555]])

Interpretation:

The angle between document 0 and document 2 is smaller, therefore, they have higher similarities.

You can also view the term frequency matrix in a dataframe:

In [7]:
terms = vec.get_feature_names_out()
tf_mat_df = pd.DataFrame(tf_mat.todense(), columns=terms)
tf_mat_df

Unnamed: 0,a,is,no,rose,still,there
0,2,1,0,2,1,0
1,0,1,1,0,0,3
2,3,3,0,4,0,0


Using Inverse Document Frequency IDF to adjust Term Frequency TF >>> 

There is a problem with using the term frequency tf as inputs in our calculation of the cosine distances without adjusting for common words. 
Therefore we should adjust for this using inverse document frequency idf.
                          
                        Ajusted term frequency (tf_idf) = tf*idf
                          
                        where:
                            idf (t, D) = 1 + log(1/df)
                          
                            df = number of documents containing term t / total number of documents in the corpus

We can use the calculated tf_idf to recalculate the cosine distances between two vectors (you can try it on your own).

Scikit-Learn has a module which allows us to determine the tf_idf easily, see codes below:

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(smooth_idf=False, norm=None)
tfidf_mat = vec.fit_transform(corpus)
tfidf_mat.todense()

matrix([[1.        , 0.        , 2.81093022, 2.09861229, 0.        ],
        [1.        , 2.09861229, 0.        , 0.        , 6.29583687],
        [3.        , 0.        , 5.62186043, 0.        , 0.        ]])

In [9]:
# use pairwise_distances module to calculate the cosine distance
pairwise_distances(tfidf_mat[0], tfidf_mat[1:], metric="cosine")

array([[0.95915143, 0.19106774]])

Let's see the adjusted tf-idf in a dataframe below:

In [10]:
terms = vec.get_feature_names_out()
tfidf_mat_df = pd.DataFrame(tfidf_mat.todense(), columns=terms)
tfidf_mat_df

Unnamed: 0,is,no,rose,still,there
0,1.0,0.0,2.81093,2.098612,0.0
1,1.0,2.098612,0.0,0.0,6.295837
2,3.0,0.0,5.62186,0.0,0.0


Final Notes:

The TfidfVectorizer removed the term 'a' from the tf_idf matrix because of either of the following reasons:

- It is considered a stop word by the vectorizer.
- It is considered a very short word and because short words don't carry much semantic meaning.
- It is considered one of the very common words which can be removed by the TfidfVectorizer if they appear in too many documents.
  By default, if a term appears in all documents, its inverse document frequency (IDF) becomes very low, and its impact on the overall TF-IDF score will be minimal.