In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tabulate import tabulate

# Documents
doc1 = "Cat runs behind rat"
doc2 = "Dog runs behind cat"
query = "Cat runs"

# Represent documents in vector space
vectorizer = CountVectorizer()
corpus = [doc1, doc2, query]
X = vectorizer.fit_transform(corpus)

# Display feature names (terms)
features = vectorizer.get_feature_names_out()

# Convert sparse matrix to dense for better visualization
dense_matrix = X.toarray()

# Assign headers and data for tabulation
headers = ["Word/Document"] + [f"Document {i+1}" for i in range(len(corpus))]
table_data = [features] + [dense_matrix[i].tolist() for i in range(len(dense_matrix))]

# Display table
print("Vector Space Representation:")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

# Compute similarity between documents
similarity = cosine_similarity(X)

# Display similarity matrix
print("\nSimilarity Matrix:")
print(tabulate(similarity, headers=["Doc1", "Doc2", "Query"], tablefmt="grid"))

Vector Space Representation:
+--------+-----------------+--------------+--------------+--------------+
|        | Word/Document   | Document 1   | Document 2   | Document 3   |
| behind | cat             | dog          | rat          | runs         |
+--------+-----------------+--------------+--------------+--------------+
| 1      | 1               | 0            | 1            | 1            |
+--------+-----------------+--------------+--------------+--------------+
| 1      | 1               | 1            | 0            | 1            |
+--------+-----------------+--------------+--------------+--------------+
| 0      | 1               | 0            | 0            | 1            |
+--------+-----------------+--------------+--------------+--------------+

Similarity Matrix:
+----------+----------+----------+
|     Doc1 |     Doc2 |    Query |
| 1        | 0.75     | 0.707107 |
+----------+----------+----------+
| 0.75     | 1        | 0.707107 |
+----------+----------+----------+
|

In [None]:
Python implementation using the tabulate library to display data in table format. 
a basic example for document representation in vector space and the computation of similarity using cosine similarity.

Key Points

Vector Space Representation:

Count Vectorizer is used to convert text into a bag-of-words model.

Each term in the corpus becomes a feature, and documents are represented as vectors based on term frequency.

Similarity Computation:

Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space, useful for comparing text documents.

Output Example

The script will output the following:

Vector space representation, showing term frequencies for each document and query.
Similarity matrix, indicating pairwise similarities between the documents and query.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example documents
doc1 = "Cat runs behind rat"
doc2 = "Dog runs behind cat"
query = "Cat runs"

# Vectorizing the documents
vectorizer = CountVectorizer()
corpus = [doc1, doc2, query]
X = vectorizer.fit_transform(corpus)

# Compute cosine similarity
similarity_matrix = cosine_similarity(X)

# Display similarity matrix
print("Cosine Similarity Matrix:")
print(similarity_matrix)


Cosine Similarity Matrix:
[[1.         0.75       0.70710678]
 [0.75       1.         0.70710678]
 [0.70710678 0.70710678 1.        ]]


In [None]:
What is Cosine Similarity?

Cosine Similarity is a metric used to measure the similarity between two non-zero vectors in a multi-dimensional space. It calculates the cosine of the angle between the vectors, where:

- A cosine value of 1 means the vectors are identical (completely similar).
- A cosine value of 0 means the vectors are orthogonal (no similarity).
- A cosine value between 0 and 1 represents partial similarity.

In the context of text documents, the vectors represent term frequencies or other numerical encodings of the documents, making cosine similarity a popular metric for text similarity.

Formula
The formula for cosine similarity is:

Cosine Similarity = vec A * vec B / vec A | vec B

Display similarity Matrix Interpretation
- Each row and column represent a document or query.
- Diagonal values are always `1`, as each document is identical to itself.
- Off-diagonal values represent the cosine similarity between two different documents.

Applications of Cosine Similarity
1. Information Retrieval: Measure the relevance of documents to a query.
2. Text Mining: Compare the similarity between articles, emails, or sentences.
3. Clustering: Group similar text documents together.
4. Recommendation Systems: Identify items or content similar to user preferences.

Read the articals for more info:
https://medium.com/@TheDataScience-ProF/understanding-cosine-similarity-applications-in-llms-and-beyond-882bcf1077dc

https://www.datastax.com/guides/what-is-cosine-similarity

https://databasecamp.de/en/data/cosine-similarity