### Sentence Similarity using TF-IDF and Matrix Operations

We now explore **how matrix operations appear in Natural Language Processing (NLP)**.

1. Suppose we have 3–4 short sentences from users.  
2. We can represent each sentence as a **vector of numbers** using **TF-IDF**:  
   - Each row = a sentence  
   - Each column = a word from the vocabulary  
   - Entry = TF-IDF weight of that word in the sentence  

**Briefly about TF-IDF:**  
- TF-IDF stands for **Term Frequency – Inverse Document Frequency**.  
- **Term Frequency (TF)** measures how often a word appears in a sentence.  
- **Inverse Document Frequency (IDF)** downweights words that appear in many sentences (common words like "the", "is").  
- **TF-IDF = TF × IDF**, giving higher weight to words that are **important in that sentence but rare across sentences**.  

3. Collectively, all sentence vectors form a **matrix** \(S\).  
4. To compute **similarity between sentences**, we can use **cosine similarity**, which involves:
\[
\text{cosine\_sim}(s_i, s_j) = \frac{s_i \cdot s_j}{\|s_i\| \|s_j\|}
\]  
- Here $\(s_i \cdot s_j\$) is a **dot product**, which is just a **matrix-vector or vector-vector multiplication**.  
- This gives a simple, real-world example of **linear algebra operations applied to text data**.


In [29]:
import re

# Take sentences from users
raw_sentences = []
num_sentences = 4
for i in range(num_sentences):
    s = input(f"Enter sentence {i+1}: ")
    raw_sentences.append(s)

# Clean sentences: lowercase and remove punctuation
sentences = [re.sub(r'[^\w\s]', '', s.lower()) for s in raw_sentences]

# Build vocabulary from cleaned sentences
vocab = set()
for sentence in sentences:
    words = sentence.split()
    vocab.update(words)
vocab = sorted(list(vocab))

print("Cleaned Sentences:\n", sentences)
print("\nVocabulary:\n", vocab)


Enter sentence 1:  Matrix multiplication is a fundamental operation in linear algebra.
Enter sentence 2:  Vectors and matrices are essential for data science computations.
Enter sentence 3:  Linear regression uses matrix operations to solve equations and generate predictions.
Enter sentence 4:  Eigenvalues and eigenvectors reveal important properties of a matrix.


Cleaned Sentences:
 ['matrix multiplication is a fundamental operation in linear algebra', 'vectors and matrices are essential for data science computations', 'linear regression uses matrix operations to solve equations and generate predictions', 'eigenvalues and eigenvectors reveal important properties of a matrix']

Vocabulary:
 ['a', 'algebra', 'and', 'are', 'computations', 'data', 'eigenvalues', 'eigenvectors', 'equations', 'essential', 'for', 'fundamental', 'generate', 'important', 'in', 'is', 'linear', 'matrices', 'matrix', 'multiplication', 'of', 'operation', 'operations', 'predictions', 'properties', 'regression', 'reveal', 'science', 'solve', 'to', 'uses', 'vectors']


### Vectorizing Sentences and Building the TF-IDF Matrix

1. **Vectorization concept:**  
   - Each sentence can be represented as a **vector** in a high-dimensional space, where each dimension corresponds to a word from the vocabulary.  
   - For example, if our vocabulary has 10 words, each sentence becomes a **10-dimensional vector**.  
   - The entry for a word in a sentence’s vector is its **TF-IDF weight** (importance in that sentence).

2. **Stacking vectors into a matrix:**  
   - Collect all sentence vectors together → form a **matrix** \(S\):  
     - Rows → sentences  
     - Columns → words (vocabulary)  
     - Entries → TF-IDF weights  

$$
S =
\begin{bmatrix}
\text{TF-IDF of sentence 1} \\
\text{TF-IDF of sentence 2} \\
\vdots \\
\text{TF-IDF of sentence n} \\
\end{bmatrix}
$$

- This **matrix view** makes linear algebra operations natural, e.g., computing **cosine similarity** as **dot products of row vectors**.

3. **TF-IDF matrix intuition:**  
   - Each row = one sentence  
   - Each column = one word from the vocabulary  
   - Entry = TF-IDF weight = TF × IDF  
   - Once vectorized, **all standard matrix operations** (dot products, norms, projections) can be applied to these sentence vectors.


In [31]:
import numpy as np
import pandas as pd

# Tokenize the stored sentences
tokenized_sentences = [s.split() for s in sentences]  # already lowercased

# Map each word to a column index
vocab_index = {word: idx for idx, word in enumerate(vocab)}

# Initialize Term Frequency (TF) matrix
tf_matrix = np.zeros((len(sentences), len(vocab)))

# Fill TF matrix
for i, sent in enumerate(tokenized_sentences):
    for word in sent:
        tf_matrix[i, vocab_index[word]] += 1
    tf_matrix[i] /= len(sent)  # normalize by sentence length



# Compute Inverse Document Frequency (IDF) vector
num_sentences = len(sentences)
idf = np.zeros(len(vocab))
for j, word in enumerate(vocab):
    containing = sum(1 for sent in tokenized_sentences if word in sent)
    idf[j] = np.log(num_sentences / (1 + containing))  # smoothing to avoid division by zero



# Compute TF-IDF matrix
tfidf_matrix = tf_matrix * idf  # elementwise multiplication

# Create a DataFrame from the TF-IDF matrix
tfidf_df = pd.DataFrame(tfidf_matrix, columns=vocab)

# Optionally, use the first few words of each sentence as row labels for readability
row_labels = [f"Sentence {i+1}" for i in range(len(sentences))]
tfidf_df.index = row_labels

# Display nicely
pd.set_option('display.precision', 3)      # round numbers to 3 decimals
pd.set_option('display.max_columns', None) # show all columns
print("TF-IDF Matrix (DataFrame view):")
display(tfidf_df)



TF-IDF Matrix (DataFrame view):


Unnamed: 0,a,algebra,and,are,computations,data,eigenvalues,eigenvectors,equations,essential,for,fundamental,generate,important,in,is,linear,matrices,matrix,multiplication,of,operation,operations,predictions,properties,regression,reveal,science,solve,to,uses,vectors
Sentence 1,0.032,0.077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077,0.0,0.0,0.077,0.077,0.032,0.0,0.0,0.077,0.0,0.077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sentence 2,0.0,0.0,0.0,0.077,0.077,0.077,0.0,0.0,0.0,0.077,0.077,0.0,0.0,0.0,0.0,0.0,0.0,0.077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077,0.0,0.0,0.0,0.077
Sentence 3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.063,0.0,0.0,0.0,0.026,0.0,0.0,0.0,0.0,0.0,0.063,0.063,0.0,0.063,0.0,0.0,0.063,0.063,0.063,0.0
Sentence 4,0.032,0.0,0.0,0.0,0.0,0.0,0.077,0.077,0.0,0.0,0.0,0.0,0.0,0.077,0.0,0.0,0.0,0.0,0.0,0.0,0.077,0.0,0.0,0.0,0.077,0.0,0.077,0.0,0.0,0.0,0.0,0.0


### Cosine Similarity Between Sentences

Once we have the TF-IDF matrix, we can compute **similarity between sentences** using **cosine similarity**:

$$
\text{cosine\_sim}(s_i, s_j) = \frac{s_i \cdot s_j}{\|s_i\| \, \|s_j\|}
$$

- \$(s_i\$) and \$(s_j\$) are **row vectors** from the TF-IDF matrix.  
- The **dot product** \$(s_i \cdot s_j\$) measures alignment between two sentences.  
- Dividing by the norms ensures the similarity is **between 0 and 1**.  
- This is a direct application of **matrix-vector operations** from linear algebra in NLP.

We can compute the **full similarity matrix** by multiplying the TF-IDF matrix with its transpose and normalizing by sentence norms.


In [43]:
import numpy as np

# TF-IDF matrix: tfidf_matrix (rows=sentences, columns=words)

# Compute L2 norms of each sentence vector
norms = np.linalg.norm(tfidf_matrix, axis=1, keepdims=True)

# Normalize each row vector
normalized_tfidf = tfidf_matrix / norms

# Cosine similarity matrix: dot product of normalized vectors
cosine_sim_matrix = normalized_tfidf @ normalized_tfidf.T

# Display nicely using pandas
cosine_df = pd.DataFrame(cosine_sim_matrix, 
                         index=[f"Sentence {i+1}" for i in range(len(sentences))],
                         columns=[f"Sentence {i+1}" for i in range(len(sentences))])

pd.set_option('display.precision', 3)
print("Cosine Similarity Matrix:")
display(cosine_df)


Cosine Similarity Matrix:


Unnamed: 0,Sentence 1,Sentence 2,Sentence 3,Sentence 4
Sentence 1,1.0,0.0,0.024,0.028
Sentence 2,0.0,1.0,0.0,0.0
Sentence 3,0.024,0.0,1.0,0.0
Sentence 4,0.028,0.0,0.0,1.0
