# Text Vectorization and Similarity Analysis

This notebook shows how to convert text into numbers and measure similarity between documents.

## Setup

In [8]:
!pip install pandas scikit-learn



In [9]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

# Download required data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Setup preprocessing tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


def clean_text(text):
    """Complete text preprocessing pipeline"""
    text = text.lower().strip()
    
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(tokens)

[nltk_data] Downloading package punkt to /home/sudarshan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sudarshan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/sudarshan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Sample Documents

In [10]:
raw_documents = [
    "I love machine learning. Machine learning is cool",
    "Machine learning is amazing", 
    "I hate bad weather",
    "The weather is great today",
    "Python programming is relly fun",
    "I enjoy learning data science with Python, don't you?"
]

# Clean the documents
documents = [clean_text(doc) for doc in raw_documents]

print("Documents after cleaning:")

for doc in documents:
    print(doc)

Documents after cleaning:
love machine learning machine learning cool
machine learning amazing
hate bad weather
weather great today
python programming relly fun
enjoy learning data science python dont


## 1. Bag of Words (BoW)

**What is it?** Count how many times each word appears in each document.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Create BoW vectors
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(documents)

# Show the vocabulary
vocab = bow_vectorizer.get_feature_names_out()
print(f"Vocabulary: {list(vocab)}")

# Convert to DataFrame to see it clearly
bow_df = pd.DataFrame(
    bow_matrix.toarray(), 
    columns=vocab,
)

bow_df

Vocabulary: ['amazing', 'bad', 'cool', 'data', 'dont', 'enjoy', 'fun', 'great', 'hate', 'learning', 'love', 'machine', 'programming', 'python', 'relly', 'science', 'today', 'weather']


Unnamed: 0,amazing,bad,cool,data,dont,enjoy,fun,great,hate,learning,love,machine,programming,python,relly,science,today,weather
0,0,0,1,0,0,0,0,0,0,2,1,2,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1
4,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,0,0,0
5,0,0,0,1,1,1,0,0,0,1,0,0,0,1,0,1,0,0


## 2. TF-IDF

**What is it?** Give higher weight to words that are:
- Common in one document 
- Rare across all documents

**Why?** Words that appear often in one document but rarely in others are more important.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Convert to DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(), 
    columns=tfidf_vectorizer.get_feature_names_out(),
)

tfidf_df

Unnamed: 0,amazing,bad,cool,data,dont,enjoy,fun,great,hate,learning,love,machine,programming,python,relly,science,today,weather
0,0.0,0.0,0.389047,0.0,0.0,0.0,0.0,0.0,0.0,0.538684,0.389047,0.638048,0.0,0.0,0.0,0.0,0.0,0.0
1,0.681722,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.471964,0.0,0.559022,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.611713,0.0,0.0,0.0,0.0,0.0,0.0,0.611713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.501613
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.611713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.611713,0.501613
4,0.0,0.0,0.0,0.0,0.0,0.0,0.521823,0.0,0.0,0.0,0.0,0.0,0.521823,0.427903,0.521823,0.0,0.0,0.0
5,0.0,0.0,0.0,0.440579,0.440579,0.440579,0.0,0.0,0.0,0.305018,0.0,0.0,0.0,0.361281,0.0,0.440579,0.0,0.0


## 3. Cosine Similarity

**What is it?** Measures how similar two documents are by looking at the angle between their vectors.

**Why use it?** It ignores document length and focuses on which words are important.

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tfidf_matrix)

# Make it into a nice table
similarity_df = pd.DataFrame(
    similarity_matrix,
)

similarity_df

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.610922,0.0,0.0,0.0,0.164308
1,0.610922,1.0,0.0,0.0,0.0,0.143958
2,0.0,0.0,1.0,0.251616,0.0,0.0
3,0.0,0.0,0.251616,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.154593
5,0.164308,0.143958,0.0,0.0,0.154593,1.0


# 4. KNN (K-Nearest neighbours)

K-Nearest Neighbors (KNN) is one of the simplest machine learning algorithms.

Unlike models that "learn" patterns during training, KNN is called a lazy learner:
- It does not train a model.
- It stores all training data.
- When asked to make a prediction, it compares the new input to the stored data.

In [14]:
import numpy as np
from sklearn.neighbors import NearestNeighbors

points = np.array([
    [1, 2],
    [2, 3],
    [3, 1],
    [6, 5],
    [7, 7],
    [8, 6]
])

# The new point
query_point = np.array([[5, 5]])

nn = NearestNeighbors(n_neighbors=3, metric='euclidean')
nn.fit(points)

distances, indices = nn.kneighbors(query_point)

# Print results
print("3 nearest neighbors:")
for i, idx in enumerate(indices[0]):
    print(f"{i+1}. Point: {points[idx]}, Distance: {distances[0][i]:.2f}")

3 nearest neighbors:
1. Point: [6 5], Distance: 1.00
2. Point: [7 7], Distance: 2.83
3. Point: [8 6], Distance: 3.16
