In [5]:
# --- MODULE 2: VECTORIZATION & SIMILARITY ---
# Goal: Convert text to math and then find similarities.

# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer # For converting text to TF-IDF numerical vectors
from sklearn.metrics.pairwise import cosine_similarity       # For calculating similarity between vectors
import pandas as pd                                          # For data manipulation and displaying results

In [6]:
# 1. THE KNOWLEDGE BASE (Cleaned Corpus)
# These represent documents the AI "knows" and will search through.
corpus = [
    "ai future technology",
    "students learn artificial intelligence",
    "machine learning transforming education"
]
print("\n--- KNOWLEDGE BASE (CORPUS) ---")
print(f"Corpus: {corpus}")
print(f"Number of documents in corpus: {len(corpus)}")


--- KNOWLEDGE BASE (CORPUS) ---
Corpus: ['ai future technology', 'students learn artificial intelligence', 'machine learning transforming education']
Number of documents in corpus: 3


In [7]:
# 2. THE USER QUERY
# This is what a user asks the AI Agent.
query = ["learning about ai"]
print("\n--- USER QUERY ---")
print(f"Query: {query}")

# 3. VECTORIZATION (TF-IDF)
# We convert both the knowledge base AND the query into a numerical (math) format.
# Initialize the TF-IDF Vectorizer.
vectorizer = TfidfVectorizer()

# Learn the vocabulary from the corpus and transform the corpus into TF-IDF vectors.
tfidf_matrix = vectorizer.fit_transform(corpus)

# Transform the user query into a TF-IDF vector using the *same* vocabulary learned from the corpus.
query_vector = vectorizer.transform(query)

print("\n--- VECTORIZATION RESULTS ---")
print(f"Vectorizer vocabulary (feature names): {vectorizer.get_feature_names_out()}")
print(f"Shape of TF-IDF matrix for corpus (documents x features): {tfidf_matrix.shape}")
print(f"TF-IDF matrix for corpus (sparse format):\n{tfidf_matrix}")
print(f"Shape of TF-IDF vector for query (1 x features): {query_vector.shape}")
print(f"TF-IDF vector for query (sparse format):\n{query_vector}")


--- USER QUERY ---
Query: ['learning about ai']

--- VECTORIZATION RESULTS ---
Vectorizer vocabulary (feature names): ['ai' 'artificial' 'education' 'future' 'intelligence' 'learn' 'learning'
 'machine' 'students' 'technology' 'transforming']
Shape of TF-IDF matrix for corpus (documents x features): (3, 11)
TF-IDF matrix for corpus (sparse format):
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 11 stored elements and shape (3, 11)>
  Coords	Values
  (0, 0)	0.5773502691896257
  (0, 3)	0.5773502691896257
  (0, 9)	0.5773502691896257
  (1, 8)	0.5
  (1, 5)	0.5
  (1, 1)	0.5
  (1, 4)	0.5
  (2, 7)	0.5
  (2, 6)	0.5
  (2, 10)	0.5
  (2, 2)	0.5
Shape of TF-IDF vector for query (1 x features): (1, 11)
TF-IDF vector for query (sparse format):
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2 stored elements and shape (1, 11)>
  Coords	Values
  (0, 0)	0.7071067811865476
  (0, 6)	0.7071067811865476


In [8]:
# 4. SIMILARITY CALCULATION (The AI's Decision)
# We calculate how 'close' (similar) the query is to each sentence in our knowledge base using cosine similarity.
similarities = cosine_similarity(query_vector, tfidf_matrix)

print("\n--- SIMILARITY CALCULATION ---")
print(f"Cosine similarities between query and each corpus document: {similarities}")

# 5. DISPLAY RESULTS
# Create a pandas DataFrame to present the results clearly.
results = pd.DataFrame({
    'Knowledge Base Sentence': corpus,
    'Similarity Score': similarities[0] # similarities is a 2D array, so we take the first (and only) row.
})

print("\n✨ AI SEARCH RESULTS (Similarity Based):")
# Sort the results by similarity score in descending order to show the most relevant first.
print(results.sort_values(by='Similarity Score', ascending=False))


--- SIMILARITY CALCULATION ---
Cosine similarities between query and each corpus document: [[0.40824829 0.         0.35355339]]

✨ AI SEARCH RESULTS (Similarity Based):
                   Knowledge Base Sentence  Similarity Score
0                     ai future technology          0.408248
2  machine learning transforming education          0.353553
1   students learn artificial intelligence          0.000000


This code block imports the necessary libraries for our task:
- `TfidfVectorizer` from `sklearn.feature_extraction.text`: Used to convert text into numerical TF-IDF vectors.
- `cosine_similarity` from `sklearn.metrics.pairwise`: Used to calculate the similarity between two vectors.
- `pandas` as `pd`: Used for data manipulation and creating DataFrames to display results.

This block defines our `corpus`:
- `corpus`: A list of strings, each representing a document or sentence in our knowledge base. This is the information the AI will 'know' and search through.

This block handles the user query and vectorization:
- `query = ["learning about ai"]`: Defines the input from the user, which is also a list of strings (though typically just one in this case).
- `vectorizer = TfidfVectorizer()`: Initializes the TF-IDF vectorizer. This object will learn the vocabulary from our `corpus`.
- `tfidf_matrix = vectorizer.fit_transform(corpus)`: This is a crucial step. `fit` learns all unique words (vocabulary) and their inverse document frequencies from the `corpus`. `transform` then converts each sentence in the `corpus` into a numerical vector based on these learned statistics.
- `query_vector = vectorizer.transform(query)`: The `transform` method is called again, but this time only on the `query`. It uses the *same vocabulary and IDF values* learned from the `corpus` to convert the query into a numerical vector, ensuring consistency for comparison.

This block calculates and displays the similarity results:
- `similarities = cosine_similarity(query_vector, tfidf_matrix)`: This function calculates the cosine similarity between the `query_vector` and each row (sentence vector) in the `tfidf_matrix`. The result is an array where each value represents the similarity score between the query and a sentence in the corpus.
- `results = pd.DataFrame(...)`: A pandas DataFrame is created to organize and present the results. It pairs each original sentence from the `corpus` with its corresponding `Similarity Score`.
- `print("✨ AI SEARCH RESULTS (Similarity Based):")`: Prints a descriptive header.
- `print(results.sort_values(by='Similarity Score', ascending=False))`: Prints the DataFrame, sorted by 'Similarity Score' in descending order, so the most relevant sentences appear at the top.

### 1. THE KNOWLEDGE BASE (Cleaned Corpus)
The `corpus` represents our "knowledge base" – a collection of documents or sentences that the AI system has access to. In a real-world scenario, this could be articles, books, or a vast database of information.

### 2. THE USER QUERY & 3. VECTORIZATION (TF-IDF)
The `query` is the user's input, the question they are asking the AI. To compare this text query with our knowledge base, we need to convert both into a numerical format that computers can understand and process. This process is called **vectorization**.

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a statistical measure that evaluates how relevant a word is to a document in a collection of documents (our corpus). It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

The `TfidfVectorizer` object will learn the vocabulary from our `corpus` and then transform both the corpus and the query into numerical vectors.

After vectorization, we get:
- **`tfidf_matrix`**: This is a numerical representation of our entire `corpus`. Each row corresponds to a sentence in the corpus, and each column corresponds to a word in the vocabulary, with values representing the TF-IDF score for that word in that sentence.
- **`query_vector`**: This is the numerical representation of the user's `query`, transformed using the same vocabulary learned from the corpus.

### 4. SIMILARITY CALCULATION (The AI's Decision - Cosine Similarity)
Now that both the `query` and the `corpus` sentences are represented as numerical vectors, we can calculate how "similar" they are. **Cosine Similarity** measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. The closer the cosine value is to 1, the smaller the angle and the higher the similarity. A value of 0 means they are orthogonal (no similarity), and -1 means they are diametrically opposite.

In essence, we are measuring the directional similarity between the query vector and each sentence vector in our knowledge base to find which sentences are most semantically related to the query.