# 🧠 NLP Foundations Workshop: Vector Space Proximity

### 🔹 Introduction to Vector Space Proximity

A large majority of the data on the Internet is **unstructured**, for example: social media posts, emails, images, videos and audio files.

If we want to **persist** all these media in a database, we may add **metadata** about them, such as file type or creation date timestamp, or we could  **tag** each file, or parts of it, so they are easy to search for. This is because it would be very difficult to identify them based on their low-level (byte) representations.

But, what if we want to make the process fully automated (i.e., remove the need to manually add features, like tags, to each media item)? We need another way to represent the semantics of digital media.

That is the reason why in **Information Retrieval (IR)** and **Natural Language Processing (NLP)**, we often represent documents and queries as **vectors** in a **high-dimensional space**, where:

* Each **dimension** corresponds to a **unique term** in the vocabulary.
* A **document** is represented by a **point** or a **vector** in the space.
* A **vector** is a list of weights (e.g., term frequencies, TF-IDF values) that describe the presence or importance of terms in a document or query.

---

#### 📘 Example 1: "Rich" and "Poor" Axes

![Vector Space Example: "Rich" and "Poor" Axes"](./images/Fig1_CartesianVectorSpace.png)

Suppose our vocabulary only has two terms:

* `"rich"`
* `"poor"`

These two terms define a **2D Cartesian space**:

* The **x-axis** corresponds to the term **"rich"**.
* The **y-axis** corresponds to the term **"poor"**.

Each document is represented as a vector in this space:

* A document with many occurrences of “poor” and none of “rich” lies near the **y-axis**.
* A document that mentions both “rich” and “poor” lies in the **first quadrant**.
* A document with only “rich” is aligned along the **x-axis**.

The **query vector** $q = \{\text{"rich"}, \text{"poor"}\}$ points in the direction of interest for the search engine.

### 🔹 Euclidean Distance and Its Limitations

One might assume we can measure similarity using **Euclidean distance**:

$$
\text{Euclidean}( \vec{q}, \vec{d} ) = \sqrt{ \sum_{i=1}^{n} (q_i - d_i)^2 }
$$

However, this has problems in practice:

* If document $d_2$ contains more occurrences of both “rich” and “poor” than the query, its vector will have a **longer length**.
* As seen in the diagram, even though $d_2$ has strong content overlap with the query $q$, it may still be **further away** in Euclidean terms than unrelated documents like $d_3$.
* This happens because **magnitude dominates**, not direction.

### 🔹 Angle as Similarity → Cosine Similarity

To solve this, we focus on **vector direction**, not length. We measure **angle** between the document and query vectors using **Cosine Similarity**:

$$
\cos(\vec{q}, \vec{d}) = \frac{ \vec{q} \cdot \vec{d} }{ \|\vec{q}\| \cdot \|\vec{d}\| }
= \frac{ \sum_{i=1}^{n} q_i \cdot d_i }{ \sqrt{ \sum_{i=1}^{n} q_i^2 } \cdot \sqrt{ \sum_{i=1}^{n} d_i^2 } }
$$

* This gives us a similarity score from **0 (orthogonal)** to **1 (identical direction)**.
* Longer documents that are semantically aligned still get **high similarity**.

### 🔹 Why Cosine Similarity Works Better

* **Angle** captures **semantic alignment**.
* It is **not affected** by document length or repetition.
* Example: duplicating document $d$ to make $d'$ will increase Euclidean distance — but **cosine similarity remains 1**.

Cosine similarity is at the core of:

* **Search ranking**
* **Embedding-based retrieval**
* **LLM scoring and attention mechanisms**

Sample code:

In [4]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Define the documents and the query
documents = [
    "Ranks of starving poets swell",       # d1
    "Rich poor gap grows",                 # d2
    "Record baseball salaries in 2010"     # d3
]

query = ["rich poor"]                     # q

# Create a CountVectorizer to convert text to term frequency vectors
vectorizer = CountVectorizer()
doc_vectors = vectorizer.fit_transform(documents + query).toarray()

# Separate vectors
doc_matrix = doc_vectors[:3]  # d1, d2, d3
query_vector = doc_vectors[3].reshape(1, -1)  # q

# Compute cosine similarity
cosine_similarities = cosine_similarity(query_vector, doc_matrix).flatten()

# Create a DataFrame to show results
df = pd.DataFrame({
    'Document': ['Doc1', 'Doc2', 'Doc3'],
    'Cosine Similarity with Query': cosine_similarities
})

# Sort for clarity
print("Query: ", query)
df.sort_values(by='Cosine Similarity with Query', ascending=False, inplace=True)
df.reset_index(drop=True, inplace=True)

# Display the result
df


Query:  ['rich poor']


Unnamed: 0,Document,Cosine Similarity with Query
0,Doc2,0.707107
1,Doc1,0.0
2,Doc3,0.0


### 📘 Example 2: Word Vectors in a Small Corpus

Let's start with a small corpus of just six words, each represented by a vector in 3D space:

```plaintext
CAT     → [ 0.2, -0.4,  0.7]
DOG     → [ 0.6,  0.1,  0.5]
APPLE   → [ 0.8, -0.2, -0.3]
ORANGE  → [ 0.7, -0.1, -0.6]
HAPPY   → [-0.5,  0.9,  0.2]
SAD     → [ 0.4, -0.7, -0.5]
```

Each term is represented by a **vector in 3D space**.

### 🔍 Observations

- Words with **similar meanings** tend to have **similar vector representations**.
  - For example, **APPLE** and **ORANGE** are close in vector space, reflecting their semantic similarity.

- Words with **opposite meanings** tend to have **vectors pointing in opposite directions**.
  - For instance, **HAPPY** and **SAD** have contrasting vectors, indicating their opposing emotional tones.

![3D Visualizationof Word Vectors](./images/Fig2_3DVisualizationWordVectors.png)



Vector representations are also called **Embeddings**.

There are several approaaches to how **word embedding methods** generate effective vector representations. 

One of them is **frequency-based embeddings**, word representations that are derived from the frequency of words in a corpus. They are based on the idea that the **importance** or the **significance** of a word can be inferred from **how frequently it occurs in the text**. One such embedding is called **Term Frequency - Inverse Document Frequency** or **TF-IDF**. 

TF-IDF highlights words that are frequent within a specific document but are rare across the entire corpus. For example, in a document about music, it would emphasize words such as **rap**, **disco**, **pop**, **rock**. On the other hand, pronouns would receive a low TF-IDF score.

There are various models for generating word embeddings.

### 🔹 Curriculum Learning (9): Word Embeddings with Word2Vec

**Word2Vec** is one of the most influential models for learning **dense vector representations** of words, also known as **embeddings**.

Unlike frequency-based models like TF-IDF, Word2Vec uses a **neural network** to learn word vectors such that **similar words have similar embeddings**.

There are two main architectures:

* **CBOW (Continuous Bag of Words)**: Predicts a word from its context.
* **Skip-gram**: Predicts context words from a target word.

Both approaches rely on the **distributional hypothesis**: words that appear in similar contexts tend to have similar meanings.

### 💻 Code Challenge: Learn Word Embeddings Using Word2Vec

#### 🚀 Your Task:

Write Python code that:

1. Prepares a small corpus of tokenized sentences.
2. Trains a **Word2Vec** model on this corpus using Gensim.
3. Displays the vector representation for a few words.
4. Finds the most similar words to a chosen term.

#### 📚 Hints:

* Use `from gensim.models import Word2Vec`
* Tokenize your corpus as a list of word lists (sentences).
* Try: `model.wv['word']`, `model.wv.most_similar('word')`

**Example Questions to Explore:**

* What is the shape of a word vector?
* Which words are closest to "learning", "data", or "model"?
* Can Word2Vec capture analogies (e.g., "king" - "man" + "woman")?

Try it out and see what your model learns! 🎯


Use cases:

- Long term memory for LLMs.
- Semantic Search: based on the meaning or context.
- Similarity search for text, images, audio or video data.
- Ranking and/or recommendation engine.


LangChain

LangChain is a framework for developing application powered by Large Language Models (LLMs). 
It was designed and implemented to be:
- Data-aware: connecting a language model to other sources of data.
- Agentic: allowing a model to interact with its environment.