## What is Cosine Similarity?

Cosine similarity intended to formulate the similarity between two different text via the cosine function in trigonometry. In this approach, where the texts are considered as vectors the relationship of the two vectors to each other is expressed at an angle. For two vectors that are completely identical, the cosine similarity will be 1. For vectors that are completely unrelated, this value will be 0. If there is an opposite relationship between the two vectors, this time the cosine similarity value will be -1. (cos0 = 1, cos90 = 0, cos180 = -1)
<br>
![image.png](attachment:image.png)

Let's say A and B are text vectors then we will calculate the similarity by using formula below.
<br>
![image.png](attachment:image.png)

### Prepare Data

In [14]:
import numpy as np
import pandas as pd
import os

In [15]:
with open("glove.6B.50d.txt", "r", encoding="utf-8") as file:
    data = file.readlines()

In [16]:
len(data)

400000

In [17]:
for i in range(len(data)):
    data[i] = data[i][:-1]

In [18]:
data_dict = dict()

for i in range(len(data)):
    split_data = data[i].split()
    data_dict[split_data[0]] = np.array(split_data[1:]).astype('float64')

In [19]:
data_dict["the"]

array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
       -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
        2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
        1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
       -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
       -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
        4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
        7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
       -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
        1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01])

### Cosine Similarity Example

In [20]:
def cosine_similarity(a, b):
    a = np.array(a, dtype = float)
    b = np.array(b, dtype = float)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def euclidean_distance(a, b):
    a = np.array(a, dtype = float)
    b = np.array(b, dtype = float)
    return np.linalg.norm(a-b)

In [21]:
table = data_dict["table"]
desk = data_dict["desk"]
football = data_dict["football"]
baseball = data_dict["baseball"]
water = data_dict["water"]
fire = data_dict["fire"]
computer = data_dict["computer"]
calculator = data_dict["calculator"]
number = data_dict["number"]
math = data_dict["math"]
boy = data_dict["boy"]
girl = data_dict["girl"]
sad = data_dict["sad"]
happy = data_dict["happy"]
good = data_dict["good"]
bad = data_dict["bad"]
turkey = data_dict["turkey"]
television = data_dict["television"]
awesome = data_dict["awesome"]
great = data_dict["great"]
coffee = data_dict["coffee"]
giraffe = data_dict["giraffe"]
cat = data_dict["cat"]
barcelona = data_dict["barcelona"]
school = data_dict["school"]
disaster = data_dict["disaster"]

print(f"Similarity and Distance for pair (table, desk) = {cosine_similarity(table, desk), euclidean_distance(table, desk)}")
print(f"Similarity and Distance for pair (football, baseball) = {cosine_similarity(football, baseball), euclidean_distance(football, baseball)}")
print(f"Similarity and Distance for pair (water, fire) = {cosine_similarity(water, fire), euclidean_distance(water, fire)}")
print(f"Similarity and Distance for pair (computer, calculator) = {cosine_similarity(computer, calculator), euclidean_distance(computer, calculator)}")
print(f"Similarity and Distance for pair (number, math) = {cosine_similarity(number, math), euclidean_distance(number, math)}")
print(f"Similarity and Distance for pair (boy, girl) = {cosine_similarity(boy, girl), euclidean_distance(boy, girl)}")
print(f"Similarity and Distance for pair (sad, happy) = {cosine_similarity(sad, happy), euclidean_distance(sad, happy)}")
print(f"Similarity and Distance for pair (good, bad) = {cosine_similarity(good, bad), euclidean_distance(good, bad)}")
print(f"Similarity and Distance for pair (turkey, television) = {cosine_similarity(turkey, television), euclidean_distance(turkey, television)}")
print(f"Similarity and Distance for pair (awesome, great) = {cosine_similarity(awesome, great), euclidean_distance(awesome, great)}")
print(f"Similarity and Distance for pair (coffee, giraffe) = {cosine_similarity(coffee, giraffe), euclidean_distance(coffee, giraffe)}")
print(f"Similarity and Distance for pair (cat, barcelona) = {cosine_similarity(cat, barcelona), euclidean_distance(cat, barcelona)}")
print(f"Similarity and Distance for pair (school, disaster) = {cosine_similarity(school, disaster), euclidean_distance(school, disaster)}")

Similarity and Distance for pair (table, desk) = (np.float64(0.5631253246562199), np.float64(4.704135012081877))
Similarity and Distance for pair (football, baseball) = (np.float64(0.7990507471765449), np.float64(3.718578193888761))
Similarity and Distance for pair (water, fire) = (np.float64(0.6159761736263326), np.float64(4.91751761375178))
Similarity and Distance for pair (computer, calculator) = (np.float64(0.5805204352195885), np.float64(5.00531584801259))
Similarity and Distance for pair (number, math) = (np.float64(0.3923536921031839), np.float64(6.120560145939014))
Similarity and Distance for pair (boy, girl) = (np.float64(0.9327198629646994), np.float64(2.0426333096686737))
Similarity and Distance for pair (sad, happy) = (np.float64(0.6890632230848218), np.float64(3.8399498989360525))
Similarity and Distance for pair (good, bad) = (np.float64(0.7964893661716318), np.float64(3.318890407049109))
Similarity and Distance for pair (turkey, television) = (np.float64(0.34783907275810

In [22]:
# Calculate and store all similarities
pairs = [
    ("table", "desk"),
    ("football", "baseball"),
    ("water", "fire"),
    ("computer", "calculator"),
    ("number", "math"),
    ("boy", "girl"),
    ("sad", "happy"),
    ("good", "bad"),
    ("turkey", "television"),
    ("awesome", "great"),
    ("coffee", "giraffe"),
    ("cat", "barcelona"),
    ("school", "disaster")
]

similarities = []
for word1, word2 in pairs:
    sim = cosine_similarity(data_dict[word1], data_dict[word2])
    similarities.append((word1, word2, sim))

# Sort by similarity in descending order
similarities.sort(key=lambda x: x[2], reverse=True)

# Display top 3
print("Top 3 pairs with highest cosine similarity:")
for i in range(3):
    print(f"{i+1}. ({similarities[i][0]}, {similarities[i][1]}): {similarities[i][2]:.4f}")

Top 3 pairs with highest cosine similarity:
1. (boy, girl): 0.9327
2. (football, baseball): 0.7991
3. (good, bad): 0.7965


## Highest Similarity: Which 3 pairs have the highest cosine similarity?  Does this align with human intuition?
The pairs are:
- **boy, girl (0.9327)** - Yes, this aligns well with human intuition as both words represent gender-based human categories and share many semantic features (young humans, children).
- **football, baseball (0.7991)** - Yes, this makes intuitive sense as both are popular sports involving balls and teams, sharing similar contexts and semantic fields.
- **good, bad (0.7965)** - This is interesting! While these are antonyms, the high similarity makes sense because they appear in similar contexts (moral judgments, quality assessments) and are often used to describe the same types of entities. They occupy related semantic spaces despite having opposite meanings.

**Overall:** Yes, these results largely align with human intuition. The embeddings capture that words appearing in similar contexts have high cosine similarity, even if they're antonyms (like good/bad). This demonstrates that cosine similarity in word embeddings measures distributional similarity (co-occurrence patterns) rather than pure semantic similarity.

## Why do you think vector embeddings place 'good' and 'bad' close together?

Antonyms like **good** and **bad** often appear in very similar contexts (e.g., “good/bad idea,” “good/bad weather”), so their embeddings capture **distributional similarity**, not polarity. Word vectors reflect co-occurrence patterns and usage environments, which makes opposites end up close in the embedding space despite having opposite meanings.

### Word Analogies

Now we will try to find a word that associates the other three words according to their analogies. For example; 
<br>
* (a, b) ---> (c, _) we'll try to find 'd'

For finding it we'll look if similarities (a,b) is like (c,d) by calculating differences of their similarities.
<br>
![image.png](attachment:image.png)
<br>


In [23]:
def find_word_cos(a, b, c, data_dict):
    a, b, c = a.lower(), b.lower(), c.lower()
    a_vector, b_vector, c_vector = data_dict[a], data_dict[b], data_dict[c]
    
    all_words = data_dict.keys()
    max_cosine_similarity = -1000
    best_match_word = None
    
    for word in all_words:
        if word in [a, b, c]:
            continue
            
        cos_sim = cosine_similarity(np.subtract(b_vector, a_vector), np.subtract(data_dict[word], c_vector))
        
        if cos_sim > max_cosine_similarity:
            max_cosine_similarity = cos_sim
            best_match_word = word
            
    return best_match_word, cos_sim

def find_word_euc(a, b, c, data_dict):
    a, b, c = a.lower(), b.lower(), c.lower()
    a_vector, b_vector, c_vector = data_dict[a], data_dict[b], data_dict[c]
    
    all_words = data_dict.keys()
    min_euc_dis = 1000
    best_match_word = None
    
    for word in all_words:
        if word in [a, b, c]:
            continue
            
        dis = euclidean_distance(np.subtract(b_vector, a_vector), np.subtract(data_dict[word], c_vector))
        
        if dis < min_euc_dis:
            min_euc_dis = dis
            best_match_word = word
            
    return best_match_word, dis

In [24]:
words_bag = [
    ('boy', 'girl', 'man'),
    ('bat', 'baseball', 'ball'),
    ('book', 'library', 'coffee'),
    ('orange', 'juice', 'apple'),
    ('turkey', 'turkish', 'colombia')
]

for words in words_bag:
    d, cos_sim = find_word_cos(*words, data_dict)
    print("({}, {}) ----> ({}, {}) with {} difference".format(*words, d, cos_sim))
print()
for words in words_bag:
    d, dis = find_word_euc(*words, data_dict)
    print("({}, {}) ----> ({}, {}) with {} difference".format(*words, d, dis))

(boy, girl) ----> (man, woman) with -0.03407576778243833 difference
(bat, baseball) ----> (ball, basketball) with 0.09564220586831744 difference
(book, library) ----> (coffee, heliospheric) with 0.10581179044448524 difference
(orange, juice) ----> (apple, juices) with -0.23517282945825477 difference
(turkey, turkish) ----> (colombia, colombian) with 0.17003994085954482 difference

(boy, girl) ----> (man, woman) with 7.687464673656914 difference
(bat, baseball) ----> (ball, basketball) with 8.987137685070092 difference
(book, library) ----> (coffee, warehouse) with 8.557083205392084 difference
(orange, juice) ----> (apple, processor) with 9.34871480514969 difference
(turkey, turkish) ----> (colombia, colombian) with 7.258751896969685 difference


## Comparison of Cosine Similarity vs. Euclidean Distance for Word Analogies:

Both techniques successfully capture some analogical relationships, but with varying degrees of accuracy:

**Success Cases (Both Methods):**
- **(boy, girl) → (man, woman)**: Both methods correctly identify "woman" as the analogous word, demonstrating that the gender relationship is well-preserved in the embedding space.
- **(turkey, turkish) → (colombia, colombian)**: Both correctly find "colombian", showing they capture country-to-language/nationality relationships.
- **(bat, baseball) → (ball, basketball)**: Both identify "basketball", capturing sport-equipment analogies.

**Problem Cases:**
- **(book, library) → (coffee, ?)**: Cosine similarity finds "heliospheric" (clearly incorrect), while Euclidean distance finds "warehouse" (somewhat reasonable as both are storage/location places). This suggests the analogy itself may be weak.
- **(orange, juice) → (apple, ?)**: Cosine similarity finds "juices" (partially correct - at least related to the concept), while Euclidean distance finds "processor" (incorrect). The expected answer would be "cider" or "juice".

**Key Observations:**
1. **Cosine similarity** focuses on the direction/angle between vectors, making it better for capturing semantic relationships independent of magnitude.
2. **Euclidean distance** considers both direction and magnitude, which can sometimes introduce noise in high-dimensional spaces.
3. Both methods perform well on clear semantic relationships (gender, nationality) but struggle with more abstract or weaker analogies.
4. The difference values themselves aren't directly comparable between methods due to different scales (cosine: -1 to 1, Euclidean: 0 to ∞).

**Conclusion:** For word analogies, cosine similarity generally performs slightly better as it focuses purely on the directional relationship between word vectors, which is what matters most for capturing semantic analogies in embedding spaces.

In [26]:
# Test additional word pairs where outputs might differ
test_pairs = [
    ('france', 'paris', 'italy'),        # Royal family relationships
    ('hot', 'cold', 'warm'),            # Temperature antonyms/related
    ('doctor', 'hospital', 'teacher'),  # Profession-location analogy
    ('dog', 'puppy', 'cat'),            # Animal-young relationship
    ('swim', 'water', 'fly')            # Action-medium relationship
]

print("Testing additional word pairs:\n")
print("=== Cosine Similarity Results ===")
for words in test_pairs:
    try:
        d, cos_sim = find_word_cos(*words, data_dict)
        print("({}, {}) ----> ({}, {}) with {:.4f} difference".format(*words, d, cos_sim))
    except KeyError as e:
        print("({}, {}) ----> ({}, ?) - Word not in vocabulary: {}".format(*words, e))

print("\n=== Euclidean Distance Results ===")
for words in test_pairs:
    try:
        d, dis = find_word_euc(*words, data_dict)
        print("({}, {}) ----> ({}, {}) with {:.4f} difference".format(*words, d, dis))
    except KeyError as e:
        print("({}, {}) ----> ({}, ?) - Word not in vocabulary: {}".format(*words, e))

Testing additional word pairs:

=== Cosine Similarity Results ===
(france, paris) ----> (italy, soho) with 0.2427 difference
(hot, cold) ----> (warm, 1973-90) with -0.0092 difference
(doctor, hospital) ----> (teacher, clinic) with -0.0122 difference
(dog, puppy) ----> (cat, widgeon) with 0.4361 difference
(swim, water) ----> (fly, oil) with -0.2038 difference

=== Euclidean Distance Results ===
(france, paris) ----> (italy, rome) with 7.3026 difference
(hot, cold) ----> (warm, warmer) with 8.3065 difference
(doctor, hospital) ----> (teacher, school) with 8.2357 difference
(dog, puppy) ----> (cat, kitten) with 5.8811 difference
(swim, water) ----> (fly, supplies) with 9.6019 difference


## Does Higher Cosine Similarity Always Mean Lower Euclidean Distance?

**No, higher cosine similarity does not always result in lower Euclidean distance.** This is an important distinction between these two metrics:

**Cosine Similarity:**
- Measures the **angle** between two vectors
- Range: -1 to 1 (where 1 means vectors point in the same direction)
- **Normalized** - independent of vector magnitude
- Focuses purely on directional similarity

**Euclidean Distance:**
- Measures the **absolute distance** between two points in space
- Range: 0 to ∞ (where 0 means identical vectors)
- **Not normalized** - affected by both direction AND magnitude
- Considers the actual spatial separation

Two vectors can:
1. **Point in similar directions** (high cosine similarity) but have **very different magnitudes** (high Euclidean distance)
2. **Point in different directions** (low cosine similarity) but have **similar magnitudes and positions** (low Euclidean distance)


In [27]:
# Find top 5 nearest neighbors for 'computer'
computer = data_dict['computer']

# Calculate cosine similarity with all words
neighbors = []
for word in data_dict.keys():
    if word != 'computer':
        sim = cosine_similarity(computer, data_dict[word])
        neighbors.append((word, sim))

# Sort by similarity in descending order
neighbors.sort(key=lambda x: x[1], reverse=True)

# Display top 5
print("Top 5 Nearest Neighbors for 'computer':")
for i in range(5):
    print(f"{i+1}. {neighbors[i][0]}: {neighbors[i][1]:.4f}")

Top 5 Nearest Neighbors for 'computer':
1. computers: 0.9165
2. software: 0.8815
3. technology: 0.8526
4. electronic: 0.8126
5. internet: 0.8060
