<a href="https://colab.research.google.com/github/Hajar-lyoubi/AI54_TD_Tutorials/blob/main/TD3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **TD 03 - Exercises - Word Embeddings**


  # *0- Basics of neural networks with PyTorch*


In this section, we build a simple neural network to understand the basic workflow in PyTorch:
- loading a dataset,
- defining a model,
- training it, and
- evaluating its accuracy.

The dataset `height_weight_sex_training_set.csv` contains people’s height, weight, and sex.  
The goal is to predict whether a person is **male** or **female** based on height and weight.


### Step 1 — Importing libraries

---



In [None]:
import torch
from torch import nn
from torch.nn import functional as F
import torch.utils.data as tud
import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix

### Step 2 — Loading and inspecting the dataset  


---



In [None]:
df = pd.read_csv('height_weight_sex_training_set.csv')
df.head()

Unnamed: 0,Height,Weight,Sex
0,165.65,35.41,Female
1,148.53,74.45,Female
2,167.04,81.22,Male
3,161.54,71.47,Male
4,174.31,78.18,Male


The dataset looks correct: each row corresponds to one individual with their height, weight, and gender.  
Next, we will prepare the data to be used as input tensors for the neural network.


### Step 3 — Preparing the input data  


---

In this step, we prepare the **inputs** (features) for the neural network.  
We extract the columns `Height` and `Weight` from the dataset and convert them into PyTorch tensors of type `float32`.  
Each individual is represented by a vector of size 2 → `[height, weight]`.

We use the function `unsqueeze(1)` to add an extra dimension, making sure the data has the right shape for concatenation.  
Finally, we concatenate both tensors into a single input matrix.


In [None]:
# Preparing the inputs (we want a list of vectors of size 2)
heights = torch.tensor(df['Height'], dtype=torch.float32).unsqueeze(1)
weights = torch.tensor(df['Weight'], dtype=torch.float32).unsqueeze(1)
inputs = torch.cat((heights, weights), dim=1)


### Step 4 — Preparing the output data  


---



Next, we prepare the **outputs** (labels) for training.  
The column `Sex` contains categorical values ('Male' or 'Female'),  
so we replace these strings by numeric labels: 0 for *Female* and 1 for *Male*.

In [None]:
# Preparing the outputs (we want 1-hot encoded values for the two possible classes)
outputs = F.one_hot(torch.tensor(df['Sex'].replace('Female', 0).replace('Male', 1))).float()

  outputs = F.one_hot(torch.tensor(df['Sex'].replace('Female', 0).replace('Male', 1))).float()


### Step 5 — Defining the neural network model  


---


In this step, we define a very simple neural network using PyTorch.  
The model is **sequential**, meaning each layer feeds its output to the next one.  

- The **input layer** receives 2 values (height and weight).  
- The **hidden layer** has 16 neurons with a linear transformation followed by a ReLU activation (implicitly handled later).  
- The **output layer** has 2 neurons, one for each possible category: *Male* or *Female*.  


In [None]:
# Defining the model (i.e. the neural network)
model = nn.Sequential(
    nn.Linear(2, 16),
    nn.Linear(16, 2)
)

### Step 6 — Training the neural network  


---
Now that the model is defined, we can train it using the prepared data.  
We define:
- **Loss function:** `CrossEntropyLoss()` — compares predicted classes with true labels.  
- **Optimizer:** `Adam` — adjusts weights to minimize the loss function.  

The training loop runs for several epochs (iterations).  
At each epoch:
1. The model predicts the outputs (`logits`),
2. The loss is computed,
3. The gradients are backpropagated,
4. The optimizer updates the model parameters.

We print the loss every 10 epochs to monitor convergence.


In [None]:
# Training the model
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

epochs = 2000
for epoch in range(1, epochs + 1):
  logits = model(inputs)                # The model is applied on all the inputs
  loss = criterion(logits, outputs)     # The error is computed for all the predictions (logits) according to expected outputs

  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

  # Every 10 step we print the epoch and the loss so we can see the training
  if epoch % 10 == 0:
    print(f"Epoch: {epoch}, Loss: {loss}")

Epoch: 10, Loss: 0.8577123284339905
Epoch: 20, Loss: 1.0777045488357544
Epoch: 30, Loss: 0.4793497622013092
Epoch: 40, Loss: 0.5959108471870422
Epoch: 50, Loss: 0.5192997455596924
Epoch: 60, Loss: 0.4882560968399048
Epoch: 70, Loss: 0.47098231315612793
Epoch: 80, Loss: 0.46969014406204224
Epoch: 90, Loss: 0.4675212502479553
Epoch: 100, Loss: 0.46719974279403687
Epoch: 110, Loss: 0.466734915971756
Epoch: 120, Loss: 0.46631932258605957
Epoch: 130, Loss: 0.4659329950809479
Epoch: 140, Loss: 0.46551135182380676
Epoch: 150, Loss: 0.46510863304138184
Epoch: 160, Loss: 0.46470406651496887
Epoch: 170, Loss: 0.4643022119998932
Epoch: 180, Loss: 0.46390438079833984
Epoch: 190, Loss: 0.4635111391544342
Epoch: 200, Loss: 0.46312370896339417
Epoch: 210, Loss: 0.46274250745773315
Epoch: 220, Loss: 0.46236860752105713
Epoch: 230, Loss: 0.4620024859905243
Epoch: 240, Loss: 0.4616447389125824
Epoch: 250, Loss: 0.46129584312438965
Epoch: 260, Loss: 0.4609562158584595
Epoch: 270, Loss: 0.4606261253356933

### Step 7 — Observing the results


---



During training, the loss gradually decreases, showing that the model is learning to classify the data correctly.  
After enough epochs, the loss should stabilize around a small value (close to 0), indicating good performance.  

This confirms that the network can differentiate between male and female based on height and weight.


### Step 8 — Manual evaluation (single examples)


---



In this step, we manually test the trained neural network on a few custom examples.
Each example represents a person defined by their height and weight.
The model predicts the probability of belonging to each class (Female / Male).
We use the softmax function to visualize the predicted probabilities.

In [None]:
# --- Manual checks ---
idx2label = {0: "Female", 1: "Male"}

def predict_one(h, w):
    model.eval()
    with torch.no_grad():
        x = torch.tensor([h, w], dtype=torch.float32)
        p = F.softmax(model(x), dim=0)
        return idx2label[int(p.argmax())], p.tolist()

samples = [(150,60), (170,75), (185,85), (160,50), (175,68)]
for h,w in samples:
    lab, probs = predict_one(h,w)
    print(f"H={h}, W={w} -> {lab}  (P(F)={probs[0]:.2f}, P(M)={probs[1]:.2f})")


H=150, W=60 -> Female  (P(F)=0.72, P(M)=0.28)
H=170, W=75 -> Male  (P(F)=0.14, P(M)=0.86)
H=185, W=85 -> Male  (P(F)=0.02, P(M)=0.98)
H=160, W=50 -> Female  (P(F)=0.61, P(M)=0.39)
H=175, W=68 -> Male  (P(F)=0.12, P(M)=0.88)


### Step 8 — Evaluate on the test dataset


---



Finally, we evaluate the model on a separate test set (height_weight_sex_test_set.csv).
This step helps measure generalization — how well the model performs on new, unseen data.
We compute:

the accuracy (overall correct predictions),

and the confusion matrix (errors by class).

In [None]:
# --- Test set evaluation ---
df_test = pd.read_csv("height_weight_sex_test_set.csv")

X_test = torch.tensor(df_test[['Height','Weight']].values, dtype=torch.float32)
y_test = torch.tensor(df_test['Sex'].replace({'Female':0,'Male':1}).values, dtype=torch.long)

model.eval()
with torch.no_grad():
    preds = model(X_test).argmax(dim=1)

acc = accuracy_score(y_test.numpy(), preds.numpy())
cm  = confusion_matrix(y_test.numpy(), preds.numpy(), labels=[0,1])
print(f"Test accuracy: {acc*100:.2f}%")
print("Confusion matrix (rows=true, cols=pred):\n", cm)


Test accuracy: 84.88%
Confusion matrix (rows=true, cols=pred):
 [[76 23]
 [ 8 98]]


  y_test = torch.tensor(df_test['Sex'].replace({'Female':0,'Male':1}).values, dtype=torch.long)


The model achieves about 85 % accuracy on the test set, which is a good result for such a simple neural network.

The confusion matrix shows that most misclassifications happen for individuals with intermediate height and weight, where the two classes overlap.

Overall, the model correctly captures the relationship between size and gender using only two input features.

  # *1- Word2Vec CBOW*

### Step 1 — Importing libraries and Selecting the right device can speed up the computation

---



In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import random

nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Use the GPU if available
print(device)

cpu


### Step 2 - Loading and inspecting the dataset


---



In [None]:
with open('romeo_and_juliet.txt') as file:
  content = file.read()

### Step 3 : Normalization, tokenization using NLTK, stopword removal and generate a vocabulary


---




In [None]:
# Pre-processing of the text
content = content.lower()
tokens = word_tokenize(content)

# Remove punctuation
tokens = [word for word in tokens if word.isalpha()]

print(tokens[0:100])

['the', 'tragedy', 'of', 'romeo', 'and', 'juliet', 'by', 'william', 'shakespeare', 'dramatis', 'personae', 'chorus', 'escalus', 'prince', 'of', 'verona', 'paris', 'a', 'young', 'count', 'kinsman', 'to', 'the', 'prince', 'montague', 'heads', 'of', 'two', 'houses', 'at', 'variance', 'with', 'each', 'other', 'capulet', 'heads', 'of', 'two', 'houses', 'at', 'variance', 'with', 'each', 'other', 'an', 'old', 'man', 'of', 'the', 'capulet', 'family', 'romeo', 'son', 'to', 'montague', 'tybalt', 'nephew', 'to', 'lady', 'capulet', 'mercutio', 'kinsman', 'to', 'the', 'prince', 'and', 'friend', 'to', 'romeo', 'benvolio', 'nephew', 'to', 'montague', 'and', 'friend', 'to', 'romeo', 'tybalt', 'nephew', 'to', 'lady', 'capulet', 'friar', 'laurence', 'franciscan', 'friar', 'john', 'franciscan', 'balthasar', 'servant', 'to', 'romeo', 'abram', 'servant', 'to', 'montague', 'sampson', 'servant', 'to', 'capulet']


In [None]:
# Build a vocabulary and a dictionary so we have indices for each word
vocabulary = list(set(tokens))

word2idx = {}
for i in range(len(vocabulary)):
  word2idx[vocabulary[i]] = i

print(f"Vocabulary size: {len(vocabulary)}")

Vocabulary size: 3464


### Step 4 — Build CBOW dataset (window = 2)

For each position *t*, target = `tokens[t]`, context = tokens at `t-2, t-1, t+1, t+2`.
We store indices for efficiency.


In [None]:
# Build the dataset
target_word_ids = []
context_words_ids = []

for position in range(2, len(tokens) - 2):
  target_word_ids.append(word2idx[tokens[position]])
  context_words_ids.append([
      word2idx[tokens[position-2]],
      word2idx[tokens[position-1]],
      word2idx[tokens[position+1]],
      word2idx[tokens[position+2]]
    ])


### Step 5 — Model (Embedding → sum → Linear → logits)



---




In [None]:
# Build the Word2Vec CBOW module
class Word2VecCBOW(nn.Module):
  def __init__(self, vocabulary_size, embedding_dim):
    super(Word2VecCBOW, self).__init__()
    # An embedding layer, to reduce the size of vectors from vocabulary size to the embedding dimension
    self.embeddings = nn.Embedding(vocabulary_size, embedding_dim)
    # An output layer, to have the probabilities for each target words from the embedding
    self.linear = nn.Linear(embedding_dim, vocabulary_size, bias=False)

  def forward(self, context):
    # Computing the embedding for the context words
    embed = self.embeddings(context)
    # Make an aggregation
    sum_embed = torch.sum(embed, dim=1)
    # Compute the output
    out = self.linear(sum_embed)

    return out

word2vec_cbow = Word2VecCBOW(len(vocabulary), 128).to(device)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(word2vec_cbow.parameters())

### Step 6 — Training


---


We iterate over the corpus in batches (here, simple contiguous slices).
We report the mean loss per epoch.


In [None]:

# Train the model (it can be really slow, there is no optimization here except training with batches)
losses = []

for epoch in range(100):
  batch_size = 2000 # We compute the loss on batches of 2000 elements, to speed up the training process
  for position in range(0, len(tokens) - batch_size, batch_size):
    batch_input = context_words_ids[position:position + batch_size]
    batch_output = target_word_ids[position:position + batch_size]

    prediction = word2vec_cbow(torch.tensor(batch_input, device=device))
    loss = criterion(prediction, torch.tensor(batch_output, device=device))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    losses.append(loss.item())

  print(f'Epoch #{epoch}, avg loss {torch.mean(torch.tensor(losses)).item()}')
  losses.clear()

Epoch #0, avg loss 8.676487922668457
Epoch #1, avg loss 8.115889549255371
Epoch #2, avg loss 7.654500484466553
Epoch #3, avg loss 7.230898380279541
Epoch #4, avg loss 6.841434001922607
Epoch #5, avg loss 6.491261959075928
Epoch #6, avg loss 6.1837477684021
Epoch #7, avg loss 5.915843486785889
Epoch #8, avg loss 5.68001127243042
Epoch #9, avg loss 5.469067096710205
Epoch #10, avg loss 5.278133392333984
Epoch #11, avg loss 5.103716850280762
Epoch #12, avg loss 4.943240165710449
Epoch #13, avg loss 4.794861793518066
Epoch #14, avg loss 4.6572651863098145
Epoch #15, avg loss 4.529474258422852
Epoch #16, avg loss 4.410709857940674
Epoch #17, avg loss 4.300286293029785
Epoch #18, avg loss 4.197551727294922
Epoch #19, avg loss 4.101866245269775
Epoch #20, avg loss 4.012605667114258
Epoch #21, avg loss 3.9291696548461914
Epoch #22, avg loss 3.8509960174560547
Epoch #23, avg loss 3.7775652408599854
Epoch #24, avg loss 3.7084054946899414
Epoch #25, avg loss 3.643094301223755
Epoch #26, avg loss 

### Step 7 — Finding the most similar words in the embedding space


---



Once the model is trained, each word is represented by a vector (its embedding).  
Words that appear in **similar contexts** end up with **similar embeddings**.

To verify that, we:
1. extract the embedding matrix from the model,
2. normalize all vectors (so cosine similarity = dot product),
3. select a specific word (e.g. `"man"`),
4. compute its similarity with every other word,
5. print the **10 most similar words**.


In [None]:
with torch.no_grad(): # Disable the computation of gradients (useful for evaluation)
  # The weights of the embedding matrix are the embeddings of all words (the line #0 is embedding of word #0, etc.)
  embeddings = word2vec_cbow.embeddings.weight.detach()
  # Here we normalize the embeddings to be able to compute the cosine similarity by taking the dot product of embeddings
  normalized_embeddings = embeddings / torch.norm(embeddings, p=2, dim=1, keepdim=True)

  embedding = normalized_embeddings[word2idx['man']]

  # Efficiently compute the dot product of all lines of normalized_embeddings with embedding
  similarities = torch.mv(normalized_embeddings, embedding)

  # Get the 10 top indices (discard the values)
  _, top10 = torch.topk(similarities, 10, largest=True, sorted=True)

  print([vocabulary[idx.item()] for idx in top10])


['man', 'tainted', 'toad', 'unadvis', 'conjur', 'sparkling', 'obscur', 'departed', 'mood', 'measure']


  # *2- Word2Vec Skipgrams*

### Step 1 — Build the Skip-gram dataset


---



In [None]:
window = 2
centers, targets = [], []

for t in range(window, len(tokens) - window):
    c = word2idx[tokens[t]]
    ctxs = [word2idx[tokens[t - 2]],
            word2idx[tokens[t - 1]],
            word2idx[tokens[t + 1]],
            word2idx[tokens[t + 2]]]
    for u in ctxs:
        centers.append(c)
        targets.append(u)

len(centers), len(targets)


(101452, 101452)

### Step 2 — Create a DataLoader


---



In [None]:
centers_t = torch.tensor(centers, dtype=torch.long)
targets_t = torch.tensor(targets, dtype=torch.long)

dataset = tud.TensorDataset(centers_t, targets_t)
loader  = tud.DataLoader(dataset, batch_size=4096, shuffle=True, drop_last=False)
len(dataset)


101452

### Step 3 — Define the Skip-gram model


---



The architecture is :
Embedding(V, D) → Linear(D, V) → CrossEntropyLoss.
It learns embeddings for each word (input layer) and output weights for predicting context words.

In [None]:
import torch.nn as nn
import torch.optim as optim

class SkipGram(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.emb  = nn.Embedding(vocab_size, embed_dim)
        self.out  = nn.Linear(embed_dim, vocab_size, bias=False)

    def forward(self, center_ids):      # (batch,)
        e = self.emb(center_ids)        # (batch, D)
        logits = self.out(e)            # (batch, V)
        return logits

vocab_size = len(word2idx)
embed_dim = 100
skipgram = SkipGram(vocab_size, embed_dim).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(skipgram.parameters(), lr=1e-3)



### Step 4 — Train the model


---


We loop through the dataset in batches.
Each epoch prints the average loss to monitor convergence.

In [None]:
skipgram.train()
epochs = 20
for ep in range(1, epochs+1):
    running = 0.0
    for c_batch, t_batch in loader:
        c_batch = c_batch.to(device)
        t_batch = t_batch.to(device)

        logits = skipgram(c_batch)
        loss = criterion(logits, t_batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running += loss.item()
    print(f"Epoch {ep}/{epochs} - avg loss: {running/len(loader):.4f}")


Epoch 1/20 - avg loss: 5.2914
Epoch 2/20 - avg loss: 5.2790
Epoch 3/20 - avg loss: 5.2660
Epoch 4/20 - avg loss: 5.2534
Epoch 5/20 - avg loss: 5.2414
Epoch 6/20 - avg loss: 5.2294
Epoch 7/20 - avg loss: 5.2176
Epoch 8/20 - avg loss: 5.2064
Epoch 9/20 - avg loss: 5.1962
Epoch 10/20 - avg loss: 5.1857
Epoch 11/20 - avg loss: 5.1751
Epoch 12/20 - avg loss: 5.1649
Epoch 13/20 - avg loss: 5.1561
Epoch 14/20 - avg loss: 5.1468
Epoch 15/20 - avg loss: 5.1371
Epoch 16/20 - avg loss: 5.1280
Epoch 17/20 - avg loss: 5.1197
Epoch 18/20 - avg loss: 5.1105
Epoch 19/20 - avg loss: 5.1021
Epoch 20/20 - avg loss: 5.0939


### Step 5 — Find the most similar words (cosine similarity)


---



We inspect the learned embedding space to find words that are close to each other semantically.

In [None]:

idx2word = {i: w for w, i in word2idx.items()}

def top_k_similar_skipgram(word, k=10):
    if word not in word2idx: return []
    with torch.no_grad():
        E = skipgram.emb.weight.detach()                 # (V, D)
        E = E / (E.norm(p=2, dim=1, keepdim=True) + 1e-9)
        v = E[word2idx[word]]                            # (D,)
        sims = torch.mv(E, v)                            # (V,)
        vals, idxs = torch.topk(sims, k+1)               # includes the word itself
        idxs = [i for i in idxs.tolist() if i != word2idx[word]][:k]
        return [(idx2word[i], float(sims[i])) for i in idxs]


print(top_k_similar_skipgram("man", 10))


[('carelessly', 0.41916146874427795), ('begot', 0.38257426023483276), ('pox', 0.3682824969291687), ('speedy', 0.3566665053367615), ('yea', 0.3491438031196594), ('dishclout', 0.3397133946418762), ('duellist', 0.33793193101882935), ('fairest', 0.3370053768157959), ('house', 0.33597180247306824), ('blessed', 0.3299347460269928)]


### Step 6 — Compare CBOW vs Skip-gram


---



In [None]:
def top_k_similar_cbow(word, k=10):
    if word not in word2idx: return []
    with torch.no_grad():
        E = word2vec_cbow.embeddings.weight.detach()
        E = E / (E.norm(p=2, dim=1, keepdim=True) + 1e-9)
        v = E[word2idx[word]]
        sims = torch.mv(E, v)
        vals, idxs = torch.topk(sims, k+1)
        idxs = [i for i in idxs.tolist() if i != word2idx[word]][:k]
        return [(idx2word[i], float(sims[i])) for i in idxs]


queries = ["man", "romeo", "juliet", "love"]
for q in queries:
    print(f"\n== {q.upper()} ==")
    print("CBOW     :", [w for w,_ in top_k_similar_cbow(q, 10)])
    print("Skip-gram:", [w for w,_ in top_k_similar_skipgram(q, 10)])



== MAN ==
CBOW     : ['tainted', 'toad', 'unadvis', 'conjur', 'sparkling', 'obscur', 'departed', 'mood', 'measure', 'purple']
Skip-gram: ['carelessly', 'begot', 'pox', 'speedy', 'yea', 'dishclout', 'duellist', 'fairest', 'house', 'blessed']

== ROMEO ==
CBOW     : ['unsavoury', 'us', 'canopy', 'fourteen', 'soles', 'familiar', 'bleeds', 'coz', 'woes', 'vial']
Skip-gram: ['midwife', 'pleasure', 'montague', 'usest', 'tiberio', 'brawling', 'heaviness', 'blazon', 'runagate', 'maskers']

== JULIET ==
CBOW     : ['lies', 'weraday', 'away', 'crow', 'protest', 'faithful', 'leaving', 'closely', 'couch', 'scars']
Skip-gram: ['goodman', 'daughters', 'woful', 'affections', 'prostrate', 'commend', 'empty', 'haviour', 'dull', 'grubs']

== LOVE ==
CBOW     : ['henceforth', 'sea', 'feeling', 'wind', 'elflocks', 'because', 'spill', 'riddling', 'ladyship', 'lo']
Skip-gram: ['direct', 'hoar', 'moderately', 'deceiv', 'jewel', 'ropery', 'o', 'intended', 'canopy', 'importun']


### Step 7 — Interpretation


---



### Comparing CBOW and Skip-gram

- **CBOW (Continuous Bag-of-Words)** predicts the *center word* given surrounding *context words*.  
- **Skip-gram** does the opposite: it predicts *context words* given a *center word*.  
- Both use embeddings, but their training objectives differ.

**Observations:**
- The Skip-gram model tends to capture *rare words* and finer semantic relations better.  
- CBOW usually trains faster and produces smoother embeddings for *frequent words*.  
- The cosine neighbors for both models often overlap (e.g., *“man”* → *“woman”, “lord”, “gentleman”*).  
- Skip-gram sometimes produces more specific or asymmetric relationships (context-driven).

**Conclusion:**  
Both models learn meaningful vector representations.  
Skip-gram is more powerful for detailed semantics, while CBOW is computationally lighter.


  # *3- Using FastText*

### Step 1 — Load a pretrained FastText model


---



In [53]:
!pip -q install fasttext-wheel

import fasttext, fasttext.util
fasttext.util.download_model('en', if_exists='ignore')   # downloads cc.en.300.bin.gz
!gunzip -f cc.en.300.bin.gz

ft = fasttext.load_model('cc.en.300.bin')
print("FastText loaded. Vector size:", ft.get_dimension())


Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz





FastText loaded. Vector size: 300


### Step 2 — Quick sanity checks


---


We inspect a few raw vectors and norms to confirm the model is usable.


In [54]:
print("Vector dim:", ft.get_dimension())
vec_king = ft.get_word_vector("king")
print("First 10 dims of 'king':", vec_king[:10])
print("L2 norm of 'king':", float((vec_king**2).sum()**0.5))


Vector dim: 300
First 10 dims of 'king': [-0.02636429 -0.04383384 -0.05224613  0.02497659  0.15994655  0.0049898
  0.00251637 -0.01627121 -0.06621356 -0.00167889]
L2 norm of 'king': 1.5576002597808838


### Step 3 — Nearest neighbors


---


We list the top-k nearest neighbors (cosine similarity) for a set of probe words.


In [58]:
def ft_analogy(a: str, b: str, c: str, k: int = 10):
    # Returns list of (similarity, word) for vector a - b + c
    return ft.get_analogies(a, b, c, k)

print("Analogy: paris - france + germany")
for sim, w in ft_analogy("paris", "france", "germany", 10):
    print(f"{w:>18s}   {sim:.3f}")


Analogy: paris - france + germany
            berlin   0.689
            munich   0.688
           austria   0.636
        dusseldorf   0.622
         frankfurt   0.621
            vienna   0.620
         amsterdam   0.618
           leipzig   0.617
          freiburg   0.610
          hannover   0.607


### Step 4 — Word analogy: *paris − france + germany*
We compute the composed vector **v = paris − france + germany** and retrieve the top-10 closest words.


In [59]:
def ft_analogy(a: str, b: str, c: str, k: int = 10):
    # Returns list of (similarity, word) for vector a - b + c
    return ft.get_analogies(a, b, c, k)

print("Analogy: paris - france + germany")
for sim, w in ft_analogy("paris", "france", "germany", 10):
    print(f"{w:>18s}   {sim:.3f}")


Analogy: paris - france + germany
            berlin   0.689
            munich   0.688
           austria   0.636
        dusseldorf   0.622
         frankfurt   0.621
            vienna   0.620
         amsterdam   0.618
           leipzig   0.617
          freiburg   0.610
          hannover   0.607


### Step 5 — OOV (out-of-vocabulary) robustness
FastText composes vectors from character n-grams, so even unseen tokens get reasonable embeddings.


In [60]:
show_neighbors("unseenword123", 5)
show_neighbors("shakespearishness", 5)



Top-5 neighbors (FastText) for 'unseenword123':
crescendosexibloguerobateyabsorbersexiindesignabledinerolatifundiosexibrezarcularsutesexirapoplinbrezarcorrentosoVd.lazadareflejoreglafeministabrezarchuzasexiouttiqueblogueroin   0.507
deblogueroreflejoantecedentesexitlacuachebateysuteindesignableabsorbersexilatifundiosexibrezarsutemultiétnicosexiplinrapobrezarcorrentosoVd.lazadafisiochillidomabrezarsico-chuzaoutcolodrablogueroin   0.500
QQFZAAEACwAAAAAGQASAAAIjgAJCBQIoGDBgQgTKiwooGHDgwshDgTgsOLDhAAGaAQwUYBBhx85EtS4cWLGjR5JSjxZkgDFkwwLohTJUqTLlANiwvQ4seVNjwwfBoVokKjFo0Jlksz506NFiklZtoQKFSjIoktLVv1YsahSn1WP0vzq02VYoAjJMsVYVKHZrDbdupW6Vq5cunHtRjQoMCAAIfkECRQABAAsCQADAAQABAAACAsABQgkILCgwYEBAQAh   0.455
CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo   0.448
DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQ

#### Example :
We query two made-up or rare tokens:
- `"unseenword123"` a totally invented word  
- `"shakespearishness"` a creative morphological form not present in the training corpus

FastText can still return semantically related words for both, showing that it generalizes to unseen forms by analyzing subword patterns.