## Task 1: Analysing Pre-trained Word Embeddings (6 Points)

In [1]:
from tutorial2_task1 import load_glove_model

model = load_glove_model("glove.6B.50d.txt")
vocab = list(model.keys())

print(f"{len(vocab)} words found with {len(model['the'])} vector size!")

FileNotFoundError: [Errno 2] No such file or directory: 'glove.6B.50d.txt'

### Finding Distance Between Words

Let's explore whether the pre-trained word embeddings accurately reflect the intuitive similarity between words. For instance, we would expect the word "dog" to be more similar to "cat" (as both are pet animals) than to a word like "ball." We can verify this assumption using the word embeddings.

In [2]:
from tutorial2_task1 import cosine_distance

d_dog_cat = cosine_distance(model["dog"], model["cat"])
d_dog_ball = cosine_distance(model["dog"], model["ball"])

print(f"Distance between dog and cat: {d_dog_cat}")
print(f"Distance between dog and bear: {d_dog_ball}")

NameError: name 'model' is not defined

Having established the ability to measure the distance between two words using their word embeddings, we can now extend this approach to compare a specific word against all words in our pre-trained model. For instance, which words are most similar to "awesome" in this context? Let's find out:

In [3]:
from tutorial2_task1 import most_similar
from tutorial2_task1 import cosine_distance, euclidean_distance

similar_words_cosine = most_similar(model["awesome"], model, cosine_distance)
similar_words_euclidean = most_similar(model["awesome"], model, euclidean_distance)

print("Cosine distance:")
print(similar_words_cosine)

print("Euclidean distance:")
print(similar_words_euclidean)

NameError: name 'model' is not defined

### Compare Similarity Between Groups of Words

In [4]:
common_words = [
    # colors
    "red", "orange", "yellow", "green", "blue", "purple",
    "pink", "brown", "black", "grey", "white", "violet", 
    # months
    "january", "february", "march", "april", "may", "june", 
    "july", "august", "september", "october", "november", "december",
]

In [5]:
from tutorial2_task1 import show_similarities

show_similarities(common_words, model, cosine_distance)
show_similarities(common_words, model, euclidean_distance)

NameError: name 'model' is not defined

Do the word embeddings agree with you assumption of relatedness between words?

Your answer: ...

### Vector Arithmetic for Analogy Solving

What happens if we subtract 'man' from 'king'? This is like asking the model to take the concept of 'king', and remove 'maleness' from it. What you're left with (in theory) is the concept of royalty or a ruler without the gender association. Then, by adding the vector for 'woman', you're essentially asking the model to provide a word that is similar to a 'king', but with a feminine aspect.

In [None]:
vec = model["king"] - model["man"] + model["woman"]
similar_words = most_similar(vec, model, euclidean_distance, ignore_vec=np.array([model["king"]]))

print(similar_words)
print(most_similar(model["king"], model, euclidean_distance))

More analogies:

In [None]:
from tutorial2_task1 import analogies

# add more analogies here

print(analogies("man", "king", "woman", model))
# print(analogies("brazil", "pele", "argentina", model))
# print(analogies("america", "washington", "germany", model))
# ...

## Task 2 (Computing Word Embeddings - Word2Vec (12 points)

### Preparing Training Set

In [None]:
from tutorial2_task2 import tokenize

tokens = tokenize("tutorial2.txt")

print(tokens[0:10])

In [None]:
from tutorial2_task2 import generate_training_data

context_window_size = 1

center_words, context_words = generate_training_data(tokens, context_window_size=context_window_size)

print(center_words[0], context_words[0])

#### Train the following CBOW and SkipGram models by completing their 'forward' functions. Use appropriate activation functions wherever necessary.

In [None]:
import torch
import torch.optim as optim

from tutorial2_task2 import generate_mappings

vocabulary, word2id, id2word = generate_mappings(tokens)

# defining hyperparameters
embedding_dim = 50
num_epochs = 200
context_size = 2 * context_window_size


##### Training CBOW

In [None]:
from tutorial2_task2 import CBOW, train_word2vec

cbow_inputs = torch.tensor([[word2id[w] for w in row] for row in context_words])
cbow_targets = torch.tensor([[word2id[w]] for w in center_words])

cbow_model = CBOW(len(vocabulary), context_size, embedding_dim)
cbow_loss = train_word2vec(cbow_model, cbow_inputs, cbow_targets, num_epochs)

##### Training SkipGram

In [None]:
from tutorial2_task2 import SkipGram, train_word2vec

skipgram_inputs = cbow_targets
skipgram_targets = cbow_inputs

skipgram_model = SkipGram(len(vocabulary), context_size, embedding_dim)
skipgram_loss = train_word2vec(skipgram_model, skipgram_inputs, skipgram_targets, num_epochs=num_epochs)

##### Comparing Training Performance

In [None]:
from tutorial2_task2 import plot_loss

plot_loss(
    'CBOW vs SkipGram Training Loss', 
    [
        (cbow_loss, "CBOW", "r"), 
        (skipgram_loss, "SkipGram", "b"),
    ]
)

#### Retrain Using a Larger Context Size

In [None]:
# TODO: choose one and comment the other
model_cls = CBOW
# model_cls = SkipGram

context_window_size = 2
context_size = 2 * context_window_size

center_words, context_words = generate_training_data(tokens, context_window_size=context_window_size)

context_words_idx = torch.tensor([[word2id[w] for w in row] for row in context_words])
center_words_idx = torch.tensor([[word2id[w]] for w in center_words])

model = model_cls(vocab_size=len(vocabulary), context_size=context_size, embedding_dim=embedding_dim)

if isinstance(model, CBOW):
    inputs = context_words_idx
    targets = center_words_idx

if isinstance(model, SkipGram):
    inputs = center_words_idx
    targets = context_words_idx
    
loss = train_word2vec(model, inputs, targets, num_epochs=num_epochs)

In [None]:
from tutorial2_task2 import plot_loss

plot_loss(
    "Performance", 
    [
        (loss, "model w/ higher context", "k"), 
        (cbow_loss, "CBOW", "r"), 
        (skipgram_loss, "SkipGram", "b"),
    ]
)

Does the context window affect the performance of the model? Explain in not more than 50 words.

Your response: ...

## Making Predictions

In [None]:
context_words = ["most", "romance"]

# Convert context words to their respective indices
context_indices = torch.tensor([word2id[word] for word in context_words])

# Get the model output
log_probs = cbow_model(context_indices)

# Find the word index with the highest probability
max_prob_idx = torch.argmax(log_probs).item()

# Find the word corresponding to the predicted index
predicted_word = id2word[max_prob_idx]

print(f"The predicted center word is '{predicted_word}'")

Now it's your turn to apply a similar approach with the SkipGram model. Complete the code below to predict the context words surrounding a given center word using the SkipGram model.

In [None]:
center_word = "movie"

# Convert the center word to its respective index
center_indices = torch.tensor([word2id[center_word]])

# Get the model output
log_probs = skipgram_model(center_indices)

# Find the word indices with the highest probability
max_prob_idx = [torch.argmax(probs).item() for probs in log_probs]

# Find the word corresponding to the predicted index
predicted_words = [id2word[id] for id in max_prob_idx]

print(f"The predicted context words are {predicted_words}")

### Task 3: Implementing RNN-LSTM classifier (6 points)

Create a moview review dataset having all the positive and negative reviews from file 'tutorial2.txt':

In [None]:
from tutorial2_task3 import load_training_data

words, targets = load_training_data("tutorial2.txt")

print(words[0], targets[0])

Encode and pad the input words.

In [None]:
from tutorial2_task1 import load_glove_model
from tutorial2_task3 import generate_mappings, encode_and_pad

embeddings = load_glove_model("glove.6B.50d.txt")
vocabulary, word2id, id2word = generate_mappings(embeddings)

features = encode_and_pad(words, word2id, 20)

print(features[0])

Split the dataset into train (80%) and test (20%):

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    # TODO
)

Train a simple RNN-LSTM classifier:

In [None]:
from tutorial2_task3 import train

embeddings_tensor = torch.tensor(list(embeddings.values()))

train(x_train, y_train, num_epochs=50, embeddings=embeddings_tensor)

In [None]:
# TODO: Plot training performance (loss, accuracy)

In [None]:
# TODO: Get test accuracy

### Task 4 Bert Embedding


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.metrics import DistanceMetric
import matplotlib.pyplot as plt
import matplotlib as mpl

# 1. Load the resume dataset (resumes_train.csv)
# TODO: Read the CSV file and store it in a DataFrame named 'df_resume'
# HINT: Use pd.read_csv() to load the file

# Display the first few rows of the dataset
# TODO: Display the head of the dataframe to inspect the structure
# HINT: Use df_resume.head()



In [None]:
# 2. Encode the resumes using a pre-trained model
# TODO: Initialize the SentenceTransformer model (use 'all-MiniLM-L6-v2') and encode the 'resume' column
# HINT: Use model.encode() to encode the resumes and store them in 'embedding_arr'

# TODO: Print the shape of 'embedding_arr' to check the dimensionality of the embeddings



In [None]:
# 3. Apply PCA to reduce dimensionality
# TODO: Apply PCA to reduce the embedding dimension to 2 components
# HINT: Use PCA(n_components=2) and fit it to 'embedding_arr'

# TODO: Print the explained variance ratio of the first two components
# HINT: Use pca.explained_variance_ratio_

# 4. Visualize the resumes using PCA components
plt.figure(figsize=(8, 6))
plt.rcParams.update({'font.size': 16})
plt.grid()

c = 0
cmap = mpl.colormaps['jet']

# TODO: Loop through each unique role in df_resume['role'], extract the corresponding embeddings, and plot them
# HINT: Use plt.scatter() to plot the PCA-transformed embeddings for each role

plt.legend(bbox_to_anchor=(1.05, 0.9))
plt.xticks(rotation=45)
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.show()


In [None]:

# 5. Define a job query
# TODO: Define a query as a string. This query represents the job you are searching for
# EXAMPLE: "Data Engineer with Apache Airflow experience"

# 6. Encode the job query using the same model
# TODO: Encode the job query using the SentenceTransformer model and store the result in 'query_embedding'




In [None]:
# 7. Compute the distances between the query and resumes
# TODO: Initialize the distance metric (Euclidean distance) and compute the distances between the query and each resume
# HINT: Use DistanceMetric.get_metric() and pairwise() to compute the distances

# TODO: Sort the resumes based on the distances and store the indices of the sorted array in 'idist_arr_sorted'
# HINT: Use np.argsort()

# 8. Display the top 10 most relevant roles
# TODO: Print the top 10 most relevant roles based on the sorted distances
# HINT: Use df_resume['role'].iloc[] with the sorted indices

# 9. Display the most relevant resume
# TODO: Print the resume text that is the closest match to the query
# HINT: Use df_resume['resume'].iloc[] with the sorted indices

# Resume Matching with Sentence Embeddings + PCA

This cell implements the task:
(a) Load `resumes_train.csv` with pandas.
(b) Encode resumes using `SentenceTransformer('all-MiniLM-L6-v2')` and print embedding shape.
(c) Apply PCA to 2 dimensions and produce a 2D scatter plot colored by `role`.
(d) Define a job query, encode it, compute Euclidean distances to each resume embedding, and show top matches.

In [None]:
# Code cell: implement (a)-(d)

# (a) Load dataset
import pandas as pd
import os
notebook_dir = os.path.dirname(r'd:\Neural Networks for NLP Assignments\Assignment 2\Exercise 2 Question\Tutorial 2\tutorial2.ipynb')
csv_path = os.path.join(notebook_dir, 'resumes_train.csv')

df = pd.read_csv(csv_path)
print('Loaded', len(df), 'rows from', csv_path)

# (b) Encode resumes using SentenceTransformer
try:
    from sentence_transformers import SentenceTransformer
except Exception:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sentence-transformers', 'scikit-learn', 'matplotlib', 'seaborn'])
    from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['resume'].astype(str).tolist(), show_progress_bar=True, convert_to_numpy=True)
print('Embeddings shape:', embeddings.shape)

# (c) PCA to 2D and scatter plot
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set(style='whitegrid')

pca = PCA(n_components=2)
components = pca.fit_transform(embeddings)

roles = df['role'].astype(str)
unique_roles = roles.unique()
palette = sns.color_palette('hsv', len(unique_roles))
role_to_color = {r: palette[i] for i,r in enumerate(unique_roles)}
colors = roles.map(role_to_color)

plt.figure(figsize=(10,8))
for r in unique_roles:
    mask = roles==r
    plt.scatter(components[mask,0], components[mask,1], label=r, alpha=0.8)
plt.legend()
plt.title('PCA (2D) of Resume Embeddings')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

# (d) Job query encoding and Euclidean distances
query = "Data Engineer with Apache Airflow experience"
query_emb = model.encode([query], convert_to_numpy=True)[0]

dists = np.linalg.norm(embeddings - query_emb, axis=1)
df['distance_to_query'] = dists

# Show top 5 closest resumes
df_sorted = df.sort_values('distance_to_query').reset_index(drop=True)
print('\nTop 5 matches:')
for i in range(min(5, len(df_sorted))):
    print(f"\n#{i+1} Role: {df_sorted.loc[i,'role']}  Distance: {df_sorted.loc[i,'distance_to_query']:.4f}\n")
    snippet = df_sorted.loc[i,'resume']
    print(snippet[:500].replace('\n', ' '), '...')
