<a href="https://colab.research.google.com/github/KeqingW44448/api/blob/main/RSM8421/Assignments/Assignment%20Three/Assignment_3_Keqing_Wang_1006927337.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Overview
In this assignment, we will build a simple song recommender system inspired by the **Word2Vec** algorithm. The main idea is to treat playlists the same way language processing treats sentences.
- In a sentence, the words that occur near one another provide information about their meaning.
- In a playlist, the songs that co-occur provide information about how listeners associate them.

Word2Vec learns an **embedding** (vector representation) for each word by examining the context in which they appear. Words that often appear in similar contexts end up with similar embeddings. We will apply the same idea to music: train embeddings for songs so that songs appearing in similar playlists end up close to one another in the embedding space.  

#### Word2Vec Variants: CBOW vs. Skip-Gram

Word2Vec has two main variants:
- **CBOW (Continuous Bag of Words):** Predict the target word from its surrounding context words.
- **Skip-Gram:** Predict the surrounding context words from a single target word.

**Example:** Consider the sentence “the quick brown fox jumps.”
- **CBOW:** If the target is “brown”, the model uses the context (“the,” “quick,” “fox,” “jumps”) to predict “brown.”
- **Skip-Gram:** If the target is “brown”, the model uses “brown” to predict each context word (“the,” “quick,” “fox,” “jumps”).

#### Extending to Playlists
Now, imagine a short playlist: “Song A, Song B, Song C” with a window size of 1.
- **CBOW:** context (Song A, Song C) $\rightarrow$ predict Song B.
- **Skip-Gram:** target Song B $\rightarrow$ predict (Song A, Song C).
In this assignment, we will use the **Skip-Gram** approach. It naturally produces many training pairs and performs well even when some songs appear infrequently.

#### Why This Works
Once trained, the embeddings can be used to recommend songs that are “close” to one another in the learned space. This works because co-occurrence reflects human preferences: if listeners often place two songs together in playlists, the model learns to embed them nearby.

#### Steps You Will Complete
1. **Preprocessing:** Prepare the playlists and map each song to a unique numeric ID.
2. **Model Training:** Train a Word2Vec-style model using the Skip-Gram method.
3. **Embedding Exploration:** Examine the learned embeddings and identify songs the model considers similar.

#### Learning Objectives
The aim of this assignment is to:
- Understand how ideas from language models (e.g., Word2Vec) can be applied to a different domain (music recommendation).
- Reflect on the strengths and limitations of this method.

## Importing Libraries and Data

Let's begin by installing and importing the required libraries.

In [None]:
# Install required packages (uncomment if running for the first time)
# %pip install pandas tensorflow

In [None]:
import pandas as pd
import pickle
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.sequence import skipgrams
import numpy as np
from collections import Counter

The required data for this assignment is provided in the file `training_data.pkl`.
This pickle file contains two components:  
- **Playlists:** stored as a list of lists, where each inner list represents a unique playlist and each element is a song ID.  
- **Metadata:** stored as a DataFrame, where each row corresponds to a song ID and contains information such as the song title and artist.  

In this assignment, playlists will serve as the only supervision signal: songs that appear close together in a playlist are considered related. The dataset was collected by Shuo Chen at Cornell University.  

**Example:**  
- **Playlists:** `[[0, 2, 3], [1, 4]]`  
- **Metadata (DataFrame):**  

| id | title                             | artist     |  
|----|-----------------------------------|------------|  
| 0  | Gucci Time (w/ Swizz Beatz)       | Gucci Mane |  
| 1  | Aston Martin Music (w/ Drake …)   | Rick Ross  |  
| 2  | Get Back Up (w/ Chris Brown)      | T.I.       |  
| 3  | Hot Toddy (w/ Jay-Z & Ester Dean) | Usher      |  
| 4  | Whip My Hair                      | Willow     |  

Here, the first playlist `[0, 2, 3]` corresponds to songs **Gucci Time $\rightarrow$ Get Back Up $\rightarrow$ Hot Toddy**, and the second `[1, 4]` corresponds to  **Aston Martin Music $\rightarrow$ Whip My Hair**.  

Let's now load the playlists and metadata, and inspect their shapes and a few examples to sanity-check the parsing.

In [None]:
# Load the playlists and metadata using pickle
# Make sure the file 'training_data.pkl' is in the same directory as this notebook
with open('training_data.pkl', 'rb') as f:
    data = pickle.load(f)

# Extract playlists (list of lists) and metadata (DataFrame)
playlists = data['train_playlists']
songs_df = data['songs_info']

# Basic dataset info
print("Data loaded successfully!")
print(f"Number of playlists: {len(playlists)}")

# Show a couple of example playlists
print("\nExample Playlists:")
print("Playlist 1:", playlists[0])
print("Playlist 2:", playlists[1])

# Preview a playlist with song titles
playlist_idx = 8
print(f"\nFirst 5 songs in playlist {playlist_idx+1}:",[songs_df.iloc[int(idx)].title for idx in playlists[playlist_idx]][:5])

# Show the first few rows of the metadata table
print("\nSong metadata sample:")
display(songs_df.head())

Data loaded successfully!
Number of playlists: 10738

Example Playlists:
Playlist 1: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75']
Playlist 2: ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62'

Unnamed: 0,title,artist
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow


## Part 1. Data Preparation

Now that we have loaded the playlists and metadata and checked that everything looks correct, the next step is to **prepare the data for training**.  

An embedding can be thought of as a big lookup table, where each row stores the vector for one item (here, a song). This table has rows numbered from `0` up to `V`, where `V` is the number of unique songs. The first row (`0`) is reserved for a special placeholder called the padding token (`<pad>`). All actual songs start from row `1` onward.

**The issue:**  
- The raw song IDs in our playlists are arbitrary labels (e.g., 5, 20, 2172).  
- They don’t necessarily start at 1 or form a contiguous sequence.  
- If we used them directly, we could either run into errors (ID out of range) or map to the wrong row in the lookup table.  

**Solution:** build a mapping.  
- Collect all unique raw song IDs and assign each a new index in `1, 2, …, V`.  
- Store both mappings so we can translate between raw IDs and indices.  
- Rewrite each playlist to use these contiguous indices.  

**Example:**  
Suppose we have two playlists:  
- `[5, 20, 2172]`  
- `[20, 5]`  

The unique IDs are `{5, 20, 2172}`, so the vocabulary size is 3. We map:  
- `5` $\rightarrow$ `1`  
- `20` $\rightarrow$ `2`  
- `2172` $\rightarrow$ `3`  

With this mapping, the playlists become:
- `[1, 2, 3]`  
- `[2, 1]`  

Now every playlist uses contiguous indices in `1, 2, …, V`, exactly matching the rows of the lookup table.  

**Your task:** Complete the code below to:  
1. Build the vocabulary of unique song IDs.  
2. Create `raw_to_idx` and `idx_to_raw` mappings that map raw song IDs to indices and vice versa, respectively.  
3. Remap playlists into index form.

In [None]:
# Use only a subset of playlists to speed up training
num_playlists = 200

# Flatten the first `num_playlists` playlists into one long list of song IDs
flat = [int(x) for pl in playlists[:num_playlists] for x in pl]

# Count how often each raw song ID appears across these playlists. Output looks like: {song_id: count}
counts = Counter(flat)

# Sorted list of all unique raw song IDs (the vocabulary)
vocab_raw_ids = sorted(counts.keys())
# Vocabulary size = number of unique songs
vocab_size = len(vocab_raw_ids)
print("Vocabulary size:", vocab_size)
print("Total song occurrences:", len(flat))

Vocabulary size: 9148
Total song occurrences: 33813


In [None]:
### TODO ###
# Create mappings between raw song IDs (vocab_raw_ids) and contiguous indices (1, 2, ..., V)
# Hint: use dictionary comprehensions
# raw_to_idx: raw ID -> contiguous index, e.g. {5: 1, 20: 2, 2172: 3}
raw_to_idx = {raw_id: index for index, raw_id in enumerate(vocab_raw_ids, start=1)}
# idx_to_raw: contiguous index -> raw ID
idx_to_raw = {index: raw_id for index, raw_id in enumerate(vocab_raw_ids, start=1)}
### END OF TODO ###

# Remap each playlist from raw IDs into contiguous indices (1, 2, ..., V)
# Example: [5, 20, 2172] -> [1, 2, 3]
playlists_idx = [[raw_to_idx[int(x)] for x in pl if int(x) in raw_to_idx] for pl in playlists[:num_playlists]]

# Build lookup dictionaries for human-readable printing
# id_to_song: contiguous index (1, 2, ..., V) -> "Title by Artist"
id_to_song = {i: songs_df.iloc[idx_to_raw[i]].title + " by " + songs_df.iloc[idx_to_raw[i]].artist for i in range(1, vocab_size + 1)}
# song_to_id: "Title by Artist" -> contiguous index
song_to_id = {name: i for i, name in id_to_song.items()}

In [None]:
# Print a few samples to sanity-check the mappings
print("First 5 entries in id_to_song:")
for i, (id, song) in enumerate(id_to_song.items()):
    if i >= 5: break
    print(f"\tID {id}: {song}")

print("\nFirst 5 entries in song_to_id:")
for i, (song, id) in enumerate(song_to_id.items()):
    if i >= 5: break
    print(f"\tSong {song}: {id}")

print(f"\nTop 5 most frequent songs:")
for i, (idx, count) in enumerate(list(counts.most_common(5))):
    print(f"  Song {id_to_song[idx]} -> ID {idx} (appears {count} times)")

First 5 entries in id_to_song:
	ID 1: Gucci Time (w\/ Swizz Beatz) by Gucci Mane
	ID 2: Aston Martin Music (w\/ Drake & Chrisette Michelle) by Rick Ross
	ID 3: Get Back Up (w\/ Chris Brown) by T.I.
	ID 4: Hot Toddy (w\/ Jay-Z & Ester Dean) by Usher
	ID 5: Whip My Hair by Willow

First 5 entries in song_to_id:
	Song Gucci Time (w\/ Swizz Beatz) by Gucci Mane: 1
	Song Aston Martin Music (w\/ Drake & Chrisette Michelle) by Rick Ross: 2
	Song Get Back Up (w\/ Chris Brown) by T.I.: 3
	Song Hot Toddy (w\/ Jay-Z & Ester Dean) by Usher: 4
	Song Whip My Hair by Willow: 5

Top 5 most frequent songs:
  Song Who's That Chick by Rihanna -> ID 13 (appears 204 times)
  Song Holding You Down (Goin' In Circles) by Jazmine Sullivan -> ID 70 (appears 166 times)
  Song Take A Chance by Micah G -> ID 3188 (appears 136 times)
  Song Be Without You by Mary J. Blige -> ID 50 (appears 125 times)
  Song Ants In Yuh Sugar Pan by Jamesy P -> ID 81 (appears 124 times)


## Part 2. Create skip-gram training pairs and labels from playlists

In the previous step, we prepared the playlists so that every song was mapped to a contiguous index between `1` and `V`. Now we want to turn those playlists into training examples that can teach our embedding model.

We first turn each playlist (a sequence of song indices) into many small training examples. For every position in a playlist we treat the song at that position as the **target** and the nearby songs (within a fixed window to the left and right) as its **context**. Each (target, context) pair is a **positive** example because those two songs actually appeared near each other. To train a model, we also create **negative** examples by pairing the same targets with randomly chosen other songs that did not appear next to them. Positives get label `1`, negatives get label `0`.

**Example (window size = 1):**  
Playlist: `[5, 7, 9]`  

- Target `5`, context `[7]` $\rightarrow$ pair `(5, 7)`  
- Target `7`, context `[5, 9]` $\rightarrow$ pairs `(7, 5)`, `(7, 9)`  
- Target `9`, context `[7]` $\rightarrow$ pair `(9, 7)`  

Positive set: `[(5, 7), (7, 5), (7, 9), (9, 7)]` (all labeled 1)  
If we add 2 random negatives per positive (e.g. `(5, 12)`, `(5, 3)`, …) those get label 0.

This manual process involves:
- Sliding a window around each song.
- Emitting all (target, context) positives.
- Sampling random contexts for negatives.
- Returning parallel arrays: `targets`, `contexts`, `labels`.

We manually described how to form (target, context) pairs. Keras also provides a helper `tf.keras.preprocessing.sequence.skipgrams` that automates this and (optionally) adds negative samples. Here is what it does when we call it:

**Call pattern in our code:**
```python
pairs, labels = skipgrams(
    sequence=seq,
    vocabulary_size=vocab_size,
    window_size=WINDOW_SIZE,
    negative_samples=NEGATIVE_SAMPLES
)
```

**Steps performed internally:**
1. Positive window sampling  
   - For each position `i` in `sequence`, it picks context indices within `[-window_size, +window_size]` (excluding `i`).  
   - Each valid (target, context) becomes a positive pair.

2. Negative sampling  
   - For every positive pair, it draws up to `negative_samples` random context indices in the range `[1, ..., vocabulary_size)` (ignoring some invalid draws like the target itself). So we need to pass `vocab_size + 1` as the argument to make sure all sampled indices are valid.
   - These synthetic (target, random_context) pairs are labeled as negatives.

3. Output format  
   - `pairs` is a Python list of `[target_index, context_index]`.
   - `labels` is a list of the same length where `1` = real (positive) and `0` = negative.

**Why we still post‑process:**
- We accumulate `pairs` and `labels` from every playlist into large arrays for training.

**Key parameters you can tune:**
- `window_size`: how far left/right to look for context.
- `negative_samples`: how many negatives per positive (higher → larger, more imbalanced dataset).

This utility saves you from manually:
- Sliding the window,
- Building positive lists,
- Generating negatives and labeling them.

So one line gives you a ready mixed (targets, contexts) + (labels) dataset for that sequence.

In [None]:
import numpy as np

# -----------------------------
# Hyperparameters
# -----------------------------
WINDOW_SIZE = 10
NEGATIVE_SAMPLES = 5
MAX_PAIRS_PER_PLAYLIST = 10_000

# Custom Skip-Gram Generator

def generate_skipgrams(sequence, window_size, vocab_size, negative_samples=5):
    pairs = []
    labels = []

    seq = [int(s) for s in sequence]  # ensure pure python ints

    for i, target in enumerate(seq):
        start = max(0, i - window_size)
        end = min(len(seq), i + window_size + 1)

        for j in range(start, end):
            if i == j:
                continue

            context = seq[j]

            # Positive pair
            pairs.append([target, context])
            labels.append(1)

            # Negative samples
            for _ in range(negative_samples):
                negative = np.random.randint(1, vocab_size + 1)
                pairs.append([target, negative])
                labels.append(0)

    return pairs, labels

# Build All Training Pairs

pairs_all = []
labels_all = []

rng = np.random.default_rng(42)

for seq in playlists_idx:
    if len(seq) < 2:
        continue

    # Generate pairs
    pairs, labels = generate_skipgrams(
        sequence=seq,
        window_size=WINDOW_SIZE,
        vocab_size=vocab_size,
        negative_samples=NEGATIVE_SAMPLES
    )

    # Subsample if too large
    if len(pairs) > MAX_PAIRS_PER_PLAYLIST:
        choose = rng.choice(len(pairs), size=MAX_PAIRS_PER_PLAYLIST, replace=False)
        pairs = [pairs[i] for i in choose]
        labels = [labels[i] for i in choose]

    pairs_all.extend(pairs)
    labels_all.extend(labels)


# Convert to NumPy arrays

pairs_all = np.array(pairs_all, dtype=np.int32)
labels_all = np.array(labels_all, dtype=np.float32)

targets_np = pairs_all[:, 0]
contexts_np = pairs_all[:, 1]

print("✅ Skip-gram generation complete.")
print(f"Total training pairs: {len(pairs_all)}")
print(f"Positive samples: {int(labels_all.sum())}")
print(f"Negative samples: {len(labels_all) - int(labels_all.sum())}")


✅ Skip-gram generation complete.
Total training pairs: 1792020
Positive samples: 298664
Negative samples: 1493356


In [None]:
print(f"Generated: {len(pairs_all)} training pairs | {int(labels_all.sum())} positives | {len(labels_all)-int(labels_all.sum())} negatives")
print(f"Example pairs: {pairs_all[:5]}")
print(f"Example labels: {labels_all[:5]}\n")

print(f"Target shape: {targets_np.shape}")
print(f"Context shape: {contexts_np.shape}")

Generated: 1792020 training pairs | 298664 positives | 1493356 negatives
Example pairs: [[  16 5950]
 [  46    3]
 [  54 2477]
 [  34  945]
 [  58 8665]]
Example labels: [0. 1. 0. 0. 0.]

Target shape: (1792020,)
Context shape: (1792020,)


## Question 1.  
When we create negative samples, we are drawing random song IDs without checking whether that pair might actually appear together in a real playlist. This means that sometimes a "negative" pair could in fact be a true (target, context) pair. Would that be a problem? If so, how should we fix it? If not, why is that acceptable?

## Part 3. Build and compile the Keras Word2Vec model

Now that we have training pairs, we need a model that can learn embeddings for each song.  In Keras, we can use the **`Embedding`** layer to do this. An embedding is just a table:  
- Each row is a vector for one item (here, one song).  
- Instead of manually creating and updating this table, `layers.Embedding` handles it during training.  

The `Embedding` layer in Keras takes two key arguments:  
- `input_dim`: the number of unique items we have (the vocabulary size).  
- `output_dim`: the length of each embedding vector (the number of features per song).  

Example: if `input_dim = 1000` and `output_dim = 32`, the layer will learn a `1000 × 32` matrix,  
where each of the 1000 rows is a 32-dimensional embedding for one song.  

Once we have embeddings, we need to compare them:  
- **`layers.Dot`** computes the dot product between two vectors.  
  - If vectors point in a similar direction, the dot product is large.  
  - If vectors are very different, the dot product is small or negative.  
- **`layers.Activation`** maps this raw score into a probability-like output (e.g., between 0 and 1).  

This makes training possible as a **binary classification task**:  
- Positive pairs (real co-occurrences) should get outputs close to 1.  
- Negative pairs (randomly sampled) should get outputs close to 0.  

**Step-by-step flow:**  
1. Input two integers: one for the **target song**, one for the **context song**.  
2. `Embedding` layer looks up each song's vector.  
3. `Dot` computes similarity: target vector $\cdot$ context vector.  
4. `Activation` turns this score into a probability for classification.  

**Manual example:**  
- Suppose row 3 = `[0.2, -0.5]`, row 8 = `[0.7, 0.1]`.  
- Input: target `3`, context `8`.  
- Lookup: `[0.2, -0.5]` and `[0.7, 0.1]`.  
- Dot product = $0.2 \times 0.7 + (-0.5) \times 0.1 = 0.14 - 0.05 = 0.09$.  
- Activation transforms `0.09` into a probability-like output (e.g., `0.52`).  

**Your task:** complete the code below to:  
1. Define the embedding model with `Embedding`, `Dot`, and `Activation` layers.  
2. Compile the model with an optimizer and loss function suitable for binary classification.

In [None]:
# Build the Keras Word2Vec model
def create_word2vec_model(vocab_size, embedding_dim=32):
    """
    Build a Skip-Gram with Negative Sampling (SGNS) model.

    Args:
        vocab_size: number of unique songs (rows in the embedding table) +1 for padding index 0.
        embedding_dim: size of each song vector.

    Returns:
        model: Keras Model that takes (target_idx, context_idx) and outputs p(real_pair).
        embedding_layer: the shared Embedding layer whose weights are the learned song vectors.
    """

    # Input layers
    target_input = keras.Input(shape=(), name='target')
    context_input = keras.Input(shape=(), name='context')

    ### TODO ###
    # Embedding layer(s) for songs
    # Each lookup returns a vector of length `embedding_dim`
    embedding_layer = layers.Embedding(
        input_dim=vocab_size + 1,
        output_dim=embedding_dim,
        name='song_embeddings'
    )

    # Lookup vectors for target and context (shape: (batch, embedding_dim))
    target_embedding = embedding_layer(target_input)
    context_embedding = embedding_layer(context_input)

    # Similarity score: dot product of the two vectors (per example)
    dot_product = layers.Dot(axes=1)([target_embedding, context_embedding])

    # Probability that (target, context) is a real pair
    output = layers.Activation("sigmoid")(dot_product)
    ### END OF TODO ###

    # Build the model
    model = keras.Model(inputs=[target_input, context_input], outputs=output)

    return model, embedding_layer

In [None]:
### TODO ###
# Create the model
embedding_dim = 32
model, embedding_layer = create_word2vec_model(vocab_size, embedding_dim)

# Compile the model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=["accuracy"]
)
### END OF TODO ###

In [None]:
# Print model summary
model.summary()

## Question 2.  
In your Keras model did you use a *shared embedding table* for both the target and the context songs (the same layer handles both inputs) or *separate embedding tables*? Is one of these approaches "wrong"? What could be the benefits and drawbacks of each choice, and does the answer change depending on whether we use Skip-Gram or CBOW?

## Question 3.
What metric do you think would make sense to include in `model.compile()` to track the model’s learning? Do you think that metric would actually reflect the quality of the learned embeddings? Why? Similarly, what about using a `validation_split` during training, does that give us a meaningful way to evaluate this model?


## Question 4.
Is defining a `Dense` layer after the dot product different from simply passing the dot product directly into an activation function? If so, is this still a valid approach, or does it break the model? If it is valid, what are the concrete benefits and drawbacks?

## Part 4. Train the model

Now that the model is defined and compiled, the next step is to **train it**. Training means feeding the model batches of `(target, context)` pairs along with their labels:  
- Label `1` for real co-occurrences (positive pairs).  
- Label `0` for randomly generated negatives.  

The model adjusts the values in the embedding table so that:  
- Real pairs get higher scores.  
- Negative pairs get lower scores.  

We use `model.fit()` to run training. Some important arguments are:  
- `x`: the inputs (here, the arrays for target songs and context songs).  
- `y`: the labels for each pair.  
- `batch_size`: how many examples are processed before updating the model.  
- `epochs`: how many full passes through the dataset.  
- `shuffle`: whether to shuffle data each epoch.  
- `validation_split`: reserve a fraction of the data for validation.  

**Your Task:** Complete and run the code below to train the model.  
Set a reasonable number of epochs so that training runs in a reasonable amount of time but still shows progress.

In [None]:
# Train the model
print("Training the Keras Word2Vec model...")
### TODO ###
from tensorflow.keras.callbacks import ReduceLROnPlateau

lr_scheduler = ReduceLROnPlateau(
    monitor='val_accuracy',
    factor=0.5,
    patience=2,
    min_lr=1e-6
)
history = model.fit(
    x=[targets_np, contexts_np],
    y=labels_all,
    batch_size=2048,
    epochs=20,
    validation_split=0.1
)

### END OF TODO ###
print("Training completed!")

Training the Keras Word2Vec model...
Epoch 1/20
[1m788/788[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 9ms/step - accuracy: 0.5297 - loss: 0.6860 - val_accuracy: 0.4104 - val_loss: 0.7018
Epoch 2/20
[1m788/788[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.6028 - loss: 0.6294 - val_accuracy: 0.4324 - val_loss: 0.7292
Epoch 3/20
[1m788/788[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.6388 - loss: 0.6079 - val_accuracy: 0.4421 - val_loss: 0.7436
Epoch 4/20
[1m788/788[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 7ms/step - accuracy: 0.6594 - loss: 0.5954 - val_accuracy: 0.4482 - val_loss: 0.7524
Epoch 5/20
[1m788/788[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.6739 - loss: 0.5871 - val_accuracy: 0.4532 - val_loss: 0.7595
Epoch 6/20
[1m788/788[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 7ms/step - accuracy: 0.6870 - loss: 0.5782 - val_accuracy: 0.4576 - val_loss

## Part 5. Extract embeddings and build a recommendation function

After training, the most valuable part of our model is the **embedding table**.  
- Each row in this table is a learned vector for one song.  
- Songs that appear in similar playlist contexts should have embeddings that are close together in this vector space.  

We can now build a simple recommendation function:  
1. Given a song, look up its embedding.  
2. Compare it to the embeddings of all other songs using **cosine similarity**.  
3. Rank the results by similarity and return the top matches.  

Cosine similarity measures how aligned two vectors are:  
- Value near `1` $\rightarrow$ songs are very similar.  
- Value near `0` $\rightarrow$ songs are unrelated.  

**Your Task:** Complete and run the function below so that, when given a song name, it finds the most similar songs in the embedding space. Be sure to **exclude the song itself** from the results. This function will serve as our first recommender: a way to see whether the embeddings are capturing meaningful relationships.

In [None]:
# Extract embeddings and create recommendation function

def find_similar_songs_keras(song_name, top_n=10):
    """
    Find the top-N most similar songs to a given song, using cosine similarity
    over the embeddings learned by the Keras Word2Vec model.

    Args:
        song_name: string, song in "Title by Artist" format.
        top_n: how many similar songs to return.

    Returns:
        List of (song_name, similarity) tuples, ranked by similarity.
        If the input song is not in the vocabulary, return an error string.
    """
    if song_name not in song_to_id:
        return f"Song {song_name} not found in vocabulary"

    # Extract the learned embedding matrix from the model
    embeddings = embedding_layer.get_weights()[0]
    ### TODO ###
    # Look up the embedding vector for the target song
    target_idx = song_to_id[song_name]
    target_embedding = embeddings[target_idx]

    # Compute cosine similarity between target and all other songs
    similarities = np.dot(embeddings, target_embedding) / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(target_embedding) + 1e-10
    )
    similarities[target_idx] = -1
    # Get indices of top-N most similar songs
    similar_indices = similarities.argsort()[-top_n:][::-1]
    ### END OF TODO ###

    # Convert back to song IDs and get similarities
    similar_songs = []
    for idx in similar_indices:
        similar_song_name = id_to_song[idx]
        similarity = similarities[idx]
        similar_songs.append((similar_song_name, similarity))
    return similar_songs

In [None]:
# Test the recommendation function
song_id = 2172  # Example: Fade To Black by Metallica
test_song = songs_df.loc[song_id].title + ' by ' + songs_df.loc[song_id].artist # Using the same song as before
print(f"\nKeras Word2Vec recommendations for song {test_song}:")
keras_recommendations = find_similar_songs_keras(test_song)

if isinstance(keras_recommendations, str):
    # If an error message is returned, print it directly
    print(keras_recommendations)
else:
    for song, similarity in keras_recommendations:
        print(f"  {song}: (similarity: {similarity:.3f})")


Keras Word2Vec recommendations for song Fade To Black by Metallica:
  Don't Cry by Guns N' Roses: (similarity: 0.630)
  Them Bones by Alice In Chains: (similarity: 0.585)
  I Stand Alone by Godsmack: (similarity: 0.566)
  Aenema by Tool: (similarity: 0.566)
  No One Like You by Scorpions: (similarity: 0.559)
  Prayin' For Daylight by Rascal Flatts: (similarity: 0.559)
  All My Friends Say by Luke Bryan: (similarity: 0.559)
  Beer!!! by Psychostick: (similarity: 0.558)
  Juke Box Hero by Foreigner: (similarity: 0.556)
  Fell On Black Days by Soundgarden: (similarity: 0.544)


## Question 5.  
Why do we bother training embeddings at all? Couldn't we simply build a co-occurrence matrix where each entry counts how often two songs appear together in playlists? If not, why not? If yes, what are the strengths and limitations of this baseline compared to learned embeddings, and in what cases might the co-occurrence approach be sufficient?

## Question 6.
We are building a recommender from playlists where no one explicitly rated or labeled the songs. Does that mean this approach is an unsupervised learning method, or should it be classified differently? Explain your reasoning by considering how the training pairs are created and whether labels are really absent or just implicit. Also, since we do not have explicit ratings, how can we evaluate whether the recommender is working well?