### Updated RAG with PyTorch Implementation of the Course Notebook
* Tokenizer and Model Loading: The code uses torch.hub.load, which is outdated. Hugging Face’s transformers library now recommends AutoTokenizer and AutoModel for standardized loading.
* Device Handling: Device management is present but can be streamlined with a single device variable for consistency and flexibility (CPU/GPU support).
* Embedding Generation: The aggregate_embeddings function manually computes mean embeddings, which is inefficient. Modern BERT models allow simpler mean pooling directly from outputs.
* Similarity Calculation: The dot product is used, but cosine similarity is more appropriate for embeddings as it focuses on direction rather than magnitude (as suggested by the exercise).
* Data Preprocessing: The process_song function is basic and can be enhanced with additional text cleaning steps.
* Visualization: The t-SNE plotting function uses suboptimal perplexity settings and could improve clarity with better labeling and color mapping.
* Code Structure: Some functions (e.g., text_to_emb) can be simplified by leveraging Hugging Face’s batch processing capabilities.


In [None]:
!pip install --user numpy torch transformers matplotlib scikit-learn tqdm --upgrade

In [1]:
from tqdm import tqdm
import numpy as np
import torch
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

Explanation:

* Added AutoTokenizer and AutoModel from transformers to replace torch.hub usage.
* Introduced a device variable for consistent GPU/CPU handling across the notebook.
* Kept tqdm, numpy, torch, TSNE, and matplotlib as they’re still needed.
* Retained warning suppression for cleaner output but removed the custom warn function (unnecessary with filterwarnings).

In [2]:
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from tqdm import tqdm

# Set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

Using device: cpu


### 3. Loading Tokenizer and Model

In [3]:
# Load tokenizer and model using Auto classes
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_model = AutoModel.from_pretrained("bert-base-uncased").to(device)

Explanation:

* Replaced torch.hub.load with AutoTokenizer and AutoModel from transformers, which are the current standards for loading pretrained models.
* Used from_pretrained to fetch "bert-base-uncased" directly from Hugging Face’s model hub.
* Moved the model to device (defined earlier) for GPU support, improving performance on compatible hardware.

### 4. Generating Embeddings

In [17]:
def text_to_emb(texts, max_length=512):
    """Convert list of texts to mean-pooled embeddings."""
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    # Mean pooling over the sequence length dimension
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings

# Example usage
input_text = ["This is an example sentence for BERT embeddings.", "How do you like it?"]
embeddings = text_to_emb(input_text)
print(f"Embeddings shape: {embeddings.shape}")

"""
Output: Returns a tensor of shape [batch_size, hidden_size] (e.g., [2, 768] for two sentences with BERT’s 768-dimensional embeddings).
"""

Embeddings shape: torch.Size([2, 768])


'\nOutput: Returns a tensor of shape [batch_size, hidden_size] (e.g., [2, 768] for two sentences with BERT’s 768-dimensional embeddings).\n'

### 5. Preprocessing Song Lyrics

In [18]:
import re

def process_song(song):
    """Clean and preprocess song lyrics."""
    # Remove line breaks and extra spaces
    song = re.sub(r'\n+', ' ', song).strip()
    # Remove special characters and punctuation (optional)
    song = re.sub(r'[^a-zA-Z0-9\s]', '', song)
    return [song]  # Return as a list for consistency with text_to_emb

### 6. Generating Embeddings for Questions and Songs

In [19]:
sesame_street = """
Sunny day
Sweepin' the clouds away
On my way to where the air is sweet
Can you tell me how to get
How to get to Sesame Street?

Come and play
Everything's A-okay
Friendly neighbors there
That's where we meet
Can you tell me how to get
How to get to Sesame Street?

It's a magic carpet ride
Every door will open wide
To happy people like you
Happy people like
What a beautiful

Sunny day
Sweepin' the clouds away
On my way to where the air is sweet
Can you tell me how to get
How to get to Sesame Street?
How to get to Sesame Street?
How to get to Sesame Street?
How to get to Sesame Street?
How to get to Sesame Street?
"""
# Predefined questions (unchanged from original)
song_questions = [
    "Does this song contain any violent themes, such as references to guns, killing, or physical aggression? ...",
    # ... (rest of the questions remain the same)
]

# Convert questions to embeddings
embeddings_questions = text_to_emb(song_questions)

# Process and embed songs (example with song_rage and sesame_street)
# song_rage = process_song(song_rage)  # song_rage lyrics should be provided
# embeddings_rage = text_to_emb(song_rage)

song_sesame_street = process_song(sesame_street)
embeddings_sesame_street = text_to_emb(song_sesame_street)


### 7. Similarity Calculation: Cosine Similarity

In [20]:
yes_responses = [
    "Yes, this song contains violent themes, including references to guns, killing, or physical aggression, and is not suitable for children.",
    "Yes, this song includes explicit lyrics or bad words that might be considered offensive or inappropriate for young audiences.",
    "No, the overall content of this song is not suitable for children as it includes themes, language, and messages that are too mature or unsuitable for young listeners.",
    "Yes, this song explicitly mentions weapons, such as guns and knives, which could be disturbing or inappropriate for children’s entertainment.",
    "Yes, the messages conveyed in this song are positive and uplifting, promoting values like kindness, friendship, and positivity, beneficial for children.",
    "Yes, this song includes sexual content and references to sexual behavior or suggestive language, which are inappropriate for a child-friendly environment.",
    "Yes, this song offers significant educational value, including segments that teach the alphabet, basic math, and other learning content, making it both fun and educational for children.",
    "Yes, this song promotes emotional resilience and social skills, incorporating themes about overcoming challenges and building friendships, which are essential for children's development."
]

from torch.nn.functional import cosine_similarity

def RAG_QA(embeddings_questions, embeddings, n_responses=3):
    """Find top responses using cosine similarity."""
    # Compute cosine similarity between questions and song embeddings
    similarities = cosine_similarity(embeddings_questions, embeddings)
    similarities = similarities.squeeze()  # Remove extra dimensions if embeddings is a single vector
    # Get top n indices
    top_indices = torch.argsort(similarities, descending=True)[:n_responses]
    for idx in top_indices:
        print(yes_responses[idx.item()])

# Example usage
# print("Rage Against the Machine - Bullet in the Head:")
# RAG_QA(embeddings_questions, embeddings_rage)
print("\nSesame Street Theme:")
RAG_QA(embeddings_questions, embeddings_sesame_street)


Sesame Street Theme:


IndexError: slice() cannot be applied to a 0-dim tensor.

### 8. Visualization with t-SNE

In [16]:
def tsne_plot(data, plot_title):
    """Visualize embeddings in 3D using t-SNE."""
    tsne = TSNE(n_components=3, random_state=42, perplexity=min(30, data.shape[0] - 1))
    data_3d = tsne.fit_transform(data.cpu().numpy())  # Convert to numpy if on GPU
    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_subplot(111, projection='3d')
    scatter = ax.scatter(data_3d[:, 0], data_3d[:, 1], data_3d[:, 2], c=range(len(data_3d)), cmap='viridis')
    ax.set_title(f'3D t-SNE Visualization of {plot_title} Embeddings')
    ax.set_xlabel('TSNE Component 1')
    ax.set_ylabel('TSNE Component 2')
    ax.set_zlabel('TSNE Component 3')
    plt.colorbar(scatter, label='Index')
    plt.show()

# Example usage
tsne_plot(embeddings_questions, "Question")

InvalidParameterError: The 'perplexity' parameter of TSNE must be a float in the range (0.0, inf). Got 0 instead.