## Libraries

In [None]:
import pandas as pd
import numpy as np
import os
import json
from transformers import AutoTokenizer, AutoModel, BertTokenizer, BertModel
import torch
from dotenv import load_dotenv
import google.generativeai as genai
from sklearn.metrics.pairwise import cosine_similarity

In [44]:
load_dotenv()
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')
genai.configure(api_key=GEMINI_API_KEY)

## Exercises

### Exercise 1

In [4]:
notebook_classic_1 = '../classic_nlp/01-regex.ipynb'
notebook_classic_1

'../classic_nlp/01-regex.ipynb'

In [9]:
with open('../classic_nlp/01-regex.ipynb', 'r', encoding='utf-8') as file:
	notebook_classic_1_json = json.load(file)
# notebook_classic_1_json

1. How can we find a cell within the notebook?
- We can find a cell within the notebook by using the `find` command in the command palette (Cmd + Shift + P) and typing "find". This will allow us to search for specific text within the notebook.

2. How can we known if the cell contains Python code or Markdown annotations?

- `cell_type` is the attribute that tells us if the cell is a code cell or a markdown cell.

### Exercise 2

The simplest way to search for text is using keywords. Improve your code so that it:

1. Collects a keyword from the user, and
2. Indicates all notebooks/cells that contain that keyword.

Reflect: what is the best way to present these results to the user?

- The best way to present the results to the user is to display the notebook name and the cell number where the keyword was found. This way, the user can easily navigate to the specific cell in the notebook.

In [17]:
# 1. Collect a keyword from the user
keyword = input("Enter a keyword to search for: ")

# 2. Indicates all notebooks/cells that contain that keyword
def search_keyword_in_notebook(keyword, notebook_json):
    results = []
    for cell in notebook_json['cells']:
        if cell['cell_type'] == 'code':
            source = ''.join(cell['source'])
            if keyword in source:
                results.append({
                    'cell_type': cell['cell_type'],
                    'source': source,
                    'metadata': cell.get('metadata', {})
                })
    return results


# 3. Search for the keyword in the notebook
results = search_keyword_in_notebook(keyword, notebook_classic_1_json)
print(f"Keyword '{keyword}' found in {len(results)} cell(s) in notebook '{notebook_classic_1}':")
for i, cell in enumerate(results, start=1):
    print("\n------------------------------")
    print(f"Cell {i}:")
    print("Cell type:", cell['cell_type'])
    print("Source:\n", cell['source'])
    if cell.get('metadata'):
        print("Metadata:", cell['metadata'])
print("\n------------------------------")

Keyword 'fascinating' found in 2 cell(s) in notebook '{'cells': [{'cell_type': 'markdown', 'metadata': {}, 'source': ['# Regular expressions: finding text within text\n', '\n']}, {'cell_type': 'markdown', 'metadata': {}, 'source': ['## Exercise 1\n', '\n', 'There are many situations that call for detecting particular words in a text. For example, we could count how many times the word "are" appears in a text:']}, {'cell_type': 'code', 'execution_count': 2, 'metadata': {}, 'outputs': [], 'source': ['text = """\n', 'Llamas are fascinating animals that are found in the Andes Mountains.\n', 'Yoda says that cute animals they are.\n', 'Are llamas cute? Yes they are!\n', 'A llama never leaves its herd and will protect it with its life.\n', 'They are known for their long necks and thick fur.\n', 'Llamas are used as pack animals by indigenous people because they are strong and can carry heavy loads.\n', 'They are also very social animals and are often seen in groups.\n', 'Llamas ARE herbivores,

### Exercise 3

There is an inherent fragility in the previous system: it requires the user to guess the keyword correctly. Trivia fact: in the early 2000s, the hability to guess keywords in Google had the same hype that using AI chatbots has nowadays.

We could prevent our user from trying to guess the exact keyword or phrase, afterall, we have an estimator for phrase similarity: BERT!

Improve your code so that it:

1. Collects a phrase from the user,
1. Calculates the phrase embedding $q$ using the CLS token from BERT
1. Traverses the course material calculating the embedding $x_i$ for each cell
1. Finds the $k$ (try with $k=1$, then generalize to any $k$) cells with minimal cosine distance ($d = \frac{ <q, x_1>}{||x|| ||c_i||}$) with relationship to the phrase.

Reflect: was this a better choice for retrieval? How can we measure this difference? (tip: research how information retrieval systems are evaluated!)


In [24]:
# 1. Collects a phrase from the user
phrase = input("Enter a phrase to search for: ")

# 2 Calculates the phrase embedding q using the CLS token from BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer(phrase, return_tensors='pt')


# 3. Traverses the course material calculating the embedding $x_i$ for each cell
def calculate_cell_embedding(cell_source):
    inputs = tokenizer(cell_source, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
        x_i = outputs.last_hidden_state[:, 0, :].numpy()  # CLS token embedding
    return x_i.flatten()  # return a 1D vector


# 4. Finds the $k$ (try with $k=1$, then generalize to any $k$) cells with minimal cosine distance ($d = \frac{ <q, x_1>}{||x|| ||c_i||}$) with relationship to the phrase.
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def find_most_similar_cells(q, notebook_json, k=1):
    similarities = []
    for cell in notebook_json['cells']:
        if cell['cell_type'] == 'code':
            source = ''.join(cell['source'])
            x_i = calculate_cell_embedding(source)
            similarity = cosine_similarity(q, x_i)
            similarities.append((similarity, cell))
    # Sort by similarity and get the top k
    similarities.sort(reverse=True, key=lambda x: x[0])
    return similarities[:k]

# 5. Prints the $k$ cells with the highest similarity
def print_most_similar_cells(q, notebook_json, k=1):
    most_similar_cells = find_most_similar_cells(q, notebook_json, k)
    print(f"Top {k} most similar cells to the phrase '{phrase}':")
    for i, (similarity, cell) in enumerate(most_similar_cells, start=1):
        print("\n------------------------------")
        print(f"Cell {i}:")
        print("Similarity:", similarity)
        print("Cell type:", cell['cell_type'])
        print("Source:\n", ''.join(cell['source']))
        if cell.get('metadata'):
            print("Metadata:", cell['metadata'])
    print("\n------------------------------")
print_most_similar_cells(q, notebook_classic_1_json, k=1)

Top 1 most similar cells to the phrase 'llams':

------------------------------
Cell 1:
Similarity: 0.8458878
Cell type: code
Source:
 text = """
Llamas are fascinating animals that are found in the Andes Mountains.
Yoda says that cute animals they are.
Are llamas cute? Yes they are!
A llama never leaves its herd and will protect it with its life.
They are known for their long necks and thick fur.
Llamas are used as pack animals by indigenous people because they are strong and can carry heavy loads.
They are also very social animals and are often seen in groups.
Llamas ARE herbivores, which means they are only eating plants.
They are also known for their gentle and calm nature, which makes them popular in petting zoos.
Overall, llamas are remarkable creatures that are a joy to observe and are important to the cultures where they are found.
"""

------------------------------


## Exercise 4

Now let's leave our retrival system waiting for a while.

Make a small program that:

1. Collects a question from the user
1. Uses an API to redirect this question to an LLM, and immediately returns the answer.
1. Add prompt information so that your answers can only regard NLP-related subjects (these are called "safeguards")

In [None]:
# 1. Collects a question from the user
question = input("Enter a question to search for: ")

# 2. Uses an API to redirect this question to an LLM, and immediately returns the answer.
generation_config = genai.GenerationConfig(
    max_output_tokens=1000,
    temperature=0.0,
    response_mime_type='application/json'
)

model = genai.GenerativeModel(model_name="gemini-2.0-flash",
                              system_instruction = "You are an NLP expert. Please ensure that your response only pertains to topics related to Natural Language Processing (NLP)."
                            )

try:
    response = model.generate_content(question,generation_config=generation_config)
    print("Response from LLM:")
    print(response.text)
except Exception as e:
    print("An error occurred while generating content:", str(e))


Response from LLM:
{
  "color": "green"
}


## Exercise 5

Now, let's joint everything.

We are able to find specific information from our courseware. Also, we are able to use LLMs. Use both abilities to:

1. Collect a question from the user
1. Retrieve the $K$ most relevant cells from the course material
1. Use the content of these cells as part of a prompt. The prompt includes both the question and the content from the relevant cells.
1. Phrase your prompt so that the LLM can only return information that is contained in the course material.

Reflect: how does this compare to the system in Exercise 4? How can we measure the differences?

In [None]:
# 1. Load embedding model (separate from your generative LLM)
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embed_tokenizer = AutoTokenizer.from_pretrained(embed_model_name)
embed_model = AutoModel.from_pretrained(embed_model_name)

# 2. Compute embedding for a piece of text (question or code cell)
def calculate_cell_embedding(text: str) -> torch.Tensor:
    """
    Tokenizes the input text and returns a mean-pooled embedding tensor.
    """
    inputs = embed_tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding=True
    )
    with torch.no_grad():
        outputs = embed_model(**inputs)
    # outputs.last_hidden_state shape: (batch_size, seq_len, hidden_size)
    token_embeddings = outputs.last_hidden_state  # (1, seq_len, d)
    attention_mask = inputs.attention_mask.unsqueeze(-1)  # (1, seq_len, 1)
    summed = torch.sum(token_embeddings * attention_mask, dim=1)  # (1, d)
    counts = torch.clamp(attention_mask.sum(dim=1), min=1e-9)  # (1, 1)
    mean_pooled = summed / counts  # (1, d)
    return mean_pooled[0]

# 3. Find the k most similar code cells to the question
def find_most_similar_cells(question: str, notebook_json: dict, k: int = 1):
    """
    Returns a list of (similarity_score, cell) for the top-k similar code cells.
    """
    q_embedding = calculate_cell_embedding(question).cpu().numpy().reshape(1, -1)

    similarities = []
    for cell in notebook_json.get("cells", []):
        if cell.get("cell_type") == "code":
            source = ''.join(cell.get("source", []))
            cell_emb = calculate_cell_embedding(source).cpu().numpy().reshape(1, -1)
            sim = cosine_similarity(q_embedding, cell_emb)[0][0]
            similarities.append((sim, cell))
    similarities.sort(key=lambda x: x[0], reverse=True)
    return similarities[:k]

# 4. Retrieve relevant cells
def retrieve_relevant_cells(question: str, notebook_json: dict, k: int = 1):
    """Wrapper that calls find_most_similar_cells."""
    return find_most_similar_cells(question, notebook_json, k)

def create_prompt(question: str, relevant_cells: list) -> str:
    """
    Formats a prompt with the question and the top relevant cells.
    """
    prompt = f"Question: {question}\n\nRelevant Cells:\n"
    for i, (sim, cell) in enumerate(relevant_cells, start=1):
        prompt += f"\nCell {i} (Similarity: {sim:.4f}):\n"
        prompt += ''.join(cell.get('source', [])) + "\n"
    return prompt

def create_final_prompt(question: str, relevant_cells: list) -> str:
    base = create_prompt(question, relevant_cells)
    base += "\nPlease answer the question based only on the provided course material."
    return base

def answer_question_with_llm(question: str, notebook_json: dict, k: int = 1):
    relevant_cells = retrieve_relevant_cells(question, notebook_json, k)
    prompt = create_final_prompt(question, relevant_cells)
    try:
        response = model.generate_content(
            prompt,
            generation_config=generation_config
        )
        print("Response from LLM:")
        print(response.text)
    except Exception as e:
        print("An error occurred while generating content:", str(e))


# Example usage:
# What are some special sets we should be aware of in regex?
question = input("Enter a question to search for: ")
answer_question_with_llm(question, notebook_classic_1_json, k=1)

Response from LLM:
[
  {
    "special_sets": [
      "EMAIL_REGEX",
      "HASHTAG_REGEX",
      "SOCIAL_NETWORKS_REGEX",
      "PHONE_NUMBERS_REGEX",
      "ZIP_CODE_REGEX"
    ],
    "description": "The provided code uses regular expressions to find patterns in strings. The variables EMAIL_REGEX, HASHTAG_REGEX, SOCIAL_NETWORKS_REGEX, PHONE_NUMBERS_REGEX, and ZIP_CODE_REGEX are likely regular expressions designed to match specific patterns for emails, hashtags, social network mentions, phone numbers, and zip codes, respectively. These are special sets of characters or patterns used to identify these specific data types within text."
  }
]


### Exercise 6

If you have reached this far, let's start optimizing our systems.

To do so:

1. Identify which step of your processing pipeline takes the longer
1. Study if there are techniques or data structures that can make this specific step faster
1. If possible, implement the optimization and test the results.
1. Iterate until you cannot optimize anymore.