# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

For this task, I have created a custom dataset on Nobel Prize winners from 2022 to 2024. The dataset is generated by web-scrpping information from the reliable source(s) such as Wikipedia.

The dataset contains information about each <b>laureate</b>, including:
-   The <b>year</b> of the award.
-   The <b>category</b> (e.g Physics, Chemistry, Medicine, Literature, Economics, Peace).
-   The <b>laureate(s) name(s)</b>
-   A <b>summary of their contribution or discovery</b> and
-   A <b>detailed description</b> of the work for which they were awarded.

This dataset will help demostrate how custom query models can extract and summarize recent scientific information on which the model was not pre-trained.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

### Preparing the dataset

In [1]:
import requests

# Wikipedia base url
base_url = "https://en.wikipedia.org/w/api.php"

years = {2022, 2023, 2024}
prefix = {2022: "", 2023: "The_Nobel_Prize_in_", 2024: ""}

fields = {"Physics", "Chemistry", "Physiology_or_Medicine", "Literature", "Economic_Sciences"}

headers = {"User-Agent": "GenerativeAIProject/2.0"}

years_response = []
for year in years:
    params = {
        "action": "parse", 
        "prop": "text",
        "page": str(year) + " Nobel Prizes",
        "formatversion": 2,
        "format": "json"
    }
    
    print(f"Scrapping {year} data from Wikipedia")
    res = requests.get("https://en.wikipedia.org/w/api.php", params=params, headers=headers)
    if (res.status_code != 200):
        print(f"Could not fetch data for the year {year}")
        continue
    years_response.append({"year":year, "data":res.text})

Scrapping 2024 data from Wikipedia
Scrapping 2022 data from Wikipedia
Scrapping 2023 data from Wikipedia


In [2]:
import json
import re
from bs4 import BeautifulSoup

records = []
for year_response in years_response:
    data = json.loads(year_response['data'])
    soup = BeautifulSoup(data['parse']['text'], 'html.parser')
    for field in fields:
        id = prefix[year_response['year']] + field
        field_div = soup.find('h3', id=id)
        table = field_div.find_next('table')
        
        rows = table.find_all("tr")[1:]
        for row in rows:
            cols = row.find_all("td")
            if not cols:
                continue
            
            name_tag = cols[1].find("a")
            name = name_tag.text.strip() if name_tag else None
        
            if(len(cols) >= 3):
                nationality = cols[2].get_text(" ", strip=True)
                nationality = re.sub(r"\s+", " ", nationality).strip()
                shared_nationality = nationality
            else:
                nationality = shared_nationality
                
            # Extract the citation
            citation_col = None
            if (len(cols) >= 4):
                citation_col = cols[3]
                citation = citation_col.get_text(" ", strip=True).strip('"')
                shared_citation = citation
            else:
                citation = shared_citation
                
            records.append({
                "name": name, 
                "nationality": nationality,
                "citation": citation,
                "year": year_response['year'],
                "field": field
            })

In [3]:
import pandas as pd
df = pd.DataFrame(records)

In [4]:
df.to_csv("Nobel_Laureates_2022_to_2024.csv", index=False, encoding="utf-8")

In [5]:
# Read the saved CSV file and summarize each row in a natural language format using OpenAI api
df = pd.read_csv("Nobel_Laureates_2022_to_2024.csv")
df.head()

Unnamed: 0,name,nationality,citation,year,field
0,John Hopfield,American,for foundational discoveries and inventions th...,2024,Physics
1,Geoffrey Hinton,British Canadian,for foundational discoveries and inventions th...,2024,Physics
2,David Baker,American,for computational protein design,2024,Chemistry
3,Demis Hassabis,British,“for protein structure prediction”,2024,Chemistry
4,John M. Jumper,American,“for protein structure prediction”,2024,Chemistry


In [None]:
from openai import OpenAI

# Set the API key and the base url
API_KEY = "" # API KEY
API_BASE = "https://openai.vocareum.com/v1"

client = OpenAI(
    api_key=API_KEY,
    base_url=API_BASE
)

def summary(row):
    prompt = f"""
    Summarize the following Nobel Prize information in a natural, human-like sentence.
    
    Name: {row['name']}
    Nationality: {row['nationality']}
    Field: {row['field']}
    Year: {row['year']}
    Citation: {row['citation']}
    
    I want to make the Name, Nationality, Field, Year to be emphasized with double quotes.
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role":"system", "content":"You are a helpful assistant that summarizes Nobel Prize data clearly and concisely without missing any data."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2
    )
    
    return response.choices[0].message.content.strip()

In [7]:
df['text'] = df.apply(summary, axis=1)
df = df[["text"]]

In [8]:
df.to_csv("Nobel_Laureates_2022_to_2024_summary.csv", index=False, encoding="utf-8")

In [9]:
df.head()

Unnamed: 0,text
0,"John Hopfield, an ""American"" physicist, was aw..."
1,"Geoffrey Hinton, a ""British Canadian"" physicis..."
2,"David Baker, an ""American"" chemist, was awarde..."
3,"Demis Hassabis, a ""British"" chemist, was award..."
4,"John M. Jumper, an ""American"" chemist, was awa..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [10]:
# Embed the data
EMBEDDING_MODEL = "text-embedding-ada-002"
embedding_response = client.embeddings.create(
    input=df["text"].tolist(),
    model=EMBEDDING_MODEL
)

In [11]:
embedding_response.data[0].embedding.__len__()

1536

In [12]:
embeddings = [data.embedding for data in embedding_response.data]

In [13]:
df["embeddings"] = embeddings

In [14]:
df.head()

Unnamed: 0,text,embeddings
0,"John Hopfield, an ""American"" physicist, was aw...","[-0.017985869199037552, 0.010889881290495396, ..."
1,"Geoffrey Hinton, a ""British Canadian"" physicis...","[-0.024491336196660995, 0.002063378691673279, ..."
2,"David Baker, an ""American"" chemist, was awarde...","[-0.020524654537439346, -0.000664970139041543,..."
3,"Demis Hassabis, a ""British"" chemist, was award...","[-0.003563943784683943, 0.019138608127832413, ..."
4,"John M. Jumper, an ""American"" chemist, was awa...","[-0.009623452089726925, 0.00858378130942583, 0..."


In [15]:
df.to_csv("Nobel_Laureates_2022_to_2024_summary_with_Embeddings.csv")

In [86]:
from scipy.spatial.distance import cosine
import numpy as np
import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")

def get_sorted_embedding(prompt, df):
    query_vector = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=prompt
    )
    query_vector = np.array(query_vector.data[0].embedding)
    df_cpy = df.copy()
    df_cpy["distance"] = df_cpy["embeddings"].apply(lambda embedding: cosine(np.array(embedding), query_vector))
    df_cpy = df_cpy.sort_values(by="distance", ascending=True).reset_index(drop=True)
    return df_cpy


def get_filtered_prompt(user_prompt, max_token=1000):
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(user_prompt))
    context = []
    for text in get_sorted_embedding(user_prompt, df)["text"].values:
        token_count = len(tokenizer.encode(text))
        if current_token_count + token_count <= max_token:
            context.append(text)
            current_token_count += token_count
        else:
            break
    
    return prompt_template.format("\n\n###\n\n".join(context), user_prompt)

In [112]:
# Querying a Completion model with context
def get_answer(user_prompt, use_RAG=False):
    prompt = user_prompt
    if (use_RAG):
        prompt = get_filtered_prompt(user_prompt)
        
    answer = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=[
            {"role":"system", "content": "You are a chatbot"},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2
    )
    return answer.choices[0].message.content.strip()

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [None]:
question_1 = "Who won the Nobel Prize in Physics for Machine learning and Neural Network inventions?"

In [114]:
print("=============== WITHOUT RAG ===============")
print(f"Question: {question_1}")
print(f"Answer: {get_answer(question_1)}")

Question: Who won the Nobel Prize in Physics for Machine learning and Neural Network inventions?
Answer: The Nobel Prize in Physics has not been awarded specifically for machine learning and neural network inventions. The Nobel Prize in Physics is typically awarded for significant discoveries or advancements in the field of physics, and while machine learning and neural networks have made significant contributions to the field of artificial intelligence, they have not been the sole focus of a Nobel Prize in Physics.


In [115]:
print("=============== WITH RAG ===============")
print(f"Question: {question_1}")
print(f"Answer: {get_answer(question_1, use_RAG=True)}")

Question: Who won the Nobel Prize in Physics for Machine learning and Neural Network inventions?
Answer: Geoffrey Hinton, a "British Canadian" physicist, was awarded the Nobel Prize in "2024" for his foundational discoveries and inventions that enable machine learning with artificial neural networks.


### Question 2

In [118]:
question_2 = "Did Demis Hassabis got Nobel Prize for his work?"

In [None]:
print("=============== WITHOUT RAG ===============")
print(f"Question: {question_2}")
print(f"Answer: {get_answer(question_2)}")

Question: Did Demis Hassabis got Nobel Prize for his work?
Answer: As of my knowledge cutoff in September 2021, Demis Hassabis has not been awarded a Nobel Prize for his work. He is a prominent figure in the field of artificial intelligence and co-founder of DeepMind, a leading AI research company. If there have been any developments since then, I would not be aware of them.


In [None]:
print("=============== WITH RAG ===============")
print(f"Question: {question_2}")
print(f"Answer: {get_answer(question_2, use_RAG=True)}")

Question: Did Demis Hassabis got Nobel Prize for his work?
Answer: Yes, Demis Hassabis was awarded the Nobel Prize in 2024 for his work on protein structure prediction.


From the above responses we can clearly see that the 

1. Non-RAG response answers queries purely based on its pre-trained knowledge. It doesn't have access to any external or updated information. As as result, responses may be like generic or "I don't know".
2. In RAG response the model is augmented with a retrieval component that searches over a custom dataset or knowledge base before answering the question. The retrieved context is passed to the model along with the query, allowing it to generate more accurate, grounded and context specific responses.