# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

* Dataset: 

    The dataset we used contains detailed information about study programs in the Netherlands. Each program's information is structured as natural language text and saved in a CSV file: **Program_info_netherlands_text.csv**.

* Motivation: 
    
    This dataset is well-suited for building a custom chatbot because the base completion model was trained several years ago. As a result, if we ask for specific and up-to-date details about study programs in the Netherlands, the model’s responses may be outdated or inaccurate. By leveraging Retrieval-Augmented Generation (RAG), the completion model can incorporate current information from the dataset as context, allowing it to generate more accurate and relevant answers to questions about study programs in the Netherlands.

In [None]:
import numpy as np
import pandas as pd
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import tiktoken #The library to load and use tokenizer built by OpenAI
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-11951781731266773652301680fbcbc47b594.20487951"

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [None]:
# Import collected text data for the study program details in the Netherlands
df = pd.read_csv("Program_info_netherlands_text.csv",index_col = 0)
df.head()

Unnamed: 0,text
0,Vrije Universiteit Amsterdam offers a Bachelor...
1,University of Amsterdam offers a Short or summ...
2,Vrije Universiteit Amsterdam offers a Master p...
3,Vrije Universiteit Amsterdam offers a Master p...
4,University of Amsterdam offers a Master progra...


In [None]:
# Generate the embeddings for the text

#The dimension of the embedding vector for this embedding model "text-embedding-ada-002" is 1536
EMBEDDING_MODEL_NAME = "text-embedding-3-large" #"text-embedding-ada-002"
batch_size = 100
embeddings = []
#Send the text data in batch(batchsize = 100) to Embedding model
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(), #input is a list of text(should be json serializable)
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,Vrije Universiteit Amsterdam offers a Bachelor...,"[-0.049999359995126724, 0.013729332014918327, ..."
1,University of Amsterdam offers a Short or summ...,"[-0.029914140701293945, 0.012806381098926067, ..."
2,Vrije Universiteit Amsterdam offers a Master p...,"[-0.01400075200945139, 0.03491421788930893, -0..."
3,Vrije Universiteit Amsterdam offers a Master p...,"[-0.030460083857178688, 0.03197935223579407, -..."
4,University of Amsterdam offers a Master progra...,"[-0.0378214493393898, 0.014933372847735882, -0..."
...,...,...
1862,SOMT offers a Bachelor program called 'Physiot...,"[-0.0041148546151816845, 0.013687042519450188,..."
1863,Breda University of Applied Sciences offers a ...,"[-0.035528749227523804, 0.04575473442673683, -..."
1864,THIM University of Applied Sciences in Physiot...,"[-0.024073099717497826, 0.02298002317547798, -..."
1865,Maastricht University offers a Master program ...,"[0.0026634896639734507, 0.020387206226587296, ..."


In [None]:
#Save the embeddings to csv file
df.to_csv("Program_info_netherlands_embeddings.csv")

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [None]:
## Reload the dataframe with text and the corresponding embeddings

# df = pd.read_csv("embeddings.csv", index_col=0)
"""
eval(source_string) parses the source_string argument and evaluates it as a Python expression.
eval() will convert string representations of Python literals (like lists, dicts, 
numbers, strings, tuples, booleans, None) to its actual python literals
ex: if a string is "[0.1, 0.2, 0.3]", eval("[0.1, 0.2, 0.3]") will convert this 
string into an actual Python list [0.1, 0.2, 0.3].
"""

# df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
# df

##### Workflow of getting custom query completion:
* 1.Create the custom text prompt:
    * 1.1 
    Given a question, calculate the distance between the question text and 
    all the text in database, then sort the dataframe text (database) by the 
    distance from least to most.
    * 1.2
    Create the context in the custom text prompt: retrieve the most relevant text
    (with shortest distances) from database, keep adding the most relevant text to 
    the context until input prompt reach the maximum #tokens limit, 
    where #tokens in input prompt= #tokens in prompt template + query question + context
    * 1.3
    Combine the prompt template, query question and context to compose a custom text prompt

* 2.Send the custom text prompt to the (Chat)Completion Model to query a response


In [196]:
#The function to sort the dataframe with embeddings based on the relevance between the prompt and the embedding text
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most distance for that question

    Input:
         question->str: the query text 
         df->pd.DataFrame(): a dataframe that stores the embeddings vectors of dataset
    Output:
         df_copy->pd.DataFrame: add extra column "distance" <cosine distance between 
         the query text to each row of text> to the original dataframe, and sort the dataframe
         by the distance. The output dataframe is expected to have three columns: text, embeddings, distances
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings, #query embeddings
        df_copy["embeddings"].values, #list_of_embeddings
        distance_metric="cosine" #distrance metric <specify as cosine distance>
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [197]:
#The function to create custom prompt：the prompt under a certain prompt template with both original query text and the context text retrieved from vector database
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt with both the query text and the 
    context to send to a Completion model.

    To compose the context: retrieve the most relevant text(with shortest distances) from 
    database, keep adding the most relevant text to the context until input prompt reach 
    the maximum #tokens limit, where #tokens in input prompt= #tokens in prompt template 
    + query question + context
    
    Input:
        question->str: the query text
        df->pd.DataFrame: the dataframe that contains text amd embeddings information of dataset
        max_token_count->int: the maximum number of tokens limit that you set up for the input prompt
    Output:
        prompt_template->str: the custom prompt with both the query text and the context text retrieved from vector database
    """

    # Create a tokenizer that is designed to align with our embeddings
    # for text-embedding-3-large embedding model, the corresponding tokenizer is cl100k_base
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the input prompt: prompt template + query question + context

    #The prompt template
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    #Get the #tokens in prompt template and query question
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    #Retrieve the most relevant text(with shortest distances) from database to compose context, keep adding text until it reach the #tokens limit
    context = []
    #get_rows_sorted_by_relevance(question, df):calculate the distance between question and text in database, then sort the dataframe text by the distance from least to most
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the maximum #tokens limit
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break
    #"\n\n###\n\n".join(context) to make the context more readable rather than just a chunk of text
    return prompt_template.format("\n\n###\n\n".join(context), question)

In [None]:
#Function to get answer from (Chat)Completion Model for custom text prompt

# COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
COMPLETION_MODEL_NAME = "gpt-4o"

def answer_question(
    question, df, max_prompt_tokens=10000, max_answer_tokens=500
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string

    Input:
        question->str: the query text
        df->pd.DataFrame: a dataset dataframe contains two columns: "text" and "embeddings" 
        max_prompt_tokens->int: the maximum number of tokens in the input query prompt (prompt template + query + context)
        max_answer_tokens->int: the maximum number of tokens in the response from Completion Model
    Output:
        response["choices"][0]["text"].strip()->str: the response text from Completion Model
        or
        response["choices"][0]["message"]["content"].strip()->str: the response text from ChatCompletion Model

    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        # response = openai.Completion.create(
        #     model=COMPLETION_MODEL_NAME,
        #     prompt=prompt,
        #     max_tokens=max_answer_tokens,
        #     temperature=0.2,
        #     top_p=0.3
        # )

        # return response["choices"][0]["text"].strip()

        response = openai.ChatCompletion.create(
            model=COMPLETION_MODEL_NAME,
            messages=[
                {"role": "user", "content": prompt}
            ],
            max_tokens=max_answer_tokens,
            temperature=0.2,
            top_p=0.3
        )

        return response["choices"][0]["message"]["content"].strip()
    
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [220]:
test_question1 = """
What are the available master programs in the field of Data Science 
in the 'University of Amsterdam' taught in English and their relevant information, including the link to the official website? 
If there is no, please answer there is no such program available.
"""

In [223]:
#The original response from Completion Model without using RAG

#gpt-3.5-turbo-instruct has a 16,385 tokens in context window(prompt + response), while the output token limit is 4,096 tokens
initial_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=test_question1,
    max_tokens=1000,
    temperature=0.2,
    top_p=0.3
)
#Extracting Response Text
initial_answer =initial_answer["choices"][0]["text"].strip()
print(initial_answer)

The University of Amsterdam offers the following master programs in the field of Data Science taught in English:

1. Master of Science in Data Science
- Duration: 2 years
- Language of instruction: English
- Program website: https://www.uva.nl/en/programmes/masters/data-science/index.html

2. Master of Science in Artificial Intelligence
- Duration: 2 years
- Language of instruction: English
- Program website: https://www.uva.nl/en/programmes/masters/artificial-intelligence/index.html

3. Master of Science in Business Analytics
- Duration: 1 year
- Language of instruction: English
- Program website: https://www.uva.nl/en/programmes/masters/business-analytics/index.html

4. Master of Science in Computational Science
- Duration: 2 years
- Language of instruction: English
- Program website: https://www.uva.nl/en/programmes/masters/computational-science/index.html

5. Master of Science in Information Studies: Data Science track
- Duration: 1 year
- Language of instruction: English
- Program

In [222]:
#The original response from ChatCompletion Model without using RAG

#gpt-4o has a token limit of 128000 tokens in context window(prompt + response), while the output tokens limit is 16,384 tokens
response = openai.ChatCompletion.create(
    model="gpt-4o",  # or "gpt-4o-mini"
    messages=[
        {"role": "user", "content": test_question1}
    ],
    temperature=0.2,
    max_tokens=1000,
    top_p=0.5
)

print(response["choices"][0]["message"]["content"].strip())

As of the latest information available, the University of Amsterdam offers a Master's program in the field of Data Science that is taught in English. Here are the details:

### Master's in Data Science and Business Analytics

- **Program Overview**: This program is designed to equip students with the skills needed to analyze and interpret complex data, and to use this information to make informed business decisions. It combines elements of computer science, statistics, and business.

- **Duration**: Typically 1-2 years, depending on the specific track and full-time or part-time enrollment.

- **Curriculum Highlights**:
  - Data Mining
  - Machine Learning
  - Statistical Methods
  - Big Data Technologies
  - Business Strategy and Analytics

- **Admission Requirements**:
  - A relevant bachelor's degree (e.g., in computer science, mathematics, statistics, or a related field)
  - Proficiency in English (e.g., TOEFL or IELTS scores)
  - Some programs may require GRE or GMAT scores

- **Ap

In [221]:
#The response based on custom text prompt from Completion Model
custom_answer = answer_question(test_question1, df, max_answer_tokens=1000)
print(custom_answer)

The University of Amsterdam offers several Master programs in the field of Data Science taught in English. Here are the available programs along with their relevant information and links to the official websites:

1. **Information Studies: Data Science**
   - Duration: 1 year
   - Degree: Master of Science
   - ECTS Credits: 120
   - Tuition Fees: EU/EEA students pay € 2,601; Non-EU/EEA students pay € 34,400; Institutional students pay € 31,300
   - Application Deadlines: 1 May '25 for EU/EEA applicants, 1 Feb '25 for non-EU/EEA applicants
   - Official Website: [Information Studies: Data Science](http://www.uva.nl/msc-ds)

2. **Data Science and Business Analytics: Data Science track**
   - Duration: 1 year
   - Degree: Master of Science
   - ECTS Credits: Information not available
   - Tuition Fees: EU/EEA students pay € 2,601; Non-EU/EEA and Institutional fees are not available
   - Application Deadlines: 1 May '25 for EU/EEA applicants, 1 Apr '25 for non-EU/EEA applicants
   - Offic

### Question 2

In [224]:
test_question2 = """
What are the available Master programs in the field of Statistics taught in English 
in the Netherlands, and what are these programs' types, language requirements, tuition 
fee, and these programs' links respectively? If there is no, please answer there is no 
suchprogram available.
"""

In [225]:
#The original response from Completion Model without using RAG

#gpt-3.5-turbo-instruct has a token limit of 4,096 tokens
initial_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=test_question2,
    max_tokens=1000,
    temperature=0.2,
    top_p=0.3
)
#Extracting Response Text
initial_answer =initial_answer["choices"][0]["text"].strip()
print(initial_answer)

1. Master of Science in Statistics - University of Amsterdam
Type: Research-based
Language Requirements: English proficiency (TOEFL or IELTS)
Tuition Fee: €2,143 for EU/EEA students, €17,500 for non-EU/EEA students
Link: https://www.uva.nl/en/programmes/masters/statistics/statistics.html

2. Master of Science in Statistical Science for the Life and Behavioural Sciences - University of Groningen
Type: Research-based
Language Requirements: English proficiency (TOEFL or IELTS)
Tuition Fee: €2,143 for EU/EEA students, €15,500 for non-EU/EEA students
Link: https://www.rug.nl/masters/statistical-science-for-the-life-and-behavioural-sciences/

3. Master of Science in Statistical Science - Leiden University
Type: Research-based
Language Requirements: English proficiency (TOEFL or IELTS)
Tuition Fee: €2,143 for EU/EEA students, €18,200 for non-EU/EEA students
Link: https://www.universiteitleiden.nl/en/education/study-programmes/master/statistical-science

4. Master of Science in Statistical Sci

In [229]:
#The original response from ChatCompletion Model without using RAG
response = openai.ChatCompletion.create(
    model="gpt-4o",  # or "gpt-4o-mini"
    messages=[
        {"role": "user", "content": test_question2}
    ],
    temperature=0.2,
    max_tokens=1000,
    top_p=0.5
)

print(response["choices"][0]["message"]["content"].strip())

As of my last update, several universities in the Netherlands offer Master's programs in Statistics or related fields, taught in English. Here are some of the options:

1. **University of Amsterdam (UvA)**
   - **Program**: MSc in Stochastics and Financial Mathematics
   - **Type**: Full-time
   - **Language Requirements**: IELTS (minimum 6.5), TOEFL (minimum 92), or equivalent
   - **Tuition Fee**: EU/EEA students approximately €2,314 per year; non-EU/EEA students approximately €16,000 per year
   - **Link**: [UvA MSc in Stochastics and Financial Mathematics](https://www.uva.nl/en/programmes/masters/stochastics-and-financial-mathematics/stochastics-and-financial-mathematics.html)

2. **Leiden University**
   - **Program**: MSc in Statistical Science for the Life and Behavioural Sciences
   - **Type**: Full-time
   - **Language Requirements**: IELTS (minimum 6.5), TOEFL (minimum 90), or equivalent
   - **Tuition Fee**: EU/EEA students approximately €2,314 per year; non-EU/EEA students 

In [230]:
#The response based on custom text prompt from Completion Model
custom_answer = answer_question(test_question2, df, max_answer_tokens=1000)
print(custom_answer)

The available Master programs in the field of Statistics taught in English in the Netherlands are:

1. **Leiden University - Statistics and Data Science**
   - **Type**: Master of Science
   - **Language Requirements**: IELTS overall band (minimum score: 6.5); TOEFL internet based (minimum score: 90); Cambridge Certificate in Advanced English (minimum score: 180); Cambridge Certificate of Proficiency in English (minimum score: 180)
   - **Tuition Fee**: EU/EEA students pay € 2,601; Non-EU/EEA students pay € 21,800
   - **Link**: [Leiden University - Statistics and Data Science](https://www.universiteitleiden.nl/en/education/study-programmes/master/statistical-science-for-the-life-and-behavioural-sciences)

2. **University of Amsterdam - Stochastics and Financial Mathematics**
   - **Type**: Master of Science
   - **Language Requirements**: TOEFL internet based (minimum score: 92); IELTS overall band (minimum score: 6.5); Cambridge Certificate in Advanced English (minimum score: 180)
  

### Question 3

In [231]:
test_question3 = """
What are the available Master programs in the field of Biomedical Science taught in English 
in the Netherlands, and what are these programs' types, language requirements, tuition 
fee, and these programs' links respectively? If there is no, please answer there is no 
suchprogram available.
"""

In [232]:
#The response based on custom text prompt from Completion Model
custom_answer = answer_question(test_question3, df, max_answer_tokens=1000)
print(custom_answer)

The available Master programs in the field of Biomedical Science taught in English in the Netherlands, along with their types, language requirements, tuition fees, and links, are as follows:

1. **University of Amsterdam - Biomedical Sciences**
   - **Type:** Master of Science
   - **Language Requirements:** TOEFL internet-based (minimum score: 92), IELTS overall band (minimum score: 6.5), or Cambridge Certificate in Advanced English (minimum score: 180)
   - **Tuition Fee:** EU/EEA students: € 2,601; Non-EU/EEA students: € 25,900; Institutional students: € 23,500
   - **Link:** [University of Amsterdam Biomedical Sciences](https://www.uva.nl/shared-content/programmas/en/masters/biomedical-sciences/biomedical-sciences.html?origin=znSrDUT%2BQ5uz6dso72fBmw)

2. **Radboud University - Biomedical Sciences**
   - **Type:** Master of Science
   - **Language Requirements:** Cambridge Certificate in Advanced English (C1), Cambridge Certificate of Proficiency in English (C2), IELTS overall band