# RAG-based Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task.

I choose wikipedia dataset. Even though the dataset has probably been used to train the model, applying RAG on the specific Wikipedia page might allow the model to tackle more narrow questions on the topic, and reduce hallucination.

In [40]:
import pandas as pd
import requests
import openai
openai.api_base = ""
openai.api_key = ""

In [48]:
Q1 = 'When was the Data Science term first coined?'
Q2 = 'What are the main topics in Data Science?'

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [65]:
# Get the Wikipedia page for "Data Science"
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "Data_science",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()
#print first two paragraphs of loaded data (paragraphs identified by line breaks)
response_dict["query"]["pages"][0]["extract"].split("\n")[:2]

['Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data. ',
 'Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.']

In [31]:
# Load page text into a dataframe by adding each paragraph into a separate row. Later we'll create embeddings at the paragraph level. 
# Identify paragraphs by line breaks ("\n")
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")

Data Cleaning

In [33]:
# anything below the see also section in wikipedia is not useful for this use case
index_to_cut = df[df['text'].str.contains('See also', na=False)].index.min()
if not pd.isna(index_to_cut):
    df = df.iloc[:index_to_cut]
# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
df.reset_index(inplace = True, drop = True)
df.head(1)

Unnamed: 0,text
0,Data science is an interdisciplinary academic ...
1,Data science also integrates domain knowledge ...
2,"Data science is ""a concept to unify statistics..."
3,A data scientist is a professional who creates...
4,Data science is an interdisciplinary field foc...
5,"Many statisticians, including Nate Silver, hav..."
6,Stanford professor David Donoho writes that da...
7,"In 1962, John Tukey described a field he calle..."
8,"The term ""data science"" has been traced back t..."
9,"During the 1990s, popular terms for the proces..."


## Generate Embeddings

In [42]:
import tiktoken
import pandas as pd
import openai

# Define model and tokenizer
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
TOKENIZER_ENCODING = "cl100k_base" 
# Create a tokenizer that is designed to align with our embeddings
tokenizer = tiktoken.get_encoding(TOKENIZER_ENCODING)

# Function to count tokens in a text
def count_tokens(text):
    return len(tokenizer.encode(text))

# Add a column with token counts
df['token_count'] = df['text'].apply(count_tokens)

# Define token limit for a single batch
BATCH_TOKEN_LIMIT = 8191

# Initialize variables for batching
current_batch = []
current_tokens = 0
embeddings = []

# Iterate over rows to form batches dynamically
for _, row in df.iterrows():
    text = row['text']
    tokens = row['token_count']

    # Check if adding this text exceeds the token limit
    if current_tokens + tokens > BATCH_TOKEN_LIMIT:
        # Send current batch to the API
        response = openai.Embedding.create(
            input=current_batch,
            engine=EMBEDDING_MODEL_NAME
        )
        embeddings.extend([data["embedding"] for data in response["data"]])

        # Reset batch
        current_batch = []
        current_tokens = 0

    # Add the current text to the batch
    current_batch.append(text)
    current_tokens += tokens

# Process the last batch if any
if current_batch:
    response = openai.Embedding.create(
        input=current_batch,
        engine=EMBEDDING_MODEL_NAME
    )
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings to the DataFrame
df["embeddings"] = embeddings

# Display the DataFrame
df.head(1)

Unnamed: 0,text,token_count,embeddings
0,Data science is an interdisciplinary academic ...,45,"[-0.01429750770330429, 0.006413985043764114, 0..."


In [43]:
df.to_csv("embeddings_wiki_ds.csv")

In [44]:
! ls

data  embeddings_wiki_ds.csv  project.ipynb


To stop the notebook here and come back, you can reload `df` using this code (adding your API key) rather than generating the embeddings again:

In [None]:
# import numpy as np
# import pandas as pd
# import openai
# openai.api_base = "https://openai.vocareum.com/v1"
# openai.api_key = "YOUR API KEY"
# df = pd.read_csv("embeddings.csv", index_col=0)
# df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

## RAG-based Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Retrieval pipeline (semantic search) - Create a Function that Finds Related Pieces of Text for a Given Question

In [51]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question_embeddings, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [49]:
# Get embeddings for the question text
Q1_embedding = get_embedding(Q1, engine=EMBEDDING_MODEL_NAME)
Q2_embedding = get_embedding(Q2, engine=EMBEDDING_MODEL_NAME)

In [52]:
df_q1_similarity_sorted = get_rows_sorted_by_relevance(Q1_embedding, df)
df_q1_similarity_sorted.head(5)

Unnamed: 0,text,token_count,embeddings,distances
8,"The term ""data science"" has been traced back t...",156,"[-0.00794466957449913, -0.006920174229890108, ...",0.113138
11,The modern conception of data science as an in...,138,"[-0.014102050103247166, 0.007750300690531731, ...",0.123728
7,"In 1962, John Tukey described a field he calle...",113,"[-0.012454736977815628, 0.0004028491675853729,...",0.129511
12,"The professional title of ""data scientist"" has...",74,"[-0.030461788177490234, -0.002075143391266465,...",0.148623
9,"During the 1990s, popular terms for the proces...",33,"[-0.02157468907535076, -0.0010463348589837551,...",0.14973


In [54]:
df_q2_similarity_sorted = get_rows_sorted_by_relevance(Q2_embedding, df)
df_q2_similarity_sorted.head(5)

Unnamed: 0,text,token_count,embeddings,distances
0,Data science is an interdisciplinary academic ...,45,"[-0.01429750770330429, 0.006413985043764114, 0...",0.145862
4,Data science is an interdisciplinary field foc...,178,"[-0.01713685691356659, 0.0032929459121078253, ...",0.15982
23,"Data science involve collecting, processing, a...",35,"[-0.012967286631464958, -0.0069978199899196625...",0.160908
16,"Data science, on the other hand, is a more com...",117,"[-0.011719823814928532, 0.004833708051592112, ...",0.165301
19,"In summary, data analysis and data science are...",90,"[-0.01534644141793251, 0.006740248762071133, 0...",0.166335


In [62]:
#total token in document
df.token_count.sum()

2110

### Augmentation pipeline - Create a Function that Composes a Text Prompt with Retrieved Text Augmented

Building on that sorted list of rows, we're going to select the create a text prompt that provides context to a `Completion` model in order to help it answer a question. The outline of the prompt looks like this:

```
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:
```

Normally, we want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the `Completion` model, which is currently 4,000.

Since the total token counts in our document is 2,000 we'll limit the token count inserted to 500, to ensure we are testing whether our retrieval strategy is effect.

We'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.

In [60]:
import tiktoken

def create_prompt(question, question_embedding, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    #context will be made up from the retrieved text by similarity until max_token_count is reached
    context = []
    for text in get_rows_sorted_by_relevance(question_embedding, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

test augmented prompt

In [61]:
print(create_prompt(Q1, Q1_embedding, df, max_token_count = 500))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name to computer science. In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic. However, the definition was still in flux. After the 1985 lecture at the Chinese Academy of Sciences in Beijing, in 1997 C. F. Jeff Wu again suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting or limited to describing data. In 1998, Hayashi Chikio argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis.

###

The modern conception of data science as an independent discipline is sometimes attributed 

In [66]:
print(create_prompt(Q2, Q2_embedding, df, max_token_count = 500))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data. 

###

Data science is an interdisciplinary field focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains. The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualizatio

### Generation pipeline - Create a Function that Answers a Question Using the Prompt (with the augmented text) Created

In [69]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question_with_RAG(
    question, question_embedding, df, max_prompt_tokens=500, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, question_embedding, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [67]:
Q1_initial_prompt = f"""
Question: {Q1}
Answer:
"""
initial_Q1_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=Q1_initial_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_Q1_answer)

The term "data science" was first coined in 2001 by William S. Cleveland, a statistician and professor at Purdue University.


In [71]:
RAG_based_Q1_answer = answer_question_with_RAG(Q1, Q1_embedding, df)
print(RAG_based_Q1_answer)

1974, when Peter Naur proposed it as an alternative name to computer science.


In [85]:
print(f"""
{Q1} \n
Original Answer: \n {initial_Q1_answer}  \n
RAG-Based Answer:   \n {RAG_based_Q1_answer}
""")


When was the Data Science term first coined? 

Original Answer: 
 The term "data science" was first coined in 2001 by William S. Cleveland, a statistician and professor at Purdue University.  

RAG-Based Answer:   
 1974, when Peter Naur proposed it as an alternative name to computer science.



### Question 2

In [73]:
Q2_initial_prompt = f"""
Question: {Q2}
Answer:
"""
initial_Q2_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=Q2_initial_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_Q2_answer)

1. Data Analysis: This topic involves understanding how to collect, clean, and transform data in order to extract useful insights and patterns.

2. Data Visualization: Data science also involves using visual tools such as charts, graphs, and maps to communicate the information and insights gained from data analysis.

3. Machine Learning: This topic focuses on building and training algorithms that can automatically learn from data and make predictions or decisions without explicit programming.

4. Natural Language Processing: This is a subfield of data science that deals with understanding and analyzing human language, both written and spoken.

5. Data Mining: This involves discovering patterns and relationships within large datasets using various statistical and computational methods.

6. Predictive Modeling: This topic involves using statistical and machine learning techniques


In [75]:
RAG_based_Q2_answer = answer_question_with_RAG(Q2, Q2_embedding, df)
print(RAG_based_Q2_answer)

The main topics in Data Science are statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms, systems, data integration, graphic design, complex systems, communication, business, human-computer interaction, database management, machine learning, and distributed/parallel systems.


In [86]:
print(f"""
{Q2} \n
Original Answer: \n {initial_Q2_answer}  \n
RAG-Based Answer:   \n {RAG_based_Q2_answer}
""")


What are the main topics in Data Science? 

Original Answer: 
 1. Data Analysis: This topic involves understanding how to collect, clean, and transform data in order to extract useful insights and patterns.

2. Data Visualization: Data science also involves using visual tools such as charts, graphs, and maps to communicate the information and insights gained from data analysis.

3. Machine Learning: This topic focuses on building and training algorithms that can automatically learn from data and make predictions or decisions without explicit programming.

4. Natural Language Processing: This is a subfield of data science that deals with understanding and analyzing human language, both written and spoken.

5. Data Mining: This involves discovering patterns and relationships within large datasets using various statistical and computational methods.

6. Predictive Modeling: This topic involves using statistical and machine learning techniques  

RAG-Based Answer:   
 The main topics in D