## Custom ChatBot using RAG

The objective of this project is to develop a chatbot utilizing the Retrieval-Augmented Generation (RAG) method, leveraging a custom dataset built from a PDF containing my CV information. The chatbot will be designed to answer questions specifically related to my educational background, providing accurate and context-aware responses. This approach integrates document retrieval with language generation to enhance the chatbot’s ability to understand and deliver relevant information.

### Create the dataset

The dataset consists of my personal information, which is only available on my CV and is not part of the OpenAI model’s pretraining data. This makes the dataset appropriate because it ensures that the model is working with new, unseen information. 

In [79]:
import PyPDF2
import pandas as pd
import openai

# Path to your PDF file
pdf_file = r"C:\Users\alsot\Downloads\CV_TAUA.pdf"

# Open the PDF file
with open(pdf_file, "rb") as file:
    reader = PyPDF2.PdfReader(file)
    
    # Extract text from each page
    text = ""
    for page in reader.pages:
        text += page.extract_text()

# Reconstruct paragraphs
paragraphs = []
current_paragraph = ""

# Split the text into lines
lines = text.replace("\n","").split(".")

lines_trim = [line.strip() for line in lines]

# Create a pandas DataFrame
df = pd.DataFrame(lines, columns=["text"])
df = df[~df["text"].str.strip().eq("")].reset_index(drop=True)

In [80]:
df.shape

(41, 1)

###  Load credentials and create embeddings from the dataset

In [81]:
import openai
import json

with open("config.json") as f:
    config = json.load(f)

openai.api_base = config["api_base"]
openai.api_key = config["api_key"]

In [82]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

In [83]:
df.to_csv("embeddings.csv")

### Get the cosine distance between embeddings from CV and the created question.

In [84]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [85]:
get_rows_sorted_by_relevance("Who is Tauã?", df).head(5)

Unnamed: 0,text,embeddings,distances
0,Tauã Santos is a dedicated Mechanical Engineer...,"[-0.02590862289071083, -0.0033735898323357105,...",0.150094
17,"Beyond his professional experience, Tauã has ...","[-0.002841450972482562, -0.017338581383228302,...",0.15645
35,"Beyond his technical skills, Tauã believes in...","[-0.008455007337033749, -0.008434616960585117,...",0.162055
29,Tauã is passionate about using technology to ...,"[-0.015827035531401634, -0.005707507487386465,...",0.162765
25,"Throughout his career, Tauã has maintained a ...","[-0.02318953350186348, -0.011980372481048107, ...",0.163264


### Utilizing the Previously Created Function to Retrieve the Most Relevant Context from Stored Text

In [86]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:  :

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [87]:
print(create_prompt("who is Taua?", df, 400))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:  :

 Tauã is passionate about using technology to support business goals

###

 Tauã holds certifications in Analytics Engineering Associate, Generative AI Associate, and Data Science Associate from Itaú

###

 Tauã is fluent in English, which allows him to collaborate effectively with international teams and stakeholders

###

 Beyond his professional experience, Tauã has pursued additional education in business, covering financial analysis, valuation, marketing, and strategy

###

Tauã Santos is a dedicated Mechanical Engineer with experience in data analytics, process automation, and machine learning applications

###

 Through his expertise in automation, machine learning, and data analytics, Tauã strives to make meaningful contributions to his work

###

 Beyond his technical skills, Tauã believes in the importance of strategic decision-making s

### Function to Retrieve Answers from the Model

In [89]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=3000, max_answer_tokens=100
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens,
            temperature = 0
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

In [90]:
while True:
    user_input = input("Type your question (or 'exit' to exit): ").strip()
    
    if user_input.lower() == "exit":
        print("Exiting.")
        break
    
    try:
        initial_who_is_answer = openai.Completion.create(
            model="gpt-3.5-turbo-instruct",
            prompt=f"Question: {user_input}\nAnswer:",
            max_tokens=150
        )["choices"][0]["text"].strip()
    except Exception as e:
        initial_who_is_answer = f"Error: {e}"

    custom_who_is_answer = answer_question(user_input, df)

    print(f"""
    Question: {user_input}

    Original Answer: {initial_who_is_answer}
    Custom Answer:   {custom_who_is_answer}
    """)


    Question: who is Taua?

    Original Answer: Taua does not appear to be a known person. It could be a name or a term used in a specific context or language. Without more information, it is impossible to determine who Taua is.
    Custom Answer:   Tauã is a dedicated Mechanical Engineer with experience in data analytics, process automation, and machine learning applications.
    

    Question: where Taua graduated?

    Original Answer: I'm sorry, I do not have information about Taua's educational background.
    Custom Answer:   Polytechnic School of the University of São Paulo
    
Exiting.


As shown above, the customized model demonstrates excellent performance by accurately answering both questions.