## Custom ChatBot using RAG

The objective of this project is to develop a chatbot utilizing the Retrieval-Augmented Generation (RAG) method, leveraging a custom dataset built from a PDF containing my CV information. The chatbot will be designed to answer questions specifically related to my educational background, providing accurate and context-aware responses. This approach integrates document retrieval with language generation to enhance the chatbot’s ability to understand and deliver relevant information.

### Create the dataset

In [62]:
import PyPDF2
import pandas as pd
import openai

# Path to your PDF file
pdf_file = r"C:\Users\alsot\Downloads\CV_TAUA.pdf"

# Open the PDF file
with open(pdf_file, "rb") as file:
    reader = PyPDF2.PdfReader(file)
    
    # Extract text from each page
    text = ""
    for page in reader.pages:
        text += page.extract_text()

# Reconstruct paragraphs
paragraphs = []
current_paragraph = ""

# Split the text into lines
lines = text.replace("\n","").split(".")

lines_trim = [line.strip() for line in lines]

# Create a pandas DataFrame
df = pd.DataFrame(lines, columns=["Paragraph"])
df = df[~df["Paragraph"].str.strip().eq("")].reset_index(drop=True)

In [None]:
df.shape

###  Load credentials and create embeddings from the dataset

In [66]:
import openai
import json

with open("config.json") as f:
    config = json.load(f)

openai.api_base = config["api_base"]
openai.api_key = config["api_key"]

In [67]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["Paragraph"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

In [68]:
df.to_csv("embeddings.csv")

### Get the cosine distance between embeddings from CV and the created question.

In [71]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [None]:
get_rows_sorted_by_relevance("Who is Tauã?", df).head(5)

### Utilizing the Previously Created Function to Retrieve the Most Relevant Context from Stored Text

In [56]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:  :

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["Paragraph"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [None]:
print(create_prompt("who is Taua?", df, 400))

### Function to Retrieve Answers from the Model

In [77]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=3000, max_answer_tokens=100
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens,
            temperature = 0
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

In [78]:
while True:
    user_input = input("Type your question (or 'exit' to exit): ").strip()
    
    if user_input.lower() == "exit":
        print("Exiting.")
        break
    
    try:
        initial_who_is_answer = openai.Completion.create(
            model="gpt-3.5-turbo-instruct",
            prompt=f"Question: {user_input}\nAnswer:",
            max_tokens=150
        )["choices"][0]["text"].strip()
    except Exception as e:
        initial_who_is_answer = f"Error: {e}"

    custom_who_is_answer = answer_question(user_input, df)

    print(f"""
    Question: {user_input}

    Original Answer: {initial_who_is_answer}
    Custom Answer:   {custom_who_is_answer}
    """)


    Question: who is Taua

    Original Answer: As an AI, I do not have access to personal information such as a person's name. Can you provide more context or details about Taua?
    Custom Answer:   Tauã is a Product Analyst at Itaú with a background in Mechanical Engineering and expertise in data analytics, process automation, and machine learning applications. He is fluent in English and holds certifications in Analytics Engineering Associate, Generative AI Associate, and Data Science Associate from Itaú. He has experience with internal bank databases and has pursued additional education in business. He is currently pursuing the Gen IA Nanodegree by Udacity to deepen his knowledge in artificial intelligence applications. He is committed to refining his
    

    Question: where Taua graduated?

    Original Answer: I'm afraid I can't answer that question as there is no information given about Taua or when they graduated. Do you have any other information that could help me answer 

As shown above, the customized model demonstrates excellent performance by accurately answering both questions.