# Custom Chatbot Project

**TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task**

In this project, we utilized a dataset containing 101 facts about the Premier League from the 2022/23 season, sourced from `theanalyst.com` webpage. This dataset was chosen because we aim to develop a tool that acts as an expert and can provide answers to questions related to this specific league and season.

To enhance the question-answering capability of our tool, we employed the Retrieval Augmented Generation technique. This technique supplements the prompt with contextual information from the dataset, enabling the model to provide more accurate and relevant answers to questions posed to it.

In [1]:
# Environment variables
OPENAI_API_KEY = 'YOUR API KEY'

# URLs and file paths
SOURCE_URL = 'https://theanalyst.com/eu/2023/05/101-best-premier-league-facts-2022-23'
HTML_PAGE_FILEPATH = './html_page.html'
CSV_FILEPATH_WITH_EMBEDDINGS = './wikipedia_with_embeddings.csv'

# OpenAI Models
EMBEDDING_MODEL = 'text-embedding-3-small'
COMPLETION_MODEL = 'gpt-3.5-turbo'

# Batch size for processing
BATCH_SIZE = 25

## Data Wrangling

**TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.**

In [2]:
# Importing libraries
import requests
import pandas as pd
from openai import OpenAI
from bs4 import BeautifulSoup
from typing import List, Union, Dict
from scipy.spatial.distance import cosine

In [3]:
# Helper function to fetch HTML page from a URL
def fetch_html_page(url: str) -> bytes:
    """
    Fetches HTML content from a given URL.
    
    Args:
        url (str): The URL of the webpage to fetch.
        
    Returns:
        bytes: The HTML content of the webpage.
        
    Raises:
        Exception: If there is a connection error.
    """
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        return response.content
    else:
        raise Exception('Connection error')

# Save the HTML page to a file
with open(HTML_PAGE_FILEPATH, mode='wb') as html_file:
    html_page = fetch_html_page(SOURCE_URL)
    html_file.write(html_page)

### Read HTML Page

In [3]:
# Function to extract data from HTML
def extract_data_from_html(html_file_path: str) -> pd.DataFrame:
    """
    Extracts data from an HTML file.
    
    Args:
        html_file_path (str): The file path to the HTML file.
        
    Returns:
        pd.DataFrame: A DataFrame containing the extracted data.
    """
    # Parsing the HTML file
    with open(html_file_path) as fp:
        soup = BeautifulSoup(fp, 'html.parser')

    # Finding the root DOM node
    root_dom_node = soup.find('h2', {'class': 'has-text-align-center wp-block-heading'})

    # Extracting month headers
    month_headers = [month_header.find_next('strong') for month_header in soup.find_all('h2', {'class': 'has-text-align-center wp-block-heading'})]

    current_month = None
    data = []

    # Loop through DOM nodes to extract data
    for node in root_dom_node.find_all_next():
        if node in month_headers:
            current_month = node.text
        elif node.name == 'ul':
            data.append(f"{current_month} 2024 -- {node.find_next('li').text.strip()}")

    # Creating DataFrame from extracted data
    df = pd.DataFrame(data, columns=['text'])

    return df

In [5]:
# Extracting data from HTML and displaying DataFrame
df = extract_data_from_html(HTML_PAGE_FILEPATH)

# Setting display options for DataFrame
pd.set_option('display.max_colwidth', None)  
pd.set_option('display.max_rows', None)

In [6]:
# Displaying DataFrame and its shape
print(df.head())
print(df.shape)

                                                                                                                                                                                                                                                         text
0                                                                           August 2024 -- On 13 August 2022, Manchester City ended the day top and Manchester United ended the day bottom of the top-flight table for the first time since 29 November 1929.
1                                                                                                August 2024 -- Erik ten Hag became the first manager to lose each of his first two games in charge of Manchester United since John Chapman in November 1921.
2                                                August 2024 -- Harry Kane netted his 185th Premier League goal for Tottenham Hotspur against Wolves, overtaking Sergio Aguero’s record for Premier League goals for a single club (184 for Ma

### Create Embedding Database

In [4]:
# Initialize OpenAI client
openai_client = OpenAI(api_key=OPENAI_API_KEY)

In [8]:
# Reset display options for pandas DataFrame
pd.reset_option('display.max_colwidth')
pd.reset_option('display.max_rows')

In [8]:
# Function to get embeddings from OpenAI API
def get_embeddings(prompt: Union[str, List[str]], embedding_model: str) -> List[List[float]]:
    """
    Retrieves embeddings from OpenAI API for the given prompt using the specified embedding model.

    Args:
        prompt (Union[str, List[str]]): Input prompt or list of prompts.
        embedding_model (str): Name of the embedding model to use.

    Returns:
        List[List[float]]: List of embeddings for the input prompt(s).
    """
    response = openai_client.embeddings.create(
        input=prompt if isinstance(prompt, list) else [prompt],
        model=embedding_model
    )
    return [row.embedding for row in response.data]

# Function to create embeddings for DataFrame
def create_embeddings(df: pd.DataFrame, embedding_model_name: str = EMBEDDING_MODEL, batch_size: int = BATCH_SIZE) -> List[List[float]]:
    """
    Creates embeddings for the text data in the DataFrame using the specified embedding model.

    Args:
        df (pd.DataFrame): DataFrame containing text data.
        embedding_model_name (str): Name of the embedding model to use.
        batch_size (int): Size of batches for processing.

    Returns:
        List[List[float]]: List of embeddings corresponding to the text data.
    """
    embeddings_output = []
    for idx in range(0, len(df), batch_size):
        batch = df.iloc[idx:idx+batch_size]['text'].tolist()
        embeddings = get_embeddings(batch, embedding_model_name)
        embeddings_output.extend(embeddings)
    return embeddings_output

In [10]:
# Add embeddings to DataFrame and save to CSV
df['embedding'] = create_embeddings(df)
df.to_csv(CSV_FILEPATH_WITH_EMBEDDINGS, sep=',', index=False)

# Display DataFrame head
print(df.head())

                                                text  \
0  August 2024 -- On 13 August 2022, Manchester C...   
1  August 2024 -- Erik ten Hag became the first m...   
2  August 2024 -- Harry Kane netted his 185th Pre...   
3  August 2024 -- Brenden Aaronson’s opening goal...   
4  August 2024 -- Darwin Núñez came off the bench...   

                                           embedding  
0  [-0.01703871600329876, 0.007503733970224857, 0...  
1  [-0.054760657250881195, -0.011442207731306553,...  
2  [0.004939332604408264, -0.01658555120229721, 0...  
3  [0.005131910089403391, -0.006789827719330788, ...  
4  [-0.055598579347133636, 0.012705056928098202, ...  


## Custom Query Completion

**TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.**

In [5]:
def build_simple_prompt(question: str) -> List[Dict[str, str]]:
    """
    Builds a simple prompt for asking a question.

    Args:
        question (str): The question to include in the prompt.

    Returns:
        List[Dict[str, str]]: A list containing a single message with the user role and the provided question.
    """
    return [
        {
            'role': 'user',
            'content': question
        }
    ]

def build_custom_prompt(question: str, database_df: pd.DataFrame) -> List[Dict[str, str]]:
    """
    Builds a custom prompt including context for asking a question based on a database DataFrame.

    Args:
        question (str): The question to include in the prompt.
        database_df (pd.DataFrame): The DataFrame containing the database of facts.

    Returns:
        List[Dict[str, str]]: A list containing two messages: system message with context and user message with the question.
    """
    return [
        {
            'role': 'system',
            'content': """
            Anser the question based on provided context below. If the question cannot be answered based on provided context, say "I don't know the answer". We have 2024. Context contains facts from season 2022/2023 for English Premier League. Facts are annotated with date and seperated by lines. 
            Context: 
                {}
            """.format('\n\n'.join(build_custom_context(question, database_df)))
        },
        {
            'role': 'user',
            'content': question
        }
    ]

def build_custom_context(question: str, database_df: pd.DataFrame, n: int = 5) -> List[str]:
    """
    Builds a custom context for a given question based on the closest facts from a database DataFrame.

    Args:
        question (str): The question for which the context is being built.
        database_df (pd.DataFrame): The DataFrame containing the database of facts.
        n (int): The number of closest facts to include in the context.

    Returns:
        List[str]: A list of closest facts to the question.
    """
    question_embedding = get_embeddings(question, EMBEDDING_MODEL)[0]
    
    df = database_df.copy()
    df["distances"] = df['embedding'].apply(lambda embedding: cosine(embedding, question_embedding))

    df.sort_values("distances", ascending=True, inplace=True)
    return df.iloc[:n]['text'].tolist()

def handle_question(prompt: List[Dict[str, str]], client: OpenAI, model_name: str = COMPLETION_MODEL) -> str:
    """
    Handles a question prompt by generating a response using the specified model.

    Args:
        prompt (List[Dict[str, str]]): The prompt messages to send to the model.
        model_name (str): The name of the completion model to use.

    Returns:
        str: The response generated by the model.
    """
    response = client.chat.completions.create(
        model=model_name,
        messages=prompt,
        max_tokens=100
    )
    return response.choices[0].message.content

## Custom Performance Demonstration

**TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.**

In [6]:
# Read the DataFrame from CSV file
df = pd.read_csv(CSV_FILEPATH_WITH_EMBEDDINGS)

# Convert embedding values from string to list of floats
df['embedding'] = df['embedding'].apply(lambda value: [float(dim) for dim in value.replace('[', '').replace(']', '').split(',')])

### Question 1

In [9]:
# Define the question
question_1 = 'Who won the Premier League in the 2022/2023 season?'

# Print answer without context
print('Answer without Context: \n', handle_question(build_simple_prompt(question_1), openai_client))

# Print answer with context
print('\nAnswer with Context: \n', handle_question(build_custom_prompt(question_1, df), openai_client))

Answer without Context: 
 It is not possible to accurately answer this question as the 2022/2023 Premier League season has not yet taken place.

Answer with Context: 
 Man City won the Premier League in the 2022/2023 season.


### Question 2

In [11]:
# Define the question
question_2 = 'Which football team did Harry Kane play for in the 2022/2023 season?'

# Print answer without context
print('Answer without Context: \n', handle_question(build_simple_prompt(question_2), openai_client))

# Print answer with context
print('\nAnswer with Context: \n', handle_question(build_custom_prompt(question_2, df), openai_client))

Answer without Context: 
 Harry Kane played for Manchester City in the 2022/2023 season.

Answer with Context: 
 Harry Kane played for Tottenham Hotspur in the 2022/2023 season.


### Question 3

In [12]:
# Define the question
question_3 = 'Which team finished the match with the most competitive win? What was the result? Who was the opponent?'

# Print answer without context
print('Answer without Context: \n', handle_question(build_simple_prompt(question_3), openai_client))

# Print answer with context
print('\nAnswer with Context: \n', handle_question(build_custom_prompt(question_3, df), openai_client))

Answer without Context: 
 I'm sorry, but I would need more information in order to provide a specific answer to your question. Please provide the name of the sport and teams involved in the match.

Answer with Context: 
 Liverpool finished the match with the most competitive win, defeating Manchester United 7-0 in March 2024.


### Conclusions

- **Question 1:**: It is evident that the model correctly responded after querying the custom database, indicating its ability to retrieve relevant information when necessary.
- **Question 2:**: The model provided an incorrect response to the question when not provided with the context from the custom database. However, upon utilizing the custom database, it furnished a correct response, underscoring the significance of context in response accuracy.
- **Question 3:**: It is evident that the model correctly responded after querying the custom database, indicating its ability to retrieve relevant information when necessary.