# Custom Chatbot Project

The dataset is the parsed version of the Wikipedia page https://it.wikipedia.org/wiki/Castelnuovo_di_Garfagnana in Italian language.

It is an interesting use case because the trained model is able to answer something, but it mades a lot of errors. The data is probably only partially present in the training dataset or under-represented.

Using RAG it is possible to see how the answers correctness improve, although they tend to be more coincise.

## Data Wrangling

Parse the Wikipedia page and create a pandas DataFrame with a "text" column with a sentence in each row

In [2]:
PAGE_TITLE = "Castelnuovo di Garfagnana"
OUTPUT_DATA_FILEPATH = "./data/wiki_it_castelnuovo_garfagnana_nb.csv"
WIKIPEDIA_LANG = "it"
SKIP_SECTIONS = ["Collegamenti_esterni", "Altri_progetti"]

In [5]:
import argparse
import os
from collections import defaultdict
from typing import Optional

import bs4
import pandas as pd
import requests
from bs4 import BeautifulSoup


def get_dict_key_from_headings(
    last_h2_level_paragraph: str,
    last_h3_level_paragraph: Optional[str] = None,
    last_h4_level_paragraph: Optional[str] = None,
) -> str:
    key = f"{last_h2_level_paragraph}"
    if last_h3_level_paragraph is not None:
        key += f" - {last_h3_level_paragraph}"
    if last_h4_level_paragraph is not None:
        key += f" - {last_h4_level_paragraph}"
    return key


def get_cleaned_text(element) -> str:
    """
    Strip text and remove '\n' inside the paragraph
    """
    return element.get_text().strip().replace(u"\xa0"," ").replace("\n", " ")

# "query" action documentation: https://en.wikipedia.org/w/api.php?action=help&modules=query
# Don't pass "explaintext": 1 to get the text in HTML format. It is a bit more complex to parse, but we have
# all the information to understand when a list is present
params = {
    "action": "query",
    "prop": "extracts",
    "exlimit": 1,
    "titles": PAGE_TITLE,
    "exsectionformat": "wiki",
    "format": "json",
}

resp = requests.get(
    f"https://{WIKIPEDIA_LANG}.wikipedia.org/w/api.php", params=params
)
response_dict = resp.json()

page_dict = next(iter(response_dict["query"]["pages"].values()))
title = page_dict["title"]
html_text = page_dict["extract"]
soup = BeautifulSoup(html_text, "html.parser")
# Enable for DEBUG
#print(soup.prettify())

# Use BeatifulSoap the go element by element
# headings -> new sectopm level
# <p>...</p> sentences
# <ul><li>..</li><li>...</li>...</ul> <li> elements to merge
sentences_dict = defaultdict(list)
last_h2_level_paragraph = None
last_h3_level_paragraph = None
last_h4_level_paragraph = None
for element in soup:
    if type(element) == bs4.Tag:
        if element.name == "p" and last_h2_level_paragraph is None:
            # Intro before the first headings
            sentences_dict[title].append(get_cleaned_text(element))
        elif element.name == "h2":
            # First level paragraph
            last_h2_level_paragraph = element.attrs["data-mw-anchor"]
            last_h3_level_paragraph = None
            last_h4_level_paragraph = None
            building_list = False
        elif element.name == "h3":
            # Second level paragraph
            last_h3_level_paragraph = element.attrs["data-mw-anchor"]
            last_h4_level_paragraph = None
            building_list = False
        elif element.name == "h4":
            # Third level paragraph
            last_h4_level_paragraph = element.attrs["data-mw-anchor"]
            building_list = False
        elif element.name == "p":
            # Sentence of a paragraph
            # Concatenate the headings to provide context
            key = get_dict_key_from_headings(
                last_h2_level_paragraph,
                last_h3_level_paragraph,
                last_h4_level_paragraph,
            )
            # Search for <ul> inside <p>
            for p_children in element.children:
                if type(p_children) == bs4.Tag and p_children.name == "ul":
                    raise ValueError("List <ul> inside a <p> not supported")
            sentences_dict[key].append(get_cleaned_text(element))

        elif element.name == "ul" or element.name == "dl":
            # Get the list elements and merge them when necessary
            # DO NOT MERGE when there is no sentence before ending with ":"
            # MERGE when the previous sentence ends with ":" or when another list element is preceding

            # Logic to merge the list elements
            list_content_str = ""
            for list_element in element.children:
                if (
                    list_element.name == "li"
                    or list_element.name == "dd"
                    or list_element.name == "dt"
                ):
                    list_text = get_cleaned_text(list_element)
                    list_content_str += list_text + "\n"
            list_content_str = (
                list_content_str.replace("\n", "; ")
                .replace(",;", ";")
                .replace(";;", ";")
                .replace(".;", ";")[: -len(", ")]
            )
            key = get_dict_key_from_headings(
                last_h2_level_paragraph,
                last_h3_level_paragraph,
                last_h4_level_paragraph,
            )
            last_sentence_for_key = (
                sentences_dict[key][-1] if len(sentences_dict[key]) > 0 else ""
            )
            if last_sentence_for_key.endswith(":"):
                # Concatenate the list elements with the previous sentence which explains the list content,
                sentences_dict[key][-1] += " " + list_content_str
            elif last_element_type == "ul" or last_element_type == "dl":
                # The list could already been started with a different ul or dl element,
                # in this case we don't support nesting and we simply concatenate
                print(
                    f"WARNING: probably there is a nested list; it will be squashed into a single level, list element content: '{list_content_str}'"
                )
                sentences_dict[key][-1] += "; " + list_content_str
            else:
                # The list is probably part of an entire section and not introduce with ":",
                # so it does worth keeping split
                sentences_dict[key].extend(list_content_str.split("; "))
        else:
            raise ValueError(f"Tag {element.name} not supported")
        last_element_type = element.name

df_content = {"text": []}
skip_keys_start = tuple(
    [skip_section + " - " for skip_section in SKIP_SECTIONS]
)
for key, key_sentences in sentences_dict.items():
    if key not in SKIP_SECTIONS and key.startswith(skip_keys_start) is False:
        for key_sentence in key_sentences:
            if key_sentence != "":
                df_content["text"].append(f"{key} - {key_sentence}")

df = pd.DataFrame.from_dict(df_content)
print(f"{len(df)} sentences obtained from the page '{PAGE_TITLE}'")

os.makedirs(os.path.dirname(OUTPUT_DATA_FILEPATH), exist_ok=True)
df.to_csv(OUTPUT_DATA_FILEPATH)
print(f"CSV file saved to '{OUTPUT_DATA_FILEPATH}'")

80 sentences obtained from the page 'Castelnuovo di Garfagnana'
CSV file saved to './data/wiki_it_castelnuovo_garfagnana_nb.csv'


In [6]:
df.head()

Unnamed: 0,text
0,Castelnuovo di Garfagnana - Castelnuovo di Gar...
1,Geografia_fisica - Territorio - Sorge alla con...
2,Geografia_fisica - Clima - Classificazione sis...
3,Geografia_fisica - Clima - Classificazione cli...
4,Geografia_fisica - Clima - Diffusività atmosfe...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

### Question 2