# Custom Chatbot Project

For this project, I decided to use the Wikipedia article on Martin Garrix, a Dutch DJ and music producer, as my dataset. Since I personally enjoy music (especially EDM), I thought building a chatbot around an artist I like would be both interesting and fun.

While regular chatbots usually do fine answering general questions about famous people, they often miss out on specific details like early career moments, exact release dates, or event-related facts. Even advanced models that use online searches don't always guarantee accuracy, since they rely heavily on external sources. By choosing a targeted dataset from Wikipedia, I hope my chatbot will deliver more precise and detailed answers about Martin Garrix, making it particularly useful for music fans.

## Data Wrangling

In [1]:
import requests
import pandas as pd

def get_wikipedia_article(title):
    """
    Fetches the plain text content of a Wikipedia article by title.
    """
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "prop": "extracts",
        "explaintext": True,
        "titles": title,
        "format": "json",
    }

    response = requests.get(url, params=params)
    data = response.json()

    page = next(iter(data["query"]["pages"].values()))
    return page["extract"]

# Fetch article text for Martin Garrix
article_text = get_wikipedia_article("Martin Garrix")

# Split the article into chunks (one per paragraph)
paragraphs = [p.strip() for p in article_text.split("\n") if p.strip()]

# Create DataFrame with "text" column
df = pd.DataFrame(paragraphs, columns=["text"])

# Show first few rows
df.head()


Unnamed: 0,text
0,Martijn Gerard Garritsen (Dutch pronunciation:...
1,Garrix has performed at music festivals such a...
2,== Early life ==
3,Garrix was born as Martijn Gerard Garritsen on...
4,"In 2004, he expressed interest in becoming a D..."


In [2]:
# Save the data in csv format
df.to_csv("data/martin_garrix_wiki.csv", index=False)

## Custom Query Completion

In this part of the project, I set up the logic for sending custom queries to the OpenAI Completion model using the Martin Garrix Wikipedia article as additional context. The idea is to help the chatbot give answers that are more accurate and relevant to specific questions about the artist.

I configured OpenAI to use Vocareum’s API endpoint, as instructed by Udacity. Then, I created two simple functions:

- One that sends basic queries without extra context.

- Another that pairs user questions with relevant parts of the article to improve responses.

This approach lets me easily compare how the model performs on its own versus when it's supported by extra information from the dataset.

In [3]:
# Configure OpenAI for Vocareum environment
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-xxxx"

In [4]:
def basic_query(question):
    # Sends the question directly to the model without extra context
    response = openai.Completion.create(
        engine="gpt-3.5-turbo-instruct",
        prompt=question,
        temperature=0.7,
        max_tokens=200
    )
    return response.choices[0].text.strip()

In [5]:
def custom_query(question, context_df):
    # Combine all text rows into one context string
    # Use only the first 50 rows to stay within token limits
    context = "\n".join(context_df["text"].head(50).tolist())
    
    # Insert context into the prompt
    prompt = f"Answer the question based on the following text:\n\n{context}\n\nQuestion: {question}\nAnswer:"
    
    response = openai.Completion.create(
        engine="gpt-3.5-turbo-instruct",
        prompt=prompt,
        temperature=0.7,
        max_tokens=200
    )
    return response.choices[0].text.strip()

## Custom Performance Demonstration

Below are two sample questions about Martin Garrix. For each question, I show two different responses from the model:

- Basic Query: The model answers without any extra information.

- Custom Query: The same question, but with details from the Wikipedia article provided for context.

This helps illustrate how the responses improve when the model has access to more relevant background information.

In [6]:
q1 = "Which artists has Martin Garrix collaborated with?"
q2 = "Which is the latest award Martin Garrix has received?"

### Question 1

In [7]:
print("Q1:", q1)

print("\nBasic Answer:")
print(basic_query(q1))

print("\nCustom Answer:")
print(custom_query(q1, df))

Q1: Which artists has Martin Garrix collaborated with?

Basic Answer:
Martin Garrix has collaborated with a variety of artists, including Bebe Rexha, Dua Lipa, Khalid, David Guetta, Troye Sivan, and Usher. He has also collaborated with other EDM artists such as Tiësto, Dimitri Vegas & Like Mike, and Hardwell.

Custom Answer:
Martin Garrix has collaborated with a variety of artists, including Julian Jordan, TV Noise, Jay Hardway, Sander Van Doorn, Dimitri Vegas & Like Mike, Firebeatz, Dillon Francis, Hardwell, Afrojack, MOTi, Ed Sheeran, Bebe Rexha, Third Party, Linkin Park, Area21, Matisse and Sadko, Troye Sivan, David Guetta, Ellie Goulding, Romy Dya, Jamie Scott, Brooks, Loopers, Khalid, Bonn, CMC$, Icona Pop, Justin Mylo, Dewain Whitmore Jr., Dyro, Pierce Fulton, and Mike Shinoda.


The basic model gave a general list of popular artists Martin Garrix has worked with. Although correct, it was quite limited. However, when the model used the Wikipedia article as context, it returned a more detailed list of collaborations, including less-famous artists and specific mentions from Garrix’s career. This clearly shows how providing additional context can significantly improve the quality of responses.

### Question 2

In [8]:
print("Q2:", q2)

print("\nBasic Answer:")
print(basic_query(q2))

print("\nCustom Answer:")
print(custom_query(q2, df))

Q2: Which is the latest award Martin Garrix has received?

Basic Answer:
The latest award Martin Garrix has received is the 2021 DJ Mag Top 100 DJs award, for the fourth consecutive year.

Custom Answer:
The latest award Martin Garrix has received is being ranked number one on DJ Mag's annual Top 100 DJs list in October 2018.


The basic model actually gave a more current answer (relatively), correctly identifying Martin Garrix’s latest DJ Mag Top 100 DJs award from 2020. The custom model, however, incorrectly cited 2018 as the most recent award. When I checked, I realised the Wikipedia article I provided didn't list all his awards—it linked to another page for that. Because of this, the custom query performed worse, as the model could only answer based on the partial information it was given.

This highlights an important limitation when using custom datasets: if critical details are incomplete or stored elsewhere, the model can't access them, reducing its accuracy. Effective retrieval-based answers depend heavily on having complete, easily accessible data.

## Conclusion

Through this project, I explored how customising a chatbot with specific data, like the Wikipedia article about Martin Garrix, can lead to better and more relevant answers. In the first example, the custom model clearly outperformed the basic one by offering a richer and more detailed list of collaborators. But in the second example, it missed providing the most recent award information because the dataset lacked certain details.

These cases illustrate the trade-off involved when building specialised chatbots: they can greatly enhance performance in specific topics, but depend heavily on how complete, accurate, and current the data is. Ultimately, careful data selection and preparation are just as crucial as crafting effective prompts when developing reliable AI systems.