# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

<p><font color=green><h3> Explanation of Dataset </h3> <br>
For this task, I have chosen the <b>New York Times (NYT) News API</b> as the primary dataset to power our custom chatbot. The NYT provides a wealth of information on current affairs and it is an excellent source for <b>Near real time, structured news</b> around topics like <b>politics, business, stock market, technology</b>, and more. 

<br>    
<h3> Why NYT News data is appropriate for the task </h3>
    
- NYT provides a <b>Free API</b> which allows access to articles and content from across various categories, making it a suitable choice for our chatbot.
- The API provides <b>accurate, relevant, and timely responses</b> for users seeking information on current events.
- Because we can get access to <b>near real time data</b>, this can be a good exercise for RAG to give relevant context on current affairs and use capabilities of GPT to get the information required. 
- I've used this in the past, so I’m slightly leaning towards it (a bit of the 'familarity bias' in play &#x1F605;) !
</font></p>

## Get Data from NYT News API

### Importing Libraries

In [1]:
import requests
import pandas as pd
from datetime import datetime
import time


### NYT API Key

In [2]:
NYT_API_KEY = 'YOUR_API_KEY'


### Filtering News by topics and date

<div class="alert alert-block alert-info">
<b>API Limitation:</b> Because of NYT News API Limitations in Free tier, the number of topics and time range are restricted. This restricts the amount of context the ChatBot has on current affairs. 
</div>

In [3]:
topics = [
    "DeepSeek", "Business Day", "Business", "Entrepreneurs", "Financial", "Magazine", "Personal Investing",
    "Personal Tech", "Politics", "Retail", "Small Business", "Sunday Business", "Technology", 
    "Washington", "Week", "Your Money", "World", "Stocks", "Share market", "President", "Policies"
]

begin_date = '2025-02-04'
end_date = '2025-02-04'


### API Call to get the News data

In [4]:
articles_list = []

def fetch_articles_for_topic(topic):
    page = 0
    total_pages = 1
    while page < total_pages:
        url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json'
        
        params = {
            'q': topic,
            'begin_date': begin_date,
            'end_date': end_date,
            'api-key': NYT_API_KEY,
            'page': page
        }
        
        response = requests.get(url, params=params)

        if response.status_code == 200:
            data = response.json()
            articles = data['response']['docs']
            
            for article in articles:
                title = article['headline']['main']
                url = article['web_url']
                snippet = article['snippet']
                pub_date = article['pub_date']
                articles_list.append({
                    'Topic': topic,
                    'Title': title,
                    'URL': url,
                    'Snippet': snippet,
                    'Published Date': pub_date
                })
            
            total_hits = data['response']['meta']['hits']
            total_pages = (total_hits // 10) + 1  # Calculate total pages based on hits
            
            page += 1
            print(f"Fetched page {page} of {total_pages} for topic: {topic}")
        else:
            print(f"Error fetching data for topic '{topic}', page {page}: {response.status_code}")
            break
        
        # Pause between requests to avoid hitting rate limits
        time.sleep(10)


In [5]:
for topic in topics[:3]:
    print("Fetching for topic: ", topic)
    fetch_articles_for_topic(topic)
    time.sleep(20)

Fetching for topic:  DeepSeek
Fetched page 1 of 1 for topic: DeepSeek
Fetching for topic:  Business Day
Fetched page 1 of 2 for topic: Business Day
Fetched page 2 of 2 for topic: Business Day
Fetching for topic:  Business
Fetched page 1 of 4 for topic: Business
Fetched page 2 of 4 for topic: Business
Fetched page 3 of 4 for topic: Business
Fetched page 4 of 4 for topic: Business


### Create a DataFrame from the articles list


In [6]:
df = pd.DataFrame(articles_list)
df.to_csv('nyt_articles_feb_2025_pagination.csv', index=False)
df.head(10)

Unnamed: 0,Topic,Title,URL,Snippet,Published Date
0,DeepSeek,Alphabet Revenue Disappoints Investors on Weak...,https://www.nytimes.com/2025/02/04/technology/...,The internet giant reported cloud sales that n...,2025-02-04T21:38:15+0000
1,DeepSeek,Stop Worshiping the American Tech Giants,https://www.nytimes.com/2025/02/04/opinion/dee...,The arrival of DeepSeek shows us the competiti...,2025-02-04T10:02:25+0000
2,DeepSeek,Where Trump’s Trade Fight Could Go Next,https://www.nytimes.com/2025/02/04/business/de...,"New tariffs on Chinese imports are on, even as...",2025-02-04T12:40:52+0000
3,DeepSeek,A Wealthy and Unhappy Nation,https://www.nytimes.com/2025/02/04/briefing/a-...,What a new study found about America.,2025-02-04T11:49:58+0000
4,Business Day,Small-Business Owners Say Tariffs Will Squeeze...,https://www.nytimes.com/2025/02/04/us/trump-ta...,"Consumer electronics, electrical equipment, an...",2025-02-04T20:13:49+0000
5,Business Day,Want Eggs With Your Breakfast? Pay a Surcharge...,https://www.nytimes.com/2025/02/04/business/wa...,"The restaurant chain, which serves breakfast a...",2025-02-04T20:40:12+0000
6,Business Day,"As Trump’s Trade War Unfolds, American Compani...",https://www.nytimes.com/2025/02/04/business/tr...,American companies intent on making goods in t...,2025-02-04T10:02:12+0000
7,Business Day,The Lives Cut Short by the D.C. Plane Crash,https://www.nytimes.com/interactive/2025/02/03...,"They were from all over — Kansas, Washington, ...",2025-02-04T02:35:19+0000
8,Business Day,S.E.C. Moves to Scale Back Its Crypto Enforcem...,https://www.nytimes.com/2025/02/04/business/se...,Some in a special unit of 50 lawyers and staff...,2025-02-04T22:42:42+0000
9,Business Day,Patriots Owner’s Son Challenges Michelle Wu fo...,https://www.nytimes.com/2025/02/04/us/boston-m...,"Josh Kraft, a political newcomer who is runnin...",2025-02-04T18:48:09+0000


## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

<font color=green><b>Note</b><br>
For this task, I will use Title and News snippet column as the "text" column.
</font>

In [7]:
news_df = pd.DataFrame()
news_df["text"] = df["Title"] + " " + df["Snippet"]
news_df

Unnamed: 0,text
0,Alphabet Revenue Disappoints Investors on Weak...
1,Stop Worshiping the American Tech Giants The a...
2,Where Trump’s Trade Fight Could Go Next New ta...
3,A Wealthy and Unhappy Nation What a new study ...
4,Small-Business Owners Say Tariffs Will Squeeze...
5,Want Eggs With Your Breakfast? Pay a Surcharge...
6,"As Trump’s Trade War Unfolds, American Compani..."
7,The Lives Cut Short by the D.C. Plane Crash Th...
8,S.E.C. Moves to Scale Back Its Crypto Enforcem...
9,Patriots Owner’s Son Challenges Michelle Wu fo...


In [8]:
len(news_df)

56

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [9]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR_API_KEY"

In [10]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(news_df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=news_df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
news_df["embeddings"] = embeddings
news_df

Unnamed: 0,text,embeddings
0,Alphabet Revenue Disappoints Investors on Weak...,"[-0.0029944465495646, -0.03382888808846474, 0...."
1,Stop Worshiping the American Tech Giants The a...,"[-0.0019823501352220774, -0.01609061472117901,..."
2,Where Trump’s Trade Fight Could Go Next New ta...,"[0.0010795766720548272, -0.03745369613170624, ..."
3,A Wealthy and Unhappy Nation What a new study ...,"[-0.0059190234169363976, -0.002042063046246767..."
4,Small-Business Owners Say Tariffs Will Squeeze...,"[0.0037213843315839767, -0.02887372113764286, ..."
5,Want Eggs With Your Breakfast? Pay a Surcharge...,"[-0.013464485295116901, -0.017149778082966805,..."
6,"As Trump’s Trade War Unfolds, American Compani...","[-0.01746896654367447, -0.04411698877811432, 0..."
7,The Lives Cut Short by the D.C. Plane Crash Th...,"[-0.02510949596762657, -0.0043425895273685455,..."
8,S.E.C. Moves to Scale Back Its Crypto Enforcem...,"[0.01084992941468954, -0.013685019686818123, 0..."
9,Patriots Owner’s Son Challenges Michelle Wu fo...,"[-0.01628192514181137, -0.013969077728688717, ..."


In [11]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [12]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [13]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

### Question 1

### Initial Prompt

In [14]:
prompt_1 = """
Question: "What’s the latest political news about the US President?"
Answer:
"""
initial_answer_1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_1,
    max_tokens=150
)["choices"][0]["text"].strip()


### Custom Prompt

In [15]:
custom_answer_1 = answer_question("What’s the latest political news about the US President?", news_df)


### Comparison between Original and Custom Prompt Answers

In [16]:
print(f"""
"What’s the latest political news about the US President?"

Original Answer: {initial_answer_1}
Custom Answer:   {custom_answer_1}

""")


"What’s the latest political news about the US President?"

Original Answer: As of October 2021, the latest political news about the US President is focused on his administration's handling of the COVID-19 pandemic, the economy and job market, climate change and environmental policies, and foreign relations with countries such as China, Russia, and Afghanistan. President Joe Biden's administration is also facing ongoing challenges with immigration and voting rights legislation, as well as efforts to pass a large infrastructure bill. Some recent headlines include the passing of the $1 trillion infrastructure bill, ongoing negotiations on a large social spending package, and ongoing debates on vaccine mandates and booster shots.
Custom Answer:   The latest news is about the ongoing trade war between the US and China and how it may potentially impact American companies and workers. There are also updates on potential tariffs on European goods and President Trump's promise to cap credit c

### Question 2

### Initial Prompt

In [17]:
prompt_2 = """
Question: "How is Police Department performing in terms of budget?"
Answer:
"""
initial_answer_2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_2,
    max_tokens=150
)["choices"][0]["text"].strip()


### Custom Prompt

In [18]:
custom_answer_2 = answer_question("How is Police Department performing in terms of budget?", news_df)



### Comparison between Original and Custom Prompt Answers

In [19]:
print(f"""
"How is Police Department performing in terms of budget?"

Original Answer: {initial_answer_2}
Custom Answer:   {custom_answer_2}

""")


"How is Police Department performing in terms of budget?"

Original Answer: It is difficult to definitively answer this question without more specific information, as police departments' budgets can vary greatly depending on various factors such as location, size, and funding sources. However, some general trends and information about police department budgets can be provided.

Overall, police departments in the United States have seen an increase in funding over the past few decades. According to data from the Bureau of Justice Statistics, state and local government spending on police increased from $42.3 billion in 1992 to $115.5 billion in 2015. This represents a more than 170% increase in funding.

However, there has been a recent push for police departments to reduce budgets and reallocate funding to other social services and programs. This has
Custom Answer:   The Police Department is already halfway through its overtime budget for the fiscal year and is struggling to recruit ne

### Bonus Question 3

### Initial Prompt

In [20]:
prompt_3 = """
Question: "What is the impact of DeepSeek on USA's market?"
Answer:
"""
initial_answer_3 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_3,
    max_tokens=150
)["choices"][0]["text"].strip()


### Custom Prompt

In [21]:
custom_answer_3 = answer_question("What is the impact of DeepSeek on USA's market?", news_df)


### Comparison between Original and Custom Prompt Answers

In [22]:
print(f"""
"What is the impact of DeepSeek on USA's market?"

Original Answer: {initial_answer_3}
\n\n###\n\n
Custom Answer:   {custom_answer_3}

""")


"What is the impact of tariffs on USA market?"

Original Answer: DeepSeek has had a significant impact on USA's market, particularly in the technology sector. This advanced AI technology has revolutionized the way businesses search and analyze data, resulting in improved efficiency and decision-making. This has led to an increase in productivity and profitability for companies using DeepSeek.

Furthermore, DeepSeek's ability to gather and process large amounts of data has made it a valuable tool for market research and analysis. This has helped businesses gain a better understanding of their target audience and market trends, allowing them to tailor their strategies and products accordingly.

The use of DeepSeek has also led to the creation of new jobs and industries within the USA's market, as more companies are adopting this technology and requiring skilled professionals to implement and manage it.

Additionally,


###


Custom Answer:   The arrival of DeepSeek shows us the competit

<p><font color=green><h3> Continuous Input Mode </h3> <br>
You can provide questions continuously and see the comparison before and after the custom prompts

In [23]:
while True:
    # Take input from user
    user_question = input("Hi, I am Chatty, your Chat Assistant. How can I help you today? (Type 'exit' to end session): ")
    
    # Check if the user wants to stop
    if user_question.lower() == 'exit':
        print("Have a good day!")
        break
    
    user_prompt = f"""
        Question: "{user_question}"
        Answer:
        """
    initial_answer = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=user_prompt,
        max_tokens=150
    )["choices"][0]["text"].strip()
    
    custom_answer = answer_question(user_question, news_df)
    
    print(f"""{user_question}

            Original Answer: {initial_answer}
            \n\n###\n\n
            Custom Answer:   {custom_answer}

            """)



Hi, I am Chatty, your Chat Assistant. How can I help you today? (Type 'exit' to end session): What is the impact of DeepSeek on USA's market?
What is the impact of DeepSeek on USA's market?

            Original Answer: There is no clear or definitive answer to this question as DeepSeek's impact on the US market would largely depend on the success and adoption of the technology. However, there are a few potential ways in which DeepSeek could impact the US market:

1. Improved Efficiency: DeepSeek's technology could lead to improved efficiency in various industries such as healthcare, manufacturing, and finance. This could lead to cost savings, increased productivity, and ultimately, a boost to the US economy.

2. Job Displacement: DeepSeek's advanced AI technology could potentially replace certain jobs that are currently performed by humans. This could lead to job displacement and unemployment in certain industries, particularly those that are heavily reliant on manual labor.

3. Disru