# Custom Chatbot Project

The goal of this project is to build a personal chatbot that can help in finding Machine Learning research papers. The dataset that I have chosen is the Arxiv dataset, which contains research papers from various fields. Arxiv is the perfect dataset for this task, as it has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. 
As the dataset is too large, it'll be filtered according to the following criteria:
* It will be filtered to include only the papers related to Machine Learning.
* It will be filtered to include only the papers published in the last 5 days. (The last 5 years contain >422538 ML related papers, and the cost limit is obviously 2-3 USD).
* As only the abstracts are required for a recommendation, only the abstracts will be used.

The gpt3.5 model's training data was cut-off in Sep 2021. The chosen dataset is suitable as it contains the latest papers (between 2024-12-10 to 2024-12-12), and the model would not have seen them before. This can help us verify the model's ability to generate relevant responses based on the new data.


## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [None]:
import pandas as pd
import json
arxiv_cats = json.load(open('arxiv_cats.json'))
df = pd.read_json('last_5_yr_data.jsonl', lines=True)
ml_cats = ['cs.CL', 'cs.LG', 'stat.ML']
df = df[df['categories'].apply(lambda x: any(cat in x for cat in ml_cats))]


df['update_date'] = pd.to_datetime(df['update_date'])
filter_date = pd.to_datetime('2024-12-10')
df = df[df['update_date'] > filter_date]
df = df[df.id.str.startswith('24')]
df['categories'] = df.categories.apply(lambda x: x.split()).apply(lambda x: [arxiv_cats[cat] for cat in x])
df = df.loc[:, ['id', 'title', 'authors' ,'categories', 'abstract']]
df['text'] = df.apply(lambda x: f'Title: {x["title"]}\nAuthors: {x["authors"]}\nCategories:{",".join(x["categories"])}\nAbstract:\n {x["abstract"][:250]}', axis=1)
df.drop(columns=['title', 'authors', 'categories', 'abstract'], inplace=True)
df.to_csv('2024-12-10_ml_papers.csv', index=False)

df.head()

Unnamed: 0,id,text
1007102,2401.01987,Title: Representation Learning of Multivariate...
1008464,2401.03349,Title: Image Inpainting via Tractable Steering...
1014389,2401.09274,Title: Avoiding strict saddle points of noncon...
1014697,2401.09582,Title: eipy: An Open-Source Python Package for...
1016754,2401.11641,Title: Revolutionizing Finance with LLMs: An O...


In [47]:
df.shape

(983, 2)

In [None]:
# getting openai embeddings for the text
import openai
apikey = "YOUR_API_KEY"
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = apikey

EMBEDDING_MODEL_NAME = "text-embedding-3-small"
batch_size = 110

In [49]:
embeddings = []
for i in range(0, len(df), batch_size):
    try:
        response = openai.Embedding.create(
            input=df.iloc[i:i+batch_size]["text"].tolist(),
            engine=EMBEDDING_MODEL_NAME
        )
        embeddings.extend([data["embedding"] for data in response["data"]])
    except Exception as e:
        print(f"Error: {e}")

# Add embeddings to DataFrame
df["embedding"] = embeddings

In [50]:
df.reset_index(drop=True, inplace=True)
df.loc[0, 'text']

'Title: Representation Learning of Multivariate Time Series using Attention and\n  Adversarial Training\nAuthors: Leon Scharw\\"achter and Sebastian Otte\nCategories:Machine Learning\nAbstract:\n   A critical factor in trustworthy machine learning is to develop robust\nrepresentations of the training data. Only under this guarantee methods are\nlegitimate to artificially generate data, for example, to counteract imbalanced\ndatasets or provide c'

In [51]:
df.to_csv('2024-12-10_ml_papers.csv', index=False)

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [53]:
import pandas as pd
import numpy as np
import tiktoken
from openai.embeddings_utils import distances_from_embeddings

#load the dataframe
df = pd.read_csv('2024-12-10_ml_papers.csv')
df.embedding = df.embedding.apply(eval).apply(np.array)
df.head()

Unnamed: 0,id,text,embedding
0,2401.01987,Title: Representation Learning of Multivariate...,"[0.018088510259985924, -0.0009138910681940615,..."
1,2401.03349,Title: Image Inpainting via Tractable Steering...,"[0.005470732226967812, 0.01446774136275053, -0..."
2,2401.09274,Title: Avoiding strict saddle points of noncon...,"[0.0009837907273322344, 0.008160199038684368, ..."
3,2401.09582,Title: eipy: An Open-Source Python Package for...,"[0.013271886855363846, -0.01609743945300579, 0..."
4,2401.11641,Title: Revolutionizing Finance with LLMs: An O...,"[0.004357148893177509, 0.018598299473524094, 0..."


In [54]:

def get_prompt_embedding(prompt):
    em_model = "text-embedding-3-small"
    response = openai.Embedding.create(engine=em_model, input=prompt)
    return np.array(response["data"][0]["embedding"])

def calculate_prompt_similarity(prompt, embedding_list):
    prompt_embedding = get_prompt_embedding(prompt)
    distances = distances_from_embeddings(prompt_embedding, embedding_list, distance_metric='cosine')
    return distances

def get_df_with_distances(prompt, context_df):
    distances = calculate_prompt_similarity(prompt, context_df.embedding.tolist())
    df_with_distances = context_df.copy()
    df_with_distances['distance'] = distances
    df_with_distances.sort_values(by='distance', inplace=True, ascending=True)
    return df_with_distances

def get_custom_prompt_text(user_prompt, context):
    custom_prompt = f"""
    Answer the question based on the context below, and if the
    question can't be answered, say "I don't know"
    Context: 
    {context}

    ---
    Question: {user_prompt}
    Answer:
    """
    return custom_prompt

In [55]:
def populate_prompt_with_context(prompt, context_df):
    tokenizer= tiktoken.get_encoding("cl100k_base")
    prompt_template = f"""
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    ---

    Question: {prompt}
    Answer:"""
    curr_prompt_length = len(tokenizer.encode(prompt_template))
    df_with_dist = get_df_with_distances(prompt, context_df)
    contexts = []
    for text in df_with_dist.text.values:
        if len(tokenizer.encode(text)) + curr_prompt_length < 3700:
            contexts.append(text)
            curr_prompt_length += len(tokenizer.encode(text))
        else:
            break
    
    context = "\n\n########\n\n".join(contexts)
    return get_custom_prompt_text(prompt, context)
    
def get_answer(input_prompt):
    openai_response = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=input_prompt,
        max_tokens=150
    )
    return openai_response

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [56]:
#Before context adding
prompt = "Suggest a recent paper on using LLMs in Finance."
raw_response = get_answer(prompt)
print(raw_response['choices'][0]['text'])



"Deep Learning for Finance: Applications and Opportunities" by Arize Nwokoro and Huan Zhao (2020) discusses the potential of applying Large Language Models (LLMs) in various financial tasks such as sentiment analysis, market prediction, fraud detection, and portfolio management. The paper provides a comprehensive overview of LLMs, their capabilities, and challenges in the financial domain. It also explores various use cases and provides a roadmap for leveraging LLMs in finance. The authors highlight the potential benefits of using LLMs, including improved accuracy, cost reduction, and automation of tedious tasks. Additionally, the paper discusses the ethical implications of using LLMs in finance and suggests measures to address potential biases. Overall, this paper provides valuable


In [57]:
#after context adding
prompt = "Suggest a recent paper on using LLMs in Finance."
custom_prompt = populate_prompt_with_context(prompt, df)
response = get_answer(custom_prompt)
print(response['choices'][0]['text'])


    Revolutionizing Finance with LLMs: An Overview of Applications and Insights 
    by Huaqin Zhao, Zhengliang Liu, Zihao Wu, Yiwei Li, Tianze Yang, Peng Shu, Shaochen Xu, Haixing Dai, Lin Zhao, Hanqi Jiang, Yi Pan, Junhao Chen, Yifan Zhou, Gengchen Mai, Ninghao Liu, and Tianming Liu.


### Question 2

In [58]:
prompt = "Are there any recent papers on tokenizers?"
raw_response = get_answer(prompt)
print(raw_response['choices'][0]['text'])



Yes, there are many recent papers on tokenizers. Here are a few examples:

1. "Tokenization Strategies for Neural Machine Translation" (2020) by Jacob Devlin et al. This paper explores different tokenization strategies for neural machine translation and evaluates their effects on translation performance.

2. "Robust Tokenization via Pre-trained Word Piece Embeddings" (2019) by Mingda Chen et al. This paper introduces a novel tokenization approach that uses pre-trained word piece embeddings to improve the robustness of tokenization in natural language processing tasks.

3. "Unicode-aware Tokenization for Natural Language Processing" (2020) by Diederik P. Kingma et al. This paper proposes a novel Unicode-aware tokenization


In [59]:
prompt = "Are there any recent papers on tokenizers?"
custom_prompt = populate_prompt_with_context(prompt, df)
response = get_answer(custom_prompt)
print(response['choices'][0]['text'])


    Title: One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for
  Retrieval-Augmented Large Language Models
Authors: Yutao Zhu, Zhaoheng Huang, Zhicheng Dou, Ji-Rong Wen
Categories:Computation and Language
Abstract:
   Retrieval-augmented generation (RAG) is a promising way to improve large
language models (LLMs) for generating more factual, accurate, and up-to-date
content. Existing methods either optimize prompts to guide LLMs in leveraging
retrieved informatio
