

```
# This is formatted as code
```

# Custom Chatbot Project

The chosen dataset consists of concise news articles focusing on advancements in generative AI, including breakthroughs in technologies like GPT models, their applications in education and business communication, and industry trends such as competition among AI companies. Each record is a self-contained text field, formatted with the year, a brief description of the news, and the source, ensuring clarity and relevance. This dataset is particularly appropriate for the task as it directly aligns with the focus on generative AI, covers diverse and credible topics, and provides well-structured information that facilitates efficient processing using text embedding models like `text-embedding-ada-002`. Its real-world context and factual nature make it ideal for both retrieving contextually relevant answers and comparing results from embedding-based custom queries with general knowledge-based completion models.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [6]:
import pandas as pd
df = pd.read_csv('/genAI_advances.csv')
df['text'] = df['year'].astype(str) + ' - ' + df['content'] + ' ' + df['source']
df = df[['text']]

# Display the DataFrame
df.describe()

Unnamed: 0,text
count,54
unique,54
top,2023 - Generative Pre-trained Transformers (GP...
freq,1


In [5]:
!pip install openai==0.28
!pip install tiktoken

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.59.9
    Uninstalling openai-1.59.9:
      Successfully uninstalled openai-1.59.9
Successfully installed openai-0.28.0


Collecting tiktoken
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.2 MB[0m [31m4.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m17.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [1]:
import numpy as np
import openai
import os
from openai.embeddings_utils import distances_from_embeddings

In [3]:
openai.api_key = 'voc-.xxx'
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
MAX_TOKENS = 1000
openai.api_base = "https://openai.vocareum.com/v1"


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [7]:
def query_openai(query):
    # Select a relevant piece of text from the DataFrame based on the query
    context = df['text'].sample(1).iloc[0]  # This selects one random row for context

    # Format the query with context
    prompt = f"Context:\n{context}\n\nQuery:\n{query}\n\nAnswer:"

    # Call the OpenAI Completion API
    response = openai.Completion.create(
        engine="gpt-3.5-turbo-instruct",
        prompt=prompt,
        max_tokens=50,  # Adjust as needed
        n=1,
        stop=None,
        temperature=0.7,  # Adjust as needed
    )

    # Extract and return the answer
    answer = response.choices[0].text.strip()
    return answer

In [8]:
query = "What is LLaMA?"
answer = query_openai(query)
print(answer)

LLaMA (Language Learning with Machine Assistance) is a research project by OpenAI that aims to improve language learning by combining the power of generative pre-trained transformers (GPT) with human teaching. It uses AI to generate personalized practice exercises


In [9]:
embeddings = []
for index, row in df.iterrows():
  response = openai.Embedding.create(
      input=row["text"],
      engine=EMBEDDING_MODEL_NAME
  )
  embeddings.extend([data["embedding"] for data in response["data"]])
df["embeddings"] = embeddings

In [10]:
df[["text", "embeddings"]].to_csv("genAI_news_embeddings.csv")


In [11]:
df = pd.read_csv('genAI_news_embeddings.csv', index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df.head()


Unnamed: 0,text,embeddings
0,2023 - Generative Pre-trained Transformers (GP...,"[-0.015429225750267506, -0.030154043808579445,..."
1,2023 - Diffusion models emerge as a transforma...,"[-0.015209694392979145, -0.003470304422080517,..."
2,2024 - OpenAI's DALL·E 3 brings unprecedented ...,"[-0.019175244495272636, -0.020816493779420853,..."
3,"2023 - Meta introduces LLaMA, a large language...","[-0.014117571525275707, 0.01502192486077547, 0..."
4,"2024 - Google DeepMind unveils Gemini, a langu...","[-0.014261371456086636, 0.005189887247979641, ..."


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1,2,3

In [23]:
from sklearn.metrics.pairwise import cosine_similarity

questions = [
    "What advancements have GPT models made in natural language processing?",
    "What DeepSeek did?",
    "What is LLaMA?"
]

# Generate embeddings for questions
question_embeddings = []
for question in questions:
    response = openai.Embedding.create(
        input=question,
        engine=EMBEDDING_MODEL_NAME
    )
    question_embeddings.append(response["data"][0]["embedding"])

# Custom query: Find the closest match using cosine similarity
for i, question_embedding in enumerate(question_embeddings):
    similarities = cosine_similarity(
        [question_embedding], df["embeddings"].tolist()
    )[0]
    closest_index = np.argmax(similarities)
    custom_answer = df.iloc[closest_index]["text"]

    # Completion model query
    completion_response = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=f"Answer the following question based on general knowledge: {questions[i]}",
        max_tokens=100
    )
    print(completion_response)

    completion_answer = completion_response["choices"][0]["text"].strip()

    # Print results
    print(f"Question {i + 1}: {questions[i]}")
    print(f"Custom Query Answer: {custom_answer}")
    print(f"Completion Model Answer: {completion_answer}")
    print()

{
  "id": "cmpl-AuqiItkJ0ofgm6JbZRH4ghsUrnjWD",
  "object": "text_completion",
  "created": 1738112362,
  "model": "gpt-3.5-turbo-instruct",
  "choices": [
    {
      "text": "\n\nGPT (Generative Pre-trained Transformer) models have made significant advancements in natural language processing (NLP) by improving language understanding and generation tasks. These models use deep learning algorithms and large training datasets to achieve state-of-the-art performance in various NLP tasks. Some of the advancements made by GPT models in NLP include:\n\n1. Improving language generation: GPT models can generate human-like text by predicting the next word or sentence based on the context. This has been applied in",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "completion_tokens": 100,
    "total_tokens": 121
  }
}
Question 1: What advancements have GPT models made in natural language processing?
Custom Query Answ