Welcome to the TrekBot Colab notebook! This notebook behaves exactly like a jupyter notebook. Let's dive right in!

First step: put the JSON data file here inside a folder called "data":
https://www.kaggle.com/datasets/gjbroughton/start-trek-scripts?resource=download

Now we'll install a few necessary packages.

In [None]:
!pip install numpy
!pip install pandas
!pip install scikit-learn
!pip install openai

Next we import the packages we'll need.


**pandas** is a package that allows us to conveniently store and manipulate data in a data structure known as a Dataframe. (This is similar to a Dataframe in R, for those familiar with R.) It’s a very common tool for anyone doing data science in python.

**sklearn** is the package formally called “scikit-learn”, and contains a wide range of statistical and machine learning methods. It’s another very common package for data scientists in python.

**numpy** is python’s main numeric library, and allows us to do things like work with arrays, matrices, dot products, etc.

**json** is a package for interacting with json files. Our data is formatted as a single json file, so this is useful for us here.

**os** helps us with file management and command-line commands.

**openai** is a package containing functions that allow us to easily make API calls to OpenAI’s models in python.

Finally, we import **cosine_similarity** from sklearn, since it’s a specialized function that we need today.

In [None]:
import pandas as pd
import numpy as np
import json
import openai
from sklearn.metrics.pairwise import cosine_similarity
import os

CHUNK_SIZE = 600
OVERLAP = 20

Remember the OpenAI API key you created? Copy and paste it in the cell below.

In [None]:
openai.api_key = input("Paste your OpenAI API key here and hit enter:")

Here's what the model is doing: we have a long piece of text that we want ChatGPT to be able to answer questions about. We first break that text up into chunks containing 600 words (technically called “tokens”), where each chunk overlaps 20 words with the following chunk. We then send these chunks to OpenAI to obtain their embeddings. When we ask a question about our text, we find the question’s embedding, and use cosine similarity to find the chunk of text that is closest to our question. We then send a query to ChatGPT that includes our original question, as well as the chunk of text as context.

We loop over all the chunks, and send each one to OpenAI, get back the embedding, and then write a new line to the Dataframe df. Note that we are casting the embedding response (a string) to a numpy array. We do this because we will be doing numerical operations on the embedding in just a moment.

In [None]:
scripts = json.load(open("data/all_scripts_raw.json", encoding='ascii')) # https://www.kaggle.com/datasets/gjbroughton/start-trek-scripts?resource=download
text = scripts['TNG']['episode 99']
text_list = text.split()
chunks = [text_list[i:i+CHUNK_SIZE] for i in range(0, len(text_list), CHUNK_SIZE-OVERLAP)]
df = pd.DataFrame(columns=['chunk', 'gpt_raw', 'embedding'])
for chunk in chunks:
    f = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=" ".join(chunk),
    )
    df.loc[len(df.index)] = (chunk, f, np.array(f['data'][0]['embedding']))

In [None]:
df.head()

Now, let’s define our query and get its embedding. Our query is a simple question: who was the captain of the Excalibur? A bit of context: in this episode, a small detail is that one of the crew members was assigned to command a ship for this one episode only, and it’s a minor detail in the plot of the episode. In fact, if you ask ChatGPT this question without giving it the script, it doesn’t know the answer. We’ll see that with the right chunk of text, identified by cosine similarity, ChatGPT can answer correctly.

We calculate the cosine distance from our query to each chunk, and save the chunk that is most similar to a variable called context_chunk.

Finally, we assemble the full query, including the chunk we identified, and send it to ChatGPT via the API:

In [None]:
query = "Who was the captain of the Excalibur?"
f = openai.Embedding.create(
    model="text-embedding-ada-002",
    input=query
)
query_embedding = np.array(f['data'][0]['embedding'])

similarity = []
for arr in df['embedding'].values:
    similarity.extend(cosine_similarity(query_embedding.reshape(1, -1), arr.reshape(1, -1)))
context_chunk = chunks[np.argmax(similarity)]

query_to_send = "CONTEXT: " + " ".join(context_chunk) + "\n\n" + query
response = openai.Completion.create(
  model="text-davinci-003",
  prompt= query_to_send,
  max_tokens=100,
  temperature=0
)

In [None]:
print(query_to_send)

Let's test our bot. Did it get it right? Execute the cell below to find out!

In [None]:
print(response['choices'][0]['text'].strip())