<a href="https://colab.research.google.com/github/AkankshCaimi/METACRAFTERS-AI-Challenge/blob/main/StarTrek_QnA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Welcome to the StarTrek_QnA Colab notebook! Let's dive right in!

First, we'll install a few necessary packages.

In [38]:
!pip install numpy
!pip install pandas
!pip install scikit-learn
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## This code installs the following Python packages:

**numpy**: A package for scientific computing with Python. It provides functions for working with arrays, linear algebra, and other mathematical operations.

**pandas**: a package for data manipulation and analysis. It provides data structures for efficiently storing and querying large datasets, and functions for data cleaning, transformation, and aggregation.

**scikit-learn**: a package for machine learning with Python. It provides tools for supervised and unsupervised learning, including classification, regression, clustering, and dimensionality reduction.

**json** is a package for interacting with json files. Our data is formatted as a single json file, so this is useful for us here.

**os** helps us with file management and command-line commands.

**openai**: a package for accessing the OpenAI API. OpenAI is an artificial intelligence research laboratory that provides an API for accessing advanced language models.

The **!pip** command is a way to install Python packages from the command line or terminal. The **!** symbol at the beginning of each command indicates that it is a shell command, rather than a Python command.

Finally, we import **cosine_similarity** from **sklearn**.

In [39]:
import pandas as pd
import numpy as np
import json
import openai
from sklearn.metrics.pairwise import cosine_similarity
import os

CHUNK_SIZE = 1000
OVERLAP = 5

Copy and paste the OpenAI Key in the cell below.

In [40]:
openai.api_key = input("Paste your OpenAI API key here and hit enter:");

Paste your OpenAI API key here and hit enter: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


Here's what the model is doing: we have a long piece of text that we want ChatGPT to be able to answer questions about. We first break that text up into chunks containing 1000 words (technically called “tokens”), where each chunk overlaps 5 words with the following chunk. We then send these chunks to OpenAI to obtain their embeddings. When we ask a question about our text, we find the question’s embedding, and use cosine similarity to find the chunk of text that is closest to our question. We then send a query to ChatGPT that includes our original question, as well as the chunk of text as context.

We loop over all the chunks, and send each one to OpenAI, get back the embedding, and then write a new line to the Dataframe **df**. 

Note that we are casting the embedding response (a string) to a numpy array. We do this because we will be doing numerical operations on the embedding.

In [41]:
url = 'https://raw.githubusercontent.com/GJBroughton/Star_Trek_Scripts/master/data/all_scripts_raw.json'
scripts = pd.read_json(url)
# scripts = json.load(open("data/all_scripts_raw.json", encoding='ascii')) # https://www.kaggle.com/datasets/gjbroughton/start-trek-scripts?resource=download
text = scripts['TNG']['episode 99']
text_list = text.split()
chunks = [text_list[i:i+CHUNK_SIZE] for i in range(0, len(text_list), CHUNK_SIZE-OVERLAP)]
df = pd.DataFrame(columns=['chunk', 'gpt_raw', 'embedding'])
for chunk in chunks:
    f = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=" ".join(chunk),
    )
    df.loc[len(df.index)] = (chunk, f, np.array(f['data'][0]['embedding']))

In [42]:
df.head()

Unnamed: 0,chunk,gpt_raw,embedding
0,"[The, Next, Generation, Transcripts, -, Redemp...","{'object': 'list', 'data': [{'object': 'embedd...","[0.012480325065553188, -0.02131645195186138, 0..."
1,"[mission., I, believe, my, twenty, six, years,...","{'object': 'list', 'data': [{'object': 'embedd...","[0.014245089143514633, -0.023996854200959206, ..."
2,"[Those, anomalies, could, be, cloaked, ships.,...","{'object': 'list', 'data': [{'object': 'embedd...","[0.02086503803730011, -0.023771541193127632, 0..."
3,"[Klingon, territory,, how, would, you, stop, u...","{'object': 'list', 'data': [{'object': 'embedd...","[0.02652817592024803, -0.01923871785402298, 0...."
4,"[what, we've, been, waiting, for., SELA:, Yes,...","{'object': 'list', 'data': [{'object': 'embedd...","[0.003148784628137946, -0.025920629501342773, ..."


Our query is a simple question: "who was the captain of the Excalibur?"

A bit of context: In this episode, a small detail is that one of the crew members was assigned to command a ship for this one episode only, and it’s a minor detail in the plot of the episode. In fact, if you ask ChatGPT this question without giving it the script, it doesn’t know the answer. We’ll see that with the right chunk of text, identified by cosine similarity, ChatGPT can answer correctly.

We calculate the cosine distance from our query to each chunk, and save the chunk that is most similar to a variable called context_chunk.

Finally, we assemble the full query, including the chunk we identified, and send it to ChatGPT via the API:

In [43]:
query = "What is Excalibur?"
f = openai.Embedding.create(
    model="text-embedding-ada-002",
    input=query
)
query_embedding = np.array(f['data'][0]['embedding'])

similarity = []
for arr in df['embedding'].values:
    similarity.extend(cosine_similarity(query_embedding.reshape(1, -1), arr.reshape(1, -1)))
context_chunk = chunks[np.argmax(similarity)]

query_to_send = "CONTEXT: " + " ".join(context_chunk) + "\n\n" + query
response = openai.Completion.create(
  model="text-davinci-003",
  prompt= query_to_send,
  max_tokens=100,
  temperature=0
)

In [44]:
print(query_to_send)

CONTEXT: Those anomalies could be cloaked ships. O'BRIEN: Could be, sir. PICARD: Open a channel to the Excalibur. RIKER [OC]: Excalibur, Riker here. PICARD: Deploy the fleet, Will. It's time to spread our net. RIKER [OC]: Yes, sir. [Sutherland Bridge] HOBSON: We've arrived at the designated coordinates, Captain. DATA: All stop. Notify the flagship that we have assumed station. HOBSON: Starboard power coupling has overloaded. We've got a radiation leak on decks ten through twelve. DATA: Why are the backups not functioning? HOBSON: There wasn't enough time to test all the backups before we left the yard. Terry, I want you down in Engineering working on a new coupling. DATA: You have taken the phaser and torpedo control units offline. HOBSON: Keith, you and I will start bringing the radiation DATA: Mister Hobson, it is inappropriate for you to determine a course of action without consulting the commanding officer. HOBSON: I was trying to safeguard the lives of people on those decks, but y

In [45]:
print(response['choices'][0]['text'].strip())

Excalibur is a Federation starship, the flagship of the Starfleet Seventh Fleet. It is commanded by Captain William Riker and is one of the most advanced vessels in the fleet. The Excalibur is equipped with the latest in Starfleet technology, including a powerful tachyon detection grid, which can be used to detect cloaked ships. The Excalibur is also equipped with a powerful arsenal of weapons, including phasers, photon torpedoes, and quantum torpedoes. The Excalibur
