# Customized Elder Scrolls ChatBot using OpenAI Text Embedding

In this project, I've decided to use all the text from the Elder Scrolls Series and use it to get accurate context based responses from OpenAI's API. 

The Dataset that we are using is borrowed from https://github.com/jd7h/sentiment-lexicon-skyrim/


### Installation of necessary libraries and modules

We ensure that all the necessary libraries are installed and if they aren't, this step installs all of them for us

In [None]:
!pip install numpy
!pip install pandas
!pip install scikit-learn
!pip install openai

### Importing the libraries inside our python file

All the libraries are imported into our python file

In [2]:
import pandas as pd
import numpy as np
import json
import openai
from sklearn.metrics.pairwise import cosine_similarity
import os
import time
import pickle


### Getting the OpenAI API key from the user

In [3]:
#openai.api_key = input()
openai.api_key = "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"


### Loading and formatting the data

The JSON file is opened and converted to a pandas dataframe that is then further used to remove unwanted entries.

In [4]:
scripts = json.load(open("TheElderScrollsBooksDataset.json", encoding='ascii')) # https://github.com/jd7h/sentiment-lexicon-skyrim/
df = pd.DataFrame(scripts)


As the dataset includes books from games from multiple Elder Scroll games, we keep the entries for books from Skyrim and exclude all other books.

In [5]:
df = df.loc[(df['game'] == "Skyrim")]
len(df)

756

Optionally, we can exclude all the books where the author is "Anonymous" if we want to reduce the size of our dataframe

In [12]:
df = df.loc[df['author'] != "Anonymous"]
len(df)

451

### Tokenizing the text from the books and converting it to chunks

Next, we define a chunk size, overlap size and tokenize all the entries from our now modified dataframe.

In [6]:
CHUNK_SIZE = 1000
OVERLAP = 20

skyrimBooks = 0
skyrimBookTokens = []
for x in (df['text']):
    skyrimBookTokens += x.split()
    skyrimBooks+=1;
print("Total Books from Elder Scrolls V Skyrim:",skyrimBooks)
print("Total Token Size of All the Books :",len(skyrimBookTokens))

Total Books from Elder Scrolls V Skyrim: 756
Total Token Size of All the Books : 436335


Once that is done, we convert these tokens to chunks that we will send to OpenAI's API to be converted into text embeddings.

In [7]:
chunks = [skyrimBookTokens[i:i+CHUNK_SIZE] for i in range(0, len(skyrimBookTokens), CHUNK_SIZE-OVERLAP)]
cdf = pd.DataFrame(columns=['chunk', 'gpt_raw', 'embedding'])
print("Total Number of Chunks : ",len(chunks))


Total Number of Chunks :  446


Alternatively, we can pickle our chunks for later use as well.

In [18]:
pickledChunkPath = 'pickledSkyrimTextChunks.pkl'
inc = 1
while(os.path.exists(pickledChunkPath)):
    pickledChunkPath = pickledChunkPath[0:23] + str(inc) + pickledChunkPath[-4::]
    inc+=1
with open(pickledChunkPath, 'wb') as f:
    pickle.dump(chunks, f)
print("DataFrame successfully exported to",pickledChunkPath)

DataFrame successfully exported to pickledSkyrimTextChunks1.pkl


### Sending Chunks to OpenAI's API

Now that our chunks are finallized, we can send them to OpenAI's API which will return us with the text embeddings for all the tokens in our chunks.

If you do not have a subscription for OpenAI's services, you can only send maximum of 60 request in a minute which is why we initialize a variable called <b><i>chunksSent</i></b> which keeps tracks of how many requests we have made and if that is exceeded by 50, we halt the program for 1 minute and then reset our chunksSent Variable and continue sending chunks until all the chunks are succesfully sent and we recieve the embeddings.

In [230]:
chunksSent=0
for chunk in chunks:
    if(chunksSent>50):
        time.sleep(60)
        chunksSent = 0
    f = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=" ".join(chunk),
    )
    cdf.loc[len(cdf.index)] = (chunk, f, np.array(f['data'][0]['embedding']))
    chunksSent+=1
print(len(cdf)," chunks succesfully embedded!")

447  chunks succesfully embedded!


### Reviewing our new dataframe

We have successfully embedded all the chunks and now we can review if the dataframe is looking good.

In [254]:
cdf.head()

Unnamed: 0,chunk,gpt_raw,embedding
1,"[16, Accords, of, Madness,, v., VI, Hircine's,...","{'object': 'list', 'data': [{'object': 'embedd...","[0.00324949505738914, -0.008569796569645405, -..."
2,"[was, no, fear,, for, he, had, faith, that, hi...","{'object': 'list', 'data': [{'object': 'embedd...","[-0.0006086188950575888, -0.01804208941757679,..."
3,"[among, his, tribe., On, a, blustery, day,, he...","{'object': 'list', 'data': [{'object': 'embedd...","[-0.0032683517783880234, -0.0243468526750803, ..."
4,"[to, push, past, him,, Lord, Sheogorath, spoke...","{'object': 'list', 'data': [{'object': 'embedd...","[0.0012960234889760613, -0.012394634075462818,..."
5,"[it, was, not, until, well, after, noon, that,...","{'object': 'list', 'data': [{'object': 'embedd...","[0.009688128717243671, -0.023770440369844437, ..."


### Storing our dataframe as a pickle file for later use

Optionally, we can store our dataframe as a pickle file which we can load later so we don't have to perform the text embedding every time we reload our notebook.  

I've also included a short code that checks if there is already any existing pickle file and if there is, a new one is created so we don't end up overwriting our previous data.

In [8]:
pickledFilePath = 'pickledSkyrimTextEmbedding.pkl'
inc = 1
while(os.path.exists(pickledFilePath)):
    pickledFilePath = pickledFilePath[0:26] + str(inc) + pickledFilePath[-4::]
    inc+=1
cdf.to_pickle(pickledFilePath)
print("DataFrame successfully exported to",pickledFilePath)

DataFrame successfully exported to pickledSkyrimTextEmbedding2.pkl


### Loading previously pickled dataframe

Now we can load our dataframe fromt the pickle file and utilize it to calculate cosine similiarity with our input.

We can modify the value in <i>pd.read_pickle(fileName)</i> to specify which pickle file is to be used.

In [40]:
pickled_df = pd.read_pickle('pickledSkyrimTextEmbedding1.pkl')
print(pickled_df)

                                                 chunk  \
1    [16, Accords, of, Madness,, v., VI, Hircine's,...   
2    [was, no, fear,, for, he, had, faith, that, hi...   
3    [among, his, tribe., On, a, blustery, day,, he...   
4    [to, push, past, him,, Lord, Sheogorath, spoke...   
5    [it, was, not, until, well, after, noon, that,...   
..                                                 ...   
442  [of, the, forest, dwelling, mages, in, Summers...   
443  [didn't, know, any, better., Incense, burned, ...   
444  [out, of, the, shop, and, down, the, road, to,...   
445  [forest, people, who, were, torn, between, man...   
446  [Two, fortnights, passed, without, relief,, un...   

                                               gpt_raw  \
1    {'object': 'list', 'data': [{'object': 'embedd...   
2    {'object': 'list', 'data': [{'object': 'embedd...   
3    {'object': 'list', 'data': [{'object': 'embedd...   
4    {'object': 'list', 'data': [{'object': 'embedd...   
5    {'object

### Getting Query from the user, calculating cosine similiarity and getting a response

Next we ask the user to input the query that is to be used to be utilized for getting a response


In [41]:
query = input("Enter Query : ")

Enter Query : Which is the most hostile god


Once we have the query, we create a text embedding for our query and calculate the cosine similiarty with our previously calculated text embedding dataframe

In [43]:
f = openai.Embedding.create(
    model="text-embedding-ada-002",
    input=query
)
query_embedding = np.array(f['data'][0]['embedding'])

similarity = []
for arr in pickled_df['embedding'].values:
    similarity.extend(cosine_similarity(query_embedding.reshape(1, -1), arr.reshape(1, -1)))
context_chunk = chunks[np.argmax(similarity)]

query_to_send = "CONTEXT: " + " ".join(context_chunk) + "\n\n" + query
response = openai.Completion.create(
  model="text-davinci-003",
  prompt= query_to_send,
  max_tokens=2000,
  temperature=0
)
print("Response Successfully Recieved!")

Response Successfully Recieved!


Finally, we print the response that we have recieved from the API.

In [44]:
print(response['choices'][0]['text'])

 in Tamrielic mythic tradition?

Molag Bal (God of Schemes, King of Rape): Daedric power of much importance in Morrowind. There, he is always the archenemy of Boethiah, the Prince of Plots. He is the main source of the obstacles to the Dunmer (and preceding Chimer) people. In the legends, Molag Bal always tries to upset the bloodlines of Houses or otherwise ruin Dunmeri 'purity'. A race of supermonsters, said to live in Molag Amur, are the result of his seduction of Vivec during the previous era.
