# Custom SampleBot - T6
### Step 1 - Prepatarion of Dataset

In [1]:
!pip install openai==0.28
!pip install tiktoken



In [2]:
#List of Required Libraries
import openai
import pandas as pd
import numpy as np
from dateutil.parser import parse
from openai.embeddings_utils import get_embedding
from openai.embeddings_utils import distances_from_embeddings
import tiktoken


In [3]:
#Datasets importation

default = pd.read_json('Dataset/default.jsonl', lines=True)
quest= pd.read_json('Dataset/queries.jsonl', lines=True)
resp= pd.read_json('Dataset/corpus.jsonl', lines=True)

In [4]:
#Inspection of responses dataset

df_resp= pd.DataFrame(
    data = resp, 
    columns= ['text']
    
)

#Removal of empty spaces and line skippers
df_resp = df_resp[df_resp.text.str.len()>0] 
df_resp = df_resp['text'].str.replace('\n', '')

df_resp.tail(20)

2028    Scientists have not yet found a way to prevent...
2029    Researchers have not found that eating, diet, ...
2030    - Alagille syndrome is a genetic condition tha...
2031    Too much glucose in the blood for a long time ...
2032    You can do a lot to prevent heart disease and ...
2033    You may have one or more of the following warn...
2034    Narrowed blood vessels leave a smaller opening...
2035    A stroke happens when part of your brain is no...
2036    - Don't smoke.  - Keep blood glucose and blood...
2037    Primary hyperparathyroidism is a disorder of t...
2038    The parathyroid glands are four pea-sized glan...
2039    High PTH levels trigger the bones to release i...
2040    In about 80 percent of people with primary hyp...
2041    Most people with primary hyperparathyroidism h...
2042    Health care providers diagnose primary hyperpa...
2043    Once the diagnosis of primary hyperparathyroid...
2044    Surgery                Surgery to remove the o...
2045    Eating

In [5]:
#Re-organization of responses dataset

df_resp= pd.DataFrame(
    data = df_resp, 
    columns= ['text']
)
df_resp.tail(20)


Unnamed: 0,text
2028,Scientists have not yet found a way to prevent...
2029,"Researchers have not found that eating, diet, ..."
2030,- Alagille syndrome is a genetic condition tha...
2031,Too much glucose in the blood for a long time ...
2032,You can do a lot to prevent heart disease and ...
2033,You may have one or more of the following warn...
2034,Narrowed blood vessels leave a smaller opening...
2035,A stroke happens when part of your brain is no...
2036,- Don't smoke. - Keep blood glucose and blood...
2037,Primary hyperparathyroidism is a disorder of t...


In [6]:
#Aditional filtering of components in the dataset
            
index_filter = df_resp.apply(lambda x: x.str.startswith('-')).text
df_resp.loc[index_filter, 'text'] = df_resp[index_filter].text.apply(lambda x: x[2:])

df_resp.tail(20)


Unnamed: 0,text
2028,Scientists have not yet found a way to prevent...
2029,"Researchers have not found that eating, diet, ..."
2030,Alagille syndrome is a genetic condition that ...
2031,Too much glucose in the blood for a long time ...
2032,You can do a lot to prevent heart disease and ...
2033,You may have one or more of the following warn...
2034,Narrowed blood vessels leave a smaller opening...
2035,A stroke happens when part of your brain is no...
2036,Don't smoke. - Keep blood glucose and blood p...
2037,Primary hyperparathyroidism is a disorder of t...


In [7]:
#Saving Responses as cv
df_resp.reset_index(drop=True).to_csv('responses.csv', index=False)

In [8]:
#Inspection of questions dataset

df_quest= pd.DataFrame(
    data = quest, 
    columns= ['text']
    
)

#Removal of empty spaces and line skippers
df_quest = df_quest[df_quest.text.str.len()>0] 
df_quest = df_quest['text'].str.replace('\n', '')

df_quest.tail(20)

2028                   How to prevent Alagille Syndrome ?
2029                   What to do for Alagille Syndrome ?
2030                   What to do for Alagille Syndrome ?
2031    What is (are) Prevent diabetes problems: Keep ...
2032    How to prevent Prevent diabetes problems: Keep...
2033    What are the symptoms of Prevent diabetes prob...
2034    What causes Prevent diabetes problems: Keep yo...
2035    What are the symptoms of Prevent diabetes prob...
2036    How to prevent Prevent diabetes problems: Keep...
2037          What is (are) Primary Hyperparathyroidism ?
2038          What is (are) Primary Hyperparathyroidism ?
2039          What is (are) Primary Hyperparathyroidism ?
2040            What causes Primary Hyperparathyroidism ?
2041    What are the symptoms of Primary Hyperparathyr...
2042        How to diagnose Primary Hyperparathyroidism ?
2043        How to diagnose Primary Hyperparathyroidism ?
2044    What are the treatments for Primary Hyperparat...
2045         W

In [9]:
#Re-organization of responses dataset

df_quest= pd.DataFrame(
    data = df_quest, 
    columns= ['text']
)

df_quest.tail(20)

Unnamed: 0,text
2028,How to prevent Alagille Syndrome ?
2029,What to do for Alagille Syndrome ?
2030,What to do for Alagille Syndrome ?
2031,What is (are) Prevent diabetes problems: Keep ...
2032,How to prevent Prevent diabetes problems: Keep...
2033,What are the symptoms of Prevent diabetes prob...
2034,What causes Prevent diabetes problems: Keep yo...
2035,What are the symptoms of Prevent diabetes prob...
2036,How to prevent Prevent diabetes problems: Keep...
2037,What is (are) Primary Hyperparathyroidism ?


In [10]:
#Aditional filtering of components in the dataset
            
index_filter = df_quest.apply(lambda x: x.str.endswith('?')).text
df_quest.loc[index_filter, 'text'] = df_quest[index_filter].text.apply(lambda x: x[0:-2])

df_quest.head(20)

Unnamed: 0,text
0,Who is at risk for Lymphocytic Choriomeningiti...
1,What are the symptoms of Lymphocytic Choriomen...
2,Who is at risk for Lymphocytic Choriomeningiti...
3,How to diagnose Lymphocytic Choriomeningitis (...
4,What are the treatments for Lymphocytic Chorio...
5,How to prevent Lymphocytic Choriomeningitis (LCM)
6,What is (are) Parasites - Cysticercosis
7,Who is at risk for Parasites - Cysticercosis?
8,How to diagnose Parasites - Cysticercosis
9,What are the treatments for Parasites - Cystic...


In [11]:
#Saving Questions as cv
df_quest.reset_index(drop=True).to_csv('questions.csv', index=False)

## Implementation of Open AI
##### The struggle was real

In [12]:
#Creation of embedding

openai.api_key = ''


response = openai.Embedding.create(
    model='text-embedding-ada-002', 
    input=df_resp.text.tolist()
)

question = openai.Embedding.create(
    model='text-embedding-ada-002', 
    input=df_quest.text.tolist()
)

In [13]:
#Verification of the embeddings
print('Responses \n')
response['data'][0]['embedding']

Responses 



[0.006323640234768391,
 -0.012436065822839737,
 0.0034754418302327394,
 -0.023348825052380562,
 -0.027342703193426132,
 0.025755392387509346,
 0.0013392932014539838,
 -0.020161403343081474,
 -0.021607903763651848,
 0.002033741446211934,
 0.011648810468614101,
 0.006115625612437725,
 -0.003760261693969369,
 -0.012276054359972477,
 -0.014516210183501244,
 -0.006784472148865461,
 0.023579241707921028,
 0.022555170580744743,
 0.03013329766690731,
 -0.027035482227802277,
 -0.022350355982780457,
 0.014119382947683334,
 -0.021262280642986298,
 -0.034562405198812485,
 -0.00987588707357645,
 0.011290385387837887,
 0.013914568349719048,
 -0.021070266142487526,
 -0.013504940085113049,
 -0.004115486517548561,
 -0.00037922640331089497,
 -0.0037314596120268106,
 -0.007898150011897087,
 0.0011552803916856647,
 0.006006818264722824,
 0.021441493183374405,
 0.01632113568484783,
 0.0023105607833713293,
 -0.013236121274530888,
 -0.002080144826322794,
 0.016884375363588333,
 0.015015444718301296,
 -0.0104

In [14]:
print('Questions \n')
question_embeddings = question['data'][0]['embedding']
question_embeddings

Questions 



[0.012244616635143757,
 0.015077278017997742,
 -0.0035975449718534946,
 -0.021520448848605156,
 -0.02415216714143753,
 0.01864241249859333,
 -0.033706728368997574,
 -0.0025620353408157825,
 0.002808353863656521,
 0.001579192583449185,
 0.01594587415456772,
 0.030750906094908714,
 -0.0395924411714077,
 0.024229951202869415,
 0.011965888552367687,
 0.010928758420050144,
 0.027924727648496628,
 0.00842020008713007,
 0.007331212982535362,
 -0.019562866538763046,
 -0.015103206969797611,
 0.0038017299957573414,
 0.013884578831493855,
 -0.020975954830646515,
 0.0014519820688292384,
 -0.005856543779373169,
 0.017073754221200943,
 -0.018629448488354683,
 0.012114975601434708,
 -0.00412259204313159,
 0.015427310019731522,
 -0.003087082412093878,
 -0.020288856700062752,
 -0.011751980520784855,
 -0.008562805131077766,
 -0.01097413245588541,
 0.02446330524981022,
 0.008122025057673454,
 0.008290558122098446,
 0.005107865668833256,
 0.022142726927995682,
 0.024320699274539948,
 -0.005010634660720825

In [15]:
#Creation of embedding lists and saving as csv

embeddings = list(map(lambda x: x['embedding'], response['data']))
embeddings_q = list(map(lambda x: x['embedding'], question['data']))



df_resp['embeddings'] = embeddings
df_resp.to_csv('embeddings_resp.csv', index=False)


df_quest['embeddings'] = embeddings_q
df_quest.to_csv('embeddings_quest.csv', index=False)

## Step 2: Finding Relevant Data

In [16]:
#Importing the response's CSV 

df_responses = pd.read_csv('embeddings_resp.csv')
df_responses['embeddings'] = df_responses['embeddings'].apply(eval).apply(np.array)
#df_questions = pd.read_csv('embeddings_quest.csv')
#df_questions['embeddings_q'] = df_questions['embeddings_q'].apply(eval).apply(np.array)

In [17]:
df_responses

Unnamed: 0,text,embeddings
0,LCMV infections can occur after exposure to fr...,"[0.006323640234768391, -0.012436065822839737, ..."
1,LCMV is most commonly recognized as causing ne...,"[-0.026094242930412292, -0.0043663159012794495..."
2,Individuals of all ages who come into contact ...,"[0.010534269735217094, -0.012414691969752312, ..."
3,"During the first phase of the disease, the mos...","[-0.0021115741692483425, -0.004429312888532877..."
4,"Aseptic meningitis, encephalitis, or meningoen...","[-0.019981464371085167, 0.015375477261841297, ..."
...,...,...
2043,Once the diagnosis of primary hyperparathyroid...,"[0.0111300153657794, 0.013934831134974957, 0.0..."
2044,Surgery Surgery to remove the o...,"[0.0014901341637596488, 0.014967930503189564, ..."
2045,"Eating, diet, and nutrition have not been show...","[0.018720755353569984, -0.001158850616775453, ..."
2046,Primary hyperparathyroidism is a disorder of t...,"[0.01327613927423954, 0.0026520818937569857, 0..."


In [18]:
#Finding the cosine distances

distances = distances_from_embeddings(question_embeddings, df_responses['embeddings'].tolist(), distance_metric='cosine')
distances

[0.20485335243890346,
 0.1632813902248874,
 0.14904232459669187,
 0.20837164032961575,
 0.17447313650313412,
 0.17286002552992508,
 0.24992172075271835,
 0.2650906396746865,
 0.2432407113901216,
 0.27446097972431593,
 0.24208437923625115,
 0.26175125767181484,
 0.2648597562415137,
 0.2658947447305766,
 0.28268172923453894,
 0.2725638985702793,
 0.2819422836784017,
 0.28421697243383015,
 0.2798598276607078,
 0.2869240637599113,
 0.27085289709270133,
 0.2671562580938267,
 0.264076861621828,
 0.2641414159401454,
 0.2471866466281324,
 0.2244331142645204,
 0.2542821390646124,
 0.27584518145928516,
 0.25114177207537547,
 0.2738029179370586,
 0.25208121310683973,
 0.2654079762592867,
 0.26243153676035014,
 0.27445824278001596,
 0.26697811169628605,
 0.25443953560960564,
 0.2742009636418731,
 0.2850491635513285,
 0.2695164649266919,
 0.22327598249417835,
 0.2888645802545906,
 0.22647254835260278,
 0.22911839990759608,
 0.25711252181406863,
 0.25415297054909247,
 0.24227900163474092,
 0.2225154

In [19]:
#Verification and sorting of distances column and saving as csv

df_responses['distances'] = distances

sorted_distances = df_responses.sort_values(by='distances', ascending=True)
sorted_distances.to_csv('distances_sorted.csv', index=False)
sorted_distances

Unnamed: 0,text,embeddings,distances
2,Individuals of all ages who come into contact ...,"[0.010534269735217094, -0.012414691969752312, ...",0.149042
1,LCMV is most commonly recognized as causing ne...,"[-0.026094242930412292, -0.0043663159012794495...",0.163281
686,Cytomegalovirus (CMV) is a virus found through...,"[-0.005178028251975775, -0.0010677087120711803...",0.172591
5,LCMV infection can be prevented by avoiding co...,"[0.004860565531998873, -0.004707245621830225, ...",0.172860
688,For most people CMV infection is not a problem...,"[-0.0009898442076519132, -0.01749727874994278,...",0.174043
...,...,...,...
1913,Diabetes management and treatment is expensive...,"[0.01575271040201187, -0.020176930353045464, 0...",0.350004
1653,Purpose of Hemodialysis The pur...,"[0.016612136736512184, -0.000823556911200285, ...",0.350121
1839,"Gas is air in the digestive tractthe large, mu...","[0.00881050992757082, 0.024427972733974457, 0....",0.350294
1458,Your doctor can offer you a number of treatmen...,"[-0.01342537347227335, 0.0011852876050397754, ...",0.352493


## Step 3 and 4: Model ft. Custom Text Prompt
#### First let's do an example

In [20]:
#Creation of a sample question 

question= df_quest['text'][0]

In [21]:
#Builting a Custom Text Prompt - Example

tokenizer = tiktoken.get_encoding('cl100k_base')
tokenized = tokenizer.encode(question)
tokenized, len(tokenized)

prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

max_token_count = 3000

print(prompt_template.format('context', question))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

context

---

Question: Who is at risk for Lymphocytic Choriomeningitis (LCM)?
Answer:


In [22]:
#Let's check how many tokens we have used in this example

tokenized_question = tokenizer.encode(question)
tokenized_prompt = tokenizer.encode(prompt_template)
len(tokenized_question), len(tokenized_prompt)


current_token_count = len(tokenized_question) + len(tokenized_prompt)
current_token_count


59

In [23]:
#Here we are going to provide some context for the sample prompt: 

df = pd.read_csv('distances_sorted.csv')

context = []
current_token_count = len(tokenized_question) + len(tokenized_prompt)

for text in df.text.values:
    text_token_count = len(tokenizer.encode(text))
    try:
        current_token_count += text_token_count
    except:
        print(text_token_count)

    if current_token_count <= max_token_count:
        context.append(text)
    else:
        break
        
df

Unnamed: 0,text,embeddings,distances
0,Individuals of all ages who come into contact ...,[ 0.01053427 -0.01241469 -0.00591455 ... -0.03...,0.149042
1,LCMV is most commonly recognized as causing ne...,[-0.02609424 -0.00436632 -0.0020876 ... -0.02...,0.163281
2,Cytomegalovirus (CMV) is a virus found through...,[-0.00517803 -0.00106771 -0.00286532 ... -0.02...,0.172591
3,LCMV infection can be prevented by avoiding co...,[ 0.00486057 -0.00470725 0.00655361 ... -0.02...,0.172860
4,For most people CMV infection is not a problem...,[-0.00098984 -0.01749728 0.01601489 ... -0.02...,0.174043
...,...,...,...
2043,Diabetes management and treatment is expensive...,[ 0.01575271 -0.02017693 0.03359914 ... -0.01...,0.350004
2044,Purpose of Hemodialysis The pur...,[ 0.01661214 -0.00082356 0.02796889 ... -0.00...,0.350121
2045,"Gas is air in the digestive tractthe large, mu...",[ 0.00881051 0.02442797 0.01285615 ... 0.00...,0.350294
2046,Your doctor can offer you a number of treatmen...,[-0.01342537 0.00118529 0.03550658 ... 0.00...,0.352493


In [24]:
context

['Individuals of all ages who come into contact with urine, feces, saliva, or blood of wild mice are potentially at risk for infection. Owners of pet mice or hamsters may be at risk for infection if these animals originate from colonies that were contaminated with LCMV, or if their animals are infected from other wild mice. Human fetuses are at risk of acquiring infection vertically from an infected mother.                 Laboratory workers who work with the virus or handle infected animals are also at risk. However, this risk can be minimized by utilizing animals from sources that regularly test for the virus, wearing proper protective laboratory gear, and following appropriate safety precautions.',
 'LCMV is most commonly recognized as causing neurological disease, as its name implies, though infection without symptoms or mild febrile illnesses are more common clinical manifestations.                 For infected persons who do become ill, onset of symptoms usually occurs 8-13 days 

In [25]:
#This allows us to provide context to any of the questions in our dataset, but here we can see which ones have the closest distances to our sample question

print(prompt_template.format('\n\n###\n\n'.join(context), question))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Individuals of all ages who come into contact with urine, feces, saliva, or blood of wild mice are potentially at risk for infection. Owners of pet mice or hamsters may be at risk for infection if these animals originate from colonies that were contaminated with LCMV, or if their animals are infected from other wild mice. Human fetuses are at risk of acquiring infection vertically from an infected mother.                 Laboratory workers who work with the virus or handle infected animals are also at risk. However, this risk can be minimized by utilizing animals from sources that regularly test for the virus, wearing proper protective laboratory gear, and following appropriate safety precautions.

###

LCMV is most commonly recognized as causing neurological disease, as its name implies, though infection without symptoms or mild febrile illnesses 

In [26]:
#Finally, let's send this prompt to a Completion Model. since "text-davinci-003" as been deprecated in the past years, we will be using 'gpt-3.5-turbo-instruct': 

answer = openai.Completion.create(
    model='gpt-3.5-turbo-instruct', 
    prompt=prompt_template.format('\n\n###\n\n'.join(context), question),
    max_tokens = 50
)

In [27]:
print(question, '\n\n', answer['choices'][0]['text'])

Who is at risk for Lymphocytic Choriomeningitis (LCM)? 

  
Individuals who come into contact with urine, feces, saliva, or blood of wild mice, owners of pet mice or hamsters if the animals are infected with LCM from wild mice, human fetuses, and laboratory workers who handle infected


#### Ok, now let's create and interface with all of these features combined, check out the "ChattyBot - T6" File in this repository