# Lesson 2 - Retrieval Augmented Generation (RAG)

### Import  the Needed Packages

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from datasets import load_dataset
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
from tqdm.auto import tqdm
from DLAIUtils import Utils

import ast
import os
import pandas as pd

In [3]:
# get api key
utils = Utils()
PINECONE_API_KEY = utils.get_pinecone_api_key()

### Setup Pinecone

In [4]:
pinecone = Pinecone(api_key=PINECONE_API_KEY)

utils = Utils()
INDEX_NAME = utils.create_dlai_index_name('wiki-ai')
if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
  pinecone.delete_index(INDEX_NAME)

pinecone.create_index(name=INDEX_NAME, dimension=1536, metric='cosine',
  spec=ServerlessSpec(cloud='aws', region='us-west-2'))

index = pinecone.Index(INDEX_NAME)

### Load the Dataset

**Note:** To access the dataset outside of this course, just copy the following two lines of code and run it (remember to uncomment them first before executing):

#!wget -q -O lesson2-wiki.csv.zip "https://www.dropbox.com/scl/fi/yxzmsrv2sgl249zcspeqb/lesson2-wiki.csv.zip?rlkey=paehnoxjl3s5x53d1bedt4pmc&dl=0"

#!unzip lesson2-wiki.csv.zip

<p style="background-color:#fff1d7; padding:15px; "> <b>(Note: <code>max_articles_num = 500</code>):</b> To achieve a more comprehensive context for the Language Learning Model, a larger number of articles is generally more beneficial. In this lab, we've initially set <code>max_articles_num</code> to 500 for speedier results, allowing you to observe the outcomes faster. Once you've done an initial run, consider increasing this value to 750 or 1,000. You'll likely notice that the context provided to the LLM becomes richer and better. You can experiment by gradually raising this variable for different queries to observe the improvements in the LLM's contextual understanding.</p>

In [5]:
from datasets import load_dataset

data = load_dataset("gamino/wiki_medical_terms",split = 'train')
data = data.to_pandas()
data.drop_duplicates(subset='page_text', keep='first', inplace=True)
data = data[0:10]
data.head()

Downloading readme:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/33.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Unnamed: 0,page_title,page_text,__index_level_0__
0,Paracetamol poisoning,"Paracetamol poisoning, also known as acetamino...",0
1,Acromegaly,Acromegaly is a disorder that results from exc...,1
2,Actinic keratosis,"Actinic keratosis (AK), sometimes called solar...",2
3,Congenital adrenal hyperplasia,Congenital adrenal hyperplasia (CAH) is a grou...,3
4,Adrenocortical carcinoma,Adrenocortical carcinoma (ACC) is an aggressi...,4


In [6]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

### Connect to OpenAI

In [7]:
OPENAI_API_KEY = utils.get_openai_api_key()
openai_client = OpenAI(api_key=OPENAI_API_KEY)

def get_embeddings(articles, model="text-embedding-ada-002"):
   return openai_client.embeddings.create(input = articles, model=model)

In [8]:
page_text = []
for record in data['page_text']:
    page_text.extend([record[0:500]])


In [9]:
import numpy as np
batch_size = 10
metadatas = []

for i in tqdm(range(0, len(data), batch_size)):
    # get end of batch
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    # first get metadata fields for this record
    metadatas = [{
        'title': record['page_title'],
        'text': record['page_text']
    } for j, record in batch.iterrows()]
    # get the list of contexts / documents
    ids = [str(x) for x in range(i, i_end)]
    # create document embeddings
    embeds = [get_embeddings(x).data[0].embedding for x in page_text[i:i_end]]
    embeds = np.array(embeds)
    # get IDs
    # add everything to pinecone
    index.upsert(vectors=zip(ids, embeds, metadatas))

  0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 10}},
 'total_vector_count': 10}

### Run Your Query

In [11]:
data

Unnamed: 0,page_title,page_text,__index_level_0__
0,Paracetamol poisoning,"Paracetamol poisoning, also known as acetamino...",0
1,Acromegaly,Acromegaly is a disorder that results from exc...,1
2,Actinic keratosis,"Actinic keratosis (AK), sometimes called solar...",2
3,Congenital adrenal hyperplasia,Congenital adrenal hyperplasia (CAH) is a grou...,3
4,Adrenocortical carcinoma,Adrenocortical carcinoma (ACC) is an aggressi...,4
5,Alcohol withdrawal syndrome,Alcohol withdrawal syndrome (AWS) is a set of ...,5
6,Alopecia areata,"Alopecia areata, also known as spot baldness, ...",6
7,Altitude sickness,"Altitude sickness, the mildest form being acut...",7
8,Amblyopia,"Amblyopia, also called lazy eye, is a disorder...",8
9,Amoebiasis,"Amoebiasis, or amoebic dysentery, is an infect...",9


In [35]:
res['matches'][1]

{'id': '7',
 'metadata': {'text': 'Altitude sickness, the mildest form being acute '
                      'mountain sickness (AMS), is the harmful effect of high '
                      'altitude, caused by rapid exposure to low amounts of '
                      'oxygen at high elevation. People can respond to high '
                      'altitude in different ways. Symptoms may include '
                      'headaches, vomiting, tiredness, confusion, trouble '
                      'sleeping, and dizziness. Acute mountain sickness can '
                      'progress to high-altitude pulmonary edema (HAPE) with '
                      'associated shortness of breath or high-altitude '
                      'cerebral edema (HACE) with associated confusion. '
                      'Chronic mountain sickness may occur after long-term '
                      'exposure to high altitude.Altitude sickness typically '
                      'occurs only above 2,500 metres (8,000 ft), tho

In [31]:
query = "what is Amblyopia?"

embed = get_embeddings([query])
res = index.query(vector=embed.data[0].embedding, top_k=3, include_metadata=True)
text = [r['metadata']['text'] for r in res['matches']]
print('\n'.join(text))


Amblyopia, also called lazy eye, is a disorder of sight in which the brain fails to fully process input from one eye and over time favors the other eye. It results in decreased vision in an eye that typically appears normal in other respects. Amblyopia is the most common cause of decreased vision in a single eye among children and younger adults.The cause of amblyopia can be any condition that interferes with focusing during early childhood. This can occur from poor alignment of the eyes (strabismic), an eye being irregularly shaped such that focusing is difficult, one eye being more nearsighted or farsighted than the other (refractive), or clouding of the lens of an eye (deprivational). After the underlying cause is addressed, vision is not restored right away, as the mechanism also involves the brain. Amblyopia can be difficult to detect, so vision testing is recommended for all children around the ages of four to five.Early detection improves treatment success. Glasses may be all th

### Build the Prompt

'Amblyopia, also called lazy eye, is a disorder of sight in which the brain fails to fully process input from one eye and over time favors the other eye. It results in decreased vision in an eye that typically appears normal in other respects. Amblyopia is the most common cause of decreased vision in a single eye among children and younger adults.The cause of amblyopia can be any condition that interferes with focusing during early childhood. This can occur from poor alignment of the eyes (strabismic), an eye being irregularly shaped such that focusing is difficult, one eye being more nearsighted or farsighted than the other (refractive), or clouding of the lens of an eye (deprivational). After the underlying cause is addressed, vision is not restored right away, as the mechanism also involves the brain. Amblyopia can be difficult to detect, so vision testing is recommended for all children around the ages of four to five.Early detection improves treatment success. Glasses may be all t

In [71]:
query = "Explain what is Amblyopia and its Signs and symptoms, also provide refernecne used?"
embed = get_embeddings([query])
res = index.query(vector=embed.data[0].embedding, top_k=3, include_metadata=True)
res = res['matches'][0]
contexts = res['metadata']['text']

prompt_start = (
    "Answer the question based on the context below.\n\n"+
    "Context:\n"
)

prompt_end = (
    f"\n\nQuestion: {query}\nAnswer:"
)

prompt = (
    prompt_start + (contexts) + 
    prompt_end
)

print(prompt)

Answer the question based on the context below.

Context:
Amblyopia, also called lazy eye, is a disorder of sight in which the brain fails to fully process input from one eye and over time favors the other eye. It results in decreased vision in an eye that typically appears normal in other respects. Amblyopia is the most common cause of decreased vision in a single eye among children and younger adults.The cause of amblyopia can be any condition that interferes with focusing during early childhood. This can occur from poor alignment of the eyes (strabismic), an eye being irregularly shaped such that focusing is difficult, one eye being more nearsighted or farsighted than the other (refractive), or clouding of the lens of an eye (deprivational). After the underlying cause is addressed, vision is not restored right away, as the mechanism also involves the brain. Amblyopia can be difficult to detect, so vision testing is recommended for all children around the ages of four to five.Early d

### Get the Summary 

In [72]:
res = openai_client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    temperature=0,
    max_tokens=1000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)
print('-' * 80)
print(res.choices[0].text)

--------------------------------------------------------------------------------
 Amblyopia, also known as lazy eye, is a vision disorder in which the brain does not fully process input from one eye and instead favors the other eye. This results in decreased vision in the affected eye, even though it may appear normal in other respects. Amblyopia is the most common cause of decreased vision in one eye among children and young adults.

The signs and symptoms of amblyopia can vary, but may include poor stereo vision, difficulty with depth perception, poor pattern recognition, and reduced visual acuity. Other symptoms may include low sensitivity to contrast and motion, abnormal spatial interactions, and impaired contour detection. Amblyopia can also lead to problems with binocular vision, such as limited stereoscopic depth perception and difficulty seeing three-dimensional images.

The cause of amblyopia can be any condition that interferes with focusing during early childhood. This can i