# Retrieval Augmented Generation (RAG)

### Import  the Needed Packages

In [1]:
import warnings
warnings.filterwarnings=gs('ignore')

In [2]:
from datasets import load_dataset
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
from tqdm.auto import tqdm
from DLAIUtils import Utils

import ast
import os
import pandas as pd

In [3]:
# get api key
utils = Utils()
PINECONE_API_KEY = utils.get_pinecone_api_key()

### Setup Pinecone

In [4]:
pinecone = Pinecone(api_key=PINECONE_API_KEY)

utils = Utils()
INDEX_NAME = utils.create_dlai_index_name('dl-ai')
if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
  pinecone.delete_index(INDEX_NAME)

pinecone.create_index(name=INDEX_NAME, dimension=1536, metric='cosine',
  spec=ServerlessSpec(cloud='aws', region='us-west-2'))

index = pinecone.Index(INDEX_NAME)

### Load the Dataset

**Note:** To access the dataset outside of this notebook, just copy the following two lines of code and run it (remember to uncomment them first before executing):

#!wget -q -O lesson2-wiki.csv.zip "https://www.dropbox.com/scl/fi/yxzmsrv2sgl249zcspeqb/lesson2-wiki.csv.zip?rlkey=paehnoxjl3s5x53d1bedt4pmc&dl=0"

#!unzip lesson2-wiki.csv.zip

<p style="background-color:#fff1d7; padding:15px; "> <b>(Note: <code>max_articles_num = 500</code>):</b> To achieve a more comprehensive context for the Language Learning Model, a larger number of articles is generally more beneficial. In this lab, we've initially set <code>max_articles_num</code> to 500 for speedier results, allowing you to observe the outcomes faster. Once you've done an initial run, consider increasing this value to 750 or 1,000. You'll likely notice that the context provided to the LLM becomes richer and better. You can experiment by gradually raising this variable for different queries to observe the improvements in the LLM's contextual understanding.</p>

In [5]:
max_articles_num = 500
df = pd.read_csv('./data/wiki.csv', nrows=max_articles_num)
df.head()


Unnamed: 0,id,metadata,values
1,1-0,"{'chunk': 0, 'source': 'https://simple.wikiped...","[-0.011254455894231796, -0.01698738895356655, ..."
2,1-1,"{'chunk': 1, 'source': 'https://simple.wikiped...","[-0.0015197008615359664, -0.007858820259571075..."
3,1-2,"{'chunk': 2, 'source': 'https://simple.wikiped...","[-0.009930099360644817, -0.012211072258651257,..."
4,1-3,"{'chunk': 3, 'source': 'https://simple.wikiped...","[-0.011600767262279987, -0.012608098797500134,..."
5,1-4,"{'chunk': 4, 'source': 'https://simple.wikiped...","[-0.026462381705641747, -0.016362832859158516,..."


### Prepare the Embeddings and Upsert to Pinecone

In [6]:
prepped = []

for i, row in tqdm(df.iterrows(), total=df.shape[0]):
    meta = ast.literal_eval(row['metadata'])
    prepped.append({'id':row['id'], 
                    'values':ast.literal_eval(row['values']), 
                    'metadata':meta})
    if len(prepped) >= 250:
        index.upsert(prepped)
        prepped = []


  0%|          | 0/500 [00:00<?, ?it/s]

In [7]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 500}},
 'total_vector_count': 500}

### Connect to OpenAI

In [8]:
OPENAI_API_KEY = utils.get_openai_api_key()
openai_client = OpenAI(api_key=OPENAI_API_KEY)

def get_embeddings(articles, model="text-embedding-ada-002"):
   return openai_client.embeddings.create(input = articles, model=model)

### Run Your Query

In [13]:
query = "what is the India Tajmahal?"

embed = get_embeddings([query])
res = index.query(vector=embed.data[0].embedding, top_k=3, include_metadata=True)
text = [r['metadata']['text'] for r in res['matches']]
print('\n'.join(text))


Timur, the Turkic conqueror, took over in the end of the 14th century and began to rebuild cities in this region. Timur's successors, the Timurids (1405–1507), were great patrons of learning and the arts who enriched their capital city of Herat with fine buildings. Under their rule Afghanistan enjoyed peace and prosperity.

Between south of the Hindu Kush and the Indus River (today's Pakistan) was the native land of the Afghan tribes. They called this land "Afghanistan" (meaning "land of the Afghans"). The Afghans ruled the rich northern Indian subcontinent with their capital at Delhi. From the 16th to the early 18th century, Afghanistan was disputed between the Safavids of Isfahan and the Mughals of Agra who had replaced the Lodi and Suri Afghan rulers in India. The Safavids and Mughals occasionally oppressed the native Afghans but at the same time the Afghans used each empire to punish the other. In 1709, the Hotaki Afghans rose to power and completely defeated the Persian Empire. Th

### Build the Prompt

In [14]:
query = "write an article titled: what is the India Tajmahal?"
embed = get_embeddings([query])
res = index.query(vector=embed.data[0].embedding, top_k=3, include_metadata=True)

contexts = [
    x['metadata']['text'] for x in res['matches']
]

prompt_start = (
    "Answer the question based on the context below.\n\n"+
    "Context:\n"
)

prompt_end = (
    f"\n\nQuestion: {query}\nAnswer:"
)

prompt = (
    prompt_start + "\n\n---\n\n".join(contexts) + 
    prompt_end
)

print(prompt)

Answer the question based on the context below.

Context:
Now, 150 years later, it really is a big city.

In modern times many cities have grown bigger and bigger. The whole area is often called a  "metropolis"  and can sometimes include several small ancient towns and villages. The metropolis of London includes London, Westminster, and many old villages such as Notting Hill, Southwark, Richmond, Greenwich, etc. The part that is officially known as the " City of London " only takes up one square mile. The rest is known as "Greater London. " Many other cities have grown in the same way.

These giant cities can be exciting places to live, and many people can find good jobs there, but modern cities also have many problems. Many people cannot find jobs in the cities and have to get money by begging or by crime. Automobiles, factories, and waste create a lot of pollution that makes people sick.

Urban history 

Urban history is history of civilization. The first cities were made in ancient 

### Get the Summary 

In [15]:
res = openai_client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    temperature=0,
    max_tokens=636,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)
print('-' * 80)
print(res.choices[0].text)

--------------------------------------------------------------------------------


The Taj Mahal is a famous monument located in Agra, India. It is considered one of the most beautiful and iconic buildings in the world, and is recognized as a symbol of love and devotion.

The Taj Mahal was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. Construction of the monument began in 1632 and took over 20 years to complete. It is said that over 20,000 workers were involved in the construction of the Taj Mahal, which is made entirely of white marble.

The design of the Taj Mahal is a blend of Islamic, Persian, and Indian architectural styles. The main structure is a large white dome, surrounded by four smaller domes and four minarets. The intricate carvings and inlaid designs on the marble walls and floors are a testament to the skilled craftsmanship of the Mughal artisans.

The Taj Mahal is not only a beautiful architectural masterpiece, but it also holds grea