<a href="https://colab.research.google.com/github/Seif-R15/Generative_AI_Projects_-_UseCases/blob/main/Retrieval_Augmented_Generation(RAG)_with_OpenAI(gpt).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building Retrival Augmented Generation (RAG) using Pincone vector database, vector embedding, and gpt-3.5-turbo

### Import  the Needed Packages

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
from datasets import load_dataset
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
from tqdm.auto import tqdm
from DLAIUtils import Utils

import ast
import os
import pandas as pd

In [None]:
# get api key
utils = Utils()
PINECONE_API_KEY = utils.get_pinecone_api_key()

### Setup Pinecone

In [None]:
pinecone = Pinecone(api_key=PINECONE_API_KEY)

utils = Utils()
INDEX_NAME = utils.create_dlai_index_name('dl-ai')
if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
  pinecone.delete_index(INDEX_NAME)

pinecone.create_index(name=INDEX_NAME, dimension=1536, metric='cosine',
  spec=ServerlessSpec(cloud='aws', region='us-west-2'))

index = pinecone.Index(INDEX_NAME)

### Load the Dataset

**Note:** To access the dataset, just copy the following two lines of code and run it (remember to uncomment them first before executing):

#!wget -q -O lesson2-wiki.csv.zip "https://www.dropbox.com/scl/fi/yxzmsrv2sgl249zcspeqb/lesson2-wiki.csv.zip?rlkey=paehnoxjl3s5x53d1bedt4pmc&dl=0"

#!unzip lesson2-wiki.csv.zip

<p style="background-color:#fff1d7; padding:15px; "> <b>(Note: <code>max_articles_num = 500</code>):</b> To achieve a more comprehensive context for the Language Learning Model, a larger number of articles is generally more beneficial. In this lab, we've initially set <code>max_articles_num</code> to 500 for speedier results, allowing you to observe the outcomes faster. Once you've done an initial run, consider increasing this value to 750 or 1,000. You'll likely notice that the context provided to the LLM becomes richer and better. You can experiment by gradually raising this variable for different queries to observe the improvements in the LLM's contextual understanding.</p>

In [None]:
max_articles_num = 500
df = pd.read_csv('./data/wiki.csv', nrows=max_articles_num)
df.head()


Unnamed: 0,id,metadata,values
1,1-0,"{'chunk': 0, 'source': 'https://simple.wikiped...","[-0.011254455894231796, -0.01698738895356655, ..."
2,1-1,"{'chunk': 1, 'source': 'https://simple.wikiped...","[-0.0015197008615359664, -0.007858820259571075..."
3,1-2,"{'chunk': 2, 'source': 'https://simple.wikiped...","[-0.009930099360644817, -0.012211072258651257,..."
4,1-3,"{'chunk': 3, 'source': 'https://simple.wikiped...","[-0.011600767262279987, -0.012608098797500134,..."
5,1-4,"{'chunk': 4, 'source': 'https://simple.wikiped...","[-0.026462381705641747, -0.016362832859158516,..."


### Prepare the Embeddings and Upsert to Pinecone on the Wikipedia dataset

In [None]:
prepped = []

for i, row in tqdm(df.iterrows(), total=df.shape[0]):
    meta = ast.literal_eval(row['metadata'])
    prepped.append({'id':row['id'],
                    'values':ast.literal_eval(row['values']),
                    'metadata':meta})
    if len(prepped) >= 250:
        index.upsert(prepped)
        prepped = []


  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 500}},
 'total_vector_count': 500}

### Connect to OpenAI

In [None]:
OPENAI_API_KEY = utils.get_openai_api_key()
openai_client = OpenAI(api_key=OPENAI_API_KEY)

def get_embeddings(articles, model="text-embedding-ada-002"):
   return openai_client.embeddings.create(input = articles, model=model)

### Run Your Query

In [None]:
query = "How an Algoritm works and what is an Algorithm?"

embed = get_embeddings([query])
res = index.query(vector=embed.data[0].embedding, top_k=3, include_metadata=True)
text = [r['metadata']['text'] for r in res['matches']]
print('\n'.join(text))


How an ideal computer works 
 Algorithmic information theory (how easily can a computer answer a question?)
 Complexity theory (how much time and memory does a computer need to answer a question?)
 Computability theory (can a computer do something?)
 Information theory (math that looks at data and how to process data)
 Theory of computation (how to answer questions on a computer using algorithms)
 Graph theory (math that looks for directions from one point to another)
 Type theory (what kinds of data should computers work with?)
 Denotational semantics (math for computer languages)
 Algorithms (looks at how to answer a question)
 Compilers (turning words into computer programs)
 Lexical analysis (how to turn words into data)
 Microprogramming (how to control the most important part of a computer)
 Operating systems (big computer programs, e.g. Linux, Microsoft Windows, Mac OS)  to control the computer hardware and software.
 Cryptography (hiding data)
 Parallel computing (many instruct

### Build the Prompt

In [None]:
query = "write an article titled: How an Algoritm works and what is an Algorithm?"
embed = get_embeddings([query])
res = index.query(vector=embed.data[0].embedding, top_k=3, include_metadata=True)

contexts = [
    x['metadata']['text'] for x in res['matches']
]

prompt_start = (
    "Answer the question based on the context below.\n\n"+
    "Context:\n"
)

prompt_end = (
    f"\n\nQuestion: {query}\nAnswer:"
)

prompt = (
    prompt_start + "\n\n---\n\n".join(contexts) +
    prompt_end
)

print(prompt)

Answer the question based on the context below.

Context:
How an ideal computer works 
 Algorithmic information theory (how easily can a computer answer a question?)
 Complexity theory (how much time and memory does a computer need to answer a question?)
 Computability theory (can a computer do something?)
 Information theory (math that looks at data and how to process data)
 Theory of computation (how to answer questions on a computer using algorithms)
 Graph theory (math that looks for directions from one point to another)
 Type theory (what kinds of data should computers work with?)
 Denotational semantics (math for computer languages)
 Algorithms (looks at how to answer a question)
 Compilers (turning words into computer programs)
 Lexical analysis (how to turn words into data)
 Microprogramming (how to control the most important part of a computer)
 Operating systems (big computer programs, e.g. Linux, Microsoft Windows, Mac OS)  to control the computer hardware and software.
 Cry

### Get the Summary

In [None]:
res = openai_client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    temperature=0,
    max_tokens=636,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)
print('-' * 80)
print(res.choices[0].text)

--------------------------------------------------------------------------------


An algorithm is a set of instructions or steps used to solve a problem or complete a task. It is like a recipe that tells a computer what to do in order to achieve a desired outcome. Algorithms are an essential part of computer science and are used in a wide range of applications, from simple calculations to complex problem-solving.

The concept of an algorithm has been around for centuries, with early examples dating back to ancient civilizations such as the Babylonians and Greeks. However, the term "algorithm" was first coined in the 9th century by a Persian mathematician, Muhammad ibn Mūsā al-Khwārizmī. He used the term in his book "Essay on the Computation of Casting and Equation" to describe a method for solving mathematical problems.

Today, algorithms are used in various fields, including computer science, mathematics, and engineering. They are also an integral part of everyday life, from the algo