# RAG implementation

In this file, we test simple RAG implementation using Pinecone and Cohere LLM. \
We use simple-Wikipedia dataset and for each article, we take only the first 4 paragraphs.\
Because simple Wikipedia has relatively small paragraphs we wanted to see how using only the first paragraphs will affect the RAG, instead of chunking it to different paragraphs and embedding them. \
In this toy example, we took only the first 400 articles

In [42]:
import os
from tqdm import tqdm
import cohere
import numpy as np
import warnings
from IPython.display import display
warnings.filterwarnings("ignore")

In [44]:
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from pinecone import Pinecone, ServerlessSpec

In [45]:
with open("cohere_api_key.txt") as f:
    COHERE_API_KEY = f.read().strip()
with open("pinecone_api_key.txt") as f:
    PINECONE_API_KEY = f.read().strip()

In [46]:
#load encoder model
from sentence_transformers import SentenceTransformer

EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(EMBEDDING_MODEL)

# Loading the dataset
### We use wikipedia dataset. We take only the first k paragraph from the data.


In [47]:
def load_and_embedd_dataset(
        dataset_name: str = 'cnn_dailymail',
        split: str = 'train',
        model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2'),
        text_field: str = 'highlights',
        rec_num: int = 400,
        subset = None
) -> tuple:
    """
    Load a dataset and embedd the text field using a sentence-transformer model
    Args:
        dataset_name: The name of the dataset to load
        split: The split of the dataset to load
        model: The model to use for embedding
        text_field: The field in the dataset that contains the text
        rec_num: The number of records to load and embed
        subset: the subset of the dataset to load. default is None
    Returns:
        tuple: A tuple containing the dataset and the embeddings
    """
    from datasets import load_dataset
    
    print("Loading and embedding the dataset")
    
    # Load the dataset
    if subset is not None:
        dataset = load_dataset(dataset_name, subset,split=split)
    else:
        dataset = load_dataset(dataset_name,split=split)

    # Function to take only the first k paragraphs
    def take_first_k_paragraphs(text,k=4):
        paragraphs = text.split('\n') 
        return '\n'.join(paragraphs[:k])
    
    # Apply the function to the text field
    dataset = dataset.map(lambda x: {text_field: take_first_k_paragraphs(x[text_field],k=4)})
        
    
    
    # Embed the first `rec_num` rows of the dataset  
    embeddings = model.encode(dataset[text_field][:rec_num])
    
    print("Done!")
    return dataset, embeddings

In [48]:
#load dataset
DATASET_NAME = "graelo/wikipedia"
subset = "20230601.simple"

dataset, embeddings = load_and_embedd_dataset(
    dataset_name=DATASET_NAME,
    rec_num=400,
    split='train',
    model=model,
    text_field = 'text',
    subset = subset
)
shape = embeddings.shape

Loading and embedding the dataset
Done!


In [50]:
print(f"The embeddings shape: {embeddings.shape}")

The embeddings shape: (400, 768)


In [51]:
pd_dataset = dataset.to_pandas()
pd_dataset['text'].head(5)

0    April is the fourth month of the year in the J...
1    Art is a creative activity and technical skill...
2    Air is the Earth's atmosphere. Air is a mixtur...
3    Alan Mathison Turing OBE FRS (London, 23 June ...
4    Adobe Illustrator is a computer program for ma...
Name: text, dtype: object

In [38]:
print(pd_dataset['text'].head(7)[2])

Air is the Earth's atmosphere. Air is a mixture of many gases and tiny dust particles. It is the clear gas in which living things live and breathe. It has an indefinite shape and volume. It has mass and weight, because it is matter. The weight of air creates atmospheric pressure. There is no air in outer space.

Atmosphere is a mixture of about 78% nitrogen, 21% of oxygen, and 1% other gases, such as Carbon Dioxide.



## Creating index

In [105]:
def create_pinecone_index(
        index_name: str,
        dimension: int,
        metric: str = 'cosine',
):
    """
    Create a pinecone index if it does not exist
    Args:
        index_name: The name of the index
        dimension: The dimension of the index
        metric: The metric to use for the index
    Returns:
        Pinecone: A pinecone object which can later be used for upserting vectors and connecting to VectorDBs
    """
    from pinecone import Pinecone, ServerlessSpec
    print("Creating a Pinecone index...")
    pc = Pinecone(api_key=PINECONE_API_KEY)
    existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]
    if index_name not in existing_indexes:
        pc.create_index(
            name=index_name,
            dimension=dimension,
            # Remember! It is crucial that the metric you will use in your VectorDB will also be a metric your embedding
            # model works well with!
            metric=metric,
            spec=ServerlessSpec(
                cloud="aws",
                region="us-east-1"
            )
        )
    print("Done!")
    return pc

In [106]:
INDEX_NAME = 'wiki-dataset'

# Create the vector database
# We are passing the index_name and the size of our embeddings
pc = create_pinecone_index(INDEX_NAME, shape[1])

Creating a Pinecone index...
Done!


In [107]:
def upsert_vectors(
        index: Pinecone,
        embeddings: np.ndarray,
        dataset: dict,
        text_field: str = 'highlights',
        batch_size: int = 128
):
    """
    Upsert vectors to a pinecone index
    Args:
        index: The pinecone index object
        embeddings: The embeddings to upsert
        dataset: The dataset containing the metadata
        batch_size: The batch size to use for upserting
    Returns:
        An updated pinecone index
    """
    print("Upserting the embeddings to the Pinecone index...")
    shape = embeddings.shape
    
    ids = [str(i) for i in range(shape[0])]
    meta = [{text_field: text} for text in dataset[text_field]]
    
    # create list of (id, vector, metadata) tuples to be upserted
    to_upsert = list(zip(ids, embeddings, meta))

    for i in tqdm(range(0, shape[0], batch_size)):
        i_end = min(i + batch_size, shape[0])
        index.upsert(vectors=to_upsert[i:i_end])
    return index


In [109]:
INDEX_NAME = 'wiki-dataset'
index = pc.Index(INDEX_NAME)
index_upserted = upsert_vectors(index, embeddings, dataset,text_field='text',batch_size=2**6)

Upserting the embeddings to the Pinecone index...


100%|█████████████████████████████████████████████| 7/7 [00:02<00:00,  2.90it/s]


## Making questions
### We want to evaluate the model with questions it can answer in a systematic way.
This part asks the LLM to make questions it can answer using the RAG data. We kept only those who their answer was different from the normal LLM answer

In [64]:
#load LLM
import cohere
co = cohere.Client(api_key=COHERE_API_KEY)

In [62]:
str(pd_dataset.sample(1)['text'].values[0])

'José María de la Torre Martín (9 September 1952 – 14 December 2020) was a Mexican Roman Catholic bishop. De la Torres Martín was born in Mexico City. He became a priest in 1980. He was titular bishop of Panatoria and as auxiliary bishop of the Roman Catholic Archdiocese of Guadalajara, Mexico from 2002 to 2008 and as bishop of the Roman Catholic Diocese of Aguascalientes, Mexico, from 2008 until his death in 2020.\n\nDe la Torre Martín died on 14 December 2020 from COVID-19 in Aguascalientes, Mexico at the age of 68.\n'

In [78]:
def generate_question():
    query = "Generate a simple question from this paragraph, that is not specific to the paragraph but the answer to it is in the paragraph and provide answer \n"
    sel_text = pd_dataset.sample(1)['text'].values[0]
    query+= sel_text
    
    response = co.chat(
            model='command-r-plus',
            message=query,
        )
    print("response: ",response.text)
    print("=====")
    print("Source:")
    print(sel_text)
def ask_question(query):
    response = co.chat(
        model='command-r-plus',
        message=query,
    )
    print(response.text)

### Question 1

In [80]:
generate_question()

response:  Question: What is the name of the district that Rocourt used to be a part of?
Answer: The district of Porrentruy.
=====
Source:
Rocourt was a municipality of the district of Porrentruy in the canton of Jura in Switzerland. On 1 January 2018, the former municipality of Rocourt merged into the municipality of Haute-Ajoie.

References



In [81]:
ask_question("What is the name of the district that Rocourt used to be a part of?")

Liège


### Question 2

In [85]:
generate_question()

response:  Question: When was Leas Cliff Hall opened? 
Answer: Leas Cliff Hall was opened on July 13, 1927, by Prince Henry, Duke of Gloucester.
=====
Source:
Leas Cliff Hall is an entertainment and function venue in Folkestone, on the Kent coast of England.

History
The Leas Shelter was built in 1894. In 1924, it was decided that a larger hall was needed. 28 months later, the building was finished. It was opened on 13 July 1927 by Prince Henry, Duke of Gloucester.


In [86]:
ask_question("When was Leas Cliff Hall opened? ")

Leas Cliff Hall was opened on Thursday, July 6, 1927.


### Question 3

In [92]:
generate_question()

response:  Question: Who was known as the "Walking Bible"?
Answer: Jack Van Impe.
=====
Source:
Jack Leo Van Impe ( ; February 9, 1931 – January 18, 2020) was an American televangelist. He was known for his half-hour weekly television series Jack Van Impe Presents which was a commentary on the news of the week through with a twist of the Bible. He was known as the "Walking Bible", having memorized most of the King James Version of the Bible.

Van Impe died on January 18, 2020 in Royal Oak, Michigan at a hospital from problems caused by a fall at the age of 88.



In [95]:
ask_question("Who was known as the Walking Bible?")

George Müller was known as the "Walking Bible" due to his remarkable memorization and recall of large portions of the Bible. Müller, a Christian evangelist and director of an orphanage in Bristol, England, in the 19th century, had a deep devotion to Scripture and is known for his faith and dedication to serving the needy. He attributed his ability to recall Bible verses to his habit of regularly reading and meditating on the Bible.


### Question 4

In [117]:
generate_question()

response:  Question: What is a figure of speech that compares two different things?
Answer: A simile.
=====
Source:
A simile is a figure of speech that compares two different things, usually by using the words 'like' or 'as'. It is used to make a direct and clear comparison between two things .Similes may be confused with metaphors, which do the same kind of thing. Similes use comparisons, with the words 'like' or 'as'. Metaphors use indirect comparisons, without the words 'like' or 'as'.

Similes:
Like a hungry wolf, he ate the food.


In [118]:
ask_question(" What is a figure of speech that compares two different things?")

A figure of speech that compares two different things is a metaphor.


# Implemeting RAG

In [110]:
def augment_prompt(
        query: str,
        model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2'),
        index=None,
) -> str:
    """
    Augment the prompt with the top 3 results from the knowledge base
    Args:
        query: The query to augment
        index: The vectorstore object
    Returns:
        str: The augmented prompt
    """
    results = [float(val) for val in list(model.encode(query))]
    
    # get top 3 results from knowledge base
    query_results = index.query(
        vector=results,
        top_k=3,
        include_values=True,
        include_metadata=True
    )['matches']
    text_matches = [match['metadata']['text'] for match in query_results]
    
    # get the text from the results
    source_knowledge = "\n\n".join(text_matches)
    
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.
    Contexts:
    {source_knowledge}
    If the answer is not included in the source knowledge - say that you don't know.
    Query: {query}"""
    return augmented_prompt, source_knowledge

## Note
We will see one example that did work and one example that did not work

On the first time the RAG was not useful because it embeds the whole text as the data to retrieve:

In [111]:
query = "What is the name of the district that Rocourt used to be a part of?"
augmented_prompt, source_knowledge = augment_prompt(query, model=model, index=index)
response = co.chat(
        model='command-r-plus',
        message=augmented_prompt,
    
    )
print("Response: " ,response.text)
print("Source: \n" +source_knowledge)

Response:  I don't know.
Source: 
Coden is a small fishing village near Bayou la Batre, Alabama, USA. It is about 20 miles southwest of Mobile, near the Alabama/Mississippi border. The name of the town comes from Coq d'Inde, which is French for "Turkey".

Around 1900, the area was known as a resort, which is a place people go to on their vacations.  The Rolston Hotel brought visitors from all over the region. When it was destroyed by a hurricane, the community fell on hard times. The Rolston Hotel property now belongs to the City Of Bayou La Batre and is a park that  is  attracting people from other areas who want cool ocean breezes and peace that originally brought visitors. It is nice because it has the gentle sound of the water of Portersville Bay, fishing, and relaxation. Fresh seafood can be found on Shell Belt Road from fishing boats returning to Bayou Coden. Coden is on the southern shore of the mainland, across the Mississippi Sound from Dauphin Island and is one stop along Ala

### Implementing more complex RAG
#### This code will first find the noun we need information on and then embed it to the index

In [112]:
def ask_rag(query,model=model, index=index):
    prompt = "Given the following question, return only the name of the noun we need information on in order to solve the question \n" + query
    response = co.chat(
        model='command-r-plus',
        message=prompt,
    )
    print("needed noun: ",response.text)
    augmented_prompt,source_knowledge = augment_prompt(query, model=model, index=index)
    response = co.chat(
        model='command-r-plus',
        message=augmented_prompt,
    
    )
    print("Response: " ,response.text)
    print("Source: \n" +source_knowledge)


In [119]:
ask_rag("What is a figure of speech that compares two different things?")

needed noun:  comparison
Response:  A figure of speech that compares two different things is a simile.
Source: 
A simile is a figure of speech that compares two different things, usually by using the words 'like' or 'as'. It is used to make a direct and clear comparison between two things .Similes may be confused with metaphors, which do the same kind of thing. Similes use comparisons, with the words 'like' or 'as'. Metaphors use indirect comparisons, without the words 'like' or 'as'.

Similes:
Like a hungry wolf, he ate the food.

A conceptual metaphor or cognitive metaphor is a metaphor which refers to one domain (group of ideas) in terms of another. For example, treating quantity in terms of direction:
Prices are rising.
I attacked every weak point in his argument. (Argument as war rather than enquiry or search for truth).
Life is a journey.

Ad hominem is a Latin word for a type of argument. It is a word often used in rhetoric. Rhetoric is the science of speaking well, and convinci

### We tried the 3 other questions and this one worked while the other did not.