In [3]:
!pip install sentence_transformers
!pip install pinecone
!pip install cohere
!pip install datasets

Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/227.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/227.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (1

In [4]:
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from pinecone import Pinecone, ServerlessSpec
import os
from tqdm import tqdm
import cohere
import numpy as np
import json
import warnings
from IPython.display import display
warnings.filterwarnings("ignore")

  from tqdm.autonotebook import tqdm, trange


In [2]:
# Load SQuAD 2.0 dataset
with open('train-v2.0.json', 'r', encoding='utf-8') as f:
    squad_data = json.load(f)

# Initialize lists to store questions and contexts
questions_with_impossible_answers = []

# Iterate through dataset
for article in squad_data['data']:
    for paragraph in article['paragraphs']:
        for qa in paragraph['qas']:
            if not qa['is_impossible']:
                questions_with_impossible_answers.append({
                    'question': qa['question'],
                    'context': paragraph['context']
                })

# Print example questions with is_impossible = False
for idx, qa in enumerate(questions_with_impossible_answers[:5]):  # Print first 5 examples
    print(f"Example {idx+1}:")
    print(f"Question: {qa['question']}")
    print(f"Context: {qa['context']}")
    print()

print(f"Total number of questions with is_impossible = False: {len(questions_with_impossible_answers)}")

Example 1:
Question: When did Beyonce start becoming popular?
Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

Example 2:
Question: What areas did Beyonce compete in when she was growing up?
Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actres

In [5]:
with open("cohere_API_key.txt") as f:
    COHERE_API_KEY = f.read().strip()
with open("pinecone_API_key.txt") as f:
    PINECONE_API_KEY = f.read().strip()

In [24]:
# the standard QA model fails to answer this question
query = "What is the name of the main character in 'the beginning after the end'?"
co = cohere.Client(api_key=COHERE_API_KEY)
response = co.chat(
        model='command-r-plus',
        message=query,
    )

print(response.text)

The name of the main character in the novel series "The Beginning After the End" is Arthur Leywin, also known as Art or Artie. Arthur is the protagonist of the story, and the narrative follows his journey as he is transported to a magical world and reincarnated as the son of a powerful wizard.


The name of the main character in the novel and web series *The Beginning After the End* is Arthur Leywin, also known as grey** and the rest is nonsence.

In [6]:
from sentence_transformers import SentenceTransformer

EMBEDDING_MODEL = 'all-MiniLM-L6-v2'
model = SentenceTransformer(EMBEDDING_MODEL)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
def chunk_documents(documents, chunk_size=64, overlap=8):
    """
    Split documents into chunks of specified size with an overlap.

    Args:
        documents (list): List of preprocessed documents.
        chunk_size (int): Size of each chunk (number of words).
        overlap (int): Number of overlapping words between chunks.

    Returns:
        list: A list of document chunks.
    """
    document_chunks = []

    for document in documents:
        words = document.split()
        start = 0
        while start < len(words):
            end = start + chunk_size
            chunk = words[start:end]
            document_chunks.append(' '.join(chunk))
            start += (chunk_size - overlap)

    return document_chunks

In [9]:
def load_and_embedd_dataset(
        file_path: str = 'train-v2.0.json',
        dataset_name: str = 'squadv2',
        split: str = 'train',
        model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2'),
        text_field: str = 'context',
        chunk: bool = False,
        chunk_size: int = 64,
        overlap: int = 8,

        # rec_num: int = 400
) :
    """
    Load a dataset and embedd the text field using a sentence-transformer model
    Args:
        dataset_name: The name of the dataset to load
        split: The split of the dataset to load
        model: The model to use for embedding
        text_field: The field in the dataset that contains the text
        rec_num: The number of records to load and embedd
        chunk: whether to chunk the dataset into smaller chunks or not
        chunk_size: size of chunks if chunk is True,
        overlap: overlap between the chunk and the next one if chunk is True,
    Returns:
        tuple: A tuple containing the dataset and the embeddings
    """


    print("Loading and embedding the dataset")
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)

    documents = []
    for article in data['data']:
        for paragraph in article['paragraphs']:
            context = paragraph['context']
            # preprocessed_context = preprocess_text(context)
            documents.append(context)
    if chunk:
      documents = chunk_documents(documents, chunk_size, overlap)

    dataset = {text_field:documents}
    # documents_tojson = [{"context": doc} for doc in documents]
    # with open("docs_context.json", "w", encoding='utf-8') as doc:
    #     json.dump(documents_tojson, doc, ensure_ascii=False, indent=4)

    # dataset = load_dataset("json", data_files="docs_context.json")

    # Embed
    embeddings = model.encode(dataset[text_field])
    embeddings = np.array(embeddings)

    print("Done!")
    return dataset, embeddings

DATASET_NAME = 'squadv2'

dataset, embeddings = load_and_embedd_dataset(
    dataset_name=DATASET_NAME,
    model=model,
)
shape = embeddings.shape

Loading and embedding the dataset
Done!


In [10]:
def create_pinecone_index(
        index_name: str,
        dimension: int,
        metric: str = 'cosine',
):
    """
    Create a pinecone index if it does not exist
    Args:
        index_name: The name of the index
        dimension: The dimension of the index
        metric: The metric to use for the index
    Returns:
        Pinecone: A pinecone object which can later be used for upserting vectors and connecting to VectorDBs
    """
    from pinecone import Pinecone, ServerlessSpec
    print("Creating a Pinecone index...")
    pc = Pinecone(api_key=PINECONE_API_KEY)
    existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]
    if index_name not in existing_indexes:
        pc.create_index(
            name=index_name,
            dimension=dimension,
            # Remember! It is crucial that the metric you will use in your VectorDB will also be a metric your embedding
            # model works well with!
            metric=metric,
            spec=ServerlessSpec(
                cloud="aws",
                region="us-east-1"
            )
        )
    print("Done!")
    return pc

In [11]:
INDEX_NAME = 'squadv2'

# Create the vector database
# We are passing the index_name and the size of our embeddings
pc = create_pinecone_index(INDEX_NAME, shape[1])

Creating a Pinecone index...
Done!


In [12]:
def upsert_vectors(
        index: Pinecone,
        embeddings: np.ndarray,
        dataset: dict,
        text_field: str = 'context',
        batch_size: int = 128
):
    """
    Upsert vectors to a pinecone index
    Args:
        index: The pinecone index object
        embeddings: The embeddings to upsert
        dataset: The dataset containing the metadata
        batch_size: The batch size to use for upserting
    Returns:
        An updated pinecone index
    """
    print("Upserting the embeddings to the Pinecone index...")
    shape = embeddings.shape

    # print(dataset)
    ids = [str(i) for i in range(shape[0])]
    meta = [{text_field: text} for text in dataset[text_field]]

    # create list of (id, vector, metadata) tuples to be upserted
    to_upsert = list(zip(ids, embeddings, meta))

    for i in tqdm(range(0, shape[0], batch_size)):
        i_end = min(i + batch_size, shape[0])
        index.upsert(vectors=to_upsert[i:i_end])
    return index

# Upsert the embeddings to the Pinecone index
index = pc.Index(INDEX_NAME)
index_upserted = upsert_vectors(index, embeddings, dataset)

Upserting the embeddings to the Pinecone index...


100%|██████████| 149/149 [01:54<00:00,  1.30it/s]


In [None]:
def augment_prompt(
        query: str,
        model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2'),
        index=None,
) -> str:
    """
    Augment the prompt with the top 3 results from the knowledge base
    Args:
        query: The query to augment
        index: The vectorstore object
    Returns:
        str: The augmented prompt
    """
    results = [float(val) for val in list(model.encode(query))]

    # get top 3 results from knowledge base
    query_results = index.query(
        vector=results,
        top_k=3,
        include_values=True,
        include_metadata=True
    )['matches']
    text_matches = [match['metadata']['context'] for match in query_results]

    # get the text from the results
    source_knowledge = "\n\n".join(text_matches)

    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.
    Contexts:
    {source_knowledge}
    If the answer is not included in the source knowledge - say that you don't know.
    Query: {query}"""
    return augmented_prompt, source_knowledge

In [14]:
co = cohere.Client(api_key=COHERE_API_KEY)

In [30]:
query = 'What is the estimate of tonnes of bombs an enemy bomber planes could drop per day on london before world war II began?'


response = co.chat(
        model='command-r-plus',
        message=query,
    )
print('Without RAG:\n')
print(response.text)
print('\n')
print('With RAG:\n')
augmented_prompt, source_knowledge = augment_prompt(query, model=model, index=index)
response = co.chat(
        model='command-r-plus',
        message=augmented_prompt,
    )
print(response.text)

Without RAG:

Before World War II, the estimated tonnage of bombs that enemy bomber planes could drop on London per day was approximately 4,000 tons. This estimate was based on the capabilities of the Luftwaffe, the German air force, and the assumptions about their strategies and tactics.

The Luftwaffe had developed a formidable bombing capability by the late 1930s, with modern aircraft such as the Junkers Ju 87 Stuka dive bomber and the Heinkel He 111 medium bomber. They had also refined their tactics, including the use of massed formations of bombers and precision bombing techniques.

British estimates of the Luftwaffe's capabilities played a crucial role in shaping their defense strategies. The development of the RAF Fighter Command and the construction of air raid shelters and other defenses in London were influenced by these estimates.

It's important to note that the actual bombing raids on London during the Blitz, which began in September 1940, did not always match the pre-war 

In [36]:
query = 'What were the main provisions of the Treaty of Lausanne regarding the populations of the two sides?'


response = co.chat(
        model='command-r-plus',
        message=query,
    )
print('Without RAG:\n')
print(response.text)
print('\n')
print('With RAG:\n')
augmented_prompt, source_knowledge = augment_prompt(query, model=model, index=index)
response = co.chat(
        model='command-r-plus',
        message=augmented_prompt,
    )
print(response.text)

Without RAG:

The Treaty of Lausanne, signed in 1923, was an agreement between Turkey and the Allied Powers, including Greece, that finalized the terms of their peace after the Turkish War of Independence and resolved issues left unsettled by the earlier Treaty of Sèvres. Here were the main provisions of the Treaty of Lausanne regarding the populations of Turkey and the Allied Powers:

1. Population Exchange: The treaty oversaw a massive population exchange between Greece and Turkey. Approximately 1.5 million Christians (mostly Greeks) living in Turkey were required to leave and settle in Greece, and around 500,000 Muslims (mostly Turks) living in Greece were forced to move to Turkey. This population exchange was intended to reduce tensions and create more homogeneous nation-states.

2. Protection of Minorities: The treaty included provisions for the protection of minority rights in both Turkey and Greece. Each country agreed to grant full civil and political rights to the remaining mi

In [35]:
query = "How many records did beyonce sell as a solo artist, and with Destiny\'s Child"


response = co.chat(
        model='command-r-plus',
        message=query,
    )
print('Without RAG:\n')
print(response.text)
print('\n')
print('With RAG:\n')
augmented_prompt, source_knowledge = augment_prompt(query, model=model, index=index)
response = co.chat(
        model='command-r-plus',
        message=augmented_prompt,
    )
print(response.text)

Without RAG:

Beyoncé has sold over 200 million records worldwide as a solo artist. With Destiny's Child, they sold over 60 million records worldwide.


With RAG:

As a solo artist, Beyoncé has sold over 118 million records, and with Destiny's Child, she has sold an additional 60 million. This makes her one of the best-selling music artists of all time.
