<a href="https://colab.research.google.com/github/MShiloni22/Data_Analysis_Presentation_Lab_HW01/blob/main/HW01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installations

In [1]:
!pip install transformers sentence-transformers datasets cohere pinecone



In [2]:
!pip install kaggle  # relevant for database upload



# Imports

In [3]:
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import pandas as pd
import os
from tqdm import tqdm
import cohere
import numpy as np
import warnings
from IPython.display import display
from pinecone import Pinecone, ServerlessSpec
from google.colab import files
warnings.filterwarnings("ignore")

  from tqdm.autonotebook import tqdm, trange


# RAG Pipeline:

* An embedding model (For example, a sentence-transformer)

* A vector database, we will use the free-to-use Pinecone API (limited to 100k vectors)

* An LLM to chat with, we will use the CohereChat API (Similar to OpenAI's chatGPT, but free)

# APIs

In [4]:
COHERE_API_KEY = 'wYIE6jcnZxOhrMLf9JiDw8yzb0dlrtejhCmNvuiw'
PINECONE_API_KEY = '53e3c868-838b-4dc4-afeb-52107e84cd06'

#Embedding Model

In [5]:
EMBEDDING_MODEL = 'all-MiniLM-L6-v2'
model = SentenceTransformer(EMBEDDING_MODEL)

### Load Dataset Into A Pandas Dataframe

In [6]:
# Upload Kaggle dataset
files.upload()  # Choose the kaggle zip file after you downloaded it from: https://www.kaggle.com/datasets/tanishqdublish/text-classification-documentation

# Move the API Key to the right location
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download the Dataset
!kaggle datasets download -d sunilthite/text-document-classification-dataset

# Unzip the Dataset
!unzip text-document-classification-dataset.zip -d /content/dataset

# Load the Dataset
dataset_dir = '/content/dataset'
files = os.listdir(dataset_dir)
print(files)

# Load CSV file into a DataFrame
csv_file = os.path.join(dataset_dir, 'df_file.csv')
data = pd.read_csv(csv_file)

# Display the first few rows of the dataframe
data.head()

Saving df_file.csv.zip to df_file.csv.zip
mv: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Dataset URL: https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
License(s): other
Downloading text-document-classification-dataset.zip to /content
 54% 1.00M/1.85M [00:00<00:00, 2.08MB/s]
100% 1.85M/1.85M [00:00<00:00, 3.35MB/s]
Archive:  text-document-classification-dataset.zip
  inflating: /content/dataset/df_file.csv  
['df_file.csv']


Unnamed: 0,Text,Label
0,Budget to set scene for election\n \n Gordon B...,0
1,Army chiefs in regiments decision\n \n Militar...,0
2,Howard denies split over ID cards\n \n Michael...,0
3,Observers to monitor UK election\n \n Minister...,0
4,Kilroy names election seat target\n \n Ex-chat...,0


### Load and Embed

In [7]:
def load_and_embedd_dataset(
        file_path: str,
        model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2'),
        text_field: str = 'text'
) -> tuple:
    """
    Load a dataset from a CSV file and embed the text field using a sentence-transformer model
    Args:
        file_path: The path to the CSV file containing the dataset
        model: The model to use for embedding
        text_field: The field in the dataset that contains the text
    Returns:
        tuple: A tuple containing the dataset and the embeddings
    """
    print("Loading and embedding the dataset")

    # Load the dataset
    dataset = pd.read_csv(file_path)

    # Embed the text field of the dataset
    embeddings = model.encode(dataset[text_field].tolist())

    print("Done!")
    return dataset, embeddings

# Set the path to the CSV file
csv_file = '/content/dataset/df_file.csv'

# Load and embed the dataset
model = SentenceTransformer('all-MiniLM-L6-v2')
dataset, embeddings = load_and_embedd_dataset(
    file_path=csv_file,
    text_field='Text',
    model=model
)

# Display the shape of the embeddings
shape = embeddings.shape
print(f"Embeddings shape: {shape}")

# Display the first few rows of the dataset
dataset.head()

Loading and embedding the dataset
Done!
Embeddings shape: (2225, 384)


Unnamed: 0,Text,Label
0,Budget to set scene for election\n \n Gordon B...,0
1,Army chiefs in regiments decision\n \n Militar...,0
2,Howard denies split over ID cards\n \n Michael...,0
3,Observers to monitor UK election\n \n Minister...,0
4,Kilroy names election seat target\n \n Ex-chat...,0


In [8]:
# def load_and_embedd_dataset(
#         dataset_name: str = 'cnn_dailymail',
#         split: str = 'train',
#         model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2'),
#         text_field: str = 'highlights',
#         rec_num: int = 400
# ) -> tuple:
#     """
#     Load a dataset and embedd the text field using a sentence-transformer model
#     Args:
#         dataset_name: The name of the dataset to load
#         split: The split of the dataset to load
#         model: The model to use for embedding
#         text_field: The field in the dataset that contains the text
#         rec_num: The number of records to load and embedd
#     Returns:
#         tuple: A tuple containing the dataset and the embeddings
#     """
#     from datasets import load_dataset

#     print("Loading and embedding the dataset")

#     # Load the dataset
#     dataset = load_dataset(dataset_name, split=split)

#     # Embed the first `rec_num` rows of the dataset
#     embeddings = model.encode(dataset[text_field])

#     print("Done!")
#     return dataset, embeddings

In [9]:
# DATASET_NAME = "Teejeigh/raw_friends_series_transcript"

# dataset, embeddings = load_and_embedd_dataset(
#     dataset_name = DATASET_NAME,
#     text_field = 'text',
#     rec_num = 100,
#     model=model,
# )
# shape = embeddings.shape

#Vector Database

We will use Pinecone


## Creating Database

In [10]:
def create_pinecone_index(
        index_name: str,
        dimension: int,
        metric: str = 'cosine',
):
    """
    Create a pinecone index if it does not exist
    Args:
        index_name: The name of the index
        dimension: The dimension of the index
        metric: The metric to use for the index
    Returns:
        Pinecone: A pinecone object which can later be used for upserting vectors and connecting to VectorDBs
    """
    print("Creating a Pinecone index...")
    pc = Pinecone(api_key=PINECONE_API_KEY)
    existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]
    if index_name not in existing_indexes:
        pc.create_index(
            name=index_name,
            dimension=dimension,
            # Remember! It is crucial that the metric you will use in your VectorDB will also be a metric your embedding
            # model works well with!
            metric=metric,
            spec=ServerlessSpec(
                cloud="aws",
                region="us-east-1"
            )
        )
    print("Done!")
    return pc

In [11]:
INDEX_NAME = 'hw01part03'

# Create the vector database
# We are passing the index_name and the size of our embeddings
pc = create_pinecone_index(INDEX_NAME, shape[1])

Creating a Pinecone index...
Done!


## Upserting

In [12]:
def upsert_vectors(
        index: Pinecone,
        embeddings: np.ndarray,
        dataset: dict,
        text_field: str = 'Text',
        batch_size: int = 128
):
    """
    Upsert vectors to a pinecone index
    Args:
        index: The pinecone index object
        embeddings: The embeddings to upsert
        dataset: The dataset containing the metadata
        batch_size: The batch size to use for upserting
    Returns:
        An updated pinecone index
    """
    print("Upserting the embeddings to the Pinecone index...")
    shape = embeddings.shape

    ids = [str(i) for i in range(shape[0])]
    meta = [{text_field: text} for text in dataset[text_field]]

    # create list of (id, vector, metadata) tuples to be upserted
    to_upsert = list(zip(ids, embeddings, meta))

    for i in tqdm(range(0, shape[0], batch_size)):
        i_end = min(i + batch_size, shape[0])
        index.upsert(vectors=to_upsert[i:i_end])
    return index


In [13]:
# Upsert the embeddings to the Pinecone index
index = pc.Index(INDEX_NAME)
index_upserted = upsert_vectors(index, embeddings, dataset)

Upserting the embeddings to the Pinecone index...


100%|██████████| 18/18 [00:11<00:00,  1.60it/s]


In [14]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 2225}},
 'total_vector_count': 2225}

#LLM
We will use [Cohere's chat API](https://cohere.com/chat)

## LLM Setup and First Trial

Without additional context

In [15]:
co = cohere.Client(api_key=COHERE_API_KEY)

# List of questions
questions = [
    "Who did Nadal win to put Spain 2-0 in 2004?",
    "What the maximum potential punishment for Luis Aragones receive for his racist comments about Thierry Henry?",
    "How old was  Chelsea left-back Wayne Bridge when he was suspected for a broken ankle injury?",
    "Which teams were supposed to play in the event at Real Madrid's Bernabeu stadium that led to the evacuation of spectators due to a bomb scare?",
    "What did Steven Gerrard say that prompted Rafael Benitez to issue a warning in 2004?"
]

# Iterate over the list of questions and get responses
for question in questions:
    response = co.chat(
        model='command-r-plus',
        message=question,
    )
    print(f"Question: {question}")
    print(f"Answer: {response.text}\n")

Question: Who did Nadal win to put Spain 2-0 in 2004?
Answer: Nadal defeated Tomas Berdych of the Czech Republic in the 2004 Davis Cup final to give Spain a 2-0 lead and put them on the brink of their second Davis Cup title.

Question: What the maximum potential punishment for Luis Aragones receive for his racist comments about Thierry Henry?
Answer: The maximum potential punishment for Luis Aragones' racist comments about Thierry Henry would depend on the specific legal context and any regulatory guidelines in place at the time of the incident. However, I can provide some general information on potential consequences for racist remarks in a sports context.

In most countries, racist comments made in public, especially by high-profile individuals, can be considered a form of hate speech or incitement to racial hatred. Such actions are often prohibited by law and can carry legal consequences, including but not limited to:

1. Criminal charges: Depending on the jurisdiction, Aragones cou

## Connection with DB and Second Trial

In [20]:
def augment_prompt(
        query: str,
        model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2'),
        index=None,
) -> str:
    """
    Augment the prompt with the top 3 results from the knowledge base
    Args:
        query: The query to augment
        index: The vectorstore object
    Returns:
        str: The augmented prompt
    """
    results = [float(val) for val in list(model.encode(query))]

    # get top 3 results from knowledge base
    query_results = index.query(
        vector=results,
        top_k=3,
        include_values=True,
        include_metadata=True
    )['matches']
    text_matches = [match['metadata']['Text'] for match in query_results]

    # get the text from the results
    source_knowledge = "\n\n".join(text_matches)

    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.
    Contexts:
    {source_knowledge}
    If the answer is not included in the source knowledge - say that you don't know.
    Query: {query}"""
    return augmented_prompt, source_knowledge

In [21]:
# Initialize a list to store results
results = []

# Iterate over the list of questions, augment each question, and get responses
for question in questions:
    augmented_prompt, source_knowledge = augment_prompt(question, model=model, index=index)  # Adjust model and index as needed
    response = co.chat(
        model='command-r-plus',
        message=augmented_prompt,
    )

    # Store the question, answer, and source_knowledge in results
    results.append({
        'question': question,
        'answer': response.text,
        'source_knowledge': source_knowledge
    })

    # Print the question, answer, and source_knowledge
    print(f"Question: {question}")
    print(f"Answer: {response.text}")
    print(f"Source Knowledge: {source_knowledge}\n")



Question: Who did Nadal win to put Spain 2-0 in 2004?
Answer: Nadal beat Andy Roddick to put Spain 2-0 up in the 2004 Davis Cup final.
Source Knowledge: Nadal puts Spain 2-0 up
 
 Result: Nadal 6-7 (6/8) 6-2 7-6 (8/6) 6-2 Roddick
 
 Spain's Rafael Nadal beats Andy Roddick of the USA in the second singles match rubber of the 2004 Davis Cup final in Seville. Spain lead 1-0 after Carlos Moya beat Mardy Fish in straight sets in the opening match of the tie.
 
 Nadal holds his nerve and the crowd goes wild as Spain go 2-0 up in the tie.
 
 Roddick holds serve to force Nadal to serve for the match but the American surely cannot turn things around now.
 
 Nadal works Roddick around the court on two consecutive points to earn two break points. One is enough, the Spaniard secures the double-break and Roddick is now teetering on the edge.
 
 Roddick is trying to gee himself up but the clay surface is taking its toll on his game and he is looking tired. Nadal wins the game to love.
 
 Nadal steps

# Part 3 - Report Setup

## Dataset Description

### Information From Dataset's Kaggle Page

[link](https://www.kaggle.com/datasets/tanishqdublish/text-classification-documentation)

**Dataset Name:** Text Document Classification Dataset

**Source:** Kaggle (sunilthite/text-document-classification-dataset)

**Content:** This dataset consists of 2225 text documents categorized into five distinct classes: politics, sport, tech, entertainment, and business.

**Features:**


*   Text: Contains the textual content of the documents across various categories.
*   Label: Represents the category labels assigned to each document, ranging from 0 to 4:

  *   0: Politics
  *   1: Sport
  *   2: Technology
  *   3: Entertainment
  *   4: Business



### Rationale for Dataset Selection

The chosen dataset is suitable because it contains diverse categories of text documents. QA models may struggle to accurately answer questions from such a dataset due to:



*   Varied vocabulary and context across different domains (politics, sport, tech, etc.).
*   Ambiguities and nuances in language specific to each category.
*   Potential for model hallucinations, where incorrect information is generated due to insufficient understanding of context or domain-specific knowledge.

## QA Failures vs RAG Successes

1. **Question: Who did Nadal win to put Spain 2-0 in 2004?**

First Answer (QA Model Only): Nadal defeated Tomas Berdych of the Czech Republic in the 2004 Davis Cup final to give Spain a 2-0 lead.

Second Answer (RAG Pipeline): Nadal beat Andy Roddick to put Spain 2-0 up in the 2004 Davis Cup final.

2. **Question: What the maximum potential punishment for Luis Aragones receive for his racist comments about Thierry Henry?**

First Answer (QA Model Only): Provided a detailed response including potential consequences such as public condemnation, fines, legal repercussions, suspension, and loss of endorsements.

Second Answer (RAG Pipeline): The maximum potential punishment for Luis Aragones' racist comments about Thierry Henry was a fine of about £22,000 or the suspension of his coaching license.

3. **Question: How old was Chelsea left-back Wayne Bridge when he was suspected for a broken ankle injury?**

First Answer (QA Model Only): Simply stated "28".
Second Answer (RAG Pipeline): Correctly answered "24".

4. **Question: Which teams were supposed to play in the event at Real Madrid's Bernabeu stadium that led to the evacuation of spectators due to a bomb scare**?

First Answer (QA Model Only): Incorrectly mentioned the 2014 UEFA Champions League Final.

Second Answer (RAG Pipeline): Real Madrid and Real Sociedad.

5. **Question: What did Steven Gerrard say that prompted Rafael Benitez to issue a warning in 2004?**

First Answer (QA Model Only): Provided details about Gerrard's dissatisfaction with Liverpool's season and his potential departure.

Second Answer (RAG Pipeline): Steven Gerrard expressed doubts about Liverpool winning the Champions League that year.

***Each answer given by the RAG pipeline is supported by a text from the dataset, stored in `source_knowledge`***

## Retrieval System

We leveraged Pinecone as our vector index to facilitate document retrieval based on semantic similarity. Each query was encoded using a SentenceTransformer model (`all-MiniLM-L6-v2`), which captures contextual embeddings of the query text. The vector index retrieved the top 3 most relevant documents based on these embeddings.

## Prompts Used in RAG Pipeline

Augmented prompts were constructed by integrating the retrieved source knowledge from the vector index into the generative model (`command-r-plus`). Each prompt included:

*   Contexts: Concatenated text from the top 3 retrieved documents, providing background and supporting information related to the query.
*   Query: The original question or query for which the response was generated.

## Additional Insights

Our implementation of the Retrieval-Augmented Generation (RAG) pipeline has provided valuable insights into its effectiveness in reducing hallucinations and improving answer accuracy based on anecdotal examples:

*  Contextual Accuracy: The RAG pipeline consistently retrieved and generated answers grounded in contextual knowledge extracted from the dataset. For instance, it accurately identified that Rafael Nadal defeated Andy Roddick, not Tomas Berdych, in the 2004 Davis Cup final, reflecting a precise understanding of sports events.

*  Precision in Legal Matters: When queried about the potential punishment for Luis Aragones' racist comments, the RAG pipeline accurately pinpointed specific penalties—such as a fine or suspension—indicative of its capability to navigate legal nuances effectively.

*  Detail and Specificity: In the case of Wayne Bridge's age during an injury incident, the RAG pipeline's response of "24" demonstrated a higher level of detail and correctness compared to the QA model's simplistic answer of "28".

*  Event Specificity: Regarding the evacuation incident at Real Madrid's Bernabeu stadium, the RAG pipeline correctly identified the involved teams as Real Madrid and Real Sociedad, whereas the QA model erroneously mentioned the 2014 UEFA Champions League Final, showcasing its ability to provide specific and accurate event details.

*  Understanding Historical Context: When asked about Steven Gerrard's statement prompting Rafael Benitez's warning in 2004, the RAG pipeline accurately recalled Gerrard expressing doubts about Liverpool's Champions League prospects, highlighting its capacity to comprehend historical context and athlete statements.