# LLM Case Study 2 - Rag pipeline

### Shusrith S PES1UG22AM155
### Smera Arun Setty PES1UG22AM922

# Fetching the data

In [1]:
import kagglehub

path = kagglehub.dataset_download("shusrith/rag-dataset")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/shusrith/rag-dataset/versions/2


# Installing libraries

In [15]:
!pip install pypdf2 pdfplumber faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


# Extracting text from pdf

`pdflumber` is used to collect all the text from the pdf.

In [2]:
import pdfplumber

def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text

pdf_path = f"{path}/dataset/combined_document_10.pdf"
text = extract_text_from_pdf(pdf_path)
print(text[:1000])

SHARE REPURCHASES AND DIVIDENDS
Share Repurchases
On September 16, 2013, our Board of Directors approved a share repurchase program authorizing up to $40.0 billion in
share repurchases. The share repurchase program became effective on October 1, 2013, has no expiration date, and may
be suspended or discontinued at any time without notice. This share repurchase program replaced the share repurchase
program that was announced on September 22, 2008 and expired on September 30, 2013. As of June 30, 2016, $7.1 billion
remained of our $40.0 billion share repurchase program. All repurchases were made using cash resources.
We repurchased the following shares of common stock under the above-described repurchase plans:
(In millions) Shares Amount Shares Amount Shares Amount
Year Ended June 30, 2016 2015 2014 (a)
First quarter 89 $ 4,000 43 $ 2,000 47 $ 1,500
Second quarter 66 3,600 43 2,000 53 2,000
Third quarter 69 3,600 116 5,000 47 1,791
Fourth quarter 70 3,600 93 4,209 28 1,118
Total $ 14,80

# Cleaning the data

Python regex library is used to clean the text, remove special characters and retain only major punctuation and text. 

In [None]:
import re

def clean_text(text):
    text = re.sub(r'\n+', '\n', text) 
    text = re.sub(r'[^a-zA-Z0-9.,!?;:\'"\s]', '', text)  
    return text.strip()

cleaned_text = clean_text(text)
print(cleaned_text[:1000])


SHARE REPURCHASES AND DIVIDENDS
Share Repurchases
On September 16, 2013, our Board of Directors approved a share repurchase program authorizing up to 40.0 billion in
share repurchases. The share repurchase program became effective on October 1, 2013, has no expiration date, and may
be suspended or discontinued at any time without notice. This share repurchase program replaced the share repurchase
program that was announced on September 22, 2008 and expired on September 30, 2013. As of June 30, 2016, 7.1 billion
remained of our 40.0 billion share repurchase program. All repurchases were made using cash resources.
We repurchased the following shares of common stock under the abovedescribed repurchase plans:
In millions Shares Amount Shares Amount Shares Amount
Year Ended June 30, 2016 2015 2014 a
First quarter 89  4,000 43  2,000 47  1,500
Second quarter 66 3,600 43 2,000 53 2,000
Third quarter 69 3,600 116 5,000 47 1,791
Fourth quarter 70 3,600 93 4,209 28 1,118
Total  14,800
294 295  1

# Chunking

The code is split into chunks of size 128 bytes for indexing into the vector db

In [4]:
def split_into_chunks(text, chunk_size=128):
    words = text.split()
    chunks = [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks

chunks = split_into_chunks(cleaned_text)
print(f"Total Chunks: {len(chunks)}")
print(chunks[:2])

Total Chunks: 31
['SHARE REPURCHASES AND DIVIDENDS Share Repurchases On September 16, 2013, our Board of Directors approved a share repurchase program authorizing up to 40.0 billion in share repurchases. The share repurchase program became effective on October 1, 2013, has no expiration date, and may be suspended or discontinued at any time without notice. This share repurchase program replaced the share repurchase program that was announced on September 22, 2008 and expired on September 30, 2013. As of June 30, 2016, 7.1 billion remained of our 40.0 billion share repurchase program. All repurchases were made using cash resources. We repurchased the following shares of common stock under the abovedescribed repurchase plans: In millions Shares Amount Shares Amount Shares Amount Year Ended June 30, 2016 2015 2014 a First quarter 89', '4,000 43 2,000 47 1,500 Second quarter 66 3,600 43 2,000 53 2,000 Third quarter 69 3,600 116 5,000 47 1,791 Fourth quarter 70 3,600 93 4,209 28 1,118 Total

# Embedding 

HuggingFace's `MiniLM-L6-v2` is used for producing vector embeddings for our text. It produces vecotrs of dimension 384 which can be stored and indexed in a vector db.

In [5]:
from sentence_transformers import SentenceTransformer

encoding_model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = encoding_model.encode(chunks)
print(embeddings.shape)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


(31, 384)


# Indexing in FAISS

The vector embeddings are indexed in FAISS using FlatL2 which is flat indexing using L2 or Euclidean distance for similarity search and clustering. A mapping is constructed to map actual text to indexed vectors, so that result indices can be converted back to text and given to an LLM as context. A sample query is demonstrated

In [9]:
import faiss
import numpy as np
d = embeddings.shape[1]
text_index = faiss.IndexFlatL2(d)
text_index.add(embeddings)
index_to_text = {i: chunks[i] for i in range(len(chunks))}

query = "What factors contributed to the increase in iPad net sales during 2018 compared to 2017?"
query_embedding = encoding_model.encode([query]).astype(np.float32)

k = 3
distances, indices = text_index.search(query_embedding, k)

best_match_idx = indices[0][0]
print(f"Best matching chunk (index {best_match_idx}):")
print(index_to_text[best_match_idx])
print(f"Distance: {distances[0][0]:.4f}")



Best matching chunk (index 8):
2017 and 2016 dollars in millions and units in thousands: 2018 Change 2017 Change 2016 Net sales 18,805 2 19,222 7 20,628 Percentage of total net sales 7 8 10 Unit sales 43,535 43,753 4 45,590 iPad net sales decreased during 2018 compared to 2017 due primarily to a different mix of iPads resulting in lower average selling prices. The strength in foreign currencies relative to the U.S. dollar had a favorable impact on iPad net sales during 2018. iPad net sales decreased during 2017 compared to 2016 due to lower iPad unit sales and a different mix of iPads with lower average selling prices. The weakness in foreign currencies relative to the U.S. dollar had an unfavorable impact on iPad net sales during 2017. Mac The following
Distance: 0.4620


# Table Embeddings

Cleaned and preprocessed tables from `table_embedding.ipynb` were stored in csvs. They are loaded here and each row in every table is converted to the form `Column Name : Value |` so that these can be converted into embeddings. 

In [None]:
import pandas as pd
def table_to_text(row):
    return " | ".join([f"{col}: {str(row[col])}" for col in row.index])


import os
tables = []
table_text = []
l = os.listdir(f"{path}/data/data")
for i in l:
    df = pd.read_csv(f"{path}/data//data/{i}")
    if "Unnamed: 0" in df.columns:
        df.columns = df.columns.str.replace("Unnamed: 0", "Column1")
    tables.append(df)
    table_text.append(df.apply(table_to_text, axis=1))

# Embedding the tables


The converted form of the tables are encoded using the same model

In [None]:
embd = []
for i in table_text:
    embd.append(encoding_model.encode(i, convert_to_numpy=True))

# Indexing the tables

A new FAISS FlatL2 index is created to store the table embeddings.

In [12]:
import faiss
import numpy as np


table_index = faiss.IndexFlatL2(384)
table_embeddings_np = np.vstack(embd)
table_index.add(table_embeddings_np)
faiss.write_index(table_index, "table_index.faiss")

# Demo query

A sample query is demonstrated and top 3 matching indices are highlighted

In [13]:
query = "What factors contributed to the increase in iPad net sales during 2018 compared to 2017?"
query_embedding = encoding_model.encode(query).reshape(1, -1).astype("float32")
k = 3
distances, indices = table_index.search(query_embedding, k)

print("Top matching row indices:", indices)
print("Corresponding distances:", distances)

Top matching row indices: [[ 1 14  5]]
Corresponding distances: [[1.072881  1.0728811 1.1398721]]


# Reverse mapping 

A `row_table_mapping` is created to map indices to the specific of a specific table. The values can be found be indexing into the tables.

In [14]:
row_table_mapping = []
row_offset = 0

for table_idx, table in enumerate(embd):
    num_rows = table.shape[0]
    for i in range(num_rows):
        row_table_mapping.append((table_idx, i))
    row_offset += num_rows
for idx in indices[0]:
    table_idx, row_idx = row_table_mapping[idx]
    print(f"Match found in Table {table_idx}, Row {row_idx}")
    print(tables[table_idx].iloc[row_idx])
    print()

Match found in Table 0, Row 1
Region        iPad (1)
2018            18,805
Change2018        (2)%
2017            19,222
Change2017        (7)%
2016            20,628
Name: 1, dtype: object

Match found in Table 1, Row 8
Region        iPad (1)
2018            18,805
Change2018        (2)%
2017            19,222
Change2017        (7)%
2016            20,628
Name: 8, dtype: object

Match found in Table 0, Row 5
Region        Total net sales
2018                  265,595
Change2018               16 %
2017                  229,234
Change2017                6 %
2016                  215,639
Name: 5, dtype: object



# Demonstrating both indices

Given a query, the pipeline first finds top3 relevant documents from the text idnex and then the top3 most relevant documents from the tables index. Both these results are then combined and given as context to the LLM

In [42]:
query = "How many shares did Microsoft repurchase in fiscal year 2016, and what was the total amount spent?"
query_embedding = encoding_model.encode([query]).astype(np.float32)
distances, indices = table_index.search(query_embedding, k)
distances1, indices1 = text_index.search(query_embedding, k)
text = [index_to_text[i] for i in indices1[0]]
table = [row_table_mapping[i] for i in indices[0]]
table = [tables[t].iloc[r] for t, r in table]
print(text)
print(table)

['4,000 43 2,000 47 1,500 Second quarter 66 3,600 43 2,000 53 2,000 Third quarter 69 3,600 116 5,000 47 1,791 Fourth quarter 70 3,600 93 4,209 28 1,118 Total 14,800 294 295 13,209 175 6,409 a Of the 175 million shares repurchased in fiscal year 2014, 128 million shares were repurchased for 4.9 billion under the share repurchase program approved by our Board of Directors on September 16, 2013 and 47 million shares were repurchased for 1.5 billion under the share repurchase program that was announced on September 22, 2008 and expired on September 30, 2013. The above table excludes shares repurchased to settle statutory employee tax withholding related to the vesting of stock awards. Dividends In fiscal year 2016, our Board of Directors declared the following', 'SHARE REPURCHASES AND DIVIDENDS Share Repurchases On September 16, 2013, our Board of Directors approved a share repurchase program authorizing up to 40.0 billion in share repurchases. The share repurchase program became effective

# Building context

Given the indices of the top3 results, the previously created mappings are used to map back the index value to the actual document which are then combined with a prompt and given to the LLM..

In [None]:
context_text = "\n".join(text)
context_table = "\n".join([str(row) for row in table])
full_context = f"Textual Data:\n{context_text}\n\nTabular Data:\n{context_table}"

prompt = f"""
You are an AI that answers questions based on provided text and table data.

### Context:
{full_context}

### Question:
{query}

Provide a concise, well-structured answer.
"""

# LLm inference

`Groq` is used through `Langchain-groq` for LLM inference. `Llama3-8b` is used for generation. The prompt contains a system message containing a basic prompt and the human message with the context and the question.

In [None]:
from langchain_groq import ChatGroq
from langchain.schema import SystemMessage, HumanMessage
import os

os.environ["GROQ_API_KEY"] = "gsk_XXXXXXXXXXXXXXXXXXXX"

llm = ChatGroq(model_name="llama3-8b-8192", temperature=0)

messages = [
    SystemMessage(content="You are an AI that answers questions based on provided context."),
    HumanMessage(content=prompt)
]

response = llm.invoke(messages)
print(response.content)

According to the provided text, Microsoft repurchased 89 million shares of common stock under the share repurchase program in the first quarter of fiscal year 2016. The total amount spent on these repurchases is not explicitly stated in the provided text. However, we can infer that the total amount spent is not included in the table provided, as it only shows the shares and amounts repurchased in specific quarters (First, Second, Third, and Fourth) of fiscal year 2014, 2015, and 2016.
