# 1. Data Loading, Splitting and Ingestion Pipeline

The idea of this notebook is to show the first steps of the pipeline, which is the data ingestion pipeline.
The data ingestion pipeline is responsible for:
- loading(parsing) the data from the source(locally available PDF data or data on S3)
- splitting the data into the desired chunks
- embedding the data
- saving it in a vector database.

<b>The whole pipeline (Data Loading, Splitting and Ingestion Pipeline is circled with red-1)</b>

![image-2.png](attachment:image-2.png)

We need to download a PDF data which will be used as data for the RAG. For our RAG we are utilizing the paper:
- Attention is All You Need, Vaswani et al. 2017
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al. 2018
- Improving Language Understanding by Generative Pretraining, Radford et al. 2018

In [1]:
# Download the paper attention is all you need(the transformer paper) from the NeurIPS 2017 conference
!wget https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf -O attention_is_all_you_need.pdf

--2024-10-02 13:16:07--  https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Resolving proceedings.neurips.cc (proceedings.neurips.cc)... 198.202.70.94
Connecting to proceedings.neurips.cc (proceedings.neurips.cc)|198.202.70.94|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 569417 (556K) [application/pdf]
Saving to: ‘attention_is_all_you_need.pdf’


2024-10-02 13:16:09 (588 KB/s) - ‘attention_is_all_you_need.pdf’ saved [569417/569417]



In [2]:
# Download the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding from the NAACL 2019 conference
!wget https://www.aclweb.org/anthology/N19-1423.pdf -O bert.pdf

--2024-10-02 13:16:11--  https://www.aclweb.org/anthology/N19-1423.pdf
Resolving www.aclweb.org (www.aclweb.org)... 50.87.169.12
Connecting to www.aclweb.org (www.aclweb.org)|50.87.169.12|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://aclanthology.org/N19-1423.pdf [following]
--2024-10-02 13:16:12--  https://aclanthology.org/N19-1423.pdf
Resolving aclanthology.org (aclanthology.org)... 174.138.37.75
Connecting to aclanthology.org (aclanthology.org)|174.138.37.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 786279 (768K) [application/pdf]
Saving to: ‘bert.pdf’


2024-10-02 13:16:13 (791 KB/s) - ‘bert.pdf’ saved [786279/786279]



In [3]:
# Download the GPT paper
!wget https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf -O gpt.pdf

--2024-10-02 13:16:14--  https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Resolving cdn.openai.com (cdn.openai.com)... 2620:1ec:bdf::42, 13.107.246.42
Connecting to cdn.openai.com (cdn.openai.com)|2620:1ec:bdf::42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 541036 (528K) [application/pdf]
Saving to: ‘gpt.pdf’


2024-10-02 13:16:15 (1,41 MB/s) - ‘gpt.pdf’ saved [541036/541036]



In [4]:
# Install the langchain and langchain-community packages
!pip install langchain langchain-community boto3



For our use case, we are going to work **Langchain**.

**Langchain** is a very powerful LLM/Agent orchestration tool that allows us to easily create and manage LLMs and Agents. **Langchain** provides all the necessary tools to create a whole RAG system, from the data ingestion pipeline to the inference pipeline.

In [5]:
# Importing all the neccessary modules/libraries
import os

from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.vectorstores import FAISS

In this step, we need to define the configuration of the first step of the pipeline. In the configuration we are going to define the:
- Region name and the credentials profile name of our AWS account
- The embedding model that we are going to use to embed the data and the configuration of the embedding model(dimension of the vector embeddings and the normalization of the embeddings)
- The size of the documents chunks and the overlap between the chunks
- The path of the PDF files that we are going to use to extract the data
- The path of the vector database that we are going to use to save the embedded data

In [6]:
# Defining the configuration
REGION_NAME = "us-west-2"
#CREDENTIALS_PROFILE_NAME = "ML"
EMBEDDER_MODEL_ID = "amazon.titan-embed-text-v2:0"
EMBEDDER_MODEL_KWARGS = {
    "dimensions": 512,
    "normalize": True
}

CHUNK_SIZE = 2000
CHUNK_OVERLAP = 100

DATA_PATHS = [
    "attention_is_all_you_need.pdf",
    "bert.pdf",
    "gpt.pdf"
]

VECTOR_STORE_PATH = "./vector_database/"


We need to define the chunker.

**The chunker** is responsible for splitting the data into the desired chunks. In this case, we are going to split the data into chunks of 2000 tokens with an overlap of 100 tokens.

The idea of why we are using larger chunks is to keep all the information of the document in the same chunk, so we can have a better representation of the document and the information that it contains.

The overlap is used to keep track of the context between the chunks.

In [7]:
# Defining the chunker
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP
)


We need to load the data from the PDF files by parsing the data and splitting it into the desired chunks. We are going to use the chunker that we defined in the previous step to split the data into chunks.

For loading the data we are going to use **PyMUPDFLoader** which is an excellent parser, that keeps the structure of the pdf document and allows us to extract the information of the docs in a very structured way.

In [8]:
# Creating chunks from the documents
global_chunks = []
for data_path in DATA_PATHS:
    loader = PyMuPDFLoader(os.path.join(os.getcwd(), data_path))
    docs = loader.load()
    chunks = splitter.split_documents(docs)
    global_chunks.extend(chunks)

We utilize the embedding model to embed the data.

We are going to use the newest **amazon-titan embeddings model v2**, which is a very powerful model that can embed the data in a very low or high dimension, depending on the desired configuration.

We are going to use the 512 dimension embeddings with the normalization of the embeddings, which is a standard configuration for the embeddings.

In [9]:
# Creating the embedder
embedder = BedrockEmbeddings(
    model_id=EMBEDDER_MODEL_ID,
    model_kwargs=EMBEDDER_MODEL_KWARGS,
    region_name=REGION_NAME,
    #credentials_profile_name=CREDENTIALS_PROFILE_NAME
)

  embedder = BedrockEmbeddings(


Now we are going to embed the data using the embedding model that we defined in the previous step and save the embedded data in the vector database. We are going to use the **FAISS** to save the embedded data in the vector database. **FAISS** is a very powerful vector database that can save the embedded data in a very efficient way.

In [10]:
# Creating the vector store
vector_store = FAISS.from_documents(documents=chunks, embedding=embedder)
vector_store.save_local(VECTOR_STORE_PATH)