### Data Ingestion

In this notebook we will learn how to create a data ingenstion pipeline to add data to a vector database. We are going to use `Pinecone` as the vector database, but there are other vector databases available too for example `Chroma, Weaviate, Faiss, etc.`

We will be doing the following in this session:
- How to load in documents.
- Add metadata to each document.
- How to use a text splitter to split documents.
- How to generate embeddings for each text chunk.
- How to insert into a vector database.

![Alt text](../images/data_ingestion.png)

### Pre-requisite

- You will need a [Pinecone](https://www.pinecone.io/) API key, you can [sign-up](https://app.pinecone.io/?sessionType=signup) for free to get a started account and then get the API key after sign-up.

- You will need an [OpenAI](https://openai.com/) api key for this session. It will be provided by Analytics Vidhya

### Importing libraries

In [2]:
# Import the 'os' module for interacting with the operating system
import os

# Import the 'time' module for handling time-related tasks
import time  

!pip install pinecone
# Import the 'Pinecone' class from the 'pinecone' package for vector database operations
from pinecone import Pinecone

# Import the 'ServerlessSpec' class from the 'pinecone' package for serverless deployment specifications
from pinecone import ServerlessSpec

# To create embeddings
from langchain_openai import OpenAIEmbeddings

# To connect with the Vectorstore
from langchain_pinecone import PineconeVectorStore

# To parse the PDFs
from langchain_community.document_loaders import PyPDFLoader 

# To load files in a directory
from langchain_community.document_loaders import DirectoryLoader 

# To split the text into smaller chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

Defaulting to user installation because normal site-packages is not writeable


In [None]:
!pip install pinecone

### Setting up keys

In [37]:
# Set the OPENAI_API_KEY environment variable to your OpenAI API key
os.environ["OPENAI_API_KEY"] = ""
# Set the PINECONE_API_KEY environment variable to your Pinecone API key
os.environ["PINECONE_API_KEY"] = "8b0aba5e-46f0-4463-8872-4bd81ca57ceb"

### Setting up a Pinecone Index

In [10]:
# Initialize a ServerlessSpec object for AWS with the specified region
spec = ServerlessSpec(cloud='aws', 
                      region='us-east-1')

# configure client  
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])  

INDEX_NAME = 'dhs2024blr'

In [11]:
# Check if the index already exists in the current PC (presumably a database or similar)
if INDEX_NAME in pc.list_indexes().names():
    # If the index exists, print a message indicating its existence
    print(f"Index `{INDEX_NAME}` already exists")
    
    # Retrieve the existing index object
    index = pc.Index(INDEX_NAME)
    
    # Print detailed statistics about the existing index
    print(index.describe_index_stats())
    
# If the index does not exist, proceed to create a new one
else:
    # Create a new index with specific parameters
    pc.create_index(
            INDEX_NAME,
            dimension=1536,  # dimensionality of text-embedding-ada-002
            metric='cosine',
            spec=spec
        )
    
    # Wait for the index to be initialized before proceeding
    while not pc.describe_index(INDEX_NAME).status['ready']:
        # Sleep for 1 second to avoid overloading the system with requests
        time.sleep(1)
    
    # Once the index is ready, print a confirmation message
    print(f"Index with name `{INDEX_NAME}` is created")
    
    # Retrieve the newly created index object
    index = pc.Index(INDEX_NAME)
    
    # Print detailed statistics about the newly created index
    print(index.describe_index_stats())

Index with name `dhs2024blr` is created
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


`Note:` In case you want to delete an already existing index then use the following `pc.delete_index(index_name)`

### Building an Ingestion Pipeline

In [12]:
# Define the directory path where the data files are stored
DATA_DIR_PATH = "../data/"

# Set the chunk size for processing data, typically in bytes
CHUNK_SIZE = 1024

# Define the overlap between chunks for more efficient processing
CHUNK_OVERLAP = 204

# Specify the name of the index to be used for storing or retrieving data
INDEX_NAME = 'av-earnings-call'

`Note:` Make sure to maintain the below show directory structure since we will be using the Year and Quarter directory names in the metadata later.

![Alt text](../images/data_dir_tree.png)

### Loading Files

Initialize a DirectoryLoader object and pass the `Path to data`, `the type of files to load from directory`, and `the loader_class` which in our case is PyPDFLoader since we are working with PDF files.

In [14]:
# Initialize a loader object with your specific loading logic or method
loader = DirectoryLoader(path=DATA_DIR_PATH, glob="**/*.pdf", loader_cls=PyPDFLoader)

# Load documents using the loader object
docs = loader.load()

# Print the total number of documents loaded
print(f"Total documents loaded: {len(docs)}")

Total documents loaded: 29


In [15]:
# Looking into the first document
docs[0]

Document(metadata={'source': '..\\data\\FY23\\Q4\\HCLTech.pdf', 'page': 0}, page_content=' \n Page 1 of 18  \n \n“HCL  Tech nologies Limited ’s Q4FY23 & Annual FY23 \nEarnings Conference Call”  \n \nApril 20 , 2023 \n \n \n \n \n \n \n \n \n \n \n  \n \n \nMANAGEMENT : MR. C. VIJAYAKUMAR – CHIEF EXECUTIVE OFFICER & \nMANAGING DIRECTOR , HCL  TECH NOLOGIES LIMITED  \nMR. PRATEEK  AGGARWAL  – CHIEF  FINANCIAL \nOFFICER , HCL  TECH NOLOGIES LIMITED  \nMR. SRINIVASAN SESHADRI – GLOBAL HEAD, \nFINANCIAL SERVICES , HCL  TECH NOLOGIES LIMITED  \nMR. VIJAY GUNTUR  – PRESIDEN T, ENGINEERING AND \nR&D  SERVICES , HCL  TECH NOLOGIES LIMITED  \nMR. MANAN  BATRA – SENIOR MANAGER , INVESTOR \nRELATIONS , HCL  TECH NOLOGIES LIMITED  \n  \n')

In [16]:
# we can convert the Document object to a python dict using the .dict() method.
print(f"keys associated with a Document: {docs[0].dict().keys()}")

keys associated with a Document: dict_keys(['id', 'metadata', 'page_content', 'type'])


In [22]:
print(f"{'-'*1}\nFirst 100 charachters of the page content: {docs[0].page_content[:100]}\n{'-'*15}")
print(f"Metadata associated with the document: {docs[0].metadata}\n{'-'*15}")
fname = docs[0].metadata['source']
print(str(fname).split('\\')[-1])
print(f"Datatype of the document: {docs[0].type}\n{'-'*15}")

-
First 100 charachters of the page content:  
 Page 1 of 18  
 
“HCL  Tech nologies Limited ’s Q4FY23 & Annual FY23 
Earnings Conference Call”  
---------------
Metadata associated with the document: {'source': '..\\data\\FY23\\Q4\\HCLTech.pdf', 'page': 0}
---------------
\data\FY23\Q4\HCLTech
Datatype of the document: Document
---------------


In [23]:
#  We loop through each document and add additional metadata - filename, quarter, and year
for doc in docs:
    filename = doc.dict()['metadata']['source'].split('\\')[-1]
    quarter = doc.dict()['metadata']['source'].split('\\')[-2]
    year = doc.dict()['metadata']['source'].split('\\')[-3]
    doc.metadata = {"filename": filename, "quarter": quarter, "year": year, "source": doc.dict()['metadata']['source'], "page": doc.dict()['metadata']['page']}

In [24]:
# To veryfy that the metadata is indeed added to the document
print(f"Metadata associated with the document: {docs[0].metadata}\n{'-'*15}")
print(f"Metadata associated with the document: {docs[1].metadata}\n{'-'*15}")

Metadata associated with the document: {'filename': 'HCLTech.pdf', 'quarter': 'Q4', 'year': 'FY23', 'source': '..\\data\\FY23\\Q4\\HCLTech.pdf', 'page': 0}
---------------
Metadata associated with the document: {'filename': 'HCLTech.pdf', 'quarter': 'Q4', 'year': 'FY23', 'source': '..\\data\\FY23\\Q4\\HCLTech.pdf', 'page': 1}
---------------


### Chunking Text

As the name suggests, chunking is the process of dividing a large amount of data into several smaller parts for more effective and meaningful storage.

There are various ways to perform chunking naming some as:
 - Character Chunking
 - Recursive Character Chunking
 - Document Specific Chunking

For the sake of this session we will be using the `Recursive Character Chunking` and langchain has an implemention that we can directly use. To read more about it you can refer to the [docs](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/)

`Additional Resource:` If you want to explore the different chunking stratigies than you can refer to the following docs from langchain - [Link to Docs](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/)

In [25]:
# Split text into chunks 
text_splitter = RecursiveCharacterTextSplitter(
     chunk_size=CHUNK_SIZE,
     chunk_overlap=CHUNK_OVERLAP
)

documents = text_splitter.split_documents(docs)

In [26]:
len(docs), len(documents)

(29, 112)

In [27]:
documents[0:4]

[Document(metadata={'filename': 'HCLTech.pdf', 'quarter': 'Q4', 'year': 'FY23', 'source': '..\\data\\FY23\\Q4\\HCLTech.pdf', 'page': 0}, page_content='Page 1 of 18  \n \n“HCL  Tech nologies Limited ’s Q4FY23 & Annual FY23 \nEarnings Conference Call”  \n \nApril 20 , 2023 \n \n \n \n \n \n \n \n \n \n \n  \n \n \nMANAGEMENT : MR. C. VIJAYAKUMAR – CHIEF EXECUTIVE OFFICER & \nMANAGING DIRECTOR , HCL  TECH NOLOGIES LIMITED  \nMR. PRATEEK  AGGARWAL  – CHIEF  FINANCIAL \nOFFICER , HCL  TECH NOLOGIES LIMITED  \nMR. SRINIVASAN SESHADRI – GLOBAL HEAD, \nFINANCIAL SERVICES , HCL  TECH NOLOGIES LIMITED  \nMR. VIJAY GUNTUR  – PRESIDEN T, ENGINEERING AND \nR&D  SERVICES , HCL  TECH NOLOGIES LIMITED  \nMR. MANAN  BATRA – SENIOR MANAGER , INVESTOR \nRELATIONS , HCL  TECH NOLOGIES LIMITED'),
 Document(metadata={'filename': 'HCLTech.pdf', 'quarter': 'Q4', 'year': 'FY23', 'source': '..\\data\\FY23\\Q4\\HCLTech.pdf', 'page': 1}, page_content="HCL Tech nologies Limited  \nApril 20 , 202 3 \n \n Page 2 of 

In [28]:
# Initialize the embedding model
embeddings = OpenAIEmbeddings(model = "text-embedding-ada-002")

In [32]:
# Prompt the user to confirm if the vectors are already added to the Pinecone database
docs_already_in_pinecone = input("Are the vectors already added in DB: (Type Y/N)")

# Check if the user has confirmed that the vectors are already in the database
if docs_already_in_pinecone == "Y" or docs_already_in_pinecone == "y":
    
    # Initialize a PineconeVectorStore object with the existing index and embeddings
    docsearch = PineconeVectorStore(index_name=INDEX_NAME, embedding=embeddings)
    
    print("Existing Vectorstore is loaded")
    
# If the user confirms that the vectors are not in the database, create a new PineconeVectorStore from the documents and embeddings
elif docs_already_in_pinecone == "N" or docs_already_in_pinecone == "n":
    
    # Create a PineconeVectorStore object from the documents and embeddings, specifying the index name
    docsearch = PineconeVectorStore.from_documents(documents, embeddings, index_name="dhs2024blr")
    
    print("New vectorstore is created and loaded")
    
# If the user input is neither 'Y' nor 'N', prompt them to enter a valid response
else:
    print("Please type Y - for yes and N - for no")

Are the vectors already added in DB: (Type Y/N)N
New vectorstore is created and loaded


In [33]:
# Here we are defing how to use the loaded vectorstore as retriver
retriver = docsearch.as_retriever()

In [34]:
retriver.invoke("what is the income?")

[Document(metadata={'filename': 'HCLTech.pdf', 'page': 8.0, 'quarter': 'Q4', 'source': '..\\data\\FY23\\Q4\\HCLTech.pdf', 'year': 'FY23'}, page_content='months basis, which is at 121% of net income. So, that continues to be at that range of 120% -\nplus that we have been mentioning in both of our investor days in the last year and as per the \npromise  that we have made to you. And free cash flow as well came in at $2 billion plus ; $2,024 \nmillion to be precise , and that comes in at 110% of net income. So, both those metrics are \nextremely good to look at . \nOur balance sheet therefore continues to stren gthen with gross cash now at $2.8 billion and net \ncash at $2.5 billion. Remember, in this quarter, we actually retired $248 million worth of bonds \nthat we had issued two years back and that reduces the gross cash. Without that , the gross cash \nwould have bee n $248 million  higher and therefore is $3,058 million  to be precise. But net cash \nobviously remains the same becau

#### Using metadata with retriver

In [35]:
# create a retriever object using the `docsearch` module, configured with specific search parameters
retriver = docsearch.as_retriever(search_kwargs={"filter": {"quarter": "Q1"}, "k": 2})

In [36]:
retriver.invoke("what is the income?")

[Document(metadata={'filename': 'Adani Enterprises Ltd.pdf', 'page': 3.0, 'quarter': 'Q1', 'source': '..\\data\\FY24\\Q1\\Adani Enterprises Ltd.pdf', 'year': 'FY24'}, page_content='passengers and non -passengers at the airport and secondly , the increase  in the actual gross spend \nrate of each of the passenger , so these two aspects contributing for the growth in AEL . \nMohit Kumar : My third question is Carmi chael, is it possible to share the revenues  and EBITDA for the quarter \nand the related question is that in the segmental which you have disclose d, commercial mining \nis one line item, I believe this primarily corresponds to Carmichael? I s my understanding \ncorrect?  \nRobbie Singh : Yes, the understanding  is correct.  \nMohit Kumar : My last question is on the Solar PV, we have done a very good job  in the sense the numbers are \nvery good  for the quarter,  and I believe the exporting of a large amount to the third countries . \nCan you please specify which are the co