# RAG Pipeline: Ingestion and Document Loaders

## Introduction
In the previous notebook, we provided an overview of the Retrieval Augmented Generation (RAG) pipeline and its three main phases: Ingestion, Retrieval, and Synthesis. In this notebook, we'll dive deeper into the ingestion phase and examine various types of document loaders that can be used to ingest data into the RAG system.

## 1. Ingestion Phase Overview
- **Purpose**: Collect and prepare data for further processing in the RAG pipeline.
- **Key Steps**: Data collection, document pre-processing, and transformation.

## Document Loading

 ![image.png](attachment:image.png)

In [1]:
%pip install langchain

Collecting langchainNote: you may need to restart the kernel to use updated packages.

  Downloading langchain-0.2.15-py3-none-any.whl.metadata (7.1 kB)
Collecting PyYAML>=5.3 (from langchain)
  Using cached PyYAML-6.0.2-cp312-cp312-win_amd64.whl.metadata (2.1 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.32-cp312-cp312-win_amd64.whl.metadata (9.8 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.10.5-cp312-cp312-win_amd64.whl.metadata (7.8 kB)
Collecting langchain-core<0.3.0,>=0.2.35 (from langchain)
  Downloading langchain_core-0.2.36-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.106-py3-none-any.whl.metadata (13 kB)
Collecting numpy<2.0.0,>=1.26.0 (from langchain)
  Using cached numpy-1.26.4-cp312-cp312-win_a

In [2]:
%pip install python-dotenv

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Note: you may need to restart the kernel to use updated packages.


In [1]:
%pip install langchain langchain-community


Collecting langchain-community
  Downloading langchain_community-0.2.16-py3-none-any.whl.metadata (2.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain
  Downloading langchain-0.2.16-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.3.0,>=0.2.35 (from langchain)
  Downloading langchain_core-0.2.39-py3-none-any.whl.metadata (6.2 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.22.0-py3-none-any.whl.metadata (7.2 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting langsmith<0.2.0,>=0.1.0 (from langchain-community)
  Downloading langsmith-0.1.117-py3-none-any.whl.metadata (13 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0

## 2. Types of Document Loaders


### 2.1 File-Based Loaders
- **Purpose**: Load data from local files or URLs.
- **Examples**: `TextLoader`, `PDFLoader`, `CSVLoader`, `HTMLLoader`.

In [4]:
from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='data/keywords_output.csv')
data = loader.load()

In [5]:
data

[Document(metadata={'source': 'data/keywords_output.csv', 'row': 0}, page_content='ï»¿No: 1\nKeyword: security guard licence\nRelevancy to supremesecurityservices.ca: 0.87\nVolume: 18100\nCPC: $2.21\nPaid Difficulty: 32\nSEO Difficulty: 5\nKeyword found onâ€¦: supremesecurityservices.ca'),
 Document(metadata={'source': 'data/keywords_output.csv', 'row': 1}, page_content='ï»¿No: 2\nKeyword: licence security guard\nRelevancy to supremesecurityservices.ca: 0.87\nVolume: 18100\nCPC: $2.21\nPaid Difficulty: 32\nSEO Difficulty: 15\nKeyword found onâ€¦: supremesecurityservices.ca'),
 Document(metadata={'source': 'data/keywords_output.csv', 'row': 2}, page_content='ï»¿No: 3\nKeyword: security officer license\nRelevancy to supremesecurityservices.ca: 0.86\nVolume: 18100\nCPC: $2.14\nPaid Difficulty: 20\nSEO Difficulty: 51\nKeyword found onâ€¦: supremesecurityservices.ca'),
 Document(metadata={'source': 'data/keywords_output.csv', 'row': 3}, page_content='ï»¿No: 4\nKeyword: security guard licens

In [6]:
len(data)

858

In [7]:
data[20:30]

[Document(metadata={'source': 'data/keywords_output.csv', 'row': 20}, page_content='ï»¿No: 21\nKeyword: security ontario license\nRelevancy to supremesecurityservices.ca: 0.91\nVolume: 14800\nCPC: $2.60\nPaid Difficulty: 17\nSEO Difficulty: 27\nKeyword found onâ€¦: supremesecurityservices.ca'),
 Document(metadata={'source': 'data/keywords_output.csv', 'row': 21}, page_content='ï»¿No: 22\nKeyword: security licence ontario\nRelevancy to supremesecurityservices.ca: 0.89\nVolume: 14800\nCPC: $2.71\nPaid Difficulty: 28\nSEO Difficulty: 4\nKeyword found onâ€¦: supremesecurityservices.ca'),
 Document(metadata={'source': 'data/keywords_output.csv', 'row': 22}, page_content='ï»¿No: 23\nKeyword: security license in ontario\nRelevancy to supremesecurityservices.ca: 0.89\nVolume: 14800\nCPC: $2.60\nPaid Difficulty: 17\nSEO Difficulty: 27\nKeyword found onâ€¦: supremesecurityservices.ca'),
 Document(metadata={'source': 'data/keywords_output.csv', 'row': 23}, page_content='ï»¿No: 24\nKeyword: securi