#  RAG - Retrieval-Augmented Generation

### 1. Document Loading

* Use **Document Loaders** in LangChain.
* Examples: `PyPDFLoader`, `TextLoader`, `UnstructuredFileLoader`.
* Purpose: Bring external data (PDFs, text files, web pages) into LangChain.

---

### 2. Document Splitting

* Use **Text Splitters** like `CharacterTextSplitter` or `RecursiveCharacterTextSplitter`.
* Purpose: Break large documents into **small chunks** (e.g., 500–1000 tokens).
* Why: LLMs work better on smaller pieces of text.

---

### 3. Document Embedding

* Use **Embeddings models** (e.g., `OpenAIEmbeddings`, `HuggingFaceEmbeddings`).
* Each chunk → vector (list of numbers).
* **Cosine Similarity** is used to measure how close two vectors (chunks) are.

  * Example: Question vector vs stored chunk vectors → find the most similar ones.

---

### 4. Document Storing

* Store embeddings in a **Vector Store** (like FAISS, Chroma, Pinecone).
* Purpose: Efficient search & retrieval of similar chunks later.

---

### 5. Retrieval + Generation (RAG)

* When user asks a question:

  * Convert question → embedding.
  * Retrieve top-k **relevant and diverse** chunks from Vector Store (using cosine similarity).
  * Send question + retrieved context to LLM via a **RetrievalQA chain**.
  * LLM generates the final answer.

---

So in LangChain terms:
**Loader → Splitter → Embeddings → Vector Store → RetrievalQA Chain**

---

# Indexing: Document Loading with PyPDF Loader

In [4]:
from langchain_community.document_loaders import PyPDFLoader
import copy

In [6]:
loader_pdf = PyPDFLoader(r"D:\AI-ML\Langchain\Files\Introduction_to_Data_and_Data_Science.pdf")

pages_pdf = loader_pdf.load()

pages_pdf

[Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2023-11-09T10:16:34+02:00', 'author': 'Hristina  Hristova', 'moddate': '2023-11-09T10:16:34+02:00', 'source': 'D:\\AI-ML\\Langchain\\Files\\Introduction_to_Data_and_Data_Science.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}, page_content='Analysis vs Analytics \nAlright! So… \nLet’s discuss the not-so-obvious differences \nbetween the terms analysis and analytics. \nDue to the similarity of the words, some people \nbelieve they share the same meaning, and thus \nuse them interchangeably. Technically, this \nisn’t correct. There is, in fact, a distinct \ndifference between the two. And the reason \nfor one often being used instead of the other \nis the lack of a transparent understanding \nof both. \nSo, let’s clear this up, shall we? \nFirst, we will start with analysis. \nConsider the following… \nYou have a huge dataset containing data of \nvar

In [7]:
pages_pdf_cut = copy.deepcopy(pages_pdf)

### 1. `copy.deepcopy()`

* Comes from Python’s built-in `copy` module.
* `deepcopy()` creates a **new independent copy** of the object *and all objects inside it*.
* This means changes in the copy will **not affect** the original.

---

### 2. Why use `deepcopy` here?

* If you just did:

  ```python
  pages_pdf_cut = pages_pdf
  ```

  → Both variables point to the **same object** in memory. Editing one edits the other.

* With `deepcopy`:

  ```python
  pages_pdf_cut = copy.deepcopy(pages_pdf)
  ```

  → `pages_pdf_cut` is a **completely separate clone**. You can modify it (cut pages, clean text, etc.) without touching `pages_pdf`.

---

In short:
This line makes a **full independent duplicate** of `pages_pdf`, so you can safely work on `pages_pdf_cut` without messing up the original data.


In [10]:
for i in pages_pdf_cut:
    i.page_content = ' '.join(i.page_content.split()) # Removes \n

In [11]:
pages_pdf[0].page_content, pages_pdf_cut[0].page_content

('Analysis vs Analytics \nAlright! So… \nLet’s discuss the not-so-obvious differences \nbetween the terms analysis and analytics. \nDue to the similarity of the words, some people \nbelieve they share the same meaning, and thus \nuse them interchangeably. Technically, this \nisn’t correct. There is, in fact, a distinct \ndifference between the two. And the reason \nfor one often being used instead of the other \nis the lack of a transparent understanding \nof both. \nSo, let’s clear this up, shall we? \nFirst, we will start with analysis. \nConsider the following… \nYou have a huge dataset containing data of \nvarious types. Instead of tackling the entire \ndataset and running the risk of becoming overwhelmed, \nyou separate it into easier to digest chunks \nand study them individually and examine how \nthey relate to other parts. And that’s analysis \nin a nutshell. \nOne important thing to remember, however, \nis that you perform analyses on things that \nhave already happened in the