# RAG with LangChain

These are my notebooks for learning from this [DataCamp](https://app.datacamp.com/learn/courses/retrieval-augmented-generation-rag-with-langchain) course.

I used the Microsoft [2024 Annual Report](https://www.microsoft.com/investor/reports/ar24/download-center/) for my analysis.


![RAG Indexing Diagram](https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png)


### Loading Documents

In [7]:
from langchain_community.document_loaders import PyPDFLoader, UnstructuredHTMLLoader
from langchain.schema import Document

# For PDF-s
loader = PyPDFLoader('data\\rag_report.pdf')

# For HTML
# htmlLoader = UnstructuredHTMLLoader()
# data = loader.load()
# print(data[0].page_content)

# Loading Markdown files
# from langchain_community.document_loaders import UnstructuredMarkdownLoader

# markdown_loader = UnstructuredMarkdownLoader('README.md')
# markdown_content = markdown_loader.load()

data: list[Document] = loader.load()

data[0:5]

[Document(metadata={'source': 'data\\rag_report.pdf', 'page': 0, 'page_label': '1'}, page_content='  \n  \n \n'),
 Document(metadata={'source': 'data\\rag_report.pdf', 'page': 1, 'page_label': '2'}, page_content=' \n1 \nDear shareholders, colleagues, customers, and partners: \nFiscal year 2024 was a pivotal year for Microsoft. We entered our 50th year as a company and the second year of the AI \nplatform shift. With these milestones, I’ve found myself reflecting on how Microsoft has remained a consequential company \ndecade after decade in an industry with no franchise value. And I realize that it’s because—time and time again, when tech \nparadigms have shifted —we have seized the opportunity to reinvent ourselves to stay relevant to our customers, our \npartners, and our employees. And that’s what we are doing again today.  \nMicrosoft has been a platform and tools company from the start. We were founded in 1975 with a belief in creating \ntechnology that would enable others to creat

### Splitting up the data to chunks for efficient retrieval

first, I try with splitting up text, then splitting up the whole document

In [8]:
from langchain_text_splitters import CharacterTextSplitter
import random

text: str = data[random.randint(0, len(data) - 1)].page_content

text_splitter = CharacterTextSplitter(separator='\n', chunk_size=200, chunk_overlap=10)

chunks = text_splitter.split_text(text)

print(chunks)
print([len(chunk) for chunk in chunks])

['11 \nThe Ambitions That Drive Us  \nTo achieve our vision, our research and development efforts focus on three interconnected ambitions:  \n• Reinvent productivity and business processes.', '• Build the intelligent cloud and intelligent edge platform.  \n• Create more personal computing.  \nReinvent Productivity and Business Processes', 'At Microsoft, we provide technology and resources to help our customers create a secure, productive work environment.', 'Our family of products plays a key role in the ways the world works, learns, and connects.', 'Our growth depends on securely delivering continuous innovation and advancing our leading productivity and collaboration', 'tools and services, including Microsoft 365, LinkedIn, and Dynamics 365. Microsoft 365 is an AI first platform that brings', 'together Off ice, Windows, Copilot, and Enterprise Mobility + Security to help organizations empower their employees.', 'Copilot for Microsoft 365 combines AI with business data in the Microsof

now cut the PDF as a whole to chunks

In [13]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=['\n', '\n\n'], chunk_size=500, chunk_overlap=100
)

chunks = text_splitter.split_documents(data)

print([len(c.page_content) for c in chunks])

[425, 429, 480, 465, 437, 413, 376, 423, 419, 442, 116, 398, 447, 423, 427, 441, 399, 483, 391, 486, 417, 476, 409, 397, 376, 378, 458, 479, 465, 440, 312, 400, 381, 483, 424, 395, 476, 397, 450, 389, 465, 417, 445, 492, 383, 457, 455, 380, 458, 485, 176, 410, 392, 491, 451, 380, 426, 446, 396, 470, 445, 399, 392, 490, 450, 404, 381, 418, 451, 400, 444, 460, 453, 424, 455, 467, 463, 144, 406, 181, 408, 378, 385, 496, 408, 480, 446, 397, 498, 20, 450, 456, 459, 461, 493, 480, 406, 477, 400, 420, 429, 441, 451, 398, 488, 412, 485, 455, 481, 385, 408, 246, 388, 417, 456, 449, 496, 394, 453, 425, 397, 154, 437, 399, 417, 477, 485, 373, 377, 438, 443, 417, 414, 396, 484, 480, 487, 420, 455, 384, 496, 436, 426, 204, 388, 443, 480, 390, 493, 476, 490, 481, 468, 476, 469, 378, 456, 459, 493, 400, 398, 477, 479, 476, 358, 480, 491, 487, 447, 419, 424, 433, 494, 480, 474, 232, 454, 471, 464, 466, 455, 413, 495, 383, 187, 480, 480, 489, 395, 447, 430, 462, 430, 490, 388, 398, 219, 415, 420, 446, 

### Creating the embeddings

In [14]:
import google.generativeai as genai
from langchain_chroma import Chroma
import os
from dotenv import load_dotenv

load_dotenv()
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

embeddings: list[list[float]] = genai.embed_content(
    model="models/text-embedding-004", content=[chunk.page_content for chunk in chunks]
)

Now split to token chunks

[Tokenizers](https://github.com/huggingface/tokenizers)

In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

tokenized_data = [tokenizer.encode(chunk.page_content, add_special_tokens=True) for chunk in chunks]

print(tokenized_data[0])

[101, 1015, 6203, 15337, 1010, 8628, 1010, 6304, 1010, 1998, 5826, 1024, 10807, 2095, 16798, 2549, 2001, 1037, 20369, 2095, 2005, 7513, 1012, 2057, 3133, 2256, 12951, 2095, 2004, 1037, 2194, 1998, 1996, 2117, 2095, 1997, 1996, 9932, 4132, 5670, 1012, 2007, 2122, 19199, 2015, 1010, 1045, 1521, 2310, 2179, 2870, 10842, 2006, 2129, 7513, 2038, 2815, 1037, 9530, 3366, 15417, 4818, 2194, 5476, 2044, 5476, 1999, 2019, 3068, 2007, 2053, 6329, 3643, 1012, 1998, 1045, 5382, 2008, 2009, 1521, 1055, 2138, 1517, 2051, 1998, 2051, 2153, 1010, 2043, 6627, 102]


Possible retrieval methods:

- TF-IDF: Encodes documents using the words that make the document unique
- BM25: Helps mitigate high-frequency words from saturating the encoding


#### Sparse

A sparse representation is a vector where most elements are zero. These are commonly used in older or simpler Natural Language Processing (NLP) techniques.

- More explainable
- Most vector elements are zero
- Uses term frequency

#### Dense

A dense representation is a vector where most elements are non-zero. These are common in modern NLP methods, particularly with embeddings learned from neural networks.

- Most vector elements are non-zero
- Extracts semantic meaning

| Feature                 | Sparse                         | Dense                      |
|-------------------------|-------------------------------|---------------------------|
| **Explainability**       | High (easy to interpret)       | Low (abstract dimensions) |
| **Vector Elements**      | Mostly zeros                  | Mostly non-zero           |
| **Feature Extraction**   | Based on frequency             | Extracts semantic meaning |
| **Use Cases**            | Simple models (e.g., BoW, TF-IDF) | Modern NLP (e.g., embeddings) |

