# RAG with LangChain

These are my notebooks for learning from this [DataCamp](https://app.datacamp.com/learn/courses/retrieval-augmented-generation-rag-with-langchain) course.

I used the Microsoft [2024 Annual Report](https://www.microsoft.com/investor/reports/ar24/download-center/) for my analysis.

### Loading Documents

In [1]:
from langchain_community.document_loaders import PyPDFLoader, UnstructuredHTMLLoader
from langchain.schema import Document

# For PDF-s
loader = PyPDFLoader('data\\rag_report.pdf')

# For HTML
# htmlLoader = UnstructuredHTMLLoader()
# data = loader.load()
# print(data[0].page_content)

# Loading Markdown files
# from langchain_community.document_loaders import UnstructuredMarkdownLoader

# markdown_loader = UnstructuredMarkdownLoader('README.md')
# markdown_content = markdown_loader.load()

data: list[Document] = loader.load()

data[0:5]

[Document(metadata={'source': 'data\\rag_report.pdf', 'page': 0, 'page_label': '1'}, page_content='  \n  \n \n'),
 Document(metadata={'source': 'data\\rag_report.pdf', 'page': 1, 'page_label': '2'}, page_content=' \n1 \nDear shareholders, colleagues, customers, and partners: \nFiscal year 2024 was a pivotal year for Microsoft. We entered our 50th year as a company and the second year of the AI \nplatform shift. With these milestones, I’ve found myself reflecting on how Microsoft has remained a consequential company \ndecade after decade in an industry with no franchise value. And I realize that it’s because—time and time again, when tech \nparadigms have shifted —we have seized the opportunity to reinvent ourselves to stay relevant to our customers, our \npartners, and our employees. And that’s what we are doing again today.  \nMicrosoft has been a platform and tools company from the start. We were founded in 1975 with a belief in creating \ntechnology that would enable others to creat

### Splitting up the data to chunks for efficient retrieval

first, I try with splitting up text, then splitting up the whole document

In [2]:
from langchain_text_splitters import CharacterTextSplitter
import random

text: str = data[random.randint(0, len(data)-1)].page_content

text_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size=200,
    chunk_overlap=10
)

chunks = text_splitter.split_text(text)

print(chunks)
print([len(chunk) for chunk in chunks])

['23 \nreach small and medium organizations. ESAs are also typically authorized as LSPs and operate as resellers for our other', 'volume licensing programs. Microsoft Cloud Solution Provider is our main partner program for reselling cloud services.', 'We distribute our retail packaged products primarily through independent non-exclusive distributors, authorized replicators,', 'resellers, and retail outlets. Individual consumers obtain these products primarily through retail outlets. We distribute our', 'devices through third-party retailers. We have a network of field sales representatives and field support personnel that solicit', 'orders from distributors and resellers and provide product training and sales support.', 'Our Dynamics business solutions are also licensed to enterprises through a global network of channel partners providing \nvertical solutions and specialized services.  \nLICENSING OPTIONS', 'We offer options for organizations of varying sizes that want to purchase our 

now cut the PDF as a whole to chunks

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=['\n', '\n\n'],
    chunk_size=1000,
    chunk_overlap=100
)

chunks = text_splitter.split_documents(data)

print([len(c.page_content) for c in chunks])

[943, 930, 973, 976, 685, 971, 985, 889, 880, 540, 887, 900, 953, 959, 435, 908, 910, 909, 982, 94, 884, 983, 964, 938, 663, 889, 932, 980, 971, 816, 884, 989, 966, 548, 978, 944, 961, 419, 519, 922, 910, 963, 978, 142, 908, 955, 961, 959, 897, 944, 879, 995, 883, 780, 974, 973, 974, 792, 915, 936, 996, 938, 907, 274, 966, 909, 960, 947, 528, 956, 995, 976, 961, 946, 230, 988, 913, 972, 957, 358, 973, 991, 958, 920, 936, 129, 947, 988, 953, 817, 962, 886, 913, 926, 908, 340, 944, 976, 877, 885, 882, 330, 884, 911, 890, 936, 962, 904, 950, 892, 347, 959, 918, 924, 948, 353, 970, 942, 929, 909, 927, 939, 857, 934, 933, 974, 948, 216, 945, 887, 691, 944, 969, 168, 923, 976, 997, 474, 909, 987, 940, 238, 993, 994, 913, 976, 930, 465, 903, 892, 927, 800, 933, 916, 919, 892, 243, 914, 968, 925, 911, 883, 369, 982, 916, 782, 885, 990, 922, 986, 616, 960, 956, 942, 963, 890, 978, 918, 947, 922, 993, 958, 139, 974, 967, 130, 971, 299, 560, 990, 982, 152, 955, 987, 504, 955, 302, 992, 977, 962, 

### Creating the embeddings

In [4]:
import google.generativeai as genai
from langchain_chroma import Chroma
import os
from dotenv import load_dotenv

load_dotenv()
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

embeddings: list[list[float]] = genai.embed_content(
        model="models/text-embedding-004",
        content=[chunk.page_content for chunk in chunks])

  from .autonotebook import tqdm as notebook_tqdm


Now split to token chunks

[Tokenizers](https://github.com/huggingface/tokenizers)

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

data = ["Hello, how are you?", "I am fine, thank you!"]

tokenized_data = [tokenizer.encode(sentence, add_special_tokens=True) for sentence in data]

print(tokenized_data)


Note: you may need to restart the kernel to use updated packages.
[[101, 7592, 1010, 2129, 2024, 2017, 1029, 102], [101, 1045, 2572, 2986, 1010, 4067, 2017, 999, 102]]
