# RAG with LangChain

These are my notebooks for learning from this [DataCamp](https://app.datacamp.com/learn/courses/retrieval-augmented-generation-rag-with-langchain) course.

I used the Microsoft [2024 Annual Report](https://www.microsoft.com/investor/reports/ar24/download-center/) for my analysis.

### Loading Documents

In [None]:
from langchain_community.document_loaders import PyPDFLoader, UnstructuredHTMLLoader
from langchain.schema import Document

# For PDF-s
loader = PyPDFLoader('data\\rag_report.pdf')

# For HTML
# htmlLoader = UnstructuredHTMLLoader()
# data = loader.load()
# print(data[0].page_content)

data: list[Document] = loader.load()

data[0:5]

Document(metadata={'source': 'data\\rag_report.pdf', 'page': 0, 'page_label': '1'}, page_content='  \n  \n \n')

In [12]:
# inspecting the object attributes
dir(data[0])

['__abstractmethods__',
 '__annotations__',
 '__class__',
 '__class_getitem__',
 '__class_vars__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__fields__',
 '__fields_set__',
 '__format__',
 '__ge__',
 '__get_pydantic_core_schema__',
 '__get_pydantic_json_schema__',
 '__getattr__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pretty__',
 '__private_attributes__',
 '__pydantic_complete__',
 '__pydantic_computed_fields__',
 '__pydantic_core_schema__',
 '__pydantic_custom_init__',
 '__pydantic_decorators__',
 '__pydantic_extra__',
 '__pydantic_fields__',
 '__pydantic_fields_set__',
 '__pydantic_generic_metadata__',
 '__pydantic_init_subclass__',
 '__pydantic_parent_namespace__',
 '__pydantic_post_init__',
 '__pydantic_private__',
 '__pydantic_root_model__',
 '__pydantic_serializer__',
 '__pydantic_validator__',

### Splitting up the data to chunks for efficient retrieval

first, I try with splitting up text, then splitting up the whole document

In [24]:
from langchain_text_splitters import CharacterTextSplitter
import random

text: str = data[random.randint(0, len(data)-1)].page_content

text_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size=200,
    chunk_overlap=10
)

chunks = text_splitter.split_text(text)

print(chunks)
print([len(chunk) for chunk in chunks])

['89 \nINVESTOR RELATIONS  \nInvestor Relations  \nYou can contact Microsoft Investor Relations by calling \ntoll-free at (800) 285 -7772 or outside the United States,', 'call (425) 706 -4400. We can be contacted between the \nhours of 9:00 a.m. to 5:00 p.m. Pacific Time to answer \ninvestment-oriented questions about Microsoft.', 'For access to additional financial information, visit the \nInvestor Relations website online at:  \nwww.microsoft.com/investor  \nOur e-mail is msft@microsoft.com  \nOur mailing address is:', 'Investor Relations  \nMicrosoft Corporation  \nOne Microsoft Way  \nRedmond, Washington 98052-6399  \nAttending the Annual Meeting  \nThe 2024 Annual Shareholders Meeting will be held', 'as a virtual-only meeting. Any shareholder can join the \nAnnual Meeting, while shareholders of record as of \nSeptember 30 2024, will be able to vote and submit \nquestions during the meeting.', 'Date: Tuesday, December 10, 2024  \nTime: 8:30 a.m. Pacific Time  \nVirtual Shareholder 

now cut the PDF as a whole to chunks

In [26]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

chunks = text_splitter.split_documents(data)

print([len(c.page_content) for c in chunks])

[943, 930, 973, 976, 685, 971, 985, 889, 880, 540, 887, 900, 953, 959, 435, 908, 910, 909, 982, 94, 884, 983, 964, 938, 663, 889, 932, 980, 971, 816, 884, 989, 966, 548, 978, 944, 961, 419, 519, 922, 910, 963, 978, 142, 908, 955, 961, 959, 897, 944, 879, 995, 883, 780, 974, 973, 974, 792, 915, 936, 996, 938, 907, 274, 966, 909, 960, 947, 528, 956, 995, 976, 961, 946, 230, 988, 913, 972, 957, 358, 973, 991, 958, 920, 936, 129, 947, 988, 953, 817, 962, 886, 913, 926, 908, 340, 944, 976, 877, 885, 882, 330, 884, 911, 890, 936, 962, 904, 950, 892, 347, 959, 918, 924, 948, 353, 970, 942, 929, 909, 927, 939, 857, 934, 933, 974, 948, 216, 945, 887, 691, 944, 969, 168, 923, 976, 997, 474, 909, 987, 940, 238, 993, 994, 913, 976, 930, 465, 903, 892, 927, 800, 933, 916, 919, 892, 243, 914, 968, 925, 911, 883, 369, 982, 916, 782, 885, 990, 922, 986, 616, 960, 956, 942, 963, 890, 978, 918, 947, 922, 993, 958, 139, 974, 967, 130, 971, 299, 560, 990, 982, 152, 955, 987, 504, 955, 302, 992, 977, 962, 

### Creating the embeddings

In [None]:
%pip install langchain_openai
from langchain_openai import OpenAIEmbeddings
%pip install langchain_chroma
from langchain_chroma import Chroma

Collecting langchain_openaiNote: you may need to restart the kernel to use updated packages.

  Downloading langchain_openai-0.3.2-py3-none-any.whl.metadata (2.7 kB)
Collecting openai<2.0.0,>=1.58.1 (from langchain_openai)
  Downloading openai-1.60.1-py3-none-any.whl.metadata (27 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.8.0-cp312-cp312-win_amd64.whl.metadata (6.8 kB)
Collecting distro<2,>=1.7.0 (from openai<2.0.0,>=1.58.1->langchain_openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai<2.0.0,>=1.58.1->langchain_openai)
  Downloading jiter-0.8.2-cp312-cp312-win_amd64.whl.metadata (5.3 kB)
Collecting tqdm>4 (from openai<2.0.0,>=1.58.1->langchain_openai)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting regex>=2022.1.18 (from tiktoken<1,>=0.7->langchain_openai)
  Downloading regex-2024.11.6-cp312-cp312-win_amd64.whl.metadata (41 kB)
Downloading langchain_openai-0.3.2-py3-n