## Webpage loaders
* Lets try loading the wiki page about superman movie
* [url](https://en.wikipedia.org/wiki/Superman_(2025_film))


In [1]:
from langchain_community.document_loaders import WebBaseLoader
url = "https://en.wikipedia.org/wiki/Superman_(2025_film)"
webpage_loader = WebBaseLoader(url)

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
type(webpage_loader)

langchain_community.document_loaders.web_base.WebBaseLoader

In [3]:
web_docs = webpage_loader.load()

In [4]:
len(web_docs)

1

In [5]:
type(web_docs[0])

langchain_core.documents.base.Document

# Types to consider

* BaseLoader
* [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html)

In [6]:
superman_webdoc = web_docs[0]

In [7]:
superman_webdoc.type

'Document'

In [8]:
superman_webdoc.metadata

{'source': 'https://en.wikipedia.org/wiki/Superman_(2025_film)',
 'title': 'Superman (2025 film) - Wikipedia',
 'language': 'en'}

In [9]:
import textwrap
wrapped_lines = textwrap.wrap(superman_webdoc.page_content, width=60)
for line in wrapped_lines:
    print(line)

    Superman (2025 film) - Wikipedia
Jump to content        Main menu      Main menu move to
sidebar hide                    Navigation            Main
pageContentsCurrent eventsRandom articleAbout
WikipediaContact us                      Contribute
HelpLearn to editCommunity portalRecent changesUpload
fileSpecial pages                    Search
Search                       Appearance
Donate  Create account  Log in         Personal tools
Donate Create account Log in                      Pages for
logged out editors learn more    ContributionsTalk
Contents move to sidebar hide     (Top)      1 Plot
2 Cast         3 Production     Toggle Production subsection
3.1 Background         3.2 Development         3.3 Pre-
production       3.3.1 Before and during the 2023 labor
strikes         3.3.2 Post-labor strikes           3.4
Filming         3.5 Post-production           4 Music
5 Marketing         6 Release     Toggle Release subsection
6.1 Theatrical         6.2 Home media           7 Rec

In [10]:
len(superman_webdoc.page_content)

146068

## Now lets try reading the pdf 

In [11]:
from langchain_community.document_loaders import PyPDFLoader
story_pdf_file = "Panchatantra.pdf"
storybook_loader = PyPDFLoader(story_pdf_file)

In [12]:
type(storybook_loader)

langchain_community.document_loaders.pdf.PyPDFLoader

In [13]:
story_documents = storybook_loader.load()

In [14]:
len(story_documents)

271

In [15]:
type(story_documents[0])

langchain_core.documents.base.Document

In [16]:
story_documents[1].metadata

{'producer': 'Adobe Acrobat Pro 11.0.3 Paper Capture Plug-in with ClearScan',
 'creator': 'Adobe Acrobat Pro 10.1.4',
 'creationdate': '2012-12-21T16:23:36+02:00',
 'author': 'Pandit Vishnu Sharma',
 'keywords': 'Panchatantra\r\nPandit Vishnu Sharma\r\nTranslated by G. L. Chandiramani',
 'moddate': '2013-08-20T17:28:50+03:00',
 'title': 'Panchatantra',
 'source': 'Panchatantra.pdf',
 'total_pages': 271,
 'page': 1,
 'page_label': 'iii'}

In [17]:
print(story_documents[14].page_content)

1 
CONFLICT AMONGST FRIENDS 
This is the beginning of the first tantra called, "Co nflict 
amongst friends". 
"A great friendship had developed in the jungle, 
Be tween the lion and the bullock, 
But it was destro yed 
By a very wicked and avaricious jackal." 
Thi's is how the story goes: 
In the south of India there was a city called 
Mahilaropyam. The · son of a very rich merchant lived 
there. His name was Vardhamanak a*. One night, as he 
lay awake in bed, his thoughts were troubled. This is 
what he was turning over in his �ind. "Even when a. 
man has plenty of money, it is still a good thing for 
him to try to make more. As they say: 
'There is nothing in life that money cannot achieve, 
And so a wise man should be bent on incr easing 
his wealth. 
If a man has money, he has friends, 
When he has money, 
He is recognisea by his relativc;s. 
In this world a stranger becomes kinsman to a 
moneyed man, 
Whilst. a poor man is avoided even by his family. 
A man with money will even be

In [18]:
from langchain_community.document_loaders import AsyncHtmlLoader

# List of URLs to scrape HTML from asynchronously
urls = [
    "https://en.wikipedia.org/wiki/Superman_(2025_film)"
]

# Create AsyncHtmlLoader with the URLs
loader = AsyncHtmlLoader(urls)

# If you need to use environment proxies (like http_proxy/https_proxy), set trust_env=True
# loader = AsyncHtmlLoader(urls, trust_env=True)

# Asynchronously load the data (usage within an async context/function)
documents = await loader.aload()  

USER_AGENT environment variable not set, consider setting it to identify your requests.


Fetching pages: 100%|##########| 1/1 [00:00<00:00,  1.18it/s]


In [19]:
documents[0].page_content

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Superman (2025 film) - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled

In [20]:
from langchain_community.document_transformers import BeautifulSoupTransformer
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(
    documents, tags_to_extract=["p", "li", "div", "a"]
)

In [21]:
type(docs_transformed[0])

langchain_core.documents.base.Document

In [22]:
print(docs_transformed[0].page_content)

Jump to content (#bodyContent)        Main menu       Main menu  move to sidebar  hide    Navigation    Main page Contents Current events Random article About Wikipedia Contact us      Contribute    Help Learn to edit Community portal Recent changes Upload file Special pages           (/wiki/Main_Page)  (/wiki/Main_Page)     (/wiki/Main_Page)      (/wiki/Special:Search) Search  (/wiki/Special:Search)            Search                        Appearance                  Donate   Create account   Log in          Personal tools       Donate  (/w/index.php?title=Special:CreateAccount&returnto=Superman+%282025+film%29) Create account  (/w/index.php?title=Special:UserLogin&returnto=Superman+%282025+film%29) Log in      Pages for logged out editors learn more     Contributions Talk             CentralNotice                 Contents  move to sidebar  hide      (#) (Top)  (#)     (#Plot)  1  Plot   (#Plot)       (#Cast)  2  Cast   (#Cast)       (#Production)  3  Production   (#Production)    Tog

### Chunking

In [None]:
### Fixed Length splitting
print(superman_webdoc.page_content)






Superman (2025 film) - Wikipedia


































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1
Plot








2
Cast








3
Production




Toggle Production subsection





3.1
Background








3.2
Development








3.3
Pre-production






3.3.1
Before and during the 2023 labor strikes








3.3.2
Post-labor strikes










3.4
Filming








3.5
Post-production










4
Music








5
Marketing








6
Release




Toggle Rel

In [25]:
len(superman_webdoc.page_content)

146068

In [29]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(separator='\n')
chunks = text_splitter.split_text(superman_webdoc.page_content)
print(f"total chunks = {len(chunks)}")
for chunk in chunks:
    print(chunk)

total chunks = 42
Superman (2025 film) - Wikipedia
Jump to content
Main menu
Main menu
move to sidebar
hide
		Navigation
	
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us
		Contribute
	
HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages
Search
Search
Appearance
Donate
Create account
Log in
Personal tools
Donate Create account Log in
		Pages for logged out editors learn more
ContributionsTalk
Contents
move to sidebar
hide
(Top)
1
Plot
2
Cast
3
Production
Toggle Production subsection
3.1
Background
3.2
Development
3.3
Pre-production
3.3.1
Before and during the 2023 labor strikes
3.3.2
Post-labor strikes
3.4
Filming
3.5
Post-production
4
Music
5
Marketing
6
Release
Toggle Release subsection
6.1
Theatrical
6.2
Home media
7
Reception
Toggle Reception subsection
7.1
Box office
7.2
Critical response
7.3
Accolades
8
Future
9
References
10
External links
Toggle the table of contents
Superman (2025 film)
40 languages
العربيةAzərbaycancaবাংলাБългарскиCa

In [30]:
for doc in text_splitter.split_documents([superman_webdoc]):
    print(doc.page_content)

Superman (2025 film) - Wikipedia
Jump to content
Main menu
Main menu
move to sidebar
hide
		Navigation
	
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us
		Contribute
	
HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages
Search
Search
Appearance
Donate
Create account
Log in
Personal tools
Donate Create account Log in
		Pages for logged out editors learn more
ContributionsTalk
Contents
move to sidebar
hide
(Top)
1
Plot
2
Cast
3
Production
Toggle Production subsection
3.1
Background
3.2
Development
3.3
Pre-production
3.3.1
Before and during the 2023 labor strikes
3.3.2
Post-labor strikes
3.4
Filming
3.5
Post-production
4
Music
5
Marketing
6
Release
Toggle Release subsection
6.1
Theatrical
6.2
Home media
7
Reception
Toggle Reception subsection
7.1
Box office
7.2
Critical response
7.3
Accolades
8
Future
9
References
10
External links
Toggle the table of contents
Superman (2025 film)
40 languages
العربيةAzərbaycancaবাংলাБългарскиCatalàČeštinaDeutsch

In [33]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents([superman_webdoc])
print(f"total docs = {len(docs)}")
for doc in docs:
    print(doc.page_content)

total docs = 195
Superman (2025 film) - Wikipedia


































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1
Plot








2
Cast








3
Production




Toggle Production subsection





3.1
Background








3.2
Development








3.3
Pre-production






3.3.1
Before and during the 2023 labor strikes








3.3.2
Post-labor strikes










3.4
Filming








3.5
Post-production










4
Music








5
Marketing








6
Release
T

In [36]:
# lets chunk panchatantra book
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
document_chunks = text_splitter.split_documents(story_documents)
print(f"total chunks = {len(document_chunks)}")
for doc in document_chunks:
    print(doc.page_content)

total chunks = 489
PANCHATANTRA 
PANDIT VISHNU SHARMA 
Translated by 
G. L. Chandiramani
Copyright© Sheila G. Chandiramani 
First in Rupa Paperback 1991 
Twentieth Impression 2011 
Published by 
Rupa Publications India Pvt. Ltd. 
7/16, Ansari Road, Daryaganj, 
New Delhi 110 002 
Sales Centres: 
Allahabad Bengaluru Chennai 
Hyderabad Jaipur Kathmandu 
Kolkata Mumbai 
All rights reserved. 
No part of this publication may be reproduced, stored in a 
retrieval system, or transmitted, in any form or by any means, 
'• 
electronic, mechanical, photocopying, recording or otherwise, 
without the prior permission of the publishers. 
Typeset by 
Mindways Design 
1410 Chiranjiv Tower 
43 Nehru Place 
New Delhi 110 019 
Printed in India by 
Gopsons Papers Ltd. 
A-14 Sector 60 
Noida 201 301
PREFACE 
The original text of the Panchataritra in Sanskrit was 
probably written about 200 B.C. by a great Hindu 
scholar,. Pandit Vishnu _Sharma. But some of the tales 
themse lves mcust-be much older, their o

# Embedding models

In [38]:
from langchain_google_vertexai import VertexAIEmbeddings

In [39]:
embeddings = VertexAIEmbeddings(model="text-embedding-005")



In [41]:
vector_embedding_query = embeddings.embed_query(text="how are you?")
vector_embedding_query

[-0.011548829264938831,
 -0.01269457582384348,
 -0.049091752618551254,
 -0.05168934538960457,
 -0.0409785695374012,
 -0.034238580614328384,
 0.020007850602269173,
 -0.04536178708076477,
 -0.02350171096622944,
 0.034224942326545715,
 -0.015385705046355724,
 -0.04004182666540146,
 -0.00989760272204876,
 -0.057200994342565536,
 0.05143439769744873,
 0.0242730975151062,
 0.06529755890369415,
 -0.09741871803998947,
 -0.02776661515235901,
 0.0368206724524498,
 0.051008231937885284,
 -0.050489939749240875,
 -0.017314953729510307,
 -0.05729701742529869,
 -0.03764816001057625,
 0.046771787106990814,
 0.016267718747258186,
 -0.02478523552417755,
 -0.06697049736976624,
 0.04353472590446472,
 0.0014090444892644882,
 -0.017848506569862366,
 0.10027343779802322,
 0.06057998165488243,
 -0.009515620768070221,
 -0.005507363472133875,
 0.049352362751960754,
 0.0062264613807201385,
 0.017378881573677063,
 -0.04067828506231308,
 -0.019675174728035927,
 -0.00914436113089323,
 -0.05006292834877968,
 -0.0215

In [42]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
docs = text_splitter.split_documents([superman_webdoc])
texts = []
for doc in docs:
    texts.append(doc.page_content)
vector_repr = embeddings.embed_documents(texts=texts)

In [43]:
vector_repr

[[-0.03961257264018059,
  -0.028536153957247734,
  0.02037365734577179,
  -0.007251320872455835,
  -0.024835577234625816,
  -0.0003270332526881248,
  0.00845731794834137,
  0.06264253705739975,
  0.01391521468758583,
  0.027165791019797325,
  -0.005020140670239925,
  0.015036875382065773,
  0.007056754548102617,
  -0.07289724797010422,
  -0.0071425368078053,
  0.0346880741417408,
  -0.026666689664125443,
  -0.007937656715512276,
  -0.02415270172059536,
  0.03183849900960922,
  0.037558794021606445,
  -0.047896549105644226,
  -0.037035975605249405,
  0.018447043374180794,
  0.05949109047651291,
  -0.030375858768820763,
  0.02903946116566658,
  -0.05314372852444649,
  -0.04342220723628998,
  -0.09113708138465881,
  0.05808420479297638,
  0.004468041937798262,
  -0.016898294910788536,
  -0.009318548254668713,
  0.009279350750148296,
  -0.06346628814935684,
  0.09406067430973053,
  -0.039498720318078995,
  -0.006033537443727255,
  0.005233412608504295,
  -0.020910030230879784,
  -0.0120534

In [45]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

In [46]:
vector_store.add_documents(docs)

['2c08366f-0ad3-4cc1-8d95-e4364b9db10b',
 'f1dbb5f1-9787-49e6-bc5f-8ebc3b5c729f',
 '934813d3-eb00-459a-ae1c-c82ce0fc12bc',
 '09201a5a-b6eb-458c-a704-1e6c4e316063',
 '74f3ec7a-12e6-4809-8cb2-e56d8f69825f',
 '2ae01c83-a700-4d19-a899-796a006e953c',
 '87c0ba39-da44-48a1-bf13-b6217adf6451',
 '407a3033-8a15-42b1-9015-73fb777ecbb6',
 '3d72b69f-999c-47b2-a102-7d3befa828e4',
 '710e4b3a-7a7d-4ebc-a9f5-e34abbecd0bf',
 '8518bc0c-acdd-4d52-a95f-3c8d045a7f25',
 '78f24062-55b1-439b-9e58-6c708668267c',
 'a4c0302f-1fb3-461b-a0fa-a0e92592e92d',
 '3c444b4c-72c5-4b4d-a251-17a4f9259397',
 'efbbd917-ae16-4024-9b5a-7f992f6dfe88',
 'f9fabf2a-141f-4939-8741-f0d6b735b318',
 '170879e6-df3d-494a-8353-ebe2abf996d4',
 '11a2109f-bf0a-4b81-b474-5a988ff80860',
 '28c1856c-9ae0-4988-9a70-bedd6e5423ba',
 '651ec8d7-e10a-4bfd-a221-60a95328765d',
 '1cd44c96-899f-4979-b057-643c2af7f471',
 '6e0fa5d0-e2ac-4ccc-8289-d3b7a2cc35d1',
 'bb6c495a-d7c0-4085-863a-0ada56ed7fa1',
 '41b3d56b-0bab-4db6-9881-ef9eaa696b3c',
 'b397e5d8-2639-

In [50]:
# lets chunk panchatantra book
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
document_chunks = text_splitter.split_documents(story_documents)
print(f"total chunks = {len(document_chunks)}")


total chunks = 489


In [48]:
pc_vector_store = Chroma(
    collection_name="pc_collection",
    embedding_function=embeddings,
    persist_directory="./pc_chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

In [51]:
pc_vector_store.add_documents(document_chunks)

['ae22958f-d888-4edf-8bea-da15e8d2f93a',
 'f44628c1-4c62-4e25-975e-50d6275c970e',
 'a08a2051-95f6-45d2-b6f1-a8b7b39b7175',
 'f4354e7a-76b6-4bc8-87f1-947db306ceaf',
 'ab718405-3ab7-4990-98e1-be3fcd77d1e4',
 'd9b2cc63-e9bb-4228-8301-fce874507bd6',
 '8cade50e-ec97-4616-be82-edba198c9122',
 '3181118d-ce3d-4070-a864-05399c806a6f',
 'd1ddb620-e1c0-4aad-b060-556f123dcd15',
 '7fd7f594-4176-43de-8046-c914c87bf595',
 '33c4dd05-e817-4eb2-a0af-f6465550ed37',
 '76747f91-973c-4431-a162-c51da0e843e9',
 '9479b61f-e347-4e30-afe2-4c046c40d255',
 '7c1b4893-7834-4aae-990a-874181cd1338',
 '945fc43a-5818-4fc5-8b93-fd4a2fc2d391',
 'aba3613b-5b56-424b-8623-6289c1128722',
 'f5c02e09-7301-48d5-871e-d4560f243797',
 '33cc74a8-1ce2-4e31-bd5b-c027e9a9cdf3',
 '2dfc5396-41fa-4b37-9c9d-8f84267368d6',
 '5956839c-82fe-4157-8e67-c3e3273ac2f7',
 'd5bac029-aa9e-44ef-b728-c159e373ad98',
 'f756e8e1-9918-4fa4-be3d-c8fef7b2688e',
 '43d3006e-3cf1-4735-8ba9-221a7149e4ea',
 '2ffc6dc2-c4c5-489c-a6d0-ca9920dc88c4',
 '926eb22d-7055-

In [52]:
pc_vector_store.as_retriever().invoke("What did monkey do to log")

[Document(id='017a819a-9d28-41b6-b1bd-cdf95bedfdc1', metadata={'author': 'Pandit Vishnu Sharma', 'producer': 'Adobe Acrobat Pro 11.0.3 Paper Capture Plug-in with ClearScan', 'creator': 'Adobe Acrobat Pro 10.1.4', 'creationdate': '2012-12-21T16:23:36+02:00', 'page_label': '6', 'keywords': 'Panchatantra\r\nPandit Vishnu Sharma\r\nTranslated by G. L. Chandiramani', 'source': 'Panchatantra.pdf', 'title': 'Panchatantra', 'total_pages': 271, 'moddate': '2013-08-20T17:28:50+03:00', 'page': 17}, page_content='6 PANCH ATANTRA \nTHE STORY OF THE MONKEY AND THE LOG \n"A merchant had started building a temple beneath \nthe trees on the outskirts of a town. Every day the \ncarpenters and the workmen used to go into the town \nfor their midday meals. Now, one particular day, a troop \nof wandering monkeys arrived on the scene. One of \nthe carpen ters, who was in the middle of sawing a log, · \n1put a wedge in it, to prevent the log from closing up, and then went off. \n"The monkeys started playing 