### This notebook is to demonstrate commonly used Loaders and Splitters

In langchain, a Document is simple structure with two fields:
- page_content(string): Contains the raw text of the document.
- metadata (dictionary): This field stores additional metadata about the text, such as the source URL, author, or any other relevant information.

In [5]:
from langchain.document_loaders import TextLoader
loader = TextLoader("gold_wiki.txt")
document = loader.load()
print(document)

[Document(metadata={'source': 'gold_wiki.txt'}, page_content='Gold as an investment\n\nOf all the precious metals, gold is the most popular as an investment. Investors generally buy gold as a way of diversifying risk, especially through the use of futures contracts and derivatives. The gold market is subject to speculation and volatility as are other markets. Compared to other precious metals used for investment, gold has been the most effective safe haven across a number of countries.[1]\n\nGold price\n\nGold prices (US$ per troy ounce), in nominal US$ and inflation adjusted US$ from 1914 onward.\n\nPrice of gold 1915–2022\n\nGold price history in 1960–present\n\nGold price per gram between Jan 1971 and Jan 2012. The graph shows nominal price in US dollars, the price in 1971 and 2011 US dollars. The notable peak in 1980 followed the Soviet military involvement in Afghanistan, after a decade of inflation, oil shocks, and American military failures.\nGold has been used throughout histor

In [6]:
document[0].page_content

'Gold as an investment\n\nOf all the precious metals, gold is the most popular as an investment. Investors generally buy gold as a way of diversifying risk, especially through the use of futures contracts and derivatives. The gold market is subject to speculation and volatility as are other markets. Compared to other precious metals used for investment, gold has been the most effective safe haven across a number of countries.[1]\n\nGold price\n\nGold prices (US$ per troy ounce), in nominal US$ and inflation adjusted US$ from 1914 onward.\n\nPrice of gold 1915–2022\n\nGold price history in 1960–present\n\nGold price per gram between Jan 1971 and Jan 2012. The graph shows nominal price in US dollars, the price in 1971 and 2011 US dollars. The notable peak in 1980 followed the Soviet military involvement in Afghanistan, after a decade of inflation, oil shocks, and American military failures.\nGold has been used throughout history as money and has been a relative standard for currency equi

In [7]:
document[0].metadata

{'source': 'gold_wiki.txt'}

#### Type of Document Loaders in LangChain
LangChain offers three main types of Document Loaders:
- Transform Loaders : Handle different input formats and transaform them into the Document format.
- Public Dataset or Service Loaders : Allowing quick retrieval and creation of Documents.
- Proprietary Dataset or Service Loaders : These loader are designed to handle proprietary sources that may require additional authentication or setup. For instance, loader could be created specifically for loading data from an internal database or an API with proprietary access.

**Transform Laoder example**

In [19]:
from langchain.document_loaders import CSVLoader

loader = CSVLoader("gold_price.csv")
documents = loader.load()

for document in documents:
    content = document.page_content
    metadata = document.metadata

    print(content)
    print("-------")

Year: 1970
Gold Usd/ozt: 37
DJIA USD: 839
DJIA XAU: 22.7
World GDP USD: 3.3
World GDP XAU: 89.2
-------
Year: 1975
Gold Usd/ozt: 140
DJIA USD: 852
DJIA XAU: 6.1
World GDP USD: 6.4
World GDP XAU: 45.7
-------
Year: 1980
Gold Usd/ozt: 590
DJIA USD: 964
DJIA XAU: 1.6
World GDP USD: 11.8
World GDP XAU: 20
-------
Year: 1985
Gold Usd/ozt: 327
DJIA USD: 1,547
DJIA XAU: 4.7
World GDP USD: 13
World GDP XAU: 39.8
-------
Year: 1990
Gold Usd/ozt: 391
DJIA USD: 2,634
DJIA XAU: 6.7
World GDP USD: 22.2
World GDP XAU: 56.8
-------
Year: 1995
Gold Usd/ozt: 387
DJIA USD: 5,117
DJIA XAU: 13.2
World GDP USD: 29.8
World GDP XAU: 77
-------
Year: 2000
Gold Usd/ozt: 273
DJIA USD: 10,787
DJIA XAU: 39.5
World GDP USD: 31.9
World GDP XAU: 116.8
-------
Year: 2005
Gold Usd/ozt: 513
DJIA USD: 10,718
DJIA XAU: 20.9
World GDP USD: 45.1
World GDP XAU: 87.9
-------
Year: 2010
Gold Usd/ozt: 1,410
DJIA USD: 11,578
DJIA XAU: 8.2
World GDP USD: 63.2
World GDP XAU: 44.8
-------


### PDF Loader
Loads each page of the PDF as one document

In [25]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("Gold_as_an_investment.pdf")
pages = loader.load()


In [28]:
cnt = 0
for page in pages:
    cnt = cnt + 1
    print("---- Document #", cnt)
    print(page.page_content.strip())

---- Document # 1
Gold
ISO 4217
Code XAU (numeric: 959)
Unit
Symbol XAU 
Demographics
User(s) Investors
Reserves of SDR, forex and gold in 2006
A Good Delivery bar, the standard for
trade in the major international gold
markets.
Gold as an investment
Of all the precious metals, gold is the most popular as
an investment. Investors generally buy gold as a way
of diversifying risk, especially through the use of
futures contracts and derivatives. The gold market is
subject to speculation and volatility as are other
markets. Compared to other precious metals used for
investment, gold has been the most effective safe
haven across a number of countries.[1]
Gold has been used throughout history as money and has
been a relative standard for currency equivalents specific to
economic regions or countries, until recent times. Many
European countries implemented gold standards in the latter
part of the 19th century until these were temporarily
suspended in the financial crises involving World War I

**Web Loader**

In [29]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Gold")
data = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [30]:
data[0].page_content

'\n\n\n\nGold - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact us\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload fileSpecial pages\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAppearance\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDonate\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\nDonate Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1\nCharacteristics\n\n\n\n\nToggle Characteristics subsection\n\n\n\n\n\n1.1\nColor\n\n\n\n\n\n\n\n\n1.2\nIsotopes\n\n\n\n\n\n\n1.2.1\nSynthesis

In [33]:
formatted_text = data[0].page_content.replace("\n\n", " \n")

print(formatted_text)

 
 
Gold - Wikipedia 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Jump to content 
 
 
 
Main menu 
 
 
Main menu
move to sidebar
hide 
 
		Navigation
	 

Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us 
 
 
		Contribute
	 

HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages 
 
 
 
 
 
 
 
 
 
Search 
 
 
 
 
 
Search 
 
 
 
 
 
 
 
 
 
 

Appearance 
 
 
 
 
 
 
 

Donate 
Create account 
Log in 
 
 
 

Personal tools 
 
 
Donate Create account Log in 
 
 
		Pages for logged out editors learn more 
 
ContributionsTalk 
 
 
 
 
 
 
 
 
 
 
 
 
 

Contents
move to sidebar
hide 
 

(Top) 
 
 
1
Characteristics 
 

Toggle Characteristics subsection 
 
 
1.1
Color 
 
 
 

1.2
Isotopes 
 
 

1.2.1
Synthesis 
 
 
 
 
 

2
Chemistry 
 

Toggle Chemistry subsection 
 
 
2.1
Rare oxidation states 
 
 
 
 

3
Origin 
 

Toggle Origin subsection 
 
 
3.1
Gold production in the universe 
 
 
 

3.2
Asteroid origin theories 
 
 
 

3.3
Mantle return theories 
 
 
 


In [34]:
import re

cleaned_text = re.sub(r"\n+", "\n\n", formatted_text)
cleaned_text = re.sub(r"\s+", " ", cleaned_text)

print(cleaned_text)

 Gold - Wikipedia Jump to content Main menu Main menu move to sidebar hide Navigation Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us Contribute HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages Search Search Appearance Donate Create account Log in Personal tools Donate Create account Log in Pages for logged out editors learn more ContributionsTalk Contents move to sidebar hide (Top) 1 Characteristics Toggle Characteristics subsection 1.1 Color 1.2 Isotopes 1.2.1 Synthesis 2 Chemistry Toggle Chemistry subsection 2.1 Rare oxidation states 3 Origin Toggle Origin subsection 3.1 Gold production in the universe 3.2 Asteroid origin theories 3.3 Mantle return theories 4 Occurrence Toggle Occurrence subsection 4.1 Seawater 5 History Toggle History subsection 5.1 Etymology 5.2 Culture 5.2.1 Religion 6 Production Toggle Production subsection 6.1 Mining and prospecting 6.2 Extraction and refining 6.3 Recycling 6.4 Consumption 6.5 Pollution 7 Monetary u

Gold - Wikipedia Jump to content Main menu Main menu move to sidebar hide Navigation Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us Contribute HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages Search Search Appearance Donate Create account Log in Personal tools Donate Creat...

**JSON Loader**

In [37]:
from langchain_community.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint
file ='gold.json'
data = json.loads(Path(file).read_text())
pprint(data)

[{'DJIA USD': '839',
  'DJIA XAU': 22.7,
  'Gold Usd': {'ozt': '37'},
  'World GDP USD': 3.3,
  'World GDP XAU': 89.2,
  'Year': 1970},
 {'DJIA USD': '852',
  'DJIA XAU': 6.1,
  'Gold Usd': {'ozt': '140'},
  'World GDP USD': 6.4,
  'World GDP XAU': 45.7,
  'Year': 1975},
 {'DJIA USD': '964',
  'DJIA XAU': 1.6,
  'Gold Usd': {'ozt': '590'},
  'World GDP USD': 11.8,
  'World GDP XAU': 20,
  'Year': 1980},
 {'DJIA USD': '1,547',
  'DJIA XAU': 4.7,
  'Gold Usd': {'ozt': '327'},
  'World GDP USD': 13,
  'World GDP XAU': 39.8,
  'Year': 1985},
 {'DJIA USD': '2,634',
  'DJIA XAU': 6.7,
  'Gold Usd': {'ozt': '391'},
  'World GDP USD': 22.2,
  'World GDP XAU': 56.8,
  'Year': 1990},
 {'DJIA USD': '5,117',
  'DJIA XAU': 13.2,
  'Gold Usd': {'ozt': '387'},
  'World GDP USD': 29.8,
  'World GDP XAU': 77,
  'Year': 1995},
 {'DJIA USD': '10,787',
  'DJIA XAU': 39.5,
  'Gold Usd': {'ozt': '273'},
  'World GDP USD': 31.9,
  'World GDP XAU': 116.8,
  'Year': 2000},
 {'DJIA USD': '10,718',
  'DJIA XAU':

**WikipediaLoader**

In [39]:
from langchain.document_loaders import WikipediaLoader
loader = WikipediaLoader("Gold as an investment")
data = loader.load()
print(data[0].page_content)

Of all the precious metals, gold is the most popular as an investment. Investors generally buy gold as a way of diversifying risk, especially through the use of futures contracts and derivatives. The gold market is subject to speculation and volatility as are other markets.  Compared to other precious metals used for investment, gold has been the most effective safe haven across a number of countries.


== Gold price ==

Gold has been used throughout history as money and has been a relative standard for currency equivalents specific to economic regions or countries, until recent times. Many European countries implemented gold standards in the latter part of the 19th century until these were temporarily suspended in the financial crises involving World War I. After World War II, the Bretton Woods system pegged the United States dollar to gold at a rate of US$35 per troy ounce. The system existed until the 1971 Nixon shock, when the US unilaterally suspended the direct convertibility of 

**IMDB Loader**

In [45]:
from langchain_community.document_loaders import IMSDbLoader
loader = IMSDbLoader("https://imsdb.com/scripts/Elizabeth-The-Golden-Age.html")
data = loader.load()

In [46]:
formatted_text = data[0].page_content[:5000].strip()

print(formatted_text)

ELIZABETH: THE GOLDEN AGE




                        Written by

             William Nicholson & Michael Hirst
                  





                                                     5th July 2006
                          


    (Dialogue printed in brackets to be translated and spoken
    in Spanish or German as appropriate, and sub-titled.)


    EXT. TITLE SEQUENCE P                                         1
    Painted images of the Elizabethan age -

                         CAPTION

              A world divided by religious
              hatred.

              The new Protestant faith is
              spreading.

    Bodies burned on a pyre - men writhing under torture - a
    momentary half-recognisable face, gaunt and staring -
    FATHER ROBERT RESTON.

                        CAPTION
              The most powerful ruler in
              Christendom, Philip of Spain, has
              sworn to return all Europe to the
              Catholic faith.

    Images of riva

**video an dlanguage preference**

In [None]:
#!pip install --upgrade --quiet youtube-transcript-api

In [None]:
from langchain_community.document_loaders import YoutubeLoader
loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=lc-puexGfts&pp=ygUJZ29sZCBuZXdz",
add_video_info=False)
data = loader.load()
print(data[0].page_content[:5000].strip())

200 day moving average, which is a really bad sign. >> You said the last time that we spoke that I believe this is correct, that you had more cash than you've had in a long time, if not ever, at at double line, and that it was too early to deploy it. Now we were going through a pretty rough market period. >> When we. >> Last spoke in early April. Here we find ourselves in a different scenario today. So how does that sit? >> I feel like we're in a risk off market on an intermediate term basis. And I think what's really interesting is through all the volatility that you had in risk assets and some parts of the bond market as well. The one thing that hasn't had much volatility is gold. Gold broke out above 2000 and went to 3000. And now it's like 3400. It hit a new high yesterday. It's down today. But I think that's telling us something. I think that's telling us that we're in a regime where gold is no longer a speculative speculation for short term traders or for survivalists as a long t

#### Text Splitters

In [68]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=2000,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [70]:
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Gold")
data = loader.load()
chunks = text_splitter.split_text(data[0].page_content)
len(chunks)

Created a chunk of size 2737, which is longer than the specified 2000
Created a chunk of size 2072, which is longer than the specified 2000
Created a chunk of size 2585, which is longer than the specified 2000
Created a chunk of size 2591, which is longer than the specified 2000
Created a chunk of size 2243, which is longer than the specified 2000
Created a chunk of size 4372, which is longer than the specified 2000
Created a chunk of size 2062, which is longer than the specified 2000
Created a chunk of size 3180, which is longer than the specified 2000
Created a chunk of size 2772, which is longer than the specified 2000
Created a chunk of size 2600, which is longer than the specified 2000
Created a chunk of size 2362, which is longer than the specified 2000
Created a chunk of size 4938, which is longer than the specified 2000


64

In [71]:
for chunk in chunks:
    print(chunk)
    print("-------")

Gold - Wikipedia


Jump to content

Main menu

Main menu
move to sidebar
hide

		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us

		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages

Search

Search


Appearance


Donate

Create account

Log in


Personal tools

Donate Create account Log in

		Pages for logged out editors learn more

ContributionsTalk


Contents
move to sidebar
hide


(Top)

1
Characteristics


Toggle Characteristics subsection

1.1
Color


1.2
Isotopes


1.2.1
Synthesis


2
Chemistry


Toggle Chemistry subsection

2.1
Rare oxidation states


3
Origin


Toggle Origin subsection

3.1
Gold production in the universe


3.2
Asteroid origin theories


3.3
Mantle return theories


4
Occurrence


Toggle Occurrence subsection

4.1
Seawater


5
History


Toggle History subsection

5.1
Etymology


5.2
Culture


5.2.1
Religion


6
Production


Toggle Production subsection

6.1
Mining and prospecting


6.2
E

In [72]:
documents = text_splitter.create_documents([data[0].page_content])
len(documents)

Created a chunk of size 2737, which is longer than the specified 2000
Created a chunk of size 2072, which is longer than the specified 2000
Created a chunk of size 2585, which is longer than the specified 2000
Created a chunk of size 2591, which is longer than the specified 2000
Created a chunk of size 2243, which is longer than the specified 2000
Created a chunk of size 4372, which is longer than the specified 2000
Created a chunk of size 2062, which is longer than the specified 2000
Created a chunk of size 3180, which is longer than the specified 2000
Created a chunk of size 2772, which is longer than the specified 2000
Created a chunk of size 2600, which is longer than the specified 2000
Created a chunk of size 2362, which is longer than the specified 2000
Created a chunk of size 4938, which is longer than the specified 2000


64

In [73]:
for doc in documents:
    print(doc)
    print("-------")

page_content='Gold - Wikipedia


Jump to content

Main menu

Main menu
move to sidebar
hide

		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us

		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages

Search

Search


Appearance


Donate

Create account

Log in


Personal tools

Donate Create account Log in

		Pages for logged out editors learn more

ContributionsTalk


Contents
move to sidebar
hide


(Top)

1
Characteristics


Toggle Characteristics subsection

1.1
Color


1.2
Isotopes


1.2.1
Synthesis


2
Chemistry


Toggle Chemistry subsection

2.1
Rare oxidation states


3
Origin


Toggle Origin subsection

3.1
Gold production in the universe


3.2
Asteroid origin theories


3.3
Mantle return theories


4
Occurrence


Toggle Occurrence subsection

4.1
Seawater


5
History


Toggle History subsection

5.1
Etymology


5.2
Culture


5.2.1
Religion


6
Production


Toggle Production subsection

6.1
Mining and prosp

**RecursiveCharacterTextSplitter**

In [74]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [75]:
texts= text_splitter.create_documents([data[0].page_content])

In [76]:
for text in texts:
    print(text)
    print("-------")

page_content='Gold - Wikipedia


































Jump to content







Main menu'
-------
page_content='Main menu





Main menu
move to sidebar
hide



		Navigation'
-------
page_content='Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us'
-------
page_content='Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages'
-------
page_content='Search











Search






















Appearance
















Donate'
-------
page_content='Donate

Create account

Log in








Personal tools





Donate Create account Log in'
-------
page_content='Pages for logged out editors learn more



ContributionsTalk'
-------
page_content='Contents
move to sidebar
hide




(Top)





1
Characteristics'
-------
page_content='Toggle Characteristics subsection





1.1
Color








1.2
Isotopes






1.2.1
Synthesis'
-------
page_content='2
Chemistry




Toggle Chemistry subsection





2.1
Rare oxidation states'