# Building Multi-Source Document Loaders with LangChain: PDF, YouTube, and Web Data Extraction

In [14]:
# !pip install langchain

In [15]:
# !pip install -U langchain-community

In [16]:
# !pip install pypdf

In [18]:
import openai
openai.api_key  = "xxxx"

## PDFs
### Let's load a dummy PDF transcript! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [5]:
import requests
from langchain_community.document_loaders import PyPDFLoader

# Step 1: Download the PDF file
url = "https://css4.pub/2015/textbook/somatosensory.pdf"
response = requests.get(url)

# Save the file locally
with open("somatosensory.pdf", "wb") as file:
    file.write(response.content)

# Step 2: Load the local file using PyPDFLoader
loader = PyPDFLoader("somatosensory.pdf")
pages = loader.load()

# Check the content
print(pages[0].page_content)

This is a sample document to
showcase page-based formatting. It
contains a chapter from a Wikibook
called Sensory Systems. None of the
content has been changed in this
article, but some content has been
removed.
Anatomy of the Somatosensory System
FROM WIKIBOOKS1
Our somatosensory system consists of sensors in the skin
and sensors in our muscles, tendons, and joints. The re-
ceptors in the skin, the so called cutaneous receptors, tell
us about temperature (thermoreceptors), pressure and sur-
face texture (mechano receptors), and pain (nociceptors).
The receptors in muscles and joints provide information
about muscle length, muscle tension, and joint angles.
Cutaneous receptors
Sensory information from Meissner corpuscles and rapidly
adapting afferents leads to adjustment of grip force when
objects are lifted. These afferents respond with a brief
burst of action potentials when objects move a small dis-
tance during the early stages of lifting. In response to
Figure 1: Receptors in the 

- Each page is a Document.
- A Document contains text (page_content) and metadata.

In [6]:
len(pages)
page = pages[0]
print(page.page_content[0:500])
page.metadata

This is a sample document to
showcase page-based formatting. It
contains a chapter from a Wikibook
called Sensory Systems. None of the
content has been changed in this
article, but some content has been
removed.
Anatomy of the Somatosensory System
FROM WIKIBOOKS1
Our somatosensory system consists of sensors in the skin
and sensors in our muscles, tendons, and joints. The re-
ceptors in the skin, the so called cutaneous receptors, tell
us about temperature (thermoreceptors), pressure and sur-
fac


{'producer': 'Prince 20150210 (www.princexml.com)',
 'creator': 'PyPDF',
 'creationdate': '',
 'title': 'Anatomy of the Somatosensory System',
 'source': 'somatosensory.pdf',
 'total_pages': 4,
 'page': 0,
 'page_label': '1'}

## YouTube

In [7]:
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import OpenAIWhisperParser
from langchain_community.document_loaders.blob_loaders import YoutubeAudioLoader

In [19]:
import os
os.environ["OPENAI_API_KEY"] = "xxx"

In [17]:
# !pip install yt_dlp
# !pip install pydub

In [10]:
url="https://www.youtube.com/watch?v=PnI-nMUxFjo"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=PnI-nMUxFjo
[youtube] PnI-nMUxFjo: Downloading webpage
[youtube] PnI-nMUxFjo: Downloading tv client config
[youtube] PnI-nMUxFjo: Downloading player 4fcd6e4a
[youtube] PnI-nMUxFjo: Downloading tv player API JSON
[youtube] PnI-nMUxFjo: Downloading ios player API JSON
[youtube] PnI-nMUxFjo: Downloading m3u8 information
[info] PnI-nMUxFjo: Downloading 1 format(s): 140
[download] Destination: docs/youtube//Design YouTube Shorts： System Design Interview with an L7 Senior Staff Google Engineer.m4a
[download] 100% of    7.78MiB in 00:00:02 at 2.68MiB/s   
[FixupM4a] Correcting container of "docs/youtube//Design YouTube Shorts： System Design Interview with an L7 Senior Staff Google Engineer.m4a"
[ExtractAudio] Not converting audio docs/youtube//Design YouTube Shorts： System Design Interview with an L7 Senior Staff Google Engineer.m4a; file is already in target format m4a
Transcribing part 1!


In [11]:
docs[0].page_content[0:500]

"When it comes to the staff interview, I do remember the MVP here would be we should allow the users to just upload the videos just in a very slow or probably unreliable network, as you have called out. But here, just from the high-level design here, I want you to stick to the topics. I'm a senior staff engineer at Google. So basically, that would be the level seven. And yeah, after joining Google, I've conducted more than 300 interviews, ranging from coding behavior all the way to the engineer m"

## URLs

In [12]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")



In [13]:
docs = loader.load()
print(docs[0].page_content[:500])














































































handbook/titles-for-programmers.md at master · basecamp/handbook · GitHub














































Skip to content













Navigation Menu

Toggle navigation




 













            Sign in
          








        Product
        













GitHub Copilot
        Write better code with AI
      







Security
        Find and fix vulnerabilities
      







Actions
        Auto
