# Document Loading

## Note to students.
During periods of high load you may find the notebook unresponsive. It may appear to execute a cell, update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed. This is particularly obvious on print statements when there is no output. If this happens, restart the kernel using the command under the Kernel tab.

## Retrieval augmented generation
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [1]:
#! pip install langchain

Collecting langchain
  Using cached langchain-0.3.17-py3-none-any.whl.metadata (7.1 kB)
Collecting PyYAML>=5.3 (from langchain)
  Using cached PyYAML-6.0.2-cp312-cp312-win_amd64.whl.metadata (2.1 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Using cached SQLAlchemy-2.0.37-cp312-cp312-win_amd64.whl.metadata (9.9 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Using cached aiohttp-3.11.11-cp312-cp312-win_amd64.whl.metadata (8.0 kB)
Collecting langchain-core<0.4.0,>=0.3.33 (from langchain)
  Using cached langchain_core-0.3.33-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.3 (from langchain)
  Using cached langchain_text_splitters-0.3.5-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.4,>=0.1.17 (from langchain)
  Using cached langsmith-0.3.3-py3-none-any.whl.metadata (14 kB)
Collecting numpy<3,>=1.26.2 (from langchain)
  Using cached numpy-2.2.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting pydantic<3.0.0,>=2.7.4 (from la

In [3]:
#! pip install openai

Collecting openai
  Using cached openai-1.60.2-py3-none-any.whl.metadata (27 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Using cached jiter-0.8.2-cp312-cp312-win_amd64.whl.metadata (5.3 kB)
Collecting tqdm>4 (from openai)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Using cached openai-1.60.2-py3-none-any.whl (456 kB)
Using cached distro-1.9.0-py3-none-any.whl (20 kB)
Using cached jiter-0.8.2-cp312-cp312-win_amd64.whl (204 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, jiter, distro, openai
Successfully installed distro-1.9.0 jiter-0.8.2 openai-1.60.2 tqdm-4.67.1


In [4]:
#! pip install python-dotenv

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [3]:
import os
import openai
from dotenv import load_dotenv

load_dotenv()  # Load variables from .env into the environment, load_dotenv() by default loads from .env in the current directory.

openai.api_key = os.getenv("OPENAI_API_KEY")

## PDFs
Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [7]:
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
# ! pip install pypdf 

Collecting pypdf
  Using cached pypdf-5.2.0-py3-none-any.whl.metadata (7.2 kB)
Using cached pypdf-5.2.0-py3-none-any.whl (298 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.2.0


In [10]:
#! pip install -U langchain-community

Collecting langchain-community
  Using cached langchain_community-0.3.16-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Using cached dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Using cached httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Using cached pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Using cached marshmallow-3.26.0-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Using cached typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Using cached mypy_extensions-1.0.0

In [6]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a Document.

A Document contains text (page_content) and metadata.

In [7]:
len(pages)

22

In [8]:
page = pages[0]

In [10]:
print(page.page_content[0:200])

MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of


In [11]:
page.metadata

{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf',
 'page': 0,
 'page_label': '1'}

## YouTube

In [48]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [27]:
#! pip install yt_dlp
#! pip install pydub



In [31]:
#! pip install ffmpeg 

Collecting ffmpeg
  Downloading ffmpeg-1.4.tar.gz (5.1 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: ffmpeg
  Building wheel for ffmpeg (pyproject.toml): started
  Building wheel for ffmpeg (pyproject.toml): finished with status 'done'
  Created wheel for ffmpeg: filename=ffmpeg-1.4-py3-none-any.whl size=6138 sha256=60f676e26fc2a9adce7f9867e8bcfc4b488fedb483eacaaa62c92ebf441e4381
  Stored in directory: c:\users\owner\appdata\local\pip\cache\wheels\26\21\0c\c26e09dff860a9071683e279445262346e008a9a1d2142c4ad
Successfully built ffmpeg
Installing collected packages: ffmpeg
Successfully installed ffmpeg-1.4


In [39]:
#import yt_dlp
## Set yt-dlp options globally, including the ffmpeg location
#yt_dlp.utils.std_headers['ffmpeg_location'] = 'C:/ffmpeg/bin/ffmpeg.exe'

In [52]:
import yt_dlp
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

# Define yt-dlp options directly
ydl_opts = {
    'ffmpeg_location': 'C:/ffmpeg/bin/',  # Provide path to FFmpeg binary
    'outtmpl': 'docs/youtube/%(id)s.%(ext)s',  # Save the file with YouTube ID as the filename
    'quiet': True,  # Suppress unnecessary output for clarity
    'format': 'bestaudio/best'  # Download the best audio quality
}

# Initialize yt-dlp with the options
ydl = yt_dlp.YoutubeDL(ydl_opts)

**Note**: This can take several minutes to complete.

In [49]:
import os
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading tv client config
[youtube] jGwO_UgTS7I: Downloading player f3d47b5a
[youtube] jGwO_UgTS7I: Downloading tv player API JSON
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.71MiB


ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location


DownloadError: ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location

In [50]:
docs[0].page_content[0:500]

NameError: name 'docs' is not defined

## URLs

In [18]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [19]:
docs = loader.load()

In [20]:
print(docs[0].page_content[:500])













































































handbook/37signals-is-you.md at master · basecamp/handbook · GitHub

















































Skip to content







Toggle navigation










            Sign up
          


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Cod


## Notion

Follow steps [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f):

- Duplicate the page into your own Notion space and export as Markdown / CSV.
- Unzip it and save it as a folder that contains the markdown file for the Notion page.

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [None]:
print(docs[0].page_content[0:200])

In [None]:
docs[0].metadata