# Document Loading

## Retrieval augmented generation
 
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

In [1]:
#! pip install langchain

In [2]:
import os
import openai
import sys

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [3]:
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
# ! pip install pypdf 

In [4]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("./resources/docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [5]:
len(pages)

22

In [6]:
page = pages[0]

In [7]:
print(page.page_content[0:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [8]:
page.metadata

{'source': './resources/docs/cs229_lectures/MachineLearning-Lecture01.pdf',
 'page': 0}

## YouTube

In [13]:
# ! pip install yt_dlp
# ! pip install pydub

[0mCollecting yt_dlp
  Obtaining dependency information for yt_dlp from https://files.pythonhosted.org/packages/d6/69/da2592056798716027215f561f2e9eeb3384d48b6b6a5a918916dbad1c98/yt_dlp-2023.12.30-py2.py3-none-any.whl.metadata
  Downloading yt_dlp-2023.12.30-py2.py3-none-any.whl.metadata (160 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.7/160.7 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting mutagen (from yt_dlp)
  Obtaining dependency information for mutagen from https://files.pythonhosted.org/packages/b0/7a/620f945b96be1f6ee357d211d5bf74ab1b7fe72a9f1525aafbfe3aee6875/mutagen-1.47.0-py3-none-any.whl.metadata
  Downloading mutagen-1.47.0-py3-none-any.whl.metadata (1.7 kB)
Collecting urllib3<3,>=1.26.17 (from yt_dlp)
  Obtaining dependency information for urllib3<3,>=1.26.17 from https://files.pythonhosted.org/packages/88/75/311454fd3317aefe18415f04568edc20218453b709c63c58b9292c71be17/urllib3-2.2.0-py3-none-any.whl.metadata
  D

In [14]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [15]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="resources/docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] Destination: resources/docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a
[download] 100% of   69.76MiB in 00:00:04 at 14.12MiB/s  
[FixupM4a] Correcting container of "resources/docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a"
[ExtractAudio] Not converting audio resources/docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part 1!
Transcribing part 2!
Transcribing part 3!
Transcribing part 4!


In [16]:
docs[0].page_content[0:500]

"Welcome to CS229 Machine Learning. Uh, some of you know that this is a class that's taught at Stanford for a long time. And this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got- built many of their products and services and startups that I'm sure, many of you or probably all of you are using, uh, uh, today. Um, so what I want to do today was spend s"

## URLs

In [17]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [18]:
docs = loader.load()

In [19]:
print(docs[0].page_content[:500])















































































File not found · GitHub
















































Skip to content













Toggle navigation










          Sign in
        


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Codespaces
        Instant dev environments



Follow steps [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f):

* Duplicate the page into your own Notion space and export as `Markdown / CSV`.
* Unzip it and save it as a folder that contains the markdown file for the Notion page.

In [20]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("resources/docs/Notion_DB")
docs = loader.load()

In [21]:
print(docs[0].page_content[0:200])

# Blendle's Employee Handbook

This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that


In [22]:
docs[0].metadata

{'source': "resources/docs/Notion_DB/Blendle's Employee Handbook e367aa77e225482c849111687e114a56.md"}

## Experiment on your own

We are going to test out a few of the many other document loaders available in Langchain

In [23]:
!pip install --upgrade --quiet  arxiv
!pip install --upgrade --quiet  pymupdf

[0m[33mDEPRECATION: bert-score 0.3.11 has a non-standard dependency specifier transformers>=3.0.0numpy. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of bert-score or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: pytorch-lightning 1.5.10 has a non-standard dependency specifier torch>=1.7.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: bert-score 0.3.11 has a non-standard dependency specifier transformers>=3.0.0numpy. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer

In [31]:
# csv
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
    file_path="./resources/docs/addresses.csv",
)

docs = loader.load()

print(len(docs))

print(docs[0].page_content[0:200])

5
John: Jack
Doe: McGinnis
120 jefferson st.: 220 hobo Av.
Riverside: Phila
NJ: PA
08075: 09119


In [None]:
# !pip install jq

In [43]:
# json
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(file_path='./resources/docs/sample_users_with_id.json',
                    jq_schema='.',
                    text_content=False)

docs = loader.load()

print(len(docs))

print(docs[0].page_content[:200])

1
[{'user_id': '583c3ac3f38e84297c002546', 'email': 'test@test.com', 'name': 'test@test.com', 'given_name': 'Hello', 'family_name': 'Test', 'nickname': 'test', 'last_ip': '94.121.163.63', 'logins_count'


In [49]:
# arvix
from langchain_community.document_loaders import ArxivLoader

docs = ArxivLoader(query="1706.03762", load_max_docs=2).load()

print(len(docs))

print(docs[0].page_content[:200])

1
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need

