# **01.Document-loading**

## **Retrieval augmented generation**
 
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

![Screenshot](Assets/Screenshot%202024-10-02%20at%2007.20.28.png)

In [12]:
! pip install langchain
! pip install openai
! pip install python-dotenv
! pip install langchain-community
! pip install pypdf

Collecting pypdf
  Downloading pypdf-5.0.1-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-5.0.1-py3-none-any.whl (294 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.0.1


In [2]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## **PDFs**

In [16]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("assets/1.Commerce-undergrad.pdf")
pages = loader.load()
len(pages)

197

In [21]:
page = pages[20]
print(page.page_content[0:500])

Rules for Undergraduate Degrees  20 
FBA22  A student will be required to complete all compulsory and optional courses prescribed for each year of study for a degree to proceed 
to courses prescribed for the following year (subject to the rules concerning transfer of other degree courses from this or  other 
approved Universities) . 
 
FBA23  A student who fails no more than four semester courses in any year, but whose overall performance in all courses is of a sati sfactory 
standard, may be pe


In [22]:
page.metadata

{'source': 'assets/1.Commerce-undergrad.pdf', 'page': 20}

## **YouTube**

In [35]:
! pip install -U yt-dlp
! pip install pydub
! pip install youtube-transcript-api
! pip install pytube

Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl.metadata (5.0 kB)
Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
Installing collected packages: pytube
Successfully installed pytube-15.0.0


In [39]:
from langchain_community.document_loaders import YoutubeLoader

In [40]:
url = "https://www.youtube.com/watch?v=Oe421EPjeBE&ab_channel=freeCodeCamp.org"

try:
    loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
    docs = loader.load()
    print(f"Loaded {len(docs)} document(s)")
    print(f"Transcript preview: {docs[0].page_content[:200]}...")
except Exception as e:
    print(f"An error occurred: {e}")

Loaded 1 document(s)
Transcript preview: this eight hour course will teach you the fundamentals of node.js and express so you can start creating backend and full stack web apps using javascript this course was created by john smilga john has...


## **URLs**

In [44]:
! pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.6-py3-none-any.whl.metadata (4.6 kB)
Downloading beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
Downloading soupsieve-2.6-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.3 soupsieve-2.6


In [47]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://en.wikipedia.org/wiki/Dog")

In [48]:
docs = loader.load()

In [49]:
print(docs[0].page_content[:500])





Dog - Wikipedia



































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search















Donate








Appearance
















Create account

Log in








Personal tools





 Create account Log in





		Pages for logged out e


## **Notion**

Export Notion page as Markdown / CSV.
Unzip it and save it as a folder that contains the markdown file for the Notion page.

In [24]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("/Users/alexander/code/langchain-tutorial/notion")
docs = loader.load()

In [25]:
print(docs[0].page_content[0:200])

# Biz ideas

---

<aside>
💡

Got a bright idea? Let’s get rich!

</aside>

[Ideas](Biz%20ideas%20d7b85523d183402eb54e1cff7a1fc495/Ideas%2010a6cb5a5b0280acba91ec1a7f55e29d.csv)


In [26]:
docs[0].metadata

{'source': '/Users/alexander/code/langchain-tutorial/notion/Biz ideas d7b85523d183402eb54e1cff7a1fc495.md'}