# Retrieval augmented generation
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [1]:
import os 
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## PDFs
Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [None]:
#! pip install pypdf 

In [2]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("../public/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a Document.

A Document contains text (page_content) and metadata.

In [None]:
len(pages)

In [None]:
page = pages[0]

In [None]:
# Let's check the first few characters
print(page.page_content[:500])

In [None]:
# CHeck the metadata; source is the file path and page field is the page number
page.metadata

## Youtube 

In [None]:
# Get the youtube audio loader and load the audio file from a youtube video, use openai whisper model (speech-to-text) to parse the youtube audio in a text format.
from langchain_community.document_loaders.blob_loaders.youtube_audio import (
    YoutubeAudioLoader
)
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import (
    OpenAIWhisperParser
)

In [None]:
#! pip install yt_dlp
#! pip install pydub

In [None]:
url = "https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir = "../public/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

In [None]:
docs[0].page_content[:500]

## URLs

In [3]:
from langchain_community.document_loaders import WebBaseLoader

In [4]:
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/making-a-career.md")

In [5]:
docs = loader.load()

In [6]:
print(docs[0].page_content[:500])













































































handbook/making-a-career.md at master · basecamp/handbook · GitHub


















































Skip to content













Toggle navigation










          Sign in
        


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







C


## Notion 

In [7]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("../public/Notion_DB")
docs = loader.load()

In [8]:
print(docs[0].page_content[0:200])

# Learning LangChain

Created: March 6, 2024 8:32 PM
Tags: Personal

LangChain is an open-source developer framework for building LLM applications.
It has Python and TypeScript packages.
Focused on co


In [9]:
docs[0].metadata

{'source': '../public/Notion_DB/Learning LangChain 8bdb6f448cad46b3a344b7b14db17657.md'}