In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

## Document Loaders
LangChain can load data from many sources:
* Websites.
* Databases.
* Youtube, Twitter.
* Excel, Pandas, Notion, Figma, HuggingFace, Github, Etc.

LangChain can load data of many types:
* PDF.
* HTML.
* JSON.
* Word, Powerpoint, etc.

**Sometimes you will have to clean or prepare the data you load before you can use it.**
<br>
This is something Data Scientist are used to do.

## Loading PDF documents

In [2]:
# !pip install pypdf

In [3]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/5pages.pdf")
pages = loader.load()

In [4]:
len(pages)

4

In [5]:
page = pages[0]

In [6]:
print(page.page_content[0:500])

Page 1 of 4 PDF Files 
Scan – Create – Reduce File Size  
 
 
It is recommended that you purchase an Adobe Acrobat product that 
allows you to read, create and manipulate PDF documents.  Go to http://www.adobe.com/products/acrobat/matrix.html
 to compare 
Adobe products and features –Adobe  Acrobat Standard is sufficient. 
 
 
Scanning Documents 
 
You should only have to scan docu ments that are not electronic, and 
when you are unable to create a PDF using PDFMaker or the Print 
Command from t


In [7]:
page.metadata

{'source': 'data/5pages.pdf', 'page': 0}

## Loading YouTube Audio

In [8]:
#from langchain.document_loaders.generic import GenericLoader
#from langchain.document_loaders.parsers import OpenAIWhisperParser
#from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import OpenAIWhisperParser
from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

* See the changes in the previous block of code importing the langchain_community module. Many thanks to Isabel González for the updates, you are on the right way to become an Honor Student!

In [None]:
# !pip install yt_dlp
# !pip install pydub
# !pip install openai-whisper

In [15]:
url="https://www.youtube.com/watch?v=Rb9Bpw8yvTg"
save_dir="data/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=Rb9Bpw8yvTg
[youtube] Rb9Bpw8yvTg: Downloading webpage
[youtube] Rb9Bpw8yvTg: Downloading ios player API JSON
[youtube] Rb9Bpw8yvTg: Downloading m3u8 information
[info] Rb9Bpw8yvTg: Downloading 1 format(s): 140
[download] Destination: data/youtube//LLM Apps： Overcoming the Context Window limits.m4a
[download] 100% of    1.86MiB in 00:00:01 at 1.74MiB/s   
[FixupM4a] Correcting container of "data/youtube//LLM Apps： Overcoming the Context Window limits.m4a"
[ExtractAudio] Not converting audio data/youtube//LLM Apps： Overcoming the Context Window limits.m4a; file is already in target format m4a
Transcribing part 1!
Transcribing part 1!


In [16]:
docs[0].page_content[0:500]

'How to overcome the context window limits? Remember that the context window is the maximum size of the context that we can give to an LLM. For example, an LLM like ChatGPT has the following context windows. The 3.5 model, the free model, supports a context window of up to 4096 tokens, approximately 3000 words or 6 pages. The model ChatGPT4 supports a context window of up to 8000 and a little bit tokens, approximately 6000 words or 12 pages. What limits does the context window impose on us? The c'

* **Update 06/01/2024**: We have tested the previous code and it works from our end. In case you have trouble with it, you can try the following alternative built by our soon-to-be Honor Student **Nicolás Oliveira**. It does not use the LangChain loader, but it does the job.

In [14]:
import os
from yt_dlp import YoutubeDL
import whisper

url = "https://www.youtube.com/watch?v=Rb9Bpw8yvTg"
save_dir = "data/youtube/"
audio_file = os.path.join(save_dir, "audio4.mp3")

os.makedirs(save_dir, exist_ok=True)

def download_audio(url, save_path):
    ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl': save_path,
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
    }

    with YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

download_audio(url, audio_file)

def transcribe_audio(audio_path):
    model = whisper.load_model("base")
    result = model.transcribe(audio_path)
    return result['text']

transcription = transcribe_audio("data/youtube/audio4.mp3")
print(transcription)


[youtube] Extracting URL: https://www.youtube.com/watch?v=Rb9Bpw8yvTg
[youtube] Rb9Bpw8yvTg: Downloading webpage
[youtube] Rb9Bpw8yvTg: Downloading ios player API JSON
[youtube] Rb9Bpw8yvTg: Downloading m3u8 information
[info] Rb9Bpw8yvTg: Downloading 1 format(s): 251
[download] data/youtube/audio4.mp3 has already been downloaded
[download] 100% of    2.76MiB
[ExtractAudio] Not converting audio data/youtube/audio4.mp3; file is already in target format mp3




 How to overcome the context window limits? Remember that the context window is the maximum size of the context that we can give to an LLM. For example, an LLM like chat GPT has the following context windows. The 3.5 model, the free model, supports a context window of up to 4,096 tokens, approximately 3000 words or 6 pages. The model chat GPT 4 supports a context window of up to 8,000 and a little bit tokens, approximately 6000 words or 12 pages. What limits does the context window impose on us? The context window prevents us from doing things like asking chat GPT to summarize a 100 page report asking chat GPT to use a large database. If we try to do that, chat GPT is going to complete. I cannot do that. The context window I have is smaller than the size of the data you are trying to give me. For us, as LLM application developers, to overcome the limitation of the context window of the functional LLM is extremely important. How to overcome the context window limits is one of the import

## Loading websites

**Option 1: Web Base Loader**

In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://aiaccelera.com/100-ai-startups-100-llm-apps-that-have-earned-500000-before-their-first-year-of-existence/")

In [None]:
docs = loader.load()

In [None]:
print(docs[0].page_content[:2000])

**Option 2: Unstructured HTML Loader**

In [None]:
# !pip install unstructured

In [None]:
from langchain.document_loaders import UnstructuredHTMLLoader

In [None]:
loader = UnstructuredHTMLLoader("data/_100 AI Startups__ 100 LLM Apps that have earned $500,000 before their first year of existence.html")

In [None]:
data = loader.load()

In [None]:
data

**Option 3: Beautiful Soup**

In [None]:
#!pip install beautifulsoup4

In [None]:
from langchain_community.document_loaders import WebBaseLoader

In [None]:
loader = WebBaseLoader("https://aiaccelera.com/ai-consulting-for-businesses/")
data = loader.load()
data