# Document Loading

## Note to students.
During periods of high load you may find the notebook unresponsive. It may appear to execute a cell, update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed. This is particularly obvious on print statements when there is no output. If this happens, restart the kernel using the command under the Kernel tab.

## Retrieval augmented generation
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [1]:
#! pip install langchain

In [3]:
! pip install openai

Collecting openai
  Downloading openai-1.78.0-py3-none-any.whl.metadata (25 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.9.0-cp312-cp312-win_amd64.whl.metadata (5.3 kB)
Collecting tqdm>4 (from openai)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading openai-1.78.0-py3-none-any.whl (680 kB)
   ---------------------------------------- 0.0/680.4 kB ? eta -:--:--
   --------------- ------------------------ 262.1/680.4 kB ? eta -:--:--
   ---------------------------------------- 680.4/680.4 kB 2.2 MB/s eta 0:00:00
Using cached distro-1.9.0-py3-none-any.whl (20 kB)
Downloading jiter-0.9.0-cp312-cp312-win_amd64.whl (207 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, jiter, distro, openai

   ---------------------------------------- 0/4 [tqdm]
   ---------------------------------------- 0/4 [tqdm]
   --

In [4]:
! pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.0


In [5]:
import os
import openai
from dotenv import load_dotenv

load_dotenv()  # Load variables from .env into the environment, load_dotenv() by default loads from .env in the current directory.

openai.api_key = os.getenv("OPENAI_API_KEY")

## PDFs
Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [6]:
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
! pip install pypdf 

Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.4.0


In [2]:
#! pip install -U langchain-community

In [8]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a Document.

A Document contains text (page_content) and metadata.

In [9]:
len(pages)

22

In [10]:
page = pages[0]

In [11]:
type(page)

langchain_core.documents.base.Document

In [12]:
type(page.page_content)

str

In [13]:
type(page.metadata)

dict

In [15]:
print(page.page_content[0:200])

MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of


In [14]:
page.metadata

{'producer': 'Acrobat Distiller 8.1.0 (Windows)',
 'creator': 'PScript5.dll Version 5.2.2',
 'creationdate': '2008-07-11T11:25:23-07:00',
 'author': '',
 'moddate': '2008-07-11T11:25:23-07:00',
 'title': '',
 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf',
 'total_pages': 22,
 'page': 0,
 'page_label': '1'}

## YouTube

In [17]:
from langchain_community.document_loaders import YoutubeLoader

In [None]:
#!pip install --upgrade --quiet  youtube-transcript-api

In [None]:
#!pip install --upgrade --quiet  pytube

In [None]:
#!pip install --upgrade --quiet yt-dlp

In [64]:
# Initialize the loader with desired parameters
loader = YoutubeLoader.from_youtube_url(
    #"https://www.youtube.com/watch?v=jGwO_UgTS7I",  # Replace with your video's URL
    "https://www.youtube.com/watch?v=x8FASlLf5ls",
    add_video_info=False,                         # Set to True to fetch video metadata
    language=["zh", "en"],                             # Specify transcript language(s)
    translation="en"                             # Translate transcript if necessary
)

In [65]:
docs=loader.load()

In [66]:
docs

[]

In [37]:
docs[0].metadata

{'source': 'jGwO_UgTS7I'}

In [39]:
import yt_dlp

In [61]:
def fetch_youtube_metadata(url: str):
    ydl_opts = {
        'quiet': True,
        'skip_download': True,  # Don't download the video
        'extract_flat': False,  # Extract full metadata
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=False)
        return {
            "title": info.get("title"),
            "description": info.get("description"),
            "upload_date": info.get("upload_date"),
            "duration": info.get("duration"),
            "view_count": info.get("view_count"),
            "like_count": info.get("like_count"),
            "channel": info.get("uploader"),
            "channel_url": info.get("uploader_url"),
            "tags": info.get("tags"),
            "categories": info.get("categories"),
            "thumbnail": info.get("thumbnail"),
            "webpage_url": info.get("webpage_url"),
        }

In [62]:
url = "https://www.youtube.com/watch?v=jGwO_UgTS7I"
#url = "https://www.youtube.com/watch?v=x8FASlLf5ls"
metadata = fetch_youtube_metadata(url)
print(metadata)

{'title': 'Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018)', 'description': "For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai\n\nListen to the first lecture in Andrew Ng's machine learning course. This course provides a broad introduction to machine learning and statistical pattern recognition. Learn about both supervised and unsupervised learning as well as learning theory, reinforcement learning and control. Explore recent applications of machine learning and design and develop algorithms for machines.\n\nAndrew Ng is an Adjunct Professor of Computer Science at Stanford University. View more about Andrew on his website: https://www.andrewng.org/\n \nTo follow along with the course schedule and syllabus, visit: \nhttp://cs229.stanford.edu/syllabus-autumn2018.html\n\n0:00 Introduction\n05:21 Teaching team introductions\n06:42 Goals for the course and the state of machine learning

In [17]:
list(metadata.keys())

['title',
 'description',
 'upload_date',
 'duration',
 'view_count',
 'like_count',
 'channel',
 'channel_url',
 'tags',
 'categories',
 'thumbnail',
 'webpage_url']

In [46]:
docs[0].metadata = metadata.copy()

In [47]:
docs[0].metadata 

{'title': 'Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018)',
 'description': "For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai\n\nListen to the first lecture in Andrew Ng's machine learning course. This course provides a broad introduction to machine learning and statistical pattern recognition. Learn about both supervised and unsupervised learning as well as learning theory, reinforcement learning and control. Explore recent applications of machine learning and design and develop algorithms for machines.\n\nAndrew Ng is an Adjunct Professor of Computer Science at Stanford University. View more about Andrew on his website: https://www.andrewng.org/\n \nTo follow along with the course schedule and syllabus, visit: \nhttp://cs229.stanford.edu/syllabus-autumn2018.html\n\n0:00 Introduction\n05:21 Teaching team introductions\n06:42 Goals for the course and the state of machine learnin

In [4]:
#!pip install --upgrade langchain langchain-community

In [9]:
from langchain_community.document_loaders.blob_loaders.youtube_audio import (
    YoutubeAudioLoader,
)
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers.audio import (
    OpenAIWhisperParser,
    OpenAIWhisperParserLocal,
)

In [None]:
#%pip install --upgrade --quiet  yt_dlp
#%pip install --upgrade --quiet  pydub
#%pip install --upgrade --quiet  librosa

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [27]:
#!pip install --upgrade transformers

In [28]:
#!pip install torch

In [None]:
#!pip install ipywidgets --upgrade --quiet

In [25]:
import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

In [6]:
# Two Karpathy lecture videos
urls = ["https://www.youtube.com/watch?v=jGwO_UgTS7I"]
# Directory to save audio files
save_dir = "docs/youtube/"

In [7]:
# set a flag to switch between local and remote parsing
# change this to True if you want to use local parsing
local = True

In [None]:
# Transcribe the videos to text
if local:
    loader = GenericLoader(
        YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParserLocal(language="en")
    )
else:
    loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser(language="en"))
docs = loader.load()

  from .autonotebook import tqdm as notebook_tqdm


Using the following model:  openai/whisper-base


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu


[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading tv client config
[youtube] jGwO_UgTS7I: Downloading tv player API JSON
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] Destination: docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a
[download] 100% of   69.76MiB in 00:00:35 at 1.95MiB/s   
[FixupM4a] Correcting container of "docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a"
[ExtractAudio] Not converting audio docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a!


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [13]:
docs[0].page_content

" Welcome to CS229 Machine Learning. Some of you know that this class is for this Stanford for a long time. And this is often the class that I most look forward to teaching each year because this is where we've helped, I think several generations of Stanford students become experts in machine learning, got built many of their products and services and startups that I'm sure many of you, all of you are using today. So what I want to do today was spend some time talking over logistics and then spend some time giving you a beginning of an intro, talk a little bit about machine learning. So about two, three, nine. You know, all of you have been reading about AI in the news, about machine learning in the news. And you've pretty heard me or they say AI is a new electricity. Much less a rise of electricity about a hundred years ago, transform every major industry. I think AI, or really, we call it machine learning, but the rest of the world seems to call it AI. Machine learning and AI and dee

In [14]:
docs[0].metadata

{'source': 'docs\\youtube\\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a'}

In [19]:
metadata| docs[0].metadata

{'title': 'Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018)',
 'description': "For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai\n\nListen to the first lecture in Andrew Ng's machine learning course. This course provides a broad introduction to machine learning and statistical pattern recognition. Learn about both supervised and unsupervised learning as well as learning theory, reinforcement learning and control. Explore recent applications of machine learning and design and develop algorithms for machines.\n\nAndrew Ng is an Adjunct Professor of Computer Science at Stanford University. View more about Andrew on his website: https://www.andrewng.org/\n \nTo follow along with the course schedule and syllabus, visit: \nhttp://cs229.stanford.edu/syllabus-autumn2018.html\n\n0:00 Introduction\n05:21 Teaching team introductions\n06:42 Goals for the course and the state of machine learnin

In [32]:
import os
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers.audio import (
    OpenAIWhisperParser,
    OpenAIWhisperParserLocal,
)
from langchain_community.document_loaders.blob_loaders import YoutubeAudioLoader

# Optional: Suppress Hugging Face symlink warning on Windows
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

def transcribe_youtube_videos(
    urls: list[str],
    save_dir: str = "docs/youtube",
    local: bool = False,
    language: str = "en"
):
    """
    Downloads and transcribes YouTube videos using OpenAI Whisper.

    Parameters:
        urls (list[str]): List of YouTube video URLs.
        save_dir (str): Directory to save downloaded audio.
        local (bool): Whether to use the local Whisper model.
        language (str): Language code (e.g., "en", "zh") or "auto" for detection.

    Returns:
        list[Document]: LangChain documents containing the transcribed text.
    """
    parser = (
        OpenAIWhisperParserLocal()
        if local
        else OpenAIWhisperParser(language=language)
    )

    loader = GenericLoader(
        YoutubeAudioLoader(urls, save_dir),
        parser
    )
    return loader.load()

In [33]:
urls = ["https://www.youtube.com/watch?v=x8FASlLf5ls"]
docs = transcribe_youtube_videos(urls, save_dir="docs/youtube/", local=True, language="en")#"en"

print(docs[0].page_content[:300])  

Using the following model:  openai/whisper-base


Device set to use cpu


[youtube] Extracting URL: https://www.youtube.com/watch?v=x8FASlLf5ls
[youtube] x8FASlLf5ls: Downloading webpage
[youtube] x8FASlLf5ls: Downloading tv client config
[youtube] x8FASlLf5ls: Downloading tv player API JSON
[youtube] x8FASlLf5ls: Downloading ios player API JSON
[youtube] x8FASlLf5ls: Downloading m3u8 information
[info] x8FASlLf5ls: Downloading 1 format(s): 140
[download] Destination: docs\youtube\美股 下周小心，药企相关！UNH、LLY 走势糟糕！TSLA、AMZN表现不错！TSM、ARM！NVDA和老黄的心酸！.m4a
[download] 100% of   23.07MiB in 00:00:10 at 2.14MiB/s   
[FixupM4a] Correcting container of "docs\youtube\美股 下周小心，药企相关！UNH、LLY 走势糟糕！TSLA、AMZN表现不错！TSM、ARM！NVDA和老黄的心酸！.m4a"
[ExtractAudio] Not converting audio docs\youtube\美股 下周小心，药企相关！UNH、LLY 走势糟糕！TSLA、AMZN表现不错！TSM、ARM！NVDA和老黄的心酸！.m4a; file is already in target format m4a
Transcribing part docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a!




Transcribing part docs\youtube\美股 下周小心，药企相关！UNH、LLY 走势糟糕！TSLA、AMZN表现不错！TSM、ARM！NVDA和老黄的心酸！.m4a!
 Welcome to CS229 Machine Learning. Some of you know that this class is for this Stanford for a long time. And this is often the class that I most look forward to teaching each year because this is where we've helped, I think several generations of Stanford students become experts in machine learnin


In [54]:
docs

[Document(metadata={'source': 'docs\\youtube\\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a'}, page_content=" Welcome to CS229 Machine Learning. Some of you know that this class is for this Stanford for a long time. And this is often the class that I most look forward to teaching each year because this is where we've helped, I think several generations of Stanford students become experts in machine learning, got built many of their products and services and startups that I'm sure many of you, all of you are using today. So what I want to do today was spend some time talking over logistics and then spend some time giving you a beginning of an intro, talk a little bit about machine learning. So about two, three, nine. You know, all of you have been reading about AI in the news, about machine learning in the news. And you've pretty heard me or they say AI is a new electricity. Much less a rise of electricity about a hundred years ago, transform every maj

In [42]:
docs[1].page_content

'大家好 歡迎回到視野環球財經 我是Rino現在是美東世界 二五年5月9日週五晚上的8點25分今天是標普高開不過美國總統川普提出將中國的關稅從145降到80的想法但他們表示這將取決於美國財長貝森特其實如此之高的關税是大幅高于市场的预期所以股市和国债是双双下跌回徒了高开的账赋今天的收盘价和昨天是基本持平不过今天少数的个股市走出了独立行情比如特斯拉特斯拉这边今天是放量上行朝着突破之后就是2007这个突破上行突破之后的下一个波段目标位更上一层楼这个目标位是3 2 6到367但是我这里要提醒一下这个波段目标的压力是比较大的预计一次非常难过去因为这个地方是它上看前高之前最大的压力位置而且从下个星期的月期权结算的位置来看300块钱上下大概是21美元左右的波动这就意味着下个星期的月期全結算的位置來看300塊錢上下大概是21美元左右的波動這對意味著下個星期想去突破壓力是一個非常小的概率但是如果有一些新的關稅其他的利好消息出來走入到小概率區間出外但向上是有機會摸到壓力位的下延的如果下感性机是下跌那么回撤大概是在2008的上方所以特斯拉后面还是继续上看326到367的这个波段那么做波段的朋友自行做好交易计划好 接下来时间我会直接去跟踪讲解一些个国从亚马运开始是一直会讲到李莱首先说亚马运亚马运昨天其实收盘价就已经是占稳了191块钱这个地方咱们之前曾经讲过191占稳之后它就比较稳当了但是它现在在今天收盘第二天收到了191的上方总体上大家也不要期望他能跑得特别的快因为市场对他的波动压住其实并不高下个星期大概是七八块钱左右的上下波动也就是说能够维持在182的强支撰的上方然后逐渐的寻求低点抬高高点也逐渐抬高的这么一个过程因为沿往讯这只股票的特性它不是那种暴力拉升很快几下就拉上去的它是属于罗选市上升这种走势是比较明显的股票所以所以長期持有亞馬遜的朋友如果你是在191到下方的166區間在這個區間進場的朋友暫時你可以拿得比較穩當了沒有什麼太大的問題了如果想做波段的朋友呢讓他講過不要期待他跑得特別的快會是一個比較慢较慢的过程接下来说一下关于台积电首先说一下关于整个的台湾产品出口的一个背景整个的台湾4月份商品出口是同比增长29.9%几乎是市场预期的2倍值得注意的是对于美国的出口同比增长29.5%当然很大程度上是得益于对高科技产品的旺盛需求按照产品来细分信息通信和视频产品比如电脑笔记本显示器的出货量是同比增

In [50]:
docs[1].metadata

{'source': 'docs\\youtube\\美股 下周小心，药企相关！UNH、LLY 走势糟糕！TSLA、AMZN表现不错！TSM、ARM！NVDA和老黄的心酸！.m4a'}

In [63]:
url = "https://www.youtube.com/watch?v=x8FASlLf5ls"
metadata = fetch_youtube_metadata(url)

In [64]:
metadata

{'title': '美股 下周小心，药企相关！UNH、LLY 走势糟糕！TSLA、AMZN表现不错！TSM、ARM！NVDA和老黄的心酸！',
 'description': '加会员前必读！\n\n大家好，视野环球财经的会员功能开通了。\n\n【如何加入会员？】\n在你手机或电脑端，打开视野环球财经栏目，在订阅旁，有一个“加入/Join”按钮，点击加入即可。或点击链接：https://www.youtube.com/channel/UCFQsi7WaF5X41tcuOryDk8w/join\n\n【看不到“加入/Join”怎么办？】\n如果你所在的地区，没有任何限制，比如美、加、澳洲、欧洲都没有限制，但还是看不到“加入/Join”按钮怎么办？\n\n建议使用非苹果系统，用电脑端、手机端，浏览器或者App打开YouTube，就能看到“加入/Join”按钮。苹果系统，有时候看不到“加入/Join”按钮，你可以尝试在Ipad或Mac上，使用Safari 或者Chrome浏览器登陆YouTube（不要使用YouTube App），就能看到加入按钮。加入成功后，可以正常使用YouTube App观看节目。\n\n如果你的地区受到YouTube限制，不支持加会员，使用上述方法，依旧无法加入YouTube会员，可以选择替代方案，订阅我的Patreon会员，会同步更新YouTube会员视频。https://www.patreon.com/RhinoBull\n\n【免费视频的更新受影响吗？】 \n\n免费视频的更新不受任何影响，目前免费视频的更新频率为，每个交易日的盘后更新，周末的视频，根据行情的重要程度确定是否更新，未来免费视频也会维持这个更新节奏。\n\n【会员功能和免费视频的区别是什么呢？】\n\n首先强调，如果你觉得目前的免费视频，对你来说信息量足够用，你完全不必花钱购买会员功能！毕竟再小的钱也是钱，我一直倡导理性投资，理性消费。\n\n付费会员功能，是针对极少数观众提出的需求开通的，是免费节目的延伸和补充！会涉及更多的细节层面！\n\n【加入会员到底能看到什么？】\n\n会员有专属的视频和帖子，内容包括：我的个股、行业研究；潜在的短线、中线、长线投资机会、策略分享；投资心理、实用交易技巧；会员优先答疑。\n\n【三个级别会员级别有什么区别？】\n\n“每周一杯咖啡学投

In [65]:
docs[1].metadata.update(metadata)

In [67]:
docs[1].metadata

{'source': 'docs\\youtube\\美股 下周小心，药企相关！UNH、LLY 走势糟糕！TSLA、AMZN表现不错！TSM、ARM！NVDA和老黄的心酸！.m4a',
 'title': '美股 下周小心，药企相关！UNH、LLY 走势糟糕！TSLA、AMZN表现不错！TSM、ARM！NVDA和老黄的心酸！',
 'description': '加会员前必读！\n\n大家好，视野环球财经的会员功能开通了。\n\n【如何加入会员？】\n在你手机或电脑端，打开视野环球财经栏目，在订阅旁，有一个“加入/Join”按钮，点击加入即可。或点击链接：https://www.youtube.com/channel/UCFQsi7WaF5X41tcuOryDk8w/join\n\n【看不到“加入/Join”怎么办？】\n如果你所在的地区，没有任何限制，比如美、加、澳洲、欧洲都没有限制，但还是看不到“加入/Join”按钮怎么办？\n\n建议使用非苹果系统，用电脑端、手机端，浏览器或者App打开YouTube，就能看到“加入/Join”按钮。苹果系统，有时候看不到“加入/Join”按钮，你可以尝试在Ipad或Mac上，使用Safari 或者Chrome浏览器登陆YouTube（不要使用YouTube App），就能看到加入按钮。加入成功后，可以正常使用YouTube App观看节目。\n\n如果你的地区受到YouTube限制，不支持加会员，使用上述方法，依旧无法加入YouTube会员，可以选择替代方案，订阅我的Patreon会员，会同步更新YouTube会员视频。https://www.patreon.com/RhinoBull\n\n【免费视频的更新受影响吗？】 \n\n免费视频的更新不受任何影响，目前免费视频的更新频率为，每个交易日的盘后更新，周末的视频，根据行情的重要程度确定是否更新，未来免费视频也会维持这个更新节奏。\n\n【会员功能和免费视频的区别是什么呢？】\n\n首先强调，如果你觉得目前的免费视频，对你来说信息量足够用，你完全不必花钱购买会员功能！毕竟再小的钱也是钱，我一直倡导理性投资，理性消费。\n\n付费会员功能，是针对极少数观众提出的需求开通的，是免费节目的延伸和补充！会涉及更多的细节层面！\n\n【加入会员到底能看到什么？】\n\n会员有专属的视

In [None]:
docs = load_youtube_document(
    url="https://www.youtube.com/watch?v=x8FASlLf5ls",
    save_dir="docs/youtube/",
    local=True,
    language="zh"
)

In [69]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ---------------------------------------- 0.0/981.5 kB ? eta -:--:--
     ---------- ----------------------------- 262.1/981.5 kB ? eta -:--:--
     ------------------------------ ------- 786.4/981.5 kB 2.4 MB/s eta 0:00:01
     -------------------------------------- 981.5/981.5 kB 2.2 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: langdetect
  Building wheel for langdetect (pyproject.toml): started
  Building wheel for langdetect (pyproject.toml): finished with status 'done'
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993363 sha256=ac3bc936df25

In [70]:
from langdetect import detect
import re

def detect_lang_from_text(text: str) -> str:
    try:
        lang = detect(text)
        if lang == "zh-cn" or lang == "zh-tw":
            return "zh"
        return lang
    except:
        return "en"  # fallback

In [71]:
url="https://www.youtube.com/watch?v=bSKX_PPflsk"
# Detect language from title + description
metadata = fetch_youtube_metadata(url)
combined_text = f"{metadata.get('title', '')} {metadata.get('description', '')}"
detected_language = detect_lang_from_text(combined_text)

In [72]:
detected_language

'zh'

In [74]:
from typing import List, Optional
from langchain_core.documents import Document
from langchain_community.document_loaders import YoutubeLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers.audio import (
    OpenAIWhisperParser,
    OpenAIWhisperParserLocal,
)
from langchain_community.document_loaders.blob_loaders import YoutubeAudioLoader
import yt_dlp
import os
from langdetect import detect
import re

# Optional: Suppress Hugging Face symlink warning on Windows
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

def fetch_youtube_metadata(url: str) -> dict:
    ydl_opts = {
        'quiet': True,
        'skip_download': True,
        'extract_flat': False,
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=False)
        return {
            "title": info.get("title"),
            "description": info.get("description"),
            "upload_date": info.get("upload_date"),
            "duration": info.get("duration"),
            "view_count": info.get("view_count"),
            "like_count": info.get("like_count"),
            "channel": info.get("uploader"),
            "channel_url": info.get("uploader_url"),
            "tags": info.get("tags"),
            "categories": info.get("categories"),
            "thumbnail": info.get("thumbnail"),
            "webpage_url": info.get("webpage_url"),
        }
    
def detect_lang_from_text(text: str) -> str:
    try:
        lang = detect(text)
        if lang == "zh-cn" or lang == "zh-tw":
            return "zh"
        return lang
    except:
        return "en"  # fallback
    
def load_youtube_document(
    url: str,
    save_dir: str = "docs/youtube",
    local: bool = True,
    language: str = None
) -> List[Document]:
    """
    Loads transcript from YouTube if available, otherwise transcribes from audio.
    Returns enriched LangChain Document(s).
    """
    docs = []

    # Enrich with metadata and detecting language if needed
    metadata = fetch_youtube_metadata(url)
    if language is None: # if no language specified, it will be detected based on meta info
        print("[Language Detection] Language not specified, detecting...")
        combined_text = f"{metadata.get('title', '')} {metadata.get('description', '')}"
        language = detect_lang_from_text(combined_text)

    # Try loading transcript
    try:
        loader = YoutubeLoader.from_youtube_url(
            url,
            add_video_info=False,
            language=[language],
            translation=language
        )
        docs = loader.load()
    except Exception as e:
        print(f"[Transcript Loader] Warning: {e}")

    # If no transcript, fall back to Whisper transcription
    if not docs:
        print("[Fallback] No transcript found. Using Whisper transcription.")
        parser = (
            OpenAIWhisperParserLocal()
            if local
            else OpenAIWhisperParser(language=language)
        )
        audio_loader = GenericLoader(YoutubeAudioLoader([url], save_dir), parser)
        docs = audio_loader.load()


    for doc in docs:
        doc.metadata.update(metadata)

    return docs

In [76]:
docs=load_youtube_document(
      url="https://www.youtube.com/watch?v=tkFDeadKz2I",
      save_dir="docs/youtube",
      local=True,
      language=None) 

[Language Detection] Language not specified, detecting...
[Fallback] No transcript found. Using Whisper transcription.
Using the following model:  openai/whisper-base


Device set to use cpu


[youtube] Extracting URL: https://www.youtube.com/watch?v=tkFDeadKz2I
[youtube] tkFDeadKz2I: Downloading webpage
[youtube] tkFDeadKz2I: Downloading tv client config
[youtube] tkFDeadKz2I: Downloading tv player API JSON
[youtube] tkFDeadKz2I: Downloading ios player API JSON
[youtube] tkFDeadKz2I: Downloading m3u8 information
[info] tkFDeadKz2I: Downloading 1 format(s): 140
[download] docs\youtube\特斯拉为何突然上涨？【2025-05-09】.m4a has already been downloaded
[download] 100% of   10.98MiB
[ExtractAudio] Not converting audio docs\youtube\特斯拉为何突然上涨？【2025-05-09】.m4a; file is already in target format m4a
Transcribing part docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a!




Transcribing part docs\youtube\特斯拉为何突然上涨？【2025-05-09】.m4a!
Transcribing part docs\youtube\美股 下周小心，药企相关！UNH、LLY 走势糟糕！TSLA、AMZN表现不错！TSM、ARM！NVDA和老黄的心酸！.m4a!


In [77]:
docs

[Document(metadata={'source': 'docs\\youtube\\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a', 'title': '特斯拉为何突然上涨？【2025-05-09】', 'description': '加入阳光财经会员，查看及时成交记录，完整连续。每个月还会定期推出会员视频，解答会员提问。https://www.youtube.com/channel/UC2I5em6UyBpQiO-8ZW0nV3w/join\nIBKR不止是大券商，还免费提供分析师TIPRANKS一致数据。\nhttps://www.interactivebrokers.com/mkt/?src=sunnyfinance6&url=%2Fcn%2Fwhyib%2Foverview.php\n\n\n\n……………………………………………….……………………………………………….\n声明\n本人及团队不是Financial Advisor，本节目以过往历史视频均仅供信息参考，不构成投资建议。据本频道提供的信息买卖证券，风险自负，与本节目无关。本人节目中涉及的股票均不存在与任何第三方的利益相关联，但本人可能持有或将来可能买卖相关证券。谢谢您的支持！', 'upload_date': '20250509', 'duration': 712, 'view_count': 55442, 'like_count': 2514, 'channel': '阳光财经', 'channel_url': 'https://www.youtube.com/@SUNNYFINANCE', 'tags': ['美股', '美股分析', '技术分析', '阳光财经', 'Sunny', '中概股', '大盘', '美股大盘', '股息', '被动收入', '大盘分析', '特斯拉', 'saham AS', 'Kewangan Sunshine', '陽光財經'], 'categories': ['People & Blogs'], 'thumbnail': 'https://i.ytimg.com/vi/tkFDeadKz2I/maxresdefault.jpg', '

In [79]:
docs[1].metadata

{'source': 'docs\\youtube\\特斯拉为何突然上涨？【2025-05-09】.m4a',
 'title': '特斯拉为何突然上涨？【2025-05-09】',
 'description': '加入阳光财经会员，查看及时成交记录，完整连续。每个月还会定期推出会员视频，解答会员提问。https://www.youtube.com/channel/UC2I5em6UyBpQiO-8ZW0nV3w/join\nIBKR不止是大券商，还免费提供分析师TIPRANKS一致数据。\nhttps://www.interactivebrokers.com/mkt/?src=sunnyfinance6&url=%2Fcn%2Fwhyib%2Foverview.php\n\n\n\n……………………………………………….……………………………………………….\n声明\n本人及团队不是Financial Advisor，本节目以过往历史视频均仅供信息参考，不构成投资建议。据本频道提供的信息买卖证券，风险自负，与本节目无关。本人节目中涉及的股票均不存在与任何第三方的利益相关联，但本人可能持有或将来可能买卖相关证券。谢谢您的支持！',
 'upload_date': '20250509',
 'duration': 712,
 'view_count': 55442,
 'like_count': 2514,
 'channel': '阳光财经',
 'channel_url': 'https://www.youtube.com/@SUNNYFINANCE',
 'tags': ['美股',
  '美股分析',
  '技术分析',
  '阳光财经',
  'Sunny',
  '中概股',
  '大盘',
  '美股大盘',
  '股息',
  '被动收入',
  '大盘分析',
  '特斯拉',
  'saham AS',
  'Kewangan Sunshine',
  '陽光財經'],
 'categories': ['People & Blogs'],
 'thumbnail': 'https://i.ytimg.com/vi/tkFDeadKz2I/maxresdefault.jpg',
 'webpage_url': 'https://www.youtub

**Note**: This can take several minutes to complete.

## URLs

In [18]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [19]:
docs = loader.load()

In [20]:
print(docs[0].page_content[:500])













































































handbook/37signals-is-you.md at master · basecamp/handbook · GitHub

















































Skip to content







Toggle navigation










            Sign up
          


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Cod


## Notion

Follow steps [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f):

- Duplicate the page into your own Notion space and export as Markdown / CSV.
- Unzip it and save it as a folder that contains the markdown file for the Notion page.

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [None]:
print(docs[0].page_content[0:200])

In [None]:
docs[0].metadata