# **Document Loading**

## **Retrieval augmented generation**

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [None]:
! pip install langchain langchain_groq langchain_community openai

Collecting openai
  Downloading openai-1.30.4-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-1.30.4


In [None]:
import os

os.environ["GROQ_API_KEY"] = "YOUR-API-KEY"
os.environ["OPENAI_API_KEY"] = "YOUR-API-KEY"

## **PDFs**

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [None]:
! pip install pypdf



In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('MachineLearning-Lecture01.pdf')
pages = loader.load()

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [None]:
len(pages)

22

Which is because our document was $22$ pages long.

In [None]:
from IPython.display import display, Markdown

page = pages[0]
display(Markdown(page.page_content))

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning is th e most exciting field of all the computer 
sciences. So I'm actually always excited about  teaching this class. Sometimes I actually 
think that machine learning is not only the most exciting thin g in computer science, but 
the most exciting thing in all of human e ndeavor, so maybe a little bias there.  
I also want to introduce the TAs, who are all graduate students doing research in or 
related to the machine learni ng and all aspects of machin e learning. Paul Baumstarck 
works in machine learning and computer vision.  Catie Chang is actually a neuroscientist 
who applies machine learning algorithms to try to understand the human brain. Tom Do 
is another PhD student, works in computa tional biology and in sort of the basic 
fundamentals of human learning. Zico Kolter is  the head TA — he's head TA two years 
in a row now — works in machine learning a nd applies them to a bunch of robots. And 
Daniel Ramage is — I guess he's not here  — Daniel applies l earning algorithms to 
problems in natural language processing.  
So you'll get to know the TAs and me much be tter throughout this quarter, but just from 
the sorts of things the TA's do, I hope you can  already tell that machine learning is a 
highly interdisciplinary topic in which just the TAs find l earning algorithms to problems 
in computer vision and biology and robots a nd language. And machine learning is one of 
those things that has and is having a large impact on many applications.  
So just in my own daily work, I actually frequently end up talking to people like 
helicopter pilots to biologists to people in  computer systems or databases to economists 
and sort of also an unending stream of  people from industry coming to Stanford 
interested in applying machine learni ng methods to their own problems.  
So yeah, this is fun. A couple of weeks ago, a student actually forwar ded to me an article 
in "Computer World" about the 12 IT skills th at employers can't say no to. So it's about 
sort of the 12 most desirabl e skills in all of IT and all of information technology, and 
topping the list was actually machine lear ning. So I think this is a good time to be 
learning this stuff and learning algorithms and having a large impact on many segments 
of science and industry.  
I'm actually curious about something. Learni ng algorithms is one of the things that 
touches many areas of science and industrie s, and I'm just kind of curious. How many 
people here are computer science majors, are in the computer science department? Okay. 
About half of you. How many people are from  EE? Oh, okay, maybe about a fifth. How 

In [None]:
page.metadata

{'source': 'MachineLearning-Lecture01.pdf', 'page': 0}

## **YouTube**

The next platform that we are going to consider is the **YouTube**.


In [None]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
! pip install yt_dlp
! pip install pydub



**Note**: This can take several minutes to complete.

In [None]:
url = "https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir = "."

loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)

In [None]:
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading player b9ad8b0a
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] ./Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.71MiB
[ExtractAudio] Not converting audio ./Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part 1!
Transcribing part 2!
Transcribing part 3!
Transcribing part 4!


In [None]:
# The video has 4 parts:
len(docs)

4

In [None]:
display(Markdown(docs[0].page_content))

Welcome to CS229 Machine Learning. Uh, some of you know that this is a class that's taught at Stanford for a long time. And this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got- built many of their products and services and startups that I'm sure, many of you or probably all of you are using, uh, uh, today. Um, so what I want to do today was spend some time talking over, uh, logistics and then, uh, spend some time, you know, giving you a beginning of an intro, talk a little bit about machine learning. So about 229, um, you know, all of you have been reading about AI in the news, uh, about machine learning in the news. Um, and you've probably heard me or others say AI is the new electricity. Uh, much as the rise of electricity about 100 years ago transformed every major industry. I think AI or really we call it machine learning, but the rest of the world seems to call it AI. You know, um, machine learning and, and AI and deep learning will change the world. And I hope that through 229, uh, will give you the tools you need so that you can be many of these future titans of industries, that you can be one to go out and build, you know, help the large tech companies do the amazing things they do, or build your own startup, or go into some other industry, go, go transform healthcare, or go transform transportation, or go build a self-driving car, um, and do all of these things that, um, after this class, I think you'll be able to do. You know, um, the majority of students applying, the, the demand for AI skills, the demand for machine learning skills is so vast. I think you all know that. Uh, and I think it's because machine learning has advanced so rapidly in the last few years that there are so many opportunities, um, to apply learning algorithms, right? Both in industry as well as in academia. I think today, we have, um, the English department professors trying to apply learning algorithms to understand history better. Uh, we have lawyers trying to apply machine learning to process legal documents. Uh, and off-campus, every company, both the tech companies as well as a lot of, a lot of companies that you wouldn't consider tech companies. Everything from manufacturing companies, to healthcare companies, to logistics companies are also trying to apply machine learning. So I think that, um, uh, uh, if you look at it on a, on a factual basis, the number of people doing very valuable machine learning projects today, is much greater than it was six months ago. And six months ago, it was much greater than it was 12 months ago. And the amount of value, the amount of exciting and meaningful work being done in machine learning is, is, is very strongly going up. Um, and I think that given the rise of, you know, the, the, the amount of data we have, as well as the new machine learning tools that we have, um, it will be a long time before we run out of opportunities, you know, before, before society as a whole has enough people with a machine learning skill set. Um, so just as maybe, I don't know, 20 years ago was a good time to start working on this Internet thing. And a lot of people that started working on the Internet, like 20 years ago, had fantastic careers. I think today is a wonderful time to jump to machine learning, uh, and, and, and the number of, and the opportunities for you to do unique things that no one is, no one else is doing, right? The opportunity for you to go to a logistics company and find an exciting way to apply machine learning, uh, will be very high because chances are that logistics company has no one else even working on this because, you know, they probably can't, they, they may not be able to hire a fantastic Stanford student as a graduate of CS229, right? Because there just aren't a lot of CS229 graduates around. Um, so what I want to do today is, um, do a quick intro, talk a little about logistics, um, and then, uh, we'll, we'll spend the second half of the day, you know, giving an overview and, and talk a little bit more about machine learning, okay? And, uh, oh, and I apologize. I, I think that, uh, this room, according to that sign there, seats, what, 300 something students? Uh, I think, uh, we have, uh, uh, like, uh, not quite 800 people enrolled in this class. Um, so if there are people outside, and all, all of the classes are, uh, recorded, broadcast on the SCPD. Uh, they usually, the videos usually made available same day. So for those of you that can't get into the room, my apologies. Um, there, there were some years, um, where even I had trouble getting into the room, but I'm glad you let me in. Uh, but, but I'm- but, but hopefully you can watch- you, you'll be able to watch all of these things online shortly, obviously. Oh, I see. Yes. Yeah. I don't know. Uh, it's a bit complicated. Yeah. Uh, yeah, thank you. I think it's okay. Yeah, I, I, we could, yeah. Yeah, maybe, maybe for the next few classes, people can squeeze in and use up the NTC. So for now, it might be too complicated. Uh, okay. So quick intros. Um, oh, I'm sorry. I should have introduced myself. My name is Andrew. Uh, uh, uh, and I want to introduce some of the rest of the teaching team as well. There's a class coordinator. Um, she has been playing this role for many years now and helps, uh, keep the trains run on time and make sure that everything in class happens when it's supposed to. Uh, uh, so, so, so she'll be here. And then, uh, we're thrilled to have- you guys wanna stand up? Uh, be the co-head TAs. Our respective- the PhD students working with me, uh, and so bring a lot of, um, uh, technical experience, uh, technical experience in machine learning, as well as practical know-how on how to actually make these things work. And with the large class that we have, we have a large TA team. Um, I- maybe I won't introduce all of the TAs here today, but you meet many of them throughout this quarter. But the TAs' expertise span everything from computer vision, to natural language processing, to computer biology, to robotics. And so, um, through this quarter, as you work on your class projects, I hope that you get a lot of, uh, help and advice and mentoring from the TAs, uh, all of which- all of whom have deep expertise not just in machine learning, but often in a specific vertical application area, um, of machine learning. So depending on what your projects, we try to match you to a TA that can give you advice, uh, that most relevant to whatever project you end up working on. Um, so, you know, go with this class. I hope that after the next 10 weeks, uh, you will be an expert in machine learning. Um, it turns out that, uh, uh, you know, um, and- and I hope that after this class, you'll be able to go out and, uh, build very meaningful machine learning applications, uh, either in an academic setting where, uh, hopefully you can apply it to your problems in mechanical engineering, electrical engineering, and, uh, English, and law, and, um, uh, and- and- and education, and all of this wonderful work that happens on campus, uh, as well as after you graduate from Stanford to be able to apply it to whatever jobs you find. Um, one of the things I find very exciting about machine learning is that it's no longer a sort of pure tech company only kind of thing, right? I think that many years ago, um, machine learning, it was like a thing that, you know, the computer science department would do, and that the elite AI companies like Google, and Facebook, and Baidu, and Microsoft would do. Uh, but now, it is so pervasive that even companies that are not traditionally considered tech companies see a huge need to apply these tools, and I find a lot of the most exciting work, uh, these days. Um, and- and maybe some of you guys know my history, so I'm a little bit biased, right? I- I led the Google Brain team which helped Google transform from what was already a great company 10 years ago to today, which is, you know, a great AI company, and then I also led the AI group at Baidu, and, you know, led the company's technology and strategy to help Baidu also transform from what was already a great company many years ago to today, arguably China's greatest AI company. So having led the, you know, built the teams that led the AI transformations of two large tech companies, I- I- I feel like that's a great thing to do, uh, but even beyond tech, I think that, um, there's a lot of exciting work to do as well to help other industries, to help other sectors, uh, embrace machine learning and use these tools effectively. Um, but after this class, I hope that each one of you will be well-qualified to get a job at, uh, a shiny tech company and do machine learning there, or go into one of these other industries and do very valuable machine learning projects there. Um, and in addition, if any of you, um, are taking this class with the primary goal of, uh, being able to do research, uh, in machine learning. So- so actually some- some of you I know are PhD students. Um, I hope that this class will also leave you well-equipped to, um, be able to read and understand research papers, uh, as well as, uh, you know, be qualified to start pushing forward, um, the state of the art. Um, so let's see. Um, so today, uh, so- so just as machine learning is evolving rapidly, um, the whole teaching team would have been, uh, constantly updating CS229 as well. So, um, it is actually very interesting. I feel like the pace of progress in machine learning has accelerated. So it- it actually feels like that, uh, the amount we change the class year over year has been increasing over time. So- so if you're friends that took the class last year, you know, things are a little bit different this year because we're- we're constantly updating the class to keep up with what feels like still accelerating progress in the whole field of machine learning. Um, so- so- so- so there's some logistical changes. For example, uh, uh, we've gone from, uh, what we used to hand out paper copies of handouts, uh, that we're- we're trying to make this class digital only. Um, but let me talk a little bit about, uh, prerequisites as well as in case your friends have taken this class before, some of the differences for this year. All right. Um, so prerequisites. Um, we are going to assume that, um, all of you have a knowledge of basic computer skills and principles. Uh, so, you know, Big O notation, Q-stacks, binary trees. Hopefully, you understand what all of those concepts are. And, uh, assume that all of you have a basic familiarity with, um, uh, probability, right, that hopefully, you know, what's a random variable, what's the expected value of a random variable, what's the variance of a random variable. Um, and if- for some of you, maybe especially the SCPD students taking this remotely, if it's been, you know, some number of years since you last had a probability and statistics class, uh, we will have, uh, review sessions, uh, on- on- on Fridays, uh, where we'll go over some of this prerequisite material as well. But- but so hopefully, you know what a random variable is, what the expected value is. But if you're a little bit fuzzy on those concepts, we'll- we'll go over them again, um, at a- at a discussion section, uh, on Friday. Um, also assume that you're familiar with basic linear algebra. So hopefully, that you know what's a matrix, what's a vector, how to multiply two matrices, or multiply a matrix and a vector. Um, if you know what is an eigenvector, then that's even better. Uh, if you're not quite sure what an eigenvector is, we'll go over it, but- but you- you- you, but, uh, uh, yeah, we'll- we'll go over it, I guess. Um, and then, um, a large part of this class, uh, uh, is, um, having you practice these ideas, uh, through the homeworks, uh, as well as I'll mention later, a, uh, open-ended project. And so, um, one, uh, there- we- we've actually, uh, until now, we used to use, uh, MATLAB, uh, and Octave for their programming assignments. Uh, but this year, we're trying to shift the programming assignments to, uh, Python. Um, and so, um, I think for a long time, uh, even today, you know, I sometimes use Octave to prototype because the syntax for Octave is so nice, and just run, you know, very simple experiments very quickly. But I think the machine learning world, um, is, you know, really migrating, I think, from, um, MATLAB Python world to increasing- excuse me, MATLAB Octave world to increasingly a Python maybe, and- and then eventually for production, Java C++ kind of world. And so, uh, we're rewriting a lot of the assignments for this class this quarter, um, have been- have been driving that process, uh, so that- so that this quarter, you could do more of the assignments, uh, uh, maybe most- maybe all of the assignments in, um, Python, uh, NumPy instead. Um, now, a note on the honor codes, um, we asked that, you know, we- we actually encourage you to form study groups. Uh, so- so, you know, I've been, um, fascinated by education for a long time. It's been a long time studying education and pedagogy, and how instructors like us can help support you to learn more efficiently. And one of the lessons I've learned from the educational research literature is that, for highly technical classes like this, if you form study groups, uh, you will probably have an easier time, right? So- so CS509, we go for the highly technical material. There's a lot of math, some of the programs are hard, and if you have a group of friends to study with, uh, you probably have an easier time, uh, uh, because you can ask each other questions and work together to help each other. Um, where we ask you to draw the line, or what we ask you to- to- to do relative to the standard, uh, honor codes is, um, we ask that you do the homework problems by yourself, right? Uh, and- and- and more specifically, um, it's okay to discuss the homework problems with friends, but if you, um, but after discussing homework problems with friends, we ask you to go back and write up the solutions by yourself, uh, without referring to notes that, you know, you and your friends had developed together. Okay? Um, the classes honor code is written clearly on the class, um, handouts posted digitally on the website. So if you ever have any questions about what is allowed collaboration and what isn't allowed, uh, please refer to that written document on the course website where we describe this more clearly. But, um, all the respect for the Stanford honor code as well as for, uh, uh, you know, for- for- for students kind of doing their own work. We ask you to basically do your own work, uh, for the, um, it's okay to discuss it, but after discussing homework problems with friends, ultimately we ask you to write up your problems by yourself so that the homework submissions reflect your own work, right? Um, and I care about this because it turns out that, uh, having CS229, you know, CS229 is one of those classes that employers recognize. Uh, uh, I don't know if you guys know, but there have been, um, companies that have put up job ads that say stuff like, so long as you got- so long as you complete the CS229, we guarantee you get an interview, right? I've- I've seen stuff like that. And so I think, you know, in order to- to maintain that sanctity of what it means to be a CS229 completer, I think, um, I ask that all of you sort of really do your own work, um, or stay within the bounds of accepted- of acceptable collaboration relative to the honor codes. Um, let's see. And I think that, um, uh, if, uh, you know what? This is, um, yeah. And I think that, uh, one of the best parts of CS229, it turns out is, um, excuse me. So, sorry, I'm gonna try looking for my mouse cursor. Uh, all right. Sorry about that. My, my, my displays are not mirrorizing. So this is a little bit awkward. Um, so one of the best parts of the class is, oh, shoot, sorry about that. All right, never mind. I won't do this. Um, you could do that- you could do yourself online later. Um, yeah. I started using- I started using Firefox recently in addition to Chrome. Anyway, it's just a mix-up. Um, one of the best parts of, um, the class is, um, the class project. Um, and so, you know, one of the goals of the class is to leave you well-qualified to do a meaningful machine learning project. And so, uh, one of the best ways to make sure you have that skill set is through this class, and hopefully with the help of some of the TAs. Uh, we want to support you to work on a small group to complete a meaningful machine learning project. Um, and so one thing I hope you start doing, you know, later today, uh, is to start brainstorming maybe with your friends, um, some of the, some of the class projects you might work on. Uh, and the most common class project that, you know, people do in CSUSD9 is to pick an area, pick an application that excites you and to apply machine learning to it, and see if you can build a good machine learning system for some application area. And so, um, if you go to the course website, you know, cs229.stanford.edu and look at previous year's projects, you- you- you see machine learning projects applied to pretty much, you know, pretty much every imaginable application under the sun. Everything from, I don't know, diagnosing cancer to creating art to, uh, lots of, um, uh, projects applied to other areas of engineering, uh, applying to application areas in EE or mechanical engineering or civil engineering or earthquake engineering and so on, uh, to applying it to understand literature, to applying it to, um, uh, I don't know. And- and- and- and- and so, uh, if you look at the previous year's projects, many of which are posted on the course website, you can use that as inspiration to see the types of projects students complete- completing this class are able to do. And I also encourage you to, um, uh, you can look at that for inspiration, you know, to- to get a sense of what you'll be able to do at the conclusion of this class, and also see if, uh, looking at previous year's projects gives you inspiration for what, um, you might do yourself. Uh, so we ask you to- we- we invite you, I guess, to do class projects in small groups. And so, um, after class today, also encourage you to start making friends in the class, both for the purpose of forming study groups as well as for the purpose of maybe finding a small group to do a class project with. Um, uh, we ask you to form project groups of, um, up to size three. Uh, uh, most project groups end up being size two or three. Um, if you insist on doing it by yourself, right, without any partners, that's actually okay too. You're welcome to do that. But, uh, but- but I think often, you know, having one or two others to work with may give you an easier time. And, uh, for projects of exceptional scope, if you have a very, very large project that just cannot be done by three people, um, uh, sometimes, you know, let us know and we're open to, uh, uh, with- with- to some project groups of size four. But our expectation, but we do hold projects, you know, with a group of four to a higher standard than projects of size one to three. Right. So- so what that means is that if your project team size is, uh, one, two, or three persons, the grading is one criteria. If your project group is, uh, bigger than three persons, we use a stricter criteria when it comes to grading class projects.

## **URLs**

In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")
docs = loader.load()

In [None]:
len(docs)

1

In [None]:
display(Markdown(docs[0].page_content))




















































































File not found · GitHub













































Skip to content












Navigation Menu

Toggle navigation










          Sign in
        


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Codespaces
        Instant dev environments
      







Copilot
        Write better code with AI
      







Code review
        Manage code changes
      







Issues
        Plan and track work
      







Discussions
        Collaborate outside of code
      




Explore



      All features

    



      Documentation

    





      GitHub Skills

    





      Blog

    









        Solutions
        





For



      Enterprise

    



      Teams

    



      Startups

    



      Education

    






By Solution



      CI/CD & Automation

    



      DevOps

    



      DevSecOps

    






Resources



      Learning Pathways

    





      White papers, Ebooks, Webinars

    





      Customer Stories

    



      Partners

    









        Open Source
        









GitHub Sponsors
        Fund open source developers
      








The ReadME Project
        GitHub community articles
      




Repositories



      Topics

    



      Trending

    



      Collections

    






Pricing












Search or jump to...







Search code, repositories, users, issues, pull requests...

 




        Search
      













Clear
 















































 




              Search syntax tips
 














        Provide feedback
      









 
We read every piece of feedback, and take your input very seriously.


Include my email address so I can be contacted


     Cancel

    Submit feedback










        Saved searches
      
Use saved searches to filter your results more quickly









 





Name






Query



            To see all available qualifiers, see our documentation.
          
 





     Cancel

    Create saved search








              Sign in
            


                  Sign in to GitHub

 
      Username or email address
    



      Password
    

Forgot password?













 

 
or sign in with a passkey




 

              Sign up
            









You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.
 


Dismiss alert



















        basecamp
 
/

handbook

Public





 

Notifications
 You must be signed in to change notification settings


 

Fork
    754




 


          Star
 6.3k
  
















Code







Issues
1






Pull requests
0






Actions







Security







Insights



 

 


Additional navigation options


 










          Code










          Issues










          Pull requests










          Actions










          Security










          Insights





 
















Footer








        © 2024 GitHub, Inc.
      


Footer navigation


Terms


Privacy


Security


Status


Docs


Contact




      Manage cookies
    





      Do not share my personal information
    
















    You can’t perform that action at this time.
  














This is a reason that why do we need post-prcocessing!