# **Loaders**

In this notebook we will try almost every type of loader through which we cal load different types of files like pdfs, docs, text files, markdown and website pages etc.

If you want you can go and read about all types of loader from [here](https://python.langchain.com/api_reference/community/document_loaders.html#module-langchain_community.document_loaders)

> ### **Libraries to Download**

In [23]:
!pip install langchain langchain-community -q
!pip install youtube-transcript-api -q
!pip install unstructured pypdf -q
!pip install langchain-yt-dlp -q

In [6]:
import langchain
print(langchain.__version__)

0.3.12


Loading `.txt` files

In [3]:
## For simple text file we use Textloader from langchain

from langchain_community.document_loaders import TextLoader

file_path = './Data/exercise.txt'

text_loader = TextLoader(file_path=file_path)

documents = text_loader.load()

documents

[Document(metadata={'source': './Data/exercise.txt'}, page_content="Exercise 1: Variable Assignment\nWrite a Python program to:\n\nCreate a variable to store your age and print it.\nUpdate the variable by adding 5 years to it and print the updated age.\n\n\nExercise2: Type Casting\n\nWrite a program that:\n\nAccepts a string input from the user and converts it into an integer.\nMultiply the integer by 10 and print the result.\n\nExercise 3: String Manipulation\nWrite a program to:\n\nAsk the user to input their full name.\nConvert it to uppercase and print it.\nCount and print the number of characters (excluding spaces).\n\n\nExercise 4: String Formatting and Indexing\nWrite a program that:\n\nTakes a user's favorite color as input.\nExtracts and prints the first and last character of the color.\nPrints the color in a sentence using string formatting.\n\nExercise 5: Type Casting Challenge\nWrite a program that:\n\nAsks the user for their age (as a string) and height (as a float).\nConv

Loading `.csv` files

If you want to learn more about CSVLoader you can check it from [here](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html)

In [24]:
## If we cant to load the csv file we can use csvloader file from langchain
from langchain_community.document_loaders.csv_loader import CSVLoader

file_path = './Data/insurance_data.csv'
csv_loader = CSVLoader(file_path)

documents = csv_loader.load()

documents

[Document(metadata={'source': './Data/insurance_data.csv', 'row': 0}, page_content='index: 0\nPatientID: 1\nage: 39.0\ngender: male\nbmi: 23.2\nbloodpressure: 91\ndiabetic: Yes\nchildren: 0\nsmoker: No\nregion: southeast\nclaim: 1121.87'),
 Document(metadata={'source': './Data/insurance_data.csv', 'row': 1}, page_content='index: 1\nPatientID: 2\nage: 24.0\ngender: male\nbmi: 30.1\nbloodpressure: 87\ndiabetic: No\nchildren: 0\nsmoker: No\nregion: southeast\nclaim: 1131.51'),
 Document(metadata={'source': './Data/insurance_data.csv', 'row': 2}, page_content='index: 2\nPatientID: 3\nage: \ngender: male\nbmi: 33.3\nbloodpressure: 82\ndiabetic: Yes\nchildren: 0\nsmoker: No\nregion: southeast\nclaim: 1135.94'),
 Document(metadata={'source': './Data/insurance_data.csv', 'row': 3}, page_content='index: 3\nPatientID: 4\nage: \ngender: male\nbmi: 33.7\nbloodpressure: 80\ndiabetic: No\nchildren: 0\nsmoker: No\nregion: northwest\nclaim: 1136.4'),
 Document(metadata={'source': './Data/insurance_dat

There are some parameters in the CSVLoader which we can change to get results based on our need.

- `file_path` (str | Path) – The path to the CSV file.

- `source_column` (str | None) – The name of the column in the CSV file to use as the source. Optional. Defaults to None.

- `metadata_columns` (Sequence[str]) – A sequence of column names to use as metadata. Optional.

- `csv_args` (Dict | None) – A dictionary of arguments to pass to the csv.DictReader. Optional. Defaults to None.

In [11]:
!pip install pandas -q

In [17]:
import pandas as pd
file_path = './Data/insurance_data.csv'
df = pd.read_csv(file_path)

In [18]:
df.head()

Unnamed: 0,index,PatientID,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
0,0,1,39.0,male,23.2,91,Yes,0,No,southeast,1121.87
1,1,2,24.0,male,30.1,87,No,0,No,southeast,1131.51
2,2,3,,male,33.3,82,Yes,0,No,southeast,1135.94
3,3,4,,male,33.7,80,No,0,No,northwest,1136.4
4,4,5,,male,34.1,100,No,0,No,northwest,1137.01


In [7]:
csv_loader = CSVLoader(file_path=file_path, source_column='region')

documents = csv_loader.load()

documents

[Document(metadata={'source': 'southeast', 'row': 0}, page_content='index: 0\nPatientID: 1\nage: 39.0\ngender: male\nbmi: 23.2\nbloodpressure: 91\ndiabetic: Yes\nchildren: 0\nsmoker: No\nregion: southeast\nclaim: 1121.87'),
 Document(metadata={'source': 'southeast', 'row': 1}, page_content='index: 1\nPatientID: 2\nage: 24.0\ngender: male\nbmi: 30.1\nbloodpressure: 87\ndiabetic: No\nchildren: 0\nsmoker: No\nregion: southeast\nclaim: 1131.51'),
 Document(metadata={'source': 'southeast', 'row': 2}, page_content='index: 2\nPatientID: 3\nage: \ngender: male\nbmi: 33.3\nbloodpressure: 82\ndiabetic: Yes\nchildren: 0\nsmoker: No\nregion: southeast\nclaim: 1135.94'),
 Document(metadata={'source': 'northwest', 'row': 3}, page_content='index: 3\nPatientID: 4\nage: \ngender: male\nbmi: 33.7\nbloodpressure: 80\ndiabetic: No\nchildren: 0\nsmoker: No\nregion: northwest\nclaim: 1136.4'),
 Document(metadata={'source': 'northwest', 'row': 4}, page_content='index: 4\nPatientID: 5\nage: \ngender: male\nbm

Loading `complete directory` using langchain DirecoryLoader

There are some Parameters in the Directory Loader which we can change based on our needs.

- We can use the `glob parameter` to control which files to load. Note that here it doesn't load the .rst file or the .html files.
- `show_progress`: By default a progress bar will not be shown. To show a progress bar, install the tqdm library (e.g. pip install tqdm), and set the `show_progress parameter` to True.
- `use_multithreading`: By default the loading happens in one thread. In order to utilize several threads set the `use_multithreading` flag to true.

- `loader_cls`: By default this uses the UnstructuredLoader class. To customize the loader, specify the loader class in the `loader_cls` kwarg.

In [19]:
!pip install tqdm -q

In [21]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('./Data', glob='**/*.txt', show_progress=True, use_multithreading=True)
documents = loader.load()

  0%|          | 0/1 [00:00<?, ?it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
100%|██████████| 1/1 [00:04<00:00,  4.90s/it]


In [22]:
documents

[Document(metadata={'source': 'Data\\exercise.txt'}, page_content="Exercise 1: Variable Assignment Write a Python program to:\n\nCreate a variable to store your age and print it. Update the variable by adding 5 years to it and print the updated age.\n\nExercise2: Type Casting\n\nWrite a program that:\n\nAccepts a string input from the user and converts it into an integer. Multiply the integer by 10 and print the result.\n\nExercise 3: String Manipulation\n\nWrite a program to:\n\nAsk the user to input their full name. Convert it to uppercase and print it. Count and print the number of characters (excluding spaces).\n\nExercise 4: String Formatting and Indexing Write a program that:\n\nTakes a user's favorite color as input. Extracts and prints the first and last character of the color. Prints the color in a sentence using string formatting.\n\nExercise 5: Type Casting Challenge Write a program that:\n\nAsks the user for their age (as a string) and height (as a float). Converts the age 

Loading `Youtube video transcripts` using langchain YoutubeLoaderDL

In [28]:
from langchain_yt_dlp.youtube_loader import YoutubeLoaderDL

# Basic transcript loading
loader = YoutubeLoaderDL.from_youtube_url(
    "https://www.youtube.com/shorts/dgWow-zyDCY", add_video_info=True
)

In [29]:
documents = loader.load()

In [30]:
documents

[Document(metadata={'source': 'dgWow-zyDCY', 'title': 'NVIDIA CEO : How Pain and Suffering Lead to Success #motivation #nvdia #5090 #graphiccard #success', 'description': 'Jensen Huang, the visionary CEO of NVIDIA, shares how pain and suffering are essential ingredients in the recipe for success. In this inspiring video, he delves into the challenges he faced while building NVIDIA into a global tech giant and how he turned setbacks into stepping stones.\n\nHuang’s story teaches us that resilience, perseverance, and learning from failures are critical to achieving greatness. Whether you’re facing personal struggles or professional challenges, this motivational message will inspire you to embrace adversity and use it as fuel for growth.\n\nDiscover the mindset of a leader who revolutionized the tech industry and continues to inspire millions worldwide. Don’t miss this powerful lesson on turning pain into power.\n#NVIDIA #JensenHuang #Motivation #SuccessStory #OvercomingAdversity #Inspira

In [31]:
documents[0].metadata

{'source': 'dgWow-zyDCY',
 'title': 'NVIDIA CEO : How Pain and Suffering Lead to Success #motivation #nvdia #5090 #graphiccard #success',
 'description': 'Jensen Huang, the visionary CEO of NVIDIA, shares how pain and suffering are essential ingredients in the recipe for success. In this inspiring video, he delves into the challenges he faced while building NVIDIA into a global tech giant and how he turned setbacks into stepping stones.\n\nHuang’s story teaches us that resilience, perseverance, and learning from failures are critical to achieving greatness. Whether you’re facing personal struggles or professional challenges, this motivational message will inspire you to embrace adversity and use it as fuel for growth.\n\nDiscover the mindset of a leader who revolutionized the tech industry and continues to inspire millions worldwide. Don’t miss this powerful lesson on turning pain into power.\n#NVIDIA #JensenHuang #Motivation #SuccessStory #OvercomingAdversity #Inspiration #Leadership 

Loading `HTML` from the web using Langchain `BSHTMLLoader` AND `UnstructuredHTMLLoader`

In [37]:
!pip install bs4 -q

In [43]:
from langchain_community.document_loaders import BSHTMLLoader, UnstructuredHTMLLoader

bsloader = BSHTMLLoader('./Data/langchain.html')
html = UnstructuredHTMLLoader('./Data/langchain.html')

In [42]:
bsloader.load()

[Document(metadata={'source': './Data/langchain.html', 'title': 'BSHTMLLoader | 🦜️🔗 LangChain'}, page_content='\n\nBSHTMLLoader | 🦜️🔗 LangChain\n\n\n\n\n\n\nSkip to main contentIntegrationsAPI ReferenceMoreContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1💬SearchKProvidersAnthropicAWSGoogleHugging FaceMicrosoftOpenAIMoreProvidersAcreomActiveloop Deep LakeAerospikeAI21 LabsAimAINetworkAirbyteAirtableAlchemyAleph AlphaAlibaba CloudAnalyticDBAnnoyAnthropicAnyscaleApache Software FoundationApache DorisApifyAppleArangoDBArceeArcGISArgillaArizeArthurArxivAscendAskNewsAssemblyAIAstra DBAtlasAwaDBAWSAZLyricsBAAIBagelBagelDBBaichuanBaiduBananaBasetenBeamBeautiful SoupBibTeXBiliBiliBittensorBlackboardbookend.aiBoxBrave SearchBreebs (Open Knowledge)BrowserbaseBrowserlessByteDanceCassandraCerebrasCerebriumAIChaindeskChromaClarifaiClearMLClickHouseClickUpCloudflareClovaCnosDBCogniSwitchCohereCollege ConfidentialCometConfident AIConfluenceConneryContextCo

In [44]:
html.load()

[Document(metadata={'source': './Data/langchain.html'}, page_content='BSHTMLLoader\n\nThis notebook provides a quick overview for getting started with BeautifulSoup4 document loader. For detailed documentation of all __ModuleName__Loader features and configurations head to the API reference.\n\nOverview\u200b\n\nIntegration details\u200b\n\nClass Package Local Serializable JS support BSHTMLLoader langchain_community ✅ ❌ ❌\n\nLoader features\u200b\n\nSource Document Lazy Loading Native Async Support BSHTMLLoader ✅ ❌\n\nSetup\u200b\n\nTo access BSHTMLLoader document loader you\'ll need to install the langchain-community integration package and the bs4 python package.\n\nCredentials\u200b\n\nNo credentials are needed to use the BSHTMLLoader class.\n\nIf you want to get automated best in-class tracing of your model calls you can also set your LangSmith API key by uncommenting below:\n\n# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")\n# os.environ["LANG

Loading `PDFs` using langchain PyPDFLoader

In [46]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('./Data/Central_Limit_Theorem.pdf')

loader.load()

[Document(metadata={'source': './Data/Central_Limit_Theorem.pdf', 'page': 0}, page_content='Bernoulli distribution is a probability distribution that models a binary outcome, where the \noutcome can be either success (represented by the value 1) or failure (represented by the \nvalue 0). The Bernoulli distribution is named after the Swiss mathematician Jacob Bernoulli, \nwho first introduced it in the late 1600s.\nThe Bernoulli distribution is characterized by a single parameter, which is the probability of \nsuccess, denoted by p. The probability mass function (PMF) of the Bernoulli distribution is:\nThe Bernoulli distribution is commonly used in machine learning for modelling \nbinary outcomes, such as whether a customer will make a purchase or not, \nwhether an email is spam or not, or whether a patient will have a certain disease \nor not.\nBernoulli Distribution\n27 March 2023 16:06\n   Session on Central Limit Theorem Page 1    '),
 Document(metadata={'source': './Data/Central_Li

Loading `Markdown file` using langchain UnstructuredMarkdownLoader

In [52]:
!pip install markdown -q

In [53]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader =  UnstructuredMarkdownLoader('./Data/mark.md')

In [54]:

loader.load()

[Document(metadata={'source': './Data/mark.md'}, page_content='Hello\n\nHello\n\nHello\n\nHello\n\nThis is my book\n\nThis is my book\n\nThis is my book.\n\nThis is my book\n\nHi\n\n__Hello__\n\nhello\n\nhello\n\nHello\n\nHi\n\nhi\n\nHello\n\nHello\n\nhi\\\n\nBooks\n\nThis is my book')]

Loading `Docx file` using LangChain `Docx2txtLoader` AND `UnstructuredWordDocumentLoader`

In [60]:
!pip install docx2txt PyMuPDF docx -q

ERROR: Could not find a version that satisfies the requirement exceptions (from versions: none)
ERROR: No matching distribution found for exceptions


In [56]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
from langchain_community.document_loaders.word_document import Docx2txtLoader

loader1 = UnstructuredWordDocumentLoader('./Data/RAG_Types_Table.docx')
loader2 = Docx2txtLoader('./Data/RAG_Types_Table.docx')

In [61]:
loader2.load()

[Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='RAG Types: Advantages, Disadvantages, Use Cases, and Additional Information\n\nRAG Type\n\nAdvantages\n\nDisadvantages\n\nWhen to Use\n\nAdditional Information\n\nHybrid RAG\n\n- High accuracy by combining multiple information sources\n- Handles diverse types of data (structured, unstructured) well\n- Robust in challenging scenarios\n\n- Complexity in implementation\n- Higher computational resources required\n- Increased latency\n\n- When accuracy is paramount, and there are multiple data types\n\nCombines retrieval-based techniques (like search engines or databases) and generation-based techniques (like GPT-based models) to provide comprehensive responses.\n\nGenerative RAG\n\n- Provides flexible and creative responses\n- Can generate human-like content\n- Capable of handling open-domain questions\n\n- Risk of generating hallucinated information\n- Requires more extensive training data\n\n- For open-ended or c

Loading `Python Code` using langchain PythonLoader

In [62]:
from langchain_community.document_loaders.python import PythonLoader

loader = PythonLoader('./Data/sms.py')
loader.load()

[Document(metadata={'source': './Data/sms.py'}, page_content='# Step 1: Dictionary to store student data\nstudents = {}\n\n\n# Step 2: Function to add a new student\ndef add_student():\n    name = input("Enter student\'s name: ").title()\n    age = int(input(f"Enter {name}\'s age: "))\n    marks = float(input(f"Enter {name}\'s marks: "))\n\n    # Add student to the dictionary\n    students[name] = {"age": age, "marks": marks}\n    print(f"Student {name} added successfully!")\n\n\n# Step 3: Function to update student marks\ndef update_marks():\n    name = input("Enter the student\'s name to update marks: ").title()\n\n    if name in students:\n        new_marks = float(input(f"Enter new marks for {name}: "))\n        students[name]["marks"] = new_marks\n        print(f"{name}\'s marks updated to {new_marks}!")\n    else:\n        print(f"Student {name} not found!")\n\n\n# Step 4: Function to display a student\'s details\ndef display_student():\n    name = input("Enter the student\'s nam

Loading `Webpages` using langchain WebBaseLoader

In [63]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader('https://python.langchain.com/docs/integrations/document_loaders/web_base/')

In [64]:
loader.load()

[Document(metadata={'source': 'https://python.langchain.com/docs/integrations/document_loaders/web_base/', 'title': 'WebBaseLoader | \uf8ffü¶úÔ∏è\uf8ffüîó LangChain', 'description': 'This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader.', 'language': 'en'}, page_content='\n\n\n\n\nWebBaseLoader | \uf8ffü¶úÔ∏è\uf8ffüîó LangChain\n\n\n\n\n\n\nSkip to main contentIntegrationsAPI ReferenceMoreContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1\uf8ffüí¨SearchProvidersAnthropicAWSGoogleHugging FaceMicrosoftOpenAIMoreProvidersAcreomActiveloop Deep LakeAerospikeAI21 LabsAimAINetworkAirbyteAirtableAlchemyAleph AlphaAlibaba CloudAnalyticDBAnnoyAnthropicAnyscaleApache Software FoundationApache DorisApifyAppleArangoDBArceeArcGISArgillaArizeArt

In [65]:
!pip install lxml



If we have multiple pages in a website and we don't want to load them separatly we can use `RecursiveUrlLoader` from langchain.

In [66]:
from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
    'https://python.langchain.com/docs/integrations/document_loaders/web_base/',
    # max_depth=2,
    # use_async=False,
    # extractor=None,
    # metadata_extractor=None,
    # exclude_dirs=(),
    # timeout=10,
    # check_response_status=True,
    # continue_on_failure=True,
    # prevent_outside=True,
    # base_url=None,
    # ...
)

In [67]:
loader.load()

[Document(metadata={'source': 'https://python.langchain.com/docs/integrations/document_loaders/web_base/', 'content_type': 'text/html; charset=utf-8', 'title': 'WebBaseLoader | \uf8ffü¶úÔ∏è\uf8ffüîó LangChain', 'description': 'This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader.', 'language': 'en'}, page_content='<!doctype html>\n<html lang="en" dir="ltr" class="docs-wrapper plugin-docs plugin-id-default docs-version-current docs-doc-page docs-doc-id-integrations/document_loaders/web_base" data-has-hydrated="false">\n<head>\n<meta charset="UTF-8">\n<meta name="generator" content="Docusaurus v3.5.2">\n<title data-rh="true">WebBaseLoader | \uf8ffü¶úÔ∏è\uf8ffüîó LangChain</title><meta data-rh="true" name="viewport" content="width=device-width,initial-scale=1"><meta data-rh="tr