# Document loaders

---

Alejandro Ricciardi (Omegapy)  
created date: 01/23/2024   
[GitHub](https://github.com/Omegapy)  

Credit: [LangChain](https://python.langchain.com/docs/expression_language/)

<br>

--- 

 
Projects Description:  
**LangChain** is a framework for developing applications powered by language models.  
**In this project:** This project is a series of LangChain document loaders for LLMs tutorials on Jupyter Notebook.  
The tutorials are a series LangChain Python code examples from the https://python.langchain.com/ website.

Specifically from the section [Document loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/).

⚠️ **Info**: Head to [Integrations]( https://python.langchain.com/docs/integrations/document_loaders/) for documentation on built-in document loader integrations with 3rd-party tools.

Use document loaders to load data from a source as Document's. A ```Document``` is a piece of text and associated metadata. For example, there are document loaders for loading a simple ```.txt``` file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

<p></p>
<b style="font-size:15;">
⚠️ This project requires an OpenAI key.
</b>


##### Project Map  
- [API Keys](#api-keys)  
- [Getting started (.txt)](@getting-started)
- [CSV](#csv)
    - [Base Example](#base-example-csv)
    - [Customizing the CSV parsing and loading](#customizing-the-csv-parsing-and-loading)
    - [Specify a column to identify the document source](#specify-a-column-to-identify-the-document-source)
- [File Directory](#file-directory)
    - [Base Example](#base-example-file-directory)
    - [Show a progress bar](#show-a-progress-bar)
    - [Use multithreading](#use-multithreading)
    - [Change loader class](#change-loader-class)
        - [Base Example](#base-example-change-loader-class)
        - [Load Python Source Code](#load-python-source-code)
    - [Auto-detect file encodings with TextLoader](#auto-detect-file-encodings-with-textloader)
        - [A. Default Behavior](#a-default-behavior)
        - [B. Silent fail](#b-silent-fail)
        - [C. Auto detect encodings](#c-auto-detect-encodings)
- [HTML](#html)


<br>

---


#### API Keys

In [6]:
import os
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ.get("OPEN_AI_KEY")

---
## Getting started


<br>

---

In [21]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./data/Portfolio Solutions for a Business Enterprise-Wide Upgrade.txt")
loader.load()

[Document(page_content="\n\n\n\nPortfolio: Solutions for a Business Enterprise-Wide Upgrade\nAlejandro Ricciardi\nColorado State University Global\nCSC300: Operating Systems and Architecture\nJoe Rangitsch\nAugust 6, 2023\n\n\n\n      Portfolio: Solutions for a Business Enterprise-Wide Upgrade\n      In a rapidly advancing technological landscape, to remain competitive, businesses need to adapt by modernizing their systems. In this portfolio essay, I take on the role of a consultant for a local business, that was asked to propose an enterprise-wide upgrade solution that includes operating systems, mass storage, virtualization, and security for a company. The company currently has a mix of operating systems, including several legacy machines. Additionally, the company does not currently use virtual machines but is strongly considering them. Furthermore, the company's core business is software testing, but it is considering offering a storage solution. In this paper, I outline the advant

[Project Map](#project-map)

---

---
## CSV
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

<br>

---

### Base Example (CSV)

In [19]:
from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='./data/mlb_teams_2012.csv')
data = loader.load()
data

[Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98', metadata={'source': './data/mlb_teams_2012.csv', 'row': 0}),
 Document(page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97', metadata={'source': './data/mlb_teams_2012.csv', 'row': 1}),
 Document(page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95', metadata={'source': './data/mlb_teams_2012.csv', 'row': 2}),
 Document(page_content='Team: Giants\n"Payroll (millions)": 117.62\n"Wins": 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 3}),
 Document(page_content='Team: Braves\n"Payroll (millions)": 83.31\n"Wins": 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 4}),
 Document(page_content='Team: Athletics\n"Payroll (millions)": 55.37\n"Wins": 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 5}),
 Document(page_content='Team: Rangers\n"Payroll (millions)": 120.51\n"Wins": 93', metadata={'source': './data/mlb_teams_2012.csv', 'row': 6}),
 Doc

### Customizing the CSV parsing and loading

See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported.

In [24]:
loader = CSVLoader(file_path='./data/mlb_teams_2012.csv', csv_args={
    'delimiter': ',',
    'quotechar': '"',
    'fieldnames': ['MLB Team', 'Payroll in millions', 'Wins']
})

data = loader.load()
data


[Document(page_content='MLB Team: Team\nPayroll in millions: "Payroll (millions)"\nWins: "Wins"', metadata={'source': './data/mlb_teams_2012.csv', 'row': 0}),
 Document(page_content='MLB Team: Nationals\nPayroll in millions: 81.34\nWins: 98', metadata={'source': './data/mlb_teams_2012.csv', 'row': 1}),
 Document(page_content='MLB Team: Reds\nPayroll in millions: 82.20\nWins: 97', metadata={'source': './data/mlb_teams_2012.csv', 'row': 2}),
 Document(page_content='MLB Team: Yankees\nPayroll in millions: 197.96\nWins: 95', metadata={'source': './data/mlb_teams_2012.csv', 'row': 3}),
 Document(page_content='MLB Team: Giants\nPayroll in millions: 117.62\nWins: 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 4}),
 Document(page_content='MLB Team: Braves\nPayroll in millions: 83.31\nWins: 94', metadata={'source': './data/mlb_teams_2012.csv', 'row': 5}),
 Document(page_content='MLB Team: Athletics\nPayroll in millions: 55.37\nWins: 94', metadata={'source': './data/mlb_teams_2012.

### Specify a column to identify the document source
Use the ```source_column``` argument to specify a source for the document created from each row. Otherwise file_path will be used as the source for all documents created from the CSV file.

This is useful when using documents loaded from CSV files for chains that answer questions using sources.

In [26]:
loader = CSVLoader(file_path='./data/mlb_teams_2012.csv', source_column="Team")

data = loader.load()
data

[Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98', metadata={'source': 'Nationals', 'row': 0}),
 Document(page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97', metadata={'source': 'Reds', 'row': 1}),
 Document(page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95', metadata={'source': 'Yankees', 'row': 2}),
 Document(page_content='Team: Giants\n"Payroll (millions)": 117.62\n"Wins": 94', metadata={'source': 'Giants', 'row': 3}),
 Document(page_content='Team: Braves\n"Payroll (millions)": 83.31\n"Wins": 94', metadata={'source': 'Braves', 'row': 4}),
 Document(page_content='Team: Athletics\n"Payroll (millions)": 55.37\n"Wins": 94', metadata={'source': 'Athletics', 'row': 5}),
 Document(page_content='Team: Rangers\n"Payroll (millions)": 120.51\n"Wins": 93', metadata={'source': 'Rangers', 'row': 6}),
 Document(page_content='Team: Orioles\n"Payroll (millions)": 81.43\n"Wins": 93', metadata={'source': 'Orioles', 'row': 7}),
 Docume

[Project Map](#project-map)

---

---
## File Directory
This covers how to load all documents in a directory.

Under the hood, by default this uses the [UnstructuredLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file).

<br>

---

### Base Example (File Directory)

In [33]:
from langchain_community.document_loaders import DirectoryLoader

We can use the ```glob``` parameter to control which files to load. 
Note that here it doesn't load the ```.rst``` file or the .```html``` files.

In [42]:
loader = DirectoryLoader('../', glob="**/*.txt") 

docs = loader.load() # pip install unstructured, note pip install "unstructured[md]" not available on Window Pro PC

len(docs)

3

[Project Map](#project-map)

---

### Show a progress bar

**By default a progress bar will not be shown.** 
To show a progress bar, install the tqdm library (e.g. ```pip install tqdm```), and set the show_progress parameter to ```True```.

In [45]:
loader = DirectoryLoader('../', glob="**/*.txt", show_progress=True)
docs = loader.load()


  0%|          | 0/3 [00:00<?, ?it/s][A
100%|██████████| 3/3 [00:00<00:00, 15.07it/s][A


[Project Map](#project-map)

---

### Use multithreading

By default the loading happens in one thread. In order to utilize several threads set the ```use_multithreading``` flag to ```true```.

No multithreading

In [48]:
%%timeit
loader = DirectoryLoader('../', glob="**/*.txt")
docs = loader.load()

197 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


 Multithreading

In [49]:
%%timeit
loader = DirectoryLoader('../', glob="**/*.txt", use_multithreading=True)
docs = loader.load()

183 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


[Project Map](#project-map)

---

### Change loader class
By default this uses the ```UnstructuredLoader``` class. However, you can change up the type of loader pretty easily.

##### Base Example (Change loader class)

In [52]:
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.txt", loader_cls=TextLoader)

docs = loader.load()

len(docs)

3

##### Load Python Source Code
If you need to load Python source code files, use the ```PythonLoader```.

In [53]:
from langchain_community.document_loaders import PythonLoader

loader = DirectoryLoader('../../../../../', glob="**/*.py", loader_cls=PythonLoader)

docs = loader.load()

len(docs)

15

[Project Map](#project-map)

---

### Auto-detect file encodings with TextLoader

In this example we will see some strategies that can be useful when loading a large list of arbitrary files from a directory using the ```TextLoader``` class.

First to illustrate the problem, let's try to load multiple texts with arbitrary encodings.

In [56]:
path = './data'
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader)

##### A. Default Behavior

In [59]:
loader.load()

RuntimeError: Error loading data\example-non-utf8.txt

The file example-non-utf8.txt uses a different encoding, so the load() function fails with a helpful message indicating which file failed decoding.

With the default behavior of TextLoader any failure to load any of the documents will fail the whole loading process and no documents are loaded.

##### B. Silent fail
We can pass the parameter ```silent_errors``` to the ```DirectoryLoader``` to skip the files which could not be loaded and continue the load process.

The example-non-utf8.txt will not be loaded in docs

In [61]:
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, silent_errors=True)
docs = loader.load()

# the example-non-utf8.txt will not be loaded in docs

Error loading file data\example-non-utf8.txt: Error loading data\example-non-utf8.txt


In [62]:
doc_sources = [doc.metadata['source']  for doc in docs]
doc_sources

['data\\An Overview of Thread Implementation in Multi-Threaded Computer Systems.txt',
 'data\\Portfolio Solutions for a Business Enterprise-Wide Upgrade.txt',
 'data\\whatsapp_chat.txt',
 'data\\fake_discord_data\\output.txt']

[Project Map](#project-map)

---

##### C. Auto detect encodings
We can also ask ```TextLoader``` to auto detect the file encoding before failing, by passing the ```autodetect_encoding``` to the loader class.

This will load example-non-utf8.txt without generating an error

In [63]:
text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
docs = loader.load()

In [64]:
doc_sources = [doc.metadata['source']  for doc in docs]
doc_sources

['data\\An Overview of Thread Implementation in Multi-Threaded Computer Systems.txt',
 'data\\example-non-utf8.txt',
 'data\\Portfolio Solutions for a Business Enterprise-Wide Upgrade.txt',
 'data\\whatsapp_chat.txt',
 'data\\fake_discord_data\\output.txt']

[Project Map](#project-map)

---

---
## HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.

This covers how to load HTML documents into a document format that we can use downstream.

<br>

---