## Set Up the Environment

In [1]:
%run setup.ipynb

## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### CSV Loader

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

LangChain implements a CSV Loader that will load CSV files into a sequence of `Document` objects. Each row of the CSV file is converted to one document.

In [2]:
import pandas as pd

# Create a DataFrame with some dummy real estate data
data = {
    'Property_ID': [101, 102, 103, 104, 105],
    'Address': ['123 Elm St', '456 Oak St', '789 Pine St', '321 Maple St', '654 Cedar St'],
    'City': ['Springfield', 'Rivertown', 'Laketown', 'Hillside', 'Sunnyvale'],
    'State': ['CA', 'TX', 'FL', 'NY', 'CO'],
    'Zip_Code': [98765, 87654, 76543, 65432, 54321],
    'Bedrooms': [3, 2, 4, 3, 5],
    'Bathrooms': [2, 1, 3, 2, 4],
    'Listing_Price': [500000, 350000, 600000, 475000, 750000]
}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('../../docs/data.csv', index=False)

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="../../docs/data.csv")
docs = loader.load()

In [4]:
docs

[Document(metadata={'source': '../../docs/data.csv', 'row': 0}, page_content='Property_ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip_Code: 98765\nBedrooms: 3\nBathrooms: 2\nListing_Price: 500000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 1}, page_content='Property_ID: 102\nAddress: 456 Oak St\nCity: Rivertown\nState: TX\nZip_Code: 87654\nBedrooms: 2\nBathrooms: 1\nListing_Price: 350000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 2}, page_content='Property_ID: 103\nAddress: 789 Pine St\nCity: Laketown\nState: FL\nZip_Code: 76543\nBedrooms: 4\nBathrooms: 3\nListing_Price: 600000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 3}, page_content='Property_ID: 104\nAddress: 321 Maple St\nCity: Hillside\nState: NY\nZip_Code: 65432\nBedrooms: 3\nBathrooms: 2\nListing_Price: 475000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 4}, page_content='Property_ID: 105\nAddress: 654 Cedar St\nCity: Sunnyvale\nState

In [5]:
print(docs[0].page_content)

Property_ID: 101
Address: 123 Elm St
City: Springfield
State: CA
Zip_Code: 98765
Bedrooms: 3
Bathrooms: 2
Listing_Price: 500000


`CSVLoader` will accept a `csv_args` kwarg that supports customization of arguments passed to Python's csv.`DictReader`. See the [`csv` module](https://docs.python.org/3/library/csv.html) documentation for more information of what `csv` args are supported.

In [6]:
loader = CSVLoader(file_path="../../docs/data.csv",
                   csv_args={
                      "delimiter": ",",
                      "quotechar": '"',
                      "fieldnames": ["Property ID", "Address", "City", "State",
                                     "Zip Code", "Bedrooms", "Bathrooms", "Price"],
                   },
                  )
docs = loader.load()

In [7]:
docs

[Document(metadata={'source': '../../docs/data.csv', 'row': 0}, page_content='Property ID: Property_ID\nAddress: Address\nCity: City\nState: State\nZip Code: Zip_Code\nBedrooms: Bedrooms\nBathrooms: Bathrooms\nPrice: Listing_Price'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 1}, page_content='Property ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip Code: 98765\nBedrooms: 3\nBathrooms: 2\nPrice: 500000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 2}, page_content='Property ID: 102\nAddress: 456 Oak St\nCity: Rivertown\nState: TX\nZip Code: 87654\nBedrooms: 2\nBathrooms: 1\nPrice: 350000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 3}, page_content='Property ID: 103\nAddress: 789 Pine St\nCity: Laketown\nState: FL\nZip Code: 76543\nBedrooms: 4\nBathrooms: 3\nPrice: 600000'),
 Document(metadata={'source': '../../docs/data.csv', 'row': 4}, page_content='Property ID: 104\nAddress: 321 Maple St\nCity: Hillside\nState: NY\n

In [8]:
print(docs[0].page_content)

Property ID: Property_ID
Address: Address
City: City
State: State
Zip Code: Zip_Code
Bedrooms: Bedrooms
Bathrooms: Bathrooms
Price: Listing_Price


In [9]:
print(docs[1].page_content)

Property ID: 101
Address: 123 Elm St
City: Springfield
State: CA
Zip Code: 98765
Bedrooms: 3
Bathrooms: 2
Price: 500000


### Compare with unstructured.io

Unstructured.io loads the entire CSV as a single table

In [1]:
from langchain_community.document_loaders import UnstructuredCSVLoader

loader = UnstructuredCSVLoader("../../docs/data.csv")
docs = loader.load()

In [2]:
len(docs)

1

In [None]:
type(docs[0]) # The Document is a LangChain Document as this is just a wrapper around the Unstructured Document

langchain_core.documents.base.Document

In [12]:
print(docs[0])

page_content='Property_ID Address City State Zip_Code Bedrooms Bathrooms Listing_Price 101 123 Elm St Springfield CA 98765 3 2 500000 102 456 Oak St Rivertown TX 87654 2 1 350000 103 789 Pine St Laketown FL 76543 4 3 600000 104 321 Maple St Hillside NY 65432 3 2 475000 105 654 Cedar St Sunnyvale CO 54321 5 4 750000' metadata={'source': '../../docs/data.csv'}


In [13]:
print(docs[0].page_content)


Property_ID Address City State Zip_Code Bedrooms Bathrooms Listing_Price 101 123 Elm St Springfield CA 98765 3 2 500000 102 456 Oak St Rivertown TX 87654 2 1 350000 103 789 Pine St Laketown FL 76543 4 3 600000 104 321 Maple St Hillside NY 65432 3 2 475000 105 654 Cedar St Sunnyvale CO 54321 5 4 750000


In [14]:
print(docs[0].metadata)

{'source': '../../docs/data.csv'}
