### 📖 Where We Are

**So far**, we have built a solid foundation for handling unstructured and semi-structured documents:
1.  **Notebook 1**: Covered `.txt` files and text splitting fundamentals.
2.  **Notebook 2**: Tackled the complexities of PDFs with specialized loaders and cleaning pipelines.
3.  **Notebook 3**: Explored `.docx` files, highlighting the difference between simple text extraction and element-aware parsing with `Unstructured`.

**In this notebook**, we pivot to a new and important category: **structured data**. We'll focus on the most common formats, **CSV and Excel files**. The challenge here is different: instead of just extracting text, we need to thoughtfully convert rows and columns of data into a format that a language model can understand and reason about.

### 1. CSV And Excel files - Structured Data

Working with structured data like CSV and Excel for RAG requires a shift in mindset. Our goal is to convert tabular data into a descriptive, text-based format. 

**Analogy**: Imagine you have a spreadsheet of product inventory. If you just feed the raw numbers `[Laptop, 999.99, 50]` to a language model, it lacks context. A better approach is to translate each row into a human-readable sentence or paragraph, like: *"We have a Laptop from the Electronics category, which costs $999.99 and has 50 units in stock."* 

This process of converting structured rows into unstructured text is fundamental to making tabular data useful in a RAG system. We'll start by creating some sample files to work with.

In [1]:
# pandas is the essential library for working with structured data in Python.
import pandas as pd
# os is used for interacting with the file system, like creating directories.
import os

In [2]:
# Create a directory to store our sample files.
os.makedirs("data/structured_files", exist_ok=True)

In [3]:
# Create a sample dataset of products as a Python dictionary.
data = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Electronics'],
    'Price': [999.99, 29.99, 79.99, 299.99, 89.99],
    'Stock': [50, 200, 150, 75, 100],
    'Description': [
        'High-performance laptop with 16GB RAM and 512GB SSD',
        'Wireless optical mouse with ergonomic design',
        'Mechanical keyboard with RGB backlighting',
        '27-inch 4K monitor with HDR support',
        '1080p webcam with noise cancellation'
    ]
}

# Use pandas to convert the dictionary into a DataFrame, which is a tabular data structure.
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file. `index=False` prevents pandas from writing row indices into the file.
df.to_csv('data/structured_files/products.csv', index=False)

In [4]:
# To create an Excel file with multiple sheets, we use the `ExcelWriter`.
with pd.ExcelWriter('data/structured_files/inventory.xlsx') as writer:
    # Write the main products DataFrame to a sheet named 'Products'.
    df.to_excel(writer, sheet_name='Products', index=False)
    
    # Create a second, summary DataFrame.
    summary_data = {
        'Category': ['Electronics', 'Accessories'],
        'Total_Items': [3, 2],
        'Total_Value': [1389.97, 109.98]
    }
    # Write the summary data to a different sheet named 'Summary'.
    pd.DataFrame(summary_data).to_excel(writer, sheet_name='Summary', index=False)

## 2. CSV Processing

In [5]:
# CSVLoader: The standard LangChain loader for CSVs, treating each row as a document.
# UnstructuredCSVLoader: A more advanced loader that can identify tables within a CSV.
from langchain_community.document_loaders import CSVLoader
from langchain_community.document_loaders import UnstructuredCSVLoader

### Method 1: `CSVLoader` (Row-Based)

This is the most straightforward way to load a CSV. It iterates through the file and creates **one `Document` for each row**. The `page_content` of each document is a simple string where each column and its value are listed, separated by newlines.

In [6]:
print("1️⃣ CSVLoader - Row-based Documents")
csv_loader = CSVLoader(
    file_path='data/structured_files/products.csv',
    encoding='utf-8',
    # csv_args allows you to pass arguments directly to Python's csv.reader.
    csv_args={
        'delimiter': ',',
        'quotechar': '"',
    }
)

csv_docs = csv_loader.load()

print(f"Loaded {len(csv_docs)} documents (one per row)")
print("\nFirst document:")
# Note the format: 'column_name: value'.
print(f"Content: {csv_docs[0].page_content}")
# The metadata includes the source and the original row number.
print(f"Metadata: {csv_docs[0].metadata}")

1️⃣ CSVLoader - Row-based Documents
Loaded 5 documents (one per row)

First document:
Content: Product: Laptop
Category: Electronics
Price: 999.99
Stock: 50
Description: High-performance laptop with 16GB RAM and 512GB SSD
Metadata: {'source': 'data/structured_files/products.csv', 'row': 0}


### Method 2: Custom CSV Processing (Intelligent Approach)

While `CSVLoader` is easy, it's not always optimal for RAG. The default `page_content` format is generic. For better results, we can create a custom processing function using `pandas` to format each row into a more descriptive, natural-language-friendly string and to add much richer metadata.

In [7]:
from typing import List
from langchain_core.documents import Document

print("\n2️⃣ Custom CSV Processing")
def process_csv_intelligently(filepath: str) -> List[Document]:
    """Reads a CSV and creates a well-formatted Document for each row with rich metadata."""
    # Read the CSV into a pandas DataFrame.
    df = pd.read_csv(filepath)
    documents = []
    
    # Iterate over each row in the DataFrame.
    for idx, row in df.iterrows():
        # Use an f-string to create a descriptive, human-readable format for the content.
        content = f"""Product Information:
        Name: {row['Product']}
        Category: {row['Category']}
        Price: ${row['Price']}
        Stock: {row['Stock']} units
        Description: {row['Description']}"""
        
        # Create a LangChain Document with this formatted content.
        doc = Document(
            page_content=content,
            # Create rich, structured metadata. This is incredibly useful for filtering during retrieval.
            metadata={
                'source': filepath,
                'row_index': idx,
                'product_name': row['Product'],
                'category': row['Category'],
                'price': row['Price'],
                'data_type': 'product_info' # Add a custom tag for the data type.
            }
        )
        documents.append(doc)
    return documents


2️⃣ Custom CSV Processing


In [8]:
# Run our custom function and inspect the output. Note the cleaner content and richer metadata.
intelligent_csv_docs = process_csv_intelligently('data/structured_files/products.csv')
intelligent_csv_docs[0]

Document(metadata={'source': 'data/structured_files/products.csv', 'row_index': 0, 'product_name': 'Laptop', 'category': 'Electronics', 'price': 999.99, 'data_type': 'product_info'}, page_content='Product Information:\n        Name: Laptop\n        Category: Electronics\n        Price: $999.99\n        Stock: 50 units\n        Description: High-performance laptop with 16GB RAM and 512GB SSD')

### 📊 CSV Processing Strategy Comparison

| Strategy | Page Content | Metadata | Best For |
| :--- | :--- | :---: | :--- |
| **`CSVLoader`** | `key: value` pairs | Basic (source, row) | Quick loading when you just need the raw row data. |
| **Custom Function** | Formatted, natural language | **Rich & Custom** | **RAG systems**. Creates context-rich documents and allows for powerful metadata filtering (e.g., "find products where `category` is 'Electronics' and `price` is less than $100"). |

### 3. Excel Processing

Excel files are similar to CSVs but with the added complexity of potentially having **multiple sheets**. A robust Excel parser needs to be able to handle this. We will again compare a custom `pandas` approach with the `Unstructured` library.

In [9]:
print("1️⃣ Pandas-based Excel Processing")
def process_excel_with_pandas(filepath: str) -> List[Document]:
    """Processes an Excel file, creating one Document per sheet."""
    documents = []
    
    # Use pd.ExcelFile to efficiently read the file and get sheet names.
    excel_file = pd.ExcelFile(filepath)
    
    # Loop through each sheet in the Excel file.
    for sheet_name in excel_file.sheet_names:
        # Read the specific sheet into a DataFrame.
        df = pd.read_excel(filepath, sheet_name=sheet_name)
        
        # Convert the entire DataFrame to a string, which will serve as the page_content.
        # This preserves the full table structure in a text format.
        sheet_content = df.to_string(index=False)
        
        doc = Document(
            page_content=sheet_content,
            metadata={
                'source': filepath,
                'sheet_name': sheet_name,
                'num_rows': len(df),
                'num_columns': len(df.columns),
                'data_type': 'excel_sheet' # Custom tag
            }
        )
        documents.append(doc)
    
    return documents

1️⃣ Pandas-based Excel Processing


In [10]:
# Run our custom Excel processing function.
excel_docs = process_excel_with_pandas('data/structured_files/inventory.xlsx')
print(f"Processed {len(excel_docs)} sheets")
excel_docs[0]

Processed 2 sheets


Document(metadata={'source': 'data/structured_files/inventory.xlsx', 'sheet_name': 'Products', 'num_rows': 5, 'num_columns': 5, 'data_type': 'excel_sheet'}, page_content=' Product    Category  Price  Stock                                         Description\n  Laptop Electronics 999.99     50 High-performance laptop with 16GB RAM and 512GB SSD\n   Mouse Accessories  29.99    200        Wireless optical mouse with ergonomic design\nKeyboard Accessories  79.99    150           Mechanical keyboard with RGB backlighting\n Monitor Electronics 299.99     75                 27-inch 4K monitor with HDR support\n  Webcam Electronics  89.99    100                1080p webcam with noise cancellation')

### Method 2: `UnstructuredExcelLoader`

Just as with Word documents, the `Unstructured` library provides a powerful loader for Excel. In `elements` mode, it will parse the file and identify tables on each sheet, returning each table as a separate `Document`. This is very effective for extracting the core data from each sheet.

In [12]:
from langchain_community.document_loaders import UnstructuredExcelLoader

print("\n2️⃣ UnstructuredExcelLoader")
try:
    # Initialize the loader in 'elements' mode to identify tables.
    excel_loader = UnstructuredExcelLoader(
        'data/structured_files/inventory.xlsx',
        mode="elements"
    )
    unstructured_excel_docs = excel_loader.load()
    print(f"  Loaded {len(unstructured_excel_docs)} elements (tables).")
    # The metadata is very rich, even including an HTML representation of the table.
    print(f"  First element's metadata: {unstructured_excel_docs[0].metadata}")
except Exception as e:
    print(f"  Error: {e}")


2️⃣ UnstructuredExcelLoader
  Loaded 2 elements (tables).
  First element's metadata: {'source': 'data/structured_files/inventory.xlsx', 'file_directory': 'data/structured_files', 'filename': 'inventory.xlsx', 'last_modified': '2025-08-19T23:03:26', 'page_name': 'Products', 'page_number': 1, 'text_as_html': '<table><tr><td>Product</td><td>Category</td><td>Price</td><td>Stock</td><td>Description</td></tr><tr><td>Laptop</td><td>Electronics</td><td>999.99</td><td>50</td><td>High-performance laptop with 16GB RAM and 512GB SSD</td></tr><tr><td>Mouse</td><td>Accessories</td><td>29.99</td><td>200</td><td>Wireless optical mouse with ergonomic design</td></tr><tr><td>Keyboard</td><td>Accessories</td><td>79.99</td><td>150</td><td>Mechanical keyboard with RGB backlighting</td></tr><tr><td>Monitor</td><td>Electronics</td><td>299.99</td><td>75</td><td>27-inch 4K monitor with HDR support</td></tr><tr><td>Webcam</td><td>Electronics</td><td>89.99</td><td>100</td><td>1080p webcam with noise cancellati

### 🔑 Key Takeaways

* **Translate Structure to Text**: The core challenge with structured data (CSV/Excel) is to convert tabular rows and columns into a descriptive, natural language format that an LLM can effectively use.
* **Customization is Key for RAG**: For both CSV and Excel, creating a custom processing pipeline with `pandas` is superior to basic loaders for RAG. It allows you to precisely control the `page_content` format and, more importantly, create rich, filterable metadata from the columns.
* **Handle Multiple Sheets**: Excel files often contain multiple sheets. Your processing strategy must be sheet-aware, treating each sheet as a separate context or document.
* **Row vs. Table Granularity**: You can choose your strategy based on your needs. For answering questions about a single item (e.g., "What is the price of a Laptop?"), row-based documents are effective. For questions about the entire dataset (e.g., "Summarize the products in the electronics category"), a document representing the whole table or sheet is better.