### 📖 Where We Are

**So far**, our journey has taken us through various data formats:
1.  **Notebook 1-3**: Handled unstructured and semi-structured text, PDFs, and Word documents.
2.  **Notebook 4**: Dove into structured (tabular) data with CSV and Excel files, focusing on converting rows into meaningful text.

**In this notebook**, we'll tackle **JSON (JavaScript Object Notation)**, the most common format for semi-structured data used in APIs and web services. We will learn how to parse its nested, hierarchical structure to extract meaningful information and also touch upon the efficient JSON Lines (`.jsonl`) format.

### 1. JSON Parsing and Processing

JSON data is semi-structured, meaning it doesn't have a rigid schema like a database table but does have a clear hierarchical structure of keys and values. It can contain nested objects and lists, making it incredibly flexible but also challenging to parse for RAG.

**Analogy**: Think of a nested JSON file as a detailed report about a company. The main report (the top-level object) has sections for 'Employees' and 'Departments'. The 'Employees' section is a list, where each item is a detailed file on a single employee, complete with their own sub-sections for 'Skills' and 'Projects'. Our job is to act as a researcher who can either grab an entire section (like all employee files) or intelligently pull specific information from multiple sections to create a comprehensive summary (like a single employee's profile with their department info).

First, let's create our sample JSON data.

In [1]:
# json library is a standard Python library for working with JSON data.
import json
import os

# Create a directory for our sample files.
os.makedirs("data/json_files", exist_ok=True)

In [2]:
# This is a complex, nested Python dictionary that we will use as our sample data.
# It includes top-level keys, a list of objects ('employees'), and a dictionary of objects ('departments').
json_data = {
    "company": "TechCorp",
    "employees": [
        {
            "id": 1,
            "name": "John Doe",
            "role": "Software Engineer",
            "skills": ["Python", "JavaScript", "React"],
            "projects": [
                {"name": "RAG System", "status": "In Progress"},
                {"name": "Data Pipeline", "status": "Completed"}
            ]
        },
        {
            "id": 2,
            "name": "Jane Smith",
            "role": "Data Scientist",
            "skills": ["Python", "Machine Learning", "SQL"],
            "projects": [
                {"name": "ML Model", "status": "In Progress"},
                {"name": "Analytics Dashboard", "status": "Planning"}
            ]
        }
    ],
    "departments": {
        "engineering": {
            "head": "Mike Johnson",
            "budget": 1000000,
            "team_size": 25
        },
        "data_science": {
            "head": "Sarah Williams",
            "budget": 750000,
            "team_size": 15
        }
    }
}

In [3]:
# Serialize the Python dictionary and save it to a .json file.
# `json.dump` writes the object to a file.
# `indent=2` makes the file human-readable with nice formatting.
with open('data/json_files/company_data.json', 'w') as f:
    json.dump(json_data, f, indent=2)

### Understanding JSON Lines (`.jsonl`)

Besides standard JSON, another common format is **JSON Lines**. In a `.jsonl` file, each line is a completely separate, valid JSON object. This format is highly efficient for streaming data, like application logs or event data, because you can process the file line by line without loading the entire, potentially massive, file into memory.

In [4]:
# Create sample data for a JSON Lines file. It's a list of dictionaries.
jsonl_data = [
    {"timestamp": "2024-01-01", "event": "user_login", "user_id": 123},
    {"timestamp": "2024-01-01", "event": "page_view", "user_id": 123, "page": "/home"},
    {"timestamp": "2024-01-01", "event": "purchase", "user_id": 123, "amount": 99.99}
]

# To write a .jsonl file, we iterate through our list.
with open('data/json_files/events.jsonl', 'w') as f:
    for item in jsonl_data:
        # `json.dumps` converts a single Python object to a JSON string.
        # We write each JSON string followed by a newline character.
        f.write(json.dumps(item) + '\n')

## 2. JSON Processing Strategies

### Method 1: `JSONLoader` with `jq` Schema

The `JSONLoader` in LangChain is extremely powerful because it integrates with `jq`, a command-line tool for processing JSON data. You provide a `jq` query (`jq_schema`) to specify exactly which parts of the JSON you want to extract into `Document` objects. This is perfect for pulling out all items from a list within a larger JSON file.

In [5]:
from langchain_community.document_loaders import JSONLoader
import json

print("1️⃣ JSONLoader - Extracting specific fields with a jq schema")

# Initialize the loader to extract employee information.
employee_loader = JSONLoader(
    file_path='data/json_files/company_data.json',
    # This jq schema says: 'Access the root object (.), find the key `employees`, and then iterate over each item in that array (`[]`).'
    jq_schema='.employees[]',
    # We set text_content=False to make the page_content of each Document the raw JSON object itself, not just its text values.
    text_content=False
)

employee_docs = employee_loader.load()
print(f"Loaded {len(employee_docs)} employee documents")
print(f"First employee's content: {employee_docs[0].page_content[:200]}...")

1️⃣ JSONLoader - Extracting specific fields with a jq schema
Loaded 2 employee documents
First employee's content: {"id": 1, "name": "John Doe", "role": "Software Engineer", "skills": ["Python", "JavaScript", "React"], "projects": [{"name": "RAG System", "status": "In Progress"}, {"name": "Data Pipeline", "status"...


### Method 2: Custom JSON Processing (Intelligent Approach)

While `JSONLoader` is great for extraction, a custom function is often needed when you want to:
1.  **Combine** data from different nested levels into a single document.
2.  **Format** the output into a more readable, natural language string.
3.  Create **highly specific metadata** based on the data's content.

Here, we'll create a function that builds a detailed, human-readable profile for each employee, including their projects.

In [6]:
from typing import List
from langchain_core.documents import Document

print("\n2️⃣ Custom JSON Processing")

def process_json_intelligently(filepath: str) -> List[Document]:
    """Processes a complex JSON file, creating a formatted Document for each employee."""
    with open(filepath, 'r') as f:
        data = json.load(f)
    
    documents = []
    
    # Iterate through each employee object in the 'employees' list.
    for emp in data.get('employees', []):
        # Use an f-string to build a clean, readable profile for the page_content.
        content = f"""Employee Profile:
        Name: {emp['name']}
        Role: {emp['role']}
        Skills: {', '.join(emp['skills'])}
\n        Projects:"""
        # Nest a loop to process the 'projects' list for the current employee.
        for proj in emp.get('projects', []):
            content += f"\n- {proj['name']} (Status: {proj['status']})"
        
        # Create a Document with the formatted content and rich metadata.
        doc = Document(
            page_content=content,
            metadata={
                'source': filepath,
                'data_type': 'employee_profile',
                'employee_id': emp['id'],
                'employee_name': emp['name'],
                'role': emp['role']
            }
        )
        documents.append(doc)

    return documents


2️⃣ Custom JSON Processing


In [7]:
# Run our custom function and inspect the first resulting Document.
intelligent_json_docs = process_json_intelligently("data/json_files/company_data.json")
intelligent_json_docs[0]

Document(metadata={'source': 'data/json_files/company_data.json', 'data_type': 'employee_profile', 'employee_id': 1, 'employee_name': 'John Doe', 'role': 'Software Engineer'}, page_content='Employee Profile:\n        Name: John Doe\n        Role: Software Engineer\n        Skills: Python, JavaScript, React\n\n        Projects:\n- RAG System (Status: In Progress)\n- Data Pipeline (Status: Completed)')

### 📊 JSON Processing Strategy Comparison

| Strategy | How it Works | Best For |
| :--- | :--- | :--- |
| **`JSONLoader`** | Uses a `jq` query to extract specific objects or values. | Quickly pulling out all items from a specific list within a large JSON file (e.g., all user comments from a product page JSON). |
| **Custom Function** | Manually parses the JSON, allowing for complex logic, formatting, and data combination. | **Complex RAG scenarios**. Ideal for creating context-rich documents that combine information from multiple nested levels into a single, coherent text. |

### 🔑 Key Takeaways

* **JSON is Hierarchical**: The main challenge with JSON is navigating its nested structure. Your goal is to extract meaningful, self-contained "sub-documents" from the larger file.
* **`jq` is Your Shortcut**: LangChain's `JSONLoader` with a `jq_schema` is a powerful and concise tool for extracting specific lists or objects. Learning basic `jq` syntax is a high-leverage skill for data processing.
* **Custom Functions Offer Control**: For maximum flexibility, a custom Python function is the best approach. It allows you to combine data from different parts of the JSON tree, format the content for better readability by an LLM, and create precise metadata.
* **Use JSON Lines for Streams**: The `.jsonl` format is highly efficient for large datasets of discrete records (like logs or events) because it can be processed one line at a time without loading the whole file into memory.