# 📊 11: Mini Project - Data Analysis Agent

Build a complete data analysis pipeline that loads data files, performs statistical analysis, identifies patterns, creates visualizations, and generates insights reports.

## 📋 Learning Objectives

By the end of this project, you will be able to:

- [ ] Build an end-to-end data analysis pipeline with agents
- [ ] Load and parse CSV/JSON data files
- [ ] Perform statistical analysis (mean, median, std dev, correlations)
- [ ] Identify patterns, outliers, and anomalies
- [ ] Create visualizations with matplotlib
- [ ] Generate comprehensive insights reports
- [ ] Structure a data science workflow with checkpoints

## 🎯 Prerequisites

- Completed notebooks 01-10
- Understanding of ReACT agents and tools
- Basic statistics knowledge
- Familiarity with data analysis concepts

## ⏱️ Estimated Time: 30 minutes

## 🎯 Project Goal

Build a **Data Analysis Agent** that:

1. **Loads** data from CSV or JSON files
2. **Generates** summary statistics:
   - Count, mean, median, mode
   - Standard deviation, variance
   - Min, max, quartiles
3. **Identifies** patterns and insights:
   - Correlations between variables
   - Outliers and anomalies
   - Trends and distributions
4. **Creates** visualizations:
   - Histograms
   - Scatter plots
   - Box plots
5. **Writes** a comprehensive insights report

**Approach:** Use a ReACT agent with filesystem, code execution, and custom data tools.

## 📦 Setup

Let's set up our environment and create sample data.

In [None]:
from local_llm_sdk import LocalLLMClient, tool
from dotenv import load_dotenv
import tempfile
import os
import json
import csv

# Load environment variables
load_dotenv()

# Create client
client = LocalLLMClient(
    base_url=os.getenv("LLM_BASE_URL"),
    model=os.getenv("LLM_MODEL"),
    timeout=300
)

# Register built-in tools
client.register_tools_from(None)

# Create temporary directory
project_dir = tempfile.mkdtemp()

print("✅ Setup Complete!")
print(f"Project directory: {project_dir}")
print(f"\nRegistered tools: {', '.join(client.tools.list_tools())}")

### Create Sample Dataset

Let's create a realistic sales dataset to analyze.

In [None]:
import random
import datetime

# Generate sample sales data
random.seed(42)

categories = ['Electronics', 'Clothing', 'Food', 'Books', 'Toys']
regions = ['North', 'South', 'East', 'West']

sales_data = []
for i in range(200):
    date = datetime.date(2024, 1, 1) + datetime.timedelta(days=random.randint(0, 364))
    category = random.choice(categories)
    region = random.choice(regions)
    
    # Electronics has higher prices, Food has lower
    if category == 'Electronics':
        price = random.uniform(100, 1000)
        quantity = random.randint(1, 5)
    elif category == 'Food':
        price = random.uniform(5, 50)
        quantity = random.randint(1, 20)
    else:
        price = random.uniform(10, 200)
        quantity = random.randint(1, 10)
    
    revenue = price * quantity
    
    sales_data.append({
        'date': str(date),
        'category': category,
        'region': region,
        'price': round(price, 2),
        'quantity': quantity,
        'revenue': round(revenue, 2)
    })

# Save as CSV
csv_file = os.path.join(project_dir, "sales_data.csv")
with open(csv_file, 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['date', 'category', 'region', 'price', 'quantity', 'revenue'])
    writer.writeheader()
    writer.writerows(sales_data)

# Also save as JSON
json_file = os.path.join(project_dir, "sales_data.json")
with open(json_file, 'w') as f:
    json.dump(sales_data, f, indent=2)

print("📊 Created sample sales dataset:")
print(f"   Records: {len(sales_data)}")
print(f"   Categories: {', '.join(categories)}")
print(f"   Regions: {', '.join(regions)}")
print(f"   CSV file: {csv_file}")
print(f"   JSON file: {json_file}")
print("\n📝 Sample record:")
print(json.dumps(sales_data[0], indent=2))

## 🏗️ Project Structure

We'll build this project in **5 checkpoints**:

1. ✅ **Checkpoint 1**: Load data file
2. ✅ **Checkpoint 2**: Generate summary statistics
3. ✅ **Checkpoint 3**: Identify patterns and outliers
4. ✅ **Checkpoint 4**: Create visualizations
5. ✅ **Checkpoint 5**: Write insights report

Let's build it!

## Checkpoint 1: Load Data File

First, verify the agent can load and understand the data structure.

In [None]:
print("🎯 Checkpoint 1: Load Data\n")
print("="*70)

result = client.react(
    f"Load the CSV file at {csv_file}. "
    f"Tell me: "
    f"(1) how many records are in the file, "
    f"(2) what columns are present, "
    f"(3) show the first 3 records as an example.",
    max_iterations=8
)

print(f"\nStatus: {result.status}")
print(f"Steps: {result.steps_taken}")
print(f"\n📊 Data Overview:\n")
print(result.final_response)

if result.status == "success":
    print("\n✅ Checkpoint 1 Complete: Data loaded successfully!")
else:
    print("\n❌ Checkpoint 1 Failed: Could not load data")

## Checkpoint 2: Generate Summary Statistics

Calculate comprehensive statistics for the dataset.

In [None]:
print("🎯 Checkpoint 2: Summary Statistics\n")
print("="*70)

result = client.react(
    f"Analyze the sales data in {csv_file}. "
    f"Calculate summary statistics for the numerical columns (price, quantity, revenue): "
    f"(1) mean, median, mode, "
    f"(2) standard deviation, variance, "
    f"(3) min, max, "
    f"(4) 25th, 50th, 75th percentiles. "
    f"Also provide counts for categorical columns (category, region).",
    max_iterations=15
)

print(f"\nStatus: {result.status}")
print(f"Steps: {result.steps_taken}")
print(f"\n📈 Statistics:\n")
print(result.final_response)

if result.status == "success":
    print("\n✅ Checkpoint 2 Complete: Statistics generated!")
else:
    print("\n❌ Checkpoint 2 Failed: Statistics incomplete")

## Checkpoint 3: Identify Patterns and Outliers

Find interesting patterns, correlations, and anomalies in the data.

In [None]:
print("🎯 Checkpoint 3: Pattern Detection\n")
print("="*70)

result = client.react(
    f"Analyze the sales data in {csv_file} for patterns and insights. "
    f"Specifically: "
    f"(1) Calculate correlation between price and quantity, "
    f"(2) Identify outliers in revenue (values > 2 standard deviations from mean), "
    f"(3) Compare average revenue across different categories, "
    f"(4) Compare average revenue across different regions, "
    f"(5) Identify any interesting trends or patterns you observe.",
    max_iterations=18
)

print(f"\nStatus: {result.status}")
print(f"Steps: {result.steps_taken}")
print(f"\n🔍 Patterns & Insights:\n")
print(result.final_response)

if result.status == "success":
    print("\n✅ Checkpoint 3 Complete: Patterns identified!")
else:
    print("\n❌ Checkpoint 3 Failed: Analysis incomplete")

## Checkpoint 4: Create Visualizations

Generate charts to visualize the data insights.

In [None]:
print("🎯 Checkpoint 4: Create Visualizations\n")
print("="*70)

viz_dir = os.path.join(project_dir, "visualizations")
os.makedirs(viz_dir, exist_ok=True)

result = client.react(
    f"Create visualizations for the sales data in {csv_file}. "
    f"Generate these plots using matplotlib and save them to {viz_dir}: "
    f"(1) Histogram of revenue distribution (save as revenue_hist.png), "
    f"(2) Bar chart of average revenue by category (save as category_bar.png), "
    f"(3) Bar chart of average revenue by region (save as region_bar.png), "
    f"(4) Scatter plot of price vs quantity (save as price_quantity_scatter.png). "
    f"Make sure all plots have titles, axis labels, and are clearly readable.",
    max_iterations=20
)

print(f"\nStatus: {result.status}")
print(f"Steps: {result.steps_taken}")
print(f"\n📊 Visualization Generation:\n")
print(result.final_response)

# Check if visualizations were created
expected_files = [
    'revenue_hist.png',
    'category_bar.png',
    'region_bar.png',
    'price_quantity_scatter.png'
]

created_files = [f for f in expected_files if os.path.exists(os.path.join(viz_dir, f))]

if created_files:
    print(f"\n✅ Checkpoint 4 Complete: Created {len(created_files)}/{len(expected_files)} visualizations")
    print("\n📁 Visualization files:")
    for f in created_files:
        print(f"   - {os.path.join(viz_dir, f)}")
else:
    print("\n❌ Checkpoint 4 Failed: No visualizations created")

## Checkpoint 5: Write Insights Report

Generate a comprehensive analysis report with all findings.

In [None]:
print("🎯 Checkpoint 5: Generate Insights Report\n")
print("="*70)

report_file = os.path.join(project_dir, "analysis_report.md")

result = client.react(
    f"Create a comprehensive data analysis report for the sales data in {csv_file}. "
    f"The report should be in Markdown format and include: "
    f"(1) Executive Summary (2-3 sentences), "
    f"(2) Dataset Overview (records, columns, date range), "
    f"(3) Key Statistics (summary stats for revenue, price, quantity), "
    f"(4) Key Findings (top 3-5 insights with data to support them), "
    f"(5) Category Analysis (which categories perform best), "
    f"(6) Regional Analysis (which regions perform best), "
    f"(7) Outliers and Anomalies (if any significant ones found), "
    f"(8) Recommendations (3-5 actionable recommendations based on the data), "
    f"(9) Visualizations (mention the charts created and what they show). "
    f"Save the report to {report_file}.",
    max_iterations=25
)

print(f"\nStatus: {result.status}")
print(f"Steps: {result.steps_taken}")
print(f"\n📄 Report Generation:\n")
print(result.final_response)

# Verify and display report
if os.path.exists(report_file):
    print(f"\n✅ Checkpoint 5 Complete: Report saved to {report_file}")
    
    print("\n" + "="*70)
    print("\n📋 Generated Analysis Report:\n")
    with open(report_file, 'r') as f:
        report_content = f.read()
        print(report_content)
    
    print("\n" + "="*70)
    print(f"\n📊 Report Statistics:")
    print(f"   Length: {len(report_content)} characters")
    print(f"   Lines: {len(report_content.splitlines())}")
    print(f"   Sections: {report_content.count('#')} headings")
else:
    print("\n❌ Checkpoint 5 Failed: Report file not created")

## 🎉 Project Complete!

Let's verify all checkpoints and review what we built.

In [None]:
print("\n" + "="*70)
print("\n🎯 Project Summary: Data Analysis Agent\n")
print("="*70 + "\n")

# Check all checkpoints
checkpoints = [
    ("Load data file", os.path.exists(csv_file)),
    ("Generate summary statistics", True),  # Completed above
    ("Identify patterns and outliers", True),  # Completed above
    ("Create visualizations", len(created_files) > 0),
    ("Write insights report", os.path.exists(report_file)),
]

for i, (checkpoint, status) in enumerate(checkpoints, 1):
    status_icon = "✅" if status else "❌"
    print(f"{status_icon} Checkpoint {i}: {checkpoint}")

all_complete = all(status for _, status in checkpoints)

print("\n" + "="*70)

if all_complete:
    print("\n🎉 SUCCESS: All checkpoints complete!")
    print("\n📁 Generated Files:")
    print(f"   - Data file: {csv_file}")
    print(f"   - Analysis report: {report_file}")
    print(f"   - Visualizations: {len(created_files)} charts in {viz_dir}")
    
    print("\n💡 What you built:")
    print("   A complete AI-powered data analysis pipeline that can:")
    print("   - Load and parse CSV/JSON data")
    print("   - Calculate comprehensive statistics")
    print("   - Identify patterns and outliers")
    print("   - Generate visualizations")
    print("   - Write actionable insights reports")
    
    print("\n📊 Analysis Coverage:")
    print(f"   - {len(sales_data)} sales records analyzed")
    print(f"   - {len(categories)} product categories")
    print(f"   - {len(regions)} geographic regions")
    print(f"   - {len(created_files)} visualizations created")
else:
    print("\n⚠️ Some checkpoints incomplete. Review the outputs above.")

## 🧹 Cleanup

Clean up temporary files when done.

In [None]:
import shutil

# Uncomment to clean up:
# shutil.rmtree(project_dir)
# print(f"✅ Cleaned up project directory: {project_dir}")

print("💡 Tip: Comment out the cleanup to keep files for inspection")
print(f"   Project files in: {project_dir}")
print(f"   - Data: {csv_file}")
print(f"   - Report: {report_file}")
print(f"   - Charts: {viz_dir}")

## 🚀 Extension Ideas

Want to enhance your Data Analysis Agent? Try these:

### 1. Time Series Analysis
```python
# Analyze trends over time
# Seasonal patterns
# Moving averages
# Forecast future values
```

### 2. Advanced Statistics
```python
# Hypothesis testing
# A/B test analysis
# Regression analysis
# Clustering (K-means)
```

### 3. Multi-Dataset Analysis
```python
# Join multiple datasets
# Compare datasets
# Cross-dataset correlations
```

### 4. Interactive Dashboards
```python
# Generate HTML dashboard
# Interactive Plotly charts
# Real-time data updates
```

### 5. Anomaly Detection
```python
# Machine learning for anomaly detection
# Isolation Forest
# Z-score analysis
# Alert on significant anomalies
```

### 6. Natural Language Queries
```python
# "What was the highest revenue day?"
# "Compare Electronics vs Clothing sales"
# "Show me outliers in the North region"
```

### 7. Export Formats
```python
# Excel reports with formatting
# PDF reports with charts
# PowerPoint presentations
# JSON API responses
```

## 💡 Key Takeaways

**What You Learned:**

✅ **Data Pipeline Design**: Building end-to-end analysis workflows

✅ **Statistical Analysis**: Computing comprehensive statistics with Python

✅ **Pattern Recognition**: Identifying trends, correlations, and outliers

✅ **Data Visualization**: Creating meaningful charts with matplotlib

✅ **Insight Generation**: Translating data into actionable recommendations

✅ **Report Writing**: Producing professional analysis reports

✅ **Agent Orchestration**: Using agents to coordinate multi-step analysis

**Production Considerations:**

- Handle missing or malformed data gracefully
- Validate data types and ranges
- Scale to larger datasets (chunking, streaming)
- Cache intermediate results
- Add progress tracking for long operations
- Support multiple data formats (Excel, Parquet, SQL)
- Implement data quality checks
- Add user configuration for analysis parameters

## 🎓 What You've Accomplished

**Congratulations! You've completed the entire Local LLM SDK tutorial series!**

### Journey Recap:

**Foundations (Notebooks 1-3):**
- ✅ Setup and basic chat
- ✅ Conversation history management
- ✅ Understanding LLM interactions

**Tool Integration (Notebooks 4-6):**
- ✅ Built-in tools (calculator, text transformer)
- ✅ Custom tool creation
- ✅ Filesystem and code execution

**Advanced Patterns (Notebooks 7-9):**
- ✅ ReACT agent pattern
- ✅ MLflow observability
- ✅ Production-ready patterns

**Capstone Projects (Notebooks 10-11):**
- ✅ Code Review Assistant
- ✅ Data Analysis Agent

### You Can Now:

- Build production-ready LLM applications
- Create custom tools for any domain
- Orchestrate multi-step agent workflows
- Debug and optimize with tracing
- Handle errors and edge cases
- Generate insights from data
- Automate complex analysis tasks

### Next Steps:

1. **Build Your Own Project**: Apply what you learned to solve a real problem
2. **Explore the SDK Source**: Dive into `local_llm_sdk/` to understand internals
3. **Contribute**: Found a bug or have an idea? Contribute to the project!
4. **Share**: Build something cool? Share it with the community!

### Resources:

- 📚 [SDK Documentation](../README.md)
- 💻 [Source Code](../local_llm_sdk/)
- 📊 [API Research](.documentation/)
- 🔧 [Example Scripts](../notebooks/)

**Happy building with Local LLM SDK!** 🚀