# 🔥 Hot Durham Data Management - Next Steps Guide

Your automated data management system is now **fully operational**! This notebook will guide you through the recommended next steps to get the most out of your system.

## 🎯 Current Status
✅ **Complete data management system** with organized folder structure  
✅ **Automated data collection** scripts for WU and TSI sensors  
✅ **Google Drive integration** for cloud backup and sync  
✅ **Smart scheduling** with cron job support  
✅ **Comprehensive monitoring** and health checks  

## 📋 What We'll Cover
1. **System Configuration Review** - Check current automation settings
2. **Automation Setup** - Enable scheduled data pulls
3. **Manual Testing** - Verify system functionality
4. **Data Analysis** - Start exploring your environmental data
5. **Production Monitoring** - Set up ongoing system health checks

Let's get started! 🚀

## 1. Import Required Libraries

First, let's import the necessary libraries for configuration and data handling:

In [None]:
import json
import sys
import subprocess
from pathlib import Path

# Add src directories to path for new structure
sys.path.append('src/core')
sys.path.append('src/analysis')
sys.path.append('src/data_collection')
sys.path.append('src/automation')

# Import our custom data manager
try:
    from data_manager import DataManager
    print("✅ DataManager imported successfully")
except ImportError as e:
    print(f"❌ Could not import DataManager: {e}")
    print("Please ensure you're running this from the Hot Durham project directory")

# Set up paths
project_root = Path.cwd()
config_path = project_root / "config" / "automation_config.json"
logs_path = project_root / "logs"
data_path = project_root / "data"

print(f"📁 Project root: {project_root}")
print(f"📁 Config path: {config_path}")
print(f"📁 Data path: {data_path}")

✅ DataManager imported successfully
📁 Project root: /Users/alainsoto/IdeaProjects/Hot Durham
📁 Config path: /Users/alainsoto/IdeaProjects/Hot Durham/config/automation_config.json
📁 Data path: /Users/alainsoto/IdeaProjects/Hot Durham/data


## 2. Load Configuration

Let's check your current automation configuration:

In [24]:
# Load automation configuration
try:
    if config_path.exists():
        with open(config_path, 'r') as f:
            config = json.load(f)
        print("✅ Configuration loaded successfully")
        print("\n📋 Current Configuration:")
        print(json.dumps(config, indent=2))
        
        # Check key settings
        automation_enabled = config.get('automation_enabled', False)
        default_pull_type = config.get('default_pull_type', 'weekly')
        google_drive_sync = config.get('google_drive_sync', False)
        share_email = config.get('share_email', 'Not set')
        
        print("\n🔧 Key Settings:")
        print(f"   📅 Automation enabled: {'✅' if automation_enabled else '❌'} {automation_enabled}")
        print(f"   📊 Default pull type: {default_pull_type}")
        print(f"   ☁️ Google Drive sync: {'✅' if google_drive_sync else '❌'} {google_drive_sync}")
        print(f"   📧 Share email: {share_email}")
        
    else:
        print("❌ Configuration file not found. Run './setup_automation.sh' to create it.")
        config = {}
except Exception as e:
    print(f"❌ Error loading configuration: {e}")
    config = {}

✅ Configuration loaded successfully

📋 Current Configuration:
{
  "share_email": "hotdurham@gmail.com",
  "automation_enabled": true,
  "default_pull_type": "weekly",
  "google_drive_sync": true,
  "notifications": {
    "email_on_success": false,
    "email_on_failure": true,
    "email_address": "hotdurham@gmail.com"
  },
  "data_retention": {
    "keep_raw_data_months": 24,
    "keep_processed_data_months": 12,
    "auto_cleanup": false
  },
  "schedules": {
    "weekly_pull": {
      "enabled": true,
      "day_of_week": "monday",
      "time": "06:00",
      "sources": [
        "wu",
        "tsi"
      ]
    },
    "monthly_summary": {
      "enabled": true,
      "day_of_month": 1,
      "time": "07:00",
      "generate_reports": true
    }
  }
}

🔧 Key Settings:
   📅 Automation enabled: ✅ True
   📊 Default pull type: weekly
   ☁️ Google Drive sync: ✅ True
   📧 Share email: hotdurham@gmail.com


## 3. Check Automation Status

Let's verify if your automated scheduling is properly configured:

In [25]:
# Check cron jobs
print("🔍 Checking cron job status...")
try:
    result = subprocess.run(['crontab', '-l'], capture_output=True, text=True)
    if result.returncode == 0:
        cron_output = result.stdout
        hot_durham_jobs = [line for line in cron_output.split('\n') if 'Hot Durham' in line or 'automated_data_pull' in line]
        
        if hot_durham_jobs:
            print(f"✅ Found {len(hot_durham_jobs)} Hot Durham cron job(s):")
            for job in hot_durham_jobs:
                print(f"   📅 {job}")
        else:
            print("⚠️ No Hot Durham cron jobs found")
            print("\n💡 To set up automation, run:")
            print("   ./setup_automation.sh")
    else:
        print("❌ Could not access crontab. You may need to set up cron permissions.")
except Exception as e:
    print(f"❌ Error checking cron jobs: {e}")

# Check if setup script exists
setup_script = project_root / "setup_automation.sh"
if setup_script.exists():
    print(f"\n✅ Setup script found at: {setup_script}")
    print("💡 Run './setup_automation.sh' to configure automated scheduling")
else:
    print("\n❌ Setup script not found")

🔍 Checking cron job status...
✅ Found 3 Hot Durham cron job(s):
   📅 # Hot Durham Automated Data Pulls
   📅 0 6 * * 1 cd /Users/alainsoto/IdeaProjects/Hot Durham && /Users/alainsoto/IdeaProjects/Hot Durham/.venv/bin/python3 /Users/alainsoto/IdeaProjects/Hot Durham/scripts/automated_data_pull.py --weekly >> /Users/alainsoto/IdeaProjects/Hot Durham/logs/weekly_pull.log 2>&1
   📅 0 7 1 * * cd /Users/alainsoto/IdeaProjects/Hot Durham && /Users/alainsoto/IdeaProjects/Hot Durham/.venv/bin/python3 /Users/alainsoto/IdeaProjects/Hot Durham/scripts/automated_data_pull.py --monthly >> /Users/alainsoto/IdeaProjects/Hot Durham/logs/monthly_pull.log 2>&1

✅ Setup script found at: /Users/alainsoto/IdeaProjects/Hot Durham/setup_automation.sh
💡 Run './setup_automation.sh' to configure automated scheduling


## 4. Run Manual Data Pull

Let's test your system with a manual data pull to ensure everything is working:

In [26]:
# Check available scripts
scripts_dir = project_root / "scripts"
automated_script = scripts_dir / "automated_data_pull.py"
main_script = scripts_dir / "faster_wu_tsi_to_sheets_async.py"

print("🔍 Available data collection scripts:")
if automated_script.exists():
    print(f"✅ Automated script: {automated_script}")
else:
    print(f"❌ Automated script not found: {automated_script}")
    
if main_script.exists():
    print(f"✅ Main script: {main_script}")
else:
    print(f"❌ Main script not found: {main_script}")

print("\n💡 To run a manual data pull, choose one of these commands:")
print("\n📊 **Option 1: Quick test (WU only, no sheets)**")
print("   python scripts/automated_data_pull.py --weekly --wu-only --no-sheets")
print("\n📊 **Option 2: Full weekly pull with Google Sheets**")
print("   python scripts/automated_data_pull.py --weekly")
print("\n📊 **Option 3: Main script (creates comprehensive sheets)**")
print("   python scripts/faster_wu_tsi_to_sheets_async.py")

print("\n🚀 **Ready to test? Uncomment and run one of the commands below:**")
print("# Uncomment the line below to run a test:")
print("# !python scripts/automated_data_pull.py --weekly --wu-only --no-sheets")

🔍 Available data collection scripts:
✅ Automated script: /Users/alainsoto/IdeaProjects/Hot Durham/scripts/automated_data_pull.py
✅ Main script: /Users/alainsoto/IdeaProjects/Hot Durham/scripts/faster_wu_tsi_to_sheets_async.py

💡 To run a manual data pull, choose one of these commands:

📊 **Option 1: Quick test (WU only, no sheets)**
   python scripts/automated_data_pull.py --weekly --wu-only --no-sheets

📊 **Option 2: Full weekly pull with Google Sheets**
   python scripts/automated_data_pull.py --weekly

📊 **Option 3: Main script (creates comprehensive sheets)**
   python scripts/faster_wu_tsi_to_sheets_async.py

🚀 **Ready to test? Uncomment and run one of the commands below:**
# Uncomment the line below to run a test:
# !python scripts/automated_data_pull.py --weekly --wu-only --no-sheets


In [27]:
# 🧪 MANUAL TEST EXECUTION
# Uncomment the line below to run a quick test of your data collection system:

# !python scripts/automated_data_pull.py --weekly --wu-only --no-sheets

# If you want to run a full test with both WU and TSI data plus Google Sheets:
# !python scripts/automated_data_pull.py --weekly

print("⏸️ Manual test execution is commented out.")
print("To run a test, uncomment one of the lines above and execute this cell.")
print("\n💡 Recommended first test: --weekly --wu-only --no-sheets (fastest)")

⏸️ Manual test execution is commented out.
To run a test, uncomment one of the lines above and execute this cell.

💡 Recommended first test: --weekly --wu-only --no-sheets (fastest)


## 5. Verify Google Drive Sync

Let's check if your Google Drive integration is working properly:

In [28]:
# Check Google Drive credentials
creds_dir = project_root / "creds"
google_creds = creds_dir / "google_creds.json"

print("☁️ Google Drive Integration Status:")

if google_creds.exists():
    print("✅ Google credentials file found")
    try:
        # Try to initialize DataManager (which tests Google Drive)
        dm = DataManager(str(project_root))
        print("✅ DataManager initialized successfully")
        print("✅ Google Drive service appears to be working")
        
        # Check for Google Drive sync script
        sync_script = scripts_dir / "google_drive_sync.py"
        if sync_script.exists():
            print("✅ Google Drive sync script found")
        else:
            print("❌ Google Drive sync script not found")
            
    except Exception as e:
        print(f"❌ Error initializing Google Drive: {e}")
        print("💡 Check your Google Cloud Console settings and credentials")
else:
    print(f"❌ Google credentials not found at: {google_creds}")
    print("💡 Place your Google service account credentials in the creds/ directory")

# Check for recent sync logs
sync_log = logs_path / "google_drive_sync.log"
if sync_log.exists():
    print(f"\n📋 Recent Google Drive sync activity found in: {sync_log}")
    print("💡 You can check the log for sync details")
else:
    print("\n⚠️ No Google Drive sync logs found yet")
    print("💡 Logs will be created after your first sync operation")

2025-05-25 01:26:38,004 - INFO - file_cache is only supported with oauth2client<4.0.0
2025-05-25 01:26:38,006 - INFO - Google Drive service initialized successfully
2025-05-25 01:26:38,006 - INFO - Google Drive service initialized successfully


☁️ Google Drive Integration Status:
✅ Google credentials file found
✅ DataManager initialized successfully
✅ Google Drive service appears to be working
✅ Google Drive sync script found

⚠️ No Google Drive sync logs found yet
💡 Logs will be created after your first sync operation


## 6. Analyze Logs for Errors

Let's check your system logs for any recent errors or warnings:

In [29]:
# Check log directory and files
print("📋 System Logs Analysis:")
print(f"\nLogs directory: {logs_path}")

if logs_path.exists():
    log_files = list(logs_path.glob("*.log")) + list(logs_path.glob("*.json"))
    
    if log_files:
        print(f"\n📁 Found {len(log_files)} log file(s):")
        for log_file in log_files:
            print(f"   📄 {log_file.name} ({log_file.stat().st_size} bytes)")
            
        # Check the most recent automation log
        json_logs = [f for f in log_files if f.suffix == '.json']
        if json_logs:
            latest_log = max(json_logs, key=lambda f: f.stat().st_mtime)
            print(f"\n📋 Latest automation log: {latest_log.name}")
            try:
                with open(latest_log, 'r') as f:
                    log_data = json.load(f)
                print(f"   📊 Contains {len(log_data)} log entries")
                if log_data:
                    latest_entry = log_data[-1]
                    print(f"   ⏰ Latest entry: {latest_entry.get('timestamp', 'Unknown time')}")
                    print(f"   📊 Status: {latest_entry.get('status', 'Unknown')}")
                    if 'error' in latest_entry:
                        print(f"   ❌ Error: {latest_entry['error']}")
            except Exception as e:
                print(f"   ❌ Error reading log: {e}")
    else:
        print("\n⚠️ No log files found yet")
        print("💡 Logs will be created after running data pulls or automation")
else:
    print("\n⚠️ Logs directory not found")
    print("💡 Directory will be created automatically when needed")

# Quick error check across all logs
if logs_path.exists():
    error_count = 0
    warning_count = 0
    for log_file in logs_path.glob("*.log"):
        try:
            with open(log_file, 'r') as f:
                content = f.read()
                error_count += content.lower().count('error')
                warning_count += content.lower().count('warning')
        except:
            pass
    
    if error_count > 0 or warning_count > 0:
        print(f"\n⚠️ Found {error_count} errors and {warning_count} warnings in logs")
        print("💡 Use 'grep -i error logs/*.log' to investigate specific issues")
    else:
        print("\n✅ No obvious errors or warnings found in logs")

📋 System Logs Analysis:

Logs directory: /Users/alainsoto/IdeaProjects/Hot Durham/logs

📁 Found 1 log file(s):
   📄 automation_log_202505.json (655 bytes)

📋 Latest automation log: automation_log_202505.json
   📊 Contains 3 log entries
   ⏰ Latest entry: 2025-05-25 01:22:16
   📊 Status: Unknown



## 7. Inspect Data Inventory

Let's review your current data collection and storage:

In [34]:
# Use DataManager to get data inventory
print("📊 Data Inventory Report:")

try:
    # Initialize DataManager
    dm = DataManager(str(project_root))
    
    # Get data inventory
    inventory = dm.get_data_inventory()
    
    print("\n📁 Data Storage Overview:")
    print(f"   📂 Total files: {inventory.get('total_files', 0)}")
    print(f"   💾 Total size: {inventory.get('total_size_mb', 0):.2f} MB")
    print(f"   📅 Date range: {inventory.get('date_range', 'No data')}")
    
    # Break down by source
    sources = inventory.get('sources', {})
    if sources:
        print("\n📊 Data by Source:")
        for source, data in sources.items():
            print(f"   🌡️ {source.upper()}: {data.get('files', 0)} files, {data.get('size_mb', 0):.2f} MB")
    
    # List recent files
    recent_files = inventory.get('recent_files', [])
    if recent_files:
        print("\n📋 Recent Data Files (last 5):")
        for file_info in recent_files[-5:]:
            print(f"   📄 {file_info.get('name', 'Unknown')} ({file_info.get('size_mb', 0):.2f} MB)")
    
except Exception as e:
    print(f"❌ Error getting data inventory: {e}")
    print("💡 Try running a manual data pull first to create some data")

# Manual file count (backup method)
print("\n🔍 Manual Directory Check:")
raw_pulls_dir = data_path / "raw_pulls"
if raw_pulls_dir.exists():
    wu_files = list((raw_pulls_dir / "wu").rglob("*.csv")) if (raw_pulls_dir / "wu").exists() else []
    tsi_files = list((raw_pulls_dir / "tsi").rglob("*.csv")) if (raw_pulls_dir / "tsi").exists() else []
    
    print(f"   🌤️ WU files: {len(wu_files)}")
    print(f"   🔬 TSI files: {len(tsi_files)}")
    
    if wu_files or tsi_files:
        print("\n📋 Sample files:")
        for f in (wu_files + tsi_files)[:3]:
            print(f"   📄 {f.name}")
else:
    print("   ⚠️ Raw data directory not found")
    print("   💡 Run a data pull to create the directory structure")

2025-05-25 01:26:43,842 - INFO - file_cache is only supported with oauth2client<4.0.0
2025-05-25 01:26:43,844 - INFO - Google Drive service initialized successfully
2025-05-25 01:26:43,844 - INFO - Google Drive service initialized successfully


📊 Data Inventory Report:
❌ Error getting data inventory: 'DataManager' object has no attribute 'get_data_inventory'
💡 Try running a manual data pull first to create some data

🔍 Manual Directory Check:
   🌤️ WU files: 0
   🔬 TSI files: 0


## 🎯 Your Recommended Next Steps

Based on your system status, here are the recommended next steps:

### 🚀 **Immediate Actions (Do Today)**

1. **Set Up Automation** (If not already done)
   ```bash
   ./setup_automation.sh
   ```

2. **Run Your First Test**
   ```bash
   python scripts/automated_data_pull.py --weekly --wu-only --no-sheets
   ```

3. **Verify Everything Works**
   ```bash
   python scripts/status_check.py
   ```

### 📊 **Short Term (This Week)**

4. **Start Regular Data Collection**
   - Let the automation run weekly (every Monday 6 AM)
   - Or run manual pulls as needed

5. **Explore Your Data**
   - Check the Google Sheets created by your pulls
   - Review data quality and completeness
   - Look for interesting patterns or anomalies

6. **Set Up Monitoring**
   - Check logs weekly for any errors
   - Monitor Google Drive sync status
   - Verify data integrity regularly

### 📈 **Medium Term (This Month)**

7. **Data Analysis & Insights**
   - Create custom analysis notebooks
   - Build dashboards or reports
   - Compare WU vs TSI sensor readings
   - Analyze seasonal or weekly patterns

8. **System Optimization**
   - Fine-tune automation schedules
   - Add custom data processing
   - Optimize Google Drive storage

### 🔬 **Long Term (Ongoing)**

9. **Environmental Research**
   - Correlate data with health outcomes
   - Study air quality trends
   - Share insights with Durham community
   - Publish findings or reports

10. **System Enhancement**
    - Add new data sources
    - Integrate with other platforms
    - Build public dashboards
    - Automate report generation

## ⚡ Quick Actions

Use these code cells to perform common tasks:

In [31]:
# 🔧 QUICK SETUP - Run automation setup
# Uncomment to run:

# !chmod +x setup_automation.sh
# !./setup_automation.sh

print("⏸️ Setup command is commented out.")
print("Uncomment the lines above to run the automation setup.")

⏸️ Setup command is commented out.
Uncomment the lines above to run the automation setup.


In [32]:
# 🧪 QUICK TEST - Run a fast data pull test
# Uncomment to run:

# !python scripts/automated_data_pull.py --weekly --wu-only --no-sheets

print("⏸️ Test command is commented out.")
print("Uncomment the line above to run a quick WU data test.")

⏸️ Test command is commented out.
Uncomment the line above to run a quick WU data test.


In [33]:
# 📊 QUICK STATUS - Check system health
# Uncomment to run:

# !python scripts/status_check.py

print("⏸️ Status check command is commented out.")
print("Uncomment the line above to run a system status check.")

⏸️ Status check command is commented out.
Uncomment the line above to run a system status check.


## 🎉 Congratulations!

Your **Hot Durham Data Management System** is ready for production! 🚀

### ✅ What You've Accomplished:
- **Complete automated data collection** system
- **Organized storage** with smart file management
- **Google Drive integration** for cloud backup
- **Scheduling system** for regular data pulls
- **Monitoring and logging** for system health
- **Comprehensive testing** and validation

### 🎯 What's Next:
1. **Run the setup** to enable automation
2. **Test your first data pull** to verify everything works
3. **Let the system collect data** automatically
4. **Explore and analyze** your environmental data
5. **Share insights** with the Durham community

---

**Need help?** 
- 📖 Check the `DATA_MANAGEMENT_README.md` for detailed documentation
- 🔍 Use `python scripts/status_check.py` for system health monitoring
- 📋 Review logs in the `logs/` directory for troubleshooting

**Happy data collecting!** 🌟📊🔬