# climaXtreme Getting Started

This notebook demonstrates the basic usage of climaXtreme for climate data analysis.

## Setup

Make sure you have installed climaXtreme and its dependencies:
```bash
pip install -e .
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from climaxtreme.data import DataIngestion, DataValidator
from climaxtreme.preprocessing import DataPreprocessor
from climaxtreme.analysis import HeatmapAnalyzer, TimeSeriesAnalyzer
from climaxtreme.ml import BaselineModel
from climaxtreme.utils import setup_logging

# Setup logging
setup_logging(level="INFO")

print("climaXtreme modules imported successfully!")

## 1. Data Ingestion

First, let's download some sample climate data from Berkeley Earth.

In [None]:
# Initialize data ingestion
ingestion = DataIngestion("../data/raw")

# Download Berkeley Earth data for a small time range
print("Downloading climate data...")
downloaded_files = ingestion.download_berkeley_earth_data(2020, 2022)

print(f"Downloaded {len(downloaded_files)} files:")
for file in downloaded_files:
    info = ingestion.get_file_info(file)
    if info["exists"]:
        print(f"  - {file}: {info['size_mb']} MB")

## 2. Data Validation

Let's validate the downloaded data to check its quality.

In [None]:
# Initialize validator
validator = DataValidator()

# Validate data files
validation_results = validator.validate_directory(Path("../data/raw"))

# Generate summary
summary = validator.generate_validation_summary()
print(f"Validation Summary:")
print(f"  Files validated: {summary['summary']['total_files_validated']}")
print(f"  Success rate: {summary['summary']['success_rate']}%")
print(f"  Total data rows: {summary['summary']['total_data_rows']:,}")

## 3. Data Preprocessing

Now let's preprocess the raw data to make it ready for analysis.

In [None]:
# Initialize preprocessor
preprocessor = DataPreprocessor()

# Process the main data file
raw_files = list(Path("../data/raw").glob("*.txt"))
if raw_files:
    main_file = raw_files[0]  # Process the first file
    print(f"Processing {main_file.name}...")
    
    output_files = preprocessor.process_file(str(main_file), "../data/processed")
    
    print("Processed files created:")
    for key, path in output_files.items():
        if Path(path).exists():
            print(f"  - {key}: {path}")
else:
    print("No raw data files found. Please run data ingestion first.")

## 4. Data Analysis

Let's load the processed data and perform some basic analysis.

In [None]:
# Load processed monthly data
monthly_files = list(Path("../data/processed").glob("*_monthly.csv"))
if monthly_files:
    df = pd.read_csv(monthly_files[0])
    print(f"Loaded data shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    print(f"Date range: {df['year'].min()}-{df['year'].max()}")
    
    # Display first few rows
    display(df.head())
    
    # Basic statistics
    print("\nTemperature statistics:")
    display(df['avg_temperature'].describe())
else:
    print("No processed monthly data found. Please run preprocessing first.")

## 5. Visualization

Create some basic visualizations of the climate data.

In [None]:
if 'df' in locals() and not df.empty:
    # Time series plot
    plt.figure(figsize=(12, 6))
    plt.plot(df['year'] + df['month']/12, df['avg_temperature'], 'b-', alpha=0.7)
    plt.title('Temperature Time Series')
    plt.xlabel('Year')
    plt.ylabel('Temperature (°C)')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Seasonal pattern
    plt.figure(figsize=(10, 6))
    monthly_avg = df.groupby('month')['avg_temperature'].mean()
    plt.plot(monthly_avg.index, monthly_avg.values, 'ro-', linewidth=2, markersize=8)
    plt.title('Average Temperature by Month')
    plt.xlabel('Month')
    plt.ylabel('Temperature (°C)')
    plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                              'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
    plt.grid(True, alpha=0.3)
    plt.show()
else:
    print("No data available for visualization")

## 6. Advanced Analysis

Use the built-in analysis modules for more sophisticated analysis.

In [None]:
# Time series analysis
if 'df' in locals() and not df.empty:
    ts_analyzer = TimeSeriesAnalyzer()
    
    try:
        # Analyze temperature trends
        trend_results = ts_analyzer.analyze_temperature_trends(
            "../data/processed", 
            "../data/output"
        )
        
        print("Temperature Trend Analysis:")
        linear_trend = trend_results['linear_trend']
        print(f"  Trend: {linear_trend['slope_per_decade']:.4f}°C per decade")
        print(f"  R²: {linear_trend['r_squared']:.4f}")
        print(f"  p-value: {linear_trend['p_value']:.6f}")
        print(f"  Significant: {linear_trend['significant']}")
        
    except Exception as e:
        print(f"Trend analysis failed: {e}")
        print("This might be due to insufficient data for trend analysis")
else:
    print("No data available for trend analysis")

## 7. Machine Learning

Train a simple model to predict temperatures.

In [None]:
if 'df' in locals() and not df.empty and len(df) > 24:  # Need enough data for ML
    # Initialize and train a baseline model
    model = BaselineModel("random_forest")
    
    # Add temperature column for model training
    df_ml = df.rename(columns={'avg_temperature': 'temperature'})
    
    print("Training machine learning model...")
    results = model.train(df_ml)
    
    print(f"Model Performance:")
    print(f"  Training R²: {results['train_r2']:.4f}")
    print(f"  Test R²: {results['test_r2']:.4f}")
    print(f"  Test RMSE: {results['test_rmse']:.4f}")
    
    # Feature importance (if available)
    if 'feature_importance' in results:
        print("\nTop 5 Most Important Features:")
        importance = results['feature_importance']
        for i, (feature, score) in enumerate(list(importance.items())[:5]):
            print(f"  {i+1}. {feature}: {score:.4f}")
    
    # Make future predictions
    print("\nMaking future predictions...")
    future_predictions = model.predict_future(2024, 2024)
    
    print(f"Predicted temperatures for 2024:")
    for idx, row in future_predictions.head().iterrows():
        print(f"  {row['year']}-{row['month']:02d}: {row['predicted_temperature']:.2f}°C")
        
else:
    print("Insufficient data for machine learning (need >24 records)")

## Next Steps

This notebook showed the basic usage of climaXtreme. For more advanced features:

1. **Use PySpark preprocessing** for larger datasets:
   ```python
   from climaxtreme.preprocessing import SparkPreprocessor
   with SparkPreprocessor() as processor:
       processor.process_directory("data/raw", "data/processed")
   ```

2. **Launch the interactive dashboard**:
   ```bash
   climaxtreme dashboard
   ```

3. **Use ensemble ML models**:
   ```python
   from climaxtreme.ml import ClimatePredictor
   predictor = ClimatePredictor(["linear", "ridge", "random_forest"])
   predictor.train_ensemble(df)
   ```

4. **Generate publication-quality heatmaps**:
   ```python
   from climaxtreme.analysis import HeatmapAnalyzer
   analyzer = HeatmapAnalyzer()
   analyzer.generate_global_heatmap("data/processed", "data/output")
   ```

Check the full documentation in the README.md for complete usage examples!