# üåßÔ∏è RAINFALL ANALYSIS FOR INDIAN AGRICULTURE

## Project Objectives:
1. Study historical rainfall patterns
2. Identify monthly, yearly, and regional trends
3. Understand rainfall variability
4. Support agricultural decision-making

---

## Problem Statement:
Agriculture in India heavily depends on rainfall, which is:
- **Uneven** - varies significantly
- **Seasonal** - monsoon dependent
- **Regional** - differs state to state

This analysis helps farmers make data-driven decisions for crop planning and irrigation management.

## STEP 1: Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.size'] = 11

print("‚úÖ All libraries imported successfully!")

## STEP 2: Load and Inspect Rainfall Dataset

**Data Source:** Government portal (IMD-like format)

**Data Contains:**
- Year: Historical year data
- State: Indian state/region
- Monthly columns (January - December): Monthly rainfall in mm
- Annual_Rainfall: Total yearly rainfall

In [None]:
# Load the dataset
df = pd.read_csv('../data/rainfall_data.csv')

print("=" * 80)
print("üìä DATASET OVERVIEW")
print("=" * 80)
print(f"\nüìå Shape of Dataset: {df.shape}")
print(f"\nüìå Column Names and Data Types:\n")
print(df.dtypes)
print(f"\nüìå First 5 Rows of Dataset:\n")
print(df.head())
print(f"\nüìå Dataset Information:\n")
print(df.info())
print(f"\nüìå Statistical Summary:\n")
print(df.describe())

## STEP 3: Data Preprocessing and Cleaning

**Cleaning Steps:**
1. Check for missing values
2. Remove duplicates
3. Fix incorrect entries
4. Verify data consistency

In [None]:
print("=" * 80)
print("üßπ DATA CLEANING PROCESS")
print("=" * 80)

# Check for missing values
print(f"\n1Ô∏è‚É£ Missing Values in Dataset:\n")
missing_values = df.isnull().sum()
print(missing_values)
print(f"\nTotal Missing Values: {missing_values.sum()}")

# Check for duplicates
print(f"\n2Ô∏è‚É£ Duplicate Rows: {df.duplicated().sum()}")

# Remove duplicates if any
df_clean = df.drop_duplicates().copy()
print(f"   After removing duplicates: {df_clean.shape[0]} rows")

# Check for negative or zero rainfall values (which might be incorrect)
rainfall_columns = [col for col in df.columns if col not in ['Year', 'State', 'Annual_Rainfall']]
print(f"\n3Ô∏è‚É£ Checking for Negative/Zero Values in Monthly Data:")
negative_per_col = (df_clean[rainfall_columns] < 0).sum().sum()
print(f"   Negative values found: {negative_per_col}")

# Handle any negative values (convert to 0)
df_clean[rainfall_columns] = df_clean[rainfall_columns].clip(lower=0)

# Verify Annual_Rainfall sum matches monthly data
df_clean['Calculated_Annual'] = df_clean[rainfall_columns].sum(axis=1)
df_clean['Annual_Match'] = (df_clean['Annual_Rainfall'] - df_clean['Calculated_Annual']).abs() < 1

print(f"\n4Ô∏è‚É£ Annual Rainfall Verification:")
print(f"   Records where annual matches sum of months: {df_clean['Annual_Match'].sum()}")

# Clean up temporary columns
df_clean = df_clean.drop(['Calculated_Annual', 'Annual_Match'], axis=1)

print(f"\n‚úÖ Data Cleaning Complete!")
print(f"   Final dataset shape: {df_clean.shape}")
print(f"\nCleaned Data Sample:\n")
print(df_clean.head())

## STEP 4: Month-wise Rainfall Analysis

**Goal:** Understand which months have maximum and minimum rainfall

In [None]:
# Extract month columns
month_columns = ['January', 'February', 'March', 'April', 'May', 'June', 
                 'July', 'August', 'September', 'October', 'November', 'December']

# Calculate average rainfall for each month
avg_monthly_rainfall = df_clean[month_columns].mean()

print("=" * 80)
print("üìÖ MONTH-WISE RAINFALL ANALYSIS")
print("=" * 80)
print(f"\nAverage Rainfall by Month (in mm):\n")
print(avg_monthly_rainfall)

# Find peak and low rainfall months
peak_month = avg_monthly_rainfall.idxmax()
peak_rainfall = avg_monthly_rainfall.max()
low_month = avg_monthly_rainfall.idxmin()
low_rainfall = avg_monthly_rainfall.min()

print(f"\n‚¨ÜÔ∏è PEAK RAINFALL: {peak_month} ({peak_rainfall:.2f} mm)")
print(f"‚¨áÔ∏è LOWEST RAINFALL: {low_month} ({low_rainfall:.2f} mm)")
print(f"üìä Difference: {peak_rainfall - low_rainfall:.2f} mm")

# Visualization
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Bar chart
axes[0].bar(range(len(month_columns)), avg_monthly_rainfall, color='steelblue', alpha=0.7)
axes[0].set_xlabel('Month', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Average Rainfall (mm)', fontsize=12, fontweight='bold')
axes[0].set_title('Average Rainfall by Month Across All Years and States', fontsize=13, fontweight='bold')
axes[0].set_xticks(range(len(month_columns)))
axes[0].set_xticklabels(month_columns, rotation=45)
axes[0].grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(avg_monthly_rainfall):
    axes[0].text(i, v + 5, f'{v:.0f}', ha='center', va='bottom', fontweight='bold')

# Line chart
axes[1].plot(range(len(month_columns)), avg_monthly_rainfall, marker='o', linewidth=2.5, 
             markersize=8, color='darkgreen')
axes[1].fill_between(range(len(month_columns)), avg_monthly_rainfall, alpha=0.3, color='lightgreen')
axes[1].set_xlabel('Month', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Average Rainfall (mm)', fontsize=12, fontweight='bold')
axes[1].set_title('Monthly Rainfall Trend', fontsize=13, fontweight='bold')
axes[1].set_xticks(range(len(month_columns)))
axes[1].set_xticklabels(month_columns, rotation=45)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/01_Monthly_Rainfall_Analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualization saved!")
print("\nüìå INSIGHTS:")
print(f"   ‚Ä¢ Monsoon season (Jun-Sep) contributes majority of rainfall")
print(f"   ‚Ä¢ Summer months ({low_month}) receive minimal rainfall")
print(f"   ‚Ä¢ {peak_month} is the wettest month with {peak_rainfall:.2f}mm average rainfall")

## STEP 5: Year-wise Rainfall Trend Analysis

**Goal:** Identify if rainfall is increasing, decreasing, or stable over years

In [None]:
# Analyze rainfall trends by year
yearly_rainfall = df_clean.groupby('Year')['Annual_Rainfall'].mean()

print("\n" + "=" * 80)
print("üìà YEAR-WISE RAINFALL TREND ANALYSIS")
print("=" * 80)
print(f"\nAverage Annual Rainfall by Year (in mm):\n")
print(yearly_rainfall)

# Calculate trend
years = np.array(yearly_rainfall.index).reshape(-1, 1)
rainfall_values = np.array(yearly_rainfall.values).reshape(-1, 1)
model = LinearRegression()
model.fit(years, rainfall_values)
slope = model.coef_[0][0]
intercept = model.intercept_[0]

# Determine trend
if slope > 0:
    trend = "INCREASING ‚¨ÜÔ∏è"
elif slope < 0:
    trend = "DECREASING ‚¨áÔ∏è"
else:
    trend = "STABLE ‚û°Ô∏è"

print(f"\nüìä TREND ANALYSIS:")
print(f"   Slope: {slope:.4f} mm/year")
print(f"   Trend: {trend}")

# Calculate min, max, and average
min_year = yearly_rainfall.idxmin()
max_year = yearly_rainfall.idxmax()
avg_rainfall = yearly_rainfall.mean()

print(f"\nüìå KEY STATISTICS:")
print(f"   Minimum rainfall year: {min_year} ({yearly_rainfall[min_year]:.2f} mm)")
print(f"   Maximum rainfall year: {max_year} ({yearly_rainfall[max_year]:.2f} mm)")
print(f"   Average annual rainfall: {avg_rainfall:.2f} mm")
print(f"   Variation: {yearly_rainfall.std():.2f} mm (std dev)")

# Visualizations
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Line plot with trend
x_years = yearly_rainfall.index
y_years = yearly_rainfall.values

axes[0].plot(x_years, y_years, marker='o', linewidth=2.5, markersize=10, 
             label='Actual Rainfall', color='darkblue')
# Trend line
trend_line = model.predict(years).flatten()
axes[0].plot(x_years, trend_line, '--', linewidth=2.5, label='Trend Line', color='red')
axes[0].fill_between(x_years, y_years, alpha=0.3, color='lightblue')
axes[0].set_xlabel('Year', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Average Annual Rainfall (mm)', fontsize=12, fontweight='bold')
axes[0].set_title('Year-wise Rainfall Trend Analysis', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Bar chart by year
colors = ['green' if y > avg_rainfall else 'orange' for y in y_years]
axes[1].bar(x_years, y_years, color=colors, alpha=0.7, label='Annual Rainfall')
axes[1].axhline(y=avg_rainfall, color='red', linestyle='--', linewidth=2, label=f'Average ({avg_rainfall:.0f}mm)')
axes[1].set_xlabel('Year', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Average Annual Rainfall (mm)', fontsize=12, fontweight='bold')
axes[1].set_title('Annual Rainfall Comparison (Green=Above Avg, Orange=Below Avg)', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/02_Yearly_Rainfall_Trend.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualization saved!")

## STEP 6: State-wise Rainfall Comparison

**Goal:** Compare rainfall patterns across different states and identify drought-prone regions

In [None]:
# Analyze rainfall by state
state_rainfall = df_clean.groupby('State')['Annual_Rainfall'].agg(['mean', 'min', 'max', 'std'])
state_rainfall = state_rainfall.sort_values('mean', ascending=False)

print("\n" + "=" * 80)
print("üó∫Ô∏è STATE-WISE RAINFALL ANALYSIS")
print("=" * 80)
print(f"\nRainfall Statistics by State (in mm):\n")
print(state_rainfall)

# Classify states
high_rainfall_threshold = state_rainfall['mean'].quantile(0.75)
low_rainfall_threshold = state_rainfall['mean'].quantile(0.25)

high_rainfall_states = state_rainfall[state_rainfall['mean'] >= high_rainfall_threshold].index.tolist()
low_rainfall_states = state_rainfall[state_rainfall['mean'] <= low_rainfall_threshold].index.tolist()

print(f"\n‚òî HIGH RAINFALL STATES (>75th percentile):")
for state in high_rainfall_states:
    print(f"   {state}: {state_rainfall.loc[state, 'mean']:.2f} mm")

print(f"\nüèúÔ∏è LOW/DROUGHT-PRONE STATES (<25th percentile):")
for state in low_rainfall_states:
    print(f"   {state}: {state_rainfall.loc[state, 'mean']:.2f} mm")

# Visualizations
fig, axes = plt.subplots(2, 1, figsize=(14, 11))

# Bar chart - sorted by rainfall
colors = ['darkblue' if state in high_rainfall_states else 'orange' if state in low_rainfall_states else 'steelblue' 
          for state in state_rainfall.index]
axes[0].barh(state_rainfall.index, state_rainfall['mean'], color=colors, alpha=0.7)
axes[0].set_xlabel('Average Annual Rainfall (mm)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('State', fontsize=12, fontweight='bold')
axes[0].set_title('Average Rainfall by State\n(Blue=High Rainfall, Orange=Low Rainfall)', fontsize=13, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Add value labels
for i, v in enumerate(state_rainfall['mean']):
    axes[0].text(v + 20, i, f'{v:.0f}', va='center', fontweight='bold')

# Box plot - rainfall distribution
state_data = [df_clean[df_clean['State'] == state]['Annual_Rainfall'].values for state in state_rainfall.index]
bp = axes[1].boxplot(state_data, labels=state_rainfall.index, patch_artist=True, vert=True)

# Color box plot
for patch, state in zip(bp['boxes'], state_rainfall.index):
    if state in high_rainfall_states:
        patch.set_facecolor('darkblue')
    elif state in low_rainfall_states:
        patch.set_facecolor('orange')
    else:
        patch.set_facecolor('steelblue')
    patch.set_alpha(0.7)

axes[1].set_ylabel('Annual Rainfall (mm)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('State', fontsize=12, fontweight='bold')
axes[1].set_title('Rainfall Distribution Variability by State (Box Plot)', fontsize=13, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('../outputs/03_State_wise_Rainfall_Comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualization saved!")

## STEP 7: Seasonal Pattern Analysis (Monsoon)

**Goal:** Analyze monsoon rainfall patterns which are crucial for agriculture

**Seasons Definition:**
- Winter: December, January, February
- Summer: March, April, May  
- Monsoon: June, July, August, September
- Post-Monsoon: October, November

In [None]:
# Define seasons
df_clean['Season_Monsoon'] = df_clean[['June', 'July', 'August', 'September']].sum(axis=1)
df_clean['Season_Winter'] = df_clean[['December', 'January', 'February']].sum(axis=1)
df_clean['Season_Summer'] = df_clean[['March', 'April', 'May']].sum(axis=1)
df_clean['Season_PostMonsoon'] = df_clean[['October', 'November']].sum(axis=1)

print("\n" + "=" * 80)
print("üåä SEASONAL RAINFALL ANALYSIS")
print("=" * 80)

# Calculate seasonal averages
seasonal_stats = pd.DataFrame({
    'Monsoon': [df_clean['Season_Monsoon'].mean(), df_clean['Season_Monsoon'].std()],
    'Winter': [df_clean['Season_Winter'].mean(), df_clean['Season_Winter'].std()],
    'Summer': [df_clean['Season_Summer'].mean(), df_clean['Season_Summer'].std()],
    'Post-Monsoon': [df_clean['Season_PostMonsoon'].mean(), df_clean['Season_PostMonsoon'].std()]
}, index=['Mean (mm)', 'Std Dev (mm)'])

print(f"\nSeasonal Rainfall Statistics:\n")
print(seasonal_stats)

# Monsoon contribution
total_rainfall = df_clean[month_columns].sum(axis=1).mean()
monsoon_contribution = (df_clean['Season_Monsoon'].mean() / total_rainfall) * 100

print(f"\nüåßÔ∏è MONSOON ANALYSIS:")
print(f"   Average Monsoon Rainfall: {df_clean['Season_Monsoon'].mean():.2f} mm")
print(f"   Monsoon Contribution: {monsoon_contribution:.1f}% of annual rainfall")
print(f"   Total Annual Average: {total_rainfall:.2f} mm")

# Seasonal by state
print(f"\nüó∫Ô∏è MONSOON RAINFALL BY STATE:")
state_monsoon = df_clean.groupby('State')['Season_Monsoon'].mean().sort_values(ascending=False)
for state, value in state_monsoon.items():
    print(f"   {state}: {value:.2f} mm")

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Seasonal comparison
seasons = ['Monsoon', 'Winter', 'Summer', 'Post-Monsoon']
seasonal_means = [df_clean['Season_Monsoon'].mean(), df_clean['Season_Winter'].mean(),
                  df_clean['Season_Summer'].mean(), df_clean['Season_PostMonsoon'].mean()]
colors_season = ['darkblue', 'lightblue', 'orange', 'green']
axes[0, 0].bar(seasons, seasonal_means, color=colors_season, alpha=0.7)
axes[0, 0].set_ylabel('Average Rainfall (mm)', fontsize=12, fontweight='bold')
axes[0, 0].set_title('Seasonal Rainfall Comparison', fontsize=13, fontweight='bold')
axes[0, 0].grid(axis='y', alpha=0.3)
for i, v in enumerate(seasonal_means):
    axes[0, 0].text(i, v + 10, f'{v:.0f}', ha='center', fontweight='bold')

# 2. Monsoon contribution pie chart
axes[0, 1].pie([monsoon_contribution, 100-monsoon_contribution], 
               labels=['Monsoon', 'Other Seasons'],
               colors=['darkblue', 'lightcoral'],
               autopct='%1.1f%%',
               startangle=90,
               explode=(0.05, 0))
axes[0, 1].set_title('Monsoon Contribution to Annual Rainfall', fontsize=13, fontweight='bold')

# 3. Heatmap - seasonal rainfall by state and year
seasonal_data = pd.DataFrame({
    'Monsoon': df_clean.groupby('State')['Season_Monsoon'].mean(),
    'Winter': df_clean.groupby('State')['Season_Winter'].mean(),
    'Summer': df_clean.groupby('State')['Season_Summer'].mean(),
    'Post-Monsoon': df_clean.groupby('State')['Season_PostMonsoon'].mean()
})

sns.heatmap(seasonal_data, annot=True, fmt='.0f', cmap='YlGnBu', ax=axes[1, 0], cbar_kws={'label': 'Rainfall (mm)'})
axes[1, 0].set_title('Seasonal Rainfall Heatmap by State', fontsize=13, fontweight='bold')
axes[1, 0].set_ylabel('State', fontsize=12, fontweight='bold')

# 4. Box plot seasonal variability
seasonal_box_data = [df_clean['Season_Monsoon'], df_clean['Season_Winter'],
                     df_clean['Season_Summer'], df_clean['Season_PostMonsoon']]
bp = axes[1, 1].boxplot(seasonal_box_data, labels=seasons, patch_artist=True)
for patch, color in zip(bp['boxes'], colors_season):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
axes[1, 1].set_ylabel('Rainfall (mm)', fontsize=12, fontweight='bold')
axes[1, 1].set_title('Seasonal Rainfall Variability (Box Plot)', fontsize=13, fontweight='bold')
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/04_Seasonal_Analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualization saved!")

## STEP 8: Statistical Summary and Key Insights

**Goal:** Extract actionable insights from the data for agricultural planning

In [None]:
print("\n" + "=" * 80)
print("üìä STATISTICAL SUMMARY AND INSIGHTS")
print("=" * 80)

# Overall statistics
print(f"\n1Ô∏è‚É£ OVERALL RAINFALL STATISTICS:")
print(f"   Mean Annual Rainfall: {df_clean['Annual_Rainfall'].mean():.2f} mm")
print(f"   Median Annual Rainfall: {df_clean['Annual_Rainfall'].median():.2f} mm")
print(f"   Std Deviation: {df_clean['Annual_Rainfall'].std():.2f} mm")
print(f"   Min Annual Rainfall: {df_clean['Annual_Rainfall'].min():.2f} mm")
print(f"   Max Annual Rainfall: {df_clean['Annual_Rainfall'].max():.2f} mm")
print(f"   Coefficient of Variation: {(df_clean['Annual_Rainfall'].std() / df_clean['Annual_Rainfall'].mean() * 100):.2f}%")

# Rainfall variability
print(f"\n2Ô∏è‚É£ RAINFALL VARIABILITY:")
variability = (df_clean['Annual_Rainfall'].std() / df_clean['Annual_Rainfall'].mean() * 100)
if variability < 15:
    variability_level = "LOW (Stable)"
elif variability < 25:
    variability_level = "MODERATE (Predictable)"
else:
    variability_level = "HIGH (Unpredictable)"
print(f"   Variability Level: {variability_level}")
print(f"   This indicates {'stable' if variability < 20 else 'variable'} rainfall patterns")

# Correlation between monthly rainfall
print(f"\n3Ô∏è‚É£ CORRELATION ANALYSIS (Monthly Rainfall):")
correlation_matrix = df_clean[month_columns].corr()
print(f"   Average inter-month correlation: {correlation_matrix.values[np.triu_indices_from(correlation_matrix.values, k=1)].mean():.3f}")

# Years with above/below average rainfall
above_avg = (df_clean['Annual_Rainfall'] > df_clean['Annual_Rainfall'].mean()).sum()
below_avg = (df_clean['Annual_Rainfall'] <= df_clean['Annual_Rainfall'].mean()).sum()
print(f"\n4Ô∏è‚É£ YEAR DISTRIBUTION:")
print(f"   Years with above-average rainfall: {above_avg}")
print(f"   Years with below-average rainfall: {below_avg}")

# Key insights
print(f"\n" + "=" * 80)
print("üí° KEY INSIGHTS FOR AGRICULTURE")
print("=" * 80)

print(f"\nüåæ CROP PLANNING RECOMMENDATIONS:")
print(f"   ‚Ä¢ Peak rainfall: {peak_month} - Best for water-demanding crops")
print(f"   ‚Ä¢ Monsoon season ({df_clean['Season_Monsoon'].mean():.0f}mm): Most reliable for main season crops")
print(f"   ‚Ä¢ Dry season ({df_clean['Season_Summer'].mean():.0f}mm): Requires irrigation for crop cultivation")

print(f"\nüíß IRRIGATION MANAGEMENT:")
print(f"   ‚Ä¢ States needing most irrigation: {', '.join(low_rainfall_states)}")
print(f"   ‚Ä¢ States with sufficient rainfall: {', '.join(high_rainfall_states)}")
print(f"   ‚Ä¢ Critical months: {low_month} requires contingency planning")

print(f"\n‚ö†Ô∏è RISK ASSESSMENT:")
print(f"   ‚Ä¢ Confidence in rainfall: {'HIGH' if variability < 20 else 'MODERATE' if variability < 25 else 'LOW'}")
print(f"   ‚Ä¢ Drought risk: {'LOW' if variability < 20 else 'MODERATE' if variability < 25 else 'HIGH'}")
print(f"   ‚Ä¢ Insurance planning: Required for rainfall-dependent states")

print("\n‚úÖ Analysis Complete!")

## STEP 9: Correlation Analysis and Heatmap

**Goal:** Understand relationships between monthly rainfall patterns

In [None]:
# Correlation analysis
correlation_matrix = df_clean[month_columns].corr()

print("\n" + "=" * 80)
print("üîó CORRELATION ANALYSIS")
print("=" * 80)

# Find strongest correlations
print(f"\nTop 5 Strongest Month Correlations:")
corr_pairs = correlation_matrix.unstack().reset_index()
corr_pairs.columns = ['Month1', 'Month2', 'Correlation']
corr_pairs = corr_pairs[corr_pairs['Month1'] != corr_pairs['Month2']]
corr_pairs = corr_pairs[corr_pairs['Correlation'].abs() > 0]
corr_pairs = corr_pairs.drop_duplicates(subset=['Correlation'], keep='first')
corr_pairs = corr_pairs.sort_values('Correlation', ascending=False)
print(corr_pairs.head(10).to_string(index=False))

# Visualizations
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Heatmap
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='RdBu_r', center=0, 
            ax=axes[0], cbar_kws={'label': 'Correlation'}, vmin=-1, vmax=1)
axes[0].set_title('Monthly Rainfall Correlation Heatmap', fontsize=13, fontweight='bold')

# Rainfall distribution (KDE plot)
for month in month_columns[::3]:  # Plot every 3rd month for clarity
    df_clean[month].plot(kind='density', ax=axes[1], label=month, linewidth=2)
axes[1].set_xlabel('Rainfall (mm)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Density', fontsize=12, fontweight='bold')
axes[1].set_title('Rainfall Distribution by Selected Months', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/05_Correlation_Analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualization saved!")

## STEP 10: Basic Trend Prediction Model

**Goal:** Build simple regression models to forecast future rainfall trends

**Purpose:**
- Future rainfall estimation
- Risk assessment for crop planning
- Insurance and disaster planning

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

print("\n" + "=" * 80)
print("üîÆ RAINFALL TREND PREDICTION MODEL")
print("=" * 80)

# Prepare data for modeling
yearly_data = df_clean.groupby('Year').agg({
    'Annual_Rainfall': 'mean'
}).reset_index()

X = yearly_data['Year'].values.reshape(-1, 1)
y = yearly_data['Annual_Rainfall'].values

# Build linear regression model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Metrics
r2 = r2_score(y, y_pred)
mae = mean_absolute_error(y, y_pred)
rmse = np.sqrt(mean_squared_error(y, y_pred))

print(f"\nüìà MODEL PERFORMANCE METRICS:")
print(f"   R¬≤ Score: {r2:.4f} (Model explains {r2*100:.2f}% of variance)")
print(f"   MAE: {mae:.2f} mm")
print(f"   RMSE: {rmse:.2f} mm")

# Future predictions
future_years = np.array([2021, 2022, 2023, 2024, 2025]).reshape(-1, 1)
future_predictions = model.predict(future_years)

print(f"\nüîÆ PREDICTED ANNUAL RAINFALL (2021-2025):")
for year, pred in zip(future_years.flatten(), future_predictions):
    print(f"   {int(year)}: {pred:.2f} mm")

# State-wise models
print(f"\nüó∫Ô∏è STATE-WISE TREND PREDICTIONS (2025 Forecast):")
state_predictions = {}
for state in df_clean['State'].unique():
    state_data = df_clean[df_clean['State'] == state].groupby('Year')['Annual_Rainfall'].mean()
    X_state = state_data.index.values.reshape(-1, 1)
    y_state = state_data.values
    
    model_state = LinearRegression()
    model_state.fit(X_state, y_state)
    
    pred_2025 = model_state.predict([[2025]])[0]
    state_predictions[state] = pred_2025
    print(f"   {state}: {pred_2025:.2f} mm")

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Historical data with trend line
axes[0, 0].scatter(X, y, s=100, alpha=0.6, color='darkblue', label='Actual Data')
axes[0, 0].plot(X, y_pred, color='red', linewidth=2.5, label='Trend Line')
axes[0, 0].set_xlabel('Year', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Annual Rainfall (mm)', fontsize=12, fontweight='bold')
axes[0, 0].set_title('Historical Rainfall Trend with Linear Regression', fontsize=13, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].text(0.05, 0.95, f'R¬≤ = {r2:.4f}', transform=axes[0, 0].transAxes,
                verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5),
                fontsize=11, fontweight='bold')

# 2. Predictions with confidence
all_years = np.concatenate([X.flatten(), future_years.flatten()])
all_pred = model.predict(all_years.reshape(-1, 1))
axes[0, 1].plot(X, y, 'o-', linewidth=2, markersize=8, label='Historical', color='darkblue')
axes[0, 1].plot(future_years, future_predictions, 's--', linewidth=2.5, markersize=8, 
                label='Forecast', color='darkgreen')
axes[0, 1].axvline(x=2020.5, color='gray', linestyle=':', linewidth=2, alpha=0.5)
axes[0, 1].set_xlabel('Year', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('Annual Rainfall (mm)', fontsize=12, fontweight='bold')
axes[0, 1].set_title('Historical Data and Future Rainfall Forecast', fontsize=13, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Residuals plot
residuals = y - y_pred
axes[1, 0].scatter(X, residuals, s=100, alpha=0.6, color='purple')
axes[1, 0].axhline(y=0, color='red', linestyle='--', linewidth=2)
axes[1, 0].set_xlabel('Year', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Residuals (mm)', fontsize=12, fontweight='bold')
axes[1, 0].set_title('Model Residuals Distribution', fontsize=13, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# 4. State predictions for 2025
state_pred_sorted = dict(sorted(state_predictions.items(), key=lambda x: x[1], reverse=True))
axes[1, 1].barh(list(state_pred_sorted.keys()), list(state_pred_sorted.values()), color='teal', alpha=0.7)
axes[1, 1].set_xlabel('Predicted Annual Rainfall (mm)', fontsize=12, fontweight='bold')
axes[1, 1].set_title('State-wise Rainfall Prediction for 2025', fontsize=13, fontweight='bold')
axes[1, 1].grid(axis='x', alpha=0.3)

for i, (state, value) in enumerate(state_pred_sorted.items()):
    axes[1, 1].text(value + 20, i, f'{value:.0f}', va='center', fontweight='bold')

plt.tight_layout()
plt.savefig('../outputs/06_Prediction_Model.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualization saved!")

## STEP 11: Project Summary and Conclusions

**What We Discovered:**

In [None]:
print("\n" + "=" * 80)
print("üìã PROJECT SUMMARY AND CONCLUSIONS")
print("=" * 80)

print(f"""
üåßÔ∏è RAINFALL ANALYSIS FOR INDIAN AGRICULTURE - EXECUTIVE SUMMARY

‚úÖ KEY FINDINGS:

1Ô∏è‚É£ TEMPORAL PATTERNS:
   ‚Ä¢ Monsoon season ({df_clean['Season_Monsoon'].mean():.0f}mm) accounts for {monsoon_contribution:.1f}% of annual rainfall
   ‚Ä¢ Peak rainfall: {peak_month} ({peak_rainfall:.0f}mm)
   ‚Ä¢ Lowest rainfall: {low_month} ({low_rainfall:.0f}mm)
   ‚Ä¢ Overall trend: {trend}

2Ô∏è‚É£ SPATIAL VARIATIONS:
   ‚Ä¢ High-rainfall states: {', '.join(high_rainfall_states[:2])}
   ‚Ä¢ Drought-prone states: {', '.join(low_rainfall_states)}
   ‚Ä¢ Maximum rainfall difference: {state_rainfall['mean'].max() - state_rainfall['mean'].min():.0f}mm

3Ô∏è‚É£ RAINFALL CHARACTERISTICS:
   ‚Ä¢ Average annual rainfall: {df_clean['Annual_Rainfall'].mean():.0f}mm
   ‚Ä¢ Variability: {'HIGH' if variability > 25 else 'MODERATE' if variability > 15 else 'LOW'}
   ‚Ä¢ Consistency: Year-to-year variation = {df_clean['Annual_Rainfall'].std():.0f}mm

4Ô∏è‚É£ PREDICTABILITY:
   ‚Ä¢ Model accuracy (R¬≤): {r2*100:.2f}%
   ‚Ä¢ Forecast for 2025: {future_predictions[-1]:.0f}mm
   ‚Ä¢ Confidence level: {'HIGH' if r2 > 0.7 else 'MODERATE' if r2 > 0.5 else 'LOW'}

üéØ RECOMMENDATIONS FOR AGRICULTURE:

1. CROP SELECTION:
   ‚Ä¢ Water-intensive crops: Plan for monsoon season ({peak_month})
   ‚Ä¢ Drought-tolerant crops: Focus on {low_month} season
   ‚Ä¢ Crop rotation: Monsoon ‚Üí Winter ‚Üí Summer (with irrigation)

2. IRRIGATION MANAGEMENT:
   ‚Ä¢ States requiring irrigation: {', '.join(low_rainfall_states)}
   ‚Ä¢ Critical months: {low_month} - {low_month} (driest period)
   ‚Ä¢ Reservoir planning: Build storage capacity for monsoon surplus

3. RISK MANAGEMENT:
   ‚Ä¢ Drought insurance: Essential for {', '.join(low_rainfall_states[:1])} region
   ‚Ä¢ Flood management: Critical during {peak_month} in {high_rainfall_states[0]}
   ‚Ä¢ Contingency planning: Maintain emergency water reserves

4. RESOURCE ALLOCATION:
   ‚Ä¢ High priority: High-rainfall states for export crops
   ‚Ä¢ Irrigation focus: Low-rainfall states for subsistence crops
   ‚Ä¢ Investment: Better infrastructure needed in drought-prone regions

üìä DATA QUALITY:
   ‚Ä¢ Records analyzed: {len(df_clean)}
   ‚Ä¢ States covered: {df_clean['State'].nunique()}
   ‚Ä¢ Time period: {df_clean['Year'].min()} - {df_clean['Year'].max()}
   ‚Ä¢ Data completeness: 100%

üîÆ FUTURE SCOPE:
   ‚úì Integrate real-time weather data
   ‚úì Apply machine learning (ARIMA, Prophet)
   ‚úì Include soil moisture and temperature data
   ‚úì Develop IoT-based farmer alerts app
   ‚úì Climate change impact analysis
   
üéì EDUCATIONAL VALUE:
   This analysis demonstrates:
   ‚úì Data-driven agricultural planning
   ‚úì Statistical analysis for environmental data
   ‚úì Predictive modeling for resource management
   ‚úì Real-world application of data science
""")

print("=" * 80)
print("‚úÖ ANALYSIS COMPLETE - All visualizations saved to outputs folder")
print("=" * 80)