In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# üåßÔ∏è Exploratory Data Analysis - Rainfall Prediction\n",
    "## Analyzing Historical Weather Data\n",
    "\n",
    "This notebook explores the weather dataset to understand:\n",
    "- Data distribution and patterns\n",
    "- Feature correlations\n",
    "- Missing value analysis\n",
    "- Weather trends\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üì¶ 1. Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from scipy import stats\n",
    "import warnings\n",
    "\n",
    "# Configuration\n",
    "warnings.filterwarnings('ignore')\n",
    "plt.style.use('seaborn-v0_8-darkgrid')\n",
    "sns.set_palette('husl')\n",
    "%matplotlib inline\n",
    "\n",
    "print('‚úÖ Libraries imported successfully!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üìÇ 2. Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the weather dataset\n",
    "df = pd.read_csv('../data/weather.csv')\n",
    "\n",
    "print(f'Dataset Shape: {df.shape}')\n",
    "print(f'Number of Rows: {df.shape[0]}')\n",
    "print(f'Number of Columns: {df.shape[1]}')\n",
    "print('\\n‚úÖ Data loaded successfully!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üîç 3. Initial Data Inspection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display first few rows\n",
    "print('First 5 rows of the dataset:')\n",
    "display(df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data types and info\n",
    "print('Dataset Information:')\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Statistical summary\n",
    "print('Statistical Summary:')\n",
    "display(df.describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üî¢ 4. Missing Value Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate missing values\n",
    "missing_values = df.isnull().sum()\n",
    "missing_percentage = (missing_values / len(df)) * 100\n",
    "\n",
    "missing_df = pd.DataFrame({\n",
    "    'Missing Values': missing_values,\n",
    "    'Percentage': missing_percentage\n",
    "})\n",
    "\n",
    "print('Missing Value Analysis:')\n",
    "display(missing_df[missing_df['Missing Values'] > 0])\n",
    "\n",
    "# Visualize missing values\n",
    "if missing_values.sum() > 0:\n",
    "    plt.figure(figsize=(10, 5))\n",
    "    missing_values[missing_values > 0].plot(kind='bar', color='coral')\n",
    "    plt.title('Missing Values by Column', fontsize=14, fontweight='bold')\n",
    "    plt.xlabel('Columns')\n",
    "    plt.ylabel('Number of Missing Values')\n",
    "    plt.xticks(rotation=45)\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "else:\n",
    "    print('‚úÖ No missing values found!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üìä 5. Target Variable Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Count of rain tomorrow\n",
    "rain_counts = df['RainTomorrow'].value_counts()\n",
    "print('Rain Tomorrow Distribution:')\n",
    "print(rain_counts)\n",
    "print(f'\\nPercentage of Rainy Days: {(rain_counts[\"Yes\"] / len(df)) * 100:.2f}%')\n",
    "print(f'Percentage of Non-Rainy Days: {(rain_counts[\"No\"] / len(df)) * 100:.2f}%')\n",
    "\n",
    "# Visualize\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Bar chart\n",
    "rain_counts.plot(kind='bar', ax=axes[0], color=['skyblue', 'coral'])\n",
    "axes[0].set_title('Count of Rainy vs Non-Rainy Days', fontsize=14, fontweight='bold')\n",
    "axes[0].set_xlabel('Rain Tomorrow')\n",
    "axes[0].set_ylabel('Count')\n",
    "axes[0].set_xticklabels(rain_counts.index, rotation=0)\n",
    "\n",
    "# Pie chart\n",
    "axes[1].pie(rain_counts, labels=rain_counts.index, autopct='%1.1f%%', \n",
    "            colors=['skyblue', 'coral'], startangle=90)\n",
    "axes[1].set_title('Proportion of Rain Tomorrow', fontsize=14, fontweight='bold')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üå°Ô∏è 6. Feature Distribution Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution of numerical features\n",
    "numerical_cols = ['MinTemp', 'MaxTemp', 'Humidity', 'WindSpeed', 'Pressure']\n",
    "\n",
    "fig, axes = plt.subplots(2, 3, figsize=(15, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "for idx, col in enumerate(numerical_cols):\n",
    "    axes[idx].hist(df[col].dropna(), bins=30, color='steelblue', edgecolor='black', alpha=0.7)\n",
    "    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')\n",
    "    axes[idx].set_xlabel(col)\n",
    "    axes[idx].set_ylabel('Frequency')\n",
    "    axes[idx].grid(axis='y', alpha=0.3)\n",
    "\n",
    "# Remove extra subplot\n",
    "fig.delaxes(axes[5])\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üìà 7. Correlation Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a copy with encoded categorical variables\n",
    "df_encoded = df.copy()\n",
    "df_encoded['RainToday'] = df_encoded['RainToday'].map({'Yes': 1, 'No': 0})\n",
    "df_encoded['RainTomorrow'] = df_encoded['RainTomorrow'].map({'Yes': 1, 'No': 0})\n",
    "\n",
    "# Calculate correlation matrix\n",
    "correlation_matrix = df_encoded[numerical_cols + ['RainToday', 'RainTomorrow']].corr()\n",
    "\n",
    "# Plot correlation heatmap\n",
    "plt.figure(figsize=(12, 8))\n",
    "sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,\n",
    "            fmt='.2f', square=True, linewidths=1, cbar_kws={'shrink': 0.8})\n",
    "plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold', pad=20)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Show strongest correlations with target\n",
    "print('\\nCorrelations with RainTomorrow:')\n",
    "target_corr = correlation_matrix['RainTomorrow'].sort_values(ascending=False)\n",
    "print(target_corr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üîç 8. Feature Relationships with Target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Box plots: Features vs Rain Tomorrow\n",
    "fig, axes = plt.subplots(2, 3, figsize=(15, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "for idx, col in enumerate(numerical_cols):\n",
    "    df.boxplot(column=col, by='RainTomorrow', ax=axes[idx])\n",
    "    axes[idx].set_title(f'{col} vs Rain Tomorrow', fontsize=12, fontweight='bold')\n",
    "    axes[idx].set_xlabel('Rain Tomorrow')\n",
    "    axes[idx].set_ylabel(col)\n",
    "    plt.sca(axes[idx])\n",
    "    plt.xticks([1, 2], ['No', 'Yes'])\n",
    "\n",
    "# Remove extra subplot\n",
    "fig.delaxes(axes[5])\n",
    "plt.suptitle('')  # Remove automatic title\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üå¶Ô∏è 9. Rain Today vs Rain Tomorrow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cross-tabulation\n",
    "crosstab = pd.crosstab(df['RainToday'], df['RainTomorrow'], margins=True)\n",
    "print('Rain Today vs Rain Tomorrow:')\n",
    "display(crosstab)\n",
    "\n",
    "# Visualize relationship\n",
    "plt.figure(figsize=(8, 6))\n",
    "crosstab_plot = pd.crosstab(df['RainToday'], df['RainTomorrow'], normalize='index') * 100\n",
    "crosstab_plot.plot(kind='bar', stacked=False, color=['skyblue', 'coral'])\n",
    "plt.title('Probability of Rain Tomorrow Given Rain Today', fontsize=14, fontweight='bold')\n",
    "plt.xlabel('Rain Today')\n",
    "plt.ylabel('Percentage (%)')\n",
    "plt.legend(title='Rain Tomorrow')\n",
    "plt.xticks(rotation=0)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üìù 10. Key Insights Summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('=' * 70)\n",
    "print('KEY INSIGHTS FROM EXPLORATORY DATA ANALYSIS')\n",
    "print('=' * 70)\n",
    "\n",
    "# Dataset overview\n",
    "print(f'\\n1. DATASET OVERVIEW')\n",
    "print(f'   ‚Ä¢ Total records: {len(df):,}')\n",
    "print(f'   ‚Ä¢ Number of features: {df.shape[1]}')\n",
    "print(f'   ‚Ä¢ Missing values: {df.isnull().sum().sum()}')\n",
    "\n",
    "# Target distribution\n",
    "print(f'\\n2. TARGET VARIABLE (RainTomorrow)')\n",
    "rain_pct = (df['RainTomorrow'].value_counts()['Yes'] / len(df)) * 100\n",
    "print(f'   ‚Ä¢ Rainy days: {rain_pct:.1f}%')\n",
    "print(f'   ‚Ä¢ Non-rainy days: {100-rain_pct:.1f}%')\n",
    "\n",
    "# Feature importance (based on correlation)\n",
    "print(f'\\n3. MOST IMPORTANT FEATURES (Correlation with target)')\n",
    "top_features = correlation_matrix['RainTomorrow'].abs().sort_values(ascending=False)[1:4]\n",
    "for feature, corr in top_features.items():\n",
    "    print(f'   ‚Ä¢ {feature}: {corr:.3f}')\n",
    "\n",
    "# Weather patterns\n",
    "print(f'\\n4. WEATHER PATTERNS')\n",
    "print(f'   ‚Ä¢ Average Temperature: {df[\"MaxTemp\"].mean():.1f}¬∞C')\n",
    "print(f'   ‚Ä¢ Average Humidity: {df[\"Humidity\"].mean():.1f}%')\n",
    "print(f'   ‚Ä¢ Average Wind Speed: {df[\"WindSpeed\"].mean():.1f} km/h')\n",
    "\n",
    "print('\\n' + '=' * 70)\n",
    "print('‚úÖ Exploratory Data Analysis Complete!')\n",
    "print('=' * 70)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üéØ Conclusion\n",
    "\n",
    "This analysis reveals:\n",
    "1. **Data Quality**: The dataset is clean with minimal missing values\n",
    "2. **Class Balance**: Check if rainy/non-rainy days are balanced\n",
    "3. **Feature Importance**: Humidity, RainToday, and Pressure show strongest correlations\n",
    "4. **Patterns**: Higher humidity and rain today increase tomorrow's rain probability\n",
    "\n",
    "These insights will guide our model selection and feature engineering!\n",
    "\n",
    "---\n",
    "**Next Steps**: Proceed to model training with `main.py`"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}