## **☁️ Weather Forecasting Project**
This project is all about **predicting the weather** using real data and machine learning. I used historical weather records and applied data science techniques to build models that can forecast temperature, humidity, and other weather metrics.

Whether it’s helping farmers plan their crops or just deciding whether to carry an umbrella, **weather forecasting** has real impact — and this project shows how data science can help.

---

### **🚀 What I Did**

* 📊 **Collected & cleaned** historical weather data
* 🔍 **Explored patterns** in temperature, humidity, wind, etc.
* 🧠 **Built models** using regression and/or time-series techniques
* 📈 **Visualized results** with clean and informative charts
* 🧰 **Used tools** like Python, Pandas, Matplotlib, and Scikit-learn

> This project is designed to be clear, practical, and easy to follow for anyone interested in data science.

---

### **📌 Goals**

* Understand how to **work with real-world time-series data**
* Practice **data preprocessing, feature engineering, and modeling**
* Improve skills in **visualization and storytelling with data**
* Share a clear, well-documented project on GitHub

In [1]:
# ======================= Standard Libraries ==========================
import warnings
from datetime import datetime, timedelta
import requests
from tqdm import tqdm

# ======================= Core Data Science ===========================
import numpy as np
import pandas as pd
from scipy.stats import pointbiserialr

# ======================= Visualization ===============================
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# ======================= Scikit-learn ================================
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score

# ======================= Time Series Analysis ==========================
from statsmodels.tsa.seasonal import seasonal_decompose

# ======================= Deep Learning ===============================
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# ======================= Configuration ===============================
warnings.filterwarnings("ignore", category=UserWarning)
np.random.seed(42)

print("✅ All libraries imported successfully.")



✅ All libraries imported successfully.


<h1 align="center"> <strong>Data Collection</strong> </h1>

In [2]:
class WeatherAPI:
    def __init__(self):
        self.base_url      = "https://api.open-meteo.com/v1"
        self.geocoding_url = "https://geocoding-api.open-meteo.com/v1/search"
    ################################################################
    # Get latitude and longitude for a city name
    def get_coordinates(self, city_name, country=None):
        if not city_name or not isinstance(city_name, str):
            raise ValueError("City name must be a non-empty string")
        
        params = {'name': city_name, 'count': 5, 'language': 'en', 'format': 'json'}
        if country:
            params['country'] = country
        
        try:
            response = requests.get(self.geocoding_url, params=params)
            response.raise_for_status()
            data = response.json()
            
            if 'results' in data and len(data['results']) > 0:
                result = data['results'][0]
                return {
                    'name'      : result['name'],
                    'country'   : result.get('country', ''),
                    'admin1'    : result.get('admin1', ''),
                    'latitude'  : result['latitude'],
                    'longitude' : result['longitude']
                }
            else:
                raise ValueError(f"City '{city_name}' not found")
        except requests.exceptions.RequestException as e:
            raise Exception(f"Failed to fetch coordinates for {city_name}: {e}")
    ################################################################
    # Get historical weather data for a city
    def get_historical_weather(self, city_name, country=None, start_date=None, end_date=None, daily_parameters=None, hourly_parameters=None, save=None):
        # Default dates: one week ago to today
        if start_date is None:
            start_date = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')
        if end_date is None:
            end_date = datetime.now().strftime('%Y-%m-%d')
        
        # Validate dates
        try:
            start_dt = datetime.strptime(start_date, '%Y-%m-%d')
            end_dt = datetime.strptime(end_date, '%Y-%m-%d')
            if start_dt > end_dt:
                raise ValueError("start_date must be before end_date")
        except ValueError as e:
            raise ValueError(f"Invalid date format. Use 'YYYY-MM-DD'. Error: {e}")
        
        # Default parameters
        if daily_parameters is None:
            daily_parameters = [
                'temperature_2m_max', 'temperature_2m_min', 'temperature_2m_mean',
                'precipitation_sum', 'rain_sum', 'snowfall_sum',
                'precipitation_hours', 'windspeed_10m_max', 'winddirection_10m_dominant'
            ]
        if hourly_parameters is None:
            hourly_parameters = [
                'temperature_2m', 'relativehumidity_2m', 'dewpoint_2m',
                'apparent_temperature', 'precipitation', 'rain', 'snowfall',
                'cloudcover', 'windspeed_10m', 'winddirection_10m'
            ]
        
        # Validate parameters
        if not daily_parameters and not hourly_parameters:
            raise ValueError("At least one of daily_parameters or hourly_parameters must be provided")
        
        # Get coordinates
        coords = self.get_coordinates(city_name, country)
        
        # Determine endpoint based on date recency
        days_ago = (datetime.now() - start_dt).days
        endpoint = f"{self.base_url}/forecast" if days_ago <= 5 else "https://historical-forecast-api.open-meteo.com/v1/forecast"
        
        # Prepare API parameters
        params = {
            'latitude': coords['latitude'],
            'longitude': coords['longitude'],
            'start_date': start_date,
            'end_date': end_date,
            'timezone': 'auto'
        }
        if daily_parameters:
            params['daily'] = ','.join(daily_parameters)
        if hourly_parameters:
            params['hourly'] = ','.join(hourly_parameters)
        
        try:
            response = requests.get(endpoint, params=params)
            response.raise_for_status()
            data = response.json()
            
            result = {
                'city_info': coords,
                'timezone': data.get('timezone', ''),
                'elevation': data.get('elevation', 0)
            }
            
            # Process daily data
            if 'daily' in data:
                daily_data = data['daily']
                daily_df = pd.DataFrame({'date': pd.to_datetime(daily_data['time'])})
                
                for param in daily_parameters:
                    if param in daily_data:
                        daily_df[param] = daily_data[param]
                
                result['daily'] = daily_df
                
                if save:
                    daily_filename = f"{save}_{city_name.replace(' ', '_')}_daily_data.csv"
                    daily_df.to_csv(daily_filename, index=False)
                    print(f"Daily data saved to {daily_filename}")
            
            # Process hourly data
            if 'hourly' in data:
                hourly_data = data['hourly']
                hourly_df = pd.DataFrame({'datetime': pd.to_datetime(hourly_data['time'])})
                
                for param in hourly_parameters:
                    if param in hourly_data:
                        hourly_df[param] = hourly_data[param]
                
                result['hourly'] = hourly_df
                
                if save:
                    hourly_filename = f"{save}_{city_name.replace(' ', '_')}_hourly_data.csv"
                    hourly_df.to_csv(hourly_filename, index=False)
                    print(f"Hourly data saved to {hourly_filename}")
                    print("-" * 50)
            
            return result
        except requests.exceptions.RequestException as e:
            raise Exception(f"Failed to fetch weather data for {city_name}: {e}")
    ################################################################
    # Get weather data with basic statistics
    def get_weather_stats(self, city_name, country=None, start_date=None, end_date=None, save=None):
        data = self.get_historical_weather(city_name, country, start_date, end_date, save=save)
        
        if 'daily' in data:
            daily_df = data['daily']
            stats = {
                'temperature_stats': {
                    'max_temp': daily_df['temperature_2m_max'].max(),
                    'min_temp': daily_df['temperature_2m_min'].min(),
                    'mean_temp': daily_df['temperature_2m_mean'].mean(),
                    'temp_std': daily_df['temperature_2m_mean'].std()
                },
                'precipitation_stats': {
                    'total_precipitation': daily_df['precipitation_sum'].sum(),
                    'rainy_days': (daily_df['precipitation_sum'] > 0).sum(),
                    'max_daily_rain': daily_df['precipitation_sum'].max()
                }
            }
            data['statistics'] = stats
            
        return data

In [3]:
# Initialize the API
weather_api = WeatherAPI()

city_name  = "Kafr ash Shaykh"
country    = "Egypt"
start_date = "2024-06-01"
end_date   = "2025-06-30"

try:
    city_weather= weather_api.get_historical_weather(city_name=city_name,country=country,start_date=start_date,end_date=end_date,save=f"../data/")
    
    print(f"====== Weather data for {city_weather['city_info']['name']} ========")
    print(f"Coordinates      : {city_weather['city_info']['latitude']}, {city_weather['city_info']['longitude']}")
    print(f"Daily data shape : {city_weather['daily'].shape}")
    print("\nFirst 5 days:")
    display(city_weather['daily'].head())
    
except Exception as e:
    print(f"Error: {e}")

Daily data saved to ../data/_Kafr_ash_Shaykh_daily_data.csv
Hourly data saved to ../data/_Kafr_ash_Shaykh_hourly_data.csv
--------------------------------------------------
Coordinates      : 31.11174, 30.93991
Daily data shape : (395, 10)

First 5 days:


Unnamed: 0,date,temperature_2m_max,temperature_2m_min,temperature_2m_mean,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,windspeed_10m_max,winddirection_10m_dominant
0,2024-06-01,38.3,19.2,28.6,0.0,0.0,0.0,0.0,20.9,102
1,2024-06-02,35.9,19.2,26.9,0.0,0.0,0.0,0.0,22.2,48
2,2024-06-03,34.1,19.3,26.3,0.0,0.0,0.0,0.0,23.1,353
3,2024-06-04,36.7,19.1,26.9,0.0,0.0,0.0,0.0,22.8,333
4,2024-06-05,38.0,20.5,28.3,0.0,0.0,0.0,0.0,20.1,325


### 🌤️ Weather Dataset Feature Overview

This dataset includes both **daily** and **hourly** weather features. Below is a comprehensive breakdown of each column, its meaning, and how it can be used in forecasting or analysis tasks.

---

#### 🗓️ Daily Features

| Feature Name | Description | Use Case |
|--------------|-------------|----------|
| `temperature_2m_max` | Daily maximum temperature at 2 meters | Identify heatwaves, extreme daytime conditions. |
| `temperature_2m_min` | Daily minimum temperature at 2 meters | Detect frost or cold nights. |
| `temperature_2m_mean` | Daily average temperature at 2 meters | General temperature trend — often used as target. |
| `precipitation_sum` | Total daily precipitation (rain + snow) | Important for overall moisture prediction. |
| `rain_sum` | Total daily rainfall only | More specific when snow isn’t relevant. |
| `snowfall_sum` | Total daily snowfall | Use in snow/climate-related forecasting. |
| `precipitation_hours` | Total hours with any precipitation | Useful for event duration estimation. |
| `windspeed_10m_max` | Maximum windspeed during the day at 10 meters | Helps predict storms or dangerous wind patterns. |
| `winddirection_10m_dominant` | Dominant wind direction during the day | Useful for regional wind pattern analysis. |

---

#### 🕐 Hourly Features

| Feature Name | Description | Use Case |
|--------------|-------------|----------|
| `datetime` | Timestamp for each observation (hourly) | Used for indexing, resampling, time-windowing. |
| `temperature_2m` | Instantaneous temperature at 2 meters | Fine-grained temperature patterns across the day. |
| `relativehumidity_2m` | Relative humidity at 2 meters | Used in modeling how moisture feels and behaves. |
| `dewpoint_2m` | Dew point temperature | Useful in fog, frost, or condensation forecasting. |
| `apparent_temperature` | "Feels like" temperature | Adjusted for humidity and wind — used for human-centric forecasting. |
| `precipitation` | Precipitation amount for that hour (rain + snow) | High-resolution input for rain event detection. |
| `rain` | Rainfall only for that hour | Helps isolate rain events without snow interference. |
| `snowfall` | Snowfall only for that hour | Relevant for snow alerts or time-based snow accumulation. |
| `cloudcover` | Cloud coverage percentage | Good for solar radiation modeling or sky condition tracking. |
| `windspeed_10m` | Hourly windspeed at 10 meters | Useful for wind energy, comfort metrics, or alerts. |
| `winddirection_10m` | Wind direction at 10 meters | Can be combined with windspeed for vector-based modeling. |


In [4]:
# Example 2: Custom parameters
try:
    custom_weather = weather_api.get_historical_weather(city_name="Giza",country=country,start_date=start_date,end_date=end_date, 
        daily_parameters=['temperature_2m_max', 'temperature_2m_min', 'precipitation_sum'],
        hourly_parameters=['temperature_2m', 'relativehumidity_2m', 'windspeed_10m'],
        save="../data/"
    )
    
    print(f"======== Custom weather data for {custom_weather['city_info']['name']} ========")
    print("Daily  data columns :", list(custom_weather['daily'].columns))
    print("Hourly data columns :", list(custom_weather['hourly'].columns))
except Exception as e:
    print(f"Error: {e}")

Daily data saved to ../data/_Giza_daily_data.csv
Hourly data saved to ../data/_Giza_hourly_data.csv
--------------------------------------------------
Daily  data columns : ['date', 'temperature_2m_max', 'temperature_2m_min', 'precipitation_sum']
Hourly data columns : ['datetime', 'temperature_2m', 'relativehumidity_2m', 'windspeed_10m']


<h1 align="center"> <strong>Data ingestion & preprocessing class</strong> </h1>

In [5]:
class WeatherTimeSeriesProcessor:
    def __init__(self):
        self.feature_names    = []
        self.validation_stats = {}
    ##############################################################
    def load_and_prepare_data(self, csv_path):
        """Load and preprocess weather data from CSV with validation and logging"""
        print(f"📂 Loading data from : {csv_path}")
        try:
            df = pd.read_csv(csv_path)
        except FileNotFoundError:
            raise Exception(f"File not found: {csv_path}")
        
        # Parse date and sort
        df['date'] = pd.to_datetime(df['date'])
        df = df.sort_values('date').reset_index(drop=True)
        df.set_index('date', inplace=True)
        
        # Rename columns for consistency
        rename_dict = {
            "temperature_2m_max"         : "max_temp_celsius",
            "temperature_2m_min"         : "min_temp_celsius", 
            "temperature_2m_mean"        : "avg_temp_celsius",
            "precipitation_sum"          : "total_precip_mm",
            "precipitation_hours"        : "precip_hours",
            "rain_sum"                   : "rain_total_mm",
            "snowfall_sum"               : "snow_total_mm",
            "windspeed_10m_max"          : "max_wind_speed_10m",
            "winddirection_10m_dominant" : "dominant_wind_dir_deg"
        }
        df.rename(columns=rename_dict, inplace=True)
        
        # Drop all-zero columns
        zero_cols = [col for col in df.columns if df[col].fillna(0).eq(0).all()]
        if zero_cols:
            print(f"🗑️ Removing zero-only columns: {zero_cols}")
            df.drop(columns=zero_cols, inplace=True)
        
        # Check for temporal gaps
        date_diff = df.index.to_series().diff()
        if date_diff.max() > pd.Timedelta(days=2):
            print("⚠️ Warning: Found gaps > 1 day in time series")
        
        # Warn about missing values
        missing = df.isna().mean()
        missing_cols = missing[missing > 0]
        if not missing_cols.empty:
            print("⚠️ Columns with missing values:")
            print(missing_cols.sort_values(ascending=False).round(2))
        
        # Create rain label for classification
        if 'rain_total_mm' in df.columns:
            df['rain_label'] = (df['rain_total_mm'] > 0).astype(int)
            print("✅ Rain label created")
        
        # Save validation stats
        self.validation_stats = {
            'original_shape'    : df.shape,
            'date_range'        : (df.index.min(), df.index.max()),
            'missing_values'    : df.isnull().sum().sum(),
            'duplicate_dates'   : df.index.duplicated().sum(),
            'columns_present'   : df.columns.tolist()
        }
        
        # Optional: warn if expected columns are missing
        expected_cols = ['avg_temp_celsius', 'rain_total_mm', 'max_temp_celsius']
        missing_expected = [col for col in expected_cols if col not in df.columns]
        if missing_expected:
            print(f"⚠️ Warning: Missing important columns: {missing_expected}")
        
        # Output summary
        print(f"✅ Data preprocessed : {df.shape[0]} rows, {df.shape[1]} columns")
        print(f"📅 Date range        : {self.validation_stats['date_range'][0]} to {self.validation_stats['date_range'][1]}")
        print(f"❓ Missing values    : {self.validation_stats['missing_values']}")
        print(f"🧩 Duplicate dates   : {self.validation_stats['duplicate_dates']}")
        print("-" * 50)
        return df
    ##############################################################
    def create_lag_features(self, df, target_col, lags=[1, 2, 3, 7, 14]):
        """Create lag features for a time series target with logging and validation"""
        print(f"🔄 Creating lag features for '{target_col}'")
        print(f"   Lags: {lags}")
        if target_col not in df.columns:
            raise ValueError(f"❌ Target column '{target_col}' not found in DataFrame")
        
        df_with_lags = df.copy()
        
        for lag in lags:
            lag_col = f"{target_col}_lag_{lag}"
            df_with_lags[lag_col] = df_with_lags[target_col].shift(lag)
            print(f"   ✅ Created: {lag_col}")
        
        n_missing = df_with_lags.isnull().sum().sum()
        print(f"📊 Missing values after lag creation: {n_missing}")
        print("-" * 50)
        return df_with_lags
    ##############################################################
    def create_rolling_features(self, df, target_col, windows=[7, 14, 30]):
        """
        Create rolling statistical features (mean, std, min, max)
        for a time series target variable using different window sizes.
        """
        print(f"📊 Creating rolling features for '{target_col}'")
        print(f"   Windows: {windows}")
        
        if target_col not in df.columns:
            raise ValueError(f"❌ Target column '{target_col}' not found in DataFrame")
        
        df_with_rolling = df.copy()
        for window in windows:
            print(f"   🔄 Processing window: {window}")
            min_periods = max(1, window // 2)
            rolling = df_with_rolling[target_col].rolling(window=window, min_periods=min_periods)
            
            df_with_rolling[f"{target_col}_rolling_mean_{window}"] = rolling.mean()
            df_with_rolling[f"{target_col}_rolling_std_{window}"]  = rolling.std()
            df_with_rolling[f"{target_col}_rolling_min_{window}"]  = rolling.min()
            df_with_rolling[f"{target_col}_rolling_max_{window}"]  = rolling.max()
            
            print(f"   ✅ Created: mean, std, min, max for window {window}")
        print(f"📈 Total rolling features created: {len(windows) * 4}")
        print("-" * 50)
        return df_with_rolling
    ##############################################################
    def create_cyclical_features(self, df):
        """Create cyclical temporal features (month, day of year, day of week)"""
        print("🌀 Creating cyclical temporal features...")
        df_with_cyclical = df.copy()
        
        # Day of year
        df_with_cyclical['day_of_year'] = df_with_cyclical.index.dayofyear
        print("   ✅ Day of year added")
        
        # Month (cyclical)
        df_with_cyclical['month_sin'] = np.sin(2 * np.pi * df_with_cyclical.index.month / 12)
        df_with_cyclical['month_cos'] = np.cos(2 * np.pi * df_with_cyclical.index.month / 12)
        print("   ✅ Month sin/cos created")
        
        # Day of year (cyclical)
        df_with_cyclical['day_of_year_sin'] = np.sin(2 * np.pi * df_with_cyclical['day_of_year'] / 365.25)
        df_with_cyclical['day_of_year_cos'] = np.cos(2 * np.pi * df_with_cyclical['day_of_year'] / 365.25)
        print("   ✅ Day of year sin/cos created")
        
        # Day of week (cyclical)
        df_with_cyclical['day_of_week_sin'] = np.sin(2 * np.pi * df_with_cyclical.index.dayofweek / 7)
        df_with_cyclical['day_of_week_cos'] = np.cos(2 * np.pi * df_with_cyclical.index.dayofweek / 7)
        print("   ✅ Day of week sin/cos created")
        
        print("🌀 Total cyclical features created: 7")
        print("-" * 50)
        return df_with_cyclical
    ##############################################################
    def add_weather_features(self, df):
        """Add engineered weather-related features (temp range, wind encoding, weekend, etc.)"""
        print("🌤️ Adding engineered weather features...")
        
        df_with_weather = df.copy()
        feature_count = 0
        
        # Temp range
        if 'max_temp_celsius' in df.columns and 'min_temp_celsius' in df.columns:
            df_with_weather['temp_range'] = df_with_weather['max_temp_celsius'] - df_with_weather['min_temp_celsius']
            print("   ✅ Temperature range created")
            feature_count += 1
        
        # Wind direction cyclical
        if 'dominant_wind_dir_deg' in df.columns:
            df_with_weather['wind_dir_sin'] = np.sin(np.deg2rad(df_with_weather['dominant_wind_dir_deg']))
            df_with_weather['wind_dir_cos'] = np.cos(np.deg2rad(df_with_weather['dominant_wind_dir_deg']))
            print("   ✅ Wind direction cyclical features created")
            feature_count += 2
        
        # Weekend indicator
        df_with_weather['is_weekend'] = (df_with_weather.index.dayofweek >= 5).astype(int)
        print("   ✅ Weekend indicator created")
        feature_count += 1
        
        print(f"🌤️ Total engineered weather features added: {feature_count}")
        print("-" * 50)
        
        return df_with_weather
    ##############################################################
    def add_seasonal_features(self, df):
        """Add season-based features (season, season label, transition flag)"""
        print("🌱 Adding seasonal features...")
        df_seasonal = df.copy()
        
        # Helper: map month to season code (0=Winter, 1=Spring, 2=Summer, 3=Autumn)
        def get_season(month):
            if month in [12, 1, 2]:
                return 0  # Winter
            elif month in [3, 4, 5]:
                return 1  # Spring
            elif month in [6, 7, 8]:
                return 2  # Summer
            else:
                return 3  # Autumn
        
        df_seasonal['season'] = df_seasonal.index.month.map(get_season)
        print("   ✅ Season index created")
        # Season label
        season_labels = {0: 'Winter', 1: 'Spring', 2: 'Summer', 3: 'Autumn'}
        df_seasonal['season_label'] = df_seasonal['season'].map(season_labels)
        print("   ✅ Season label added")
        # Transition season flag
        df_seasonal['is_transition_season'] = df_seasonal['season'].isin([1, 3]).astype(int)
        print("   ✅ Transition season flag added")
        print("🌱 Seasonal features complete")
        return df_seasonal
    ##############################################################
    def create_all_features(self, csv_path, target_col=None, verbose=True):
        """
        Orchestrate full feature engineering pipeline.
        Includes preprocessing, lag, rolling, cyclical, weather, and seasonal features.
        """
        if target_col is None:
            target_col = 'avg_temp_celsius'
        
        if verbose:
            print("🚀 Starting comprehensive feature engineering...")
            print(f"   Target column : {target_col}\n")
        
        # Step 1: Load + Preprocess
        df_processed = self.load_and_prepare_data(csv_path)
        
        # Step 2: Lag
        df_lag = self.create_lag_features(df_processed, target_col)
        
        # Step 3: Rolling
        df_roll = self.create_rolling_features(df_lag, target_col)
        
        # Step 4: Cyclical
        df_cyc = self.create_cyclical_features(df_roll)
        
        # Step 5: Weather
        df_weather = self.add_weather_features(df_cyc)
        
        # Step 6: Season
        df_final = self.add_seasonal_features(df_weather)
        
        # Save feature names
        self.feature_names = df_final.columns.tolist()
        
        # Final check
        if verbose:
            missing = df_final.isnull().sum().sum()
            print("="*50)
            print(f"✅ Feature engineering complete!")
            print("="*50)
            print(f"   📊 Final shape    : {df_final.shape}")
            print(f"   🔢 Total features : {len(self.feature_names)}")
            print(f"   ❓ Missing values : {missing}")
            if missing > 0:
                print("   ⚠️ Consider handling missing values before modeling.")
        return df_final
    ##############################################################

# Initialize the processor
processor = WeatherTimeSeriesProcessor()
print("✅ WeatherTimeSeriesProcessor initialized!")

✅ WeatherTimeSeriesProcessor initialized!


<h1 align="center"> <strong>Data loading & feature engineering</strong> </h1>

In [6]:
# Create comprehensive feature set

target_col_reg = "avg_temp_celsius"
target_col_clf = "rain_label"

df = processor.create_all_features("../data/_Kafr_ash_Shaykh_daily_data.csv", target_col=target_col_reg)

🚀 Starting comprehensive feature engineering...
   Target column : avg_temp_celsius

📂 Loading data from : ../data/_Kafr_ash_Shaykh_daily_data.csv
🗑️ Removing zero-only columns: ['snow_total_mm']
✅ Rain label created
✅ Data preprocessed : 395 rows, 9 columns
📅 Date range        : 2024-06-01 00:00:00 to 2025-06-30 00:00:00
❓ Missing values    : 0
🧩 Duplicate dates   : 0
--------------------------------------------------
🔄 Creating lag features for 'avg_temp_celsius'
   Lags: [1, 2, 3, 7, 14]
   ✅ Created: avg_temp_celsius_lag_1
   ✅ Created: avg_temp_celsius_lag_2
   ✅ Created: avg_temp_celsius_lag_3
   ✅ Created: avg_temp_celsius_lag_7
   ✅ Created: avg_temp_celsius_lag_14
📊 Missing values after lag creation: 27
--------------------------------------------------
📊 Creating rolling features for 'avg_temp_celsius'
   Windows: [7, 14, 30]
   🔄 Processing window: 7
   ✅ Created: mean, std, min, max for window 7
   🔄 Processing window: 14
   ✅ Created: mean, std, min, max for window 14
   🔄

In [7]:
def display_data_info(df, detailed=False):
    print("="*60)
    print("                 📊 PROCESSED DATA OVERVIEW                 ")
    print("="*60)
    print(f"🧾 Shape           : {df.shape}")
    print(f"💾 Memory usage    : {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    total_missing = df.isnull().sum().sum()
    print(f"❓ Missing values  : {total_missing}")
    
    if total_missing > 0:
        print("\n🔍 Top features with missing values:")
        print(df.isnull().sum().sort_values(ascending=False).head(5))
    
    print(f"\n📌 Feature names   : {df.columns.tolist()}")
    
    if detailed:
        print("\n📈 Stats Preview:")
        display(df.describe().T[['mean', 'std', 'min', 'max']].round(2))
    
    print("="*60)
    return df.head()

display_data_info(df)

                 📊 PROCESSED DATA OVERVIEW                 
🧾 Shape           : (395, 40)
💾 Memory usage    : 0.14 MB
❓ Missing values  : 115

🔍 Top features with missing values:
avg_temp_celsius_rolling_min_30     14
avg_temp_celsius_rolling_std_30     14
avg_temp_celsius_rolling_mean_30    14
avg_temp_celsius_rolling_max_30     14
avg_temp_celsius_lag_14             14
dtype: int64

📌 Feature names   : ['max_temp_celsius', 'min_temp_celsius', 'avg_temp_celsius', 'total_precip_mm', 'rain_total_mm', 'precip_hours', 'max_wind_speed_10m', 'dominant_wind_dir_deg', 'rain_label', 'avg_temp_celsius_lag_1', 'avg_temp_celsius_lag_2', 'avg_temp_celsius_lag_3', 'avg_temp_celsius_lag_7', 'avg_temp_celsius_lag_14', 'avg_temp_celsius_rolling_mean_7', 'avg_temp_celsius_rolling_std_7', 'avg_temp_celsius_rolling_min_7', 'avg_temp_celsius_rolling_max_7', 'avg_temp_celsius_rolling_mean_14', 'avg_temp_celsius_rolling_std_14', 'avg_temp_celsius_rolling_min_14', 'avg_temp_celsius_rolling_max_14', 'avg_temp

Unnamed: 0_level_0,max_temp_celsius,min_temp_celsius,avg_temp_celsius,total_precip_mm,rain_total_mm,precip_hours,max_wind_speed_10m,dominant_wind_dir_deg,rain_label,avg_temp_celsius_lag_1,...,day_of_year_cos,day_of_week_sin,day_of_week_cos,temp_range,wind_dir_sin,wind_dir_cos,is_weekend,season,season_label,is_transition_season
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2024-06-01,38.3,19.2,28.6,0.0,0.0,0.0,20.9,102,0,,...,-0.872929,-0.974928,-0.222521,19.1,0.978148,-0.207912,1,2,Summer,0
2024-06-02,35.9,19.2,26.9,0.0,0.0,0.0,22.2,48,0,28.6,...,-0.881192,-0.781831,0.62349,16.7,0.743145,0.669131,1,2,Summer,0
2024-06-03,34.1,19.3,26.3,0.0,0.0,0.0,23.1,353,0,26.9,...,-0.889193,0.0,1.0,14.8,-0.121869,0.992546,0,2,Summer,0
2024-06-04,36.7,19.1,26.9,0.0,0.0,0.0,22.8,333,0,26.3,...,-0.896932,0.781831,0.62349,17.6,-0.45399,0.891007,0,2,Summer,0
2024-06-05,38.0,20.5,28.3,0.0,0.0,0.0,20.1,325,0,26.9,...,-0.904405,0.974928,-0.222521,17.5,-0.573576,0.819152,0,2,Summer,0


In [8]:
print("="*80)
print("                 📊 INTERACTIVE PLOTLY VISUALIZATIONS                 ")
print("="*80)

# Create interactive plot showing temperature variations
fig = go.Figure()

# Add traces for different temperature measurements
fig.add_trace(go.Scatter(
    x=df.index,
    y=df['max_temp_celsius'],
    mode='lines',
    name='Max Temperature',
    line=dict(color='red', width=2),
    hovertemplate='<b>Max Temp</b><br>Date: %{x}<br>Temperature: %{y:.1f}°C<extra></extra>'
))

fig.add_trace(go.Scatter(
    x=df.index,
    y=df['min_temp_celsius'],
    mode='lines',
    name='Min Temperature',
    line=dict(color='blue', width=2),
    hovertemplate='<b>Min Temp</b><br>Date: %{x}<br>Temperature: %{y:.1f}°C<extra></extra>'
))

fig.add_trace(go.Scatter(
    x=df.index,
    y=df['avg_temp_celsius'],
    mode='lines',
    name='Average Temperature',
    line=dict(color='green', width=3),
    hovertemplate='<b>Avg Temp</b><br>Date: %{x}<br>Temperature: %{y:.1f}°C<extra></extra>'
))

# Add temperature range as filled area
fig.add_trace(go.Scatter(
    x=df.index,
    y=df['max_temp_celsius'],
    fill=None,
    mode='lines',
    line_color='rgba(0,0,0,0)',
    showlegend=False,
    name='Temperature Range'
))

fig.add_trace(go.Scatter(
    x=df.index,
    y=df['min_temp_celsius'],
    fill='tonexty',
    mode='lines',
    line_color='rgba(0,0,0,0)',
    name='Temperature Range',
    fillcolor='rgba(255,255,0,0.2)',
    hovertemplate='<b>Temp Range</b><br>Date: %{x}<br>Range: %{customdata:.1f}°C<extra></extra>',
    customdata=df['temp_range']
))

# Update layout with responsive width and legend on right
fig.update_layout(
    title={
        'text': f'🌡️ Temperature Analysis for {city_weather["city_info"]["name"]}, {city_weather["city_info"]["country"]}',
        'x': 0.5,
        'xanchor': 'center',
        'font': {'size': 20, 'family': 'Arial Black'}
    },
    xaxis_title='Date',
    yaxis_title='Temperature (°C)',
    hovermode='x unified',
    template='plotly_white',
    width=None,  # Auto-fit to page width
    height=600,
    legend=dict(
        orientation="v",
        yanchor="top",
        y=1,
        xanchor="left",
        x=1.02
    ),
    margin=dict(r=150)  # Add right margin for legend
)

# Add annotations for key statistics
temp_stats = {
    'max_temp': df['max_temp_celsius'].max(),
    'min_temp': df['min_temp_celsius'].min(),
    'avg_temp': df['avg_temp_celsius'].mean(),
    'temp_range_avg': df['temp_range'].mean()
}

# Add annotations to the plot
fig.add_annotation(
    xref="paper", yref="paper",
    x=0.02, y=0.98,
    text=f"Max Temp: {temp_stats['max_temp']:.1f}°C<br>Min Temp: {temp_stats['min_temp']:.1f}°C<br>Avg Temp: {temp_stats['avg_temp']:.1f}°C<br>Avg Range: {temp_stats['temp_range_avg']:.1f}°C",
    showarrow=False,
    font=dict(size=12),
    align="left",
    bgcolor="rgba(255,255,255,0.8)",
    bordercolor="black",
    borderwidth=1
)

# Show the plot
fig.show()

print(f"✅ Interactive temperature visualization created!")
print(f"📈 Data period: {df.index.min().strftime('%Y-%m-%d')} to {df.index.max().strftime('%Y-%m-%d')}")
print(f"📊 Total days analyzed: {len(df)}")

                 📊 INTERACTIVE PLOTLY VISUALIZATIONS                 


✅ Interactive temperature visualization created!
📈 Data period: 2024-06-01 to 2025-06-30
📊 Total days analyzed: 395


In [9]:
# Prepare the series
target_series = df['avg_temp_celsius'].dropna()
decomposition = seasonal_decompose(target_series, model='additive', period=30)

# Subplot setup
titles = ['Original Time Series','Trend Component', 'Seasonal Component (30-day cycle)','Residual Component']

fig = make_subplots(rows=4, cols=1,shared_xaxes=True,subplot_titles=[f"🔹 {title}" for title in titles],vertical_spacing=0.06,row_heights=[0.3, 0.25, 0.25, 0.2])

# Color palette
colors = {'original': '#1f77b4','trend': '#ff7f0e','seasonal': '#2ca02c','residual': '#d62728'}

# Add traces
components = {'Original': (decomposition.observed, colors['original'], 2),'Trend': (decomposition.trend, colors['trend'], 2.5),'Seasonal': (decomposition.seasonal, colors['seasonal'], 2),'Residual': (decomposition.resid, colors['residual'], 1.5),}

for i, (name, (series, color, width)) in enumerate(components.items(), start=1):
    trace = go.Scatter(
        x=series.index,
        y=series,
        mode='lines',
        name=name,
        line=dict(color=color, width=width),
        hovertemplate=f'<b>{name}</b><br>Date: %{{x}}<br>Value: %{{y:.2f}}°C<extra></extra>'
    )
    if name == "Seasonal":
        trace.fill = 'tonexty'
        trace.fillcolor = 'rgba(44, 160, 44, 0.1)'
    fig.add_trace(trace, row=i, col=1)

# Layout
fig.update_layout(
    title=dict(text=f'🌡️ Time Series Decomposition - Average Temperature<br>'f"<sub>📍 {city_weather['city_info']['name']}, {city_weather['city_info']['country']} | 30-Day Seasonality</sub>",x=0.5,xanchor='center',font=dict(size=20)),
    height=850,
    template='plotly_white',
    showlegend=True,
    hovermode='x unified',
    legend=dict(orientation="v",yanchor="top",y=0.98,xanchor="right",x=1.05,font=dict(size=12)),
    margin=dict(t=120, b=60, l=70, r=120)
)

# Axes
y_titles = ["Temp (°C)", "Trend (°C)", "Seasonality (°C)", "Residuals (°C)"]
for i, y_title in enumerate(y_titles, start=1):
    fig.update_yaxes(title_text=y_title,row=i, col=1,title_font=dict(size=14),tickfont=dict(size=12),showgrid=True,gridcolor='rgba(200,200,200,0.2)',showline=True,linecolor='gray',linewidth=1.5,mirror=True)

fig.update_xaxes(title_text="Date",row=4, col=1,title_font=dict(size=14),tickfont=dict(size=12),showgrid=True,gridcolor='rgba(200,200,200,0.2)',showline=True,linecolor='gray',linewidth=1.5,mirror=True)
fig.show()

print("=" * 60)
print("📊 DECOMPOSITION SUMMARY STATISTICS")
print("=" * 60)
print(f"🔢 Original Data Range   : {target_series.min():.1f}°C to {target_series.max():.1f}°C")
print(f"📈 Trend Range           : {decomposition.trend.min():.1f}°C to {decomposition.trend.max():.1f}°C")
print(f"🌀 Seasonal Variation    : {decomposition.seasonal.min():.1f}°C to {decomposition.seasonal.max():.1f}°C")
print(f"🔀 Residual Std Dev      : {decomposition.resid.std():.2f}°C")
print(f"📅 Analysis Period       : {len(target_series)} days")
print("=" * 60)

📊 DECOMPOSITION SUMMARY STATISTICS
🔢 Original Data Range   : 10.0°C to 31.4°C
📈 Trend Range           : 12.8°C to 28.8°C
🌀 Seasonal Variation    : -0.9°C to 0.7°C
🔀 Residual Std Dev      : 1.47°C
📅 Analysis Period       : 395 days


In [10]:
# first we will save the processed DataFrame to a CSV file to use it in our models
df.to_csv("../data/saved/processed_weather_features.csv")