# Feature Engineering - Hanoi Weather Forecasting

## Step 4: Advanced Feature Engineering for Temperature Forecasting

This notebook transforms raw weather data into features suitable for **5-day ahead temperature forecasting**.

### 🎯 Forecasting Definition:
**Goal**: Use daily weather data to predict Hanoi temperature for the **next 5 days**

### 🚀 Feature Engineering Objectives:
1. **Temporal Features** - Extract time-based patterns (seasonality, trends)
2. **Lag Features** - Use historical temperature data (1-7 days back)
3. **Rolling Statistics** - Moving averages and variability measures
4. **Text Feature Processing** - Transform weather descriptions into numerical features
5. **Cyclical Encoding** - Handle seasonal and daily cycles
6. **Weather Pattern Features** - Derive meaningful weather indicators
7. **Target Engineering** - Create future temperature targets for training

### 📊 Input → Output:
Raw 33 features → Engineered 100+ features → Ready for ML forecasting models

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Feature engineering libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from datetime import datetime, timedelta
import re
from textblob import TextBlob
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette('plasma')
plt.rcParams['figure.figsize'] = (12, 6)

print("⚙️ Feature Engineering Libraries loaded successfully!")
print("🎯 Ready to engineer features for 5-day temperature forecasting")

⚙️ Feature Engineering Libraries loaded successfully!
🎯 Ready to engineer features for 5-day temperature forecasting


## 1. Load and Prepare Base Data

In [2]:
# Load the processed dataset
data_path = '../data/raw/Hanoi-Daily-10-years.csv'
df = pd.read_csv(data_path)

# Convert datetime and sort by date
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.sort_values('datetime').reset_index(drop=True)

print(f"📋 Base Dataset Shape: {df.shape}")
print(f"📅 Date Range: {df['datetime'].min()} to {df['datetime'].max()}")
print(f"🌡️ Target (temp) Range: {df['temp'].min():.1f}°C to {df['temp'].max():.1f}°C")

# Create feature engineering dataset
df_features = df.copy()

print("\n✅ Data loaded and sorted chronologically for time series feature engineering")

📋 Base Dataset Shape: (3660, 33)
📅 Date Range: 2015-09-20 00:00:00 to 2025-09-26 00:00:00
🌡️ Target (temp) Range: 7.0°C to 35.5°C

✅ Data loaded and sorted chronologically for time series feature engineering


## 2. Temporal Feature Engineering

In [3]:
def create_temporal_features(df):
    """
    Create comprehensive temporal features from datetime
    """
    df = df.copy()
    
    # Basic time components
    df['year'] = df['datetime'].dt.year
    df['month'] = df['datetime'].dt.month
    df['day'] = df['datetime'].dt.day
    df['dayofweek'] = df['datetime'].dt.dayofweek  # 0=Monday
    df['dayofyear'] = df['datetime'].dt.dayofyear
    df['week'] = df['datetime'].dt.isocalendar().week
    df['quarter'] = df['datetime'].dt.quarter
    
    # Cyclical encoding for periodic features
    # Month (12-month cycle)
    df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
    df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
    
    # Day of year (365-day cycle)
    df['dayofyear_sin'] = np.sin(2 * np.pi * df['dayofyear'] / 365)
    df['dayofyear_cos'] = np.cos(2 * np.pi * df['dayofyear'] / 365)
    
    # Day of week (7-day cycle)
    df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
    df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)
    
    # Season indicators
    def get_season(month):
        if month in [12, 1, 2]:
            return 'winter'
        elif month in [3, 4, 5]:
            return 'spring'
        elif month in [6, 7, 8]:
            return 'summer'
        else:
            return 'autumn'
    
    df['season'] = df['month'].apply(get_season)
    
    # Binary season indicators
    df['is_winter'] = (df['season'] == 'winter').astype(int)
    df['is_spring'] = (df['season'] == 'spring').astype(int)
    df['is_summer'] = (df['season'] == 'summer').astype(int)
    df['is_autumn'] = (df['season'] == 'autumn').astype(int)
    
    # Time trends
    df['days_since_start'] = (df['datetime'] - df['datetime'].min()).dt.days
    df['years_since_start'] = df['days_since_start'] / 365.25
    
    return df

# Apply temporal feature engineering
df_features = create_temporal_features(df_features)

print("📅 Temporal Features Created:")
temporal_cols = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'week', 'quarter',
                'month_sin', 'month_cos', 'dayofyear_sin', 'dayofyear_cos',
                'dayofweek_sin', 'dayofweek_cos', 'season', 'is_winter', 'is_spring',
                'is_summer', 'is_autumn', 'days_since_start', 'years_since_start']

for col in temporal_cols:
    if col in df_features.columns:
        print(f"• {col}")

print(f"\n✅ Added {len([c for c in temporal_cols if c in df_features.columns])} temporal features")

📅 Temporal Features Created:
• year
• month
• day
• dayofweek
• dayofyear
• week
• quarter
• month_sin
• month_cos
• dayofyear_sin
• dayofyear_cos
• dayofweek_sin
• dayofweek_cos
• season
• is_winter
• is_spring
• is_summer
• is_autumn
• days_since_start
• years_since_start

✅ Added 20 temporal features


## 3. Lag Features for Time Series Forecasting

In [4]:
def create_lag_features(df, target_col='temp', lag_days=[1, 2, 3, 4, 5, 6, 7, 14, 30]):
    """
    Create lag features for temperature forecasting
    """
    df = df.copy()
    
    # Temperature lag features
    for lag in lag_days:
        df[f'{target_col}_lag_{lag}'] = df[target_col].shift(lag)
        
    # Additional weather feature lags (shorter lags)
    weather_features = ['tempmax', 'tempmin', 'humidity', 'precip', 'windspeed', 'sealevelpressure']
    short_lags = [1, 2, 3, 7]
    
    for feature in weather_features:
        if feature in df.columns:
            for lag in short_lags:
                df[f'{feature}_lag_{lag}'] = df[feature].shift(lag)
    
    # Temperature differences (day-to-day changes)
    df['temp_diff_1d'] = df[target_col] - df[target_col].shift(1)
    df['temp_diff_2d'] = df[target_col] - df[target_col].shift(2)
    df['temp_diff_7d'] = df[target_col] - df[target_col].shift(7)
    
    return df

# Apply lag feature engineering
df_features = create_lag_features(df_features)

# Count lag features
lag_features = [col for col in df_features.columns if '_lag_' in col or 'temp_diff_' in col]
print(f"🔄 Created {len(lag_features)} lag and difference features:")
print("\nTemperature lags:")
temp_lags = [col for col in lag_features if col.startswith('temp_lag_')]
print(temp_lags)

print("\nTemperature differences:")
temp_diffs = [col for col in lag_features if 'temp_diff_' in col]
print(temp_diffs)

print(f"\n✅ Total lag-based features: {len(lag_features)}")

🔄 Created 36 lag and difference features:

Temperature lags:
['temp_lag_1', 'temp_lag_2', 'temp_lag_3', 'temp_lag_4', 'temp_lag_5', 'temp_lag_6', 'temp_lag_7', 'temp_lag_14', 'temp_lag_30']

Temperature differences:
['temp_diff_1d', 'temp_diff_2d', 'temp_diff_7d']

✅ Total lag-based features: 36


## 4. Rolling Window Statistics

In [5]:
def create_rolling_features(df, target_col='temp', windows=[3, 7, 14, 30]):
    """
    Create rolling window statistical features
    """
    df = df.copy()
    
    # Rolling statistics for temperature
    for window in windows:
        # Basic statistics
        df[f'{target_col}_rolling_mean_{window}'] = df[target_col].rolling(window=window).mean()
        df[f'{target_col}_rolling_std_{window}'] = df[target_col].rolling(window=window).std()
        df[f'{target_col}_rolling_min_{window}'] = df[target_col].rolling(window=window).min()
        df[f'{target_col}_rolling_max_{window}'] = df[target_col].rolling(window=window).max()
        
        # Range and variability
        df[f'{target_col}_rolling_range_{window}'] = (df[f'{target_col}_rolling_max_{window}'] - 
                                                     df[f'{target_col}_rolling_min_{window}'])
        
        # Relative position in range
        df[f'{target_col}_position_in_range_{window}'] = ((df[target_col] - df[f'{target_col}_rolling_min_{window}']) /
                                                         (df[f'{target_col}_rolling_range_{window}'] + 1e-8))
    
    # Rolling statistics for other key features
    other_features = ['humidity', 'precip', 'windspeed', 'solarradiation']
    short_windows = [3, 7, 14]
    
    for feature in other_features:
        if feature in df.columns:
            for window in short_windows:
                df[f'{feature}_rolling_mean_{window}'] = df[feature].rolling(window=window).mean()
                df[f'{feature}_rolling_std_{window}'] = df[feature].rolling(window=window).std()
    
    # Exponential moving averages
    df[f'{target_col}_ema_3'] = df[target_col].ewm(span=3).mean()
    df[f'{target_col}_ema_7'] = df[target_col].ewm(span=7).mean()
    df[f'{target_col}_ema_30'] = df[target_col].ewm(span=30).mean()
    
    return df

# Apply rolling feature engineering
df_features = create_rolling_features(df_features)

# Count rolling features
rolling_features = [col for col in df_features.columns if ('_rolling_' in col or '_ema_' in col or '_position_' in col)]
print(f"📊 Created {len(rolling_features)} rolling statistical features")

# Display some examples
temp_rolling = [col for col in rolling_features if col.startswith('temp_')][:10]
print(f"\nExample temperature rolling features:")
for feat in temp_rolling:
    print(f"• {feat}")

print(f"\n✅ Total rolling features: {len(rolling_features)}")

📊 Created 51 rolling statistical features

Example temperature rolling features:
• temp_rolling_mean_3
• temp_rolling_std_3
• temp_rolling_min_3
• temp_rolling_max_3
• temp_rolling_range_3
• temp_position_in_range_3
• temp_rolling_mean_7
• temp_rolling_std_7
• temp_rolling_min_7
• temp_rolling_max_7

✅ Total rolling features: 51


## 5. Text Feature Engineering

In [6]:
def create_text_features(df):
    """
    Transform text features (conditions, description) into numerical features
    """
    df = df.copy()
    
    # 1. Weather Conditions Processing
    if 'conditions' in df.columns:
        # Clean and standardize conditions
        df['conditions_clean'] = df['conditions'].str.lower().str.strip()
        
        # Extract key weather patterns
        df['has_rain'] = df['conditions_clean'].str.contains('rain', na=False).astype(int)
        df['has_cloud'] = df['conditions_clean'].str.contains('cloud|overcast', na=False).astype(int)
        df['has_clear'] = df['conditions_clean'].str.contains('clear|sunny', na=False).astype(int)
        df['has_fog'] = df['conditions_clean'].str.contains('fog|mist', na=False).astype(int)
        df['has_storm'] = df['conditions_clean'].str.contains('storm|thunder', na=False).astype(int)
        df['has_wind'] = df['conditions_clean'].str.contains('wind', na=False).astype(int)
        
        # Count weather condition words
        df['conditions_word_count'] = df['conditions'].str.split().str.len().fillna(0)
        
        # One-hot encode most common conditions
        top_conditions = df['conditions_clean'].value_counts().head(10).index
        for condition in top_conditions:
            df[f'condition_{condition.replace(" ", "_").replace(",", "")}'] = (
                df['conditions_clean'] == condition
            ).astype(int)
    
    # 2. Description Processing
    if 'description' in df.columns:
        # Clean descriptions
        df['description_clean'] = df['description'].str.lower().str.strip()
        
        # Description length and complexity
        df['description_length'] = df['description'].str.len().fillna(0)
        df['description_word_count'] = df['description'].str.split().str.len().fillna(0)
        
        # Extract sentiment (positive/negative weather descriptions)
        def get_weather_sentiment(text):
            if pd.isna(text):
                return 0
            positive_words = ['clear', 'sunny', 'bright', 'pleasant', 'mild']
            negative_words = ['storm', 'heavy', 'severe', 'harsh', 'extreme']
            
            text_lower = text.lower()
            pos_count = sum(1 for word in positive_words if word in text_lower)
            neg_count = sum(1 for word in negative_words if word in text_lower)
            
            return pos_count - neg_count
        
        df['description_sentiment'] = df['description'].apply(get_weather_sentiment)
        
        # Extract specific weather events from description
        df['desc_has_rain'] = df['description_clean'].str.contains('rain', na=False).astype(int)
        df['desc_has_cloud'] = df['description_clean'].str.contains('cloud', na=False).astype(int)
        df['desc_has_sun'] = df['description_clean'].str.contains('sun|bright', na=False).astype(int)
    
    # 3. Icon feature processing
    if 'icon' in df.columns:
        # Label encode icons
        le_icon = LabelEncoder()
        df['icon_encoded'] = le_icon.fit_transform(df['icon'].fillna('unknown'))
        
        # One-hot encode most common icons
        top_icons = df['icon'].value_counts().head(8).index
        for icon in top_icons:
            df[f'icon_{icon}'] = (df['icon'] == icon).astype(int)
    
    return df

# Apply text feature engineering
df_features = create_text_features(df_features)

# Count text-derived features
text_features = [col for col in df_features.columns if any(x in col for x in 
                ['has_', 'condition_', 'desc_', 'icon_', 'sentiment', 'word_count', 'length'])]

print(f"📝 Created {len(text_features)} text-derived features:")
print("\nWeather pattern indicators:")
pattern_features = [col for col in text_features if col.startswith('has_')]
print(pattern_features)

print("\nCondition categories:")
condition_features = [col for col in text_features if col.startswith('condition_')][:5]
print(condition_features)

print(f"\n✅ Total text-derived features: {len(text_features)}")

📝 Created 24 text-derived features:

Weather pattern indicators:
['has_rain', 'has_cloud', 'has_clear', 'has_fog', 'has_storm', 'has_wind']

Condition categories:
['condition_rain_partially_cloudy', 'condition_partially_cloudy', 'condition_rain_overcast', 'condition_clear', 'condition_overcast']

✅ Total text-derived features: 24


In [9]:
final_feature_cols = temporal_cols + lag_features + rolling_features + text_features
final_feature_cols = [col for col in final_feature_cols if col in df_features.columns]
print(f"✅ Total final features: {len(final_feature_cols)}")
l_feature_cols = temporal_cols + lag_features + rolling_features + text_features
final_feature_cols = [col for col in final_feature_cols if col in df_features.columns]
print(f"✅ Total final features: {len(final_feature_cols)}")

#print all feature names
print("\nFinal feature columns:")   
for col in final_feature_cols:
    print(f"• {col}")
final_feature_cols = [col for col in final_feature_cols if col in df_features.columns]
print(f"✅ Total final features: {len(final_feature_cols)}")

✅ Total final features: 131
✅ Total final features: 131

Final feature columns:
• year
• month
• day
• dayofweek
• dayofyear
• week
• quarter
• month_sin
• month_cos
• dayofyear_sin
• dayofyear_cos
• dayofweek_sin
• dayofweek_cos
• season
• is_winter
• is_spring
• is_summer
• is_autumn
• days_since_start
• years_since_start
• temp_lag_1
• temp_lag_2
• temp_lag_3
• temp_lag_4
• temp_lag_5
• temp_lag_6
• temp_lag_7
• temp_lag_14
• temp_lag_30
• tempmax_lag_1
• tempmax_lag_2
• tempmax_lag_3
• tempmax_lag_7
• tempmin_lag_1
• tempmin_lag_2
• tempmin_lag_3
• tempmin_lag_7
• humidity_lag_1
• humidity_lag_2
• humidity_lag_3
• humidity_lag_7
• precip_lag_1
• precip_lag_2
• precip_lag_3
• precip_lag_7
• windspeed_lag_1
• windspeed_lag_2
• windspeed_lag_3
• windspeed_lag_7
• sealevelpressure_lag_1
• sealevelpressure_lag_2
• sealevelpressure_lag_3
• sealevelpressure_lag_7
• temp_diff_1d
• temp_diff_2d
• temp_diff_7d
• temp_rolling_mean_3
• temp_rolling_std_3
• temp_rolling_min_3
• temp_rolling_max