# 🌡️ Hanoi Hourly Weather Data Exploration - Comprehensive Analysis

## Step 8: Hourly Temperature Forecasting
**Objective**: Explore the hourly weather dataset (`hanoi_weather_data_hourly.csv`) to understand patterns, features, and prepare for 5-day ahead temperature forecasting with hourly resolution.

**Dataset Overview**:
- **Source**: Visual Crossing Weather API - Hanoi Hourly Data
- **Records**: ~87,698 hourly observations 
- **Time Range**: 2015-2024 (10 years)
- **Features**: 28 weather parameters per hour
- **Target**: Temperature forecasting for next 120 hours (5 days)

---

### 📋 Analysis Roadmap

1. **Data Loading & Initial Exploration**
2. **Temporal Pattern Analysis** (Hourly, Daily, Seasonal)
3. **Feature Understanding & Distribution**
4. **Missing Value Assessment**
5. **Correlation Analysis**
6. **Diurnal Temperature Patterns**
7. **Weather Condition Analysis**
8. **Preparation for Feature Engineering**

In [1]:
# Import required libraries for hourly data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from datetime import datetime, timedelta
import os
import sys

warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# File paths for hourly data
DATA_PATH = '../data/raw/hanoi_weather_data_hourly.csv'
OUTPUT_PATH = '../outputs/'

print("🌡️ Libraries imported successfully!")
print("📊 Ready for hourly weather data exploration!")
print(f"📂 Data source: {DATA_PATH}")
print(f"💾 Outputs will be saved to: {OUTPUT_PATH}")

🌡️ Libraries imported successfully!
📊 Ready for hourly weather data exploration!
📂 Data source: ../data/raw/hanoi_weather_data_hourly.csv
💾 Outputs will be saved to: ../outputs/


In [2]:
# Load and examine the hourly dataset
df_hourly = pd.read_csv(DATA_PATH)

print("🌡️ HANOI HOURLY WEATHER DATASET OVERVIEW")
print("=" * 60)
print(f"📊 Dataset Shape: {df_hourly.shape}")
print(f"📅 Records: {df_hourly.shape[0]:,} hourly observations")
print(f"🔢 Features: {df_hourly.shape[1]} columns")
print(f"💾 Memory Usage: {df_hourly.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display basic info
print("\n📋 DATASET INFO:")
print("-" * 40)
df_hourly.info()

print(f"\n🕐 DATE RANGE ANALYSIS:")
print("-" * 40)
df_hourly['datetime'] = pd.to_datetime(df_hourly['datetime'])
print(f"Start Date: {df_hourly['datetime'].min()}")
print(f"End Date: {df_hourly['datetime'].max()}")
print(f"Time Span: {(df_hourly['datetime'].max() - df_hourly['datetime'].min()).days} days")
print(f"Expected Hours: {(df_hourly['datetime'].max() - df_hourly['datetime'].min()).days * 24}")
print(f"Actual Hours: {len(df_hourly)}")
print(f"Data Completeness: {len(df_hourly) / ((df_hourly['datetime'].max() - df_hourly['datetime'].min()).days * 24) * 100:.2f}%")

print(f"\n🌡️ TEMPERATURE ANALYSIS:")
print("-" * 40)
print(f"Min Temperature: {df_hourly['temp'].min():.1f}°C")
print(f"Max Temperature: {df_hourly['temp'].max():.1f}°C")
print(f"Mean Temperature: {df_hourly['temp'].mean():.1f}°C")
print(f"Temperature Range: {df_hourly['temp'].max() - df_hourly['temp'].min():.1f}°C")

🌡️ HANOI HOURLY WEATHER DATASET OVERVIEW
📊 Dataset Shape: (87696, 28)
📅 Records: 87,696 hourly observations
🔢 Features: 28 columns
💾 Memory Usage: 50.16 MB

📋 DATASET INFO:
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87696 entries, 0 to 87695
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              87696 non-null  object 
 1   address           87696 non-null  object 
 2   resolvedAddress   87696 non-null  object 
 3   latitude          87696 non-null  int64  
 4   longitude         87696 non-null  float64
 5   datetime          87696 non-null  object 
 6   temp              87696 non-null  float64
 7   feelslike         87696 non-null  float64
 8   dew               87696 non-null  float64
 9   humidity          87696 non-null  float64
 10  precip            87659 non-null  float64
 11  precipprob        87696 non-null  int64  
 12  preciptype    

In [3]:
# Display first few rows and column analysis
print("📋 FIRST 5 ROWS:")
print("-" * 40)
display(df_hourly.head())

print("\n🔍 COLUMN ANALYSIS:")
print("-" * 40)
print("Column Names and Types:")
for i, (col, dtype) in enumerate(zip(df_hourly.columns, df_hourly.dtypes)):
    print(f"{i+1:2d}. {col:<20} | {str(dtype):<12} | Non-null: {df_hourly[col].count():,}")

print(f"\n📊 STATISTICAL SUMMARY (Numeric Columns):")
print("-" * 40)
display(df_hourly.describe())

📋 FIRST 5 ROWS:
----------------------------------------


Unnamed: 0,name,address,resolvedAddress,latitude,longitude,datetime,temp,feelslike,dew,humidity,precip,precipprob,preciptype,snow,snowdepth,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,severerisk,conditions,icon,source
0,Hanoi,Hanoi,Hanoi,21,105.85,2015-09-27 00:00:00,27.0,31.2,26.0,94.27,0.1,100,rain,0.0,0.0,6.5,5.4,106.6,1007.0,0.0,3.0,0.0,0.0,0.0,,Rain,rain,obs
1,Hanoi,Hanoi,Hanoi,21,105.85,2015-09-27 01:00:00,26.5,26.5,25.9,96.32,0.0,0,,0.0,0.0,5.8,1.5,344.0,1008.0,1.3,8.9,0.0,0.0,0.0,,Clear,clear-night,obs
2,Hanoi,Hanoi,Hanoi,21,105.85,2015-09-27 02:00:00,27.0,31.2,26.0,94.27,0.9,100,rain,0.0,0.0,8.3,0.9,20.2,1007.0,44.0,4.5,0.0,0.0,0.0,,"Rain, Partially cloudy",rain,obs
3,Hanoi,Hanoi,Hanoi,21,105.85,2015-09-27 03:00:00,27.0,31.2,26.0,94.27,0.6,100,rain,0.0,0.0,8.6,0.9,360.0,1007.0,0.0,4.0,0.0,0.0,0.0,,Rain,rain,obs
4,Hanoi,Hanoi,Hanoi,21,105.85,2015-09-27 04:00:00,26.3,26.3,25.7,96.87,0.0,0,,0.0,0.0,9.7,1.2,12.0,1007.7,0.5,8.6,0.0,0.0,0.0,,Clear,clear-night,obs



🔍 COLUMN ANALYSIS:
----------------------------------------
Column Names and Types:
 1. name                 | object       | Non-null: 87,696
 2. address              | object       | Non-null: 87,696
 3. resolvedAddress      | object       | Non-null: 87,696
 4. latitude             | int64        | Non-null: 87,696
 5. longitude            | float64      | Non-null: 87,696
 6. datetime             | datetime64[ns] | Non-null: 87,696
 7. temp                 | float64      | Non-null: 87,696
 8. feelslike            | float64      | Non-null: 87,696
 9. dew                  | float64      | Non-null: 87,696
10. humidity             | float64      | Non-null: 87,696
11. precip               | float64      | Non-null: 87,659
12. precipprob           | int64        | Non-null: 87,696
13. preciptype           | object       | Non-null: 7,503
14. snow                 | float64      | Non-null: 87,655
15. snowdepth            | float64      | Non-null: 87,655
16. windgust             | fl

Unnamed: 0,latitude,longitude,datetime,temp,feelslike,dew,humidity,precip,precipprob,snow,snowdepth,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,severerisk
count,87696.0,87696.0,87696,87696.0,87696.0,87696.0,87696.0,87659.0,87696.0,87655.0,87655.0,87648.0,87695.0,87694.0,87696.0,87696.0,87480.0,87660.0,87660.0,87660.0,3840.0
mean,21.0,105.85,2020-09-26 23:30:00,24.837318,27.309565,20.421809,78.247121,0.218036,8.152025,0.0,0.0,17.341566,9.221488,142.935597,1011.197993,64.407258,8.603593,154.215529,0.554636,1.535467,10.844271
min,21.0,105.85,2015-09-27 00:00:00,5.9,2.1,-8.0,17.59,0.0,0.0,0.0,0.0,1.1,0.0,0.0,985.0,0.0,0.0,0.0,0.0,0.0,3.0
25%,21.0,105.85,2018-03-28 11:45:00,21.0,21.0,17.0,67.15,0.0,0.0,0.0,0.0,11.5,5.4,65.0,1005.0,34.5,7.0,0.0,0.0,0.0,10.0
50%,21.0,105.85,2020-09-26 23:30:00,25.7,25.7,22.1,81.86,0.0,0.0,0.0,0.0,16.6,8.5,118.0,1011.0,81.2,10.0,5.5,0.0,0.0,10.0
75%,21.0,105.85,2023-03-29 11:15:00,29.0,34.0,25.0,91.45,0.0,0.0,0.0,0.0,22.0,12.0,189.4,1017.0,90.0,10.0,246.7,0.9,2.0,10.0
max,21.0,105.85,2025-09-27 23:00:00,41.9,52.5,31.0,100.0,186.24,100.0,0.0,0.0,118.8,53.6,360.0,1041.0,100.0,24.1,1003.4,3.6,10.0,75.0
std,0.0,1.744819e-10,,5.589906,8.482175,5.991473,15.862294,2.678485,27.363398,0.0,0.0,7.701515,4.787909,105.093954,7.317338,34.582003,2.802476,234.408601,0.844611,2.357551,6.015065


## 🕐 Hourly vs Daily Data Comparison

Key differences between hourly and daily datasets:

### **Dataset Size**
- **Daily**: ~3,660 records (10 years × 365 days)
- **Hourly**: ~87,698 records (10 years × 365 days × 24 hours)
- **Scale**: ~24x more data points

### **Unique Hourly Features**
1. **Diurnal Patterns**: Within-day temperature cycles
2. **Rush Hour Effects**: Human activity patterns
3. **Rapid Weather Changes**: Storm fronts, temperature drops
4. **Solar Radiation Cycles**: Hourly solar energy variations
5. **Wind Pattern Changes**: Hourly wind direction/speed variations

### **New Forecasting Opportunities**
- **Intraday Predictions**: Specific hour temperature forecasts
- **Peak Time Identification**: Daily max/min timing
- **Weather Event Timing**: Precise precipitation/storm timing
- **Energy Management**: Hour-by-hour HVAC optimization

In [4]:
# Diurnal Pattern Analysis - The Heart of Hourly Data
print("🌅 DIURNAL TEMPERATURE PATTERNS")
print("=" * 50)

# Extract hour from datetime
df_hourly['hour'] = df_hourly['datetime'].dt.hour
df_hourly['month'] = df_hourly['datetime'].dt.month
df_hourly['day_of_week'] = df_hourly['datetime'].dt.dayofweek

# Calculate hourly averages
hourly_avg_temp = df_hourly.groupby('hour')['temp'].agg(['mean', 'std', 'min', 'max']).round(2)
print("Average Temperature by Hour of Day:")
print(hourly_avg_temp)

# Create diurnal pattern visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Hourly Temperature Pattern', 'Temperature Range by Hour', 
                   'Monthly-Hourly Heatmap', 'Weekend vs Weekday Pattern'),
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"colspan": 2}, None]],
    vertical_spacing=0.12
)

# Plot 1: Hourly average temperature
fig.add_trace(
    go.Scatter(x=hourly_avg_temp.index, y=hourly_avg_temp['mean'],
              mode='lines+markers', name='Average Temperature',
              line=dict(color='red', width=3)),
    row=1, col=1
)

# Plot 2: Temperature range by hour
fig.add_trace(
    go.Scatter(x=hourly_avg_temp.index, y=hourly_avg_temp['max'],
              mode='lines', name='Max Temperature', line=dict(color='orange')),
    row=1, col=2
)
fig.add_trace(
    go.Scatter(x=hourly_avg_temp.index, y=hourly_avg_temp['min'],
              mode='lines', name='Min Temperature', line=dict(color='blue')),
    row=1, col=2
)

# Plot 3: Monthly-Hourly heatmap
monthly_hourly = df_hourly.groupby(['month', 'hour'])['temp'].mean().unstack(level=0)
fig.add_trace(
    go.Heatmap(z=monthly_hourly.values, x=monthly_hourly.columns, y=monthly_hourly.index,
              colorscale='RdYlBu_r', name='Temperature (°C)'),
    row=2, col=1
)

fig.update_layout(height=800, showlegend=True, 
                 title_text="🌡️ Diurnal Temperature Patterns Analysis")
fig.show()

print(f"\n🔍 KEY INSIGHTS:")
print("-" * 30)
coolest_hour = hourly_avg_temp['mean'].idxmin()
warmest_hour = hourly_avg_temp['mean'].idxmax()
print(f"Coolest hour: {coolest_hour}:00 ({hourly_avg_temp.loc[coolest_hour, 'mean']:.1f}°C)")
print(f"Warmest hour: {warmest_hour}:00 ({hourly_avg_temp.loc[warmest_hour, 'mean']:.1f}°C)")
print(f"Daily temperature swing: {hourly_avg_temp['mean'].max() - hourly_avg_temp['mean'].min():.1f}°C")
print(f"Most variable hour: {hourly_avg_temp['std'].idxmax()}:00 (std: {hourly_avg_temp['std'].max():.1f}°C)")
print(f"Least variable hour: {hourly_avg_temp['std'].idxmin()}:00 (std: {hourly_avg_temp['std'].min():.1f}°C)")

🌅 DIURNAL TEMPERATURE PATTERNS
Average Temperature by Hour of Day:
       mean   std  min   max
hour                        
0     23.42  4.88  7.0  34.0
1     23.40  4.85  7.3  32.6
2     22.94  4.83  7.0  32.0
3     22.74  4.81  6.0  32.0
4     22.89  4.83  6.4  31.5
5     22.49  4.81  6.0  31.5
6     22.49  4.89  6.0  31.5
7     23.06  5.09  6.1  33.5
8     23.78  5.30  6.0  34.5
9     24.72  5.46  6.0  36.5
10    25.64  5.58  5.9  38.3
11    26.36  5.70  6.0  38.5
12    26.96  5.78  6.0  40.0
13    27.52  5.85  6.4  40.8
14    27.65  5.91  7.0  41.0
15    27.67  5.90  7.0  40.5
16    27.64  5.85  7.5  41.9
17    26.72  5.72  7.0  40.0
18    25.87  5.51  7.0  39.0
19    25.37  5.27  7.3  38.0
20    24.67  5.13  7.0  37.0
21    24.29  4.99  7.0  36.0
22    24.11  4.91  7.3  34.2
23    23.68  4.88  7.0  34.0



🔍 KEY INSIGHTS:
------------------------------
Coolest hour: 5:00 (22.5°C)
Warmest hour: 15:00 (27.7°C)
Daily temperature swing: 5.2°C
Most variable hour: 14:00 (std: 5.9°C)
Least variable hour: 3:00 (std: 4.8°C)


In [5]:
# Missing Values Analysis for Hourly Data
print("🕳️ MISSING VALUES ANALYSIS")
print("=" * 40)

missing_summary = pd.DataFrame({
    'Column': df_hourly.columns,
    'Missing_Count': df_hourly.isnull().sum(),
    'Missing_Percentage': (df_hourly.isnull().sum() / len(df_hourly) * 100).round(2),
    'Data_Type': df_hourly.dtypes
})

missing_summary = missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)
print("Columns with Missing Values:")
print(missing_summary)

# Visualize missing values pattern
if len(missing_summary) > 0:
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=missing_summary['Column'],
        y=missing_summary['Missing_Percentage'],
        text=missing_summary['Missing_Count'],
        textposition='auto',
        name='Missing Values %'
    ))
    fig.update_layout(
        title='Missing Values by Column (Hourly Dataset)',
        xaxis_title='Columns',
        yaxis_title='Missing Percentage (%)',
        height=400
    )
    fig.show()
else:
    print("✅ No missing values found in the dataset!")

# Check for temporal gaps in hourly sequence
print(f"\n🕐 TEMPORAL SEQUENCE ANALYSIS:")
print("-" * 40)
df_sorted = df_hourly.sort_values('datetime')
time_diffs = df_sorted['datetime'].diff()
expected_diff = pd.Timedelta(hours=1)

gaps = time_diffs[time_diffs > expected_diff]
if len(gaps) > 0:
    print(f"Found {len(gaps)} gaps in hourly sequence:")
    gap_summary = df_sorted[time_diffs > expected_diff][['datetime']].head(10)
    print(gap_summary)
    print(f"Largest gap: {gaps.max()}")
else:
    print("✅ No gaps found in hourly sequence!")

print(f"\n📊 TEMPORAL DATA QUALITY METRICS:")
print("-" * 40)
total_expected_hours = (df_hourly['datetime'].max() - df_hourly['datetime'].min()).total_seconds() / 3600
completeness = len(df_hourly) / total_expected_hours * 100
print(f"Expected total hours: {total_expected_hours:.0f}")
print(f"Actual hours recorded: {len(df_hourly)}")
print(f"Data completeness: {completeness:.2f}%")
print(f"Missing hours: {total_expected_hours - len(df_hourly):.0f}")

🕳️ MISSING VALUES ANALYSIS
Columns with Missing Values:
                        Column  Missing_Count  Missing_Percentage Data_Type
severerisk          severerisk          83856               95.62   float64
preciptype          preciptype          80193               91.44    object
visibility          visibility            216                0.25   float64
snow                      snow             41                0.05   float64
windgust              windgust             48                0.05   float64
snowdepth            snowdepth             41                0.05   float64
precip                  precip             37                0.04   float64
solarradiation  solarradiation             36                0.04   float64
solarenergy        solarenergy             36                0.04   float64
uvindex                uvindex             36                0.04   float64
winddir                winddir              2                0.00   float64
windspeed            windspeed  


🕐 TEMPORAL SEQUENCE ANALYSIS:
----------------------------------------
✅ No gaps found in hourly sequence!

📊 TEMPORAL DATA QUALITY METRICS:
----------------------------------------
Expected total hours: 87695
Actual hours recorded: 87696
Data completeness: 100.00%
Missing hours: -1
