# ENSF 611 Final Project: Weather Temperature Prediction
## Improving Temperature Forecasts Using Machine Learning

**Group Members:** Cameron Dunn, Manuja Senanayake, Edmund Yu, Zohara Kamal

### Project Overview
This project aims to improve upon standard 24-hour advance temperature forecasts by analyzing the discrepancy between forecasted and observed temperatures in Calgary. We'll train and compare three regression models:
- Linear Regression (baseline)
- Support Vector Regressor (SVR)
- Gradient Boosting Regressor

### Objective
Can a machine learning model find trends in the discrepancy between forecasted weather 24hrs in advance and observed temperature such that it can use the forecast to produce more accurate temperature predictions?

## 1. Import Libraries

In [11]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler

# Machine Learning - Models
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor

# Machine Learning - Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load Data

In [12]:
# Load forecast data (24-hour advance predictions)
forecast_df = pd.read_csv('data/raw/forecast.csv', skiprows=2)

# Load observed temperature data
observed_df = pd.read_csv('data/raw/observed.csv', skiprows=2)

print(f"Forecast data shape: {forecast_df.shape}")
print(f"Observed data shape: {observed_df.shape}")
print("\nForecast data columns:")
print(forecast_df.columns.tolist())

Forecast data shape: (16080, 15)
Observed data shape: (16080, 2)

Forecast data columns:
['time', 'temperature_2m_previous_day1 (°C)', 'relative_humidity_2m_previous_day1 (%)', 'dew_point_2m_previous_day1 (°C)', 'apparent_temperature_previous_day1 (°C)', 'precipitation_previous_day1 (mm)', 'rain_previous_day1 (mm)', 'showers_previous_day1 (mm)', 'snowfall_previous_day1 (cm)', 'weather_code_previous_day1 (wmo code)', 'pressure_msl_previous_day1 (hPa)', 'surface_pressure_previous_day1 (hPa)', 'cloud_cover_previous_day1 (%)', 'wind_speed_10m_previous_day1 (km/h)', 'wind_direction_10m_previous_day1 (°)']


## 3. Checking out the data

In [13]:
# Display first few rows of forecast data
print("Forecast Data:")
display(forecast_df.head())

print("\nObserved Data:")
display(observed_df.head())

# Check data types
print("\nForecast Data Info:")
print(forecast_df.info())

print("\nObserved Data Info:")
print(observed_df.info())

Forecast Data:


Unnamed: 0,time,temperature_2m_previous_day1 (°C),relative_humidity_2m_previous_day1 (%),dew_point_2m_previous_day1 (°C),apparent_temperature_previous_day1 (°C),precipitation_previous_day1 (mm),rain_previous_day1 (mm),showers_previous_day1 (mm),snowfall_previous_day1 (cm),weather_code_previous_day1 (wmo code),pressure_msl_previous_day1 (hPa),surface_pressure_previous_day1 (hPa),cloud_cover_previous_day1 (%),wind_speed_10m_previous_day1 (km/h),wind_direction_10m_previous_day1 (°)
0,2024-01-01T00:00,,,,,,,,,,,,,,
1,2024-01-01T01:00,,,,,,,,,,,,,,
2,2024-01-01T02:00,,,,,,,,,,,,,,
3,2024-01-01T03:00,,,,,,,,,,,,,,
4,2024-01-01T04:00,,,,,,,,,,,,,,



Observed Data:


Unnamed: 0,time,temperature_2m (°C)
0,2024-01-01T00:00,3.2
1,2024-01-01T01:00,2.8
2,2024-01-01T02:00,2.6
3,2024-01-01T03:00,2.2
4,2024-01-01T04:00,1.9



Forecast Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16080 entries, 0 to 16079
Data columns (total 15 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   time                                     16080 non-null  object 
 1   temperature_2m_previous_day1 (°C)        15636 non-null  float64
 2   relative_humidity_2m_previous_day1 (%)   15636 non-null  float64
 3   dew_point_2m_previous_day1 (°C)          15636 non-null  float64
 4   apparent_temperature_previous_day1 (°C)  15636 non-null  float64
 5   precipitation_previous_day1 (mm)         15636 non-null  float64
 6   rain_previous_day1 (mm)                  7902 non-null   float64
 7   showers_previous_day1 (mm)               15636 non-null  float64
 8   snowfall_previous_day1 (cm)              15636 non-null  float64
 9   weather_code_previous_day1 (wmo code)    15636 non-null  float64
 10  pressure_msl_previous_day

In [14]:
# Checking for missing values
print("Missing values in forecast data:")
print(forecast_df.isnull().sum())

print("\nMissing values in observed data:")
print(observed_df.isnull().sum())

Missing values in forecast data:
time                                          0
temperature_2m_previous_day1 (°C)           444
relative_humidity_2m_previous_day1 (%)      444
dew_point_2m_previous_day1 (°C)             444
apparent_temperature_previous_day1 (°C)     444
precipitation_previous_day1 (mm)            444
rain_previous_day1 (mm)                    8178
showers_previous_day1 (mm)                  444
snowfall_previous_day1 (cm)                 444
weather_code_previous_day1 (wmo code)       444
pressure_msl_previous_day1 (hPa)            444
surface_pressure_previous_day1 (hPa)        444
cloud_cover_previous_day1 (%)               444
wind_speed_10m_previous_day1 (km/h)         444
wind_direction_10m_previous_day1 (°)        444
dtype: int64

Missing values in observed data:
time                   0
temperature_2m (°C)    0
dtype: int64


In [15]:
# Basic statistics
print("Forecast Data Statistics:")
display(forecast_df.describe())

print("\nObserved Data Statistics:")
display(observed_df.describe())

Forecast Data Statistics:


Unnamed: 0,temperature_2m_previous_day1 (°C),relative_humidity_2m_previous_day1 (%),dew_point_2m_previous_day1 (°C),apparent_temperature_previous_day1 (°C),precipitation_previous_day1 (mm),rain_previous_day1 (mm),showers_previous_day1 (mm),snowfall_previous_day1 (cm),weather_code_previous_day1 (wmo code),pressure_msl_previous_day1 (hPa),surface_pressure_previous_day1 (hPa),cloud_cover_previous_day1 (%),wind_speed_10m_previous_day1 (km/h),wind_direction_10m_previous_day1 (°)
count,15636.0,15636.0,15636.0,15636.0,15636.0,7902.0,15636.0,15636.0,15636.0,15636.0,15636.0,15636.0,15636.0,15636.0
mean,6.847845,59.188923,-1.743541,3.466289,0.066436,0.067957,0.0,0.006585,7.17012,1014.141954,894.137305,67.899463,10.074354,205.674405
std,11.225072,21.182587,9.164973,12.235256,0.653012,0.748325,0.0,0.052957,17.950868,8.230116,7.305635,42.550855,6.472498,109.799991
min,-28.3,9.0,-34.8,-33.8,0.0,0.0,0.0,0.0,0.0,984.6,866.6,0.0,0.0,1.0
25%,-0.3,42.0,-7.5,-4.5,0.0,0.0,0.0,0.0,1.0,1008.9,889.7,15.0,5.5,126.0
50%,8.0,59.0,-1.4,4.2,0.0,0.0,0.0,0.0,3.0,1013.9,894.8,100.0,8.4,203.0
75%,14.8,77.0,5.6,12.5,0.0,0.0,0.0,0.0,3.0,1019.1,899.2,100.0,13.1,307.0
max,34.4,100.0,17.8,34.9,44.6,44.5,0.0,1.33,95.0,1049.9,914.4,100.0,41.6,360.0



Observed Data Statistics:


Unnamed: 0,temperature_2m (°C)
count,16080.0
mean,5.884428
std,11.929266
min,-39.1
25%,-1.2
50%,7.0
75%,14.6
max,33.6


## 4. Data Cleaning and Preprocessing

## 5. Feature Engineering

## 6. Visualization and Exploratory Data Analysis

## 7. Prepare Data for Modeling / Pipeline

## 8. Model Training and Evaluation

## 9. Model Comparison and Results

## 10. Prediction Visualization

## 11. Feature Importance Analysis

## 12. Conclusions and Summary