# 2D Design Template

# Overview

The purpose of this project is for you to apply what you have learnt in this course. This includes working with data and visualizing it, create model of linear regression, as well as using metrics to measure the accuracy of your model. 

Please find the project handout description in the following [link](https://edimension.sutd.edu.sg/webapps/blackboard/content/listContent.jsp?course_id=_5582_1&content_id=_200537_1).


## Deliverables

You need to submit this Jupyter notebook together with the dataset into Vocareum. Use the template in this notebook to work on this project. You are free to edit or add more cells if needed

## Students Submission
*Include a short sentence summarizing each member’s contribution.*

Student's Name:
- Chen Shixiong - 1009260: Cleaning dataset
- Yeo Owen - 1009253: Found dataset
- Low Wei Yang - 1008921: Build model
- Lee Yi Xiang - 1009335: Evaluated model
- Arman Parkash - 1009174: Perform feature and target preparation

### Problem Statement

- Background description of the problem

With energy demand rising and weather patterns growing more erratic, accurate forecasting of electricity usage has become a critical priority for cities like London. As temperature, humidity, and wind speed shift throughout the year, energy consumption patterns follow suit — affecting everything from heating systems in winter to cooling loads during the warmer months.

This project uses Multiple Linear Regression (MLR) to develop a predictive model of household energy consumption in kilowatt-hours (kWh), based on three key environmental factors: average temperature, humidity, and wind speed. These variables were selected for their measurable impact on energy demand and for the availability of reliable historical data.

By building and comparing the MLR model in Python, the project aims to understand how well these variables explain fluctuations in electricity use in London. The goal is to produce a model that is statistically robust, reproducible, and scalable — offering potential use cases for energy providers, city planners, and policy-makers seeking to optimise energy resources and improve sustainability.

- User Persona

Name: David Rahman

Age: 45

Role: Senior Urban Energy Planner

Organization: Greater London Authority (GLA)

Location: City Hall, London

Background:

David has over 15 years of experience in sustainable urban planning. His current focus is on integrating energy efficiency into city infrastructure to meet London’s climate goals and reduce strain on the electricity grid.

Goals & Responsibilities:

Ensure London's energy systems are resilient and prepared for seasonal demand

Support the city’s transition to net-zero emissions by 2030

Make data-driven decisions for urban development and housing retrofits

Collaborate with energy providers to predict and reduce peak load pressures

Allocate funding for sustainability projects based on real impact potential

Pain Points:

Energy demand predictions often rely on outdated or generalised models

Difficulty in aligning short-term weather shifts with long-term infrastructure planning

Lacks granular, season-specific data to guide targeted energy interventions

Needs a scalable and reproducible system to evaluate energy usage across boroughs

- Problem Statement using “how might we ...” statement

How might we predict household energy consumption in London using average temperature, humidity, and wind speed, to improve energy planning and protect vulnerable communities from the effects of climate variability.

### Dataset

- Describe your dataset.
- Put the link to the sources of your raw dataset.
- Put python codes for loading the data into pandas dataframe(s). The data should be the raw data downloaded from the source. No pre-processing using any software (excel, python, etc) yet. Include this dataset in your submission
- Explain each column of your dataset (can use comment or markdown)
- State which column is the dependent variable (target) and explain how it is related to your problem statement
- State which columns are the independent variables (features) and describe your hypothesis on why these features can predict the target variable

[`LCL.csv`](https://data.4tu.nl/datasets/fbbe775b-48d8-469f-a39b-b64488bfd6fd) : 

- This dataset contains half hourly smart meter measurements of 4443 households, obtained during the Low Carbon London project, during 2013.

- Columns:

    DateTime: The date and time of power usage record.

    MAC\d{6}: Household labels' power usage where measurements are in kWh (energy consumption) for the preceding half hour.


[`all_weather_data.csv`](https://www.kaggle.com/datasets/jakewright/2m-daily-weather-history-uk?resource=download): 

- This dataset contains historical weather data from various locations across the UK, spanning from 2009 to 2024. Each entry records the weather conditions for a specific day, providing insights into temperature, rain, humidity, cloud cover, wind speed, and wind direction. The data is useful for analyzing weather patterns and trends over time.

- Columns:

    location: The name of the location (e.g., Holywood, Ardkeen).

    date: The date of the weather record (format: YYYY-MM-DD).

    min_temp (°C): The minimum temperature recorded on that day (in degrees Celsius).

    max_temp (°C): The maximum temperature recorded on that day (in degrees Celsius).

    rain (mm): The amount of rainfall recorded (in millimeters).

    humidity (%): The percentage of humidity.

    cloud_cover (%): The percentage of cloud cover.

    wind_speed (km/h): The wind speed recorded (in kilometers per hour).

    wind_direction: The direction of the wind (e.g., N, SSE, WSW).

    wind_direction_numerical: The numerical representation of the wind direction (e.g., 90.0 for east)


Feature - X data (Independent): Average temperature, Humidity and Wind Speed

Target - Y data (Dependent): Power Usage

#### Import Necessary Libraries

In [None]:
from typing import TypeAlias
from typing import Optional, Any

Number: TypeAlias = int | float

import warnings
import math

warnings.filterwarnings('ignore', category=FutureWarning, message='.*Dtype inference.*')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.axes as axes
import seaborn as sns
from IPython.display import display

#### Import necessary functions from cohort problem sets

In [None]:
def normalize_z(array: np.ndarray, columns_means: Optional[np.ndarray]=None,
                columns_stds: Optional[np.ndarray]=None) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    assert columns_means is None or columns_means.shape == (1, array.shape[1])
    assert columns_stds is None or columns_stds.shape == (1, array.shape[1])

    if columns_means is None:
        columns_means = array.mean(axis=0).reshape(1, -1)

    if columns_stds is None:
        columns_stds = array.std(axis=0).reshape(1, -1)

    out: np.ndarray = (array - columns_means) / columns_stds

    assert out.shape == array.shape
    assert columns_means.shape == (1, array.shape[1])
    assert columns_stds.shape == (1, array.shape[1])
    return out, columns_means, columns_stds


def get_features_targets(df: pd.DataFrame,
                         feature_names: list[str],
                         target_names: list[str]) -> tuple[pd.DataFrame, pd.DataFrame]:
    df_feature: pd.DataFrame = df[feature_names] # if feature_names is not list[str] type and just str, then we will get Series and not Dataframe
    df_target: pd.DataFrame = df[target_names]
    return df_feature, df_target


def prepare_feature(np_feature: np.ndarray) -> np.ndarray:
    cols: int = np_feature.shape[1]
    X: np.ndarray = np.concatenate((np.ones((np_feature.shape[0],1)), np_feature), axis = 1 ) # axis = 1 is to concatenate column wise
    return X


def predict_linreg(array_feature: np.ndarray, beta: np.ndarray,
                   means: Optional[np.ndarray]=None,
                   stds: Optional[np.ndarray]=None) -> np.ndarray:
    assert means is None or means.shape == (1, array_feature.shape[1])
    assert stds is None or stds.shape == (1, array_feature.shape[1])
    norm_data, _, _ = normalize_z(array_feature, means, stds)
    X: np.ndarray = prepare_feature(norm_data)
    result = calc_linreg(X, beta)
    assert result.shape == (array_feature.shape[0], 1)
    return result


def calc_linreg(X: np.ndarray, beta: np.ndarray) -> np.ndarray:
    result = np.matmul(X, beta)
    assert result.shape == (X.shape[0], 1)
    return result


def compute_cost_linreg(X: np.ndarray, y: np.ndarray, beta: np.ndarray) -> np.ndarray:
    m = X.shape[0]
    predicted_y = calc_linreg(X, beta)
    error = predicted_y - y
    error_sq = np.matmul(error.T, error)
    J = (1/(2*m)) * error_sq
    assert J.shape == (1, 1)
    return np.squeeze(J)


def gradient_descent_linreg(X: np.ndarray, y: np.ndarray, beta: np.ndarray,
                            alpha: float, num_iters: int) -> tuple[np.ndarray, np.ndarray]:
    m = X.shape[0]
    J_storage = np.zeros ((num_iters, 1))
    for n in range(num_iters):
        deriv: np.ndarray = np.matmul(X.T, (calc_linreg(X, beta) - y))
        beta = beta - alpha * (1/m) * deriv
        J_storage[n] = compute_cost_linreg(X, y, beta)

    assert beta.shape == (X.shape[1], 1)
    assert J_storage.shape == (num_iters, 1)
    return beta, J_storage


def split_data(df_feature: pd.DataFrame, df_target: pd.DataFrame,
               random_state: Optional[int]=None,
               test_size: float=0.5) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    index = df_feature.index

    if random_state is not None:
        np.random.seed(random_state)

    rows_test_size = int(test_size * len(index))

    # Sample the test index dataset
    test_index = np.random.choice(index, rows_test_size, replace = False)
    train_index = index.drop(test_index)

    # Select data for the output
    df_feature_train = df_feature.loc[train_index, :]
    df_feature_test = df_feature.loc[test_index,:]

    df_target_train = df_target.loc[train_index,:]
    df_target_test = df_target.loc[test_index,:]

    return df_feature_train, df_feature_test, df_target_train, df_target_test


# just wrap all the code in CS4 in a function
def build_model_linreg(df_feature_train: pd.DataFrame,
                       df_target_train: pd.DataFrame,
                       beta: Optional[np.ndarray] = None,
                       alpha: float = 0.01,
                       iterations: int = 1500) -> tuple[dict[str, Any], np.ndarray]:
    # check if initial beta values are given
    if beta is None:
        beta = np.zeros((df_feature_train.shape[1]+1, 1)) # add one dimension to the feature_train array because of the b0 coefficient
    assert beta.shape == (df_feature_train.shape[1]+1, 1) # to make sure if beta argument is given, then it conforms to the shape of the feature train

    array_feature_train_z, means, stds = normalize_z(df_feature_train.to_numpy())

    # prepare the X matrix and the target vector as ndarray
    X: np.ndarray = prepare_feature(array_feature_train_z)
    target: np.ndarray = df_target_train.to_numpy()
    beta, J_storage = gradient_descent_linreg(X, target, beta, alpha, iterations)
    # store the output in model dictionary
    model = {"beta": beta, "means":means, "stds": stds}

    # assert the shapes
    assert model["beta"].shape == (df_feature_train.shape[1] + 1, 1) # make sure that beta vector is d by 1
    assert model["means"].shape == (1, df_feature_train.shape[1]) # make sure that the means vector is also d-1 by 1 (1 per feature)
    assert model["stds"].shape == (1, df_feature_train.shape[1])  # make sure that the stds vector is also d-1 by 1 (1 per feature)
    assert J_storage.shape == (iterations, 1) # make sure we have recorded #iterations of error
    return model, J_storage


def r2_score(y: np.ndarray, ypred: np.ndarray) -> float:
    res = np.sum((y - ypred)**2)
    tot = np.sum((y - y.mean())**2)
    return 1 - res / tot


def mean_squared_error(target: np.ndarray, pred: np.ndarray) -> float:
    return np.sum((target - pred)**2)/target.shape[0]

#### Loading Raw Data

In [None]:
df1 = pd.read_csv('./LCL_2013.csv')

df1

In [None]:
df2 = pd.read_csv('./all_weather_data.csv')

df2

### Clean & Analyze your data
Use python code to:
- Clean your data
- Calculate Descriptive Statistics and other statistical analysis
- Visualization with meaningful analysis description

#### `LCL.csv` (Average Household Power Usage)

1. Compute average energy average usage across all households.
2. Convert energy usage data into daily scale.

In [None]:
# Create a proper copy of the dataframe and ensure clean dtypes
df1_pp = df1.iloc[:-1].copy()

# Drop columns where there are households with readings that are NaN
df1_pp = df1_pp.dropna(axis='columns', how='any')

# Convert DateTime column into datetime format with explicit parameters
df1_pp.loc[:, 'DateTime'] = pd.to_datetime(df1_pp['DateTime'], format='mixed', errors='coerce')

# Set index and ensure it's properly typed
df1_pp = df1_pp.set_index('DateTime')

# Resample by day ('D') and sum each column (i.e. each household)
df_daily = df1_pp.resample('D').sum()

# Reset index so 'DateTime' becomes a column again
df_daily = df_daily.reset_index()

# Exclude non-household columns (e.g., 'DateTime')
mac_columns = [col for col in df_daily.columns if col.startswith('MAC')]

# Compute row-wise average across all MAC* columns
df_daily['Household Average'] = df_daily[mac_columns].mean(axis=1, skipna=True)

# Create final dataframe with just DateTime and average
df_daily_average = df_daily[['DateTime', 'Household Average']].copy()

In [None]:
display(df_daily_average), display(df_daily_average.describe())

#### `all_weather_data.csv` (Weather stats in London)

1. Filter data to include only records from the year 2013.
2. Select London data only.

In [None]:
df2_pp = df2.copy()

# Filter for London AND year 2013 in one operation
df2_london_2013 = df2_pp[
    (df2_pp['location'] == 'London') &
    (pd.to_datetime(df2_pp['date']).dt.year == 2013)
].copy()

df2_london_2013['average_temp °c'] = (df2_london_2013['min_temp °c'] + df2_london_2013['max_temp °c']) / 2

# Ensure date column is datetime format
df2_london_2013['date'] = pd.to_datetime(df2_london_2013['date'])

column_order = [
    'location', 'date', 'min_temp °c', 'average_temp °c', 'max_temp °c',
    'rain mm', 'humidity %', 'cloud_cover %', 'wind_speed km/h',
    'wind_direction', 'wind_direction_numerical'
]

# Reorder the dataframe
df2_london_2013 = df2_london_2013[column_order]

In [None]:
display(df2_london_2013), display(df2_london_2013.describe())

#### Seasonal Classification

1. Correlate both datasets base on date.
2. Segment dataset by season for further analysis.

In [None]:
# Add season classification
def get_season(date):
    """
    Classify date into seasons based on meteorological seasons:
    - Winter: December, January, February
    - Spring: March, April, May
    - Summer: June, July, August
    - Fall/Autumn: September, October, November
    """
    month = date.month
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:  # months 9, 10, 11
        return 'Fall'

In [None]:
# Apply season classification using .loc to avoid warning
df_daily_average.loc[:, 'Season'] = df_daily_average['DateTime'].apply(get_season)

# Split data into seasonal dataframes
winter_data = df_daily_average[df_daily_average['Season'] == 'Winter'].copy()
spring_data = df_daily_average[df_daily_average['Season'] == 'Spring'].copy()
summer_data = df_daily_average[df_daily_average['Season'] == 'Summer'].copy()
fall_data = df_daily_average[df_daily_average['Season'] == 'Fall'].copy()

In [None]:
# Add season column
df2_london_2013['season'] = df2_london_2013['date'].apply(get_season)

# Split data into seasonal dataframes
winter_weather = df2_london_2013[df2_london_2013['season'] == 'Winter'].copy()
spring_weather = df2_london_2013[df2_london_2013['season'] == 'Spring'].copy()
summer_weather = df2_london_2013[df2_london_2013['season'] == 'Summer'].copy()
fall_weather = df2_london_2013[df2_london_2013['season'] == 'Fall'].copy()

In [None]:
winter_data = pd.merge(
    winter_data,           # energy data with 'DateTime' and 'Household Average'
    winter_weather,        # weather data with 'date' and 'average_temp'
    left_on='DateTime',    # date column in energy data
    right_on='date',       # date column in weather data
    how='inner'            # only dates that exist in both datasets
)

spring_data = pd.merge(
    spring_data,           # energy data with 'DateTime' and 'Household Average'
    spring_weather,        # weather data with 'date' and 'average_temp'
    left_on='DateTime',    # date column in energy data
    right_on='date',       # date column in weather data
    how='inner'            # only dates that exist in both datasets
)

summer_data = pd.merge(
    summer_data,           # energy data with 'DateTime' and 'Household Average'
    summer_weather,        # weather data with 'date' and 'average_temp'
    left_on='DateTime',    # date column in energy data
    right_on='date',       # date column in weather data
    how='inner'            # only dates that exist in both datasets
)

fall_data = pd.merge(
    fall_data,           # energy data with 'DateTime' and 'Household Average'
    fall_weather,        # weather data with 'date' and 'average_temp'
    left_on='DateTime',    # date column in energy data
    right_on='date',       # date column in weather data
    how='inner'            # only dates that exist in both datasets
)

In [None]:
display(winter_data), display(winter_data.describe())
display(spring_data), display(spring_data.describe())
display(summer_data), display(summer_data.describe())
display(fall_data), display(fall_data.describe())

### Model Analysis

Features and Target Preparation

Prepare features and target for model training.


In [None]:
FEATURE_NAMES = ["min_temp °c", "average_temp °c", "max_temp °c", "rain mm", "humidity %", "cloud_cover %", "wind_speed km/h", "wind_direction_numerical"]
TARGET_NAMES = ["Household Average"]

df_feature_winter, df_target_winter = get_features_targets(winter_data, FEATURE_NAMES, TARGET_NAMES)
df_feature_spring, df_target_spring = get_features_targets(spring_data, FEATURE_NAMES, TARGET_NAMES)
df_feature_summer, df_target_summer = get_features_targets(summer_data, FEATURE_NAMES, TARGET_NAMES)
df_feature_fall, df_target_fall = get_features_targets(fall_data, FEATURE_NAMES, TARGET_NAMES)

In [None]:
array_feature_winter = df_feature_winter.to_numpy()
array_feature_spring = df_feature_spring.to_numpy()
array_feature_summer = df_feature_summer.to_numpy()
array_feature_fall = df_feature_fall.to_numpy()

In [None]:
df_features_winter: pd.DataFrame = pd.DataFrame(array_feature_winter, columns=df_feature_winter.columns)
display(df_features_winter.describe()), display(df_target_winter.describe())

df_features_spring: pd.DataFrame = pd.DataFrame(array_feature_spring, columns=df_feature_spring.columns)
display(df_features_spring.describe()), display(df_target_spring.describe())

df_features_summer: pd.DataFrame = pd.DataFrame(array_feature_summer, columns=df_feature_summer.columns)
display(df_features_summer.describe()), display(df_target_summer.describe())

df_features_fall: pd.DataFrame = pd.DataFrame(array_feature_fall, columns=df_feature_fall.columns)
display(df_features_fall.describe()), display(df_target_fall.describe())

Building Model

Use python code to build your model. Give explanation on this process.

Cost Function

In [None]:
# put Python code to build your model
X_winter: np.ndarray = prepare_feature(df_features_winter.to_numpy())
target_winter: np.ndarray = df_target_winter.to_numpy().reshape(-1,1)
beta_winter: np.ndarray = np.zeros((X_winter.shape[1],1))
J_winter: np.ndarray = compute_cost_linreg(X_winter, target_winter, beta_winter)
print(J_winter)


X_spring: np.ndarray = prepare_feature(df_features_spring.to_numpy())
target_spring: np.ndarray = df_target_spring.to_numpy().reshape(-1,1)
beta_spring: np.ndarray = np.zeros((X_spring.shape[1],1))
J_spring: np.ndarray = compute_cost_linreg(X_spring, target_spring, beta_spring)
print(J_spring)


X_summer: np.ndarray = prepare_feature(df_features_summer.to_numpy())
target_summer: np.ndarray = df_target_summer.to_numpy().reshape(-1,1)
beta_summer: np.ndarray = np.zeros((X_summer.shape[1],1))
J_summer: np.ndarray = compute_cost_linreg(X_summer, target_summer, beta_summer)
print(J_summer)


X_fall: np.ndarray = prepare_feature(df_features_fall.to_numpy())
target_fall: np.ndarray = df_target_fall.to_numpy().reshape(-1,1)
beta_fall: np.ndarray = np.zeros((X_fall.shape[1],1))
J_fall: np.ndarray = compute_cost_linreg(X_fall, target_fall, beta_fall)
print(J_fall)

Gradient descent

In [None]:
iterations: int = 1500
'''"
We chose a very small learning rate (alpha = 0.00001) because our input data (like temperature, humidity, etc.) was not scaled properly. 
Without scaling, some values were much bigger than others, which made the model unstable when learning too fast.
If we used a higher learning rate, the model's predictions would "jump around" too much and quickly become useless,
and we will get results like nan (not a number).
So, to keep the model stable, we had to slow it down by using a much smaller learning rate.
'''
alpha: float = 0.00001

beta_winter: np.ndarray = np.zeros((X_winter.shape[1],1))
beta_spring: np.ndarray = np.zeros((X_spring.shape[1],1))
beta_summer: np.ndarray = np.zeros((X_summer.shape[1],1))
beta_fall: np.ndarray = np.zeros((X_fall.shape[1],1))


beta_winter, J_storage_winter = gradient_descent_linreg(X_winter, target_winter, beta_winter, alpha, iterations)
print('beta_winter: \n', beta_winter)
print('J_storage_winter: \n', J_storage_winter)

beta_spring, J_storage_spring = gradient_descent_linreg(X_spring, target_spring, beta_spring, alpha, iterations)
print('beta_spring: \n', beta_spring)
print('J_storage_spring: \n', J_storage_spring)

beta_summer, J_storage_summer = gradient_descent_linreg(X_summer, target_summer, beta_summer, alpha, iterations)
print('beta_summer: \n', beta_summer)
print('J_storage_summer: \n', J_storage_summer)

beta_fall, J_storage_fall = gradient_descent_linreg(X_fall, target_fall, beta_fall, alpha, iterations)
print('beta_fall: \n', beta_fall)
print('J_storage_fall: \n', J_storage_fall)

In [None]:
plt.plot(J_storage_winter)

In [None]:
plt.plot(J_storage_spring)

In [None]:
plt.plot(J_storage_summer)

In [None]:
plt.plot(J_storage_fall)

Predict Linear Regression

In [None]:
# Call predict()
pred_winter: np.ndarray = predict_linreg(df_features_winter.to_numpy(), beta_winter)

# Change target to numpy array
target_winter: np.ndarray = df_target_winter.to_numpy()

print("Winter")
print(pred_winter[0], pred_winter[1], pred_winter[2])
print(target_winter[0], target_winter[1], target_winter[2])
print(pred_winter.mean(), pred_winter.std())



# Call predict()
pred_spring: np.ndarray = predict_linreg(df_features_spring.to_numpy(), beta_spring)

# Change target to numpy array
target_spring: np.ndarray = df_target_spring.to_numpy()

print("Spring")
print(pred_spring[0], pred_spring[1], pred_spring[2])
print(target_spring[0], target_spring[1], target_spring[2])
print(pred_spring.mean(), pred_spring.std())



# Call predict()
pred_summer: np.ndarray = predict_linreg(df_features_summer.to_numpy(), beta_summer)

# Change target to numpy array
target_summer: np.ndarray = df_target_summer.to_numpy()

print("Summer")
print(pred_summer[0], pred_summer[1], pred_summer[2])
print(target_summer[0], target_summer[1], target_summer[2])
print(pred_summer.mean(), pred_summer.std())



# Call predict()
pred_fall: np.ndarray = predict_linreg(df_features_fall.to_numpy(), beta_fall)

# Change target to numpy array
target_fall: np.ndarray = df_target_fall.to_numpy()

print("Fall")
print(pred_fall[0], pred_fall[1], pred_fall[2])
print(target_fall[0], target_fall[1], target_fall[2])
print(pred_fall.mean(), pred_fall.std())


Evaluating the Model

- Describe the metrics of your choice
- Evaluate your model performance

Splitting Data

In [None]:
# Split the data set into training and test
df_features_train_winter, df_features_test_winter, df_target_train_winter, df_target_test_winter = split_data(df_features_winter, df_target_winter, random_state=100, test_size=0.3)
# call build_model_linreg() function
model_winter, J_storage_winter = build_model_linreg(df_features_train_winter, df_target_train_winter)
# call the predict_linreg() method
pred_winter: np.ndarray = predict_linreg(df_features_test_winter.to_numpy(), model_winter['beta'], model_winter['means'], model_winter['stds'])


# Split the data set into training and test
df_features_train_spring, df_features_test_spring, df_target_train_spring, df_target_test_spring = split_data(df_features_spring, df_target_spring, random_state=100, test_size=0.3)
# call build_model_linreg() function
model_spring, J_storage_spring = build_model_linreg(df_features_train_spring, df_target_train_spring)
# call the predict_linreg() method
pred_spring: np.ndarray = predict_linreg(df_features_test_spring.to_numpy(), model_spring['beta'], model_spring['means'], model_spring['stds'])


# Split the data set into training and test
df_features_train_summer, df_features_test_summer, df_target_train_summer, df_target_test_summer = split_data(df_features_summer, df_target_summer, random_state=100, test_size=0.3)
# call build_model_linreg() function
model_summer, J_storage_summer = build_model_linreg(df_features_train_summer, df_target_train_summer)
# call the predict_linreg() method
pred_summer: np.ndarray = predict_linreg(df_features_test_summer.to_numpy(), model_summer['beta'], model_summer['means'], model_summer['stds'])


# Split the data set into training and test
df_features_train_fall, df_features_test_fall, df_target_train_fall, df_target_test_fall = split_data(df_features_fall, df_target_fall, random_state=100, test_size=0.3)
# call build_model_linreg() function
model_fall, J_storage_fall = build_model_linreg(df_features_train_fall, df_target_train_fall)
# call the predict_linreg() method
pred_fall: np.ndarray = predict_linreg(df_features_test_fall.to_numpy(), model_fall['beta'], model_fall['means'], model_fall['stds'])



R<sup>2</sup> Score (Coefficient of Determination)

In [None]:
target_winter: np.ndarray = df_target_test_winter.to_numpy()
r2_winter: float = r2_score(target_winter, pred_winter)

target_spring: np.ndarray = df_target_test_spring.to_numpy()
r2_spring: float = r2_score(target_spring, pred_spring)

target_summer: np.ndarray = df_target_test_summer.to_numpy()
r2_summer: float = r2_score(target_summer, pred_summer)

target_fall: np.ndarray = df_target_test_fall.to_numpy()
r2_fall: float = r2_score(target_fall, pred_fall)

print('r2_winter:', round(r2_winter,3))
print('r2_spring:', round(r2_spring,3))
print('r2_summer:', round(r2_summer,3))
print('r2_fall:', round(r2_fall,3))

Mean Square Error & Root Mean Squared Error

In [None]:
mse_winter: float = mean_squared_error(target_winter, pred_winter)
rmse_winter: float = math.sqrt(mse_winter)

mse_spring: float = mean_squared_error(target_spring, pred_spring)
rmse_spring: float = math.sqrt(mse_spring)

mse_summer: float = mean_squared_error(target_summer, pred_summer)
rmse_summer: float = math.sqrt(mse_summer)

mse_fall: float = mean_squared_error(target_fall, pred_fall)
rmse_fall: float = math.sqrt(mse_fall)

print(f'mse_winter: {round(mse_winter,3)}, rmse_winter: {round(rmse_winter,3)}')
print(f'mse_spring: {round(mse_spring,3)}, rmse_spring: {round(rmse_spring,3)}')
print(f'mse_summer: {round(mse_summer,3)}, rmse_summer: {round(rmse_summer,3)}')
print(f'mse_fall: {round(mse_fall,3)}, rmse_fall: {round(rmse_fall,3)}')

### Improving the Model

- Improve the models by performing any data processing techniques or hyperparameter tuning.
- You can repeat the steps above to show the improvement as compared to the previous performance

Note:
- You should not change or add dataset at this step
- You are allowed to use library such as sklearn for data processing (NOT for building model)
- Make sure to have the same test dataset so the results are comparable with the previous model 
- If you perform hyperparameter tuning, it will require you to split your training data further into train and validation dataset

### Features and Target Preparation

Prepare features and target for model training.

In [None]:
# put Python code to prepare your features and target
FEATURE_NAMES = ["average_temp °c", "humidity %", "wind_speed km/h"]
TARGET_NAMES = ["Household Average"]

df_feature_winter, df_target_winter = get_features_targets(winter_data, FEATURE_NAMES, TARGET_NAMES)
df_feature_spring, df_target_spring = get_features_targets(spring_data, FEATURE_NAMES, TARGET_NAMES)
df_feature_summer, df_target_summer = get_features_targets(summer_data, FEATURE_NAMES, TARGET_NAMES)
df_feature_fall, df_target_fall = get_features_targets(fall_data, FEATURE_NAMES, TARGET_NAMES)

In [None]:
array_feature_winter, _, _ = normalize_z(df_feature_winter.to_numpy())
array_feature_spring, _, _ = normalize_z(df_feature_spring.to_numpy())
array_feature_summer, _, _ = normalize_z(df_feature_summer.to_numpy())
array_feature_fall, _, _ = normalize_z(df_feature_fall.to_numpy())

In [None]:
df_features_winter: pd.DataFrame = pd.DataFrame(array_feature_winter, columns=df_feature_winter.columns)
display(df_features_winter.describe()), display(df_target_winter.describe())

df_features_spring: pd.DataFrame = pd.DataFrame(array_feature_spring, columns=df_feature_spring.columns)
display(df_features_spring.describe()), display(df_target_spring.describe())

df_features_summer: pd.DataFrame = pd.DataFrame(array_feature_summer, columns=df_feature_summer.columns)
display(df_features_summer.describe()), display(df_target_summer.describe())

df_features_fall: pd.DataFrame = pd.DataFrame(array_feature_fall, columns=df_feature_fall.columns)
display(df_features_fall.describe()), display(df_target_fall.describe())

In [None]:
sns.set()
plt.scatter(df_features_winter["average_temp °c"], df_target_winter)

In [None]:
sns.set()
plt.scatter(df_features_winter["humidity %"], df_target_winter)

In [None]:
sns.set()
plt.scatter(df_features_winter["wind_speed km/h"], df_target_winter)

In [None]:
sns.set()
plt.scatter(df_features_spring["average_temp °c"], df_target_spring)

In [None]:
sns.set()
plt.scatter(df_features_spring["humidity %"], df_target_spring)

In [None]:
sns.set()
plt.scatter(df_features_spring["wind_speed km/h"], df_target_spring)

In [None]:
sns.set()
plt.scatter(df_features_summer["average_temp °c"], df_target_summer)

In [None]:
sns.set()
plt.scatter(df_features_summer["humidity %"], df_target_summer)

In [None]:
sns.set()
plt.scatter(df_features_summer["wind_speed km/h"], df_target_summer)

In [None]:
sns.set()
plt.scatter(df_features_fall["average_temp °c"], df_target_fall)

In [None]:
sns.set()
plt.scatter(df_features_fall["humidity %"], df_target_fall)

In [None]:
sns.set()
plt.scatter(df_features_fall["wind_speed km/h"], df_target_fall)

### Building Model

Use python code to build your model. Give explanation on this process.

#### Cost Function

In [None]:
# put Python code to build your model
X_winter: np.ndarray = prepare_feature(df_features_winter.to_numpy())
target_winter: np.ndarray = df_target_winter.to_numpy()
beta_winter: np.ndarray = np.zeros((4,1))
J_winter: np.ndarray = compute_cost_linreg(X_winter, target_winter, beta_winter)
print(J_winter)


X_spring: np.ndarray = prepare_feature(df_features_spring.to_numpy())
target_spring: np.ndarray = df_target_spring.to_numpy()
beta_spring: np.ndarray = np.zeros((4,1))
J_spring: np.ndarray = compute_cost_linreg(X_spring, target_spring, beta_spring)
print(J_spring)


X_summer: np.ndarray = prepare_feature(df_features_summer.to_numpy())
target_summer: np.ndarray = df_target_summer.to_numpy()
beta_summer: np.ndarray = np.zeros((4,1))
J_summer: np.ndarray = compute_cost_linreg(X_summer, target_summer, beta_summer)
print(J_summer)


X_fall: np.ndarray = prepare_feature(df_features_fall.to_numpy())
target_fall: np.ndarray = df_target_fall.to_numpy()
beta_fall: np.ndarray = np.zeros((4,1))
J_fall: np.ndarray = compute_cost_linreg(X_fall, target_fall, beta_fall)
print(J_fall)

#### Gradient descent

In [None]:
iterations: int = 1500
alpha: float = 0.01

beta_winter: np.ndarray = np.zeros((4,1))
beta_spring: np.ndarray = np.zeros((4,1))
beta_summer: np.ndarray = np.zeros((4,1))
beta_fall: np.ndarray = np.zeros((4,1))

beta_winter, J_storage_winter = gradient_descent_linreg(X_winter, target_winter, beta_winter, alpha, iterations)
print('beta_winter: \n', beta_winter)
print('J_storage_winter: \n', J_storage_winter)

beta_spring, J_storage_spring = gradient_descent_linreg(X_spring, target_spring, beta_spring, alpha, iterations)
print('beta_spring: \n', beta_spring)
print('J_storage_spring: \n', J_storage_spring)

beta_summer, J_storage_summer = gradient_descent_linreg(X_summer, target_summer, beta_summer, alpha, iterations)
print('beta_summer: \n', beta_summer)
print('J_storage_summer: \n', J_storage_summer)

beta_fall, J_storage_fall = gradient_descent_linreg(X_fall, target_fall, beta_fall, alpha, iterations)
print('beta_fall: \n', beta_fall)
print('J_storage_fall: \n', J_storage_fall)

In [None]:
plt.plot(J_storage_winter)

In [None]:
plt.plot(J_storage_spring)

In [None]:
plt.plot(J_storage_summer)

In [None]:
plt.plot(J_storage_fall)

#### Predict Linear Regression

In [None]:
# Call predict()
pred_winter: np.ndarray = predict_linreg(df_features_winter.to_numpy(), beta_winter)

# Change target to numpy array
target_winter: np.ndarray = df_target_winter.to_numpy()

print("Winter")
print(pred_winter[0], pred_winter[1], pred_winter[2])
print(target_winter[0], target_winter[1], target_winter[2])
print(pred_winter.mean(), pred_winter.std())



# Call predict()
pred_spring: np.ndarray = predict_linreg(df_features_spring.to_numpy(), beta_spring)

# Change target to numpy array
target_spring: np.ndarray = df_target_spring.to_numpy()

print("Spring")
print(pred_spring[0], pred_spring[1], pred_spring[2])
print(target_spring[0], target_spring[1], target_spring[2])
print(pred_spring.mean(), pred_spring.std())



# Call predict()
pred_summer: np.ndarray = predict_linreg(df_features_summer.to_numpy(), beta_summer)

# Change target to numpy array
target_summer: np.ndarray = df_target_summer.to_numpy()

print("Summer")
print(pred_summer[0], pred_summer[1], pred_summer[2])
print(target_summer[0], target_summer[1], target_summer[2])
print(pred_summer.mean(), pred_summer.std())



# Call predict()
pred_fall: np.ndarray = predict_linreg(df_features_fall.to_numpy(), beta_fall)

# Change target to numpy array
target_fall: np.ndarray = df_target_fall.to_numpy()

print("Fall")
print(pred_fall[0], pred_fall[1], pred_fall[2])
print(target_fall[0], target_fall[1], target_fall[2])
print(pred_fall.mean(), pred_fall.std())

In [None]:
plt.scatter(df_features_winter["average_temp °c"],target_winter)
plt.scatter(df_features_winter["average_temp °c"],pred_winter)

In [None]:
plt.scatter(df_features_winter["humidity %"],target_winter)
plt.scatter(df_features_winter["humidity %"],pred_winter)

In [None]:
plt.scatter(df_features_winter["wind_speed km/h"],target_winter)
plt.scatter(df_features_winter["wind_speed km/h"],pred_winter)

In [None]:
plt.scatter(df_features_spring["average_temp °c"],target_spring)
plt.scatter(df_features_spring["average_temp °c"],pred_spring)

In [None]:
plt.scatter(df_features_spring["humidity %"],target_spring)
plt.scatter(df_features_spring["humidity %"],pred_spring)

In [None]:
plt.scatter(df_features_spring["wind_speed km/h"],target_spring)
plt.scatter(df_features_spring["wind_speed km/h"],pred_spring)

In [None]:
plt.scatter(df_features_summer["average_temp °c"],target_summer)
plt.scatter(df_features_summer["average_temp °c"],pred_summer)

In [None]:
plt.scatter(df_features_summer["humidity %"],target_summer)
plt.scatter(df_features_summer["humidity %"],pred_summer)

In [None]:
plt.scatter(df_features_summer["wind_speed km/h"],target_summer)
plt.scatter(df_features_summer["wind_speed km/h"],pred_summer)

In [None]:
plt.scatter(df_features_fall["average_temp °c"],target_fall)
plt.scatter(df_features_fall["average_temp °c"],pred_fall)

In [None]:
plt.scatter(df_features_fall["humidity %"],target_fall)
plt.scatter(df_features_fall["humidity %"],pred_fall)

In [None]:
plt.scatter(df_features_fall["wind_speed km/h"],target_fall)
plt.scatter(df_features_fall["wind_speed km/h"],pred_fall)

### Evaluating the Model

- Describe the metrics of your choice
- Evaluate your model performance

#### Splitting Data

In [None]:
# Split the data set into training and test
df_features_train_winter, df_features_test_winter, df_target_train_winter, df_target_test_winter = split_data(df_features_winter, df_target_winter, random_state=100, test_size=0.3)
# call build_model_linreg() function
model_winter, J_storage_winter = build_model_linreg(df_features_train_winter, df_target_train_winter)
# call the predict_linreg() method
pred_winter: np.ndarray = predict_linreg(df_features_test_winter.to_numpy(), model_winter['beta'], model_winter['means'], model_winter['stds'])


# Split the data set into training and test
df_features_train_spring, df_features_test_spring, df_target_train_spring, df_target_test_spring = split_data(df_features_spring, df_target_spring, random_state=100, test_size=0.3)
# call build_model_linreg() function
model_spring, J_storage_spring = build_model_linreg(df_features_train_spring, df_target_train_spring)
# call the predict_linreg() method
pred_spring: np.ndarray = predict_linreg(df_features_test_spring.to_numpy(), model_spring['beta'], model_spring['means'], model_spring['stds'])


# Split the data set into training and test
df_features_train_summer, df_features_test_summer, df_target_train_summer, df_target_test_summer = split_data(df_features_summer, df_target_summer, random_state=100, test_size=0.3)
# call build_model_linreg() function
model_summer, J_storage_summer = build_model_linreg(df_features_train_summer, df_target_train_summer)
# call the predict_linreg() method
pred_summer: np.ndarray = predict_linreg(df_features_test_summer.to_numpy(), model_summer['beta'], model_summer['means'], model_summer['stds'])


# Split the data set into training and test
df_features_train_fall, df_features_test_fall, df_target_train_fall, df_target_test_fall = split_data(df_features_fall, df_target_fall, random_state=100, test_size=0.3)
# call build_model_linreg() function
model_fall, J_storage_fall = build_model_linreg(df_features_train_fall, df_target_train_fall)
# call the predict_linreg() method
pred_fall: np.ndarray = predict_linreg(df_features_test_fall.to_numpy(), model_fall['beta'], model_fall['means'], model_fall['stds'])

In [None]:
plt.scatter(df_features_test_winter["average_temp °c"], df_target_test_winter)
plt.scatter(df_features_test_winter["average_temp °c"], pred_winter)

In [None]:
plt.scatter(df_features_test_winter["humidity %"], df_target_test_winter)
plt.scatter(df_features_test_winter["humidity %"], pred_winter)

In [None]:
plt.scatter(df_features_test_winter["wind_speed km/h"], df_target_test_winter)
plt.scatter(df_features_test_winter["wind_speed km/h"], pred_winter)

In [None]:
plt.scatter(df_features_test_spring["average_temp °c"], df_target_test_spring)
plt.scatter(df_features_test_spring["average_temp °c"], pred_spring)

In [None]:
plt.scatter(df_features_test_spring["humidity %"], df_target_test_spring)
plt.scatter(df_features_test_spring["humidity %"], pred_spring)

In [None]:
plt.scatter(df_features_test_spring["wind_speed km/h"], df_target_test_spring)
plt.scatter(df_features_test_spring["wind_speed km/h"], pred_spring)

In [None]:
plt.scatter(df_features_test_summer["average_temp °c"], df_target_test_summer)
plt.scatter(df_features_test_summer["average_temp °c"], pred_summer)

In [None]:
plt.scatter(df_features_test_summer["humidity %"], df_target_test_summer)
plt.scatter(df_features_test_summer["humidity %"], pred_summer)

In [None]:
plt.scatter(df_features_test_summer["wind_speed km/h"], df_target_test_summer)
plt.scatter(df_features_test_summer["wind_speed km/h"], pred_summer)

In [None]:
plt.scatter(df_features_test_fall["average_temp °c"], df_target_test_fall)
plt.scatter(df_features_test_fall["average_temp °c"], pred_fall)

In [None]:
plt.scatter(df_features_test_fall["humidity %"], df_target_test_fall)
plt.scatter(df_features_test_fall["humidity %"], pred_fall)

In [None]:
plt.scatter(df_features_test_fall["wind_speed km/h"], df_target_test_fall)
plt.scatter(df_features_test_fall["wind_speed km/h"], pred_fall)

#### R<sup>2</sup> Score (Coefficient of Determination)

In [None]:
target_winter: np.ndarray = df_target_test_winter.to_numpy()
r2_winter: float = r2_score(target_winter, pred_winter)

target_spring: np.ndarray = df_target_test_spring.to_numpy()
r2_spring: float = r2_score(target_spring, pred_spring)

target_summer: np.ndarray = df_target_test_summer.to_numpy()
r2_summer: float = r2_score(target_summer, pred_summer)

target_fall: np.ndarray = df_target_test_fall.to_numpy()
r2_fall: float = r2_score(target_fall, pred_fall)

print('r2_winter:', round(r2_winter,3))
print('r2_spring:', round(r2_spring,3))
print('r2_summer:', round(r2_summer,3))
print('r2_fall:', round(r2_fall,3))

#### Mean Squared Error & Root Mean Squared Error

In [None]:
mse_winter: float = mean_squared_error(target_winter, pred_winter)
rmse_winter: float = math.sqrt(mse_winter)

mse_spring: float = mean_squared_error(target_spring, pred_spring)
rmse_spring: float = math.sqrt(mse_spring)

mse_summer: float = mean_squared_error(target_summer, pred_summer)
rmse_summer: float = math.sqrt(mse_summer)

mse_fall: float = mean_squared_error(target_fall, pred_fall)
rmse_fall: float = math.sqrt(mse_fall)

print(f'mse_winter: {round(mse_winter,3)}, rmse_winter: {round(rmse_winter,3)}')
print(f'mse_spring: {round(mse_spring,3)}, rmse_spring: {round(rmse_spring,3)}')
print(f'mse_summer: {round(mse_summer,3)}, rmse_summer: {round(rmse_summer,3)}')
print(f'mse_fall: {round(mse_fall,3)}, rmse_fall: {round(rmse_fall,3)}')

### Discussion and Analysis

- Analyze the results of your metrics.

R² tells us how much of the changes in energy usage can be explained by temperature, humidity, and wind.

MSE and RMSE show how far off our predictions were from the real values (lower numbers = better).

__For winter,__

R<sup>2</sup>: 0.529, MSE: 0.327, RMSE: 0.572

The model explains about 52.9% of the changes in energy use — that’s pretty decent.

The average error is around 0.572 kWh, which is not too bad.

This makes sense since people use more heating when it’s cold, and that’s affected by temperature and wind.

__For spring,__

R<sup>2</sup>: 0.764, MSE: 0.805, RMSE: 0.897

The model does even better here — it explains 76.4% of the variation.

But the average error is a bit higher, nearly 0.897 kWh.

Spring weather can be unpredictable, so while the model gets the trend right, it’s not always exact.

__For summer,__

R<sup>2</sup>: 0.077, MSE: 0.104, RMSE: 0.322

The model doesn’t perform well here — it only explains about 7.7% of energy use changes.

The error is small (0.322 kWh), but that’s likely because energy use is low and steady in summer.

It shows that weather isn’t a big factor for energy use during this time.

__For fall,__

R<sup>2</sup>: 0.813, MSE: 0.251, RMSE: 0.501

This is the model’s best season — it explains over 81.3% of the changes.

The average error is low at 0.501 kWh, meaning the predictions are quite accurate.

Fall likely has more stable weather patterns, making energy use easier to predict.

- Explain how does your analysis and machine learning help to solve your problem statement.

Our goal was to figure out how weather affects household electricity use, and whether we can predict that use ahead of time.

By using machine learning — specifically, Multiple Linear Regression — we were able to build a model that takes in temperature, humidity, and wind speed and gives an estimate of how much energy people are likely to use.

This helps in a few important ways:

It shows which seasons are easier to predict (like fall and spring) and which are harder (like summer).

It gives energy providers a better idea of what to expect, so they can prepare in advance — for example, by making sure there’s enough supply during colder months.

It supports better planning — city planners and policy-makers can use this kind of model to design smarter energy systems, reduce waste, and plan for future demand as the climate continues to change.

In short, our model helps turn weather data into useful insights about energy use — and that’s a key step toward building more sustainable and efficient cities.

- Conclusion

The MLR model successfully demonstrates the influence of environmental factors on household energy consumption, with performance varying by season. While temperature, humidity, and wind speed strongly predict energy usage in fall and spring, they are less effective in summer — highlighting the need for additional variables or models for warmer months.