<a href="https://colab.research.google.com/github/A-Monaghan/LondonEnergy/blob/main/london_energy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#London Energy Consumption Prediction
###Project Overview
This project tackles the challenge of predicting daily total electrical consumption for customers within various London boroughs. Developed as a data science consulting engagement for a fictional London energy company, this solution provides crucial insights for operational planning, resource allocation, and strategic decision-making.

By integrating disparate datasets – historical London weather data and London home energy consumption – we've built a predictive model that can forecast energy demand, empowering the energy company to anticipate consumption patterns and optimize their services.

###Problem Statement
London's dynamic energy landscape necessitates accurate forecasting of electrical consumption to ensure grid stability, optimize energy procurement, and enhance customer satisfaction. Without a robust predictive model, energy companies face challenges like inefficient resource allocation, potential supply shortages, and missed opportunities for demand-side management.

###Project Goals
Data Integration: Combine and preprocess diverse datasets (weather and energy consumption) to create a unified, analyzable dataset.

###Exploratory Data Analysis (EDA):
 Uncover key trends, patterns, and correlations within the combined data, particularly focusing on the relationship between weather variables and energy consumption.

###Predictive Model Development:
Engineer features and develop a machine learning model capable of accurately predicting daily total electrical consumption per borough.

###Model Evaluation:
Rigorously assess the model's performance using appropriate metrics and techniques.

###Actionable Insights:
Translate complex data analysis and model predictions into clear, concise, and actionable insights for a non-technical executive audience (the fictional CEO).

###Communication Strategy:
Develop and present a compelling narrative that highlights the business value and implications of the project findings.

##Data Sources
This project utilizes two primary datasets, adapted and transformed for the purpose of this analysis:

London Daily Weather (1979-2021): Provides historical weather conditions including temperature, precipitation, and other relevant meteorological factors.

Original Source: Kaggle: London Daily Weather 1979 to 2021

London Hourly Energy Dataset (2011-2014): Contains hourly energy consumption data for various London homes, including borough information.

Original Source: Kaggle: London Hourly Energy Dataset 2011 to 2014

Technical Stack
Programming Language: Python

###Key Libraries:

pandas for data manipulation and analysis

numpy for numerical operations

scikit-learn for machine learning model development (e.g., regression models, preprocessing)

matplotlib and seaborn for data visualization

(Potentially) Jupyter Notebook for reproducible analysis and presentation.

London daily weather 1979 to 2021:

* https://www.kaggle.com/datasets/emmanuelfwerr/london-weather-data



London hourly energy dataset 2011 to 2014:

* https://www.kaggle.com/datasets/emmanuelfwerr/london-homes-energy-data

# London Weather Dataset

####Read the [London weather dataset](https://drive.google.com/file/d/1eT1YaXgNIjxFPjPfzpQHWLaWwK54s_uC/view?usp=sharing) into a Pandas Dataframe**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow import keras
import numpy as np

In [2]:
## Import Dataset - I'd downloaded and saved to google drive


File_ID = 'FILE_ID'
download_link = 'GOOGLEDRIVELINK'
Full_link = download_link.replace('FILE_ID', File_ID)


In [3]:
df_weather = pd.read_csv(File_ID)

FileNotFoundError: [Errno 2] No such file or directory: 'FILE_ID'

# Statistical analysis

Explore the London weather dataset stats

In [None]:
# --- Initial Data Inspection ---

# Display the first 5 rows of the DataFrame
# This helps to quickly preview the data, column names, and initial data types.

df_weather.head()


In [None]:
# --- Feature Engineering: Date Transformation ---

# Original 'date' column is likely in a numerical format (e.g., YYYYMMDD).
# We extract year, month, and day into separate integer columns for easier analysis
# (e.g., grouping by year, analyzing seasonal patterns by month, or daily trends).

# Extract the first 4 characters for the year and convert to integer.
df_weather['year'] = df_weather['date'].astype(str).str[:4].astype(int)

# Extract characters from index 4 to 5 (6th character) for the month and convert to integer.
df_weather['month'] = df_weather['date'].astype(str).str[4:6].astype(int)

# Extract characters from index 6 onwards for the day and convert to integer.
df_weather['day'] = df_weather['date'].astype(str).str[6:].astype(int)

# --- (Commented Out Alternative Methods) ---
# These lines show alternative ways that were likely explored for date extraction.
# It's good practice to keep such exploratory code commented if not actively used,
# as it can serve as a reference for future work or understanding past attempts.

#df_weather['year'] = df_weather['date'].str[:5].astype(int)
#df_weather['month'] = df_weather['date'].str[5:7].astype(int)
#df_weather['day'] = df_weather['date'].str[7:].astype(int)
#df_weather['month'] = df_weather['date'].apply(lambda x: int(x[6:8]))
#df_weather['year'] = df_weather['date'].apply(lambda x: int(x[9:]))

In [None]:
# --- Post-Transformation Inspection ---

# Display the first 5 rows again to confirm the new 'year', 'month', 'day' columns
# have been added correctly.
df_weather.head()


In [None]:
# Display the last 5 rows of the DataFrame.
# Useful for checking the end of the dataset, especially after transformations,
# and to see the data range.

df_weather.tail()


In [None]:
# Generate descriptive statistics for numerical columns.
# This provides summary metrics like count, mean, standard deviation, min, max,
# and quartiles, which are essential for understanding the distribution and
# central tendency of your data.
df_weather.describe()

In [None]:
# Print a concise summary of the DataFrame.
# This includes the index dtype and column dtypes, non-null values, and memory usage.
# It's vital for checking data types and identifying columns with missing values quickly.
df_weather.info()

In [None]:
# --- Data Quality and Exploration ---

# Display all unique values in the 'cloud_cover' column.
# This is useful for understanding the range of categories or discrete values
# within a specific column. It can help identify inconsistencies or unexpected values.
df_weather['cloud_cover'].unique()


In [None]:
# Calculate the pairwise correlation of all columns in the DataFrame.
# Correlation matrices show the linear relationship between variables, ranging from -1 to 1.
# - A value close to 1 indicates a strong positive linear relationship.
# - A value close to -1 indicates a strong negative linear relationship.
# - A value close to 0 indicates a weak or no linear relationship.
# This helps in identifying potential features for modeling and understanding multicollinearity.
df_weather.corr()

In [None]:
# Check for missing values in each column and sum them up.
# This provides a count of NaN (Not a Number) values for every column,
# indicating data completeness and where imputation or handling of missing data might be needed.
df_weather.isnull().sum()

In [None]:

# --- Annual Weather Summary Aggregation ---

# GroupBY the DataFrame by 'year' and calculate aggregate statistics for key weather metrics.
# This helps to summarize weather patterns on an annual basis, revealing trends over time.


result = df_weather.groupby('year')[['cloud_cover', 'global_radiation', 'sunshine', 'precipitation', 'mean_temp', 'max_temp', 'min_temp']].agg(
    {
        'global_radiation': ['min', 'max', 'mean'],
        'cloud_cover': ['min', 'max', 'mean'],
        'sunshine': ['min', 'max', 'mean'],
        'precipitation': ['min', 'max', 'mean'],
        'mean_temp': 'mean',
        'max_temp': 'max',
        'min_temp': 'min'
    }
)
result

# Visualisations


In [None]:
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np # Import numpy for select_dtypes


In [None]:

# --- Exploratory Data Analysis: Visualizations ---

# Create a scatter matrix for all numerical columns.
# This plot provides pairwise scatter plots, histograms for each variable,
# aiding in visualizing relationships and distributions quickly.

_, ax = plt.subplots(1, 1, figsize=(15, 15)) # Smaller for demonstration
scatter_matrix(df_weather.select_dtypes(include=np.number), ax=ax) # Only numerical columns
plt.suptitle('Scatter Matrix of Weather Data', y=1.02) # Add a suptitle
plt.show()

In [None]:
# Generate a heatmap of the correlation matrix.
# This visually represents the strength and direction of linear relationships
# between numerical variables, with annotations showing correlation coefficients.

plt.figure(figsize=(10, 10))
cm=df_weather.corr()
sns.heatmap(cm,annot=True,cmap="Blues")
plt.show()

In [None]:
# Plot line graphs for the mean annual precipitation, sunshine, and mean temperature.
# This visualizes trends of these key weather metrics over the years.

plt.plot(result.index, result[('precipitation', 'mean')])
plt.plot(result.index, result[('sunshine', 'mean')])
plt.plot(result.index, result[('mean_temp', 'mean')])

plt.xlabel("Year") # Y-axis label should reflect the metrics shown, or be more general if scales differ
plt.ylabel("Annual Mean Values") # More general, or specify units if all are comparable (e.g., "Units")
plt.title("Annual Trends: Mean Precipitation, Sunshine, and Temperature") # Accurate title

plt.legend() # Add a legend to differentiate the lines
plt.grid(True)
plt.show()


In [None]:
# --- (Previous Grouping and Flattening - already discussed in detail) ---
# Group by year for aggregated statistics (as explained in previous responses).
# Using a standard dictionary format for aggregations.

# Group by year for aggregated statistics
# Using a standard dictionary format for aggregations
result = df_weather.groupby('year')[['cloud_cover', 'global_radiation', 'sunshine', 'precipitation', 'mean_temp', 'max_temp', 'min_temp']].agg(
    {
        'global_radiation': ['min', 'max', 'mean'],
        'cloud_cover': ['min', 'max', 'mean'],
        'sunshine': ['min', 'max', 'mean'],
        'precipitation': ['min', 'max', 'mean'],
        'mean_temp': 'mean',
        'max_temp': 'max',
        'min_temp': 'min'
    }
)

# The column naming with MultiIndex should now work correctly
result.columns = ['_'.join(col).strip() for col in result.columns.values]
print("\nAggregated statistics by year (first 5 rows):")
print(result.head())
print(result.tail())

In [None]:
# --- Specific Variable Visualizations ---

# Visualize the trend of mean cloud cover over years.
# A line plot helps to identify changes and patterns in cloudiness across the dataset's time span.

# Cloud Cover Visualization
plt.figure(figsize=(12, 6))
sns.lineplot(x='year', y='cloud_cover', data=df_weather.groupby('year')['cloud_cover'].mean().reset_index())
plt.xlabel('Year')
plt.ylabel('Mean Cloud Cover')
plt.title('Mean Cloud Cover Over Years')
plt.grid(True)
plt.show()

# Visualize the distribution of 'cloud_cover' using a histogram.
# This shows the frequency of different cloud cover values, providing insight into its typical range and variability.

plt.figure(figsize=(10, 6))
sns.histplot(df_weather['cloud_cover'], bins=10, kde=True)
plt.xlabel('Cloud Cover')
plt.ylabel('Frequency')
plt.title('Distribution of Cloud Cover')
plt.grid(True)
plt.show()

In [None]:
# Visualize the trend of mean snow depth over years.
# A line plot to observe annual variations and trends in snow depth.
# Snow Depth Visualization
plt.figure(figsize=(12, 6))
sns.lineplot(x='year', y='snow_depth', data=df_weather.groupby('year')['snow_depth'].mean().reset_index())
plt.xlabel('Year')
plt.ylabel('Mean Snow Depth')
plt.title('Mean Snow Depth Over Years')
plt.grid(True)
plt.show()


# Visualize the distribution of 'snow_depth' using a histogram.
# Shows the frequency distribution of snow depth values.
plt.figure(figsize=(10, 6))
sns.histplot(df_weather['snow_depth'], bins=10, kde=True)
plt.xlabel('snow_depth')
plt.ylabel('Frequency')
plt.title('Distribution of snow_depth')
plt.grid(True)
plt.show()

In [None]:
# --- Scatter Plots for Relationships Between Variables ---


# @title global_radiation vs max_temp
# Scatter plot of 'global_radiation' vs 'max_temp'.
# Helps visualize if higher global radiation is associated with higher maximum temperatures.


from matplotlib import pyplot as plt
df_weather.plot(kind='scatter', x='global_radiation', y='max_temp', s=32, alpha=.2)
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
# @title cloud_cover vs month

# Scatter plot of 'cloud_cover' vs 'month'.
# Explores the relationship between cloud cover and the month, potentially showing seasonal patterns.


from matplotlib import pyplot as plt
df_weather.plot(kind='scatter', x='cloud_cover', y='month', s=32, alpha=0.01, )
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
# Scatter plot of 'cloud_cover' vs 'sunshine'.
# Visualizes the inverse relationship expected between cloud cover and sunshine hours.

df_weather.plot(kind='scatter', x='cloud_cover', y='sunshine', s=32, alpha=.05)


In [None]:
# Scatter plot of 'cloud_cover' vs 'max_temp'.
# Explores how cloud cover might correlate with maximum daily temperatures.

df_weather.plot(kind='scatter', x='cloud_cover', y='max_temp', s=32, alpha=.03)


In [None]:
# Scatter plot of 'sunshine' vs 'global_radiation'.
# Visualizes the direct relationship between sunshine duration and global solar radiation.

# @title sunshine vs global_radiation

from matplotlib import pyplot as plt
df_weather.plot(kind='scatter', x='sunshine', y='global_radiation', s=32, alpha=.05)
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
# Scatter plot of 'cloud_cover' vs 'precipitation'.
# Explores the relationship between cloudiness and the amount of precipitation.

df_weather.plot(x='cloud_cover', y='precipitation', kind='scatter')

# Missing Data - Cloud Cover

#Data Imputation:
Addressing Missing cloud_cover Data
A significant challenge identified was the missing data in the cloud_cover column, particularly before the 1990s. To preserve valuable historical information rather than discarding it, a deep learning classification model was employed for imputation. This approach allows for more accurate and contextually relevant predictions of the missing cloud_cover values by leveraging patterns from other weather variables.

###Imputation Process:

Data Segregation: The dataset was split into two: one for training (complete cloud_cover data) and one for imputation (missing cloud_cover).

###Data Preparation:
The training data was standardized, and features/labels were prepared as NumPy arrays, then split into training and testing sets.

###Model Training:
A deep learning classification model was trained on the clean, prepared weather data.

###Model Evaluation:
The model's performance was rigorously tested to ensure its predictive capability.

###Data Filling:
The trained model was then used to predict and fill the missing cloud_cover values in the original DataFrame, completing the dataset for subsequent analysis.

In [None]:
# Initial Missing Value Check & Data Separation ---

# Display the count of missing values for each column in the original DataFrame.
# This confirms the extent of missing data before any manipulation.
df_weather.isnull().sum()


# Prepare NumPy Arrays for Deep Learning Model ---

### Explore unique values and their frequencies in 'cloud_cover'.
### This helps understand the distribution of the target variable for the classification model.


In [None]:
df_weather['cloud_cover'].unique()
df_weather['cloud_cover'].value_counts()



In [None]:
# Explore unique values and their frequencies in 'snow_depth'.
# This is a pre-emptive check, as 'snow_depth' also has many missing values that will need handling.

df_weather['snow_depth'].unique()
df_weather['snow_depth'].value_counts()



In [None]:
# Create two new DataFrames based on the presence/absence of 'cloud_cover' NaN values.
# 'df_weather_cloud_cover_nan': Contains rows where 'cloud_cover' is missing (to be imputed).
# 'df_weather_clean': Contains rows where 'cloud_cover' is not missing (to be used for training).

df_weather_cloud_cover_nan = df_weather[df_weather['cloud_cover'].isna()]
df_weather_clean = df_weather[df_weather['cloud_cover'].notna()]

In [None]:
# Display the tail of 'df_weather_clean'.
# Verify that this DataFrame ends with complete 'cloud_cover' data (no NaNs at the end).

df_weather_clean.tail()

In [None]:
# Display the tail of 'df_weather_cloud_cover_nan'.
# Verify that this DataFrame correctly contains rows with missing 'cloud_cover'.

df_weather_cloud_cover_nan.tail()

In [None]:
# Re-check unique values and counts in the 'cloud_cover' column of the 'clean' DataFrame.
# Confirm that 'cloud_cover' column in this DataFrame has no NaN values and shows expected categories.

df_weather_clean['cloud_cover'].unique()
df_weather_clean['cloud_cover'].value_counts()



In [None]:
# Another check for unique values in 'cloud_cover' of the clean DataFrame. (Redundant, can be removed)

df_weather_clean['cloud_cover'].unique()


In [None]:
# --- Handle Missing 'snow_depth' (Imputation Strategy for another column) ---
# Replace NaN values in 'snow_depth' column with 0 in the 'df_weather_clean' DataFrame.
# The assumption is that NaN in 'snow_depth' indicates no snow, especially outside winter.

df_weather_clean['snow_depth'] = df_weather_clean['snow_depth'].fillna(0)


In [None]:
# Verify unique values in 'snow_depth' after filling NaNs.
# Expect to see 0 among the unique values, and no NaNs.

df_weather_clean['snow_depth'].unique()

In [None]:
# --- Final Missing Value Checks Before Model Data Prep ---

# Display missing values for the original DataFrame (before any drops).
# This is typically a reminder of the initial state of missing data.

df_weather.isnull().sum()

# Display missing values for the 'cleaned' DataFrame.
# This check should confirm that 'cloud_cover' NaNs are gone, and other NaNs might still exist.

df_weather_clean.isnull().sum()



In [None]:
# Remove any remaining rows with *any* NaN values from 'df_weather_clean'.
# This ensures that the dataset used for training the model is completely free of missing data.

df_weather_clean = df_weather_clean.dropna()

# Final check to confirm no missing values remain in the 'clean' DataFrame.

df_weather_clean.isnull().sum()

In [None]:
# --- Prepare Data for Deep Learning Model (Features & Labels) ---

# Create NumPy array for features (X) by dropping the 'cloud_cover' column from 'df_weather_clean'.
# The data type is set to float32, common for numerical deep learning inputs.

x = df_weather_clean.drop('cloud_cover', axis=1).to_numpy(dtype='float32')

# Create NumPy array for labels (y) using the 'cloud_cover' column from 'df_weather_clean'.
# The data type is set to int, as 'cloud_cover' is treated as a classification target.

y = df_weather_clean['cloud_cover'].to_numpy(dtype='int')

In [None]:
# --- Split Data into Training and Testing Sets ---

# Define the boundary for splitting the data (80% for training, 20% for testing).

boundary = int(x.shape[0] * 0.8)

# Split features and labels into training sets.
x_train = x[:boundary]
y_train = y[:boundary]

# Split features and labels into testing sets.

x_test = x[boundary:]
y_test = y[boundary:]

In [None]:
# --- Data Standardization ---

# Calculate the mean of each feature from the training data.
means = x_train.mean(axis=0)

# Calculate the standard deviation of each feature from the training data.
stds = x_train.std(axis=0)

In [None]:
# Standardize the training features.

# (X - mean) / std ensures that features have a mean of 0 and standard deviation of 1.
# This is crucial for many deep learning models to converge faster and perform better.

x_train = (x_train - means) / stds

In [None]:
# --- (Commented Out: Class Weight Calculation) ---
# This block is commented out, but its purpose is to calculate class weights.
# Class weights are used in classification tasks, especially with imbalanced datasets,
# to give more importance to minority classes during model training, preventing the model
# from being biased towards the majority class.

"""from sklearn.utils.class_weight import compute_class_weight

# Get unique classes and their counts
unique_classes = np.unique(y_train)


class_weights_array = compute_class_weight(
    class_weight='balanced',
    classes=unique_classes,
    y=y_train # Your integer labels for the training data
)

# Convert the array to a dictionary, which model.fit() expects
class_weights_dict = dict(zip(unique_classes, class_weights_array))

print("Calculated Class Weights:")
print(class_weights_dict)
# Example output: {0: 0.5, 1: 2.5, 2: 0.8, ...} (where 0.5 means a majority class, 2.5 a minority)"""

# Cloud Cover Model

**Deep learning classification model using Tensorflow/Keras. This will:**
*   Use inverted bottlenecks
*   Use residual connections
*   Use gelu activations
*   Use a softmax for output layer
*   Use the sparse categorical crossentropy loss
*   Use the adam optimizer

In [None]:
# --- Model Training: Deep Learning for Cloud Cover Imputation ---


In [None]:
# Check the shapes of the prepared NumPy arrays.
# This is a crucial sanity check to ensure data dimensions are as expected
# before feeding them into the neural network.

print("Shape of training features (x_train):", x_train.shape) # Expected: (num_samples_train, num_features)
print("Shape of training labels (y_train):", y_train.shape)   # Expected: (num_samples_train,) for sparse categorical
print("Shape of testing features (x_test):", x_test.shape)     # Expected: (num_samples_test, num_features)
print("Shape of testing labels (y_test):", y_test.shape)       # Expected: (num_samples_test,)


In [None]:
# Determine the number of input features (columns) for the neural network.
# This should match the second dimension of x_train (number of features after dropping 'cloud_cover').
# Here, it's explicitly set to 12, implying 12 features are used to predict cloud cover.
# inputs match shape (column number)

inputs = keras.layers.Input(shape=(12,))
z = inputs


In [None]:
# --- Neural Network Architecture Definition (Custom Block Design) ---

# First Dense (fully connected) layer.
# Transforms the input features into a 256-dimensional representation.

z = keras.layers.Dense(256)(z)

# Loop to create a sequence of custom "residual" blocks.
# This architecture helps in building deeper networks by allowing information
# to skip layers, mitigating the vanishing gradient problem.

for i in range(4):
  s = z

  z = keras.layers.LayerNormalization()(z)

  # First Dense layer within the block, expanding dimensions.
  # The factor 6*256 suggests an expansion ratio within the block.

  z = keras.layers.Dense(6*256)(z)
  z = keras.activations.gelu(z)# GELU (Gaussian Error Linear Unit) activation function.
                                  # A popular alternative to ReLU, often used in modern architectures.
  # Second Dense layer within the block, projecting back to the original dimension (256).
  z = keras.layers.Dense(256)(z)

  # Residual connection: Add the input 's' to the output 'z' of the block.
  # This allows the network to learn residual functions and improves information flow.
  z = keras.layers.Add()([s,z])

# Final Layer Normalization before the output layer.
z = keras.layers.LayerNormalization()(z)
# Output Dense layer with 10 units.
# The number of units corresponds to the number of unique classes in 'cloud_cover'.
z = keras.layers.Dense(10)(z)
# Softmax activation function for the output layer.
# This converts the raw outputs (logits) into probability distributions over the 10 classes,
# suitable for multi-class classification where classes are mutually exclusive.
outputs = keras.activations.softmax(z)

as cloud cover has 0,1,2,3,4,5,6,7,8,9 as unique values need to use 10 in final layer

In [None]:
# Create the Keras Model.
# Defines the model by specifying its input and output layers.

model = keras.Model(inputs=[inputs], outputs=[outputs])

In [None]:
"""
# EarlyStopping callback
# monitor='val_loss': Monitors the validation loss
# patience=10: Waits for 10 epochs with no improvement in val_loss before stopping
# mode='min': Stops when the monitored quantity (val_loss) stops decreasing
# restore_best_weights=True: Restores the model weights from the epoch with the best monitored value

early_stopping_callback = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=20,
    mode='min',
    restore_best_weights=True,
    verbose=1 # To see messages when training stops or weights are restored
)
"""

In [None]:
# --- Model Compilation and Summary ---

# Set the number of training epochs.
epochs = 30

# Calculate steps per epoch. This is the number of batches per epoch.
# It's calculated by dividing the total number of training samples by the batch size (assuming 32 if not specified).
# np.ceil ensures that even partial batches are accounted for.
steps_per_epoch = np.ceil(np.shape(x_train)[0] / 32)
print("Steps per epoch =", steps_per_epoch)

# Compile the model.
# This configures the model for training by specifying:

model.compile(
  loss=keras.losses.SparseCategoricalCrossentropy(),# Loss function for integer labels (0-9) classification.
                                                       # Suitable when labels are integers, not one-hot encoded.
    # Optimizer for updating model weights.

  optimizer=keras.optimizers.Adam(learning_rate=keras.optimizers.schedules.CosineDecay(
      # Learning rate schedule: Cosine Decay.
      # Learning rate starts at initial_learning_rate and
      # decays following a cosine curve to 0 over decay_steps.
  initial_learning_rate=0.01,
  decay_steps=epochs*steps_per_epoch)),# Total steps over all epochs for decay.
  metrics=["accuracy"] # Metric to monitor during training and evaluation.
)
)


In [None]:
# Print a summary of the model's architecture.
# Shows layer types, output shapes, and the number of trainable parameters.

model.summary()

In [None]:
"""

model.fit(x_train, y_train,
          epochs=epochs,
          validation_data=(x_test, y_test),)
"""

In [None]:
# --- Model Training Execution ---

# Train the deep learning model.
# The `model.fit` method trains the model for a fixed number of epochs.

history = model.fit(
    x_train, # Training features.
    y_train, # Training labels.
    epochs=epochs, # Number of full passes through the training dataset.
    batch_size=64, # Number of samples per gradient update.
    validation_split=0.2, # Fraction of the training data to be used as validation data.
                          # This creates a validation set directly from x_train/y_train.
    # callbacks=[early_stopping_callback] # If early stopping was uncommented, it would be added here.
)

In [None]:
# --- Model Prediction ---

# Make predictions on the test set.
# The model outputs probabilities for each class (10 values for each sample in x_test).

y_pred = model.predict(x_test)

In [None]:
# Convert predicted probabilities to class labels (integers 0-9).
# np.argmax selects the index (class) with the highest predicted probability for each sample.

y_pred_labels = np.argmax(y_pred, axis=1)


In [None]:
# --- Step 15: Model Evaluation on Test Set ---

# Evaluate the model's performance on the unseen test set.
# Returns the loss value and metric values (e.g., accuracy) for the test data.

history = model.evaluate(x_test, y_test)
print(f"Test Loss: {history}")
print(f"Test Accuracy: {history[1]}")

**Evaluate the performance of this deep learning model of both the training and test sets. This will include the use of:**
*   Sklearn confusion matrix
*   Sklearn classification report

In [None]:
# --- Classification Report and Confusion Matrix ---

# Import the confusion_matrix function from scikit-learn.

from sklearn.metrics import confusion_matrix
# Compute the confusion matrix.
# It's a table showing the number of correct and incorrect predictions for each class,
# essential for understanding where the model is performing well or struggling.
# Pass y_test (true labels) and y_pred_labels (predicted integer labels)
cm = confusion_matrix(y_test, y_pred_labels)
print("\n--- Confusion Matrix ---")
print(cm)

In [None]:
# Import the classification_report function from scikit-learn.
from sklearn.metrics import classification_report
# Generate a text report showing the main classification metrics (precision, recall, f1-score, support)
# for each class, and overall averages.
print(classification_report(y_test, y_pred_labels))

** Evaluation filling in missing cloud_cover data**



The accuracy of this model is not very high - 27%  
As there are 10 classes to choose from this is slightly higher than random selecting but not hugely significant. The model could get better if we took inot consideration the distribution of the classes in the datatset.

It is not overfit but only getting around 3/10 correct.

A significant factor contributing to this low accuracy is likely the imbalanced distribution of classes within the dataset. Future improvements should focus on addressing this class imbalance to enable the model to learn more effectively from under-represented cloud cover categories.



**Replace  missing cloud_cover data in the original dataset using this classification model**

In [None]:
# Prepare for inference and replacing missing values
x_pred = df_weather_cloud_cover_nan.drop(['cloud_cover',], axis=1).to_numpy(dtype='float32')
print(x_pred.shape)

In [None]:
# Standardise the input with values from the training set (means and stds)

x_pred_standardized = (x_pred - means) / stds

print(f"Shape of x_pred_standardized: {x_pred_standardized.shape}")


In [None]:

# Make predictions on the *standardized inference data*
y_pred_probabilities = model.predict(x_pred_standardized)

print(f"Shape of y_pred_probabilities: {y_pred_probabilities.shape}")


In [None]:

# Convert the predicted probabilities to class labels (0-9)
predicted_labels = np.argmax(y_pred_probabilities, axis=1)
print(f"Shape of predicted_labels: {predicted_labels.shape}")


# Ensure the lengths match before imputation
if len(predicted_labels) == len(df_weather_cloud_cover_nan):
    # Replace the NaN values in the original df_weather with the model's predictions
    df_weather.loc[df_weather['cloud_cover'].isna(), 'cloud_cover'] = predicted_labels

df_weather.head()

# London Energy Dataset

**Read the [London energy dataset](https://drive.google.com/file/d/1elYGf3VwdDuMhkGGzFA9STGax6sq3iLT/view?usp=sharing) into a Pandas Dataframe**

In [None]:
# https://www.kaggle.com/datasets/emmanuelfwerr/london-homes-energy-data



In [None]:
File_ID = 'Your ID'
download_link = 'Googlelink'
London_link = download_link.replace('FILE_ID', File_ID)


In [None]:
#df_weather_london = pd.read_csv(London_link)

** London energy dataset stats**

In [None]:
df_weather_london.head()


In [None]:

df_weather_london.tail()

In [None]:
df_weather_london.describe()

In [None]:
df_weather_london.info()

In [None]:
df_weather_london['Borough'].unique()


In [None]:
df_weather_london['MWH'].unique()


In [None]:
df_weather_london.isnull().sum()

**London weather dataset visualisations. **

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.plot(df_weather_london['MWH'],alpha=0.5)
plt.show()



In [None]:

plt.figure(figsize=(12, 6))

# 'Date' column  converted to datetime
df_weather_london['Date'] = pd.to_datetime(df_weather_london['Date'])

sns.lineplot(x='Date', y='MWH', data=df_weather_london)
plt.xlabel('Date')
plt.ylabel('MWH (Megawatt Hours)')
plt.title('Total MWH Consumption Over Time')
plt.grid(True)
plt.show()

# Visualisations


In [None]:
from pandas.plotting import scatter_matrix


_, ax = plt.subplots(1, 1, figsize=(15, 15)) # Smaller for demonstration
scatter_matrix(df_weather_london.select_dtypes(include=np.number), ax=ax) # Only numerical columns
plt.suptitle('Scatter Matrix of Weather Data', y=1.02) # Add a suptitle
plt.show()

**Change the weather datasets date to the format YYYY-MM-DD. This is in prepartion for joining both the weather and energy datasets by date.**

In [None]:
# check format of df_weather - it is in correct format
df_weather.head()
# use same column name
df_weather = df_weather.rename(columns={'date': 'Date'})


In [None]:
df_weather_london.head()

** inner join using Pandas merge on the weather and energy datasets together by the date/Date columns**

In [None]:
# Merge the two dataframes on the date column using an 'inner join'
df_summed = df_weather_london.merge(df_weather, on='Date', how='inner')
df_summed.head()

In [None]:
import folium
from folium.plugins import MarkerCluster
import matplotlib.pyplot as plt
import matplotlib.colors as colors

# Create a map centered around London
m = folium.Map(location=[51.5074, -0.1278], zoom_start=10, tiles='OpenStreetMap')

# Create a MarkerCluster to add multiple markers
marker_cluster = MarkerCluster().add_to(m)

norm = colors.Normalize(vmin=df_summed['MWH'].min(), vmax=df_summed['MWH'].max())

# Add markers with colour proportional to MWH values
for index, row in df_summed.iterrows():
    color = plt.cm.jet(norm(row['MWH']))
    color_hex = colors.rgb2hex(color)
    folium.CircleMarker(location=[row['Latitude'], row['Longitude']],
                        color=color_hex,
                        fill_color=color_hex,
                        fill_opacity=0.5).add_to(marker_cluster)

# Display the map
m

** One hot encode the categorical borough column in prepartion for the deep learning model**

In [None]:
# One hot encoding of categorical columns - try with, without or with pd.factorize (https://pandas.pydata.org/docs/reference/api/pandas.factorize.html)
df_summed = pd.get_dummies(
    df_summed, columns=['Borough']
)

In [None]:
# convert bool to int
for col in df_summed.select_dtypes(include='bool').columns:
    df_summed[col] = df_summed[col].astype(int)

In [None]:
df_summed.head()


** Create Numpy arrays in preparation for training and testing your deep learning model. These will need to be standardised and split into training and testing sets. The label will be the MWH.**

In [None]:
# Create numpy arrays
x = df_summed.drop('MWH', axis=1).to_numpy(dtype='float32')
y = df_summed['MWH'].to_numpy(dtype='float32')

In [None]:
# Create train and test sets
boundary = int(x.shape[0]*0.8)
x_train = x[:boundary]
y_train = y[:boundary]
x_test = x[boundary:]
y_test = y[boundary:]

In [None]:
# Standardise the input
means = x_train.mean(axis=0)
stds = x_train.std(axis=0)
x_train = (x_train - means) / stds



In [None]:

# And test set with values from training set
x_test = (x_test - means) / stds


In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

**Create and train a deep learning regression model using Tensorflow/Keras. This will:**
*   Use inverted bottlenecks
*   Use residual connections
*   Use gelu activations
*   Use a gelu activation for output layer
*   Use the mean squared error loss
*   Use the adam optimizer

In [None]:
# 54 inputs
inputs = keras.layers.Input(shape=(54,))
z = inputs

# Projection layer from 54 to 192
z = keras.layers.Dense(192)(z)

for i in range(3):
  # Shortcut connection
  s = z

  # Layer norm
  z = keras.layers.LayerNormalization()(z)

  # Expand this dimension up to 4*192 neurons
  z = keras.layers.Dense(4*192)(z)
  z = keras.activations.gelu(z)
  z = keras.layers.Dense(192)(z)

  # Add the shortcut connection with the output of the block
  z = keras.layers.Add()([s,z])

# Regression output for one value with gelu activation as price
z = keras.layers.Dense(1)(z)
outputs = keras.activations.gelu(z)

In [None]:
model = keras.Model(inputs=[inputs], outputs=[outputs])

In [None]:
import numpy as np

epochs = 20
steps_per_epoch = np.ceil(np.shape(x_train)[0] / 32)
print(steps_per_epoch)

model.compile(
  loss=keras.losses.MeanSquaredError(),
  optimizer=keras.optimizers.Adam(learning_rate=keras.optimizers.schedules.CosineDecay(initial_learning_rate=0.01, decay_steps=epochs*steps_per_epoch)),
  metrics=["mse", 'mape']
)

In [None]:
model.summary()

In [None]:
model.fit(x_train, y_train, epochs=epochs)

**Evaluate the performance of this deep learning model of both the training and test sets. This will include the use of:**
*   Sklearn root mean squared error
*   Sklearn mean absolute percentage error

In [None]:
mse = model.evaluate(x_train, y_train)[1]
print(mse)

In [None]:
# converts the multi dimensional array into 1D
y_train_pred = model.predict(x_train).flatten() # .flatten() to convert (N, 1) to (N,)
y_test_pred = model.predict(x_test).flatten()

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error

# Formatted to 4 decimal places

# RMSE for Training Set
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
print(f"Training RMSE: {rmse_train:.4f}")

# RMSE for Test Set
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
print(f"Test RMSE: {rmse_test:.4f}")


In [None]:
# Formatted to 2 decimal places as a percentage
# MAPE for Training Set

mape_train = mean_absolute_percentage_error(y_train, y_train_pred) * 100
print(f"Training MAPE: {mape_train:.2f}%") # Formatted to 2 decimal places as a percentage

# MAPE for Test Set
mape_test = mean_absolute_percentage_error(y_test, y_test_pred) * 100
print(f"Test MAPE: {mape_test:.2f}%")



**Evaluate using this model for energy use prediction. potential issues around bias?**

Bias - yes a couple of potential issues.

Sampling Bias - the collected dataset may not be representative of the population as a whole. Using non representative areas - skewed towards high or low income areas.
Energy usage changes overtime - new tech can have an impact in reducing and increasing energy consuption. Socialtal changes drive fluctuations in energy usage, london olympics / war etc. Need to be careful that the timeframe that is sampled is representative of general usage.




