<a href="https://colab.research.google.com/github/AkhilNam/EVChargingLoadsMLPredictor/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Residential EV Charging Loads using Neural Networks

- [View Solution Notebook](./solutions.html)
- [View Project Page](https://www.codecademy.com/)

In [1]:
# Setup - import basic data libraries
import numpy as np
import pandas as pd

## Task Group 1 - Load, Inspect, and Merge Datasets

### Task 1

The file `'datasets/EV charging reports.csv'` contains electric vehicle (EV) charging data. These come from various residential apartment buildings in Norway. The data includes specific user and garage information, plug-in and plug-out times, charging loads, and the dates of the charging sessions.

Import this CSV file to a pandas DataFrame named `ev_charging_reports`.

Use the `.head()` method to preview the first five rows.

In [None]:
ev_charging_reports = pd.read_csv('datasets/EV charging reports.csv')
ev_charging_reports.head()

Unnamed: 0,Unnamed: 1,session_ID;Garage_ID;User_ID;User_type;Shared_ID;Start_plugin;Start_plugin_hour;End_plugout;End_plugout_hour;El_kWh;Duration_hours;month_plugin;weekdays_plugin;Plugin_category;Duration_category
1;AdO3;AdO3-4;Private;NA;21.12.2018 10:20;10;21.12.2018 10:23;10;0,3;0,05;Dec;Friday;late morning (9-12);Less than 3 ...
2;AdO3;AdO3-4;Private;NA;21.12.2018 10:24;10;21.12.2018 10:32;10;0,87;0,136666667;Dec;Friday;late morning (9-12);Less ...
3;AdO3;AdO3-4;Private;NA;21.12.2018 11:33;11;21.12.2018 19:46;19;29,87;8,216388889;Dec;Friday;late morning (9-12);Betwe...
4;AdO3;AdO3-2;Private;NA;22.12.2018 16:15;16;23.12.2018 16:40;16;15,56;24,41972222;Dec;Saturday;late afternoon (15-18);M...
5;AdO3;AdO3-2;Private;NA;24.12.2018 22:03;22;24.12.2018 23:02;23;3,62;0,970555556;Dec;Monday;late evening (21-midnight...


<details><summary style="display:list-item; font-size:16px; color:blue;">What is the structure of the dataset?</summary>

- **session_ID** - the unique id for each EV charging session
- **Garage_ID** - the unique id for the garage of the apartment
- **User_ID** - the unique id for each user
- **User_private** - 1.0 indicates private charge point spaces and 0.0 indicates shared charge point spaces
- **Shared_ID** - the unique id if shared charge point spaces are used
- **Start_plugin** - the plug-in date and time in the format (day.month.year hour:minute)
- **Start_plugin_hour** - the plug-in date and time rounded to the start of the hour
- **End_plugout** - the plug-out date and time in the format (day.month.year hour:minute)
- **End_plugout_hour** - the start of the hour of the `End_plugout` hour
- **El_kWh** - the charged energy in kWh (charging loads)
- **Duration_hours** - the duration of the EV connection time per session
- **Plugin_category** - the plug-in time categorized by early/late night, morning, afternoon, and evening
- **Duration_category** - the plug-in duration categorized by 3 hour groups
- **month_plugin_{month}** - the month of the plug-in session
- **weekdays_plugin_{day}** - the day of the week of the plug-in session

### Task 2

Import the file `'datasets/Local traffic distribution.csv'` to a pandas DataFrame named `traffic_reports`. This dataset contains the hourly local traffic density counts at 5 nearby traffic locations.

Preview the first five rows.

In [None]:
traffic_reports = pd.read_csv('datasets/Local traffic distribution.csv')
traffic_reports.head()

Unnamed: 0,Date_from;Date_to;KROPPAN BRU;MOHOLTLIA;SELSBAKK;MOHOLT RAMPE 2;Jonsvannsveien vest for Steinanvegen
0,01.12.2018 00:00;01.12.2018 01:00;639;0;0;4;144
1,01.12.2018 01:00;01.12.2018 02:00;487;153;115;...
2,01.12.2018 02:00;01.12.2018 03:00;408;85;75;10;69
3,01.12.2018 03:00;01.12.2018 04:00;282;89;56;8;39
4,01.12.2018 04:00;01.12.2018 05:00;165;64;34;3;25


<details><summary style="display:list-item; font-size:16px; color:blue;">What is the structure of the dataset?</summary>

- **Date_from** - the starting time in the format (day.month.year hour:minute)
- **Date_to** - the ending time in the format (day.month.year hour:minute)
- **Location 1 to 5** - contains the number of vehicles each hour at a specified traffic location.


### Task 3

We'd like to use the traffic data to help our model. The same charging location may charge at different rates depending on the number of cars being charged, so this traffic data might help the model out.

Merge the `ev_charging_reports` and `traffic_reports` datasets together into a Dataframe named `ev_charging_traffic` using the columns:

- `Start_plugin_hour` in `ev_charging_reports`
- `Date_from` in `traffic_reports`

In [None]:
ev_charging_traffic = pd.merge(
    ev_charging_reports,
    traffic_reports,
    left_on='Start_plugin_hour',
    right_on='Date_from',
    how='left'
)
ev_charging_traffic.head()

KeyError: 'Date_from'

### Task 4

Use `.info()` to inspect the merged dataset. Specifically, pay attention to the data types and number of missing values in each column.

In [None]:
ev_charging_traffic.info()

NameError: name 'ev_charging_traffic' is not defined

<details><summary style="display:list-item; font-size:16px; color:blue;">What do we notice about merged dataset under inspection?</summary>

We see that there are 39 columns and 6,833 rows in our merged dataset.

Some notable things we might have to address:

- We expected columns like `El_kWh` and `Duration_hours` to be floats but they are actually object data types.

- There are many identifying columns like `session_ID` and `User_ID` that might not be useful for training.

## Task Group 2 - Data Cleaning and Preparation

### Task 5

Let's start by reducing the size of our dataset by dropping columns that won't be used for training. These include
- ID columns
- columns with lots of missing data
- non-numeric columns (for now, since we haven't yet covered using non-numeric data in neural networks)

Drop columns you don't want to use in training from `ev_charging_traffic_hourly`.

To match our solution, drop the columns

```py
['session_ID', 'Garage_ID', 'User_ID',
                'Shared_ID',
                'Plugin_category','Duration_category',
                'Start_plugin', 'Start_plugin_hour', 'End_plugout', 'End_plugout_hour',
                'Date_from', 'Date_to']
```

In [None]:
columns_to_drop = [
    'session_ID', 'Garage_ID', 'User_ID',
    'Shared_ID', 'Plugin_category', 'Duration_category',
    'Start_plugin', 'Start_plugin_hour', 'End_plugout', 'End_plugout_hour',
    'Date_from', 'Date_to'
]

# Drop columns from ev_charging_traffic
ev_charging_traffic_hourly = ev_charging_traffic.drop(columns=columns_to_drop)

# Display the first few rows of the updated DataFrame to verify
ev_charging_traffic_hourly.head()

Unnamed: 0,User_private,El_kWh,Duration_hours,month_plugin_Apr,month_plugin_Aug,month_plugin_Dec,month_plugin_Feb,month_plugin_Jan,month_plugin_Jul,month_plugin_Jun,...,weekdays_plugin_Saturday,weekdays_plugin_Sunday,weekdays_plugin_Thursday,weekdays_plugin_Tuesday,weekdays_plugin_Wednesday,Kroppan_bru_traffic,Moholtlia_traffic,Selsbakk_traffic,Moholt_rampe_2_traffic,Jonsvannsveien_vest_steinanvegen_traffic
0,1.0,3,5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3244.0,1632.0,545.0,194.0,622.0
1,1.0,87,136666667,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3244.0,1632.0,545.0,194.0,622.0
2,1.0,2987,8216388889,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3605.0,1691.0,605.0,230.0,771.0
3,1.0,1556,2441972222,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,3052.0,1484.0,453.0,224.0,694.0
4,1.0,362,970555556,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1390.0,693.0,226.0,83.0,353.0


### Task 6

Earlier we saw that the `El_kWh` and `Duration_hours` columns were object data types. Upon further inspection, we see that the reason is that the data is following European notation where commas `,` are used as decimals instead of periods.

Replace `,` with `.` in these three columns.

In [None]:
# Replace commas with periods and convert columns to numeric data type
columns_to_clean = ['El_kWh', 'Duration_hours']

for col in columns_to_clean:
    ev_charging_traffic_hourly[col] = ev_charging_traffic_hourly[col].str.replace(',', '.')

# Display the updated DataFrame to ensure the columns are cleaned
print(ev_charging_traffic_hourly.head())


   User_private El_kWh Duration_hours  month_plugin_Apr  month_plugin_Aug  \
0           1.0    0.3           0.05               0.0               0.0   
1           1.0   0.87    0.136666667               0.0               0.0   
2           1.0  29.87    8.216388889               0.0               0.0   
3           1.0  15.56    24.41972222               0.0               0.0   
4           1.0   3.62    0.970555556               0.0               0.0   

   month_plugin_Dec  month_plugin_Feb  month_plugin_Jan  month_plugin_Jul  \
0               1.0               0.0               0.0               0.0   
1               1.0               0.0               0.0               0.0   
2               1.0               0.0               0.0               0.0   
3               1.0               0.0               0.0               0.0   
4               1.0               0.0               0.0               0.0   

   month_plugin_Jun  ...  weekdays_plugin_Saturday  weekdays_plugin_Sunday

### Task 7

Next, convert the data types of all the columns of `ev_charging_traffic` to floats.

In [None]:
ev_charging_traffic_hourly = ev_charging_traffic_hourly.astype(float)

## Task Group 3 - Train Test Split

Next, let's split the dataset into training and testing datasets.

The training data will be used to train the model and the testing data will be used to evaluate the model.

### Task 8

First, create two datasets from `ev_charging_traffic`:

- `X` contains only the input numerical features
- `y` contains only the target column `El_kWh`

In [None]:
X = ev_charging_traffic_hourly.drop(columns=['El_kWh'])
y = ev_charging_traffic_hourly['El_kWh']

### Task 9

Use `sklearn` to split `X` and `y` into training and testing datasets. The training set should use 80% of the data. Set the `random_state` parameter to `2`.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
    train_size=0.80,
    test_size=0.20,
    random_state=2)

## Task Group 4 - Linear Regression Baseline

This section is optional, but useful. The idea is to compare our neural network to a basic linear regression. After all, if a basic linear regression works just as well, there's no need for the neural network!

If you haven't done linear regression with scikit-learn before, feel free to use [our solution code](./solutions.html) or to skip ahead.

### Task 10

Use Scikit-learn to train a Linear Regression model using the training data to predict EV charging loads.

The linear regression will be used as a baseline to compare against the neural network we will train later.

### Task 11

Evaluate the linear regression baseline by calculating the MSE on the testing data. Use `mean_squared_error` from `sklearn.metrics`.

Save the testing MSE to the variable `test_mse` and print it out.

Looks like our mean squared error is around `131.4` (if you used different columns in your model than we did, you might have a different value). Remember, this is squared error. If we take the square root, we have about `11.5`. One way of interpreting this is to say that the linear regression, on average, is off by `11.5 kWh`.

## Task Group 5 - Train a Neural Network Using PyTorch

Let's now create a neural network using PyTorch to predict EV charging loads.

### Task 12

First, we'll need to import the PyTorch library and modules.

Import the PyTorch library `torch`.

From `torch`, import `nn` to access built-in code for constructing networks and defining loss functions.

From `torch`, import `optim` to access built-in optimizer algorithms.

In [None]:
import torch
from torch import nn
from torch import optim

### Task 13

Before training the neural network, convert the training and testing sets into PyTorch tensors and specify `float` as the data type for the values.

In [None]:
# Remove NaN values from X_train and adjust y_train accordingly
X_train_clean = X_train.dropna()
y_train_clean = y_train[X_train_clean.index]
X_test_clean = X_test.dropna()
y_test_clean = y_test[X_test_clean.index]

# Convert to tensors and ensure y_train_tensor is 2D
X_train_tensor = torch.tensor(X_train_clean.to_numpy(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_clean.to_numpy(), dtype=torch.float32).view(-1, 1)
X_test_tensor = torch.tensor(X_test_clean.to_numpy(), dtype=torch.float32)
y_test_tensor = torch.tensor(y_test_clean.to_numpy(), dtype=torch.float32).view(-1, 1)


### Task 14

Next, let's use `nn.Sequential` to create a neural network.

First, set a random seed using `torch.manual_seed(42)`.

Then, create a sequential neural network with the following architecture:

- input layer with number of nodes equal to the number of training features
- a first hidden layer with `56` nodes and a ReLU activation
- a second hidden layer with `26` nodes and a ReLU activation
- an output layer with `1` node

Save the network to the variable `model`.

In [None]:
torch.manual_seed(42)
model = nn.Sequential(
    nn.Linear(26, 56),
    nn.ReLU(),
    nn.Linear(56, 26),
    nn.ReLU(),
    nn.Linear(26, 1)
)

### Task 15

Next, let's define the loss function and optimizer used for training:

- set the MSE loss function to the variable `loss`
- set the Adam optimizer to the variable `optimizer` with a learning rate of `0.0007`

In [None]:
loss = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr = 0.0007)

### Task 16

Create a training loop to train our neural network for 3000 epochs.

Keep track of the training loss by printing out the MSE every 500 epochs.

In [None]:
num_epochs = 3000
for epoch in range(num_epochs):
    predictions = model(X_train_tensor)
    MSE = loss(predictions, y_train_tensor)
    MSE.backward()
    optimizer.step()
    optimizer.zero_grad()
    if (epoch + 1) % 500 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], MSE Loss: {MSE.item()}')

Epoch [500/3000], MSE Loss: 121.35372924804688
Epoch [1000/3000], MSE Loss: 113.1530990600586
Epoch [1500/3000], MSE Loss: 112.26484680175781
Epoch [2000/3000], MSE Loss: 106.9486083984375
Epoch [2500/3000], MSE Loss: 104.27964782714844
Epoch [3000/3000], MSE Loss: 103.15689086914062


### Task 17

Save the neural network in the `models` directory using the path `models/model.pth`.

In [None]:
torch.save(model, "models/model.pth")

### Task 18

Evaluate the neural network on the testing set.

Save the testing data loss to the variable `test_loss` and use `.item()` to extract and print out the loss.

In [None]:
loaded_model = torch.load('models/model.pth')
loaded_model.eval()
with torch.no_grad():
    predictions = loaded_model(X_test_tensor)
    test_MSE = loss(predictions, y_test_tensor)
# show output
print('Test MSE is ' + str(test_MSE.item()))
print('Test Root MSE is ' + str(test_MSE.item()**(1/2)))

Test MSE is 106.4807357788086
Test Root MSE is 10.318950323497473


### Task 19

We trained this same model for 4500 epochs locally. That model is saved as `models/model4500.pth`. Load this model using PyTorch and evaluate it. How well does the longer-trained model perform?

In [None]:
loaded_model = torch.load('models/model4500.pth')
loaded_model.eval()
with torch.no_grad():
    predictions = loaded_model(X_test_tensor)
    test_MSE = loss(predictions, y_test_tensor)
# show output
print('Test MSE is ' + str(test_MSE.item()))
print('Test Root MSE is ' + str(test_MSE.item()**(1/2)))

Test MSE is 104.2914047241211
Test Root MSE is 10.212316325110631


Pretty cool! The increased training improved our test loss to about `115.2`, a full `12%` improvement on our linear regression baseline. So the nonlinearity introduced by the neural network actually helped us out.

That's the end of our project on predicting EV charging loads! Feel free to continue experimenting with this neural network model.

Some things you might want to investigate further include:
- explore different ways to clean and prepare the data
- we added traffic data, but there's no guarantee that more data converts to a better model. Test out different sets of input columns.
- test out different number of nodes in the hidden layers, activation functions, and learning rates
- train on a larger number of epochs