# Leeds EDA

This notebook contains the EDA process of the Leeds City Council EV charging dataset, ranging from 2014 to 2021 Q1.  

Contents:  
[Combine Datasets](#Combine-Datasets)  

In [1]:
# Import the necessary packages
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Combine Datasets

The datasets are first combined before being analyised.  

In [2]:
# Read in all .csv files in data
path = "../data/"
os.chdir(path)
csv_files = csv_files = [f for f in os.listdir() if f.endswith('.csv')]

dfs = []
# Import all .csv files
for csv in csv_files:
    df = pd.read_csv(csv)
    dfs.append(df)

# Combine the dataframes
ev_df = pd.concat(dfs, ignore_index=True)

# Examine contents of dataframe
ev_df.head()

Unnamed: 0,Charging event,User ID,CP ID,Connector,Start Date,Start Time,End Date,End Time,Total kWh,Site,Model,Unique User ID,Unnamed: 11,Unnamed: 12,Unnamed: 13,Cost
0,4105202,User 690,70204,1,01/01/2018,16:11,01/01/2018,19:03,5.89,Woodhouse Lane Car Park,APT 7kW Dual Outlet,,,,,
1,4105706,User 406,70204,2,02/01/2018,07:05,02/01/2018,17:05,14.3,Woodhouse Lane Car Park,APT 7kW Dual Outlet,,,,,
2,4105757,User 716,70206,2,02/01/2018,08:01,02/01/2018,18:50,18.66,Elland Road Park and Ride,APT 7kW Dual Outlet,,,,,
3,4105869,User 706,70207,2,02/01/2018,09:09,02/01/2018,09:10,0.02,Elland Road Park and Ride,APT 7kW Dual Outlet,,,,,
4,4105870,User 706,70207,2,02/01/2018,09:11,02/01/2018,19:09,5.95,Elland Road Park and Ride,APT 7kW Dual Outlet,,,,,


In [3]:
ev_df.shape

(24224, 16)

In [4]:
ev_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24224 entries, 0 to 24223
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Charging event  24224 non-null  int64  
 1   User ID         23213 non-null  object 
 2   CP ID           24224 non-null  int64  
 3   Connector       24224 non-null  int64  
 4   Start Date      24224 non-null  object 
 5   Start Time      24224 non-null  object 
 6   End Date        23927 non-null  object 
 7   End Time        23927 non-null  object 
 8   Total kWh       23927 non-null  float64
 9   Site            24224 non-null  object 
 10  Model           24224 non-null  object 
 11  Unique User ID  1011 non-null   object 
 12  Unnamed: 11     0 non-null      float64
 13  Unnamed: 12     0 non-null      float64
 14  Unnamed: 13     0 non-null      float64
 15  Cost            2278 non-null   float64
dtypes: float64(5), int64(3), object(8)
memory usage: 3.0+ MB


As can be seen, there's no data in columns 11 to 13.  
They should be removed completely.  
There are about 1k `Unique User ID` missing, and 2k `Cost` missing.  
They ammount to about 4% and 9.5% of the data!   
This is a significant ammount.  
It is advisable to take take a look and see if these missing data can be salvaged, or should they be dropped directly.  

In [5]:
# Change the name and drop columns that's filled with NA
col_keep = {"Charging event":"charging_event_id", 
            "User ID":"user_id", 
            "CP ID":"cp_id", 
            "Connector":"con_num", 
            "Start Date":"start_date", 
            "Start Time":"start_time", 
            "End Date":"end_date", 
            "Total kWh":"total_kwh", 
            "Site":"site", 
            "Model":"charger_model", 
            "Unique User ID":"uid", 
            "Cost":"charging_cost"}
ev_df = ev_df.loc[:,[key for key in col_keep]].rename(columns=col_keep).copy(deep=True)
ev_df.to_csv("../data/combined_ev_charging_data.csv")
ev_df

Unnamed: 0,charging_event_id,user_id,cp_id,con_num,start_date,start_time,end_date,total_kwh,site,charger_model,uid,charging_cost
0,4105202,User 690,70204,1,01/01/2018,16:11,01/01/2018,5.89,Woodhouse Lane Car Park,APT 7kW Dual Outlet,,
1,4105706,User 406,70204,2,02/01/2018,07:05,02/01/2018,14.30,Woodhouse Lane Car Park,APT 7kW Dual Outlet,,
2,4105757,User 716,70206,2,02/01/2018,08:01,02/01/2018,18.66,Elland Road Park and Ride,APT 7kW Dual Outlet,,
3,4105869,User 706,70207,2,02/01/2018,09:09,02/01/2018,0.02,Elland Road Park and Ride,APT 7kW Dual Outlet,,
4,4105870,User 706,70207,2,02/01/2018,09:11,02/01/2018,5.95,Elland Road Park and Ride,APT 7kW Dual Outlet,,
...,...,...,...,...,...,...,...,...,...,...,...,...
24219,1078855,User 38,70206,1,6/29/2015,09:13,6/29/2015,0.00,Elland Road Park and Ride,APT 7kW Dual Outlet,,
24220,1078859,User 38,70206,2,6/29/2015,09:16,6/29/2015,0.00,Elland Road Park and Ride,APT 7kW Dual Outlet,,
24221,1078864,User 38,70206,1,6/29/2015,09:20,6/29/2015,10.53,Elland Road Park and Ride,APT 7kW Dual Outlet,,
24222,1080286,User 628,70206,2,6/30/2015,08:44,6/30/2015,7.16,Elland Road Park and Ride,APT 7kW Dual Outlet,,


In [6]:
ev_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24224 entries, 0 to 24223
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   charging_event_id  24224 non-null  int64  
 1   user_id            23213 non-null  object 
 2   cp_id              24224 non-null  int64  
 3   con_num            24224 non-null  int64  
 4   start_date         24224 non-null  object 
 5   start_time         24224 non-null  object 
 6   end_date           23927 non-null  object 
 7   total_kwh          23927 non-null  float64
 8   site               24224 non-null  object 
 9   charger_model      24224 non-null  object 
 10  uid                1011 non-null   object 
 11  charging_cost      2278 non-null   float64
dtypes: float64(2), int64(3), object(7)
memory usage: 2.2+ MB


In [7]:
unique_user = ev_df["User ID"].unique()

In [17]:
len(unique_user)

1039

In [None]:
# Test to check if 2 users are the same or not... 
# Different year may not be the same user....