# Energy Use Data Cleaning

# Introduction
This project is meant to gather insights on my electricity usage.
The 1st steps ne
Data is from my energy company(ComEd) from the past year. 10_22_2022 to 10_22_2023

## Data Source
Data from the [My Green Button](https://secure.comed.com/MyAccount/MyBillUsage/pages/secure/GreenButtonConnectDownloadMyData.aspx) webpage on the ComEd website.

# Goals
* become familiar with the columns in the dataset
* remove redundant data
* clean anomalous data

# Column / header info
* **TYPE**: Electric usage
* **DATE**: Day recorded
* **START TIME**: start of recording in Hour:Minutes
* **END TIME**: end of recording in Hour:Minutes
* **USAGE**: Electric usage in kWh
* **UNITS**: Electric usage metric
* **COST**: amount charged for energy usage
* **NOTES**: useless data

In [92]:
import pandas as pd
import numpy as np

In [93]:
# Import the energy use spreadsheet from the 'data' directory
import glob

# Define the directory path and the regular expression pattern
directory_path = "./data"
file_pattern = "energy_use*.csv"

# Use glob.glob to match filenames based on the pattern
file_name = glob.glob(f"{directory_path}/{file_pattern}")[0]
energy_df = pd.read_csv(filepath_or_buffer=file_name, header=4)

In [94]:
print(energy_df.columns)
energy_df.head()

Index(['TYPE', 'DATE', 'START TIME', 'END TIME', 'USAGE', 'UNITS', 'COST',
       'NOTES'],
      dtype='object')


Unnamed: 0,TYPE,DATE,START TIME,END TIME,USAGE,UNITS,COST,NOTES
0,Electric usage,2022-10-22,00:00,00:29,0.11,kWh,$0.01,
1,Electric usage,2022-10-22,00:30,00:59,0.13,kWh,$0.02,
2,Electric usage,2022-10-22,01:00,01:29,0.09,kWh,$0.01,
3,Electric usage,2022-10-22,01:30,01:59,0.2,kWh,$0.02,
4,Electric usage,2022-10-22,02:00,02:29,0.1,kWh,$0.01,


In [95]:
energy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17520 entries, 0 to 17519
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   TYPE        17520 non-null  object 
 1   DATE        17520 non-null  object 
 2   START TIME  17520 non-null  object 
 3   END TIME    17520 non-null  object 
 4   USAGE       17520 non-null  float64
 5   UNITS       17520 non-null  object 
 6   COST        17520 non-null  object 
 7   NOTES       0 non-null      float64
dtypes: float64(2), object(6)
memory usage: 1.1+ MB


# Initial Observations
* 2 columns for date & time can be combined into datetime objs
* start-end time intervals seem to be all the same
* 'TYPE', 'UNITS', 'NOTES' columns seem to have all the same values

In [96]:
# Printing all the unique values of uninteresting columns
print([energy_df['TYPE'].unique(),
    energy_df['UNITS'].unique(),
    energy_df['NOTES'].unique()])

[array(['Electric usage'], dtype=object), array(['kWh'], dtype=object), array([nan])]


In [97]:
# dropping columns with 0 variance
energy_df_clean = energy_df.drop(['TYPE', 'UNITS', 'NOTES'], axis='columns')
energy_df_clean.head()

Unnamed: 0,DATE,START TIME,END TIME,USAGE,COST
0,2022-10-22,00:00,00:29,0.11,$0.01
1,2022-10-22,00:30,00:59,0.13,$0.02
2,2022-10-22,01:00,01:29,0.09,$0.01
3,2022-10-22,01:30,01:59,0.2,$0.02
4,2022-10-22,02:00,02:29,0.1,$0.01


In [98]:
# replace the spaces in column titles with underscores
energy_df_clean.columns = energy_df_clean.columns.str.replace(' ', '_')

In [99]:
# converted the DATE, START, END columns to datetime format
energy_df_clean['DATE_TIME'] = energy_df_clean['DATE'] + ' ' + energy_df_clean['START_TIME']
energy_df_clean['START_TIME'] = pd.to_datetime(energy_df_clean['DATE_TIME'], infer_datetime_format=True)

energy_df_clean['DATE_TIME'] = energy_df_clean['DATE'] + ' ' + energy_df_clean['END_TIME']
energy_df_clean['END_TIME'] = pd.to_datetime(energy_df_clean['DATE_TIME'], infer_datetime_format=True)
energy_df_clean.drop(columns='DATE_TIME', inplace=True)
energy_df_clean.head()

Unnamed: 0,DATE,START_TIME,END_TIME,USAGE,COST
0,2022-10-22,2022-10-22 00:00:00,2022-10-22 00:29:00,0.11,$0.01
1,2022-10-22,2022-10-22 00:30:00,2022-10-22 00:59:00,0.13,$0.02
2,2022-10-22,2022-10-22 01:00:00,2022-10-22 01:29:00,0.09,$0.01
3,2022-10-22,2022-10-22 01:30:00,2022-10-22 01:59:00,0.2,$0.02
4,2022-10-22,2022-10-22 02:00:00,2022-10-22 02:29:00,0.1,$0.01


In [102]:
# created USAGE Duration column from START & END times
energy_df_clean['USAGE_DUR'] = energy_df_clean['END_TIME'] - energy_df_clean['START_TIME']

energy_df_clean.head()

Unnamed: 0,DATE,START_TIME,END_TIME,USAGE,COST,USAGE_DUR
0,2022-10-22,2022-10-22 00:00:00,2022-10-22 00:29:00,0.11,$0.01,0 days 00:29:00
1,2022-10-22,2022-10-22 00:30:00,2022-10-22 00:59:00,0.13,$0.02,0 days 00:29:00
2,2022-10-22,2022-10-22 01:00:00,2022-10-22 01:29:00,0.09,$0.01,0 days 00:29:00
3,2022-10-22,2022-10-22 01:30:00,2022-10-22 01:59:00,0.2,$0.02,0 days 00:29:00
4,2022-10-22,2022-10-22 02:00:00,2022-10-22 02:29:00,0.1,$0.01,0 days 00:29:00


In [103]:
unique_durs = (energy_df_clean['USAGE_DUR']).unique()
print(unique_durs)

[ 1740000000000 80940000000000]


In [104]:
weird_usage = energy_df_clean[energy_df_clean['USAGE_DUR']==unique_durs[1]]
weird_usage.head()

Unnamed: 0,DATE,START_TIME,END_TIME,USAGE,COST,USAGE_DUR
723,2022-11-06,2022-11-06 01:30:00,2022-11-06 23:59:00,2.03,$0.26,0 days 22:29:00


In [106]:
# drop the row with weird start/end times & verify USAGE_DUR is homagenous
wu_start = weird_usage['START_TIME']
energy_df_clean.set_index('START_TIME', inplace=True)
energy_df_clean.drop(wu_start, inplace=True)
print((energy_df_clean['USAGE_DUR']).unique())

[1740000000000]
