# Electricity & Weather Data Analysis

# Introduction
This project is meant to gather insights on electricity usage.

This file looks for insights into electricity usage relative to weather data.

## Data
* This uses data cleaned with "green_button_data_cleaning.ipynb": clean_energy_use_*.csv
* This uses data cleaned with "weather_data_cleaning.ipynb.ipynb": clean_weather_*.csv


## Original Energy Data Source
Data is from my energy company(ComEd) from the past year. 10_22_2022 to 10_22_2023
Data from the [My Green Button](https://secure.comed.com/MyAccount/MyBillUsage/pages/secure/GreenButtonConnectDownloadMyData.aspx) webpage on the ComEd website.

## Original Weather Data Source
This data was collected using [Meteostat](https://github.com/meteostat/meteostat-python). The Meteostat Python library provides a simple API for accessing open weather and climate data. The historical observations and statistics are collected by Meteostat from different public interfaces, most of which are governmental.

Among the data sources are national weather services like the National Oceanic and Atmospheric Administration (NOAA) and Germany's national meteorological service (DWD).

# Data Column Descriptions

## energy_df
* **DATE**: Day recorded
* **START_TIME**: start of recording in Hour:Minutes
* **END_TIME**: end of recording in Hour:Minutes
* **USAGE**: Electric usage in kWh
* **COST**: amount charged for energy usage in USD

## weather_df
src: [Meteostat Documentation](https://dev.meteostat.net/python/hourly.html#data-structure)

| | | |
|-|-|-|
|**Column**|**Description**|**Type**|
|**time**|datetime of the observation|Datetime64|
|**temp**|air temperature in *°C*|Float64|
|**dwpt**|dew point in *°C*|Float64|
|**rhum**|relative humidity in percent (*%*)|Float64|
|**prcp**|one hour precipitation total in *mm*|Float64|
|**wdir**|average wind direction in degrees (*°*)|Float64|
|**wspd**|average wind speed in *km/h*|Float64|
|**pres**|average sea-level air pressure in *hPa*|Float64|

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [26]:
# Import clean_energy_use spreadsheet from 'data' directory

# Define the directory path and the regular expression pattern
import glob
directory_path = "./data"
file_pattern = "clean_energy_use*.csv"

# Use glob.glob to match filenames based on the pattern
file_path = glob.glob(f"{directory_path}/{file_pattern}")[0]
energy_df = pd.read_csv(filepath_or_buffer=file_path)
energy_df.info()
energy_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17517 entries, 0 to 17516
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   START_TIME  17517 non-null  object 
 1   DATE        17517 non-null  object 
 2   END_TIME    17517 non-null  object 
 3   USAGE       17517 non-null  float64
 4   COST        17517 non-null  float64
dtypes: float64(2), object(3)
memory usage: 684.4+ KB


Unnamed: 0,START_TIME,DATE,END_TIME,USAGE,COST
0,2022-10-22 00:00:00,2022-10-22 00:00:00,2022-10-22 00:29:00,0.11,0.01
1,2022-10-22 00:30:00,2022-10-22 00:00:00,2022-10-22 00:59:00,0.13,0.02
2,2022-10-22 01:00:00,2022-10-22 00:00:00,2022-10-22 01:29:00,0.09,0.01
3,2022-10-22 01:30:00,2022-10-22 00:00:00,2022-10-22 01:59:00,0.2,0.02
4,2022-10-22 02:00:00,2022-10-22 00:00:00,2022-10-22 02:29:00,0.1,0.01


In [27]:
# Import clean_weather spreadsheet from 'data' directory

# Define the directory path and the regular expression pattern
import glob
directory_path = "./data"
file_pattern = "clean_weather*.csv"

# Use glob.glob to match filenames based on the pattern
file_path = glob.glob(f"{directory_path}/{file_pattern}")[0]
weather_df = pd.read_csv(filepath_or_buffer=file_path)
weather_df.info()
weather_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8785 entries, 0 to 8784
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    8785 non-null   object 
 1   temp    8785 non-null   float64
 2   dwpt    8785 non-null   float64
 3   rhum    8785 non-null   float64
 4   prcp    8785 non-null   float64
 5   wdir    8785 non-null   float64
 6   wspd    8785 non-null   float64
 7   pres    8785 non-null   float64
dtypes: float64(7), object(1)
memory usage: 549.2+ KB


Unnamed: 0,time,temp,dwpt,rhum,prcp,wdir,wspd,pres
0,2022-10-21 00:00:00,13.0,1.1,44.0,0.0,190.0,7.6,1008.0
1,2022-10-21 01:00:00,10.7,1.0,51.0,0.0,160.0,7.6,1008.0
2,2022-10-21 02:00:00,9.0,1.5,59.0,0.0,180.0,5.4,1008.0
3,2022-10-21 03:00:00,9.0,1.5,59.0,0.0,180.0,5.4,1008.0
4,2022-10-21 04:00:00,7.6,1.5,65.0,0.0,170.0,5.4,1008.0


## Observations & TODOs

- [ ] convert the time/START_TIME cols into Datetime objs
- [ ] convert half-hour data granularity of the energy_df into 1hr rows
- [ ] Merge the dataset on the time start of each row
- [ ] analyze correlations between weather variables and electricty usage
- [ ] vizualize these correlations
- [ ] create an AI to predict energy usage based on weather data
- [ ] **MAYBE** find a way to create an intuitive random forest tree from the data
- [ ] 

In [28]:
# convert the time/START_TIME cols into Datetime objs
energy_df['START_TIME'] = pd.to_datetime(energy_df['END_TIME'], infer_datetime_format=True)
weather_df['time'] = pd.to_datetime(weather_df['time'], infer_datetime_format=True)
print(energy_df['START_TIME'].dtypes, weather_df['time'].dtypes)

datetime64[ns] datetime64[ns]


In [29]:
weather_df['time'].head()

0   2022-10-21 00:00:00
1   2022-10-21 01:00:00
2   2022-10-21 02:00:00
3   2022-10-21 03:00:00
4   2022-10-21 04:00:00
Name: time, dtype: datetime64[ns]

In [30]:
# convert half-hour data granularity of the energy_df into 1hr rows
#  create an 'HOUR' col to group/sum energy readings by and merge with the weather_df['time'] to
energy_df['HOUR'] = energy_df['START_TIME'].dt.strftime("%Y-%m-%d %H:00:00")
energy_df['HOUR'] = pd.to_datetime(energy_df['HOUR'], infer_datetime_format=True)
energy_df['HOUR'].dtypes
energy_df['HOUR'].head()

0   2022-10-22 00:00:00
1   2022-10-22 00:00:00
2   2022-10-22 01:00:00
3   2022-10-22 01:00:00
4   2022-10-22 02:00:00
Name: HOUR, dtype: datetime64[ns]

In [31]:
# convert half-hour data granularity of the energy_df into 1hr rows
energy_df = energy_df.groupby('HOUR')['USAGE', 'COST'].sum()
energy_df.head()

  energy_df = energy_df.groupby('HOUR')['USAGE', 'COST'].sum()


Unnamed: 0_level_0,USAGE,COST
HOUR,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-10-22 00:00:00,0.24,0.03
2022-10-22 01:00:00,0.29,0.03
2022-10-22 02:00:00,0.2,0.02
2022-10-22 03:00:00,0.09,0.02
2022-10-22 04:00:00,0.09,0.02


In [33]:
# merge datasets
weather_df.set_index('time')
energy_weather_df = weather_df.merge(energy_df, how='inner', left_on=['time'], right_on=['HOUR'])
energy_weather_df.head()

Unnamed: 0,time,temp,dwpt,rhum,prcp,wdir,wspd,pres,USAGE,COST
0,2022-10-22 00:00:00,21.4,5.8,36.0,0.0,190.0,14.8,1007.0,0.24,0.03
1,2022-10-22 01:00:00,19.6,6.1,41.0,0.0,180.0,16.6,1007.0,0.29,0.03
2,2022-10-22 02:00:00,17.5,5.5,45.0,0.0,180.0,9.4,1008.0,0.2,0.02
3,2022-10-22 03:00:00,14.0,5.2,55.0,0.0,160.0,11.2,1008.0,0.09,0.02
4,2022-10-22 04:00:00,14.2,5.4,55.0,0.0,170.0,11.2,1009.0,0.09,0.02


In [40]:
hi_corr = abs(energy_weather_df.corr()['USAGE']).sort_values(ascending=False)
print(hi_corr)

USAGE    1.000000
COST     0.997469
temp     0.809782
dwpt     0.795263
wspd     0.335373
wdir     0.286017
pres     0.191690
rhum     0.091558
prcp     0.055778
Name: USAGE, dtype: float64
