# Project Title: Time Series Analysis of Pennsylvania and Illinois Weather, Energy Consumption, and Flu Contagion (2013–2015)

## Introduction

In this project, we aim to identify and explore temporal patterns, correlations, and potential causal relationships between weather conditions, energy consumption, and flu contagion across Pennsylvania and Illinois from 2013 to 2015. By analyzing these interrelated datasets, we hope to derive insights that can inform stakeholders in public health and energy management.

### Objectives
- Analyze daily/hourly weather data to identify trends and seasonal patterns.
- Examine hourly energy consumption data to understand usage patterns in relation to weather.
- Investigate weekly flu contagion data to determine correlations with weather and energy consumption.
- Utilize statistical modeling techniques to assess interdependencies among the datasets.

### Key Datasets
1. **USA Weather Dataset (2013–2015)**:
   - Temporal granularity: Hourly
   - Variables of interest: Temperature.

2. **PJM Historic Energy Consumption**:
   - Temporal granularity: Hourly
   - Scope: Energy usage data for Pennsylvania (Duquesne Light) and Illinois (ComEd).

3. **Flu Contagion Dataset by State (2013–2015)**:
   - Temporal granularity: Weekly
   - Variables: Influenza-like illness (ILI) [activity](https://www.cdc.gov/mmwr/volumes/67/wr/mm6722a4.htm) in the USA.

This notebook will guide you through the data ingestion, preparation, exploratory data analysis (EDA), time series modeling, and visualization phases of the project.

In [32]:
# Data Manipulation and Analysis
import pandas as pd  # For data manipulation and analysis using DataFrames
import numpy as np   # For numerical operations and handling arrays

# Statistical Analysis
import scipy.stats as stats  # For statistical tests and distributions
from statsmodels.tsa.stattools import adfuller  # For stationarity tests

# Time Series Analysis
from statsmodels.tsa.arima.model import ARIMA  # For ARIMA modeling
from statsmodels.tsa.seasonal import seasonal_decompose  # For seasonal decomposition of time series

# Machine Learning Libraries
from sklearn.model_selection import train_test_split  # For splitting datasets into training and testing sets
from sklearn.ensemble import RandomForestRegressor  # For regression tasks
import xgboost as xgb  # For gradient boosting models

# Visualization Libraries
import matplotlib.pyplot as plt  # For creating static visualizations
import seaborn as sns            # For enhanced statistical visualizations
import plotly.express as px      # For interactive plots (optional)

# Date and Time Handling
from datetime import datetime     # For date/time manipulation


## Data Ingestion/Wrangling

We begin by retrieving the relevant dataframes for our job.

Weather dataset, granularity in hours. We are only interested in two states. This dataset is the one imposing the time dataframe in our study. As we are interested in its influence on energy consumption and flu activity, we isolate only the temperature (in Kelvin degrees).

Note: for this particular dataset we only have Pittsburgh and Chicago as representative cities for the states of Pennsylvania and Chicago respectivelly.

In [33]:
path = r'sources/historical_hourly_weather_data- 2012_to_2017/temperature.csv'
df = pd.read_csv(path)
display(df)

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
0,2012-10-01 12:00:00,,,,,,,,,,...,,,,,,,309.100000,,,
1,2012-10-01 13:00:00,284.630000,282.080000,289.480000,281.800000,291.870000,291.530000,293.410000,296.600000,285.120000,...,285.630000,288.220000,285.830000,287.170000,307.590000,305.470000,310.580000,304.4,304.4,303.5
2,2012-10-01 14:00:00,284.629041,282.083252,289.474993,281.797217,291.868186,291.533501,293.403141,296.608509,285.154558,...,285.663208,288.247676,285.834650,287.186092,307.590000,304.310000,310.495769,304.4,304.4,303.5
3,2012-10-01 15:00:00,284.626998,282.091866,289.460618,281.789833,291.862844,291.543355,293.392177,296.631487,285.233952,...,285.756824,288.326940,285.847790,287.231672,307.391513,304.281841,310.411538,304.4,304.4,303.5
4,2012-10-01 16:00:00,284.624955,282.100481,289.446243,281.782449,291.857503,291.553209,293.381213,296.654466,285.313345,...,285.850440,288.406203,285.860929,287.277251,307.145200,304.238015,310.327308,304.4,304.4,303.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45248,2017-11-29 20:00:00,,282.000000,,280.820000,293.550000,292.150000,289.540000,294.710000,285.720000,...,290.240000,,275.130000,288.080000,,,,,,
45249,2017-11-29 21:00:00,,282.890000,,281.650000,295.680000,292.740000,290.610000,295.590000,286.450000,...,289.240000,,274.130000,286.020000,,,,,,
45250,2017-11-29 22:00:00,,283.390000,,282.750000,295.960000,292.580000,291.340000,296.250000,286.440000,...,286.780000,,273.480000,283.940000,,,,,,
45251,2017-11-29 23:00:00,,283.020000,,282.960000,295.650000,292.610000,292.150000,297.150000,286.140000,...,284.570000,,272.480000,282.170000,,,,,,


In [34]:
df_temperature = df[['datetime', 'Pittsburgh', 'Chicago']].copy()
display(df_temperature)

# export it as CSV
df_temperature.to_csv('raw_temperature.csv', index=False)

# change 'datetime' format
df_temperature['datetime'] = pd.to_datetime(df_temperature['datetime'])

Unnamed: 0,datetime,Pittsburgh,Chicago
0,2012-10-01 12:00:00,,
1,2012-10-01 13:00:00,281.000000,284.010000
2,2012-10-01 14:00:00,281.024767,284.054691
3,2012-10-01 15:00:00,281.088319,284.177412
4,2012-10-01 16:00:00,281.151870,284.300133
...,...,...,...
45248,2017-11-29 20:00:00,285.300000,281.340000
45249,2017-11-29 21:00:00,285.330000,281.690000
45250,2017-11-29 22:00:00,282.910000,281.070000
45251,2017-11-29 23:00:00,280.140000,280.060000


In [35]:
# for completitude we count NaNs and duplicates
# the few NaNs will be absorbed later on the week grouping
print(df_temperature.isna().sum())
print(df_temperature.duplicated().sum())

datetime      0
Pittsburgh    3
Chicago       3
dtype: int64
0


The granularity as was mentioned before is in hours, for smoothing the curve we could take moving average or simply group by week and take average (also to make it compatible with the flu dataset which is granulated in weeks).

We add a new 'week' column for the week number. Also to save as much information as possible and keep detail, we also add the columns 'max_temp' and 'min_temp' at the moment of grouping. Energy consumption models, like those for heating and cooling, often assume a linear relationship with temperature. This makes mean temperature more applicable in energy-related studies.

In [36]:
# make a new column for the week number
df_temperature['week'] = df_temperature['datetime'].dt.isocalendar().week


# group by week number, add 'avg_temp', 'max_temp', 'min_temp' columns
df_temperature_week = df_temperature.groupby('week').agg({
    'datetime': 'first',
    'Pittsburgh': ['mean', 'max', 'min'],
    'Chicago': ['mean', 'max', 'min']
    }).reset_index()

# flatten the column names
df_temperature_week.columns = ['week', 'datetime', 'pittsburgh_avg_temp', 
                               'pittsburgh_max_temp', 'pittsburgh_min_temp', 
                               'chicago_avg_temp', 'chicago_max_temp', 
                               'chicago_min_temp'
                               ]

# display
display(df_temperature_week.head())

Unnamed: 0,week,datetime,pittsburgh_avg_temp,pittsburgh_max_temp,pittsburgh_min_temp,chicago_avg_temp,chicago_max_temp,chicago_min_temp
0,1,2012-12-31,270.490219,287.401667,258.11,268.218978,285.193,255.52
1,2,2013-01-07,271.51065,290.66,253.57,268.742806,285.193,251.541
2,3,2013-01-14,273.441404,292.34,258.56,271.183082,288.35,253.53
3,4,2013-01-21,270.641184,289.72,254.15,269.780815,285.193,255.23
4,5,2013-01-28,271.648399,289.8,250.52,269.966624,287.15,248.89


In [37]:
# finally we convert all the temperature columns into more friendly units
cols_to_convert = ['pittsburgh_avg_temp', 'pittsburgh_max_temp', 'pittsburgh_min_temp', 
                   'chicago_avg_temp', 'chicago_max_temp', 'chicago_min_temp']

# convert each col into C
for col in cols_to_convert:
  df_temperature_week[col] = df_temperature_week[col] - 273.15

# display
display(df_temperature_week.head())

Unnamed: 0,week,datetime,pittsburgh_avg_temp,pittsburgh_max_temp,pittsburgh_min_temp,chicago_avg_temp,chicago_max_temp,chicago_min_temp
0,1,2012-12-31,-2.659781,14.251667,-15.04,-4.931022,12.043,-17.63
1,2,2013-01-07,-1.63935,17.51,-19.58,-4.407194,12.043,-21.609
2,3,2013-01-14,0.291404,19.19,-14.59,-1.966918,15.2,-19.62
3,4,2013-01-21,-2.508816,16.57,-19.0,-3.369185,12.043,-17.92
4,5,2013-01-28,-1.501601,16.65,-22.63,-3.183376,14.0,-24.26


Electrical energy consumption by local statal companies. Here we have a dataset for Duquesne Light Co. (DUQ) and Commonwealth Edison (ComEd), for Illinois and Chicago respectivelly.  Duquesne Light Co. provides electricity mainly to Pennsylvania -although Pennsylvania might take energy from other sources in case of need. Commonwealth Edison provides mainly to Illinois -although Illinois might take energy from other sources in case of need.

In [38]:
# import energy datasets
path1 = r'sources\hourly_energy_consumption\DUQ_hourly.csv'
path2 = r'sources\hourly_energy_consumption\COMED_hourly.csv'
df_pennsylvania = pd.read_csv(path1)
df_illinois = pd.read_csv(path2)

# check df structure
display(df_pennsylvania)

Unnamed: 0,Datetime,DUQ_MW
0,2005-12-31 01:00:00,1458.0
1,2005-12-31 02:00:00,1377.0
2,2005-12-31 03:00:00,1351.0
3,2005-12-31 04:00:00,1336.0
4,2005-12-31 05:00:00,1356.0
...,...,...
119063,2018-01-01 20:00:00,1962.0
119064,2018-01-01 21:00:00,1940.0
119065,2018-01-01 22:00:00,1891.0
119066,2018-01-01 23:00:00,1820.0


We need only the time period speciffied by the weather data.

In [39]:
# convert 'Datetime' to datetime format
df_pennsylvania['Datetime'] = pd.to_datetime(df_pennsylvania['Datetime'])
df_illinois['Datetime'] = pd.to_datetime(df_illinois['Datetime'])

# define period limits
start_date = '2012-09-30'
end_date = '2017-10-29'

# apply the period filter
df_pennsylvania = df_pennsylvania[(df_pennsylvania['Datetime'] >= start_date) & (df_pennsylvania['Datetime'] <= end_date)].copy()
df_illinois = df_illinois[(df_illinois['Datetime'] >= start_date) & (df_illinois['Datetime'] <= end_date)].copy()

# verify
display(df_pennsylvania)

# store it as csv
df_pennsylvania.to_csv('raw_energy_pennsylvania.csv', index=False)
df_illinois.to_csv('raw_energy_illinois.csv', index=False)

Unnamed: 0,Datetime,DUQ_MW
61329,2012-12-31 01:00:00,1556.0
61330,2012-12-31 02:00:00,1509.0
61331,2012-12-31 03:00:00,1479.0
61332,2012-12-31 04:00:00,1468.0
61333,2012-12-31 05:00:00,1488.0
...,...,...
113928,2017-01-01 20:00:00,1565.0
113929,2017-01-01 21:00:00,1551.0
113930,2017-01-01 22:00:00,1500.0
113931,2017-01-01 23:00:00,1444.0


We repeat the grouping as made for temperature

In [47]:
# first we join both pennsylvania and illinois datasets on the date
df_energy = pd.merge(df_pennsylvania, df_illinois, on='Datetime', how='left')

# rename the columns according to the states
df_energy.rename(columns={'Datetime': 'datetime', 'DUQ_MW': 'Duquesne_MW', 'COMED_MW': 'ComEdison_MW'}, inplace=True)

# convert 'datetime' to datetime format
df_energy['datetime'] = pd.to_datetime(df_energy['datetime'])

# group by week as made for temperature
df_energy['week'] = df_energy['datetime'].dt.isocalendar().week
df_energy_week = df_energy.groupby('week').agg({
    'datetime': 'first',
    'Duquesne_MW': ['mean', 'max', 'min'],
    'ComEdison_MW': ['mean', 'max', 'min']
    }).reset_index()

# flatten the column names
df_energy_week.columns = ['week', 'datetime', 'Duquesne_avg_MW', 
                          'Duquesne_max_MW', 'Duquesne_min_MW', 
                          'ComEdison_avg_MW', 'ComEdison_max_MW', 
                          'ComEdison_min_MW'
                          ]

# display
display(df_energy_week.head())

Unnamed: 0,week,datetime,Duquesne_avg_MW,Duquesne_max_MW,Duquesne_min_MW,ComEdison_avg_MW,ComEdison_max_MW,ComEdison_min_MW
0,1,2012-12-31 01:00:00,1678.89881,2124.0,1190.0,11761.661905,15100.0,8810.0
1,2,2013-01-13 01:00:00,1725.325,2367.0,1185.0,12370.408333,16514.0,8657.0
2,3,2013-01-20 01:00:00,1684.219048,2072.0,1154.0,11872.89881,14956.0,8218.0
3,4,2013-01-27 01:00:00,1737.75,2280.0,1144.0,11915.87619,15554.0,8443.0
4,5,2013-02-03 01:00:00,1709.934524,2324.0,1141.0,11875.859524,16064.0,8946.0


Although [Pennsylvania](https://www.macrotrends.net/global-metrics/states/pennsylvania/population) and [Illinois](https://www.macrotrends.net/global-metrics/states/illinois/population) have similar populations, there is a magnitude order in energy load.

| Year | Pennsylvania | Illinois    |
|------|--------------|-------------|
| 2012 | 12,763,536   | 12,875,280  |
| 2013 | 12,773,801   | 12,882,250  |
| 2014 | 12,787,209   | 12,880,552  |
| 2015 | 12,802,503   | 12,859,585  |
| 2016 | 12,784,227   | 12,821,709  |
| 2017 | 12,805,537   | 12,779,893  |

The problem (for our analysis) is the interconnected energy distribution network:

**Duquesne Light Company** primarily provides electricity to southwestern Pennsylvania, specifically serving the greater Pittsburgh area and parts of Allegheny and Beaver counties. The utility serves approximately 600,000 customers across an 817 square mile service area, with about 90% of its service area being residential [[1]](https://www.electricchoice.com/utilities/duquesne-light/), [[2]](https://www.chooseenergy.com/utilities/duquesne-light-co-pa/), [[3]](https://electricityplans.com/pennsylvania/utilities/duquesne-light-company/).

**Commonwealth Edison (ComEd)** is the largest electric utility in Illinois, serving more than 4,000,000 customers across northern Illinois, which represents approximately 70% of the state's population. The company's service territory covers about 11,400 square miles, stretching from the Wisconsin border to the north, Iowa border to the west, Indiana border to the east, and as far south as Iroquois County. ComEd, a unit of Exelon Corporation, manages over 90,000 miles of power lines and has been providing electric service to the region for more than 100 years [[4]](https://www.exeloncorp.com/companies/comed), [[5]](https://www.ilcma.org/friends-of-ilcma/comed/), [[6]](https://en.wikipedia.org/wiki/Commonwealth_Edison).

So to normalize the electric energy load, we will need to divide by the amount of clients of each company:

In [54]:
# to normalize we divide by num of customers
df_energy_week_normalized = df_energy_week.copy()
df_energy_week_normalized[['Duquesne_avg_kW', 'Duquesne_max_kW', 'Duquesne_min_kW']] = \
                            df_energy_week_normalized[['Duquesne_avg_MW', 'Duquesne_max_MW', 'Duquesne_min_MW']] / 600.
df_energy_week_normalized[['ComEdison_avg_kW', 'ComEdison_max_kW', 'ComEdison_min_kW']] = \
                            df_energy_week_normalized[['ComEdison_avg_MW', 'ComEdison_max_MW', 'ComEdison_min_MW']] / 4000.

# drop MW columns
df_energy_week_normalized = df_energy_week_normalized.drop(df_energy_week_normalized.columns[2:8], axis=1)

# display
display(df_energy_week_normalized.head())

Unnamed: 0,week,datetime,Duquesne_avg_kW,Duquesne_max_kW,Duquesne_min_kW,ComEdison_avg_kW,ComEdison_max_kW,ComEdison_min_kW
0,1,2012-12-31 01:00:00,2.798165,3.54,1.983333,2.940415,3.775,2.2025
1,2,2013-01-13 01:00:00,2.875542,3.945,1.975,3.092602,4.1285,2.16425
2,3,2013-01-20 01:00:00,2.807032,3.453333,1.923333,2.968225,3.739,2.0545
3,4,2013-01-27 01:00:00,2.89625,3.8,1.906667,2.978969,3.8885,2.11075
4,5,2013-02-03 01:00:00,2.849891,3.873333,1.901667,2.968965,4.016,2.2365


Flu dataset, pay attention to its granularity in weeks.

In [40]:
path = r'sources\fluview\StateDatabySeason55_54,53,52,57,56.csv'
df = pd.read_csv(path)
display(df)

Unnamed: 0,STATENAME,URL,WEBSITE,ACTIVITY LEVEL,ACTIVITY LEVEL LABEL,WEEKEND,WEEK,SEASON
0,Alabama,http://adph.org/influenza/,Influenza Surveillance,Level 1,Minimal,Jun-10-2017,23,2016-17
1,Alabama,http://adph.org/influenza/,Influenza Surveillance,Level 10,High,Mar-25-2017,12,2016-17
2,Alabama,http://adph.org/influenza/,Influenza Surveillance,Level 9,High,Apr-01-2017,13,2016-17
3,Alabama,http://adph.org/influenza/,Influenza Surveillance,Level 4,Low,Apr-08-2017,14,2016-17
4,Alabama,http://adph.org/influenza/,Influenza Surveillance,Level 3,Minimal,Apr-15-2017,15,2016-17
...,...,...,...,...,...,...,...,...
16826,New York City,http://www1.nyc.gov/site/doh/providers/health-...,Surveillance Data,Level 8,High,Dec-22-2012,51,2012-13
16827,New York City,http://www1.nyc.gov/site/doh/providers/health-...,Surveillance Data,Level 1,Minimal,Oct-26-2013,43,2013-14
16828,New York City,http://www1.nyc.gov/site/doh/providers/health-...,Surveillance Data,Level 1,Minimal,Oct-19-2013,42,2013-14
16829,New York City,http://www1.nyc.gov/site/doh/providers/health-...,Surveillance Data,Level 1,Minimal,Oct-12-2013,41,2013-14


In [41]:
df_flu = df[(df['STATENAME'] == 'Pennsylvania') | (df['STATENAME'] == 'Illinois')].copy()
display(df_flu)
df_flu.to_csv('raw_flu.csv', index=False)

Unnamed: 0,STATENAME,URL,WEBSITE,ACTIVITY LEVEL,ACTIVITY LEVEL LABEL,WEEKEND,WEEK,SEASON
4057,Illinois,http://www.dph.illinois.gov/topics-services/di...,Seasonal Influenza Surveillance Reports,Level 10,High,Dec-23-2017,51,2017-18
4058,Illinois,http://www.dph.illinois.gov/topics-services/di...,Seasonal Influenza Surveillance Reports,Level 6,Moderate,Dec-16-2017,50,2017-18
4059,Illinois,http://www.dph.illinois.gov/topics-services/di...,Seasonal Influenza Surveillance Reports,Level 3,Minimal,Dec-09-2017,49,2017-18
4060,Illinois,http://www.dph.illinois.gov/topics-services/di...,Seasonal Influenza Surveillance Reports,Level 3,Minimal,Dec-02-2017,48,2017-18
4061,Illinois,http://www.dph.illinois.gov/topics-services/di...,Seasonal Influenza Surveillance Reports,Level 1,Minimal,Nov-25-2017,47,2017-18
...,...,...,...,...,...,...,...,...
12188,Pennsylvania,https://www.health.pa.gov/topics/disease/Flu/P...,Influenza Weekly Report,Level 1,Minimal,Jul-15-2017,28,2016-17
12189,Pennsylvania,https://www.health.pa.gov/topics/disease/Flu/P...,Influenza Weekly Report,Level 1,Minimal,Jul-22-2017,29,2016-17
12190,Pennsylvania,https://www.health.pa.gov/topics/disease/Flu/P...,Influenza Weekly Report,Level 1,Minimal,May-06-2017,18,2016-17
12191,Pennsylvania,https://www.health.pa.gov/topics/disease/Flu/P...,Influenza Weekly Report,Level 1,Minimal,May-13-2017,19,2016-17
