# Power Outages
* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict the severity (number of customers, duration, or demand loss) of a major power outage.
    * Predict the cause of a major power outage.
    * Predict the number and/or severity of major power outages in the year 2020.
    * Predict the electricity consumption of an area.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
For this assignment the prediction problem that we are trying to solve is determining the duration of a power outage based on the available data in the outages dataset. In order to compute our predicted values we are using a multiple feature LinearRegression model. We will be using this model to predict the OUTAGE.DURATION variable in the dataset, and our goal is to be able to predict the value of OUTAGE.DURATION with >50% accuracy.

### Baseline Model
Our baseline model runs off of a sligtly cleaned up dataset. Essentially all we have done at this point is dropped "OUTAGE.START", "OUTAGE.DURATION", "OUTAGE.RESTORATION". We removed the start and restoration columns because we felt that including these would simplify the problem at hand to a point at which it was trivial. The duration coulumn was removed for obvious reasons.

As for the data that remained, we simply left the numeric values as is and one-hot encoded all remaining categorical variables. However we did not make any distinction between numeric columns that were technically categorical variable and numeric columns that were simply quantitiative. For example, the month and day columns were left as if they were quantitative and were not one hot encoded.

In the end we had:

Nominal: 9
Ordinal: 4
Quantitiative: 41

To assess the performance of our model we decided to use linear regression score. We believe that this is a good metric for our performance as it allows us to understand how much of the variance of our data we are accounting for in our model. Since we are doing Regression and not Classification it would not make much sense to use the literall accuracy of our model to determine its performance.

The score of our baseline model was absolute garbage at: -0.4929976470442452


### Final Model
After lots of trial and error, we settled down on two models. The better of the two being our "Attempt 2".

We first began by determining all of the features we wanted to engineer that we thought would have a significant impact on our model. The next step was adding them to the model among the mountains of other features already present. We also graphed scatter plots of all the features that we could with our cleaned outages data so that we could see if there was any obvious correlation between any of the available factors and the OUTAGE.DURATION. However, because we did not see any obvious answers from this EDA we continued to treat all of the preexistant features the same. We were not getting the performance out of our first attempt that we wanted and we decided that this might be due to the excess of quantitative features present at the end of the outages dataset. These features are mostly state specific statistics for states and are essentially just nominal variables as far as the model is concerend. 

In our second model, we selected specific features to narrow down the scope of our model and provide more emphasis on our engineered features. Removing the excess quantitative features did not have as significant of an impact as we thought it would (probably because we were running PCA in each of these models), but our second model came out with a slightly higher score at 0.23495832991235685. Unfortunately neither of our models came close to our goal of at least a .5 score so they're just slightly cleaned up garbage

Engineered Features:

QUARTER: The quarter of the year that the outage occurred in
- There could be a relation between the period of the year an outage occurs in and how long it takes to resolve it

OUTAGE.START: The date that the outage started on represented as a string
- Outages that start at different periods of time might take longer to resolve

DAY.OF.WEEK: Numeric representation of M T W Th F Sat Sun
- Outages might take longer to resolve on weekends or specific weekdays

CUSTOMER.PROPORTION.AFFECTED: CUSTOMERS.AFFECTED / TOTAL.CUSTOMERS
- Outages that affect a larger portion of customers might be resolved faster as they are more impactful on business

CUSTOMER.VALUE: CUSTOMERS.AFFECTED / TOTAL.SALES
- Outages that affect higher paying customers might be resolved faster as they are more impactful on business

HOUR: Hour of the day that an outage occurs during, 24 hour time
- Outages that occur late at night or early in the morning might take more time to address


We chose a LinearRegression model because we wanted to predict quantitative variable and believed that this would the best model to do so with large amounts of features.

### Fairness Evaluation

For our fairness test, since we were not using a classifier variable to predict a category we decided to use a hypothesis test instead of a permutation test. To be completely honest we didn't really know how to set up a permutationt test for this situation since our predicted variable wasnt really a category. So to the best of our ability we developed a hypothesis test to see if our model performed better/worse for specific periods of the year (months).

Null: Our model performs fairly for all months, and any poor performance for a month is due to chance

Alternate: Our model performs unfairly for specific months 

For our hypothesis test we decided to use the average R^2 score across 12 months as a statistic to see if the incredibly poor performance for most of the dataset was just chance or if our model was not performing fairly for the majority of months in the year. For our test statistics we randomly sampled 1/12th of the dataset 12 times to simulate random values for each month.

In the end our hypothesis test showed us that 33.6% (p-val .336) of the time when months are composed of rows randomly sampled amongst the dataset, our model performance per month on average worsens. Inversely, our model gets better on average the other 77% of the time. We believe that this means our model is so inconsistient that even though it performs unfairly for certain months, that unfairness is not consistient and is only caused by the poor quality of our underdeveloped model




# Code

In [7]:
pd.set_option('display.max_columns', 999)


In [5]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures


from sklearn.impute import SimpleImputer

from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier




# Initializing Data

In [8]:
# Keep a permanent copy of original data to refrain from reading excel file constantly
DF_ORIGINAL = pd.read_excel("data\outage.xlsx")

In [9]:
outage = DF_ORIGINAL.copy()

# Drop excess descriptive rows
outage = (outage
          .drop(np.arange(0, 4), axis=0)
          .drop("Major power outage events in the continental U.S.", axis=1)
          .reset_index(drop=True)
         )

# Set columns from leftover row
cols = outage.iloc[0].reset_index(drop=True)
outage.columns = cols
outage = outage.drop([0, 1]).reset_index(drop=True)
outage.head()

Unnamed: 0,OBS,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,OUTAGE.START.DATE,OUTAGE.START.TIME,OUTAGE.RESTORATION.DATE,OUTAGE.RESTORATION.TIME,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,HURRICANE.NAMES,OUTAGE.DURATION,DEMAND.LOSS.MW,CUSTOMERS.AFFECTED,RES.PRICE,COM.PRICE,IND.PRICE,TOTAL.PRICE,RES.SALES,COM.SALES,IND.SALES,TOTAL.SALES,RES.PERCEN,COM.PERCEN,IND.PERCEN,RES.CUSTOMERS,COM.CUSTOMERS,IND.CUSTOMERS,TOTAL.CUSTOMERS,RES.CUST.PCT,COM.CUST.PCT,IND.CUST.PCT,PC.REALGSP.STATE,PC.REALGSP.USA,PC.REALGSP.REL,PC.REALGSP.CHANGE,UTIL.REALGSP,TOTAL.REALGSP,UTIL.CONTRI,PI.UTIL.OFUSA,POPULATION,POPPCT_URBAN,POPPCT_UC,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND
0,1,2011,7,Minnesota,MN,MRO,East North Central,-0.3,normal,2011-07-01 00:00:00,17:00:00,2011-07-03 00:00:00,20:00:00,severe weather,,,3060,,70000.0,11.6,9.18,6.81,9.28,2332915,2114774,2113291,6562520,35.5491,32.225,32.2024,2308736,276286,10673,2595696,88.9448,10.644,0.411181,51268,47586,1.07738,1.6,4802,274182,1.75139,2.2,5348119,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874
1,2,2014,5,Minnesota,MN,MRO,East North Central,-0.1,normal,2014-05-11 00:00:00,18:38:00,2014-05-11 00:00:00,18:39:00,intentional attack,vandalism,,1,,,12.12,9.71,6.49,9.28,1586986,1807756,1887927,5284231,30.0325,34.2104,35.7276,2345860,284978,9898,2640737,88.8335,10.7916,0.37482,53499,49091,1.08979,1.9,5226,291955,1.79,2.2,5457125,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874
2,3,2010,10,Minnesota,MN,MRO,East North Central,-1.5,cold,2010-10-26 00:00:00,20:00:00,2010-10-28 00:00:00,22:00:00,severe weather,heavy wind,,3000,,70000.0,10.87,8.19,6.07,8.15,1467293,1801683,1951295,5222116,28.0977,34.501,37.366,2300291,276463,10150,2586905,88.9206,10.687,0.392361,50447,47287,1.06683,2.7,4571,267895,1.70627,2.1,5310903,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874
3,4,2012,6,Minnesota,MN,MRO,East North Central,-0.1,normal,2012-06-19 00:00:00,04:30:00,2012-06-20 00:00:00,23:00:00,severe weather,thunderstorm,,2550,,68200.0,11.79,9.25,6.71,9.19,1851519,1941174,1993026,5787064,31.9941,33.5433,34.4393,2317336,278466,11010,2606813,88.8954,10.6822,0.422355,51598,48156,1.07148,0.6,5364,277627,1.93209,2.2,5380443,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874
4,5,2015,7,Minnesota,MN,MRO,East North Central,1.2,warm,2015-07-18 00:00:00,02:00:00,2015-07-19 00:00:00,07:00:00,severe weather,,,1740,250.0,250000.0,13.07,10.16,7.74,10.43,2028875,2161612,1777937,5970339,33.9826,36.2059,29.7795,2374674,289044,9812,2673531,88.8216,10.8113,0.367005,54431,49844,1.09203,1.7,4873,292023,1.6687,2.2,5489594,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874


# Basic Cleaning

### Combine into one OUTAGE.START and one OUTAGE.RESTORATION

In [10]:
# Clean Data by combining date and time columns and converting to datetime objects

# Copy data to prevent altering original
dirty = outage[["OUTAGE.START.DATE", "OUTAGE.START.TIME", "OUTAGE.RESTORATION.DATE", "OUTAGE.RESTORATION.TIME"]].copy()

# Takes one datetime column and one time column and sets the datetime column dependent on
# the time column
def combineDateAndTime(row, dateCol, timeCol):
    
    # If the value is not present return NaN
    if pd.isnull(row[dateCol]):
        return row[dateCol]
    
    # Otherwise set the values of the datetime obj from the time obj
    return row[dateCol].replace(hour = row[timeCol].hour, 
                                minute = row[timeCol].minute, 
                                second = row[timeCol].second
                                )
    
# Convert both start and restoration times and add them to the table
outage["OUTAGE.START"] = dirty.apply(combineDateAndTime, axis=1, args=["OUTAGE.START.DATE", "OUTAGE.START.TIME"])
outage["OUTAGE.RESTORATION"] = dirty.apply(combineDateAndTime, axis=1, args=["OUTAGE.RESTORATION.DATE", "OUTAGE.RESTORATION.TIME"])
outageClean = pd.DataFrame(outage.drop(columns=["OUTAGE.START.DATE", "OUTAGE.START.TIME", "OUTAGE.RESTORATION.DATE", "OUTAGE.RESTORATION.TIME"]))

outageClean.head()


Unnamed: 0,OBS,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,HURRICANE.NAMES,OUTAGE.DURATION,DEMAND.LOSS.MW,CUSTOMERS.AFFECTED,RES.PRICE,COM.PRICE,IND.PRICE,TOTAL.PRICE,RES.SALES,COM.SALES,IND.SALES,TOTAL.SALES,RES.PERCEN,COM.PERCEN,IND.PERCEN,RES.CUSTOMERS,COM.CUSTOMERS,IND.CUSTOMERS,TOTAL.CUSTOMERS,RES.CUST.PCT,COM.CUST.PCT,IND.CUST.PCT,PC.REALGSP.STATE,PC.REALGSP.USA,PC.REALGSP.REL,PC.REALGSP.CHANGE,UTIL.REALGSP,TOTAL.REALGSP,UTIL.CONTRI,PI.UTIL.OFUSA,POPULATION,POPPCT_URBAN,POPPCT_UC,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND,OUTAGE.START,OUTAGE.RESTORATION
0,1,2011,7,Minnesota,MN,MRO,East North Central,-0.3,normal,severe weather,,,3060,,70000.0,11.6,9.18,6.81,9.28,2332915,2114774,2113291,6562520,35.5491,32.225,32.2024,2308736,276286,10673,2595696,88.9448,10.644,0.411181,51268,47586,1.07738,1.6,4802,274182,1.75139,2.2,5348119,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2011-07-01 17:00:00,2011-07-03 20:00:00
1,2,2014,5,Minnesota,MN,MRO,East North Central,-0.1,normal,intentional attack,vandalism,,1,,,12.12,9.71,6.49,9.28,1586986,1807756,1887927,5284231,30.0325,34.2104,35.7276,2345860,284978,9898,2640737,88.8335,10.7916,0.37482,53499,49091,1.08979,1.9,5226,291955,1.79,2.2,5457125,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2014-05-11 18:38:00,2014-05-11 18:39:00
2,3,2010,10,Minnesota,MN,MRO,East North Central,-1.5,cold,severe weather,heavy wind,,3000,,70000.0,10.87,8.19,6.07,8.15,1467293,1801683,1951295,5222116,28.0977,34.501,37.366,2300291,276463,10150,2586905,88.9206,10.687,0.392361,50447,47287,1.06683,2.7,4571,267895,1.70627,2.1,5310903,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2010-10-26 20:00:00,2010-10-28 22:00:00
3,4,2012,6,Minnesota,MN,MRO,East North Central,-0.1,normal,severe weather,thunderstorm,,2550,,68200.0,11.79,9.25,6.71,9.19,1851519,1941174,1993026,5787064,31.9941,33.5433,34.4393,2317336,278466,11010,2606813,88.8954,10.6822,0.422355,51598,48156,1.07148,0.6,5364,277627,1.93209,2.2,5380443,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2012-06-19 04:30:00,2012-06-20 23:00:00
4,5,2015,7,Minnesota,MN,MRO,East North Central,1.2,warm,severe weather,,,1740,250.0,250000.0,13.07,10.16,7.74,10.43,2028875,2161612,1777937,5970339,33.9826,36.2059,29.7795,2374674,289044,9812,2673531,88.8216,10.8113,0.367005,54431,49844,1.09203,1.7,4873,292023,1.6687,2.2,5489594,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2015-07-18 02:00:00,2015-07-19 07:00:00


### Convert Remaining Columns to Appropriate Datatypes


Everything from the original dataset was given as a String and some columns need to remain strings while others need to be converted to floats. The simple solution to this is trying to convert a column to a float, and if it fails simply leaving it as is.

In [11]:
from datetime import timedelta
# Change all values to propper datatypes to run statistics
for col in outage.columns:
    try:
        outageClean[col] = outageClean[col].astype(float)
        
    # If item cannot be converted to a float keep it
    except:
        pass

# Dealing With Missingness

The principle purpose of missing analysis in this situation is to understand what kinds of imputation must be done in order to make our data usable in a pipeline. Once we know what columns have missing data, we can inspect these columns to determine why the data is missing and how we can best resolve this issue.

In [35]:
def proportionMissing(col):
    return sum(col.isna()) / col.size

In [44]:
proportions = outageClean.apply(proportionMissing)

proportions[proportions != 0].to_frame()

Unnamed: 0_level_0,0
0,Unnamed: 1_level_1
MONTH,0.005867
CLIMATE.REGION,0.003911
ANOMALY.LEVEL,0.005867
CLIMATE.CATEGORY,0.005867
CAUSE.CATEGORY.DETAIL,0.30704
HURRICANE.NAMES,0.953064
OUTAGE.DURATION,0.03781
DEMAND.LOSS.MW,0.459583
CUSTOMERS.AFFECTED,0.288787
RES.PRICE,0.014342


#### Missing By Design (MD)

Both CAUSE.CATEGORY.DETAIL and HURRICANE.NAMES have null values by design. 

**HURRICANE.NAMES** is null for every entry that does not have a hurricane as the value for CAUSE.CATEGORY.

**CAUSE.CATEGORY.DETAIL** is null for every entry of CAUSE.CATEGORY that does not have any more specific detailed information to provide. However, there are null values for certain entries that do have the ability to provide more specific information, so the column is not completely missing by design.

In [53]:
outageClean[~outageClean["CAUSE.CATEGORY.DETAIL"].isna()]["CAUSE.CATEGORY"].value_counts()

severe weather                   576
intentional attack               370
equipment failure                 48
system operability disruption     37
fuel supply emergency             32
Name: CAUSE.CATEGORY, dtype: int64

### Baseline Model

In [16]:
# Determine what columns are categorical and what columns are not

X = outageClean.drop(["OUTAGE.DURATION", "OUTAGE.RESTORATION"], axis=1)

y = outageClean["OUTAGE.DURATION"].fillna(0)


In [17]:

types = X.dtypes
catcols = types.loc[types == np.object].index
numcols = types.loc[types != np.object].index


In [98]:
categoricals = Pipeline([
    ('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
])

numerics = Pipeline([
    ('imp', SimpleImputer(strategy='constant'))
    
])

ct = ColumnTransformer([
    ('catcols', cats, catcols),
    ('fillNA', SimpleImputer(strategy='constant', fill_value=0), numcols)
])


pl = Pipeline([('feats', ct), ('reg', LinearRegression())])

In [99]:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.25, random_state = 2)

In [100]:
pl.fit(X_tr, y_tr)
pl.score(X_ts, y_ts)

-0.4929976470442452

### Final Model

#### Attempt One

In [101]:
# Make a quarter feature
def findQuarter(date):
    if pd.isnull(date):
        return 0
    return int(date.quarter)

# Make a date feature to onehot encode the date 
def dateToString(date):
    return str(date)

# Make a feature that corresponds to MTW...
def dayOfTheWeek(date):
    return str(date.dayofweek)

# Make a feature that corresponds to the proportion of the customers affected by the outage
def proportionAffected(row):
    return row["CUSTOMERS.AFFECTED"] / row["TOTAL.CUSTOMERS"]

# Make a feature that corresponds to the monetary value of the customers that were affected
def customerValue(row):
    return row["CUSTOMERS.AFFECTED"] * row["TOTAL.SALES"]

# Make a feature that corresponds to the hour of the day an outage occurs at
def hourOfDay(date):
    return date.hour

In [102]:
outageCleanFinal = outageClean.copy()

outageCleanFinal["QUARTER"] = outageClean["OUTAGE.START"].apply(findQuarter)
outageCleanFinal["OUTAGE.START"] = outageClean["OUTAGE.START"].apply(dateToString)
outageCleanFinal["DAY.OF.WEEK"] = outageClean["OUTAGE.START"].apply(dayOfTheWeek).iloc[0]
outageCleanFinal["CUSTOMER.PROPORTION.AFFECTED"] = outageClean.apply(proportionAffected, axis=1)
outageCleanFinal["CUSTOMER.VALUE"] = outageClean.apply(customerValue, axis=1)
outageCleanFinal["HOUR"] = outageClean["OUTAGE.START"].apply(hourOfDay).fillna(0)


outageCleanFinal = outageCleanFinal.drop("POSTAL.CODE", axis=1)


# Set up X and Y data
X = outageCleanFinal.drop(["OUTAGE.DURATION", "OUTAGE.RESTORATION", "OBS"], axis=1)
X["MONTH"] = X["MONTH"].fillna(0)
y = outageCleanFinal["OUTAGE.DURATION"].fillna(0)

# Determine what columns are categorical and what are not
types = X.dtypes
catcols = types.loc[types == np.object].index




catcols = pd.Series(catcols).append(pd.Series(["QUARTER", "YEAR", "OUTAGE.START", "MONTH", "HOUR", "DAY.OF.WEEK"]
                                             )).reset_index(drop=True)



In [103]:
cats = Pipeline([
    ('imp', SimpleImputer(strategy='constant', fill_value="NULL")),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
    ('pca', PCA(svd_solver='full', n_components=0.99))

])

ct = ColumnTransformer([
    ('catcols', cats, catcols),
])


pl = Pipeline([('feats', ct), ('reg', LinearRegression())])

In [104]:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.25, random_state = 2)

In [105]:
pl.fit(X_tr, y_tr)
pl.score(X_ts, y_ts)

0.22497857007570454

#### Attempt 2

In [111]:
outageCleanFinal = outageClean.copy()

outageCleanFinal = outageClean[["YEAR", "MONTH", "U.S._STATE", "NERC.REGION", "CLIMATE.REGION", "ANOMALY.LEVEL", "CLIMATE.CATEGORY",
                 "CAUSE.CATEGORY", "CUSTOMERS.AFFECTED", "TOTAL.PRICE", "TOTAL.SALES", "POPULATION", "OUTAGE.START", "OUTAGE.DURATION"
                ]].copy()

outageCleanFinal["QUARTER"] = outageClean["OUTAGE.START"].apply(findQuarter)
outageCleanFinal["OUTAGE.START"] = outageClean["OUTAGE.START"].apply(dateToString)
outageCleanFinal["DAY.OF.WEEK"] = outageClean["OUTAGE.START"].apply(dayOfTheWeek).iloc[0]
outageCleanFinal["CUSTOMER.PROPORTION.AFFECTED"] = outageClean.apply(proportionAffected, axis=1)
outageCleanFinal["CUSTOMER.VALUE"] = outageClean.apply(customerValue, axis=1)
outageCleanFinal["HOUR"] = outageClean["OUTAGE.START"].apply(hourOfDay).fillna(0)
outageCleanFinal["MONTH"] = outageClean["MONTH"].fillna(0)



X = outageCleanFinal.drop("OUTAGE.DURATION", axis=1)
y = outageCleanFinal["OUTAGE.DURATION"].fillna(0)



types = X.dtypes
catcols = pd.Series(types.loc[types == np.object].index)

catcols = pd.Series(catcols).append(pd.Series(["QUARTER", "YEAR", "MONTH", "HOUR", "DAY.OF.WEEK", "OUTAGE.START"]
                                             )).reset_index(drop=True)


In [112]:
cats = Pipeline([
    ('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
    ('pca', PCA(svd_solver='full', n_components=0.99))

])

ct = ColumnTransformer([
    ('catcols', cats, catcols),
])


pl = Pipeline([('feats', ct), ('reg', LinearRegression())])

In [113]:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.25, random_state = 2)

In [114]:
pl.fit(X_tr, y_tr)
pl.score(X_ts, y_ts)



0.23495832991235685

### Fairness Evaluation

In [116]:
# Separate training and test data by month to see if there is any difference between model performance on a month by month basis

scoresPerMonth = []

# Find the score of our model on each separate month of the year
for i in np.arange(1, 13):
    
    monData = outageCleanFinal[outageCleanFinal["MONTH"] == i]
    
    X = monData.drop("OUTAGE.DURATION", axis=1)
    y = monData["OUTAGE.DURATION"].fillna(0)
    
    X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.25, random_state = 2)

    pl.fit(X_tr, y_tr)
    scoresPerMonth.append(pl.score(X_ts, y_ts))



In [120]:
# Put the results into a dataframe
monthList = ["January", "February", "March", "April", "May", "June", "July", 
             "August", "September", "October", "November", "December"]

dfFairness = pd.DataFrame(index =monthList , data= {"SCORE":scoresPerMonth})

### Observed Fairness

In [131]:
display(dfFairness)
obsvMeanFairness = np.mean(scoresPerMonth)
obsvMeanFairness

Unnamed: 0,SCORE
January,-0.944511
February,-5.47473
March,-0.784987
April,0.156092
May,-2.372941
June,-0.066275
July,-0.65045
August,0.159271
September,-0.122489
October,0.403898


-0.98350324136128

### Fairness Hypothesis Test

In [140]:
results = []
for j in range(1000):
    scoresPerMonth = []

    # Find the score of our model on each separate month of the year
    for i in np.arange(1, 13):

        monData = outageCleanFinal.sample(frac=(1/12), replace=True)

        X = monData.drop("OUTAGE.DURATION", axis=1)
        y = monData["OUTAGE.DURATION"].fillna(0)

        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.25, random_state = 2)

        pl.fit(X_tr, y_tr)
        scoresPerMonth.append(pl.score(X_ts, y_ts))
        
    results.append(np.mean(scoresPerMonth))

(results <= obsvMeanFairness).mean()

0.336