# Outages
**Name(s)**: Bill Wang, Ethan Cao

**Website Link**: https://billwang04.github.io/us_state_power_outage/

## Code

### Question
1. Out of Nerc Regions/ **States** , what are the most likely to have the worst outages, what are the causes.
2. UniVariant: 
3. BiVariant: 

In [1]:
import pandas as pd
import numpy as np
import os


import plotly.express as px
pd.options.plotting.backend = 'plotly'

First we load in the data from the excel file and combine the time and date columns

### Cleaning and EDA

In [2]:
def combine_times(date_col_name, time_col_name, new_col_name, df):
    df = df.copy()
    df[new_col_name] = df[date_col_name] + pd.to_timedelta(df[time_col_name].astype(str))
    return df

In [3]:
data = pd.read_excel("outage.xlsx", skiprows=[0,1,2,3,4,6], index_col=1).iloc[:,1:]
data = combine_times("OUTAGE.START.DATE", 'OUTAGE.START.TIME', 'OUTAGE.START.DATETIME', data)
data = combine_times("OUTAGE.RESTORATION.DATE", "OUTAGE.RESTORATION.TIME", "OUTAGE.RESTORATION.DATETIME", data)

In [4]:
data.columns

Index(['YEAR', 'MONTH', 'U.S._STATE', 'POSTAL.CODE', 'NERC.REGION',
       'CLIMATE.REGION', 'ANOMALY.LEVEL', 'CLIMATE.CATEGORY',
       'OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE',
       'OUTAGE.RESTORATION.TIME', 'CAUSE.CATEGORY', 'CAUSE.CATEGORY.DETAIL',
       'HURRICANE.NAMES', 'OUTAGE.DURATION', 'DEMAND.LOSS.MW',
       'CUSTOMERS.AFFECTED', 'RES.PRICE', 'COM.PRICE', 'IND.PRICE',
       'TOTAL.PRICE', 'RES.SALES', 'COM.SALES', 'IND.SALES', 'TOTAL.SALES',
       'RES.PERCEN', 'COM.PERCEN', 'IND.PERCEN', 'RES.CUSTOMERS',
       'COM.CUSTOMERS', 'IND.CUSTOMERS', 'TOTAL.CUSTOMERS', 'RES.CUST.PCT',
       'COM.CUST.PCT', 'IND.CUST.PCT', 'PC.REALGSP.STATE', 'PC.REALGSP.USA',
       'PC.REALGSP.REL', 'PC.REALGSP.CHANGE', 'UTIL.REALGSP', 'TOTAL.REALGSP',
       'UTIL.CONTRI', 'PI.UTIL.OFUSA', 'POPULATION', 'POPPCT_URBAN',
       'POPPCT_UC', 'POPDEN_URBAN', 'POPDEN_UC', 'POPDEN_RURAL',
       'AREAPCT_URBAN', 'AREAPCT_UC', 'PCT_LAND', 'PCT_WATER_TOT',
       'PCT

In [5]:
# data = data[['U.S._STATE',"YEAR",'CLIMATE.REGION', 'OUTAGE.START.DATETIME', 'OUTAGE.RESTORATION.DATETIME','CAUSE.CATEGORY', 'CAUSE.CATEGORY.DETAIL', 'CUSTOMERS.AFFECTED', 'OUTAGE.DURATION','DEMAND.LOSS.MW']]

### UNIVARIANT GRAPH

In [6]:
univariant_plot = px.histogram(data['OUTAGE.DURATION'])
univariant_plot.update_layout(xaxis_title = 'OUTAGE.DURATION in Minutes', showlegend = False, title = 'Count of Duration of Outage')

In [7]:
# univariant_plot.write_html('../uni-plot.html', include_plotlyjs='cdn')

### Bivariant

In [47]:
bivariant = data.plot(kind = 'bar', x = 'U.S._STATE', y = 'OUTAGE.DURATION')

In [48]:
bivariant = data.groupby('U.S._STATE')['OUTAGE.DURATION'].sum().sort_values().reset_index().plot(kind = 'bar', x ='U.S._STATE' , y= 'OUTAGE.DURATION', title = 'Total Outage Duration per State')

In [49]:
bivariant.update_xaxes(dtick=1)

In [50]:
bivariant.write_html('../static/bi-plot.html', include_plotlyjs='cdn')

FileNotFoundError: [Errno 2] No such file or directory: '../static/bi-plot.html'

### Interesting AGG

In [12]:
data.groupby('U.S._STATE')['YEAR'].count().sort_values()

U.S._STATE
Alaska                    1
South Dakota              2
North Dakota              2
Montana                   3
Mississippi               4
West Virginia             4
Nebraska                  4
Hawaii                    5
Alabama                   6
Wyoming                   6
Nevada                    7
South Carolina            8
New Mexico                8
Iowa                      8
Vermont                   9
Idaho                     9
Kansas                    9
District of Columbia     10
Kentucky                 13
New Hampshire            14
Minnesota                15
Colorado                 15
Missouri                 17
Georgia                  17
Massachusetts            18
Connecticut              18
Maine                    19
Wisconsin                20
Oklahoma                 24
Arkansas                 25
Oregon                   26
Arizona                  28
Tennessee                34
New Jersey               35
Virginia                 37
Louisiana

### Assessment of Missingness

## NMAR

**CAUSE CATEGORY**:
All the NAN values in CAUSE.CATEGORY.DETAIL could be NMAR because the type of disaster may affect whether or not the data could be collected. For instance, a severe weather event like heavy rain and hail could occur at the same time and obscure the real cause of the damage. Because of this uncertaintity, this data may not have been collected for certain types of causes which may have coincided. This would cause the missingness of the data to be dependent on what kind of event was occuring and thus it would depend on itself making it NMAR.

If we would want to collect data that would make this column MAR, what we could do is audit precise conditions of weather, and if there are a lot of weather observations, this could mean that the data collecters couldn't categorize what caused the outage. 

## MAR

### US STATE WITH CUSTOMERS
**NULL HYPOTHESIS**: There is no significant difference between 

In [13]:
#Found distribution of NotNA for each state
distr_nan = data[['U.S._STATE', 'CUSTOMERS.AFFECTED']].assign(NotNA = data[['U.S._STATE', 'CUSTOMERS.AFFECTED']]['CUSTOMERS.AFFECTED'].isna()== False)
dist_notNA = distr_nan[distr_nan['NotNA']][['U.S._STATE','NotNA']]
dist_notNA = dist_notNA.groupby('U.S._STATE').count()

#Found distribution of NA for each state
distr_nan = data[['U.S._STATE', 'CUSTOMERS.AFFECTED']].assign(NotNA = data[['U.S._STATE', 'CUSTOMERS.AFFECTED']]['CUSTOMERS.AFFECTED'].isna()== False)
dist_NA = distr_nan[distr_nan['NotNA'] == False][['U.S._STATE','NotNA']].rename(columns = {'NotNA' : 'ISNA'})
dist_NA = dist_NA.groupby('U.S._STATE').count()
plot_df = dist_NA.merge(dist_notNA, left_index=True, right_index=True)
#plot
plot_dist_na =  px.bar(plot_df.reset_index(), x='U.S._STATE', y=['ISNA', 'NotNA'], title='Distribution of NAN and Non_NAN values for each state')
plot_dist_na.update_layout(barmode='group', xaxis_tickangle=-45)
plot_dist_na

In [14]:
#FIND TVD
def find_tvd(df, depends_on, col_analyze):
    df = df.copy()
    df_needed = df.loc[:,[depends_on, col_analyze]]
    df_needed['ISNA'] = df_needed.loc[:,col_analyze].isna()
    df_needed[col_analyze] = df_needed.loc[:,col_analyze].fillna(0)
    find_prop = df_needed.pivot_table(index = depends_on, columns = 'ISNA', aggfunc = 'count', fill_value=0).loc[:,col_analyze]
    total_not_missing = find_prop[False].sum()
    total_missing = find_prop[True].sum()
    find_prop[False] = find_prop[False] / total_not_missing
    find_prop[True] = find_prop[True] / total_missing
    return find_prop.diff(axis=1)[True].abs().sum()

In [15]:
#MAR PERMUTATION TEST
def mar_permutation(df, depends_on, col_analyze, N = 1000):
    observed = find_tvd(df, depends_on, col_analyze)
    arr = []
    for _ in range(N):
        shuffled = df.assign(**{col_analyze: np.random.permutation(df[col_analyze])}) 
        arr.append(find_tvd(shuffled, depends_on, col_analyze))
    plot = px.histogram(np.array(arr))
    plot.add_vline(x=observed, line_color= "green", annotation_text="obs")
    plot.update_layout(xaxis_title = "TVD of Missing and Non-Missing Values", yaxis_title = "Frequency", title_text = "Simulated Null Values",showlegend=False)


    return plot, np.array([np.array(arr) > observed]).mean()


In [16]:
graph, p_value = mar_permutation(data, 'CLIMATE.CATEGORY', 'CUSTOMERS.AFFECTED')
# graph.write_html('../03-topic/mar-hist.html', include_plotlyjs='cdn')
print(p_value)
graph

0.374


In [17]:
graph, p_value = mar_permutation(data, 'U.S._STATE', 'CUSTOMERS.AFFECTED')
print(p_value)
graph

0.0


### Hypothesis Testing

1. Out of Nerc Regions/ **States** , what are the most likely to have the worst outages.

In [18]:
def calculate_outage_severity(data, state, var):
    #finds the severity of outage
    output = data.groupby(state).mean()
    
    return output.loc[True,var]

In [19]:
#create a permutation test of the column in mind for each state
def perm_test(data, state, var, n=1000):
    data = data.copy()[["U.S._STATE", var]]
    data[state] = data["U.S._STATE"] == state
    #find observation
    obs = calculate_outage_severity(data, state, var)
    test_stats = []
    for _ in range(n):
        #loop through and add shuffled to test stats
        value = np.random.permutation(data[var])
        shuffled = data.assign(**{var : value })
        trial = calculate_outage_severity(shuffled, state, var)
        test_stats.append(trial)
    #find p_value
    return  (np.array(test_stats) >= obs).mean()

**This** represents texas has a difference. 

In [20]:
stat = perm_test(data, "Texas", 'OUTAGE.DURATION')
stat

0.431

In [38]:
all_p_values_duration = {}
all_p_values_customers = {}
for i in data['U.S._STATE'].unique():
    #loop through all the states 
    all_p_values_duration[i] = perm_test(data, i, "OUTAGE.DURATION")
    all_p_values_customers[i] = perm_test(data, i , "CUSTOMERS.AFFECTED")

create dataframe to look p_value and how many outages to find how likely an outage is gonna happen and how severe it is. 

In [39]:
#use dictionires to create dataframe and graph all the values with under 0.05 p_value
initial = pd.DataFrame({'duration_p_value' : pd.Series(all_p_values_duration),
              'customer_p_value' : pd.Series(all_p_values_customers)})
resulting_df = data.groupby('U.S._STATE')[['YEAR']].count().merge(initial, left_index = True, right_index = True).rename(columns = {'YEAR': 'Count of Outages'})
graph_df = resulting_df[(resulting_df['duration_p_value'] < 0.05) | (resulting_df['customer_p_value'] < 0.05 )].sort_values('customer_p_value')

In [40]:
graph_df = graph_df[graph_df['Count of Outages'] > 3]

In [41]:
graph_df

Unnamed: 0,Count of Outages,duration_p_value,customer_p_value
California,210,1.0,0.006
Florida,45,0.069,0.007
Texas,127,0.382,0.007
New York,71,0.002,0.112
Michigan,95,0.0,0.368
Wisconsin,20,0.008,0.99


In [42]:
fig = px.bar(graph_df.reset_index().rename(columns = {'index': 'U.S._STATES'}), x='U.S._STATES', y=['duration_p_value', 'customer_p_value'], title='Grouped Bar Chart of Duration P-Value and Customer P-Value by U.S. States with a P-Value less than 0.05')
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig

In [25]:
# fig.write_html('../p_value_bar.html', include_plotlyjs='cdn')