### Tests relating to the distance to the nearest station:

In [None]:
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np

# Allow importing from parent directory by temporarily moving the CWD up one level
# Very hacky, but there literally isn't a simpler way (in Jupyter)
import sys
sys.path.append("..")
from common import get_dataframe_from_pipeline
outages = get_dataframe_from_pipeline("../pipeline/3.csv.gz")
# Drop the path back down after import
sys.path.pop()

### Checking if the distances are normally distributed:

In [None]:
stats.normaltest(outages['outageToSubstationDistance']).pvalue

### First Idea
1. Splitting the distances into **equal** length **bins** and seeing how many outages occured that have distances in that distance interval for example from (0,3), is the number of outages that had a distance of between 0 to 3 to the nearest station.
2. Now we **groupby** these intervals and aggregate by counting the number of outages that are in this interval. This gives us outages per distance
3. We can do a statistical test like a **T-test** to see if there is a significant different between the first half of these outages compared to the second half of these outages.

In [None]:
#this cuts the data into 3 equal width bins.
data = pd.Series([2,19,1,20, 13, 19, 24, 30])
bins = pd.cut(data, bins=3)
print(bins)

In [None]:
n = 1000 #number of bins
distance_bins = pd.cut(outages['outageToSubstationDistance'], bins=n)
outages['distance_bin'] = distance_bins
outages_per_dist = outages.groupby(['distance_bin']).size().reset_index(name="# of outages")
outages_per_dist

### T-Test:
Doing a T-test comparing the first half of the bins to the second half of the bins.
<p> Checking for equal variances: since the levene test p-value is very small. We can proceed as them having different variances, which is why we have chosen "equal_var=False" when doing the t-test.
<p> According to the big p-value, since the ttest alternative hypothesis is that "the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.", it means it is strongly rejecting it!! so we cannot conclude that there is more number of outages that are far compared to closer ones!

In [None]:
#median_bin = n//2
median_bin = n//2

closer_outages = outages_per_dist[outages_per_dist['distance_bin'].cat.codes < median_bin]['# of outages'].reset_index(drop=True)
farther_outages = outages_per_dist[outages_per_dist['distance_bin'].cat.codes >= median_bin]['# of outages'].reset_index(drop=True)
closer_outages = closer_outages.to_frame()
farther_outages = farther_outages.to_frame()

t_stat, p_value = stats.ttest_ind(farther_outages['# of outages'], closer_outages['# of outages'], equal_var=False, alternative='greater')
print("Levene Test p-value:", stats.levene(farther_outages['# of outages'], closer_outages['# of outages']).pvalue)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

### Trying a Mann-Whitney U-test: 
The result of this also agrees with the t test result and strongly aggress with the null hypothesis, so it disagrees strongly with the alternative hypothesis.
Since the alternative hypothesis is: "the distribution underlying x is stochastically greater than the distribution underlying y, i.e. SX(u) > SY(u) for all u", It shows that it is not true that farther outages have more outages compared to smaller ones.

In [None]:
print(stats.mannwhitneyu(farther_outages['# of outages'], closer_outages['# of outages'], alternative='greater').pvalue)

### Checking for correlations: **distance vs timeout**
Checking if there is any correlation between the distance of the outage to the station and the time it took for the outage to be resolved.

In [None]:
outages['timeOut'] = outages['dateOn'] - outages['dateOff']
outages['timeOut'] = outages['timeOut'].apply(lambda x: x.total_seconds()/3600)
outages['timeOut']
#timeout is the total of minutes without power (we can change it into hours if its better)

In [None]:
stats.normaltest(outages['timeOut']).pvalue

In [None]:
fit = stats.linregress(outages['timeOut'], outages['outageToSubstationDistance'])
plt.xticks(rotation = 25)
plt.plot(outages["timeOut"], outages["outageToSubstationDistance"], 'b.', alpha = 0.5)
plt.plot(outages["timeOut"], outages["timeOut"]*fit.slope + fit.intercept, 'r-', linewidth = 3)
plt.title('Timeout vs Distance')
plt.ylabel('Distance (km)')
plt.xlabel('TimeOut (hour)')
plt.show()
#this plot doesnt look good because the distance csv that I created was not very good. 
#it would be nice to try it on our actual big dataset and the corresponding distances csv

In [None]:
outages["timeOut"].corr(outages["outageToSubstationDistance"])

### Trying transformations:

In [None]:
outages["timeOut"].apply(np.sqrt).corr(outages["outageToSubstationDistance"].apply(np.sqrt))

### Log transformation:


In [None]:
outages["timeOut"].apply(np.log).corr(outages["outageToSubstationDistance"].apply(np.log))

In [None]:
timeouts_transformed = outages["timeOut"].apply(np.log)
distance_transformed = outages["outageToSubstationDistance"].apply(np.log)
fit = stats.linregress(timeouts_transformed, distance_transformed)
plt.xticks(rotation = 25)
plt.plot(timeouts_transformed, distance_transformed, 'b.', alpha = 0.5)
plt.plot(timeouts_transformed, timeouts_transformed*fit.slope + fit.intercept, 'r-', linewidth = 3)
plt.title('log(Timeout) vs log(Distance)')
plt.ylabel('log(Distance (km))')
plt.xlabel('log(TimeOut (hour))')
plt.show()

Using log makes better bins as well! 
but are the bins even meaningful?

In [None]:
n = 10 #number of bins
distance_bins = pd.cut(outages['outageToSubstationDistance'].apply(np.log), bins=n)
outages['distance_bin'] = distance_bins
outages_per_dist = outages.groupby(['distance_bin']).size()
outages_per_dist = pd.DataFrame(outages_per_dist)
outages_per_dist = outages_per_dist.rename(columns={0:"#of outages"})
outages_per_dist