In [1]:
import pandas as pd
df = pd.read_csv('./data/power_gen.csv')

**Q1** Use the dataset `./data/power_gen.csv`. This dataset contains information on the electrical power generated across time from `09/01/2017` to `12/22/2022` [dataset link](https://www.kaggle.com/datasets/arvindnagaonkar/power-generation-data?select=PowerGeneration.csv). It has been the belief that in summer months (April-September) the power production is more than that in winter months (October-March). Given that we have now actual data, can we do a hypothesis test and verify our hypothesis?

**Solution**:
- H0: The average electricity production in Winter Months and Summer Months is same
- Ha: The average is different

In [2]:
df.head(2)

Unnamed: 0,Dates,Power Station,Monitored Cap.(MW),Total Cap. Under Maintenace (MW),Planned Maintanence (MW),Forced Maintanence(MW),Other Reasons (MW),Programme or Expected(MU),Actual(MU),Excess(+) / Shortfall (-),Deviation
0,2017-09-01,Delhi,2235.4,135.0,0.0,135.0,0,13,18,5.0,0.0
1,2017-09-01,Haryana,2720.0,2470.0,0.0,2470.0,0,28,7,-21.8,0.0


In [3]:
df['Dates'] = pd.to_datetime(df['Dates'])

In [4]:
df['Months'] = df['Dates'].dt.month

In [5]:
winters = df.query("Months in (10,11,12,1,2,3)").groupby('Dates')['Actual(MU)'].mean()
summers = df.query("Months in (4,5,6,7,8,9)").groupby('Dates')['Actual(MU)'].mean()

In [6]:
import scipy.stats as stats

In [7]:
stats.ttest_ind(winters,
                summers,equal_var=True)

Ttest_indResult(statistic=-15.046418947767783, pvalue=2.3056142404452155e-48)

**Conclusion**

Since the pvalue is small we can reject the H0

**Q2** Use the dataset `./data/power_gen.csv`. This dataset contains information on the electrical power generated across time from `09/01/2017` to `12/22/2022` [dataset link](https://www.kaggle.com/datasets/arvindnagaonkar/power-generation-data?select=PowerGeneration.csv). It has been the belief that year on year power production is increasing and is significantly different.(Consider data from 2018 to 2022)Given that we have now actual data, can we do a hypothesis test and verify our hypothesis? You may want to read about Post-Hoc testing before you answer this question. In the pre-read videos this content has been explained.

**Solution**

- H0: Average daily power productions is same across 2018 to 2022
- Ha: Average daily power productions is different across 2018 to 2022

In [8]:
df['Years'] = df['Dates'].dt.year

In [9]:
from scipy.stats import f_oneway

In [10]:
groups = []
for year in [2018,2019,2020,2021,2022]:
    groups.append(df.query(f"Years=={year}").groupby('Dates')['Actual(MU)'].mean().values)   

In [11]:
f_oneway(*groups)

F_onewayResult(statistic=142.00049471260826, pvalue=4.467946695342016e-105)

**We can reject H0. But we don't know which groups are different. We can carry out a posthoc test**

In [12]:
dfs = []
for year in [2018,2019,2020,2021,2022]:
    vals = df.query(f"Years=={year}").groupby('Dates')['Actual(MU)'].mean().values
    years = [year]*len(vals)
    res = pd.DataFrame({'Production':vals,'Year':years})
    dfs.append(res)

In [13]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [14]:
result = pd.concat(dfs,axis=0)

In [15]:
result

Unnamed: 0,Production,Year
0,17.545455,2018
1,18.022727,2018
2,18.278409,2018
3,18.443182,2018
4,18.568182,2018
...,...,...
348,19.631868,2022
349,20.461538,2022
350,20.631868,2022
351,20.686813,2022


In [16]:
tukey = pairwise_tukeyhsd(endog=result['Production'],
                          groups=result['Year'],
                          alpha=0.05)

In [17]:
print(tukey)

Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
  2018   2019    0.067  0.964 -0.2085  0.3425  False
  2018   2020  -0.3215 0.0226 -0.6136 -0.0293   True
  2018   2021   0.5777    0.0  0.3022  0.8532   True
  2018   2022    1.904   -0.0  1.6261  2.1818   True
  2019   2020  -0.3884 0.0027 -0.6803 -0.0966   True
  2019   2021   0.5107    0.0  0.2355  0.7858   True
  2019   2022    1.837   -0.0  1.5595  2.1144   True
  2020   2021   0.8991   -0.0  0.6073   1.191   True
  2020   2022   2.2254   -0.0  1.9314  2.5195   True
  2021   2022   1.3263   -0.0  1.0488  1.6038   True
----------------------------------------------------


**Conclusion**

We can see that across most consecutive years there has been a significant difference in the electricity production

**Q3** Use the dataset `./data/power_gen.csv`. This dataset contains information on the electrical power generated across time from `09/01/2017` to `12/22/2022` [dataset link](https://www.kaggle.com/datasets/arvindnagaonkar/power-generation-data?select=PowerGeneration.csv). It has been the belief that the power shortfall is more in summer months than winter months. Given that we have now actual data, can we do a hypothesis test and verify our hypothesis? You may want to read about Post-Hoc testing before you answer this question. In the pre-read videos this content has been explained.

**Solution**

- H0: Shortfall in winter and summer months is same
- Ha: Shortfall in winter and summer months is different

In [18]:
df['Shortfall_Present'] = df['Excess(+) / Shortfall (-)'].map(lambda x: "Yes" if x<0 else "No")

In [20]:
winters = df.query("Months in (10,11,12,1,2,3) and Shortfall_Present=='Yes'").groupby('Dates')['Actual(MU)'].mean()
summers = df.query("Months in (4,5,6,7,8,9) and Shortfall_Present=='Yes'").groupby('Dates')['Actual(MU)'].mean()

In [21]:
stats.ttest_ind(winters,
                summers,equal_var=True)

Ttest_indResult(statistic=-13.305222939776751, pvalue=1.200162884952542e-38)

**Conclusion**

Since the pvalue is small we can reject the H0

**Q4** Use the dataset `./data/attrition.csv`. The dataset comes from [here](https://www.kaggle.com/datasets/patelprashant/employee-attrition) and has data on attrition of employess from a major fortune 500 company. One of the things that one would like to understand from this dataset is the link between gender and attrition. Can you do an appropriate hypothesis test, to answer this question?

**Solution**

- H0: Gender and attrition are independent
- H1: Gender and attrition are related

In [22]:
df = pd.read_csv("./data/attrition.csv")

In [23]:
df.head(2)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7


In [24]:
pd.crosstab(df['Attrition'],df['Gender'])

Gender,Female,Male
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1
No,501,732
Yes,87,150


In [25]:
contingency_table = pd.crosstab(df['Attrition'],df['Gender'])

In [26]:
from scipy.stats import chi2_contingency

In [27]:
stat, p, dof, expected = chi2_contingency(contingency_table)

In [28]:
p

0.29057244902890855

**Conclusion**:

p value is large so we can't reject H0

**Q5** Use the dataset `./data/attrition.csv`. The dataset comes from [here](https://www.kaggle.com/datasets/patelprashant/employee-attrition) and has data on attrition of employess from a major fortune 500 company. One of the things that one would like to understand from this dataset is the link between marital status and attrition. Can you do an appropriate hypothesis test, to answer this question?

**Solution**

- H0: Marital Status and attrition are independent
- H1: Marital Status and attrition are related

In [29]:
pd.crosstab(df['Attrition'],df['MaritalStatus'])

MaritalStatus,Divorced,Married,Single
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,294,589,350
Yes,33,84,120


In [30]:
contingency_table = pd.crosstab(df['Attrition'],df['MaritalStatus'])

In [31]:
stat, p, dof, expected = chi2_contingency(contingency_table)

In [32]:
p

9.45551106034083e-11

**Conclusion**:

p value is small so we can reject H0