In [1]:
import pandas as pd
df = pd.read_csv('./Data/salaries.csv')

In [2]:
df.head(2)

Unnamed: 0,index,Name,Base Pay,University or Office,Position
0,0,"Driscoll, Michael Allan",$275000,Indiana,University President
1,1,"Weisenstein, Greg R",$241935,West Chester,University President


**Q1.** A recent survey was done on the state of compensation in academia. The detailed survey data is stored in the file `./Data/salaries.csv`. The belief before the survey was commissioned was that a Professor, on an average earns $100,100. Look at the data provided to you in the file and see if this belief about the compensation is credible or not. 

**Solution**

- H0: The average salary of a professor is `$101000`
- Ha: The average salary of a professor > `$101000` (See the sample average below)

In [3]:
df['Base Pay'] = df['Base Pay'].map(lambda x: float(x.replace("$","")))
role = 'professor'
cond = df['Position'].str.lower().str.startswith('professor')
df[(cond)]['Base Pay'].mean()

101902.3792822186

In [4]:
import scipy.stats
import math

In [5]:
def p_value(pop_mean,sample_mean,sample_stdev,sample_size):
    SE = sample_stdev/math.sqrt(sample_size)
    if pop_mean>sample_mean:
        pval = scipy.stats.norm(pop_mean,SE).cdf(sample_mean)
    else:
        pval = 1-scipy.stats.norm(pop_mean,SE).cdf(sample_mean)
    return pval

In [6]:
pop_mean = 101000
sample_mean = df[(cond)]['Base Pay'].mean()
sample_stdev = df[(cond)]['Base Pay'].std()
sample_size = df[(cond)]['Base Pay'].shape[0]

In [7]:
p_value(pop_mean,sample_mean,sample_stdev,sample_size)

0.0001497070410003687

**Conclusion**
p-value is very small so we reject H0, it can be concluded that the salary is more than $101,000

**Q2.** A recent survey was done on the state of compensation in academia. The detailed survey data is stored in the file `./Data/salaries.csv`. The belief before the survey was commissioned was that a Dean, on an average earns $100,100. Look at the data provided to you in the file and see if this belief about the compensation is credible or not. 

**Solution**

- H0: The average salary of a dean is `$101,000`
- Ha: The average salary of a dean > `$101,000` (See the sample average below)

In [8]:
def check_if_dean(x):
    x = x.lower()
    if "dean" in x.split(" "):
        return True
    else:
        return False

In [11]:
cond = df['Position'].map(check_if_dean)
df[cond]['Base Pay'].mean()

101830.15

In [12]:
pop_mean = 101000
sample_mean = df[(cond)]['Base Pay'].mean()
sample_stdev = df[(cond)]['Base Pay'].std()
sample_size = df[(cond)]['Base Pay'].shape[0]

In [13]:
p_value(pop_mean,sample_mean,sample_stdev,sample_size)

0.40021995241851205

**Conclusion**
p-value is large so we can't reject H0

**Q3.** Use the dataset `./Data/power_gen.csv`. This dataset contains information on the electrical power generated across time from `09/01/2017` to `12/22/2022` [dataset link](https://www.kaggle.com/datasets/arvindnagaonkar/power-generation-data?select=PowerGeneration.csv). It has been the belief that the average daily power production is 19.1 MU. Given that we have now actual data, can we do a hypothesis test and verify our hypothesis?

In [14]:
df2 = pd.read_csv('./Data/power_gen.csv')

In [15]:
df2.head(2)

Unnamed: 0,Dates,Power Station,Monitored Cap.(MW),Total Cap. Under Maintenace (MW),Planned Maintanence (MW),Forced Maintanence(MW),Other Reasons (MW),Programme or Expected(MU),Actual(MU),Excess(+) / Shortfall (-),Deviation
0,2017-09-01,Delhi,2235.4,135.0,0.0,135.0,0,13,18,5.0,0.0
1,2017-09-01,Haryana,2720.0,2470.0,0.0,2470.0,0,28,7,-21.8,0.0


**Solution**

- H0: The average daily power production is 19.1 MU
- Ha: The average daily power production is more than 19.1 MU


In [17]:
df2.groupby('Dates')['Actual(MU)'].mean().mean()

19.223266310725037

In [24]:
pop_mean = 19.1
sample_mean = df2.groupby('Dates')['Actual(MU)'].mean().mean()
sample_stdev = df2.groupby('Dates')['Actual(MU)'].mean().std()
sample_size = df2.groupby('Dates')['Actual(MU)'].mean().shape[0]

In [25]:
p_value(pop_mean,sample_mean,sample_stdev,sample_size)

0.0005362100127317415

**Conclusion**

p-value is very small so we reject H0, it can be concluded that the power production is more than 19.1 MU

**Q4** Use the dataset `./Data/power_gen.csv`. This dataset contains information on the electrical power generated across time from `09/01/2017` to `12/22/2022` [dataset link](https://www.kaggle.com/datasets/arvindnagaonkar/power-generation-data?select=PowerGeneration.csv). It has been the belief that the average daily power production is 19.5 MU in the summer months (April-September). Given that we have now actual data, can we do a hypothesis test and verify our hypothesis?

**Solution**

- H0: The average daily power production is 19.5 MU
- Ha: The average daily power production is more than 19.5 MU


In [27]:
df2['Dates'] = pd.to_datetime(df2['Dates'])
df2['Months'] = df2['Dates'].dt.month

In [28]:
df2.query("Months in (4,5,6,7,8,9)").groupby('Dates')['Actual(MU)'].mean().mean()

19.78686720234407

In [29]:
pop_mean = 19.5
sample_mean = df2.query("Months in (4,5,6,7,8,9)").groupby('Dates')['Actual(MU)'].mean().mean()
sample_stdev = df2.query("Months in (4,5,6,7,8,9)").groupby('Dates')['Actual(MU)'].mean().std()
sample_size = df2.query("Months in (4,5,6,7,8,9)").groupby('Dates')['Actual(MU)'].mean().shape[0]

In [30]:
p_value(pop_mean,sample_mean,sample_stdev,sample_size)

9.340143111158383e-07

**Conclusion**

p-value is very small so we reject H0, it can be concluded that the power production is more than 19.5 MU

**Q5** Use the dataset `./Data/power_gen.csv`. This dataset contains information on the electrical power generated across time from `09/01/2017` to `12/22/2022` [dataset link](https://www.kaggle.com/datasets/arvindnagaonkar/power-generation-data?select=PowerGeneration.csv). It has been the belief that the average daily power production is 18.5 MU in the summer months (October-March). Given that we have now actual data, can we do a hypothesis test and verify our hypothesis?

**Solution**

- H0: The average daily power production is 18.5 MU
- Ha: The average daily power production is more than 18.5 MU


In [31]:
df2.query("Months in (10,11,12,1,2,3)").groupby('Dates')['Actual(MU)'].mean().mean()

18.714524112450068

In [32]:
pop_mean = 18.5
sample_mean = df2.query("Months in (10,11,12,1,2,3)").groupby('Dates')['Actual(MU)'].mean().mean()
sample_stdev = df2.query("Months in (10,11,12,1,2,3)").groupby('Dates')['Actual(MU)'].mean().std()
sample_size = df2.query("Months in (10,11,12,1,2,3)").groupby('Dates')['Actual(MU)'].mean().shape[0]

In [33]:
p_value(pop_mean,sample_mean,sample_stdev,sample_size)

5.597539975976673e-08

**Conclusion**

p-value is very small so we reject H0, it can be concluded that the power production is more than 18.5 MU