# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [None]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import sem

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [6]:
# Run this code:
salaries = pd.read_csv('../Current_Employee_Names__Salaries__and_Position_Titles.csv')

Examine the `salaries` dataset using the `head` function below.

In [7]:
# Your code here
# Look at the first rows
salaries.head()

# Optional: basic info and column names
salaries.info()
salaries.columns


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33183 entries, 0 to 33182
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               33183 non-null  object 
 1   Job Titles         33183 non-null  object 
 2   Department         33183 non-null  object 
 3   Full or Part-Time  33183 non-null  object 
 4   Salary or Hourly   33183 non-null  object 
 5   Typical Hours      8022 non-null   float64
 6   Annual Salary      25161 non-null  float64
 7   Hourly Rate        8022 non-null   float64
dtypes: float64(3), object(5)
memory usage: 2.0+ MB


Index(['Name', 'Job Titles', 'Department', 'Full or Part-Time',
       'Salary or Hourly', 'Typical Hours', 'Annual Salary', 'Hourly Rate'],
      dtype='object')

# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [8]:
from scipy.stats import ttest_1samp

# Filter only hourly workers
hourly = salaries[salaries["Salary or Hourly"] == "Hourly"].copy()

# Extract hourly wages
wages = hourly["Hourly Rate"].dropna()

# One-sample t-test: H0: mean(wages) = 30
t_stat, p_value = ttest_1samp(wages, popmean=30)

print("t-statistic:", t_stat)
print("p-value:", p_value)

alpha = 0.05
if p_value < alpha:
    print("Reject H0 → the mean hourly wage is significantly different from $30.")
else:
    print("Fail to reject H0 → no evidence that the mean hourly wage differs from $30.")


t-statistic: 20.6198057854942
p-value: 4.3230240486229894e-92
Reject H0 → the mean hourly wage is significantly different from $30.


# Challenge 3 - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [9]:
from scipy.stats import t, sem
import numpy as np

n = len(wages)
df = n - 1
mean_wage = wages.mean()
se_wage = sem(wages)   # standard error of the mean

confidence_level = 0.95
alpha = 1 - confidence_level

# Critical t value for 95% two-sided CI
t_crit = t.ppf(1 - alpha/2, df)

ci_low  = mean_wage - t_crit * se_wage
ci_high = mean_wage + t_crit * se_wage

print(f"Sample mean hourly wage: {mean_wage:.2f}")
print(f"95% confidence interval: ({ci_low:.2f}, {ci_high:.2f})")


Sample mean hourly wage: 32.79
95% confidence interval: (32.52, 33.05)


# Challenge 4 - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

In [10]:
from statsmodels.stats.proportion import proportions_ztest

# Total employees and hourly employees
n_total = len(salaries)
n_hourly = len(hourly)

count = n_hourly   # number of hourly workers
nobs = n_total     # total employees
p0 = 0.25          # hypothesised proportion

z_stat, p_value = proportions_ztest(count=count, nobs=nobs, value=p0)

print("z-statistic:", z_stat)
print("p-value:", p_value)

alpha = 0.05
if p_value < alpha:
    print("Reject H0 → the proportion of hourly workers is significantly different from 25%.")
else:
    print("Fail to reject H0 → no evidence that the proportion differs from 25%.")


z-statistic: -3.5099964213703005
p-value: 0.0004481127249057967
Reject H0 → the proportion of hourly workers is significantly different from 25%.
