# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [2]:
# import numpy and pandas
import numpy as np
import pandas as pd
import scipy.stats as stats

# mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents. Please, load the data using Ironhack's database (db: employees, table: employees_advanced).

In [3]:
# Your code here:
employees = pd.read_csv("/content/drive/MyDrive/[01] Data Analytics - IronHack/[06] Courses/Week 5/Day 21 - Monday/[LAB 7] - Hypothesis Testing/Employees_advanced.csv")

Examine the `salaries` dataset using the `head` function below.

In [4]:
# Your code here:
employees.head()

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"AARON, JEFFERY M",SERGEANT,POLICE,F,Salary,,101442.0,
1,"AARON, KARINA",POLICE OFFICER (ASSIGNED AS DETECTIVE),POLICE,F,Salary,,94122.0,
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,F,Salary,,101592.0,
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,F,Salary,,110064.0,
4,"ABASCAL, REECE E",TRAFFIC CONTROL AIDE-HOURLY,OEMC,P,Hourly,20.0,,19.86


In [5]:
employees.shape

(33183, 8)

We see from looking at the `head` function that there is quite a bit of missing data. Let's examine how much missing data is in each column. Produce this output in the cell below

In [6]:
# Your code here:
employees.isnull().sum().to_frame()

Unnamed: 0,0
Name,0
Job Titles,0
Department,0
Full or Part-Time,0
Salary or Hourly,0
Typical Hours,25161
Annual Salary,8022
Hourly Rate,25161


Let's also look at the count of hourly vs. salaried employees. Write the code in the cell below

In [7]:
# Your code here:
employees["Salary or Hourly"].value_counts().to_frame()

Unnamed: 0,Salary or Hourly
Salary,25161
Hourly,8022


What this information indicates is that the table contains information about two types of employees - salaried and hourly. Some columns apply only to one type of employee while other columns only apply to another kind. This is why there are so many missing values. Therefore, we will not do anything to handle the missing values.

There are different departments in the city. List all departments and the count of employees in each department.

In [33]:
# Your code here:
employees.groupby(["Department"], as_index=False)[["Name"]].count().sort_values(by="Name", ascending=False)

Unnamed: 0,Department,Name
27,POLICE,13414
17,FIRE,4641
31,STREETS & SAN,2198
26,OEMC,2102
34,WATER MGMNT,1879
2,AVIATION,1629
32,TRANSPORTN,1140
30,PUBLIC LIBRARY,1015
18,GENERAL SERVICES,980
15,FAMILY & SUPPORT,615


# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

**[MNEMONICS] :**

> If the p-value is lower than the significance level (alpha), you reject the null hypothesis. In other words, when the p-value is smaller than the chosen alpha, it suggests that the observed data is less likely to have occurred by chance alone, assuming the null hypothesis is true. This provides evidence against the null hypothesis and in favor of the alternative hypothesis.

---

**Null hypothesis (H₀):**

The average hourly wage of all hourly workers is equal to $30/hr. In other words, there is no significant difference between the average hourly wage and \$30/hr.

**Alternative hypothesis (H₁):**

The average hourly wage of all hourly workers is not equal to $30/hr. This means that there is a significant difference between the average hourly wage and \$30/hr, and the average hourly wage is either significantly higher or significantly lower than \$30/hr.

In [9]:
# Your code here:
from scipy.stats import ttest_1samp

# filter only hourly workers
hourly_workers = employees[employees["Salary or Hourly"] == "Hourly"]

# extract "Hourly Rate" column for hourly workers
hourly_rates = hourly_workers["Hourly Rate"]

# one-sample t-test
t_statistic, p_value = ttest_1samp(hourly_rates, 30)

# two-sided 95% confidence interval
alpha = 0.05  # Significance level

# can null hypothesis be rejected
if p_value < alpha:
    print(f"The hourly wage is significantly different from $30/hr (p-value: {p_value:.4f})")
    print("We reject the null hypothesis (H₀).")
else:
    print(f"The hourly wage is not significantly different from $30/hr (p-value: {p_value:.4f})")
    print("We fail to reject the null hypothesis (H₀).")

The hourly wage is significantly different from $30/hr (p-value: 0.0000)
We reject the null hypothesis (H₀).


The result of the one-sample t-test indicates that the average hourly wage of all hourly workers in the "employees" DataFrame is significantly different from $30/hr. The p-value is 0.0000, which is less than the significance level of 0.05 (alpha).

This means that there is strong evidence to reject the null hypothesis, which states that the average hourly wage of all hourly workers is equal to $30/hr. Instead, the data suggests that the average hourly wage is either significantly higher or significantly lower than \$30/hr.

In the t-test result, the p-value was 0.0000, which is less than the significance level (alpha) of 0.05. This leads to the rejection of the null hypothesis in favor of the alternative hypothesis, suggesting that the average hourly wage of all hourly workers is significantly different from $30/hr.

We are also curious about salaries in the police force. The chief of police in Chicago claimed in a press briefing that salaries this year are higher than last year's mean of $86000/year a year for all salaried employees. Test this one sided hypothesis using a 95% confidence interval.

Hint: A one tailed test has a p-value that is half of the two tailed p-value. If our hypothesis is greater than, then to reject, the test statistic must also be positive.

In [10]:
# Your code here:
from scipy.stats import ttest_1samp

# filter only salaried police employees
salaried_police = employees[(employees["Department"] == "POLICE") & (employees["Salary or Hourly"] == "Salary")]

# extract "Annual Salary" column for salaried police employees
annual_salaries = salaried_police["Annual Salary"]

# one-sample t-test
t_statistic, p_value_two_tailed = ttest_1samp(annual_salaries, 86000)

# one-tailed p-value
p_value_one_tailed = p_value_two_tailed / 2

# significance level and test the one-sided hypothesis
alpha = 0.05

if p_value_one_tailed < alpha and t_statistic > 0:
    print(f"The average salary is significantly greater than $86,000/year (p-value: {p_value_one_tailed:.4f})")
    print("We reject the null hypothesis (H₀).")
else:
    print(f"The average salary is not significantly greater than $86,000/year (p-value: {p_value_one_tailed:.4f})")
    print("We fail to reject the null hypothesis (H₀).")


The average salary is significantly greater than $86,000/year (p-value: 0.0010)
We reject the null hypothesis (H₀).


If the average salary is significantly greater than $86,000/year with a p-value of 0.0010, it means that there is strong evidence to suggest that the average salary of the group being analyzed (in this case, salaried police employees) is higher than \$86,000 per year.

The p-value of 0.0010 indicates the probability of observing the data or more extreme results, assuming the null hypothesis (that the average salary is equal to \$86,000/year) is true. Since the p-value is very low (less than the chosen significance level, typically 0.05), it means that the observed data is highly unlikely to have occurred by chance alone. Therefore, we reject the null hypothesis and conclude that the average salary of salaried police employees is indeed greater than $86,000/year.

Using the `crosstab` function, find the department that has the most hourly workers. 

In [20]:
# Your code here:

# method 1
hourly = employees.loc[employees["Salary or Hourly"] == "Hourly"]
display(pd.crosstab(index=[hourly["Department"]], columns="count").sort_values(by="count", ascending=False).reset_index().head(1))

# method 2
department_hourly_workers = pd.crosstab(employees["Department"], employees["Salary or Hourly"])
# sort
sorted_departments = department_hourly_workers.sort_values("Hourly", ascending=False)
# department with the most hourly workers
department_with_most_hourly_workers = sorted_departments.index[0]
most_hourly_workers = sorted_departments.iloc[0]["Hourly"]
# results
print(f"The department with the most hourly workers is '{department_with_most_hourly_workers}' with {most_hourly_workers:,} hourly workers.")

col_0,Department,count
0,STREETS & SAN,1862


The department with the most hourly workers is 'STREETS & SAN' with 1,862 hourly workers.


The workers from the department with the most hourly workers have complained that their hourly wage is less than $35/hour. Using a one sample t-test, test this one-sided hypothesis at the 95% confidence level.

In [23]:
import pandas as pd
from scipy.stats import ttest_1samp

# "department_with_most_hourly_workers" variable is defined from the previous code block

# filter only hourly workers from the department with the most hourly workers
hourly_workers = employees[(employees["Department"] == department_with_most_hourly_workers) & (employees["Salary or Hourly"] == "Hourly")]

# extract "Hourly Rate" column for these hourly workers
hourly_rates = hourly_workers["Hourly Rate"]

# one-sample t-test
t_statistic, p_value_two_tailed = ttest_1samp(hourly_rates, 35)

# one-tailed p-value
p_value_one_tailed = p_value_two_tailed / 2

# significance level and test the one-sided hypothesis
alpha = 0.05

if p_value_one_tailed < alpha and t_statistic < 0:
    print(f"The average hourly wage is significantly less than $35/hour (p-value: {p_value_one_tailed:.4f})")
    print("We reject the null hypothesis (H₀).")
else:
    print(f"The average hourly wage is not significantly less than $35/hour (p-value: {p_value_one_tailed:.4f})")
    print("We fail to reject the null hypothesis (H₀).")

The average hourly wage is significantly less than $35/hour (p-value: 0.0000)
We reject the null hypothesis (H₀).


# Challenge 3: To practice - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [26]:
# Your code here:
from scipy.stats import t, sem

# filter only hourly workers
hourly_workers = employees[employees["Salary or Hourly"] == "Hourly"]

# extract "Hourly Rate" column for these hourly workers
hourly_rates = hourly_workers["Hourly Rate"]

# sample mean and standard error of the mean
mean_hourly_wage = hourly_rates.mean()
standard_error = sem(hourly_rates)

# 95% confidence interval
confidence_level = 0.95
degrees_of_freedom = len(hourly_rates) - 1
confidence_interval = t.interval(confidence_level, degrees_of_freedom, mean_hourly_wage, standard_error)

print(f"The 95% confidence interval for the mean hourly wage of all hourly workers is {confidence_interval}")

The 95% confidence interval for the mean hourly wage of all hourly workers is (32.52345834488425, 33.05365708767623)


Now construct the 95% confidence interval for all salaried employeed in the police in the cell below.

In [28]:
# Your code here:

# method 1
police = employees.loc[employees["Department"] == "POLICE"][["Annual Salary"]]
# mean
sample_mean = police.mean()
# size
sample_size = len(police)
# standard deviation
sample_std_dev = np.std(police, ddof=1)
# standard error
std_error = sample_std_dev / np.sqrt(sample_size)
# 95% confidence interval using stats.t.interval()
conf_interval = stats.t.interval(0.95, df=sample_size - 1, loc=sample_mean, scale=std_error)

print(f"[METHOD 1] 95% confidence interval: {conf_interval}")

# method 2
import pandas as pd
from scipy.stats import t, sem

# filter only salaried workers in the police department
salaried_police_workers = employees[(employees["Department"] == "POLICE") & (employees["Salary or Hourly"] == "Salary")]

# extract "Annual Salary" column for these salaried workers
annual_salaries = salaried_police_workers["Annual Salary"]

# sample mean and standard error of the mean
mean_annual_salary = annual_salaries.mean()
standard_error = sem(annual_salaries)

# 95% confidence interval
confidence_level = 0.95
degrees_of_freedom = len(annual_salaries) - 1
confidence_interval = t.interval(confidence_level, degrees_of_freedom, mean_annual_salary, standard_error)

print(f"[METHOD 2] The 95% confidence interval for the annual salary of all salaried employees in the police department is {confidence_interval}")

[METHOD 1] 95% confidence interval: (array([86177.17166932]), array([86795.65733694]))
[METHOD 2] The 95% confidence interval for the annual salary of all salaried employees in the police department is (86177.05631531784, 86795.77269094894)


# Bonus Challenge - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

**Null hypothesis (H₀):**

The proportion of hourly workers in the City of Chicago is equal to 25% (p = 0.25)

**Alternative hypothesis (H₁):**

The proportion of hourly workers in the City of Chicago is not equal to 25% (p ≠ 0.25)

Here, H₀ assumes that the true proportion of hourly workers is 25%, while H₁ assumes that the true proportion is different from 25%. The hypothesis test aims to determine if there is enough evidence to reject H₀ in favor of H₁.

---

1. **`nobs`** stands for "number of observations." In this context, it refers to the total number of employees in the dataset. It is used in the proportions_ztest function to calculate the proportion of hourly workers in the entire dataset.

2. **`z_stat`** is the z-score or the test statistic in the hypothesis test of proportions. It measures how many standard deviations away the observed proportion is from the hypothesized proportion (25% in this case). The larger the absolute value of the z-score, the more unlikely it is that the observed proportion is due to random chance, given that the null hypothesis is true.

3. When we say "***the proportion of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level***", it means that based on the data, we have enough evidence to conclude that the true proportion of hourly workers in the City of Chicago is not 25%. The 95% confidence level implies that if we were to repeat this hypothesis test 100 times, we would expect to obtain a result as extreme or more extreme than the observed result in 95 of those tests, assuming that the null hypothesis is true (i.e., the true proportion is 25%).

In [30]:
# Your code here:
from statsmodels.stats.proportion import proportions_ztest

# filter only hourly workers
hourly_workers = employees[employees["Salary or Hourly"] == "Hourly"]

# number of hourly workers and the total number of employees
num_hourly_workers = len(hourly_workers)
total_employees = len(employees)

# hypothesized proportion (25%)
hypothesized_proportion = 0.25

# hypothesis test using proportions_ztest
count = num_hourly_workers
nobs = total_employees
value = hypothesized_proportion
z_stat, p_value = proportions_ztest(count, nobs, value)

# 95% confidence level
alpha = 0.05

# can null hypothesis be rejected
if p_value < alpha:
    print(f"The proportion of hourly workers is significantly different from 25% (p-value: {p_value:.4f})")
    print("We reject the null hypothesis (H₀).")
else:
    print(f"The proportion of hourly workers is not significantly different from 25% (p-value: {p_value:.4f})")
    print("We do not reject the null hypothesis (H₀).")

The proportion of hourly workers is significantly different from 25% (p-value: 0.0004)
We reject the null hypothesis (H₀).
