# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [1]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import sem

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [2]:
# Run this code:
salaries = pd.read_csv('../data/Current_Employee_Names__Salaries__and_Position_Titles.csv')

Examine the `salaries` dataset using the `head` function below.

In [3]:
# Your code here
salaries.head()

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"AARON, JEFFERY M",SERGEANT,POLICE,F,Salary,,101442.0,
1,"AARON, KARINA",POLICE OFFICER (ASSIGNED AS DETECTIVE),POLICE,F,Salary,,94122.0,
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,F,Salary,,101592.0,
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,F,Salary,,110064.0,
4,"ABASCAL, REECE E",TRAFFIC CONTROL AIDE-HOURLY,OEMC,P,Hourly,20.0,,19.86


In [None]:
#Checking the data types, number of rows and columns
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33183 entries, 0 to 33182
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               33183 non-null  object 
 1   Job Titles         33183 non-null  object 
 2   Department         33183 non-null  object 
 3   Full or Part-Time  33183 non-null  object 
 4   Salary or Hourly   33183 non-null  object 
 5   Typical Hours      8022 non-null   float64
 6   Annual Salary      25161 non-null  float64
 7   Hourly Rate        8022 non-null   float64
dtypes: float64(3), object(5)
memory usage: 2.0+ MB


In [None]:
#Summary of numerical values
salaries.describe()

Unnamed: 0,Typical Hours,Annual Salary,Hourly Rate
count,8022.0,25161.0,8022.0
mean,34.507604,86786.99979,32.788558
std,9.252077,21041.354602,12.112573
min,10.0,7200.0,2.65
25%,20.0,76266.0,21.2
50%,40.0,90024.0,35.6
75%,40.0,96060.0,40.2
max,40.0,300000.0,109.0


# Challenge 2
This is a placeholder to make the AI corrector be able to find the correct exercise for feedback

# Challenge 3 - Constructing Confidence Intervals

We will test whether the hourly wage of all hourly workers is significantly different from $30/hr.

In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. Is $30/hr within that interval?

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [6]:
# Your code here

# We need to filter only hourly workers
hourly_workers= salaries[salaries['Salary or Hourly'] == 'Hourly']

# Calculating mean and standard error
mean_wage = hourly_workers['Hourly Rate'].mean()
std_error = sem(hourly_workers['Hourly Rate'])

#Degrees of freedom (number of samples - 1 or n-1)
n = len(hourly_workers)
df= n - 1

# Confidence interval 95%
confidence_interval= t.interval(0.95, df, loc = mean_wage, scale = std_error)

print("95% confidence interval for hourly wage:", confidence_interval)

95% confidence interval for hourly wage: (32.52345834488425, 33.05365708767623)


In [None]:
#Since $30/hr is NOT inside the confidence interval, this suggests that the average hourly wage is significantly different from $30/hr.

This is fine if we have thousands of worker data. But what if we have only 100 workers data?

Sample 100 workers and re-construct the 95% confidence interval. Is the interval wider of narrower? And why?
Do you still encapsulate the $30/hr mark in this case?

In [9]:
# Your code here

# Sample of 100 workers with a fixed seed to consistency
sample_hourly_workers = hourly_workers.sample(100, random_state = 42) 

# random_state=42 ensures the same random selection each time we run the code.
# Without it, the sample would be different every time.

# Calculating mean and standard error for the new sample
sample_mean = sample_hourly_workers['Hourly Rate'].mean()
sample_std_error = sem(sample_hourly_workers['Hourly Rate'])

# degrees of freedom (number of samples - 1 or n-1) for the new sample
n= len(sample_hourly_workers) - 1

df_sample = n - 1

# new confidence interval
sample_conf_interval= t.interval(0.95 , df_sample, loc=sample_mean, scale=sample_std_error)

print("95% confidence interval for 100 sampled workers:", sample_conf_interval)

95% confidence interval for 100 sampled workers: (30.82756066952274, 36.08303933047728)


In [None]:
# Compared to the full dataset's confidence interval (32.52, 33.05), this interval is wider.
# This happens because a smaller sample size leads to more uncertainty.

# Does $30/hr fall within this interval? 
# Yes! Since $30/hr is within the range, we cannot confidently say the true hourly wage differs from $30/hr based on this smaller sample.

# The larger interval means our estimate of the true mean is less precise.
# This is a direct result of having fewer data points.

# Why Is the Interval Wider?
# Smaller samples introduce more variability, with fewer workers, our estimate is based on less data, leading to a larger margin of error.