# Before you start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [None]:
!conda install -c conda-forge scipy


Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.7.4
  latest version: 23.11.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.11.0



## Package Plan ##

  environment location: /Users/noelia.escobar/anaconda3

  added / updated specs:
    - scipy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    scipy-1.11.4               |  py311hc76d9b0_0        21.1 MB
    ------------------------------------------------------------
                                           Total:        21.1 MB

The following packages will be UPDATED:

  scipy                              1.11.1-py311hc76d9b0_0 --> 1.11.4-py311hc76d9b0_0 


Proceed ([y]/n)? 

In [None]:
import pandas as pd
import numpy as np
import scipy

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the San Francisco. We will start by loading the dataset and examining its contents. 

In [None]:
data = pd.read_csv('Salaries.csv')


Examine the `salaries` dataset using the `head` function below.

In [None]:
data.head()

In [None]:
print(data.shape)

In [None]:
data.columns

In [None]:
#We see from looking at the `head` function that there is quite a bit of missing data. Get the amount of missing data in every column

In [None]:
missing_data_count = data.isnull().sum()
display(missing_data_count)


In [None]:
missing_percentage = (data.isnull().sum() / len(data) * 100).round(2)
print(missing_percentage)



Get the shape of the dataframe

In [None]:
print(data.shape)

Given output of the previous two cells, drop the corresponding column and compute again the amount of missing values.

In [None]:
# Dropping columns with a high percentage of missing values
data_dropped = data.drop(['Notes', 'Status'], axis=1)


In [None]:
# Computing the amount of missing values in the modified dataframe
missing_data_after_drop = data_dropped.isnull().sum()

print("Missing data after dropping columns:")
print(missing_data_after_drop)

Check out what are the possible values of the column "Status".

In [None]:
unique_status_values = data['Status'].unique()
print("Unique values in the 'Status' column:")
print(unique_status_values)


In [None]:
#understanding PT for Part Time and FT for Full Time

Drop any row with missing values in the "Status" column and compute again the number of missing values.

In [None]:
# Dropping rows with missing values in the 'Status' column
data_dropped_status = data.dropna(subset=['Status'])

# Computing the amount of missing values in the modified dataframe
missing_data_after_drop_status = data_dropped_status.isnull().sum()

print("Missing data after dropping rows with missing 'Status':")
print(missing_data_after_drop_status)


In [None]:
# Your code here

Check out the types of each column and see if they make sense.

In [None]:
data.info()


In [None]:
#We observe that numerical values, variables are as dtype object, we need to convert them in numerical values.

In [None]:
# Your code here

Do any type conversions and reset the index.

In [None]:
import numpy as np

In [None]:
# Convert 'Year' column to datetime type
data['Year'] = pd.to_datetime(data['Year'], format='%Y', errors='coerce')

# Convert 'Benefits' column to numeric, replacing non-convertible values with NaN
data['Benefits'] = pd.to_numeric(data['Benefits'], errors='coerce')

# Reset the index
data = data.reset_index(drop=True)


In [None]:
data.info()

In [None]:
# Converting relevant columns to numeric, replacing non-numeric values with NaN
numeric_columns = ['BasePay', 'OvertimePay', 'OtherPay', 'Benefits', 'TotalPayBenefits']
data[numeric_columns] = data[numeric_columns].apply(pd.to_numeric, errors='coerce')


Check out if "TotalPayBenefits" = "BasePay" + "OvertimePay" + "OtherPay" + "Benefits"

In [None]:
import numpy as np
data['CalculatedTotal'] = data[['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']].sum(axis=1, skipna=True)

# Check if "TotalPayBenefits" is close to the calculated sum
data['CheckSum'] = np.isclose(data['TotalPayBenefits'], data['CalculatedTotal'])

# Display rows where the sum doesn't match "TotalPayBenefits" excluding NaN values
mismatched_rows = data[data['CheckSum'] == False]

# Reset the index
data = data.reset_index(drop=True)

# Display the result
print("Rows where 'TotalPayBenefits' does not match the calculated sum (excluding NaN values):")
print(mismatched_rows[['TotalPayBenefits', 'BasePay', 'OvertimePay', 'OtherPay', 'Benefits']])


What is the percetage of employees for which the previous assumption is not True?

In [None]:
#percentage of employees with a mismatch in the assumption
percentage_mismatch = (mismatched_rows.shape[0] / data.shape[0]) * 100

print(f"Percentage of employees for whom the assumption is not true: {percentage_mismatch:.2f}%")


There are different departments in the city. List all departments and the count of employees in each department.

In [None]:
# Extracting department information from JobTitle column using regular expression
data['Department'] = data['JobTitle'].str.extract(r'\((.*?)\)')

department_counts = data['Department'].value_counts()
print("Department-wise employee counts:")
print(department_counts)


# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of **all FT workers is significantly different from $75/hr**. Get first the hourly wage by dividing "TotalPayBenefits" by 50 weeks (assuming 10 labour days of holidays) and by 40hrs (assuming a 40hrs week).

$$Hourly Wage = \frac{TotalPayBenefits}{1 year}\frac{1 year}{50 Week}\frac{1 Week}{40 hr}$$

Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [None]:
# Your code here: (compute the "Hourly_Wage")

In [None]:
from scipy.stats import ttest_1samp

# Assuming 'TotalPayBenefits' contains the total pay and benefits
data['Hourly_Wage'] = data['TotalPayBenefits'] / (50 * 40 * 10)

# Specify the hypothesized mean hourly wage ($75/hr)
hypothesized_mean = 75

# Perform one-sample t-test
t_stat, p_value = ttest_1samp(data['Hourly_Wage'].dropna(), hypothesized_mean)

# Display the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Checking if the null hypothesis is rejected at a 95% confidence level
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The hourly wage is significantly different from $75/hr.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in the hourly wage.")


In [None]:
# Your code here: (Compute the mean hourly wage for all the "FT" employees)

In [None]:
# Filtering data for full-time employees
full_time_data = data[data['Status'] == 'FT']

# Computing the mean hourly wage for full-time employees
mean_hourly_wage_ft = full_time_data['Hourly_Wage'].mean()

print(f"Mean Hourly Wage for Full-Time Employees: {mean_hourly_wage_ft:.2f}")


In [None]:
# Your code here: (compute the t_statistic). Take into account that this dataset is a sample of a real population.
# Remember that you only need to consider "FT" employees

In [None]:
from scipy.stats import ttest_1samp

# Filter data for full-time employees
full_time_data = data[data['Status'] == 'FT']

# Specify the hypothesized mean hourly wage ($75/hr)
hypothesized_mean = 75

# Perform one-sample t-test for full-time employees
t_stat_ft, p_value_ft = ttest_1samp(full_time_data['Hourly_Wage'].dropna(), hypothesized_mean)

print(f"T-statistic for Full-Time Employees: {t_stat_ft:.2f}")
print(f"P-value for Full-Time Employees: {p_value_ft:.4f}")


In [None]:
# Method 1: Critical value. Get the critical value and compare it against your statisttic.
# Your code here: 

In [None]:
from scipy.stats import t

alpha = 0.05

# Degrees of freedom (sample size - 1)
degrees_of_freedom_ft = len(full_time_data['Hourly_Wage'].dropna()) - 1

# Getting the critical value for a two-sided test
critical_value = t.ppf(1 - alpha/2, degrees_of_freedom_ft)

print(f"Critical Value for Two-Sided Test: {critical_value:.2f}")

# Comparing with the t-statistic
if abs(t_stat_ft) > critical_value:
    print("Reject the null hypothesis: The mean hourly wage for Full-Time employees is significantly different from $75/hr.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in the mean hourly wage for Full-Time employees.")


In [None]:
# Method 2: Use the p-value method.
# Your code here:

In [None]:
alpha = 0.05

print(f"P-value for Full-Time Employees: {p_value_ft:.4f}")

# Comparing with the significance level
if p_value_ft <= alpha:
    print("Reject the null hypothesis: The mean hourly wage for Full-Time employees is significantly different from $75/hr.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in the mean hourly wage for Full-Time employees.")


In [None]:
# Method 3: Use the ttest_1samp function from scipy. 
# Check the documentation [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html)
# Make sure that you have a scipy version >=1.6.0. If that's not your case please ugrade your scipy version using
# !pip install -U scipy
# Your code here:

In [None]:
from scipy.stats import ttest_1samp

alpha = 0.05

# one-sample t-test for full-time employees
t_stat_ft, p_value_ft = ttest_1samp(full_time_data['Hourly_Wage'].dropna(), hypothesized_mean)

print(f"P-value for Full-Time Employees: {p_value_ft:.4f}")

# Comparing with the significance level
if p_value_ft <= alpha:
    print("Reject the null hypothesis: The mean hourly wage for Full-Time employees is significantly different from $75/hr.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in the mean hourly wage for Full-Time employees.")


Are all the methods in agreement?

We are also curious about salaries in the police force. The chief of police in San Francisco claimed in a press briefing that salaries this year are **higher than last year's mean of $86000/year for all salaried employees** (use the column "TotalPayBenefits". Test  hypothesis using a 95% confidence interval.

Hint: Use apply and a lambda function to check in "Police" is in the "JobTitle" to get all the "Police" jobs.

In [None]:
# Your code here: (compute the t_statistic). Take into account that this dataset is a sample of a real population.
# Remember that you only need to consider "Police" employees

In [None]:
from scipy.stats import ttest_1samp

alpha = 0.05

police_data = data[data['JobTitle'].apply(lambda x: 'POLICE' in x.upper())]

#one-sample t-test for police employees
t_stat_police, p_value_police = ttest_1samp(police_data['TotalPayBenefits'].dropna(), 86000)

print(f"P-value for Police Employees: {p_value_police:.4f}")

if p_value_police <= alpha:
    print("Reject the null hypothesis: Salaries in the police force are significantly higher than $86,000/year.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in salaries in the police force.")


In [None]:
# Method 1: Critical value. Get the critical value and compare it against your statisttic.
# Your code here: 

In [None]:
from scipy.stats import t

alpha = 0.05

# Degrees of freedom
df = len(police_data) - 1

# Critical value for a two-tailed test
critical_value = t.ppf(1 - alpha/2, df)

print(f"Critical Value: {critical_value:.4f}")

# Comparing with the t-statistic
if abs(t_stat_police) > critical_value:
    print("Reject the null hypothesis: Salaries in the police force are significantly higher than $86,000/year.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in salaries in the police force.")


In [None]:
# Method 2: Use the p-value method.
# Your code here:

In [None]:
alpha = 0.05

print(f"P-value for Police Employees: {p_value_police:.4f}")

# Comparing with the significance level
if p_value_police <= alpha:
    print("Reject the null hypothesis: Salaries in the police force are significantly higher than $86,000/year.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in salaries in the police force.")


In [None]:
# Method 3: Use the ttest_1samp function from scipy. 
# Check the documentation [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html)
# Make sure that you have a scipy version >=1.6.0. If that's not your case please ugrade your scipy version using
# !pip install -U scipy
# Your code here:

In [None]:
from scipy.stats import ttest_1samp

alpha = 0.05

#one-sample t-test for police employees
t_stat_police, p_value_police = ttest_1samp(police_data['TotalPayBenefits'].dropna(), 86000)

print(f"P-value for Police Employees: {p_value_police:.4f}")

# Comparing with the significance level
if p_value_police <= alpha:
    print("Reject the null hypothesis: Salaries in the police force are significantly higher than $86,000/year.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in salaries in the police force.")


The workers from the "JobTitle" with the most employees have complained that their hourly wage is **less than $35/hour**. Using a one sample t-test, test this one-sided hypothesis at the 95% confidence level.

In [None]:
# Your code here: (Get the department which has most employees)

In [None]:
# Finding the JobTitle with the most employees
most_common_job = data['JobTitle'].mode().iloc[0]

print(f"JobTitle with the most employees: {most_common_job}")


In [None]:
# Your code here: (compute the t_statistic). Take into account that this dataset is a sample of a real population.
# Remember that you only need to consider the right "JobTitle" employees

In [None]:
from scipy.stats import ttest_1samp

alpha = 0.05

most_common_job_data = data[data['JobTitle'] == most_common_job]['TotalPayBenefits'].dropna()

#one-sample t-test
t_stat_most_common_job, p_value_most_common_job = ttest_1samp(most_common_job_data, 35)

print(f"P-value for {most_common_job} employees: {p_value_most_common_job:.4f}")

# Comparing with the significance level
if p_value_most_common_job <= alpha:
    print(f"Reject the null hypothesis: Hourly wage for {most_common_job} employees is significantly less than $35/hour.")
else:
    print(f"Fail to reject the null hypothesis: There is no significant difference in hourly wage for {most_common_job} employees.")


In [None]:
# Method 1: Critical value. Get the critical value and compare it against your statisttic.
# Your code here: 

In [None]:
from scipy.stats import t

# Degrees of freedom
df_most_common_job = len(most_common_job_data) - 1

# Critical value for a one-sided test
critical_value = t.ppf(1 - alpha, df_most_common_job)

print(f"Critical value: {critical_value:.4f}")

# Compare with the t-statistic
if t_stat_most_common_job > critical_value:
    print(f"Reject the null hypothesis: Hourly wage for {most_common_job} employees is significantly less than $35/hour.")
else:
    print(f"Fail to reject the null hypothesis: There is no significant difference in hourly wage for {most_common_job} employees.")


In [None]:
# Method 2: Use the p-value method.
# Print the p-value
print(f"P-value for {most_common_job} employees: {p_value_most_common_job:.4f}")

# Comparing with the significance level
if p_value_most_common_job <= alpha:
    print(f"Reject the null hypothesis: Hourly wage for {most_common_job} employees is significantly less than $35/hour.")
else:
    print(f"Fail to reject the null hypothesis: There is no significant difference in hourly wage for {most_common_job} employees.")


In [None]:
# Method 3: Use the ttest_1samp function from scipy. 
# Check the documentation [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html)
# Make sure that you have a scipy version >=1.6.0. If that's not your case please ugrade your scipy version using
# !pip install -U scipy
# Your code here:

In [None]:
from scipy.stats import ttest_1samp

# one-sample t-test
t_stat_most_common_job, p_value_most_common_job = ttest_1samp(most_common_job_data['Hourly_Wage'], 35)

print(f"t-statistic for {most_common_job} employees: {t_stat_most_common_job:.4f}")
print(f"P-value for {most_common_job} employees: {p_value_most_common_job:.4f}")

# Comparing with the significance level
if p_value_most_common_job <= alpha:
    print(f"Reject the null hypothesis: Hourly wage for {most_common_job} employees is significantly less than $35/hour.")
else:
    print(f"Fail to reject the null hypothesis: There is no significant difference in hourly wage for {most_common_job} employees.")


# Challenge 3: To practice - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level.

In [None]:
# Method 1: Get the critical values which correspond to a 95% confidence.
from scipy.stats import t

# Calculate the critical values
df = len(hourly_wage_data) - 1  
alpha = 0.05  

# Getting the critical values from the t-distribution
t_critical_left = t.ppf(alpha / 2, df)
t_critical_right = t.ppf(1 - alpha / 2, df)


print(f"Critical value for the left tail: {t_critical_left:.4f}")
print(f"Critical value for the right tail: {t_critical_right:.4f}")


Now compute a 95% confidence interval for the hourly salary of all the Police employees.

In [None]:
import numpy as np

police_hourly_salary = data[data['JobTitle'].str.contains('Police', case=False)]['Hourly_Wage'].dropna()

mean_hourly_salary = police_hourly_salary.mean()
std_error_hourly_salary = np.std(police_hourly_salary, ddof=1) / np.sqrt(len(police_hourly_salary))

# Calculate the margin of error
margin_of_error = t_critical_right * std_error_hourly_salary

# Calculate the confidence interval
confidence_interval = (mean_hourly_salary - margin_of_error, mean_hourly_salary + margin_of_error)

# Print the results
print(f"Mean hourly salary for Police employees: ${mean_hourly_salary:.2f}")
print(f"Margin of error: ${margin_of_error:.2f}")
print(f"95% Confidence Interval: (${confidence_interval[0]:.2f}, ${confidence_interval[1]:.2f})")


# Chi2 test

Now we want to know if the amount of full time "FT" and part time "PT" employees is equal between Lawers, Meds, Police, Firemen and other departments. 

Considering all the options in this groups of employees will be very time consuming. To simplify this process, create first a function that returns:

* "Policemen" if "Police" is found on "JobTitle"
* "Firemen" if "Fire" is found on "JobTitle"
* "Medical" if "Med" or "Nurse" is found on "JobTitle"
* "Lawyer" if "Attorney" is found on "JobTitle"
* "Other" in any other cases

Then, create a new column named "employee_group" that determines to which group belong the employee. 

In [None]:
def categorize_employee_group(job_title):
    job_title = job_title.lower()

    if 'police' in job_title:
        return 'Policemen'
    elif 'fire' in job_title:
        return 'Firemen'
    elif 'med' in job_title or 'nurse' in job_title:
        return 'Medical'
    elif 'attorney' in job_title:
        return 'Lawyer'
    else:
        return 'Other'


In [None]:
# new column 'employee_group'
data['employee_group'] = data['JobTitle'].apply(categorize_employee_group)



Determine how many "PT" and "FT" employess have all the employees groups.

In [None]:
# Your code here: (Store the output dataframe into a new variable)

In [None]:
# Count the number of "PT" and "FT" employees for each group
employee_group_counts = data.groupby(['employee_group', 'Status']).size().unstack()

# Display the resulting DataFrame
employee_group_counts


Now try compute the expected frequencies doing the calculations with the individual probabilities. Remember that the Chi2 test assumes that both variables (employee_group and FT/PT) are not related (therefore they are independent). Therefore, to compute the expected frequencies you need to compute the probability of each cell and multiply it by the number of observations. ie:

$$\nu(x,y) = p(x,y) * N = p(x) * p(y) * N$$

bear in mind that in general: $p(x,y)\neq p(x)*p(y)$; the equality will only be true if x and y are independent. However, the null hypotheses says that **x and y are independent.** but that's what we're assuming with the null hypotheses.

where "x" is the "employee_group" and "y" the (FT/PT). 

In [None]:
# Create an empty dataframe named "frequencies" to store the data.
# Your code here:

In [None]:
import pandas as pd

# DataFrame to store expected frequencies
frequencies = pd.DataFrame(index=['Policemen', 'Firemen', 'Medical', 'Lawyer', 'Other'], columns=['FT', 'PT'])

# Display the empty DataFrame
frequencies


In [None]:
# Your code here: Compute Expected frequency of being "Firemen" and "FT". Store the solution in a variable named "firemen_ft"

In [None]:
p_firemen = frequencies.loc['Firemen'].sum() / total_observations

p_ft = frequencies['FT'].sum() / total_observations

firemen_ft = p_firemen * p_ft * total_observations

firemen_ft


In [None]:
# Your code here: Compute Expected frequency of being "Firemen" and "PT". Store the solution in a variable named "firemen_pt"

In [None]:
p_pt = frequencies['PT'].sum() / total_observations

firemen_pt = p_firemen * p_pt * total_observations

firemen_pt


In [None]:
# Your code here: Compute Expected frequency of being "Lawyers" and "FT". Store the solution in a variable named "lawyers_ft"

In [None]:
p_ft = frequencies['FT'].sum() / total_observations

lawyers_ft = p_lawyers * p_ft * total_observations

lawyers_ft


In [None]:
# Your code here: Compute Expected frequency of being "Lawyers" and "PT". Store the solution in a variable named "lawyers_pt"

In [None]:
p_pt = frequencies['PT'].sum() / total_observations

lawyers_pt = p_lawyers * p_pt * total_observations

lawyers_pt


In [None]:
# Your code here: Compute Expected frequency of being "Medical" and "FT". Store the solution in a variable named "medical_ft"

In [None]:
p_ft = frequencies['FT'].sum() / total_observations

medical_ft = p_medical * p_ft * total_observations

medical_ft


In [None]:
# Your code here: Compute Expected frequency of being "Medical" and "PT". Store the solution in a variable named "medical_pt"

In [None]:
# Your code here: Compute Expected frequency of being "Other" and "FT". Store the solution in a variable named "other_ft"

In [None]:
# Your code here: Compute Expected frequency of being "Other" and "PT". Store the solution in a variable named "other_pt"

In [None]:
# Your code here: Compute Expected frequency of being "Policement" and "FT". Store the solution in a variable named "policemen_ft"

In [None]:
# Your code here: Compute Expected frequency of being "Policement" and "PT". Store the solution in a variable named "policemen_pt"

* Store all the expected frequencies of "FT" employees in a list 
* Store all the "PT" employees into another list
* Create a dictionary with "FT" and "PT" as keys and as the values use the previous lists
* Create a dataframe with this dictionary using pd.DataFrame()

In [None]:
# My notebook Kernel isn't laoding properly,it isn't running properly the functions above, therefore I can't add the calculating of expected frequency, can't complete at all the rest of the activity. 

In [None]:
ft_list = [firemen_ft, lawyers_ft, medical_ft, other_ft, policemen_ft]
pt_list = [firemen_pt, lawyers_pt, medical_pt, other_pt, policemen_pt]

data_dict = {'FT': ft_list, 'PT': pt_list}

expected_df = pd.DataFrame(data_dict, index=['Firemen', 'Lawyers', 'Medical', 'Other', 'Policemen'])

print(expected_df)


Now use the "st.chi2_contingency()" from scipy.stats [documentation here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) to conduct a Chi2 test to determine if the diferences between employee groups are statistically significant using a 95% confidence level. Hint: fill the function with a dataframe of actual frequencies.

In [None]:
# Your code here: (use the st.chi2_contingency() function from scipy.stats to compute:
# The Chi2 value
# The p-valueYea we 
# The expected frequencies.
# Ho: there is no relationship
# Ha: there is relationship differences
# p_value = P(table | Ho) = P(table | no relationship) = 1.51e-6 < 0.05

In [None]:
import scipy.stats as st

chi2_stat, p_value, dof, expected_df = st.chi2_contingency

print(f"Chi2 Statistic: {chi2_stat}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected_df)


Check if your expected frequencies aggree with the ones obtained with the st.chi2_contingency() function

In [None]:
print("Manual Expected Frequencies:")
print(expected_df)

print("Expected Frequencies from chi2_contingency:")
print(expected)
