# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:
* Conduct t-tests and an ANOVA on a real-world dataset and interpret the results

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [1]:
#Your code here
#Importing libraries and loading data
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('multipleChoiceResponses_cleaned.csv', encoding='latin1')
data.head()

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity,exchangeRate,AdjustedCompensation
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,Somewhat important,,,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,,
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,...,,,,,,,,,1.0,250000.0
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,


## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

You may import the functions stored in the `flatiron_stats.py` file to help perform your hypothesis tests. It contains the stats functions that you previously coded: `welch_t(a,b)`, `welch_df(a, b)`, and `p_value(a, b, two_sided=False)`. 

Note that `scipy.stats.ttest_ind(a, b, equal_var=False)` performs a two-sided Welch's t-test and that p-values derived from two-sided tests are two times the p-values derived from one-sided tests. See the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.    

In [2]:
#Your code here
import flatiron_stats as fs




In [3]:
##Creating null and alternative hypothesis.
#Null Hypothesis: Those who have a Master's Degree and those who have a Bachelor's degree earn the same
#Alternative Hypothesis: Those who have a master's degree earn more than those with a bachelor's degree.    

In [4]:
#Subsetting the data into 2 sets

masters = data[data['FormalEducation']=="Master's degree"]['AdjustedCompensation']
bachelors = data[data['FormalEducation']=="Bachelor's degree"]['AdjustedCompensation']

In [5]:
data.isna().sum()

GenderSelect                        105
Country                            6635
Age                                6844
EmploymentStatus                   6532
StudentStatus                     25088
                                  ...  
JobFactorLeaderReputation         22993
JobFactorDiversity                22984
JobFactorPublishingOpportunity    22970
exchangeRate                      21895
AdjustedCompensation              22051
Length: 230, dtype: int64

In [6]:
#Dropping the missing values
masters_cleaned=masters.dropna()
bachelors_cleaned=bachelors.dropna()

In [8]:
#Carrying out the t test to examine the hypothesis.
#One sided test because we are testing whether people with masters earn more
p_value = fs.p_value_welch_ttest(masters_cleaned, bachelors_cleaned, two_sided=False)
p_value

0.33077639451272445

In [9]:
#From the p-value obtained above, their is no suffient evidence to reject the null hypothesis. 
#From this we can say that education level does not really have an impact on education

## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [15]:
#2 Categories therefore we do t tests.
#Null hypothesis: Those with a doctoral degree and those with a bachelor's degree earn the same
#Alternative hypothesis: Those with a doctoral degree earn more compared to those with a bachelor's degree

In [11]:
#Subsetting and cleaning the data
doctoral = data[data.FormalEducation == "Doctoral degree"]['AdjustedCompensation']
doctoral_clean = doctoral.dropna()
doctoral_clean


22       100000.000
32       172144.440
34       133000.000
61        15000.000
72        43049.736
            ...    
25875     71749.560
25966     12000.000
26012    123553.200
26038    170000.000
26203    200000.000
Name: AdjustedCompensation, Length: 967, dtype: float64

In [14]:
#Performing a t test with outliers in the data
p_with_outliers=fs.p_value_welch_ttest(bachelors_cleaned,doctoral_clean,two_sided=False)
p_with_outliers

0.15682381994720251

In [None]:
#With outliers in the data, the p-value is greater than 0.05. Therefore we will fail to reject the null hypothesis.
#Education level still doesn't seem to have an impact on the earnings.

In [16]:
#Conducting t-tests without outliers in the data
#First we set a threshhold for the outliers.
#We will set it at 350000


In [18]:
#Subsetting data and removing outliers.
doctoral_no_outlier = data[(data['FormalEducation']=="Doctoral degree") & (data['AdjustedCompensation']<=350000)]['AdjustedCompensation']
bachelors_no_outlier = data[(data['FormalEducation']=="Bachelor's degree") & (data['AdjustedCompensation']<=350000)]['AdjustedCompensation']

In [20]:
p_no_outlier=fs.p_value_welch_ttest(doctoral_no_outlier,bachelors_no_outlier, two_sided=False )
p_no_outlier

0.0

In [21]:
#Having removed the outliers our p-value becomes smaller. It is below 0.05.
#we will therefore reject the null hypothesis. Those with a doctoral degree earn more than those with a bachelor's degree.

## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [22]:
#Your code here
from statsmodels.formula.api import ols
import statsmodels.api as sm
f= '{} ~ C({})'.format('AdjustedCompensation', 'FormalEducation')
#fitting a model
linear_model = ols(formula=f, data=data).fit()



In [23]:
#Anova table
anova_table = sm.stats.anova_lm(linear_model, typ=2)
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(FormalEducation),6.540294e+17,6.0,0.590714,0.738044
Residual,7.999414e+20,4335.0,,


In [None]:
#The p-value here is less than 0.05. Therefore formal education does not really cause a variation in the prices.

## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!