# Analysis of StackOverflow Survey. Part 2

## Import packages and libraries

In [None]:
# general packages and libraries
import sys
import importlib

# data manipulation packages
import pandas as pd

# data visualizations packages
import matplotlib.pyplot as plt
# to render plots in the notebook
%matplotlib inline

import seaborn as sns
# set a theme for seaborn
sns.set_theme()

In [None]:
# create a string for the working directory
mypath = '/home/silvia/Documents/udacityND/ml_dsnd/proj1_dsnd/'

# add src folder to sys.path to use the local modules
sys.path.insert(1, mypath + 'src')

In [None]:
# import local module containing the neccessary functions
import utils_functions as uf

# forces the interpreter to re-import the module
importlib.reload(uf);

## State the question 

We are addressing the second question in this notebook. How do the salaries of data coders compare with the salaries of the other coders? In particular, we shall determine if there is a statistically significant difference between the average converted compensation of data coders versus the average converted compensation of the other coders.

We conduct a Z-test for independent means to compare the averages of the converted compensation, using the entries from the column 'imputedComp' obtained after imputing the missing values in 'ConvertedComp'. We will test the null hypothesis 
$$H_0: \mu_c \leq  \mu_o $$
that the mean converted compensation of the data coders ($\mu_c$) is less than for the other coders ($\mu_o$), against the alternative
$$H_a: \mu_c > \mu_o $$

I will work with a $1 \%$ significance level so that $\alpha = 0.01$.

### Upload the processed data

In [None]:
# upload the datafiles as pandas dataframes
df1 = pd.read_csv(mypath+'data/interim/survey20_updated.csv')

# check the uploaded data
df1.shape

### Impute data in ConvertedComp column

We impute the missing data in `ConvertedComp` with the median value from each country.

In [None]:
# take a look at the records statistics
df1.ConvertedComp.describe().round(3)

In [None]:
# there are numerous extreme values and also almost half of values missing
df1.ConvertedComp.sort_values(ascending=False)

In [None]:
# the extreme values
df1[df1.ConvertedComp == 2000000].shape

In [None]:
# impute the ConvertedComp missing values with the median of the corresponding country
# save the existing values and the imputed values in a new column
df1['ImputedComp'] = df1['ConvertedComp'].fillna(df1.groupby('Country')['ConvertedComp'].transform('median'))

In [None]:
# check how many values are still missing
df1.ImputedComp.isnull().sum()

In [None]:
# drop all rows that have NaN in ImputedComp column
df1.dropna(subset = ['ImputedComp'], inplace=True)
df1.shape

In [None]:
# how many values are 0
df1[df1.ImputedComp == 0.0].shape

In [None]:
# drop those values with 0 imputedCompensation
df1.drop(df1[df1.ImputedComp == 0.0].index, inplace = True)

# check for success
df1.shape

### Data Processing

In [None]:
# create a dataframe with the variables of interest only
df2 = pd.DataFrame([df1.ImputedComp, df1.DevClass]).T
df2.head(2)

In [None]:
# check the attributes of the dataframe
df2.info()

In [None]:
# change the Dtype of the imputedComp column
df2['ImputedComp'] = pd.to_numeric(df2["ImputedComp"])
df2.info()

In [None]:
# the records for the other coders
df2_other = df2[df2.DevClass == 'other_coder']
df2_other.shape

In [None]:
# the records for the data coders
df2_data = df2[df2.DevClass == 'data_coder']
df2_data.shape

In [None]:
# set the figure sizecenter plot
plt.figure(figsize = [20, 16])

# left upper plot: histogram
plt.subplot(2, 2, 1)
sns.histplot(data=df2_data, x='ImputedComp');
plt.title('Compensation data coders', fontsize=14);


# right upper plot: histogram
plt.subplot(2, 2, 2)
sns.histplot(data=df2_data, x='ImputedComp',log_scale = True);
plt.title('Compensation data coders - logscale', fontsize=14);

# left lower plot: histogram
plt.subplot(2, 2, 3)
sns.histplot(data=df2_other, x='ImputedComp');
plt.title('Compensation other coders', fontsize=14);

# right lower plot: histogram
plt.subplot(2, 2, 4)
sns.histplot(data=df2_other, x='ImputedComp',log_scale = True );
plt.title('Compensation other coders - logscale', fontsize=14);


In [None]:
# set the figure sizecenter plot
plt.figure(figsize = [20, 6])

# left upper plot: histogram
plt.subplot(1, 2, 1)
sns.boxplot(data=df2_data, x='ImputedComp');
plt.title('Compensation data coders', fontsize=14);


# right upper plot: histogram
plt.subplot(1, 2, 2)
sns.boxplot(data=df2_other, x='ImputedComp');
plt.title('Compensation other coders', fontsize=14);

### Evaluate statistics

In [None]:
# create the two data arrays for the z-test
x_data = df2_data.ImputedComp 
x_other = df2_other.ImputedComp  


In [None]:
# compute the difference of averages
x_data.mean() - x_other.mean()

In [None]:
from scipy.stats import ttest_ind
statistic,pvalue = ttest_ind(x_data, x_other, equal_var=False, alternative='greater')
print('The test statistic is %.3f and the p-value is %.5f'%(statistic,pvalue))

#### Comments regarding the Z-test:

We are interested to see if the converted compensation of the tata coders is larger in average than the converted compensation of the other coders. We test the null hypothesis 
$$H_0: \mu_c - \mu_o  \leq 0$$
that the mean converted compensation of the data coders ($\mu_c$) is smaller than for the other coders ($\mu_o$), against the alternative
$$H_a: \mu_c - \mu_o  > 0.$$

We will work with a $1 \%$ significance level so that $\alpha = 0.01$.

We perform a right - tailed test and a Z-test of two means, with known population standard deviations. 

For the significance level of $\alpha = 0.01$ the critical value is $z_{\scriptsize critical} = 2.33$  with rejection region $R = \lbrace z: z > 2.33 \rbrace$. 

We compute a z-statistic $z = 7.9 \geq 2.33$ and we conclude that the null hypothesis is rejected. 

Using the p-value approach: since the p-value $p = 0 < 0.01$, we conclude that the null hypothesis is rejected.

We found a difference in average converted compensation between data coders and other coders $18696 \$$, with a p-value of $0.00$, consistent with our hypothesis that there is a difference in compensation between data coders and the other coders.

A few comments about these conclusions:
- about $46 \%$ of the data is missing in the ConvComp column;
- we imputed most of the values with the country medians and dropped the remaining zero and missing values;
- there are numerous outliers in the data;
- the compensation is converted, but we still have to keep in mind that the compensation levels differ substantially among countries;
- another aspect not taken into account (an this is only one of the many) is that the developers have various levels of seniority, and we averaged over all experience levels.

With all these observations in mind, our result is statistically significant but not necessarily a strong one. To obtain a more reliable outcome, we need to take into account other factors, some of them mentioned above.