To support data-driven decision-making in education, it's important to try to ascertain the effect of particular activities on their participants. One possible way to do it would be to compare certain factors for those who took part in the activity and for those who did'nt. These factors may include KPIs, year-end evaluations, the success of adaptation process at a company, different types of engagement, and the ultimate metric for employees' satisfaction - staff turnover.

The limitation of these methods come from the fact that education is only one of the factors that influence these outcomes for employees. Therefore, there are 2 options: 1) first option is building a model with regards to every factor and assessing the influence of each of them (including education); 2) second option is isolating the influence of one factor to a maximum degree based on our subjective judgements. 

To assess the effect of the introduction education on the employees we'll take the second path and use the metric of staff turnover as a proxy for the education effectiveness. Introduction education at the company includes an online or offline seminar ("webinar") and an e-learning component. It provides new employees with the key information about the company's values, its past and future paths and their place in the company's hierarchy. It allows employees to feel like they are an integral part of the company's life and to better understand its values.

The following analysis demonstrates that there is a connection between finishing the introduction education (particularly, the full program) and continuing to work, at least, 3 months after getting hired. 

In [1]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import chisquare
%matplotlib inline


In [12]:
starting_program = pd.read_csv("starting_program1.csv", sep=";")

In [13]:
starting_program.head()

Unnamed: 0,state,webinar,program
0,working,1,2
1,working,1,2
2,working,0,0
3,working,0,0
4,working,1,2


Description of the variables: 
1) state (working/left) - is an employee working at the time of analysis;
2) webinar (1/0) - has an employee completed the webinar as part of the introduction education
3) program (0/1/2/3) - has an employee completed the full program (in different variations) for the introduction education

4 levels of program completion: 0-nothing, 1-only e-learning, 2-only webinar, 3-full program.

In [14]:
data_crosstab = pd.crosstab(starting_program['state'],
                           starting_program['program'],
                           margins=True, margins_name='Total')

In [15]:
data_crosstab

program,0,1,2,3,Total
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
left,125,151,54,146,476
working,580,517,328,1214,2639
Total,705,668,382,1360,3115


Both variables are nominative, there's enough data to use the Pearson's chi-square. 
Pearson's statistical hypothesis with a chi-square criteria is an example of independence test between categorical variables.
Chi-square independence test lets us decide whether there is a connection (dependence) between two variables.
H0: 2 categorical variables are not connected.
H1: There is a connection between 2 categorical variables.

In [16]:
alpha = 0.05
chi_square = 0
rows = starting_program['state'].unique()
columns = starting_program['program'].unique()

In [17]:
for i in columns:
    for j in rows:
        O = data_crosstab[i][j]
        E = data_crosstab[i]['Total'] * data_crosstab['Total'][j] / data_crosstab['Total']['Total']
        chi_square += (O-E)**2/E

In [18]:
chi_square

53.038757027178754

In [19]:
columns

array([2, 0, 3, 1], dtype=int64)

In [21]:
# The critical value approach
print("Approach 2: The critical value approach to hypothesis testing in the decision rule")
critical_value = stats.chi2.ppf(1-alpha, (len(rows)-1)*(len(columns)-1))
conclusion = "Failed to reject the null hypothesis."
if chi_square > critical_value:
    conclusion = "Null hypothesis is rejected."
    
print("chisquare-score is:", chi_square, " and p value is:", critical_value)
print(conclusion)

Approach 2: The critical value approach to hypothesis testing in the decision rule
chisquare-score is: 53.038757027178754  and p value is: 7.814727903251179
Null hypothesis is rejected.


The found connection is an evidence of the existence of common prerequisites for both variables - the importance of corporate culture, common values, communication with management of all levels. 

The result of this study can be used in communication with the management to highlight the importance of education for retaining employees, in communication with the employees to remind them of the necessity to complete mandatory training.