This is the notebook aiming to analyze the depenencies between age and sat_colleques columns in the dataframe

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df = pd.read_excel('WorkPlaceSatisfactionSurveyData.xlsx')

df = df.drop(["number", "healtcare", "holidayCabin", "gym", "muscleCare"], axis=1)
df

Unnamed: 0,gender,age,family,education,years_of_service,salary,sat_management,sat_colleques,sat_workingEnvironment,sat_salary,sat_tasks
0,1,38,1,1.0,22.0,3587,3,3.0,3,3,3
1,1,29,2,2.0,10.0,2963,1,5.0,2,1,3
2,1,30,1,1.0,7.0,1989,3,4.0,1,1,3
3,1,36,2,1.0,14.0,2144,3,3.0,3,3,3
4,1,24,1,2.0,4.0,2183,2,3.0,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...
77,1,22,1,3.0,0.0,1598,4,4.0,4,3,4
78,1,33,1,1.0,2.0,1638,1,3.0,2,1,2
79,1,27,1,2.0,7.0,2612,3,4.0,3,3,3
80,1,35,2,2.0,16.0,2808,3,4.0,3,3,3


we have to assign bins to give classes to age

In [2]:
age_bins =  [15, 40, 65,]
df["age_class"] = pd.cut(df["age"], bins = age_bins)

We also have to assing string format values to express the values of sat data better 

In [3]:
df_age_sat_colleques = pd.crosstab(df["sat_colleques"], df["age_class"], normalize="columns")
df_age_sat_colleques.columns.name = "Age brackets"
df_age_sat_colleques.index = ["unsatisfied", "neutral", "satisfied", "very satisfied"]



In [4]:
df_age_sat_colleques

Age brackets,"(15, 40]","(40, 65]"
unsatisfied,0.018182,0.076923
neutral,0.218182,0.153846
satisfied,0.454545,0.384615
very satisfied,0.309091,0.384615


**Defining the story**

 
We have to try and see if there is relevant relation between some age-groups and satisfaction of colleques. There might be interesting findings when manipulating the ways in which we study the data.

These ways could be:

- Altering the brackets of age-groups
- % values vs. Count values
- Changing the angle in which we look at the data


Right now we can see that the age-bracket [15-20] only has 1 person who has answered to the questioneer. While this is interesting finding in itself it is not relevant information in the context of bigger picture. Furthermore if this would be an actual study it could jeapardize the anonymity of the one individual who has participated in the study.

-   *Suggestion: We can showcase this phenomenon in order to prove a point. No further use here.*

We can suggest that in the age-bracket of [20-30] over 50% of the employees are satisfied in their peers.

-   *Suggestion: What if the age-brackets were wider? How would that alter the results?*


In [5]:
from scipy.stats import chi2_contingency

df2 = pd.crosstab(df["sat_management"], df["age_class"])

chi2_contingency(df2)
#df2

Chi2ContingencyResult(statistic=2.4308386138820923, pvalue=0.6570615260014371, dof=4, expected_freq=array([[ 4.69512195,  2.30487805],
       [10.73170732,  5.26829268],
       [20.12195122,  9.87804878],
       [15.42682927,  7.57317073],
       [ 4.02439024,  1.97560976]]))

In [6]:
from scipy.stats import chi2_contingency

df2 = pd.crosstab(df["sat_colleques"], df["age_class"])
print(df2)
chi2_contingency(df2)


age_class      (15, 40]  (40, 65]
sat_colleques                    
2.0                   1         2
3.0                  12         4
4.0                  25        10
5.0                  17        10


Chi2ContingencyResult(statistic=2.516583416583417, pvalue=0.472301542245824, dof=3, expected_freq=array([[ 2.03703704,  0.96296296],
       [10.86419753,  5.13580247],
       [23.7654321 , 11.2345679 ],
       [18.33333333,  8.66666667]]))

In [7]:
from scipy.stats import chi2_contingency

df2 = pd.crosstab(df["sat_tasks"], df["age_class"])
chi2_contingency(df2)


Chi2ContingencyResult(statistic=4.535429234877511, pvalue=0.33836713640132454, dof=4, expected_freq=array([[ 3.35365854,  1.64634146],
       [10.06097561,  4.93902439],
       [19.45121951,  9.54878049],
       [16.76829268,  8.23170732],
       [ 5.36585366,  2.63414634]]))

In [8]:
from scipy.stats import chi2_contingency

df2 = pd.crosstab(df["gender"], df["sat_salary"])
chi2_contingency(df2)
#df2




Chi2ContingencyResult(statistic=15.086918785533744, pvalue=0.00452429558455757, dof=4, expected_freq=array([[25.35365854, 14.59756098, 14.59756098,  7.68292683,  0.76829268],
       [ 7.64634146,  4.40243902,  4.40243902,  2.31707317,  0.23170732]]))

Näyttäisi siltä että sukupuolen ja palkkatyytyväisyyden välillä on merkittävä riippuvuus

In [9]:
from scipy.stats import chi2_contingency

df2 = pd.crosstab(df["gender"], df["sat_workingEnvironment"])
df2.index=["Male", "Female"]
df2.loc["Total"] = df2.sum()

df2

chi2_contingency(df2)


Chi2ContingencyResult(statistic=13.466769837784332, pvalue=0.0967669427352795, dof=8, expected_freq=array([[ 6.91463415,  6.91463415, 23.04878049, 17.67073171,  8.45121951],
       [ 2.08536585,  2.08536585,  6.95121951,  5.32926829,  2.54878049],
       [ 9.        ,  9.        , 30.        , 23.        , 11.        ]]))

Näyttäisi siltä että sukupuolen ja työpaikan ympäristötyytyväisyyden välillä on merkittävä riippuvuus

In [10]:
from scipy.stats import chi2_contingency

df2 = pd.crosstab(df["gender"], df["sat_tasks"])


chi2_contingency(df2)

Chi2ContingencyResult(statistic=3.5888454853609115, pvalue=0.464498388739787, dof=4, expected_freq=array([[ 3.84146341, 11.52439024, 22.2804878 , 19.20731707,  6.14634146],
       [ 1.15853659,  3.47560976,  6.7195122 ,  5.79268293,  1.85365854]]))