# Analysis of StackOverflow Survey. Part IV

In this notebook we address the third question, and we build a model to predict job satisfaction for data coders.

The steps of the process are:
1. 
2. 

In [1]:
# general packages and libraries
import os
import sys
from collections import defaultdict
import importlib

In [2]:
# data manipulation packages
import numpy as np
import pandas as pd

In [3]:
# data visualizations packages
import matplotlib.pyplot as plt
# to render plots in the notebook
%matplotlib inline

import seaborn as sns
# set a theme for seaborn
sns.set_theme()

In [4]:
# clean this up in the end

from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif

from sklearn import (
    ensemble,
    preprocessing,
    tree,
)
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
)
from sklearn.metrics import (
    classification_report,
    r2_score, 
    mean_squared_error,
    auc,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
    roc_curve,
)


In [5]:
# import local module containing the neccessary functions
import utils_functions as uf

# forces the interpreter to re-load the module
importlib.reload(uf);

# create a path string
mypath = os.getcwd()

## State the question
I am addressing the third question in this notebook. What can we tell about the job satisfaction of a data coder? What factors do influence it? Also, predict the job satisfaction for a developer who works with big data. 

This is a classification question, we are predicting a satisfaction level for a data developer, which includes: data scientist or machine learning specialist, data or business analyst and data engineer.

## Performance metrics - to review at the end

The following performance measures will be used in this project:
1. Cross validation via StratifiedKFold with 10 folds.
2. Confusion matrix, in particular precision, recall and F1 score.
3. The ROC curve and the related AUC score.

# Gather and prepare the data

Upload the data and keep the subset that contains those developers that work in data science related fields. 


## Load the data

In [27]:
# upload the datafiles as pandas dataframes
df = pd.read_csv(mypath+'/data/survey20_updated.csv', index_col=[0])
dfs = pd.read_csv(mypath+'/data/survey20_results_schema.csv')

In [28]:
# check for success
df.shape

(64461, 62)

## Remove unnecessary data

### Keep the developers that work with data

In [29]:
# the data frame that contains the data developers only
df1 = df[df.DevClass == 'data_coder']

# check for success
df1.shape

(8726, 62)

### Retain the developers that are employed

In [30]:
# check the employment types for data coders
df1.Employment.value_counts()

Employed full-time                                      6904
Independent contractor, freelancer, or self-employed    1080
Not employed, but looking for work                       388
Employed part-time                                       354
Name: Employment, dtype: int64

In [31]:
# retain only the employed data davelopers
df1 = df1[df1['Employment'] != 'Not employed, but looking for work']

# check for success
df1.Employment.value_counts()

Employed full-time                                      6904
Independent contractor, freelancer, or self-employed    1080
Employed part-time                                       354
Name: Employment, dtype: int64

### Remove unnecessary columns

In [32]:
cols_del = [
    # personal, demographics  information
    'Respondent', 'MainBranch', 'Hobbyist', 'Country',
    'Ethnicity', 'Gender', 'Sexuality', 'Trans',
    
    # related to ConvertedComp
    'CompFreq', 'CompTotal', 'CurrencyDesc', 'CurrencySymbol',
    
    # questions regarding future activities
    'DatabaseDesireNextYear', 'MiscTechDesireNextYear',
    'CollabToolsDesireNextYear', 'PlatformDesireNextYear',
    'LanguageDesireNextYear', 'WebframeDesireNextYear',
    
    # questions regarding this survey
    'SurveyEase', 'SurveyLength', 'WelcomeChange',
    
    # question regarding participation is StackOverflow
    'SOSites', 'SOComm', 'SOPartFreq',
    'SOVisitFreq', 'SOAccount',

    # related to other columns
    'Age1stCode', 'YearsCodePro', 'DevClass', 

    # high cardinality, multiple choices
    'DatabaseWorkedWith','MiscTechWorkedWith',
    'WebframeWorkedWith',

    # questions not relevant to our goal
    'JobHunt', 'JobHuntResearch', 'Stuck',
    'PurchaseResearch', 'PurchaseWhat', 
    'Stuck', 'PurpleLink',
    'OffTopic', 'OtherComms',
    'JobFactors', 'JobSeek',

    # auxiliary columns
    'DevClass']

In [33]:
# drop all the columns in the list
df1.drop(columns=cols_del, inplace=True)

# check the output
df1.shape

(8338, 20)

### Check and update data types

In [34]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8338 entries, 21 to 64446
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Age                    6479 non-null   float64
 1   ConvertedComp          5810 non-null   float64
 2   DevType                8338 non-null   object 
 3   EdLevel                8208 non-null   object 
 4   Employment             8338 non-null   object 
 5   JobSat                 7735 non-null   float64
 6   LanguageWorkedWith     7835 non-null   object 
 7   CollabToolsWorkedWith  7379 non-null   object 
 8   DevOps                 7323 non-null   object 
 9   DevOpsImpt             7130 non-null   object 
 10  EdImpt                 7959 non-null   object 
 11  Learn                  7640 non-null   object 
 12  OnboardGood            7317 non-null   object 
 13  Overtime               7424 non-null   object 
 14  OpSys                  7760 non-null   object 
 15  Or

In [41]:
# replace strings with numerical entries
replace_dict = {'Less than 1 year': '0', 'More than 50 years': '51'}
df1.replace(replace_dict, inplace=True)

In [42]:
# change dtype to numeric
df1['YearsCode'] = pd.to_numeric(df1['YearsCode'])

In [49]:
df1.DevType.iloc[3]

'Data or business analyst;Data scientist or machine learning specialist'

In [55]:
df1['DevType'] = df1['DevType'].str.split(';')


In [56]:
df1.head()

Unnamed: 0,Age,ConvertedComp,DevType,EdLevel,Employment,JobSat,LanguageWorkedWith,CollabToolsWorkedWith,DevOps,DevOpsImpt,EdImpt,Learn,OnboardGood,Overtime,OpSys,OrgSize,PlatformWorkedWith,UndergradMajor,WorkWeekHrs,YearsCode
21,,,"[Developer, full-stack, Data engineer]",Bachelor’s degree,Employed full-time,2.0,Java;Python,,Not sure,,Very important,Every few months,Yes,Often: 1-2 days per week or more,Windows,500 to 999 employees,,Computer science,50.0,10.0
24,,,"[Developer, back-end, Developer, full-stack, D...",Associate degree,Employed full-time,3.0,Bash/Shell/PowerShell;C,Jira;Github;Gitlab,No,Extremely important,Critically important,Once every few years,Yes,Often: 1-2 days per week or more,Windows,100 to 499 employees,Docker;Linux;Windows,Computer science,40.0,23.0
29,,38778.0,"[Data or business analyst, Database administra...",Bachelor’s degree,Employed full-time,2.0,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;P...,Confluence;Jira;Github,No,Extremely important,Somewhat important,,Onboarding? What onboarding?,Occasionally: 1-2 days per quarter but less th...,Windows,,MacOS;Windows,Information system,37.0,4.0
35,34.0,77556.0,"[Data or business analyst, Data scientist or m...",College study/no degree,Employed full-time,4.0,C#;Go;HTML/CSS;Java;JavaScript;Python;R;SQL,Confluence;Jira;Github;Slack;Trello,Not sure,Neutral,Somewhat important,Every few months,Yes,Sometimes: 1-2 days per month but less than we...,Windows,"1,000 to 4,999 employees",MacOS;Windows,Computer science,40.0,4.0
43,32.0,55893.0,"[Data or business analyst, Developer, back-end...",Master’s degree,Employed full-time,3.0,HTML/CSS;Python;R;SQL;VBA,Github;Slack;Trello,No,Somewhat important,Very important,Once every few years,No,Often: 1-2 days per week or more,Windows,10 to 19 employees,Windows,Engineering other,45.0,10.0


In [57]:
df1=df1.explode('DevType')

In [59]:
df1.shape

(40172, 20)

In [58]:
df1.head()

Unnamed: 0,Age,ConvertedComp,DevType,EdLevel,Employment,JobSat,LanguageWorkedWith,CollabToolsWorkedWith,DevOps,DevOpsImpt,EdImpt,Learn,OnboardGood,Overtime,OpSys,OrgSize,PlatformWorkedWith,UndergradMajor,WorkWeekHrs,YearsCode
21,,,"Developer, full-stack",Bachelor’s degree,Employed full-time,2.0,Java;Python,,Not sure,,Very important,Every few months,Yes,Often: 1-2 days per week or more,Windows,500 to 999 employees,,Computer science,50.0,10.0
21,,,Data engineer,Bachelor’s degree,Employed full-time,2.0,Java;Python,,Not sure,,Very important,Every few months,Yes,Often: 1-2 days per week or more,Windows,500 to 999 employees,,Computer science,50.0,10.0
24,,,"Developer, back-end",Associate degree,Employed full-time,3.0,Bash/Shell/PowerShell;C,Jira;Github;Gitlab,No,Extremely important,Critically important,Once every few years,Yes,Often: 1-2 days per week or more,Windows,100 to 499 employees,Docker;Linux;Windows,Computer science,40.0,23.0
24,,,"Developer, full-stack",Associate degree,Employed full-time,3.0,Bash/Shell/PowerShell;C,Jira;Github;Gitlab,No,Extremely important,Critically important,Once every few years,Yes,Often: 1-2 days per week or more,Windows,100 to 499 employees,Docker;Linux;Windows,Computer science,40.0,23.0
24,,,DevOps specialist,Associate degree,Employed full-time,3.0,Bash/Shell/PowerShell;C,Jira;Github;Gitlab,No,Extremely important,Critically important,Once every few years,Yes,Often: 1-2 days per week or more,Windows,100 to 499 employees,Docker;Linux;Windows,Computer science,40.0,23.0


In [None]:
to explode: DevType, Language WorkedWith, CollabToolsWorkedWith, PlatformWorkedWith
    DevType explode and drop everything not data

## Sample the data

### Create features and target datasets

In [17]:
# create a copy of the pre-proceessed dataset
df2 = df1.copy()

In [18]:
# create the predictors dataframe
X = df2.drop(columns = 'JobSat')

# create the labels
y = df2['JobSat']

# check for success
X.shape, len(y)

((8726, 19), 8726)

### Isolate a test set

In [19]:
# split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# summarize the data
print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)

Train (6108, 19) (6108,)
Test (2618, 19) (2618,)


## Create a profiling report

Create a profiling report for the training data.

In [20]:
# run this once to generate a profiling report and save it as html file

#import pandas_profiling
#profile = pandas_profiling.ProfileReport(X_train, minimal=False)
#profile.to_file(output_file="data_training_report.html")

numerical_col = ['Age', 'ConvertedComp', 'WorkWeekHrs', 'YearsCode']