# Analysis of StackOverflow Survey. Part IV 

In this notebook we build a predictiv model for job satisfaction. 

In [2]:
# import neccessary packages and libraries
import os
from collections import defaultdict

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
# to render plots in the notebook
%matplotlib inline

import seaborn as sns
# set a theme for seaborn
sns.set_theme()

from sklearn.linear_model import LinearRegression

from sklearn import (
    ensemble,
    preprocessing,
    tree,
)
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
)
from sklearn.metrics import (
    r2_score, 
    mean_squared_error,
    auc,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
)

In [56]:
# import local module containing the neccessary functions
import utils_functions as uf

# forces the interpreter to re-import the module
import importlib
importlib.reload(uf);

## State the question
I am addressing the third question in this notebook. What can we tell about the job satisfaction of a data coder? What factors do influence it? Also, predict the job satisfaction for a developer who works with big data. 

This is a classification question, we are predicting a satisfaction level for a data developer, which includes: data scientist or machine learning specialist, data or business analyst and data engineer.

## Performance metric

## Gather the data

Upload the data and keep just the subset that contains those developers that work in data science related fields.

In [72]:
# create a path string
mypath = os.getcwd()

# upload the datafiles as pandas dataframes
df = pd.read_csv(mypath+'/data/survey20_results_public.csv')

# check the uploaded data
df.shape

(64461, 61)

## Data preprocessing 

In [73]:
# rename the data engineer string in the full dataset
df.DevType = df.DevType.str.replace('Engineer, data', 'Data engineer')

In [74]:
# create column DevClass, entry data_coder or other_coder, based on DevType contains data or not
df['DevClass'] = np.where(df['DevType'].str.contains("Data ", na = False),
                          'data_coder', 'other_coder')
df['DevClass'].value_counts()

other_coder    55735
data_coder      8726
Name: DevClass, dtype: int64

In [75]:
# the data frame that contains the data developers only
df = df[df.DevClass == 'data_coder']

# check the size of the data
df.shape

(8726, 62)

In [76]:
# create a list of columns to be used in this analysis
list_cols = ['MainBranch', 'ConvertedComp', 'Country',
       'EdLevel', 'Employment', 'JobFactors',
       'JobSat', 'NEWEdImpt',
       'NEWLearn', 'NEWOvertime', 'OpSys', 'OrgSize', 
       'UndergradMajor', 'WorkWeekHrs']

In [77]:
# the dataset that contains only the listed columns
df = df[list_cols]
df.shape

(8726, 14)

In [78]:
 # drop the NEW prefix in some of the columns' names
df.columns = [col.replace('NEW', '') for col in df.columns]
df.columns

Index(['MainBranch', 'ConvertedComp', 'Country', 'EdLevel', 'Employment',
       'JobFactors', 'JobSat', 'EdImpt', 'Learn', 'Overtime', 'OpSys',
       'OrgSize', 'UndergradMajor', 'WorkWeekHrs'],
      dtype='object')

In [79]:
# reset the index 
df.reset_index(drop=True, inplace=True)

## Data profiling

In [80]:
# run this once to generate a profiling report and save it as html file

#import pandas_profiling
#profile = pandas_profiling.ProfileReport(df, minimal=False)
#profile.to_file(output_file="data_coders_report.html")

Summarize dataset:   0%|          | 0/27 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Remove duplicates

In [81]:
# drop duplicate rows, if any
df.drop_duplicates(subset=None, keep='first', inplace=True)

## Transform categorical data 

In [82]:
# create a list of categorical features
cat_cols = df.select_dtypes(include=['object']).copy().columns
non_cat = ['ConvertedComp', 'WorkWeekHrs']

In [83]:
# encode the categorical variablesas dummy, drop the first column for each feature
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

In [84]:
df.shape


(8722, 419)

In [85]:
df.columns

Index(['ConvertedComp', 'WorkWeekHrs',
       'MainBranch_I am a student who is learning to code',
       'MainBranch_I am not primarily a developer, but I write code sometimes as part of my work',
       'MainBranch_I code primarily as a hobby',
       'MainBranch_I used to be a developer by profession, but no longer am',
       'Country_Albania', 'Country_Algeria', 'Country_Andorra',
       'Country_Argentina',
       ...
       'UndergradMajor_A humanities discipline (such as literature, history, philosophy, etc.)',
       'UndergradMajor_A natural science (such as biology, chemistry, physics, etc.)',
       'UndergradMajor_A social science (such as anthropology, psychology, political science, etc.)',
       'UndergradMajor_Another engineering discipline (such as civil, electrical, mechanical, etc.)',
       'UndergradMajor_Computer science, computer engineering, or software engineering',
       'UndergradMajor_Fine arts or performing arts (such as graphic design, music, studio art,

In [86]:
import janitor as jn
X, y = jn.get_features_targets(df, target_columns='JobSat')

ModuleNotFoundError: No module named 'janitor'

## Create a training set and a test set

In [None]:
# create the 

### Comments: dropping columns

I will build a classification model to predict the level of job satisfaction a developer has. Several of the columns have very high cardinality, and also there are a few redundant columns which can be dropped at this point. Here is a list of columns we can drop:

### Clean Data

Some preliminary processing was already performed on the data. 