# Analysis of StackOverflow Survey. Part IV 

In this notebook we build a predictiv model for job satisfaction. 

In [1]:
# import neccessary packages and libraries
import os
from collections import defaultdict

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
# to render plots in the notebook
%matplotlib inline

import seaborn as sns
# set a theme for seaborn
sns.set_theme()

from sklearn.linear_model import LinearRegression

from sklearn import (
    ensemble,
    preprocessing,
    tree,
)
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
)
from sklearn.metrics import (
    r2_score, 
    mean_squared_error,
    auc,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
)

In [2]:
# import local module containing the neccessary functions
import utils_functions as uf

# forces the interpreter to re-import the module
import importlib
importlib.reload(uf);

## State the question
I am addressing the third question in this notebook. What can we tell about the job satisfaction of a data coder? What factors do influence it? Also, predict the job satisfaction for a developer who works with big data. 

This is a classification question, we are predicting a satisfaction level for a data developer, this means: data scientist or machine learning specialist, data or business analyst and data engineer.

## Gather the data

Upload the data and keep just the subset that contains those developers that work in data science related fields.

In [3]:
# create a path string
mypath = os.getcwd()

# upload the datafiles as pandas dataframes
df = pd.read_csv(mypath+'/data/survey20_results_public.csv')

# check the uploaded data
df.shape

(64461, 61)

### 1.5.1  Create a column to label the developers 

In [20]:
# rename the data engineer string in the full dataset
df.DevType = df.DevType.str.replace('Engineer, data', 'Data engineer')

In [21]:
# create column DevClass, entry data_coder or other_coder, based on DevType contains data or not
df.DevClass = np.where(df.DevType.str.contains("Data ", na = False), 'data_coder', 'other_coder')
df.DevClass.value_counts()

other_coder    55735
data_coder      8726
Name: DevClass, dtype: int64

In [38]:
# the data frame that contains the data developers only
df = df[df.DevClass == 'data_coder']

# check the size of the data
df.shape

(8726, 62)

In [39]:
# create a list of columns to be used in this analysis
list_cols = ['MainBranch', 'Hobbyist', 'Country',
       'EdLevel', 'Employment', 'Gender', 'JobFactors',
       'JobSat', 'JobSeek', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
       'NEWLearn', 'NEWOvertime',  'NEWPurpleLink', 'NEWStuck', 'OpSys', 'OrgSize', 
       'UndergradMajor', 'WorkWeekHrs']

In [41]:
# the dataset that contains only the listed columns
df = df[list_cols]
df.shape

(8726, 20)

In [42]:
 # drop the NEW prefix in some of the columns' names
df.columns = [col.replace('NEW', '') for col in df.columns]
df.columns

Index(['MainBranch', 'Hobbyist', 'Country', 'EdLevel', 'Employment', 'Gender',
       'JobFactors', 'JobSat', 'JobSeek', 'EdImpt', 'JobHunt',
       'JobHuntResearch', 'Learn', 'Overtime', 'PurpleLink', 'Stuck', 'OpSys',
       'OrgSize', 'UndergradMajor', 'WorkWeekHrs'],
      dtype='object')

In [None]:
# create test and train sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)

In [None]:


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)


In [35]:
df4 = df3[list_cols]

### Comments: dropping columns

I will build a classification model to predict the level of job satisfaction a developer has. Several of the columns have very high cardinality, and also there are a few redundant columns which can be dropped at this point. Here is a list of columns we can drop:

In [20]:
# list of columns to drop
drop_cols = ['Unnamed: 0', 'Respondent', 'ConvertedComp','DevType', 'DevClass',
            'DatabaseWorkedWith', 'LanguageWorkedWith', 'PlatformWorkedWith',
            'WebframeWorkedWith', 'CollabToolsWorkedWith', 'DevOps', 'DevOpsImpt']

# drop columns in the list
df3 = df3.drop(drop_cols, axis = 1)

# reset the index also
df3.reset_index(drop=True, inplace=True)

# check the shape of the new dataset
df3.shape

(8706, 21)

### Generate a profiling report

In [21]:
# run this once to generate the report and save it as html file

import pandas_profiling
profile = pandas_profiling.ProfileReport(df3, minimal=False)
profile.to_file(output_file="data_coders_report.html")

Summarize dataset:   0%|          | 0/34 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Clean Data

Some preliminary processing was already performed on the data. 