# Analysis of StackOverflow Survey. Part IV 

In [1]:
# import neccessary packages and libraries
import os
from collections import defaultdict

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
# to render plots in the notebook
%matplotlib inline

import seaborn as sns
# set a theme for seaborn
sns.set_theme()

from sklearn.linear_model import LinearRegression

from sklearn import (
    ensemble,
    preprocessing,
    tree,
)
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
)
from sklearn.metrics import (
    r2_score, 
    mean_squared_error,
    auc,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
)

In [2]:
# import local module containing the neccessary functions
import utils_functions as uf

# forces the interpreter to re-import the module
import importlib
importlib.reload(uf);

## 1.1  State the question and outline
I am addressing the third question in this notebook. What can we tell about he job satisfaction of a data coder? First, what factors do influence it? Secondly, predict the job satisfaction for a developer who works with big data. 

This is a classification question, we are predicting a satisfaction level for a data developer.

## Gather the data

Upload the processed data and keep just the subset that contains those developers that work in data science related fields.

In [16]:
# create a path string
mypath = os.getcwd()

# upload the datafiles as pandas dataframes
df1 = pd.read_csv(mypath+'/data/survey20_updated.csv')

# check the uploaded data
df1.shape

(60430, 33)

In [17]:
# the data frame that contains the data devlopers
df3 = df1[df1.DevClass == 'data_coder']

# check the size of the data
df3.shape

(8706, 33)

In [18]:
# the columns in the dataset
df3.columns

Index(['Unnamed: 0', 'Respondent', 'MainBranch', 'Hobbyist', 'ConvertedComp',
       'Country', 'DatabaseWorkedWith', 'DevType', 'EdLevel', 'Employment',
       'Gender', 'JobFactors', 'JobSat', 'JobSeek', 'LanguageWorkedWith',
       'CollabToolsWorkedWith', 'DevOps', 'DevOpsImpt', 'EdImpt', 'JobHunt',
       'JobHuntResearch', 'Learn', 'Overtime', 'PurpleLink', 'Stuck', 'OpSys',
       'OrgSize', 'PlatformWorkedWith', 'UndergradMajor', 'WebframeWorkedWith',
       'WorkWeekHrs', 'DevClass', 'imputedComp'],
      dtype='object')

### Comments: dropping columns

I will build a classification model to predict the level of job satisfaction a developer has. Several of the columns have very high cardinality, and also there are a few redundant columns which can be dropped at this point. Here is a list of columns we can drop:

In [20]:
# list of columns to drop
drop_cols = ['Unnamed: 0', 'Respondent', 'ConvertedComp','DevType', 'DevClass',
            'DatabaseWorkedWith', 'LanguageWorkedWith', 'PlatformWorkedWith',
            'WebframeWorkedWith', 'CollabToolsWorkedWith', 'DevOps', 'DevOpsImpt']

# drop columns in the list
df3 = df3.drop(drop_cols, axis = 1)

# reset the index also
df3.reset_index(drop=True, inplace=True)

# check the shape of the new dataset
df3.shape

(8706, 21)

### Generate a profiling report

In [21]:
# run this once to generate the report and save it as html file

import pandas_profiling
profile = pandas_profiling.ProfileReport(df3, minimal=False)
profile.to_file(output_file="data_coders_report.html")

Summarize dataset:   0%|          | 0/34 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Clean Data

Some preliminary processing was already performed on the data. 