Question
========

What's the best developer job like ? (depending on your own definition
of \"best\")

Dataset
=======

The dataset provided by [StackOverflow on
Kaggle](https://www.kaggle.com/stackoverflow/so-survey-2017/data) seems
to be a great start. It contains about fifty thousand answers from a
sample of the active StackOverflow population about a lot of questions,
namely 154. This means that we would have a tremendous insight into what
makes a programmer unique, but also can help us to answer a lot of
interesting questions.

Project
=======

Exploratory
-----------

Check the distributions of all useful features, outliers, quantiles.
Questions we could answer with the exploration:

-   Does salary equates to happiness/fulfilment in your job ?

-   For users not satisfied with their job, what should they change to
    be more satisfied (use closest correlated neighbor) ?

-   How much is Job Satisfaction linked to education ?

-   Are \"gif\" people more satisfied with their job compared to \"jif\"
    people ?
    
Metric
--------------
Derive metric to measure happiness/fulfillment

Pre-processing
--------------

Data cleaning, categorize values, check out their distribution,
selecting columns, removing bad values if needed.

Feature Extraction
------------------

PCA to check which are the features explaining the most variance.

Graph Analysis
--------------

The graph will be built the following way:

-   Users will be the nodes

-   Correlations (with a threshold) in-between users used as edges

Recommender System
------------------

The idea here would be to be able to recommend which of a set of users
best represents a set of given goals. To do so, we would simply check
which existing node is the closest to the artificial one that we create
for the chosen features a recruiter is looking for.


In [None]:
%config InlineBackend.figure_format = 'retina'
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
from subprocess import check_output
from sklearn.neighbors import kneighbors_graph
from sklearn import preprocessing
import networkx as nx
pd.set_option('display.max_columns', None)

In [None]:
stack = pd.read_csv("data/survey_results_public.csv")
kept_columns = ['Respondent', 'Professional', 'ProgramHobby', 'Country', 'University', 'EmploymentStatus', 'FormalEducation', 'MajorUndergrad', 'CompanySize', 'CompanyType', 'YearsProgram', 'YearsCodedJob', 'DeveloperType', 'WebDeveloperType', 'NonDeveloperType', 'CareerSatisfaction', 'JobSatisfaction', 'PronounceGIF', 'ProblemSolving', 'BuildingThings', 'LearningNewTech', 'BoringDetails', 'JobSecurity', 'DiversityImportant', 'FriendsDevelopers', 'WorkPayCare', 'ChallengeMyself', 'ImportantBenefits', 'ClickyKeys', 'Overpaid', 'TabsSpaces', 'EducationImportant', 'EducationTypes', 'SelfTaughtTypes', 'WorkStart', 'HaveWorkedLanguage', 'WantWorkLanguage', 'IDE', 'AuditoryEnvironment', 'Methodology', 'EquipmentSatisfiedMonitors', 'StackOverflowSatisfaction', 'StackOverflowFoundAnswer', 'StackOverflowCopiedCode', 'StackOverflowWhatDo', 'Gender', 'HighestEducationParents', 'Race', 'Salary', "ExpectedSalary"]
stack = stack[kept_columns]
stack.set_index("Respondent", inplace=True)
stack.head()

In [None]:
# We filter out the devs upon some criteria
def row_filter(row):
    if row.Professional not in ["Student", 
                                "Professional developer"]:
        return False
    if row.Professional == "Professional developer":
        if row.EmploymentStatus not in ['Employed part-time',
                                        'Employed full-time',
                                        'Independent contractor, freelancer, or self-employed']:
            return False
        # After checking salary values, we decided to remove the first 5%
        # quantile as they were mostly outliers (values inbetween 0 and 100)
        if row.isnull().Salary or row.Salary < stack.Salary.quantile(0.05):
            return False
        if row.isnull().JobSatisfaction and row.isnull().CareerSatisfaction:
            return False
    else:
        if row.isnull().ExpectedSalary or row.ExpectedSalary < stack.ExpectedSalary.quantile(0.05):
            return False
    return True
    
stack = stack[stack.apply(row_filter, axis=1)]
prof_stack = stack[stack.Professional == "Professional developer"]
stud_stack = stack[stack.Professional == "Student"]

In [None]:
metadata = pd.read_csv("data/survey_results_schema.csv")
metadata

# Exploratory Analysis

In this section we will explore different columns of our dataframe to have an idea of what the population we have looks like.

In [None]:
profCount = prof_stack.count()
studCount = stud_stack.count()

## Professional

In [None]:
stack['Professional'].value_counts()[0:10].plot(kind='bar',figsize=(10,8))
plt.show()

TODO

In [None]:
def plot_stud_prof(prof="", stud="", column="", title=""):
    if column!="":
        prof=prof_stack[column]
        stud=stud_stack[column]
    p = prof.value_counts(normalize=True)[:10]
    v_stud = stud.value_counts(normalize=True)
    s = v_stud.loc[p.index]
    df = pd.DataFrame([p, s])
    df = df.T
    df.columns = ["Professional", "Student"]
    df.plot.bar(figsize=(7,7))
    plt.title(title + ' Distribution for Professionals/Students')
    plt.show()

## Country

In [None]:
plot_stud_prof(column='Country', title="Country")

TODO

## Company Size

In [None]:
stack.CompanySize.value_counts(normalize=True)[0:10].plot(kind='bar',figsize=(10,8))
plt.show()

## Developer Type

In [None]:
DevTypes = pd.Series([lang for sublist in [str(langs).replace(" ", "").split(";") for langs in stack['DeveloperType'].dropna()] for lang in sublist])

In [None]:
DevTypes.value_counts(normalize=True)[0:10].plot(kind='bar',figsize=(10,8))
plt.show()

TODO

## Languages

In [None]:
prof_languages = pd.Series([lang for sublist in [str(langs).replace(" ", "").split(";") for langs in prof_stack['HaveWorkedLanguage'].dropna()] for lang in sublist])
stud_languages = pd.Series([lang for sublist in [str(langs).replace(" ", "").split(";") for langs in stud_stack['HaveWorkedLanguage'].dropna()] for lang in sublist])
plot_stud_prof(prof=prof_languages, stud=stud_languages, title="Languages")

## Career Satisfaction

In [None]:
carrSat = stack['CareerSatisfaction']/stack['CareerSatisfaction'].max()
carrSat.value_counts().sort_index().plot(kind='bar',figsize=(10,8))
plt.show()

## Diversity Important

In [None]:
plot_stud_prof(column='DiversityImportant', title="Diversity Importance")

# SEPARATION

In [None]:
plt.figure(figsize=(5,7))
sns.boxplot(prof_stack.Salary, orient='v')
plt.title("Box-plot of the total salaries")
plt.show()

In [None]:
stud_stack.ExpectedSalary.plot(kind='kde', figsize=(10,8), color='r', legend=True)
prof_stack[(prof_stack.YearsProgram == "Less than a year")].Salary.plot(kind='kde', figsize=(7,7), legend=True)
plt.xlabel("Salary/Expected Salary")
plt.title("Distribution of expected salary and salary for the students and professionals")
plt.show()

In [None]:
stack.Race.value_counts(normalize=True)[0:10].plot(kind='bar',figsize=(7,7))
plt.show()

In [None]:
stack.Gender.value_counts(normalize=True)[0:10].plot(kind='bar',figsize=(7,7))
plt.show()

In [None]:
stack.EducationTypes.value_counts(normalize=True)[0:10].plot(kind='bar',figsize=(7,7))
plt.show()

## GIF vs JIF

In [None]:
import matplotlib.pyplot as plt
temp = stack[["PronounceGIF", "Salary"]].dropna(how='any')
temp = temp.set_index("PronounceGIF")
gif = temp.loc['With a hard "g," like "gift"'].Salary.values
jif = temp.loc['With a soft "g," like "jiff"'].Salary.values
tr = pd.DataFrame()
tr['gif'] = pd.Series(gif)
filling = np.empty((6081))
filling[:] = np.nan
to_add = np.append(jif, filling)
tr['jif'] = pd.Series(to_add)
plot = sns.boxplot(data=tr, orient="v",)
plt.ylabel("Salary")
plt.title("Distribution of salary for the gif and jif populations")
plt.show()

print(tr.describe())

# ECRIS LA MATTHIAS

In [None]:
prof_stack.isnull().sum()

In [None]:
stud_stack.isnull().sum()

In [None]:
important_features_prof = ['Professional', 'ProgramHobby', 'Country', 'University', 'FormalEducation', 'MajorUndergrad', 'CompanyType',
                     'YearsCodedJob', 'YearsProgram', 'DeveloperType', 'CareerSatisfaction', 'JobSatisfaction', 'Overpaid',
                     'WorkStart', 'HaveWorkedLanguage', 'WantWorkLanguage', 'AuditoryEnvironment', 'Salary']
important_features_stud = ['Professional', 'ProgramHobby', 'Country', 'University', 'FormalEducation', 'YearsProgram','WorkStart',
                           'ClickyKeys', 'HaveWorkedLanguage', 'WantWorkLanguage', 'AuditoryEnvironment', 'ExpectedSalary']

final_prof_stack = prof_stack[important_features_prof].copy()
final_stud_stack = stud_stack[important_features_stud].copy()

In [None]:
final_prof_stack = final_prof_stack.dropna()
final_prof_stack.shape

In [None]:
final_stud_stack = final_stud_stack.dropna()
final_stud_stack.shape

In [None]:
label_prof_stack = final_prof_stack.copy()
label_stud_stack = final_stud_stack.copy()

for c in important_features_prof[:-1]:
    le = preprocessing.LabelEncoder()
    le.fit(label_prof_stack[c])
    label_prof_stack[c] = le.transform(label_prof_stack[c])

for c in important_features_stud[:-1]:
    le = preprocessing.LabelEncoder()
    le.fit(label_stud_stack[c])
    label_stud_stack[c] = le.transform(label_stud_stack[c])

In [None]:
prof_neighbors = kneighbors_graph(label_prof_stack, 98, mode='distance', include_self=True)
prof_rbf = prof_neighbors.copy()
prof_rbf.data = np.exp(- prof_rbf.data ** 2 / (2. * np.mean(prof_rbf.data) ** 2))
del prof_neighbors

stud_neighbors = kneighbors_graph(label_stud_stack, 43, mode='distance', include_self=True)
stud_rbf = stud_neighbors.copy()
stud_rbf.data = np.exp(- stud_rbf.data ** 2 / (2. * np.mean(stud_rbf.data) ** 2))
del stud_neighbors


In [None]:
G_prof = nx.from_scipy_sparse_matrix(prof_rbf,edge_attribute='similarity')
pos_prof = nx.spring_layout(G_prof)
nx.draw_networkx_nodes(G_prof, pos_prof, node_size=7)
plt.show()

In [None]:
G_stud = nx.from_scipy_sparse_matrix(stud_rbf,edge_attribute='similarity')
pos_stud = nx.spring_layout(G_stud)
nx.draw_networkx_nodes(G_stud, pos_stud, node_size=7)
plt.show()