Question
========

What's the best developer job like ? (depending on your own definition
of \"best\")

Dataset
=======

The dataset provided by [StackOverflow on
Kaggle](https://www.kaggle.com/stackoverflow/so-survey-2017/data) seems
to be a great start. It contains about fifty thousand answers from a
sample of the active StackOverflow population about a lot of questions,
namely 154. This means that we would have a tremendous insight into what
makes a programmer unique, but also can help us to answer a lot of
interesting questions.

Project
=======

Exploratory
-----------

Check the distributions of all useful features, outliers, quantiles.
Questions we could answer with the exploration:

-   Does salary equates to happiness/fulfilment in your job ?

-   For users not satisfied with their job, what should they change to
    be more satisfied (use closest correlated neighbor) ?

-   How much is Job Satisfaction linked to education ?

-   Are \"gif\" people more satisfied with their job compared to \"jif\"
    people ?

Pre-processing
--------------

Data cleaning, categorize values, check out their distribution,
selecting columns, removing bad values if needed.

Feature Extraction
------------------

PCA to check which are the features explaining the most variance.

Graph Analysis
--------------

The graph will be built the following way:

-   Users will be the nodes

-   Correlations (with a threshold) in-between users used as edges

Recommender System
------------------

The idea here would be to be able to recommend which of a set of users
best represents a set of given goals. To do so, we would simply check
which existing node is the closest to the artificial one that we create
for the chosen features a recruiter is looking for.


In [None]:
%config InlineBackend.figure_format = 'retina'
import numpy as np 
import pandas as pd 
import seaborn as sns
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
from subprocess import check_output
pd.set_option('display.max_columns', None)

In [None]:
stack = pd.read_csv("data/survey_results_public.csv")
kept_columns = ['Respondent', 'Professional', 'ProgramHobby', 'Country', 'University', 'EmploymentStatus', 'FormalEducation', 'MajorUndergrad', 'CompanySize', 'CompanyType', 'YearsProgram', 'YearsCodedJob', 'DeveloperType', 'WebDeveloperType', 'NonDeveloperType', 'CareerSatisfaction', 'JobSatisfaction', 'PronounceGIF', 'ProblemSolving', 'BuildingThings', 'LearningNewTech', 'BoringDetails', 'JobSecurity', 'DiversityImportant', 'FriendsDevelopers', 'WorkPayCare', 'ChallengeMyself', 'ImportantBenefits', 'ClickyKeys', 'Currency', 'Overpaid', 'TabsSpaces', 'EducationImportant', 'EducationTypes', 'SelfTaughtTypes', 'WorkStart', 'HaveWorkedLanguage', 'WantWorkLanguage', 'IDE', 'AuditoryEnvironment', 'Methodology', 'EquipmentSatisfiedMonitors', 'InfluenceTechStack', 'InfluenceCommunication', 'StackOverflowSatisfaction', 'StackOverflowFoundAnswer', 'StackOverflowCopiedCode', 'StackOverflowWhatDo', 'Gender', 'HighestEducationParents', 'Race', 'Salary', 'ExpectedSalary']
stack = stack[kept_columns]
stack.head()

In [None]:
metadata = pd.read_csv("data/survey_results_schema.csv")
metadata

In [None]:
import matplotlib.pyplot as plt
temp = stack[["PronounceGIF", "Salary"]].dropna(how='any')
temp = temp.set_index("PronounceGIF")
gif = temp.loc['With a hard "g," like "gift"'].Salary.values
jif = temp.loc['With a soft "g," like "jiff"'].Salary.values
tr = pd.DataFrame()
tr['gif'] = pd.Series(gif)
filling = np.empty((6081))
filling[:] = np.nan
to_add = np.append(jif, filling)
tr['jif'] = pd.Series(to_add)
plot = sns.boxplot(data=tr, orient="v",)
plt.ylabel("Salary")
plt.title("Distribution of salary for the gif and jif populations")
plt.show()

print(tr.describe())