<a href="https://colab.research.google.com/github/NicoEssi/Data_Science_Portfolio/blob/master/Stackoverflow_2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Setup

In [0]:
import os
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
# Downloading the Stack Overflow Survey Results for 2019
!wget --no-check-certificate \
    "https://drive.google.com/uc?authuser=0&id=1QOmVDpd8hcVYqqUXDXf68UMDWQZP0wQV&export=download" \
    -O "/tmp/soi_2019.zip"
zip_ref = zipfile.ZipFile("/tmp/soi_2019.zip", 'r')
zip_ref.extractall("/tmp/soi_2019")
zip_ref.close()

--2019-09-10 18:26:39--  https://drive.google.com/uc?authuser=0&id=1QOmVDpd8hcVYqqUXDXf68UMDWQZP0wQV&export=download
Resolving drive.google.com (drive.google.com)... 74.125.20.113, 74.125.20.102, 74.125.20.101, ...
Connecting to drive.google.com (drive.google.com)|74.125.20.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://drive.google.com/uc?id=1QOmVDpd8hcVYqqUXDXf68UMDWQZP0wQV&export=download [following]
--2019-09-10 18:26:39--  https://drive.google.com/uc?id=1QOmVDpd8hcVYqqUXDXf68UMDWQZP0wQV&export=download
Reusing existing connection to drive.google.com:443.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-4c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/rt09if7md7v9rba5kgoam556228ho95b/1568138400000/06716978924947585995/*/1QOmVDpd8hcVYqqUXDXf68UMDWQZP0wQV?e=download [following]
--2019-09-10 18:26:42--  https://doc-14-4c-docs.googleusercontent.com/docs/securesc/ha0

# 1. Business- & Data Understanding

In [0]:
data = pd.read_csv("/tmp/soi_2019/survey_results_public.csv")
schema = pd.read_csv("/tmp/soi_2019/survey_results_schema.csv")

In [0]:
#for i in range(len(schema)):
#    print(schema.iloc[i].Column + " : " + schema.iloc[i].QuestionText)

In [5]:
print("US Count: " + str(np.sum(data.Country == "United States")))
print("UK Count: " + str(np.sum(data.Country == "United Kingdom")))
print("EU Count: " + str(np.sum(data.Country == "France")
                        + np.sum(data.Country == "Germany")
                        + np.sum(data.Country == "Sweden")
                        + np.sum(data.Country == "Denmark")
                        + np.sum(data.Country == "Finland")
                        + np.sum(data.Country == "Ireland")
                        + np.sum(data.Country == "Netherlands")
                        + np.sum(data.Country == "Austria")
                        + np.sum(data.Country == "Belgium")
                        + np.sum(data.Country == "Switzerland")))

US Count: 20949
UK Count: 5737
EU Count: 15591


In [6]:
print("Before: " + str(len(data.CompTotal)))

data = data[pd.notnull(data['CompFreq'])]
data = data[pd.notnull(data['CompTotal'])].reset_index(drop = True)

print("After: " + str(len(data.CompTotal)))

Before: 88883
After: 55827


In [7]:
print("US Count: " + str(np.sum(data.Country == "United States")))
print("UK Count: " + str(np.sum(data.Country == "United Kingdom")))
print("EU Count: " + str(np.sum(data.Country == "France")
                        + np.sum(data.Country == "Germany")
                        + np.sum(data.Country == "Sweden")
                        + np.sum(data.Country == "Denmark")
                        + np.sum(data.Country == "Finland")
                        + np.sum(data.Country == "Ireland")
                        + np.sum(data.Country == "Netherlands")
                        + np.sum(data.Country == "Austria")
                        + np.sum(data.Country == "Belgium")
                        + np.sum(data.Country == "Switzerland")
                        + np.sum(data.Country == "Norway")))

US Count: 14981
UK Count: 4036
EU Count: 10744


## We are interested in identifying the profiles of the top earners in each region; United States, United Kingdom, and Europe. Thereafter, we'll build a model to predict salary.

### Why do we conduct our inquiry on two countries, yet one is of a larger region consisting of numerous countries (Europe)?
Professionals residing within the European Union have liberties granted to them by the Schengen agreement, which enables free movement of labor force and thus professionals can freely work wherever they wish.

### Why are Norway and Switzerland included in the European dataset?
While they are not part of the European Union, they have signed agreements in association with the Schengen agreement.

### Why is the United Kingdom not included in the European dataset despite having signed an agreement in association with the Schengen agreement?
Brexit. Despite the fact that the data has been gathered prior to the secession deadline, it would still be interesting to inquire on the UK data separately in case it would be of interest to compare changes in 2020 and onwards. And as of writing this, nothing has been officially said regarding future signed agreement in association with the Schengen agreement for UK.

### Why are the countries in APT / RCEP not included?
Unfortunately the data gathered for China, South Korea, and Japan are too small to make reliable inferences - and there are currently significant restrictions in freedom of movement for professionals between these countries, making it unreasonable to conduct an inquiry on these countries as a collective.

# 2. Data Preparation

In [0]:
del data['Respondent']

In [0]:
data_us = data[data["Country"] == "United States"]

data_uk = data[data["Country"] == "United Kingdom"]

data_eu = data[data["Country"].isin(["France", "Germany", "Sweden",
                                     "Denmark", "Finland", "Norway",
                                     "Ireland", "Netherlands", "Austria",
                                     "Belgium", "Switzerland"])]

data_test = data[data["Country"] == "Canada"]

In [0]:
# currency rates as of 09/09/2019
"""
EUR = 1.10
CHF = 1.00
SEK = 0.10
NOK = 0.11
DKK = 0.15
USD = 1.00
GBP = 1.23
INR = 0.014
NZD = 0.64
AUD = 0.69
CAD = 0.76
"""

currencies = {"EUR" : 1.10
             ,"CHF" : 1.00
             ,"SEK" : 0.10
             ,"NOK" : 0.11
             ,"DKK" : 0.15
             ,"USD" : 1.00
             ,"GBP" : 1.23
             ,"INR" : 0.014
             ,"NZD" : 0.64
             ,"AUD" : 0.69
             ,"CAD" : 0.76}

def currency_usd(data):
    if data in (currencies):
        return currencies[data]
    else:
        return 0

def currency_annualize(data):
    if data == 'Weekly':
        return 52
    elif data == 'Monthly':
        return 12
    elif data == 'Yearly':
        return 1

In [122]:
data_us["CompTotal"] *= data_us["CompFreq"].apply(currency_annualize)
data_us["CompTotal"] *= data_us["CurrencySymbol"].apply(currency_usd)
data_us = data_us[data_us.CompTotal > 100]

data_uk["CompTotal"] *= data_uk["CompFreq"].apply(currency_annualize)
data_uk["CompTotal"] *= data_uk["CurrencySymbol"].apply(currency_usd)
data_uk = data_uk[data_uk.CompTotal > 100]

data_eu["CompTotal"] *= data_eu["CompFreq"].apply(currency_annualize)
data_eu["CompTotal"] *= data_eu["CurrencySymbol"].apply(currency_usd)
data_eu = data_eu[data_eu.CompTotal > 100]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stab

In [155]:
#devtypes = list(dict.fromkeys(devtypes))

devtypes

[['Developer, desktop or enterprise applications'],
 ['DevOps specialist', 'Engineer, site reliability'],
 ['Developer, mobile', 'Student'],
 ['Developer, back-end', 'Developer, full-stack', 'Student'],
 ['Developer, back-end',
  'Developer, desktop or enterprise applications',
  'Developer, front-end',
  'Developer, full-stack',
  'DevOps specialist'],
 ['Developer, full-stack', 'DevOps specialist', 'System administrator'],
 ['nan'],
 ['Data or business analyst',
  'Data scientist or machine learning specialist',
  'Database administrator',
  'Designer',
  'Developer, back-end',
  'Developer, desktop or enterprise applications',
  'Developer, front-end',
  'Developer, full-stack',
  'Educator',
  'Marketing or sales professional',
  'Student',
  'System administrator'],
 ['Data scientist or machine learning specialist',
  'Developer, back-end',
  'Developer, full-stack',
  'Developer, QA or test',
  'DevOps specialist',
  'Engineer, data',
  'Engineer, site reliability',
  'Engineerin

In [0]:
def parsed_tuple(data):
    result = [parsed.strip() for parsed in data.split(';')]
    return result

my_string = "blah;     lots  ,  of;  spaces; here "
result = parser(my_string)


devtypes = []
for i in range( len(data_eu.DevType) ):
    devtypes.append([parsed.strip() for parsed in str(data_eu.DevType.iloc[i]).split(';')])

In [123]:
data_eu.DevType.unique()

array(['Developer, desktop or enterprise applications',
       'DevOps specialist;Engineer, site reliability',
       'Developer, mobile;Student', ...,
       'Developer, back-end;Developer, desktop or enterprise applications;Developer, embedded applications or devices;Developer, mobile;Developer, QA or test;DevOps specialist;Educator;Engineer, site reliability;System administrator',
       'Developer, desktop or enterprise applications;Developer, embedded applications or devices;Developer, game or graphics;Developer, QA or test',
       'Academic researcher;Database administrator;Developer, back-end;Developer, desktop or enterprise applications;Developer, embedded applications or devices;Developer, front-end;Developer, full-stack;Developer, game or graphics;Developer, mobile;Educator;Marketing or sales professional;Product manager;Scientist;System administrator'],
      dtype=object)

In [115]:
for i in range( len(schema) ):
    print(schema.Column[i] + " : " + schema.QuestionText[i])

Respondent : Randomized respondent ID number (not in order of survey response time)
MainBranch : Which of the following options best describes you today? Here, by "developer" we mean "someone who writes code."
Hobbyist : Do you code as a hobby?
OpenSourcer : How often do you contribute to open source?
OpenSource : How do you feel about the quality of open source software (OSS)?
Employment : Which of the following best describes your current employment status?
Country : In which country do you currently reside?
Student : Are you currently enrolled in a formal, degree-granting college or university program?
EdLevel : Which of the following best describes the highest level of formal education that you’ve completed?
UndergradMajor : What was your main or most important field of study?
EduOther : Which of the following types of non-degree education have you used or participated in? Please select all that apply.
OrgSize : Approximately how many people are employed by the company or organizat