<a href="https://colab.research.google.com/github/NicoEssi/Data_Science_Portfolio/blob/master/StackOverflow_Survey_2019_and_2018.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Setting up

In [0]:
import os
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [0]:
#@title Download, Unzip, and Append Datasets { output-height: 100, display-mode: "form" }
# Downloading the Stack Overflow Survey Results for 2018
!wget --no-check-certificate \
    "https://drive.google.com/uc?export=download&id=1_9On2-nsBQIw3JiY43sWbrF8EjrqrR4U" \
    -O "/tmp/soi_2018.zip"
zip_ref = zipfile.ZipFile("/tmp/soi_2018.zip", 'r')
zip_ref.extractall("/tmp/soi_2018")
zip_ref.close()

# Downloading the Stack Overflow Survey Results for 2019
!wget --no-check-certificate \
    "https://drive.google.com/uc?authuser=0&id=1QOmVDpd8hcVYqqUXDXf68UMDWQZP0wQV&export=download" \
    -O "/tmp/soi_2019.zip"
zip_ref = zipfile.ZipFile("/tmp/soi_2019.zip", 'r')
zip_ref.extractall("/tmp/soi_2019")
zip_ref.close()

schemas = []
for i in range(2):
    schemas.append(pd.read_csv("/tmp/soi_201" + str(9-i) +"/survey_results_schema.csv"))
    
schema_2019 = schemas[0] # Data from 2019
schema_2018 = schemas[1] # Data from 2018

# 1. Data Preprocessing

In [80]:
# First we check which schema has the least amount of questions,
# then we find out how many of those questions match within the other schema.

for i, s in enumerate(schemas, 0):
    print("201" + str(9-i) + ": " + str(s.shape[0]) + ".")

2019: 85.
2018: 129.


In [81]:
# We will compare how many questions from 2019 match with 2018,
# then how many of those matches match with 2017.

# Let's first check whether most matches are in QuestionText or Column.
# We take the one with most matches, to reduce the manual comparison later.


schema_Column = []
schema_QuestionText = []

for column in schema_2019.Column:
    schema_Column.append(column in list(schema_2018.Column))
    
for questiontext in schema_2019.QuestionText:
    schema_QuestionText.append(questiontext in list(schema_2018.QuestionText))


print("Column matches: " + str(sum(schema_Column)) +
      ".\nQuestionText matches: " + str(sum(schema_QuestionText)) + ".")

Column matches: 17.
QuestionText matches: 31.


In [0]:
schema = schema_2019[schema_QuestionText] # preparing the final schema
schema_not = schema_2019[np.invert(schema_QuestionText)] # QuestionTexts that didn't match

In [83]:
schema.head()

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
2,Hobbyist,Do you code as a hobby?
5,Employment,Which of the following best describes your cur...
6,Country,In which country do you currently reside?
7,Student,"Are you currently enrolled in a formal, degree..."


In [84]:
schema_leftover = []

for column in schema_not.Column:
    schema_leftover.append(column in list(schema_2018.Column))
    
schema_leftover = schema_not[schema_leftover] # Columns that matched with questiontexts that didn't
schema_leftover = schema_leftover.reset_index(drop=True)

schema_leftover

Unnamed: 0,Column,QuestionText
0,OpenSource,How do you feel about the quality of open sour...
1,UndergradMajor,What was your main or most important field of ...
2,Age,What is your age (in years)? If you prefer not...
3,Dependents,"Do you have any dependents (e.g., children, el..."


In [85]:
schema_leftover.head

<bound method NDFrame.head of            Column                                       QuestionText
0      OpenSource  How do you feel about the quality of open sour...
1  UndergradMajor  What was your main or most important field of ...
2             Age  What is your age (in years)? If you prefer not...
3      Dependents  Do you have any dependents (e.g., children, el...>

In [0]:
questions19 = schema_leftover.loc[schema_leftover['Column'] == list(schema_leftover.Column)].QuestionText.reset_index(drop=True) # == used because lengths match
questions18 = schema_2018.loc[schema_2018['Column'].isin(list(schema_leftover.Column))].QuestionText.reset_index(drop=True) # isin() used because lengths don't match

questions = pd.concat([questions19, questions18], axis=1, sort=False)

In [62]:
for i in range(questions.QuestionText.shape[0]):
    for j in questions.QuestionText.iloc[i]:
        print(j)
    print("-----------------------------------------------")

How do you feel about the quality of open source software (OSS)?
Do you contribute to open source projects?
-----------------------------------------------
What was your main or most important field of study?
You previously indicated that you went to a college or university. Which of the following best describes your main field of study (aka 'major')
-----------------------------------------------
What is your age (in years)? If you prefer not to answer, you may leave this question blank.
What is your age? If you prefer not to answer, you may leave this question blank.
-----------------------------------------------
Do you have any dependents (e.g., children, elders, or others) that you care for?
Do you have any children or other dependents that you care for? If you prefer not to answer, you may leave this question blank.
-----------------------------------------------


In [63]:
# 1: Not the same question
# 2: Same question
# 3: Same question
# 4: Same question

schema_leftover = schema_leftover.drop(schema_leftover.index[0])

schema_leftover

Unnamed: 0,Column,QuestionText
1,UndergradMajor,What was your main or most important field of ...
2,Age,What is your age (in years)? If you prefer not...
3,Dependents,"Do you have any dependents (e.g., children, el..."


In [0]:
schema = schema.append(schema_leftover).reset_index(drop = True)

# 2. Exploration: Business- & Data Understanding

In [108]:
for i in range(len(schemas[0])):
    print(schemas[0].iloc[i].Column + " : " + schemas[0].iloc[i].QuestionText)

Respondent : Randomized respondent ID number (not in order of survey response time)
MainBranch : Which of the following options best describes you today? Here, by "developer" we mean "someone who writes code."
Hobbyist : Do you code as a hobby?
OpenSourcer : How often do you contribute to open source?
OpenSource : How do you feel about the quality of open source software (OSS)?
Employment : Which of the following best describes your current employment status?
Country : In which country do you currently reside?
Student : Are you currently enrolled in a formal, degree-granting college or university program?
EdLevel : Which of the following best describes the highest level of formal education that you’ve completed?
UndergradMajor : What was your main or most important field of study?
EduOther : Which of the following types of non-degree education have you used or participated in? Please select all that apply.
OrgSize : Approximately how many people are employed by the company or organizat