# Case Study: User survey

In this case study we figure out how to analyse the responses from a user survey form kaggle

The case study is divided into several parts:
- Goals
- Parsing
- Preparation (cleaning)
- Processing
- Exploration
- Visualization
- Conclusion

## Goals

In this section we define questions that will be our guideline througout the case study

- What influences salary?
- Can we deduce common skills for job titles?
- Do higher paid jobs spend time differently?
- Important: education or experience?

We'll (try to) keep these question in mind when performing the case study.

## Parsing

we start out by importing all necessary libraries

In [1]:
import os
import json
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats
import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats
%matplotlib inline
set_matplotlib_formats('svg')

in order to download datasets from kaggle, we need an API key to access their API, we'll make that here

In [2]:
if not os.path.exists("/root/.kaggle"):
    os.mkdir("/root/.kaggle")

with open('/root/.kaggle/kaggle.json', 'w') as f:
    json.dump(
        {
            "username":"lorenzf",
            "key":"7a44a9e99b27e796177d793a3d85b8cf"
        }
        , f)

now we can import kaggle too and download the datasets

In [3]:
import kaggle
kaggle.api.dataset_download_files(dataset='kaggle/kaggle-survey-2018', path='./data', unzip=True)



the csv files are now in the './data' folder, we can now read them using pandas, here is the list of all csv files in our folder

In [4]:
os.listdir('./data')

['freeFormResponses.csv', 'multipleChoiceResponses.csv', 'SurveySchema.csv']


The file of our interest is 'athlete_events.csv', it contains every contestant in every sport since 1896. Let's print out the top 5 events.

In [5]:
choice_df = pd.read_csv('./data/multipleChoiceResponses.csv')
print('shape: ' + str(choice_df.shape))
choice_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


shape: (23860, 395)


Unnamed: 0,Time from Start to Finish (seconds),Q1,Q1_OTHER_TEXT,Q2,Q3,Q4,Q5,Q6,Q6_OTHER_TEXT,Q7,Q7_OTHER_TEXT,Q8,Q9,Q10,Q11_Part_1,Q11_Part_2,Q11_Part_3,Q11_Part_4,Q11_Part_5,Q11_Part_6,Q11_Part_7,Q11_OTHER_TEXT,Q12_MULTIPLE_CHOICE,Q12_Part_1_TEXT,Q12_Part_2_TEXT,Q12_Part_3_TEXT,Q12_Part_4_TEXT,Q12_Part_5_TEXT,Q12_OTHER_TEXT,Q13_Part_1,Q13_Part_2,Q13_Part_3,Q13_Part_4,Q13_Part_5,Q13_Part_6,Q13_Part_7,Q13_Part_8,Q13_Part_9,Q13_Part_10,Q13_Part_11,...,Q46,Q47_Part_1,Q47_Part_2,Q47_Part_3,Q47_Part_4,Q47_Part_5,Q47_Part_6,Q47_Part_7,Q47_Part_8,Q47_Part_9,Q47_Part_10,Q47_Part_11,Q47_Part_12,Q47_Part_13,Q47_Part_14,Q47_Part_15,Q47_Part_16,Q48,Q49_Part_1,Q49_Part_2,Q49_Part_3,Q49_Part_4,Q49_Part_5,Q49_Part_6,Q49_Part_7,Q49_Part_8,Q49_Part_9,Q49_Part_10,Q49_Part_11,Q49_Part_12,Q49_OTHER_TEXT,Q50_Part_1,Q50_Part_2,Q50_Part_3,Q50_Part_4,Q50_Part_5,Q50_Part_6,Q50_Part_7,Q50_Part_8,Q50_OTHER_TEXT
0,Duration (in seconds),What is your gender? - Selected Choice,What is your gender? - Prefer to self-describe...,What is your age (# years)?,In which country do you currently reside?,What is the highest level of formal education ...,Which best describes your undergraduate major?...,Select the title most similar to your current ...,Select the title most similar to your current ...,In what industry is your current employer/cont...,In what industry is your current employer/cont...,How many years of experience do you have in yo...,What is your current yearly compensation (appr...,Does your current employer incorporate machine...,Select any activities that make up an importan...,Select any activities that make up an importan...,Select any activities that make up an importan...,Select any activities that make up an importan...,Select any activities that make up an importan...,Select any activities that make up an importan...,Select any activities that make up an importan...,Select any activities that make up an importan...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,...,Approximately what percent of your data projec...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,What methods do you prefer for explaining and/...,"Do you consider ML models to be ""black boxes"" ...",What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What tools and methods do you use to make your...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...
1,710,Female,-1,45-49,United States of America,Doctoral degree,Other,Consultant,-1,Other,0,,,I do not know,Analyze and understand data to influence produ...,Build and/or run a machine learning service th...,Build and/or run the data infrastructure that ...,,Do research that advances the state of the art...,,,-1,"Cloud-based data software & APIs (AWS, GCP, Az...",-1,-1,-1,-1,0,-1,Jupyter/IPython,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
2,434,Male,-1,30-34,Indonesia,Bachelor’s degree,Engineering (non-computer focused),Other,0,Manufacturing/Fabrication,-1,5-10,"10-20,000",No (we do not use ML methods),,,,,,None of these activities are an important part...,,-1,"Basic statistical software (Microsoft Excel, G...",1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
3,718,Female,-1,30-34,United States of America,Master’s degree,"Computer science (software engineering, etc.)",Data Scientist,-1,I am a student,-1,0-1,"0-10,000",I do not know,Analyze and understand data to influence produ...,,,,,,,-1,Local or hosted development environments (RStu...,-1,-1,-1,0,-1,-1,,,,,,,MATLAB,,,,,...,10-20,,Examine feature correlations,Examine feature importances,,,,,Plot predicted vs. actual results,,,,,,,,,I am confident that I can explain the outputs ...,,,,,,,Make sure the code is human-readable,Define all random seeds,,Include a text file describing all dependencies,,,-1,,Too time-consuming,,,,,,,-1
4,621,Male,-1,35-39,United States of America,Master’s degree,"Social sciences (anthropology, psychology, soc...",Not employed,-1,,-1,,,,,,,,,,,-1,Local or hosted development environments (RStu...,-1,-1,-1,1,-1,-1,Jupyter/IPython,RStudio,PyCharm,,,,,Visual Studio,,,Vim,...,20-30,,Examine feature correlations,Examine feature importances,Plot decision boundaries,,,,Plot predicted vs. actual results,,Sensitivity analysis/perturbation importance,,,,,,,"Yes, most ML models are ""black boxes""",,,"Share data, code, and environment using a host...",,,,Make sure the code is human-readable,,Define relative rather than absolute file paths,,,,-1,,,Requires too much technical knowledge,,Not enough incentives to share my work,,,,-1


In [6]:
free_form_df = pd.read_csv('./data/freeFormResponses.csv')
print('shape: ' + str(free_form_df.shape))
free_form_df.head()

shape: (23860, 35)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Q11_OTHER_TEXT,Q12_OTHER_TEXT,Q12_Part_1_TEXT,Q12_Part_2_TEXT,Q12_Part_3_TEXT,Q12_Part_4_TEXT,Q12_Part_5_TEXT,Q13_OTHER_TEXT,Q14_OTHER_TEXT,Q15_OTHER_TEXT,Q16_OTHER_TEXT,Q17_OTHER_TEXT,Q18_OTHER_TEXT,Q19_OTHER_TEXT,Q1_OTHER_TEXT,Q20_OTHER_TEXT,Q21_OTHER_TEXT,Q22_OTHER_TEXT,Q27_OTHER_TEXT,Q28_OTHER_TEXT,Q29_OTHER_TEXT,Q30_OTHER_TEXT,Q31_OTHER_TEXT,Q32_OTHER,Q33_OTHER_TEXT,Q34_OTHER_TEXT,Q35_OTHER_TEXT,Q36_OTHER_TEXT,Q37_OTHER_TEXT,Q38_OTHER_TEXT,Q42_OTHER_TEXT,Q49_OTHER_TEXT,Q50_OTHER_TEXT,Q6_OTHER_TEXT,Q7_OTHER_TEXT
0,Select any activities that make up an importan...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,What is the primary tool that you use at work ...,Which of the following integrated development ...,Which of the following hosted notebooks have y...,Which of the following cloud computing service...,What programming languages do you use on a reg...,What specific programming language do you use ...,What programming language would you recommend ...,What machine learning frameworks have you used...,What is your gender? - Prefer to self-describe...,Of the choices that you selected in the previo...,What data visualization libraries or tools hav...,Of the choices that you selected in the previo...,Which of the following cloud computing product...,Which of the following machine learning produc...,Which of the following relational database pro...,Which of the following big data and analytics ...,Which types of data do you currently interact ...,What is the type of data that you currently in...,Where do you find public datasets? (Select all...,During a typical data science project at work ...,What percentage of your current machine learni...,On which online platforms have you begun or co...,On which online platform have you spent the mo...,Who/what are your favorite media sources that ...,What metrics do you or your organization use t...,What tools and methods do you use to make your...,What barriers prevent you from making your wor...,Select the title most similar to your current ...,In what industry is your current employer/cont...
1,,,,,,"Jupyter Notebooks, Pycharm, Intelijidea",,,,,,,,,,,,,,,,,,,,0.0,,mlcourse.ai,,ods.ai,,,,,
2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,anaconda,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


I saw that the first row of our choice dataframe contains the questions, to let's extract that.

In [7]:
questions = choice_df.iloc[0]
choice_df = choice_df.drop(0)

In [8]:
questions.head(20)

Time from Start to Finish (seconds)                                Duration (in seconds)
Q1                                                What is your gender? - Selected Choice
Q1_OTHER_TEXT                          What is your gender? - Prefer to self-describe...
Q2                                                           What is your age (# years)?
Q3                                             In which country do you currently reside?
Q4                                     What is the highest level of formal education ...
Q5                                     Which best describes your undergraduate major?...
Q6                                     Select the title most similar to your current ...
Q6_OTHER_TEXT                          Select the title most similar to your current ...
Q7                                     In what industry is your current employer/cont...
Q7_OTHER_TEXT                          In what industry is your current employer/cont...
Q8                   

## Preparation

here we perform tasks to prepare the data in a more pleasing format.

### Data Types

Before we do anything with our data, it is good to see if our data types are in order

In [9]:
choice_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23859 entries, 1 to 23859
Columns: 395 entries, Time from Start to Finish (seconds) to Q50_OTHER_TEXT
dtypes: object(395)
memory usage: 72.1+ MB


Seems there are to many too show, so we have to do some manual work, The first 10 questions seem to be about personal info, where the first one is about gender

In [10]:
print(questions.Q1)
choice_df.Q1.value_counts()

What is your gender? - Selected Choice


Male                       19430
Female                      4010
Prefer not to say            340
Prefer to self-describe       79
Name: Q1, dtype: int64

In [11]:
print(questions.Q1_OTHER_TEXT)
choice_df.Q1_OTHER_TEXT.unique()

What is your gender? - Prefer to self-describe - Text


array(['-1', '2', '3', '4', '5', '6', -1, 7, 8, 9, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 4,
       32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
       49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
       66, 67], dtype=object)

Hmm the self-describe seems to already been encoded, as there are so many different answers I would opt to ignore those results as they only take up 79 answers of all 24k.
For the second question I am going to convert it to an ordinal value, this way we know the order of the categories.

In [12]:
choice_df.Q2 = choice_df.Q2.astype(pd.api.types.CategoricalDtype(categories=['18-21', '22-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-69', '70-79', '80+'], ordered=True))
print(questions.Q2)
choice_df.Q2

What is your age (# years)?


1        45-49
2        30-34
3        30-34
4        35-39
5        22-24
         ...  
23855    45-49
23856    25-29
23857    22-24
23858    25-29
23859    25-29
Name: Q2, Length: 23859, dtype: category
Categories (12, object): ['18-21' < '22-24' < '25-29' < '30-34' ... '55-59' < '60-69' < '70-79' < '80+']

Next we have a few very important questions that signify the situation of each user in our survey. I chose for nominal categories as I don't want to be biased.

In [13]:
print(questions.Q6)
choice_df.Q6.value_counts()

Select the title most similar to your current role (or most recent title if retired): - Selected Choice


Student                    5253
Data Scientist             4137
Software Engineer          3130
Data Analyst               1922
Other                      1322
Research Scientist         1189
Not employed                842
Consultant                  785
Business Analyst            772
Data Engineer               737
Research Assistant          600
Manager                     590
Product/Project Manager     428
Chief Officer               360
Statistician                237
DBA/Database Engineer       145
Developer Advocate          117
Marketing Analyst           115
Salesperson                 102
Principal Investigator       97
Data Journalist              20
Name: Q6, dtype: int64

In [14]:
print(questions[['Q3', 'Q4', 'Q5', 'Q6', 'Q7']])
choice_df[['Q3', 'Q4', 'Q5', 'Q6', 'Q7']] = choice_df[['Q3', 'Q4', 'Q5', 'Q6', 'Q7']].astype('category')

Q3            In which country do you currently reside?
Q4    What is the highest level of formal education ...
Q5    Which best describes your undergraduate major?...
Q6    Select the title most similar to your current ...
Q7    In what industry is your current employer/cont...
Name: 0, dtype: object


Question 8 is about experience, or as they call it tenure. Not as a numerical value but in categories, so again I create an ordinal category from it.

In [15]:
print(questions.Q8)
choice_df.Q8.value_counts()

How many years of experience do you have in your current role?


0-1      5898
1-2      3745
2-3      2577
5-10     2524
3-4      1751
10-15    1512
4-5      1488
15-20     854
20-25     384
30 +      197
25-30     171
Name: Q8, dtype: int64

In [16]:
choice_df.Q8 = choice_df.Q8.astype(pd.api.types.CategoricalDtype(categories=['0-1', '1-2', '2-3', '3-4', '4-50', '5-10', '10-15', '15-20', '20-25', '25-30', '30+'], ordered=True))
print(questions.Q8)
choice_df.Q8

How many years of experience do you have in your current role?


1         NaN
2        5-10
3         0-1
4         NaN
5         0-1
         ... 
23855    5-10
23856     NaN
23857     0-1
23858     NaN
23859     NaN
Name: Q8, Length: 23859, dtype: category
Categories (11, object): ['0-1' < '1-2' < '2-3' < '3-4' ... '15-20' < '20-25' < '25-30' < '30+']

And not to forget we have the salary, again as a category, which is unfortunate since we could have been able to create a more accurate prediction in the end.
Here I opt for an ordinal category.

In [17]:
choice_df.Q9.value_counts()

I do not wish to disclose my approximate yearly compensation    4756
0-10,000                                                        4398
10-20,000                                                       1937
20-30,000                                                       1395
30-40,000                                                       1119
40-50,000                                                        965
50-60,000                                                        919
100-125,000                                                      843
60-70,000                                                        729
70-80,000                                                        677
90-100,000                                                       566
125-150,000                                                      533
80-90,000                                                        506
150-200,000                                                      457
200-250,000                       

In [18]:
choice_df.Q9 = choice_df.Q9.astype(pd.api.types.CategoricalDtype(categories=['0-10,000', '10-20,000', '20-30,000', '30-40,000', '40-50,000', '50-60,000', '60-70,000', '70-80,000', '80-90,000', '90-100,000', '100-125,000', '125-150,000', '150-200,000', '200-250,000', '250-300,000', '300-400,000', '400-500,000', '500,000+',], ordered=True))
choice_df.Q9

1                NaN
2          10-20,000
3           0-10,000
4                NaN
5           0-10,000
            ...     
23855    250-300,000
23856            NaN
23857      10-20,000
23858            NaN
23859            NaN
Name: Q9, Length: 23859, dtype: category
Categories (18, object): ['0-10,000' < '10-20,000' < '20-30,000' < '30-40,000' ... '250-300,000' <
                          '300-400,000' < '400-500,000' < '500,000+']

### Missing values

for each dataframe we apply a few checks in order to see the quality of data

In [19]:
print(100*choice_df.isna().sum().head(20)/choice_df.shape[0])

Time from Start to Finish (seconds)     0.000000
Q1                                      0.000000
Q1_OTHER_TEXT                           0.000000
Q2                                      0.000000
Q3                                      0.000000
Q4                                      1.764533
Q5                                      3.822457
Q6                                      4.019448
Q6_OTHER_TEXT                           0.000000
Q7                                      9.111866
Q7_OTHER_TEXT                           0.000000
Q8                                     18.621904
Q9                                     35.332579
Q10                                    13.370217
Q11_Part_1                             60.048619
Q11_Part_2                             77.027537
Q11_Part_3                             78.066977
Q11_Part_4                             69.684396
Q11_Part_5                             79.320173
Q11_Part_6                             85.452031
dtype: float64


You can clearly see that there are a lot of missing values, for questions 11 and onwards this is just because they did not check that answer on a question, but for 1-10 this is a problem as these are 'mandatory' questions. I have no idea how to fill this in and salary is missing about 35%, pretty disastrous, but this is to be expected with user surveys.

Another problem we have here is trolls, there might have been persons that would just fill this in to mess with our data collection, I thought they might have been funny and answered a high salary.

In [20]:
choice_df[choice_df.Q9=='500,000+'].Q2.value_counts()

25-29    13
35-39    10
80+       7
30-34     7
50-54     6
45-49     5
55-59     4
22-24     4
18-21     4
60-69     3
70-79     0
40-44     0
Name: Q2, dtype: int64

you can see there are 13 persons between 25-29 that earn more than 500k annually, which i think is near impossible. Let us see what they are upto.

In [21]:
choice_df[(choice_df.Q9=='500,000+') & (choice_df.Q2=='25-29')]

Unnamed: 0,Time from Start to Finish (seconds),Q1,Q1_OTHER_TEXT,Q2,Q3,Q4,Q5,Q6,Q6_OTHER_TEXT,Q7,Q7_OTHER_TEXT,Q8,Q9,Q10,Q11_Part_1,Q11_Part_2,Q11_Part_3,Q11_Part_4,Q11_Part_5,Q11_Part_6,Q11_Part_7,Q11_OTHER_TEXT,Q12_MULTIPLE_CHOICE,Q12_Part_1_TEXT,Q12_Part_2_TEXT,Q12_Part_3_TEXT,Q12_Part_4_TEXT,Q12_Part_5_TEXT,Q12_OTHER_TEXT,Q13_Part_1,Q13_Part_2,Q13_Part_3,Q13_Part_4,Q13_Part_5,Q13_Part_6,Q13_Part_7,Q13_Part_8,Q13_Part_9,Q13_Part_10,Q13_Part_11,...,Q46,Q47_Part_1,Q47_Part_2,Q47_Part_3,Q47_Part_4,Q47_Part_5,Q47_Part_6,Q47_Part_7,Q47_Part_8,Q47_Part_9,Q47_Part_10,Q47_Part_11,Q47_Part_12,Q47_Part_13,Q47_Part_14,Q47_Part_15,Q47_Part_16,Q48,Q49_Part_1,Q49_Part_2,Q49_Part_3,Q49_Part_4,Q49_Part_5,Q49_Part_6,Q49_Part_7,Q49_Part_8,Q49_Part_9,Q49_Part_10,Q49_Part_11,Q49_Part_12,Q49_OTHER_TEXT,Q50_Part_1,Q50_Part_2,Q50_Part_3,Q50_Part_4,Q50_Part_5,Q50_Part_6,Q50_Part_7,Q50_Part_8,Q50_OTHER_TEXT
2322,561,Prefer to self-describe,7,25-29,France,I prefer not to answer,Other,Other,113,I am a student,-1,,"500,000+",I do not know,,,,,,,Other,56,Other,-1,-1,-1,-1,-1,158,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
8899,2116,Male,-1,25-29,Philippines,Bachelor’s degree,Engineering (non-computer focused),Data Analyst,-1,Accounting/Finance,-1,5-10,"500,000+",We are exploring ML methods (and may one day p...,Analyze and understand data to influence produ...,,,,,,,-1,"Advanced statistical software (SPSS, SAS, etc.)",-1,254,-1,-1,-1,-1,Jupyter/IPython,RStudio,,,,,MATLAB,,,,,...,20-30,Examine individual model coefficients,Examine feature correlations,Examine feature importances,,,,,Plot predicted vs. actual results,,,,,,,,,I am confident that I can understand and expla...,,,,,,Make sure the code is well documented,Make sure the code is human-readable,,,,,,-1,,,,,,,,Other,260
12092,1607,Male,-1,25-29,China,Doctoral degree,"Information technology, networking, or system ...",Data Scientist,-1,Computers/Technology,-1,2-3,"500,000+","We have well established ML methods (i.e., mod...",,,,,Do research that advances the state of the art...,,,-1,"Basic statistical software (Microsoft Excel, G...",736,-1,-1,-1,-1,-1,Jupyter/IPython,,PyCharm,Visual Studio Code,,,MATLAB,Visual Studio,Notepad++,Sublime Text,,...,40-50,,,,,,Dimensionality reduction techniques,,Plot predicted vs. actual results,,,,,,,,,"I view ML models as ""black boxes"" but I am con...",Share code on Github or a similar code-sharing...,,,"Share data, code, and environment using contai...",,,Make sure the code is human-readable,Define all random seeds,,Include a text file describing all dependencies,,,-1,,Too time-consuming,Requires too much technical knowledge,,,,,,-1
13468,5487,Male,-1,25-29,India,Bachelor’s degree,"Computer science (software engineering, etc.)",Data Scientist,-1,Computers/Technology,-1,,"500,000+","We have well established ML methods (i.e., mod...",Analyze and understand data to influence produ...,Build and/or run a machine learning service th...,Build and/or run the data infrastructure that ...,Build prototypes to explore applying machine l...,Do research that advances the state of the art...,,,-1,"Cloud-based data software & APIs (AWS, GCP, Az...",-1,-1,-1,-1,374,-1,Jupyter/IPython,RStudio,PyCharm,Visual Studio Code,nteract,Atom,MATLAB,Visual Studio,Notepad++,Sublime Text,Vim,...,90-100,Examine individual model coefficients,Examine feature correlations,Examine feature importances,Plot decision boundaries,Create partial dependence plots,Dimensionality reduction techniques,Attention mapping/saliency mapping,Plot predicted vs. actual results,Print out a decision tree,Sensitivity analysis/perturbation importance,LIME functions,ELI5 functions,SHAP functions,,,,I am confident that I can explain the outputs ...,Share code on Github or a similar code-sharing...,Share both data and code on Github or a simila...,"Share data, code, and environment using a host...","Share data, code, and environment using contai...","Share code, data, and environment using virtua...",Make sure the code is well documented,Make sure the code is human-readable,Define all random seeds,Define relative rather than absolute file paths,Include a text file describing all dependencies,,,-1,Too expensive,Too time-consuming,Requires too much technical knowledge,Afraid that others will use my work without gi...,Not enough incentives to share my work,I had never considered making my work easier f...,,,-1
14367,359331,Male,-1,25-29,Kenya,Bachelor’s degree,"Medical or life sciences (biology, chemistry, ...",Data Scientist,-1,Computers/Technology,-1,1-2,"500,000+",We are exploring ML methods (and may one day p...,Analyze and understand data to influence produ...,,Build and/or run the data infrastructure that ...,,,,,-1,Local or hosted development environments (RStu...,-1,-1,-1,2773,-1,-1,Jupyter/IPython,RStudio,,,,,,,,Sublime Text,,...,40-50,,,,Plot decision boundaries,,,,Plot predicted vs. actual results,Print out a decision tree,,,,,,,,I am confident that I can understand and expla...,,Share both data and code on Github or a simila...,"Share data, code, and environment using a host...",,,Make sure the code is well documented,Make sure the code is human-readable,,,Include a text file describing all dependencies,,Other,166,,,,,,,None of these reasons apply to me,,-1
15469,68,Male,-1,25-29,United States of America,Master’s degree,Mathematics or statistics,Research Scientist,-1,Accounting/Finance,-1,3-4,"500,000+",,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
15825,94,Female,-1,25-29,United States of America,Master’s degree,"Computer science (software engineering, etc.)",Business Analyst,-1,Computers/Technology,-1,0-1,"500,000+",We are exploring ML methods (and may one day p...,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
16404,78,Prefer not to say,-1,25-29,United States of America,No formal education past high school,,Consultant,-1,Hospitality/Entertainment/Sports,-1,5-10,"500,000+",We use ML methods for generating insights (but...,,Build and/or run a machine learning service th...,,,,,,-1,"Basic statistical software (Microsoft Excel, G...",921,-1,-1,-1,-1,-1,,,,,nteract,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
18120,281,Female,-1,25-29,Colombia,Doctoral degree,"Computer science (software engineering, etc.)",Data Scientist,-1,Computers/Technology,-1,2-3,"500,000+",We are exploring ML methods (and may one day p...,,,,Build prototypes to explore applying machine l...,,,,-1,Local or hosted development environments (RStu...,-1,-1,-1,5,-1,-1,Jupyter/IPython,,,,,,,,,,Vim,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
20576,197,Prefer to self-describe,65,25-29,Belgium,Master’s degree,"Computer science (software engineering, etc.)",Student,-1,I am a student,-1,,"500,000+",We are exploring ML methods (and may one day p...,,Build and/or run a machine learning service th...,,Build prototypes to explore applying machine l...,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1


No way they are this succesfull, i'm not yet going to remove them, but i'm definitely going to keep this in mind, this might break our predictions!

Later on I will remove the entries without salaries, but im going to keep them in a prediction dataframe, so we could perhaps predict their salary, we don't have a reference but still might be interesting. For the rest of the preparation im going to keep them in here so the final format of both train and prediction are the same.

### Duplicates

It is very highly unlikely but just to check if no one has entered the same survey twice, we check for duplicates

In [22]:
choice_df[choice_df.duplicated()]

Unnamed: 0,Time from Start to Finish (seconds),Q1,Q1_OTHER_TEXT,Q2,Q3,Q4,Q5,Q6,Q6_OTHER_TEXT,Q7,Q7_OTHER_TEXT,Q8,Q9,Q10,Q11_Part_1,Q11_Part_2,Q11_Part_3,Q11_Part_4,Q11_Part_5,Q11_Part_6,Q11_Part_7,Q11_OTHER_TEXT,Q12_MULTIPLE_CHOICE,Q12_Part_1_TEXT,Q12_Part_2_TEXT,Q12_Part_3_TEXT,Q12_Part_4_TEXT,Q12_Part_5_TEXT,Q12_OTHER_TEXT,Q13_Part_1,Q13_Part_2,Q13_Part_3,Q13_Part_4,Q13_Part_5,Q13_Part_6,Q13_Part_7,Q13_Part_8,Q13_Part_9,Q13_Part_10,Q13_Part_11,...,Q46,Q47_Part_1,Q47_Part_2,Q47_Part_3,Q47_Part_4,Q47_Part_5,Q47_Part_6,Q47_Part_7,Q47_Part_8,Q47_Part_9,Q47_Part_10,Q47_Part_11,Q47_Part_12,Q47_Part_13,Q47_Part_14,Q47_Part_15,Q47_Part_16,Q48,Q49_Part_1,Q49_Part_2,Q49_Part_3,Q49_Part_4,Q49_Part_5,Q49_Part_6,Q49_Part_7,Q49_Part_8,Q49_Part_9,Q49_Part_10,Q49_Part_11,Q49_Part_12,Q49_OTHER_TEXT,Q50_Part_1,Q50_Part_2,Q50_Part_3,Q50_Part_4,Q50_Part_5,Q50_Part_6,Q50_Part_7,Q50_Part_8,Q50_OTHER_TEXT
15278,36,Male,-1,18-21,China,,,,-1,,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
15865,23,Male,-1,18-21,United States of America,,,,-1,,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
17521,36,Male,-1,25-29,United States of America,Master’s degree,,,-1,,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
18257,27,Male,-1,25-29,Brazil,,,,-1,,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
18320,46,Male,-1,35-39,United States of America,,,,-1,,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
18966,43,Male,-1,18-21,India,Bachelor’s degree,,,-1,,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
21214,106,Male,-1,18-21,India,Bachelor’s degree,"Computer science (software engineering, etc.)",Student,-1,I am a student,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
21916,45,Male,-1,22-24,China,,,,-1,,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
22049,46,Male,-1,25-29,China,,,,-1,,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
22638,60,Male,-1,25-29,China,,,,-1,,-1,,,,,,,,,,,-1,,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1


I take back my words, seems there are some faulty entries, perhaps we should even improve our bad entry detection? For now im just going to remove duplicates

In [23]:
choice_df = choice_df.drop_duplicates()

At this point im going to seperate the non salary entries from our training dataframe. resulting in 2 partitions:
- train_df
- prediction_df

In [24]:
prediction_df = choice_df[(choice_df.Q9.isna()) | (choice_df.Q9=='I do not wish to disclose my approximate yearly compensation')]
train_df = choice_df.drop(prediction_df.index)
print('prediction shape:' + str(prediction_df.shape))
print('remaining shape:' + str(train_df.shape))

prediction shape:(8418, 395)
remaining shape:(15429, 395)


## Processing

For other questions I selected a few that caught my interest, here is the list that made it. Notice that I did not perform any preparation on these question as they mostly are checkmarks on a survey, yet in processing I am going to create a more convenient method to store them.

In [25]:
print(questions.Q11_Part_1)
#print(questions.Q12_Part_1_TEXT)
print(questions.Q13_Part_1)
print(questions.Q16_Part_1)
print(questions.Q17)
print(questions.Q19_Part_1)
print(questions.Q21_Part_1)
print(questions.Q31_Part_1)
print(questions.Q34_Part_1)
print(questions.Q42_Part_1)
print(questions.Q49_Part_1)

Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions
Which of the following integrated development environments (IDE's) have you used at work or school in the last 5 years? (Select all that apply) - Selected Choice - Jupyter/IPython
What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python
What specific programming language do you use most often? - Selected Choice
What machine learning frameworks have you used in the past 5 years? (Select all that apply) - Selected Choice - Scikit-Learn
What data visualization libraries or tools have you used in the past 5 years? (Select all that apply) - Selected Choice - ggplot2
Which types of data do you currently interact with most often at work or school? (Select all that apply) - Selected Choice - Audio Data
During a typical data science project at work or schoo

### One hot encoding questions
What I will do here is create a makeshift database, not in SQL as usually just to keep it simple, but in a dictionary of dataframes. For each question I will take the answers and create a one hot encoded table from them, for each user we will know which checkmarks they marked and which they didn't. This view makes it easier to apply statistics and machine learning to the data.

In [26]:
answer_dfs = {}
for question in ['Q11', 'Q13', 'Q16', 'Q19', 'Q21', 'Q31', 'Q34', 'Q42', 'Q49']:
  
  choices = train_df[train_df.columns[train_df.columns.str.contains(question)][:-1]].notnull().astype(int)
  choices.columns = questions[questions.index.str.contains(question)][:-1].str.split(' - ').apply(lambda x: x[-1]).values
  answer_dfs[question] = choices

an example of a question, Q13: Which IDE's have you used in the last 5 years?

In [27]:
answer_dfs['Q13']

Unnamed: 0,Jupyter/IPython,RStudio,PyCharm,Visual Studio Code,nteract,Atom,MATLAB,Visual Studio,Notepad++,Sublime Text,Vim,IntelliJ,Spyder,None,Other
2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
5,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
8,1,0,1,0,0,1,0,1,1,1,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23844,1,0,1,0,0,0,0,0,0,1,1,1,0,0,0
23845,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
23854,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
23855,1,1,1,0,0,0,1,0,0,1,0,1,0,0,0


for some reason they did Q17 differently, so we have to one hot encode it in another method.

In [28]:
answer_dfs['Q17'] = pd.get_dummies(train_df[train_df.columns[train_df.columns.str.contains('Q17')][:-1]])
answer_dfs['Q17']

Unnamed: 0,Q17_Bash,Q17_C#/.NET,Q17_C/C++,Q17_Go,Q17_Java,Q17_Javascript/Typescript,Q17_Julia,Q17_MATLAB,Q17_Other,Q17_PHP,Q17_Python,Q17_R,Q17_Ruby,Q17_SAS/STATA,Q17_SQL,Q17_Scala,Q17_Visual Basic/VBA
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23844,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
23845,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
23854,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
23855,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


That was for our choices data, where the questions are based on choices, for generic info we do it a bit different, we create a general dataframe containing all info.

In [29]:
info_df = train_df[['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10']]
info_df.columns = questions[['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10']]

In [30]:
info_df

Unnamed: 0,What is your gender? - Selected Choice,What is your age (# years)?,In which country do you currently reside?,What is the highest level of formal education that you have attained or plan to attain within the next 2 years?,Which best describes your undergraduate major? - Selected Choice,Select the title most similar to your current role (or most recent title if retired): - Selected Choice,In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice,How many years of experience do you have in your current role?,What is your current yearly compensation (approximate $USD)?,Does your current employer incorporate machine learning methods into their business?
2,Male,30-34,Indonesia,Bachelor’s degree,Engineering (non-computer focused),Other,Manufacturing/Fabrication,5-10,"10-20,000",No (we do not use ML methods)
3,Female,30-34,United States of America,Master’s degree,"Computer science (software engineering, etc.)",Data Scientist,I am a student,0-1,"0-10,000",I do not know
5,Male,22-24,India,Master’s degree,Mathematics or statistics,Data Analyst,I am a student,0-1,"0-10,000",I do not know
7,Male,35-39,Chile,Doctoral degree,"Information technology, networking, or system ...",Other,Academics/Education,10-15,"10-20,000",No (we do not use ML methods)
8,Male,18-21,India,Master’s degree,"Information technology, networking, or system ...",Other,Other,0-1,"0-10,000","We recently started using ML methods (i.e., mo..."
...,...,...,...,...,...,...,...,...,...,...
23844,Male,30-34,Netherlands,Master’s degree,"Computer science (software engineering, etc.)",Software Engineer,Computers/Technology,10-15,"90-100,000",We are exploring ML methods (and may one day p...
23845,Male,22-24,Romania,Master’s degree,Mathematics or statistics,Student,I am a student,0-1,"0-10,000",
23854,Male,30-34,Turkey,Doctoral degree,"Computer science (software engineering, etc.)",Research Assistant,Academics/Education,5-10,"10-20,000",
23855,Male,45-49,France,Doctoral degree,"Computer science (software engineering, etc.)",Chief Officer,Computers/Technology,5-10,"250-300,000","We recently started using ML methods (i.e., mo..."


### Mean choice Matrix
As we have so much information to process, I opted to keep it dynamic, the following function helps in that, it calculates for a question from our choice database the mean occurence for each group in a feature of the info dataframe.
Let's say we want to know the average amount of persons that know a specific language for each role/job title. We would have to match Q16 (known languages) with Q6 (job description). This is performed below, notice how it both performs a merge (join) and a groupby to get the result.

In [31]:
def mean_matrix(info, question):
  return info_df[[questions[info]]].join(answer_dfs[question]).groupby(questions[info]).mean()

In [32]:
mean_matrix('Q6','Q16')

Unnamed: 0_level_0,Python,R,SQL,Bash,Java,Javascript/Typescript,Visual Basic/VBA,C/C++,MATLAB,Scala,Julia,Go,C#/.NET,PHP,Ruby,SAS/STATA,None,Other
Select the title most similar to your current role (or most recent title if retired): - Selected Choice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Business Analyst,0.605085,0.401695,0.547458,0.049153,0.079661,0.084746,0.189831,0.069492,0.030508,0.016949,0.00339,0.005085,0.042373,0.035593,0.00678,0.091525,0.054237,0.027119
Chief Officer,0.717131,0.2749,0.430279,0.191235,0.183267,0.306773,0.083665,0.139442,0.091633,0.067729,0.015936,0.059761,0.103586,0.103586,0.043825,0.035857,0.023904,0.055777
Consultant,0.692845,0.413613,0.472949,0.116928,0.136126,0.165794,0.111693,0.078534,0.062827,0.024433,0.010471,0.012216,0.082024,0.052356,0.012216,0.073298,0.031414,0.041885
DBA/Database Engineer,0.666667,0.307692,0.717949,0.188034,0.213675,0.128205,0.068376,0.102564,0.051282,0.025641,0.0,0.017094,0.179487,0.042735,0.008547,0.025641,0.008547,0.017094
Data Analyst,0.647059,0.484594,0.586134,0.079132,0.093137,0.098039,0.110644,0.079132,0.067227,0.028711,0.005602,0.006303,0.039916,0.032213,0.014006,0.105042,0.011905,0.018908
Data Engineer,0.801418,0.248227,0.510638,0.223404,0.246454,0.184397,0.053191,0.132979,0.08156,0.161348,0.008865,0.033688,0.086879,0.054965,0.015957,0.033688,0.003546,0.028369
Data Journalist,0.6,0.4,0.4,0.2,0.2,0.4,0.0,0.1,0.2,0.1,0.0,0.0,0.2,0.0,0.0,0.1,0.0,0.0
Data Scientist,0.860265,0.429671,0.511542,0.193906,0.100031,0.098492,0.043398,0.105571,0.076023,0.068637,0.014158,0.018159,0.04032,0.024007,0.008618,0.058172,0.001847,0.023084
Developer Advocate,0.611765,0.211765,0.482353,0.094118,0.341176,0.4,0.094118,0.141176,0.035294,0.035294,0.0,0.023529,0.211765,0.141176,0.023529,0.011765,0.0,0.058824
Manager,0.681416,0.384956,0.455752,0.117257,0.126106,0.137168,0.139381,0.097345,0.050885,0.024336,0.011062,0.011062,0.075221,0.044248,0.022124,0.077434,0.044248,0.028761


We can see that for each combination of job title and programming language an average between 0 and 1 persons have checked this option, e.g. the combination of data scientist and python equals 0.86, meaning that 86% of data scientists know python. 

Similarly we can also calculate correlation between choices from our choice database, here we did it again for Question 16.

In [33]:
answer_dfs['Q16'].corr()

Unnamed: 0,Python,R,SQL,Bash,Java,Javascript/Typescript,Visual Basic/VBA,C/C++,MATLAB,Scala,Julia,Go,C#/.NET,PHP,Ruby,SAS/STATA,None,Other
Python,1.0,0.077293,0.191304,0.188435,0.141813,0.125164,0.00755,0.183621,0.131117,0.095374,0.039096,0.05318,0.046182,0.048012,0.031308,-0.009036,-0.20652,0.009641
R,0.077293,1.0,0.223527,0.032511,-0.034205,-0.039002,0.098949,-0.049046,0.030446,0.033991,0.063129,-0.016114,-0.039415,-0.006318,0.014921,0.198183,-0.085536,-0.00124
SQL,0.191304,0.223527,1.0,0.161086,0.13589,0.192323,0.159062,-0.034188,-0.047761,0.11762,0.013813,0.048231,0.134615,0.157483,0.056242,0.120958,-0.101752,0.001674
Bash,0.188435,0.032511,0.161086,1.0,0.078031,0.146723,-0.026907,0.082853,0.010577,0.116862,0.058785,0.104544,0.000396,0.054351,0.10355,-0.032248,-0.049637,0.056457
Java,0.141813,-0.034205,0.13589,0.078031,1.0,0.254773,0.024432,0.227691,0.064536,0.16549,0.005821,0.073318,0.137888,0.177413,0.067257,-0.040728,-0.056888,0.026739
Javascript/Typescript,0.125164,-0.039002,0.192323,0.146723,0.254773,1.0,0.04929,0.095921,-0.004829,0.060073,0.025601,0.118897,0.222775,0.307413,0.127312,-0.048567,-0.052961,0.034497
Visual Basic/VBA,0.00755,0.098949,0.159062,-0.026907,0.024432,0.04929,1.0,0.020796,0.019424,0.005185,0.037063,0.002042,0.122287,0.077126,0.011487,0.093618,-0.032002,-0.004965
C/C++,0.183621,-0.049046,-0.034188,0.082853,0.227691,0.095921,0.020796,1.0,0.260311,0.004697,0.04998,0.048903,0.13485,0.111623,0.041745,-0.052866,-0.058636,0.019235
MATLAB,0.131117,0.030446,-0.047761,0.010577,0.064536,-0.004829,0.019424,0.260311,1.0,0.003116,0.056403,0.003529,0.029772,0.046232,0.014027,0.015261,-0.044376,-4.3e-05
Scala,0.095374,0.033991,0.11762,0.116862,0.16549,0.060073,0.005185,0.004697,0.003116,1.0,0.049863,0.07711,0.006505,0.021619,0.066129,0.012633,-0.02578,0.014499


Here we see thich answers are checked usually together or not, as an example we see that python and SQL have a correlation of 19% whilst Python and R have a correlation of 7.7% which is logical as Python and R have a similar purpose and SQL is complementary. Obviously None is always negatively correlated, a good example of obsolete information!

### Count matrix
to correlate information between 2 questions of the info dataframe, we create a function that counts the occurence of each combination. An example is given for question 2 (age) and Question 7 (industry). With this information we can find out if there is a correlation between information of our users in the survey, not specifically their choices on the multiple choice answers.

In [34]:
def count_matrix(q1, q2):
  return info_df[[questions[q1], questions[q2]]].groupby([questions[q1], questions[q2]]).size().unstack()

In [35]:
count_matrix('Q2', 'Q7')

In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice,Academics/Education,Accounting/Finance,Broadcasting/Communications,Computers/Technology,Energy/Mining,Government/Public Service,Hospitality/Entertainment/Sports,I am a student,Insurance/Risk Assessment,Manufacturing/Fabrication,Marketing/CRM,Medical/Pharmaceutical,Military/Security/Defense,Non-profit/Service,Online Business/Internet-based Sales,Online Service/Internet-based Services,Other,Retail/Sales,Shipping/Transportation
What is your age (# years)?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
18-21,117,27,4,194,8,9,3,869,12,7,11,16,9,2,12,20,27,9,7
22-24,322,198,38,786,38,37,22,811,56,41,71,76,33,16,43,123,98,59,26
25-29,539,361,86,1250,97,118,39,424,131,111,124,179,31,50,107,227,227,110,67
30-34,364,252,66,795,81,125,34,89,109,96,83,127,25,40,62,162,169,81,59
35-39,243,141,56,454,44,74,25,32,73,62,50,76,22,17,23,86,79,44,36
40-44,146,82,33,279,40,48,14,12,30,40,22,37,14,13,14,51,69,25,20
45-49,91,47,24,195,13,40,5,5,15,32,12,27,4,9,5,24,34,14,6
50-54,71,38,5,123,8,34,3,2,16,21,4,21,6,5,4,10,18,5,9
55-59,35,18,4,64,7,17,5,3,7,12,6,17,2,1,1,4,10,6,3
60-69,40,14,3,32,8,20,1,0,4,12,1,14,3,0,1,8,13,2,4
