## How about being a self-taught Data Scientist after completing the Doctoral degree?

### Table of Contents

<ul>
    <li><a href='#intro'>Introduction</a></li>
    <li><a href='#wrangle'>Data Wrangling</a></li>
    <li><a href='#eda'>Exploratory Data Analysis</a></li>
    <li><a href='#conclusion'>Conclusion</a></li>
</ul>

<a id='intro'></a>
## Introduction

Hi there! I am a neuroscience graduate student studying data science by myself. I plan to be a data scientist after graduation. Before that, I'd like to make an exploration on my future career and colleagues through this survey.

[Data science](https://en.wikipedia.org/wiki/Data_science#:~:text=Data%20science%20is%20a%20%22concept,domain%20knowledge%20and%20information%20science.), a inter-disciplinary field, is challenging yet attractive for people with different background. The Master's degree in computer science or statistics teaches knowledge and skills sufficient to step into the field of data science. However, it's never too late to start anything! There are considerable amount of DS practitioners have the Doctoral degree completed. The reasons of why pursuing the higher degree in related majors, or switching the career after acquiring the Doctoral degree in un-related majors, are not hard to understand. However, whether or not having a Doctoral degree affects data handling ability, employment, and salary, requires further analysis.

I'll start the analysis from following aspects:

<ul>
    <li><a href='#prop'>Proportion of practitioners with Doctoral degree</a></li>
    <li><a href='#gender'>Gender</a></li>
    <li><a href='#age'>Age and experience</a></li>
    <li><a href='#profession'>Profession</a></li>
    <li><a href='#salary'>Salary</a></li>
    <li><a href='#q23'>Important roles at work</a></li>
</ul>


I'm constantly updating this notebook. If you have any questions, comments and suggestions, please feel free to leave your comments. :)

<a id='#wrangle'></a>
## Data Wrangling

In [1]:
# import packages
## DataFrame
import numpy as np
import pandas as pd

## Visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#plt.rcParams.update(plt.rcParamsDefault)

In [3]:
sns.choose_cubehelix_palette()

interactive(children=(IntSlider(value=9, description='n_colors', max=16, min=2), FloatSlider(value=0.0, descri…

[[0.9312692223325372, 0.8201921796082118, 0.7971480974663592],
 [0.8888663743660877, 0.7106793139856472, 0.7158661451411206],
 [0.8314793143949643, 0.5987041921652179, 0.6530062709235388],
 [0.7588951019517731, 0.49817117746394224, 0.6058723814510268],
 [0.6672565752652589, 0.40671838146419587, 0.5620016466433286],
 [0.5529215689527474, 0.3217924564263954, 0.5093718054521851],
 [0.43082755198027817, 0.24984535814964698, 0.44393960899639856],
 [0.29794615023641036, 0.18145907625614888, 0.3531778140503475],
 [0.1750865648952205, 0.11840023306916837, 0.24215989137836502]]

In [4]:
edu_color = sns.cubehelix_palette(n_colors=15, start=2.8, rot=-0.1, gamma=2.0, hue=0.8, light=0.8, dark=0.3)
edu_cmap = sns.cubehelix_palette(n_colors=10, start=2.8, rot=-0.1, gamma=1.0, hue=0.7, light=1.0, dark=0.1, as_cmap=True)
all_cmap = sns.cubehelix_palette(n_colors=10, start=2.8, rot=-0.1, gamma=1.0, hue=0.1, light=1.0, dark=0.1, as_cmap=True)
all_cmap1 = sns.cubehelix_palette(n_colors=10, start=1.6, rot=0.3, gamma=1.0, hue=0.7, light=1.0, dark=0.1, as_cmap=True)

In [8]:
survey_2020 = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False, skiprows=[1])
survey_2019 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv', low_memory=False, skiprows=[1])
survey_2018 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv', low_memory=False, skiprows=[1])

In [9]:
# Modify the column names
columns = {'Q1': 'Age', 
           'Q2': 'Gender', 
           'Q3': 'Country', 
           'Q4': 'Education', 
           'Q5': 'Title', 
           'Q6': 'Coding_exp', 
           'Q8': 'Recmd_language', 
           'Q15': 'ML_exp',
           'Q20': 'Company_size', 
           'Q24': 'Salary'}

survey_2019.rename(columns={'Q4': 'Education'}, inplace=True)
survey_2018.rename(columns={'Q4': 'Education'}, inplace=True)
survey_2020.rename(columns=columns, inplace=True)

In [10]:
survey_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20036 entries, 0 to 20035
Columns: 355 entries, Time from Start to Finish (seconds) to Q35_B_OTHER
dtypes: int64(1), object(354)
memory usage: 54.3+ MB


In [11]:
survey_2020.dtypes

Time from Start to Finish (seconds)     int64
Age                                    object
Gender                                 object
Country                                object
Education                              object
                                        ...  
Q35_B_Part_7                           object
Q35_B_Part_8                           object
Q35_B_Part_9                           object
Q35_B_Part_10                          object
Q35_B_OTHER                            object
Length: 355, dtype: object

In [None]:
# Convert the datatype of 'Education' column to an ordered category
Education = ['I prefer not to answer',
             'No formal education past high school', 
             'Some college/university study without earning a bachelor’s degree', 
             'Bachelor’s degree', 
             'Master’s degree', 
             'Doctoral degree', 
             'Professional degree']
edu = pd.api.types.CategoricalDtype(categories=Education, ordered=True)
survey_2020['Education'] = survey_2020['Education'].astype(edu)

In [None]:
survey_2020_doct = survey_2020.query('Education == "Doctoral degree"').copy()

In [None]:
survey_2020_doct

<a id='eda'></a>
### Exploratory Data Analysis

<a id='prop'></a>
## Distribution of education levels

In [None]:
edu_20 = survey_2020['Education'].value_counts()[Education].reset_index()
edu_19 = survey_2019['Education'].value_counts()[Education].reset_index()
edu_18 = survey_2018['Education'].value_counts()[Education].reset_index()

edu_20.rename(columns={'index': 'Education', 'Education': '2020'}, inplace=True)
edu_19.rename(columns={'index': 'Education', 'Education': '2019'}, inplace=True)
edu_18.rename(columns={'index': 'Education', 'Education': '2018'}, inplace=True)

edu_20.set_index('Education', inplace=True)
edu_19.set_index('Education', inplace=True)
edu_18.set_index('Education', inplace=True)

edu = edu_18.merge(edu_19, how='inner', on='Education').merge(edu_20, how='inner', on='Education')

edu['2018'] = edu['2018'].apply(lambda x: x/edu['2018'].sum()*100, convert_dtype='float')
edu['2019'] = edu['2019'].apply(lambda x: x/edu['2019'].sum()*100, convert_dtype='float')
edu['2020'] = edu['2020'].apply(lambda x: x/edu['2020'].sum()*100, convert_dtype='float')

In [None]:

# initiate an figure
fig = plt.figure(figsize=[12, 7])
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0])

# backgroud color
bg_color = '#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)

# bar color
edu_color = sns.cubehelix_palette(n_colors=15, start=2.8, rot=-0.1, gamma=0.9, hue=0.8, light=0.8, dark=0.3)

# plot 'edu.dataframe' row by row
x = np.arange(start=0, stop=13, step=2)
ax0.bar(x, edu['2018'], width=0.4, color=edu_color[2], edgecolor=(0, 0, 0), zorder=3, label='2018')
ax0.bar(x+0.4+0.06, edu['2019'], width=0.4, color=edu_color[6], edgecolor=(0, 0, 0), zorder=3, label='2019')
ax0.bar(x+0.4*2+0.06*2, edu['2020'], width=0.4, color=edu_color[10], edgecolor=(0, 0, 0), zorder=3, label='2020')

# Polish the plot
ax0.grid(which='major', axis='y', color='lightgrey', zorder=0)

for s in ['top','right','left']:
    ax0.spines[s].set_visible(False)

ax0.set_xticks(x+0.4+0.03)
edu_ticks = ['I prefer not to answer',
             'No formal education \n past high school', 
             'Some college/university \n study without earning \n a bachelor’s degree', 
             'Bachelor’s degree', 
             'Master’s degree', 
             'Doctoral degree', 
             'Professional degree']
ax0.set_xticklabels(edu_ticks, {'rotation': 90, 'fontsize': 10})

ax0.text(-1, 52,
         'Distribution of Education Levels (%)',
         fontsize=14, 
         fontweight='bold')

ax0.legend(title='Year', 
           fontsize=9,
           title_fontsize=10,
           bbox_to_anchor=[1, 0.88], 
           loc='lower left',
           framealpha=0);

**Here is the distribution of education levels in 2018, 2019, and 2020.**

Since the number of respondents are different each year, I used relative frequency instead of absolute frequency. 

The proportion of respondents with Bachelor's and Master's degree is the highest in all three years, though Bachelor's degree is increasing and Master's degree is decreasing.

For Doctoral degree, its proportion is the third highest, though there is a slightly decrease in 2020.  

<a id='gender'></a>
## Gender

In [None]:
survey_male = survey_2020.query('Gender == "Man"').copy()
survey_female = survey_2020.query('Gender ==  "Woman"').copy()

male_edu = survey_male.groupby('Education').size()
male_edu = male_edu/male_edu.sum()*100
female_edu = survey_female.groupby('Education').size()
female_edu = -female_edu/female_edu.sum()*100

In [None]:
fig = plt.figure(figsize=[12,7])
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0], ylim=[-50, 50])

bg_color = '#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)

ax0.bar(male_edu.index, male_edu, color=edu_color[1], edgecolor=(0, 0, 0), zorder=2, label='Male')
ax0.bar(female_edu.index, female_edu, color=edu_color[8], edgecolor=(0,0,0), zorder=2, label='Female')

for i in male_edu.index:
    ax0.annotate('{:.2f}%'.format(male_edu[i]),
                xy=(i, male_edu[i]+3),
                va='center', 
                ha='center',
                alpha=0.5,
                fontsize=10)
for j in female_edu.index:
    ax0.annotate('{:.2f}%'.format(-female_edu[j]),
                xy=(j, female_edu[j]-3),
                va='center',
                ha='center',
                alpha=0.5,
                fontsize=10)
    
ticks, labels = plt.xticks()
ax0.set_xticklabels(edu_ticks,
                    {'rotation': 90,
                     'fontsize': 10})

yticks = [-40, -20, 0, 20, 40]
plt.yticks(ticks = yticks, labels = np.abs(yticks))

for s in ['top', 'left', 'right']:
    ax0.spines[s].set_visible(False)
    
ax0.grid(which='major', axis='y', color='lightgrey', zorder=0)
ax0.axhline(c='black', lw=1, ls=':')

ax0.legend(title='Gender', 
           fontsize=10,
           title_fontsize=10,
           bbox_to_anchor=[1, 0.88], 
           loc='lower left',
           framealpha=0)

ax0.text(-1, 50, 
         'Distribution of Education Levels in Male and Female (%)', 
         fontsize=14, 
         fontweight='bold');

**This is the distribution of education levels in male and female respondents**

There are more female respondents in the group of Master's degree and Doctoral degree. 

<a id='age'></a>
## Age and experience


In [None]:
# Age distribution
age_total = survey_2020.groupby('Age').size()
age_prop_total = -age_total/age_total.sum()*100
age_doct = survey_2020_doct.groupby('Age').size()
age_prop_doct = age_doct/age_doct.sum()*100

# coding distribution
coding = ['I have never written code', '< 1 years', '1-2 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']

code_total = survey_2020.groupby('Coding_exp').size()[coding]
code_prop_total = -code_total/code_total.sum()*100

code_doct = survey_2020_doct.groupby('Coding_exp').size()[coding]
code_prop_doct = code_doct/code_doct.sum()*100

# age and coding relationship
age_coding = survey_2020.groupby(['Coding_exp', 'Age']).size().unstack().reindex(coding[::-1])

In [None]:
fig = plt.figure(figsize=[12, 7])
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0], ylim=[-30, 35])

bg_color = '#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)

ax0.bar(age_doct.index, age_prop_doct, color=edu_color[7], edgecolor=(0, 0, 0), width=0.6, zorder=2, label='Doctoral degree')
ax0.bar(age_total.index, age_prop_total, color='silver', edgecolor=(0, 0, 0), width=0.6, zorder=2, label='All')

for j in age_doct.index:
    ax0.annotate('{:.2f}%'.format(age_prop_doct[j]),
                xy=(j, age_prop_doct[j]+2),
                va='center', 
                ha='center',
                fontsize=10,
                alpha=0.5)
for i in age_total.index:
    ax0.annotate('{:.2f}%'.format(-age_prop_total[i]),
                xy=(i, age_prop_total[i]-2),
                va='center',
                ha='center',
                fontsize=10,
                alpha=0.5)

for s in ['top', 'left', 'right']:
    ax0.spines[s].set_visible(False)
ax0.grid(which='major', axis='y', color='lightgrey', zorder=0)
ax0.axhline(c='black', lw=1, ls=':')

yticks=[-20, -10, 0, 10, 20, 30]
plt.yticks(yticks, np.abs(yticks))

ax0.set_xticklabels(['18-21', '22-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54',
       '55-59', '60-69', '70+'],
                    {'fontsize': 10})
plt.xlabel('Age (yr)')

ax0.text(-1.5, 35,
         'Distribution of Age in Doctoral Degree and All Education Levels (%)', 
         fontsize=14, 
         fontweight='bold')

ax0.legend(title='Education Levels', 
           fontsize=10,
           title_fontsize=10,
           bbox_to_anchor=[1, 0.88], 
           loc='lower left',
           framealpha=0);

**This is the distribution of age in respondents with Doctoral degree and in the whole population.**

For respondents with Doctoral degree, majority are older than 30 years old. The peak of distribution locates in the age range 30-34 years old. 

For all the education levels, majority of people are younger than 30 years old. The peak of the age distribution locates in age range 25-29 years old. 

Considering the long period of time to complete the Doctoral degree (at least 5 years), it is not surprising to observe the right-switch of the age distribution.

Interestingly, there is a slight increase for respondents with Doctoral degree in 60-69 years old. 

In [None]:
fig = plt.figure(figsize=[12, 7])
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0], ylim=[-30, 30])

bg_color = '#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)

ax0.bar(code_doct.index, code_prop_doct, color=edu_color[7], edgecolor=(0, 0, 0), width=0.6, zorder=2, label='Doctoral degree')
ax0.bar(code_total.index, code_prop_total, color='silver', edgecolor=(0, 0, 0), width=0.6, zorder=2, label='All')

for j in code_doct.index:
    ax0.annotate('{:.2f}%'.format(code_prop_doct[j]),
                xy=(j, code_prop_doct[j]+2),
                va='center', 
                ha='center',
                fontsize=10,
                alpha=0.5)
for i in code_total.index:
    ax0.annotate('{:.2f}%'.format(-code_prop_total[i]),
                xy=(i, code_prop_total[i]-2),
                va='center',
                ha='center', 
                fontsize=10,
                alpha=.5)

for s in ['top', 'left', 'right']:
    ax0.spines[s].set_visible(False)
ax0.grid(which='major', axis='y', color='lightgrey', zorder=0)
ax0.axhline(c='black', lw=1, ls=':')

yticks = [-20, -10, 0, 10, 20, 30]
plt.yticks(yticks, np.abs(yticks))

cod_label = ['I have never \n written code', '< 1 years', '1-2 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']
ticks, labels = plt.xticks()
plt.xticks(ticks=ticks, labels=cod_label, rotation=90)
plt.xlabel('Coding Experience')

ax0.text(-1, 35,
         'Distribution of Coding Experience in Doctoral Degree and All Education Levels (%)', 
          fontsize=14, 
          fontweight='bold',)

ax0.legend(title='Education Levels', 
           fontsize=9,
           title_fontsize=10,
           bbox_to_anchor=[1, 0.88], 
           loc='lower left',
           framealpha=0);

In [None]:
print('{:.2f}% of respondents with Doctoral degree have coding experience longer than 5 years.'.format(code_prop_doct['5-10 years']+code_prop_doct['10-20 years']+code_prop_doct['20+ years']))

**This is the distribution of coding experience in respondents with Doctoral degree and all education levels**

Consist with age distribution, majority of people with doctoral degree have coding experience longer than 5 years. There is no doubt that a proportion of these respondents acquired Ph.D. majoring in computer science or data science, which provides them with additional coding experience. For some respondents with Ph.D. in other majors, they usually start coding before graduation in order to switch the career fluently. It benefits them with more coding experience as well.

For all education levels, most people have coding experience shorter than 5 years. 

**Then, I want to do a small exploration into the relationship between coding experience and age of respondents**

In [None]:
fig = plt.figure(figsize=[12, 7])
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0])

sns.heatmap(data=age_coding, 
            cmap=edu_cmap, 
            linewidths=0.2, 
            square=True, 
            annot=True, 
            fmt = 'd', 
            annot_kws={'alpha': 0.5, 'fontsize': 9},
            cbar_kws={'shrink': 0.6, 'label': 'Number of respondents'})

plt.xlabel('Age (yr)')


plt.ylabel('Coding experience')
plt.yticks(rotation=0)

plt.title('Relationship Between Coding Experience and Age', fontsize=14, fontweight='bold', loc='left', va='bottom');

**There is a positive correlation between age and coding experience in this dataset**

With the age increasing, coding experience increases. Recalling the age distribution, this positive correlation partially explains the higher proportion of experienced respondents in the Doctoral degree group.

Majority have 1-5 years coding experiences in the age between 18-29 years old.

<a id='profession'></a>
## Profession

In [None]:
company_size = ['0-49 employees', '50-249 employees', '250-999 employees', '1000-9,999 employees', '10,000 or more employees']

company_total = survey_2020.groupby('Company_size').size()[company_size]
company_prop_total = -company_total/company_total.sum()*100

company_doct = survey_2020_doct.groupby('Company_size').size()[company_size]
company_prop_doct = company_doct/company_doct.sum()*100

In [None]:
fig = plt.figure(figsize=[12, 7])
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0], ylim=[-45, 40])

bg_color = '#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)

ax0.bar(company_doct.index, company_prop_doct, color=edu_color[7], edgecolor=(0, 0, 0), width=0.5, zorder=3, label='Doctoral degree')
ax0.bar(company_total.index, company_prop_total, color='silver', edgecolor=(0, 0, 0), width=0.5, zorder=3, label='All')

for j in company_doct.index:
    ax0.annotate('{:.2f}%'.format(company_prop_doct[j]),
                xy=(j, company_prop_doct[j]+2),
                va='center', 
                ha='center',
                fontsize=10,
                alpha=.5)
    
for i in company_total.index:
    ax0.annotate('{:.2f}%'.format(-company_prop_total[i]),
                xy=(i, company_prop_total[i]-2),
                va='center',
                ha='center',
                fontsize=10,
                alpha=.5)

for s in ['top', 'left', 'right']:
    ax0.spines[s].set_visible(False)
ax0.grid(which='major', axis='y', color='lightgrey', zorder=0)
ax0.axhline(c='black', lw=1, ls=':')

plt.xticks(rotation=90)

yticks=[-30, -20, -10, 0, 10, 20, 30]
plt.yticks(yticks, np.abs(yticks))


ax0.text(-1, 45,
         'Distribution of Company Size in Doctoral Degree and All Education Levels (%)', 
          fontsize=14, 
          fontweight='bold')

ax0.legend(title='Education Levels', 
           fontsize=9,
           title_fontsize=10,
           bbox_to_anchor=[1, 0.88], 
           loc='lower left',
           framealpha=0);

**This is the distribution of respondents employed by companies of different sizes**

Over 30% of respondents are working in companies having less than 50 employees, which are probably start-up companies. It reflects the rapid development and growing demand of data science and machine learning.

In [None]:
title_total = survey_2020.groupby('Title').size()
title_prop_total = -title_total/title_total.sum()*100
title_doct = survey_2020_doct.groupby('Title').size()
title_prop_doct = title_doct/title_doct.sum()*100

In [None]:
fig = plt.figure(figsize=[12, 7])
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0], ylim=[-35, 40])

bg_color = '#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)

ax0.bar(title_doct.index, title_prop_doct, color=edu_color[7], edgecolor=(0, 0, 0), width=0.6, zorder=2, label='Doctoral degree')
ax0.bar(title_total.index, title_prop_total, color='silver', edgecolor=(0, 0, 0), width=0.6, zorder=2, label='All')

for i in title_total.index:
    ax0.annotate('{:.2f}%'.format(-title_prop_total[i]),
                xy=(i, title_prop_total[i]-2),
                va='center',
                ha='center',
                fontsize=10,
                alpha=.5)
    
for j in title_doct.index:
    ax0.annotate('{:.2f}%'.format(title_prop_doct[j]),
                xy=(j, title_prop_doct[j]+2),
                va='center', 
                ha='center',
                fontsize=10,
                alpha=.5)

for s in ['top', 'left', 'right']:
    ax0.spines[s].set_visible(False)
ax0.grid(which='major', axis='y', color='lightgrey', zorder=0)
ax0.axhline(color='black', lw=1, ls=':')

plt.xticks(rotation=90)

yticks=[-20, -10, 0, 10, 20, 30]
plt.yticks(yticks, np.abs(yticks))


ax0.text(-1.5, 40,
         'Distribution of Profession in Doctoral Degree and All Education Levels (%)', 
         fontsize=14, 
         fontweight='bold')

ax0.legend(title='Education Levels', 
           fontsize=9,
           title_fontsize=10,
           bbox_to_anchor=[1, 0.88], 
           loc='lower left',
           framealpha=0);

**This is the distribution of profession in respondents with Doctoral degree and in whole population**

For respondents with doctoral degree, 30.08% are currently working as research scientist, which is the most prevalent profession. It is much higher than 6.09% in the whole population. 

The second highest proportion of profession for both groups is Data scientist.

For all education levels, "student" is the most prevalent role, with percentage 26.82%.

<a id='salary'></a>
## Salary

In [None]:
salary = ['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499', '7,500-9,999', 
          '10,000-14,999', '15,000-19,999', '20,000-24,999', '25,000-29,999', '30,000-39,999', '40,000-49,999',
          '50,000-59,999', '60,000-69,999', '70,000-79,999', '80,000-89,999', '90,000-99,999', '100,000-124,999',
          '125,000-149,999', '150,000-199,999', '200,000-249,999', '250,000-299,999', '300,000-500,000', '> $500,000']
salary_total = survey_2020.groupby('Salary').size()[salary]
salary_prop_total = -salary_total/salary_total.sum()*100

salary_doct = survey_2020_doct.groupby('Salary').size()[salary]
salary_prop_doct = salary_doct/salary_doct.sum()*100

In [None]:
fig = plt.figure(figsize=[14, 8])
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0], ylim=[-30, 20])

bg_color = '#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)

ax0.bar(salary_doct.index, salary_prop_doct, color=edu_color[7], edgecolor=(0, 0, 0), width=0.6, zorder=2, label='Doctoral degree')
ax0.bar(salary_total.index, salary_prop_total, color='silver', edgecolor=(0, 0, 0), width=0.6, zorder=2, label='All')

for i in salary_total.index:
    ax0.annotate('{:.2f}%'.format(-salary_prop_total[i]),
                xy=(i, salary_prop_total[i]-1),
                va='center',
                ha='center',
                fontsize=7,
                alpha=.5)
    
for j in salary_doct.index:
    ax0.annotate('{:.2f}%'.format(salary_prop_doct[j]),
                xy=(j, salary_prop_doct[j]+0.8),
                va='center', 
                ha='center',
                fontsize=7,
                alpha=.5)
    
for s in ['top', 'left', 'right']:
    ax0.spines[s].set_visible(False)   
ax0.grid(which='major', axis='y', color='lightgrey', zorder=0)
ax0.axhline(c='black', lw=1, ls=':')

plt.xticks(rotation=90)


yticks=[-20, -10, 0, 10, 20]
plt.yticks(yticks, np.abs(yticks))


ax0.text(-2, 25,
         'Distribution of Salary in Doctoral Degree and All Education Levels (%)', 
          fontsize=14, 
          fontweight='bold')

ax0.legend(title='Education Levels', 
           fontsize=9,
           title_fontsize=10,
           bbox_to_anchor=[1, 0.88], 
           loc='lower left',
           framealpha=0);

**This is the distribution of salary of respondents with Doctoral degree and in the whole population.**

The "salary" here means annual compensation. However, the highest proportion is range $0-999, which is irrational in US. I have two assumptions:

* Considering it is a world-wide survey, this might be attributed to the difference in regional economy. 
* It is also possible, that some parts of respondents misunderstood the salary as monthly instead of annual.

If we compare the distrubition in the range higher than $1,000, it is not difficult to find that, the percentage of respondents with Doctoral degree earning over \\$100,000 per year is higher than that in the whole population. Given the higher proportion of data scientist and research scientist, as well as longer coding experience in respondents with Doctoral degree, the different distribution of salary might be attributed to the potential relationships among profession, coding experience and salary. 

Then, I'll dig into it by plotting the distribution of title, coding experience and salary with/without Doctoral degree.

In [None]:
coding_title = survey_2020.groupby(['Coding_exp', 'Title']).size().unstack().reindex(coding[::-1])
coding_title_doct = survey_2020_doct.groupby(['Coding_exp', 'Title']).size().unstack().reindex(coding[::-1]).fillna(0).astype(int)

In [None]:
fig = plt.figure(figsize=[18, 10])
gs=fig.add_gridspec(2, 1)
ax0=fig.add_subplot(gs[0, 0])
ax1=fig.add_subplot(gs[1, 0])

bg_color='#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)
ax1.set_facecolor(bg_color)

sns.heatmap(data=coding_title_doct,
                  cmap=edu_cmap, 
                  linewidths=0.2,
                  linecolor='#f7f7f7',
                  cbar=False,
                  square=True, 
                  annot=True, 
                  fmt = 'd', 
                  annot_kws={'alpha': 0.5},
                  ax=ax0,
                  zorder=2,
                  label='Doctoral degree')

sns.heatmap(data=coding_title,
                  cmap=all_cmap1, 
                  linewidths=0.2,
                  linecolor='#f7f7f7',
                  cbar=False,
                  square=True, 
                  annot=True, 
                  fmt = 'd', 
                  annot_kws={'alpha': 0.5},
                  ax=ax1,
                  zorder=2,
                  label='All')



ax0.set_xticks([])
ax0.set_xlabel(None)
ax1.set_xlabel(None)
ax0.set_ylabel(None)
ax1.set_ylabel(None)

ax0.text(13.7, 2.5, 'Doctoral Degree',
         fontsize=14,
         fontweight='bold',
         bbox=dict(facecolor=edu_color[2]))

ax1.text(13.7, 2.5, 'All Education Levels',
         fontsize=14,
         fontweight='bold',
         bbox=dict(facecolor='darkgreen', alpha=.5))

ax0.text(0, -1, 'Distribution of Coding Experience and Profession',
         fontsize=16,
         fontweight='bold');

**This is the distribution of coding experience and profession in respondents with Doctoral degree and in the whole population.**

Most students, data analysts, and budiness analysts have coding experience less than 5 years. 

Most data scientists, research scientists, and software engineers have coding experience longer than 3 years. 

Most of respondents with Doctoral degree are working as data scientist and research scientist, with coding experience longer than 5 years.

In [None]:
title_salary = survey_2020.groupby(['Title', 'Salary']).size().unstack()[salary].fillna(0).astype(int)
title_salary_doct = survey_2020_doct.groupby(['Title', 'Salary']).size().unstack()[salary].fillna(0).astype(int)

In [None]:
fig = plt.figure(figsize=[18, 10])
gs=fig.add_gridspec(2, 1)
ax0=fig.add_subplot(gs[0, 0])
ax1=fig.add_subplot(gs[1, 0])

bg_color='#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)
ax1.set_facecolor(bg_color)

sns.heatmap(data=title_salary_doct,
                  cmap=edu_cmap, 
                  cbar=False,
                  linewidths=0.2,
                  linecolor='#f7f7f7',
                  square=True, 
                  annot=True, 
                  fmt = 'd', 
                  annot_kws={'alpha': 0.5, 'fontsize': 9},
                  zorder=2,
                  ax=ax0)

sns.heatmap(data=title_salary,
                  cmap=all_cmap1,
                  cbar=False,
                  linewidths=0.2, 
                  linecolor='#f7f7f7',
                  square=True, 
                  annot=True, 
                  fmt = 'd', 
                  annot_kws={'alpha': 0.5, 'fontsize': 9},
                  zorder=2,
                  ax=ax1)


ax0.set_xticks([])
ax0.set_xlabel(None)
ax1.set_xlabel(None)
ax0.set_ylabel(None)
ax1.set_ylabel(None)

ax0.text(25.8, 2.5, 'Doctoral Degree',
         fontsize=14,
         fontweight='bold',
         bbox=dict(facecolor=edu_color[2]))

ax1.text(25.8, 2.5, 'All Education Levels',
         fontsize=14,
         fontweight='bold',
         bbox=dict(facecolor='darkgreen', alpha=.5))

ax0.text(0, -1, 'Distribution of Profession and Salary',
         fontsize=16,
         fontweight='bold');

**This is the distribution of profession and salary in respondents with Doctoral degree and in the whole population**

In the whole pupulation, there is a considerable proportion of data scientist have income higher than \\$100,000 per year. This pattern became more obvious for respondents with Doctoral degree.

To address my assumption mentioned before, I'll separate USA and India in this plot. They are the top two countries from which the most respondents came. In addition, they are two representative developed and developing countries, respectively.

In [None]:
def title_vs_salary(country):
    survey_2020_c = survey_2020[survey_2020['Country']==country].copy()
    title_salary_c = survey_2020_c.groupby(['Title', 'Salary']).size().unstack()[salary].fillna(0).astype(int)
    
    fig = plt.figure(figsize=[14, 8])
    gs=fig.add_gridspec(1, 1)
    ax0=fig.add_subplot(gs[0, 0])

    bg_color='#f7f7f7'
    fig.patch.set_facecolor(bg_color)
    ax0.set_facecolor(bg_color)

    sns.heatmap(data=title_salary_c,
                  cmap=edu_cmap, 
                  linewidths=0.2,
                  linecolor='#f7f7f7',
                  cbar=False,
                  square=True, 
                  annot=True, 
                  fmt = 'd', 
                  annot_kws={'alpha': 0.5, 'fontsize': 9},
                  ax=ax0)
    ax0.set_ylabel(None)
    ax0.set_xlabel(None)
    ax0.text(0, -1, 'Distribution of Profession and Salary in {}'.format(country),
             fontsize=16,
             fontweight='bold')

In [None]:
title_vs_salary('United States of America')

In [None]:
title_vs_salary('India')

There is a striking difference in salary distribution between USA and India. 

For all the education levels, respondents in USA earn much more than those in India in the same profession.

Next, I'll explore the distribution of coding experience and salary:

In [None]:
coding_salary = survey_2020.groupby(['Coding_exp', 'Salary']).size().unstack()[salary].reindex(coding[::-1]).fillna(0).astype(int)
coding_salary_doct = survey_2020_doct.groupby(['Coding_exp', 'Salary']).size().unstack()[salary].reindex(coding[::-1]).fillna(0).astype(int)

In [None]:
fig = plt.figure(figsize=[18, 10])
gs=fig.add_gridspec(2, 1)
ax0=fig.add_subplot(gs[0, 0])
ax1=fig.add_subplot(gs[1, 0])

bg_color='#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)
ax1.set_facecolor(bg_color)

sns.heatmap(data=coding_salary_doct,
                  cmap=edu_cmap, 
                  cbar=False,
                  linewidths=0.2,
                  linecolor='#f7f7f7',
                  square=True, 
                  annot=True, 
                  fmt = 'd', 
                  annot_kws={'alpha': 0.5, 'fontsize': 9},
                  zorder=2,
                  ax=ax0)

sns.heatmap(data=coding_salary,
                  cmap=all_cmap1,
                  cbar=False,
                  linewidths=0.2, 
                  linecolor='#f7f7f7',
                  square=True, 
                  annot=True, 
                  fmt = 'd', 
                  annot_kws={'alpha': 0.5, 'fontsize': 9},
                  zorder=2,
                  ax=ax1)


ax0.set_xticks([])
ax0.set_xlabel(None)
ax1.set_xlabel(None)
ax0.set_ylabel(None)
ax1.set_ylabel(None)

ax0.text(25.8, 2.5, 'Doctoral Degree',
         fontsize=14,
         fontweight='bold',
         bbox=dict(facecolor=edu_color[2]))

ax1.text(25.8, 2.5, 'All Education Levels',
         fontsize=14,
         fontweight='bold',
         bbox=dict(facecolor='darkgreen', alpha=.5))

ax0.text(0, -1, 'Distribution of Coding Experience and Salary',
         fontsize=16,
         fontweight='bold');

**This is the distribution of coding experience and salary in respondents with Doctoral degree and in the whole population**

Both plots showed the positive correlation between coding experience and salary. 

There are higher proportion of respondents with Doctoral degree with more coding experience and salary.

Similarly, I'll plot the distribution of coding experience and salary by countries: USA and India.

In [None]:
def coding_vs_salary(country):
    survey_2020_c = survey_2020[survey_2020['Country']==country].copy()
    coding_salary_c = survey_2020_c.groupby(['Coding_exp', 'Salary']).size().unstack()[salary].reindex(coding[::-1]).fillna(0).astype(int)
    
    fig = plt.figure(figsize=[14, 8])
    gs=fig.add_gridspec(1, 1)
    ax0=fig.add_subplot(gs[0, 0])

    bg_color='#f7f7f7'
    fig.patch.set_facecolor(bg_color)
    ax0.set_facecolor(bg_color)

    sns.heatmap(data=coding_salary_c,
                  cmap=edu_cmap, 
                  linewidths=0.2, 
                  linecolor='#f7f7f7',
                  cbar=False,
                  square=True, 
                  annot=True, 
                  fmt = 'd', 
                  annot_kws={'alpha': 0.5, 'fontsize': 9},
                  ax=ax0)
    ax0.set_xlabel(None)
    ax0.set_ylabel(None)

    ax0.text(0, -1, 'Distribution of Coding Experience and Salary in {}'.format(country),
             fontsize=16,
             fontweight='bold')

In [None]:
coding_vs_salary('United States of America')

In [None]:
coding_vs_salary('India')

**The positive relationship between coding experience and salary exists in both countries**

The distribution of coding experience varies in USA and India, partially explaining the differed distribution of salary in two countries: 
* Most of the respondents in US make more than \\$100,000 per year.
* Majority of respondents in India earn less than \\$1000 per year.

However, comparing the respondents with similar coding experience in two countries, such as 3-5 years, respondents in USA have higher salary.

**These observations explained the accumulation of respondents in the salary range \\$0-999. However, it is not appropriate to compare the salary among countries directly, as the different economic development level, cost of living, et al. play important roles. The discussion on this topic will be interesting, though it requires additional data and is outside of the scope of this notebook.**

<a id='q23'></a>
## Roles at work

In [None]:
def role(df):
    role = df[df.columns[df.columns.str.contains(pat='Q23')]].copy()
    role_list=[]
    for i in range(1, 8):
        role_name = role['Q23_Part_{}'.format(i)].dropna().value_counts().index[0]
        role_count = role['Q23_Part_{}'.format(i)].dropna().count()
        role_dict = {'role_name': role_name,
                     'role_count': role_count}
        role_list.append(role_dict)
    role_list.append({'role_name': role['Q23_OTHER'].dropna().value_counts().index[0],
                      'role_count': role['Q23_OTHER'].dropna().count()})
    df_role = pd.DataFrame(role_list)
    return df_role

In [None]:
doct_role = role(survey_2020_doct).set_index('role_name')
all_role = role(survey_2020).set_index('role_name')
doct_role['role_count'] = doct_role['role_count'].apply(lambda x: x/doct_role.sum()*100)
all_role['role_count'] = all_role['role_count'].apply(lambda x: -x/all_role.sum()*100)
all_role = all_role.reset_index()
doct_role = doct_role.reset_index()

In [None]:
role_label = ['Analyze and understand \n data to influence product \n or business decisions',
       'Build and/or run the data infrastructure \n that my business uses for storing, \n analyzing, and operationalizing data',
       'Build prototypes to explore \n applying machine learning \n to new areas',
       'Build and/or run a machine learning \n service that operationally improves \n my product or workflows',
       'Experimentation and iteration \n to improve existing ML models',
       'Do research that \n advances the state of the art \n of machine learning',
       'None of these activities \n are an important part \n of my role at work',
       'Other']

In [None]:
fig = plt.figure(figsize=[14, 8])
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0], ylim=[-30, 30])

bg_color = '#f7f7f7'
fig.patch.set_facecolor(bg_color)
ax0.set_facecolor(bg_color)

x=np.arange(8)
ax0.bar(x, doct_role['role_count'], color=edu_color[7], edgecolor=(0, 0, 0), width=0.5, zorder=2, label='Doctoral degree')
ax0.bar(x, all_role['role_count'], color='silver', edgecolor=(0, 0, 0), width=0.5, zorder=2, label='All')

for i in all_role.index:
    ax0.annotate('{:.2f}%'.format(-all_role['role_count'][i]),
                xy=(i, all_role['role_count'][i]-1),
                va='center',
                ha='center',
                fontsize=10,
                alpha=.5)
for j in doct_role.index:
    ax0.annotate('{:.2f}%'.format(doct_role['role_count'][j]),
                xy=(j, doct_role['role_count'][j]+1),
                va='center',
                ha='center',
                fontsize=10,
                alpha=.5)   
    
for s in ['top', 'left', 'right']:
    ax0.spines[s].set_visible(False)   
ax0.grid(which='major', axis='y', color='lightgrey', zorder=0)
ax0.axhline(color='black', lw=1, ls=':')


ax0.set_xticks(x)
ax0.set_xticklabels(role_label, {'rotation': 60, 'fontsize': 10})


yticks=[-20, -10, 0, 10, 20, 30]
plt.yticks(yticks, np.abs(yticks))


ax0.text(-1, 35,
         'Distribution of Roles in Work in Doctoral Degree and All Education Levels (%)', 
          fontsize=16, 
          fontweight='bold')

ax0.legend(title='Education Levels', 
           fontsize=9,
           title_fontsize=10,
           bbox_to_anchor=[1, 0.88], 
           loc='lower left',
           framealpha=0);

**This is the distribution of important roles at work of respondents with Doctoral degree and in the whole population**

Compared with the whole population, respondents with Doctoral degree perform more machine learning-related works. 

Whether or not the respondents have Doctoral degree, their most common work is analyzing and understanding data for business decisions. 

<a id='conclusion'></a>
## Conclusion

After exploring this survey, I found that:

- In the last three years, data science and machine learning practitioners with Doctoral degree are the third biggest population. 
- Majority (61.92%) of practitioners with Doctoral degree have coding experience longer than 5 years. 
- Their most common professions are research scientist (30.08%) and data scientist (20.05%). 
- Machine learning is an important topic of their work.

For people planning to switch career after completing doctoral degree, machine learning could be an essential topic to learn and master. 

Thank you for reading my notebook. I'd appreciate it if you could share any feedback and give me an upvote. :)