## How does experience affect the kaggle's ML and data science survey answers?

Through this notebook I want to explore how participants with different levels of experience in coding/programming answered the different survey questions, and how their preferences of programming languages, tools and techniques differ.

# Setup

Let's import all the required libraries:

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

plt.rc('axes', titlesize=14)
plt.rc('xtick', labelsize=12) 
plt.rc('ytick', labelsize=12)

sns.set_style('whitegrid')
sns.set_palette(sns.color_palette("Set2"))
%matplotlib inline

print('Libraries ready!')

Let's load our dataset:

In [None]:
data = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', low_memory=False)
print('Dataset Loaded.')

We'll save all the details of the survey question in the dictionary `q_descriptions`:

In [None]:
q_descriptions = {}
for (c, q) in zip(data.columns.values, data.iloc[0, :].values):
    q_descriptions[c] = q

Now let's modify our dataset to remove the row of question descriptions as well as shorten some of the values in the dataset for ease of plotting. Here's a sample from the modified dataset:

In [None]:
# remove description row
data = data.iloc[1:]

# shorten some column values
data = data.replace({
    'United States of America': 'USA',
    'United Kingdom of Great Britain and Northern Ireland': 'UK',
    'I have never written code': 'Never wrote code'
})

data.sample(5)

Let's define some constants and funtions that will be useful later:

In [None]:
# CONSTANTS
EXP_ORDER = ['Never wrote code', '< 1 years', '1-3 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']
GENDER_ORDER = ['Man', 'Woman', 'Nonbinary', 'Prefer not to say', 'Prefer to self-describe']

# HELPER-FUNCTIONS
def options_by_share(df, q, n_options=None):
    """
    Given a question, say 'Q7', returns share(%) of the different options chosen by the users.
    """
    result_df = None
    nrows = df.shape[0]
    q_cols = [c for c in df.columns if (q + '_') in c] # cols related to question, if multiple
    
    if len(q_cols) == 0:
        result_df = df.groupby(q, as_index=False).size()
        result_df = result_df.rename({q: 'Option', 'size': 'Percentage'}, axis='columns')
        result_df['Percentage'] = round((result_df['Percentage'] / nrows) * 100, 2)
    else:
        options = [q_descriptions[c].split(' - ')[-1] for c in q_cols]
        percentages = []
        
        for i, c in enumerate(q_cols):
            option_freq = sum(df[c] == options[i])
            percentages.append(round((option_freq / nrows) * 100, 2))
        
        result_df = pd.DataFrame({
            'Option': options,
            'Percentage': percentages
        })
    
    result_df = result_df.sort_values('Percentage', ascending=False).iloc[:n_options]
    return result_df

def get_ques_cols(q):
    cols = []
    
    for i in q_descriptions:
        if q + '_' in i:
            cols.append(i)
            
    if not len(cols): cols.append(q) 
    return cols

print("We're all set!")

# Survey Participants by Experience

- About `96%` of particpants have experience in programming.
- The largest portion of participants have `1-3 years` of experience.

In [None]:
exp_data = data.groupby('Q6').size()
labels = exp_data.keys()

plt.figure(figsize=(12, 8))
plt.title('Share of Participants by Programming Experience')
plt.pie(x=exp_data, autopct="%.2f%%", explode=[0.03]*7, labels=exp_data.keys(), pctdistance=0.5);

# Age and Experience

- So many participants with `<1 years` of experience throughout the age groups shows that its never too late to pick up programming :)

In [None]:
age_exp_data = data[['Q1', 'Q6']].rename({'Q1': 'Age Group', 'Q6': 'Experience'}, axis='columns').sort_values('Age Group')

plt.figure(figsize=(14, 8))
plt.title('Experience in Programming and Age Groups\n')
sns.histplot(x="Age Group", hue="Experience", data=age_exp_data
             , stat="count", multiple="stack"
             , hue_order=EXP_ORDER);

# Gender and Experience

- About `61%` of women participants have 1-3 years of experience in programming, while the same figure for men is around `51%`.
- It suggests a `positive trend` regarding the involvement of women in the field, with a large portion of women picking up programming in recent years. 

In [None]:
gender_exp_data = data.groupby('Q2').apply(lambda df: options_by_share(df, 'Q6').set_index('Option')).reset_index()
gender_exp_data = gender_exp_data.rename({
    'Q2': 'Gender',
    'Option': 'Experience'
}, axis='columns')


plt.figure(figsize=(14, 6))
plt.title('Experience by Gender\n')
sns.barplot(y='Percentage', x='Gender', hue='Experience'
            , data=gender_exp_data
            , order=GENDER_ORDER
            , hue_order=EXP_ORDER)
plt.legend(loc='right');

# Countries and Experience

In [None]:
part_by_country = data.Q3.value_counts()
top_countries = part_by_country[part_by_country.index != 'Other'].index[:10]

country_share_data = part_by_country[part_by_country.index.isin(top_countries)]
country_share_data = country_share_data.append(pd.Series(data.shape[0] - country_share_data.sum(), index=['Other']))

plt.figure(figsize=(8, 8))
plt.title('Share of Participants by Country')
plt.pie(x=country_share_data, autopct="%.2f%%"
        , explode=[0.07]*country_share_data.count()
        , labels=country_share_data.keys(), pctdistance=0.85)
centre_circle = plt.Circle((0,0),0.75,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle);

- Participants from `Nigeria`, `Pakistan`, `China`, `Egypt` and `India` are relatively newer to programming with 85-90% of participants having none to five years of experience.
- Participants from `US`, `UK` and `Brazil` have signifiantly more experience under their belt, with 40-48% of participants being familiar with programming for more than 5 years.

In [None]:
country_exp_data = data[data['Q3'].isin(top_countries)].groupby('Q3').apply(lambda df: options_by_share(df, 'Q6').set_index('Option')).reset_index()
country_exp_data = country_exp_data.rename({
    'Q3': 'Country',
    'Option': 'Experience'
}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title('Experience by Countries\n')
sns.barplot(y='Percentage', x='Country', hue='Experience', data=country_exp_data, alpha=0.9
            , hue_order=EXP_ORDER
            , palette=sns.color_palette("Set2")[:7])
plt.legend(loc='upper right');

# Job Title and Experience

In [None]:
part_by_title = data.Q5.value_counts()
top_titles = part_by_title[part_by_title.index != 'Other'].index[:10]

title_share_data = part_by_title[part_by_title.index.isin(top_titles)]
title_share_data = title_share_data.append(pd.Series(data.shape[0] - title_share_data.sum(), index=['Other']))

plt.figure(figsize=(8, 8))
plt.title('Share of Participants by Title/Role')
plt.pie(x=title_share_data, autopct="%.2f%%"
        , explode=[0.07]*title_share_data.count()
        , labels=title_share_data.keys(), pctdistance=0.85)
centre_circle = plt.Circle((0,0),0.75,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle);

- About 65% of participants with the job title `Business Analyst` or `Data Analyst` have none to three years of experience in programming.
- The titles `Research Scientist`, `DBA/Database Engineer`, `Software Engineer` and `Project/Product Manager` have the people with most experience, with ~50% of particpants having 5+ years of experience in programming.

In [None]:
title_exp_data = data.groupby('Q5').apply(lambda df: (df.groupby('Q6').size() / df.shape[0])*100)
title_exp_data = title_exp_data[EXP_ORDER]

plt.figure(figsize=(16, 10))
plt.title('Distribution of Experience by Job Title\n')
sns.heatmap(title_exp_data, annot=True, cmap="Blues")
plt.xticks(rotation=45, horizontalalignment='right' )
plt.xlabel('Experience')
plt.ylabel('Title');

print('Note: All rows add up to 100%.')

# Yearly Compensation and Experience

- Over `60%` of people with none to 3 years of experience in programming make upto 10,000 USD an year.
- Almost `30%` of the respondents with 20+ years of experience have an yearly compensation of 100,000-249,999 USD.

In [None]:
# merge compensation values
modif_comp_data = data.replace({
    '$0-999': '0-9,999', 
    '1,000-1,999': '0-9,999', 
    '2,000-2,999': '0-9,999', 
    '3,000-3,999': '0-9,999', 
    '4,000-4,999': '0-9,999', 
    '5,000-7,499': '0-9,999', 
    '7,500-9,999': '0-9,999', 
    '10,000-14,999': '10,000-24,999',
    '15,000-19,999': '10,000-24,999',
    '20,000-24,999': '10,000-24,999',
    '25,000-29,999': '25,000-49,999',
    '30,000-39,999': '25,000-49,999',
    '40,000-49,999': '25,000-49,999',
    '50,000-59,999': '50,000-99,999', 
    '60,000-69,999': '50,000-99,999', 
    '70,000-79,999': '50,000-99,999', 
    '80,000-89,999': '50,000-99,999', 
    '90,000-99,999': '50,000-99,999', 
    '100,000-124,999': '100,000-249,999', 
    '125,000-149,999': '100,000-249,999', 
    '150,000-199,999': '100,000-249,999', 
    '200,000-249,999': '100,000-249,999', 
    '250,000-299,999': '250,000-499,999', 
    '300,000-499,999': '250,000-499,999',
    '$500,000-999,999': '500,000-999,999',
    '>$1,000,000': '>1,000,000'})

# compensation order for plotting
COMP_ORDER = ['0-9,999', '10,000-24,999', '25,000-49,999', '50,000-99,999', '100,000-249,999', '250,000-499,999', '500,000-999,999', '>1,000,000']

# data to plot
q = 'Q25'
yrly_comp_filtered_data = modif_comp_data[~modif_comp_data[q].isnull()]
yrly_comp_exp_data = yrly_comp_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
yrly_comp_exp_data = yrly_comp_exp_data.rename({'Q6': 'Experience'}, axis='columns')
yrly_comp_exp_data = yrly_comp_exp_data.set_index('Option').loc[COMP_ORDER].reset_index()

# plot
plt.figure(figsize=(18, 6))
plt.title("What is your current yearly compensation (approximate $USD)?\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=yrly_comp_exp_data, alpha=0.9
            , hue_order=EXP_ORDER)
plt.xlabel('Compensation ($USD)')
plt.xticks(rotation=45, horizontalalignment='right' )
plt.legend(loc='upper right');

# Preference and Experience

### Programming Languages in Use

- `Python` is the most popular language with >80% of participants with programming experience using it on a regular basis.
- The usage of languages such as `SQL`, `Javascript`, `Bash` and `R` increases with more experienced programmers.

In [None]:
q = 'Q7'
cols = get_ques_cols(q)
proglang_filtered_data = data[~data[cols].isnull().all(1)]
proglang_exp_data = proglang_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
proglang_exp_data = proglang_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title('What programming languages do you use on a regular basis? (Select all that apply)\n')
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=proglang_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Prog. Language');

### Recommended Programming Languages

- `Python` is recommended as the first language to learn for an aspiring data scientist by ~80% of all participants with experience.
- `R` and `SQL` are the next 2 most recommended languages.

In [None]:
q = 'Q8'
proglang_rec_filtered_data = data[~data[q].isnull()]
proglang_rec_exp_data = proglang_rec_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
proglang_rec_exp_data = proglang_rec_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title('What programming language would you recommend an aspiring data scientist to learn first?\n')
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=proglang_rec_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Prog. Language');

### Integrated Development Environments (IDEs)

- `Jupyter Notebook` is the IDE of choice among all levels of experience, followed by `Visual Studio Code` and `PyCharm`.
- The usage of `Visual Studio`, `Notepad++` and `Vim/Emacs` is higher among more experienced programmers.

In [None]:
q = 'Q9'
cols = get_ques_cols(q)
ide_filtered_data = data[~data[cols].isnull().all(1)]
ide_exp_data = ide_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
ide_exp_data = ide_exp_data.rename({'Q6': 'Experience'}, axis='columns')
ide_exp_data = ide_exp_data.replace({
    'Jupyter (JupyterLab, Jupyter Notebooks, etc) ': 'Jupyter',
    ' Visual Studio Code (VSCode) ': 'VS Code'
})

plt.figure(figsize=(18, 6))
plt.title("Which of the following integrated development environments (IDE's) do you use on a regular basis?\n(Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=ide_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('IDE')
plt.xticks(rotation=45, horizontalalignment='right' );

### Hosted Notebook Products

- `Colab Notebooks` and `Kaggle notebooks` are the most popular hosted notebook products.
- The usage of both seem to decrease among higher experience groups.

In [None]:
q = 'Q10'
cols = get_ques_cols(q)
nb_filtered_data = data[~data[cols].isnull().all(1)]
nb_exp_data = nb_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
nb_exp_data = nb_exp_data.rename({'Q6': 'Experience'}, axis='columns')
nb_exp_data = nb_exp_data.replace({
    'Google Cloud Notebooks (AI Platform / Vertex AI) ': 'Google Cloud Notebooks',
    ' Amazon Sagemaker Studio Notebooks ': 'Amazon Sagemaker Studio NBs',
    ' Databricks Collaborative Notebooks ': 'Databricks Collaborative NBs'
})

plt.figure(figsize=(18, 6))
plt.title("Which of the following hosted notebook products do you use on a regular basis? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience', data=nb_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Notebook Product')
plt.xticks(rotation=45, horizontalalignment='right' );

### Computing Platform

- Portable `laptops` are the most widely used platform for data science.
- `Cloud Computing Platforms` and `Deep Learning workstations` are slightly more popular among higher experience levels.

In [None]:
q = 'Q11'
cp_filtered_data = data[~data[q].isnull()]
cp_exp_data = cp_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
cp_exp_data = cp_exp_data.rename({'Q6': 'Experience'}, axis='columns')
cp_exp_data = cp_exp_data.replace({
    'A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)': 'A cloud computing \nplatform',
    'A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)': 'A deep learning \nworkstation',
    'A personal computer / desktop': 'A PC/desktop'
})

plt.figure(figsize=(14, 6))
plt.title("What type of computing platform do you use most often for your data science projects?\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=cp_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Platform');

### Specialized Hardware

- `>50%` of participants with experience do not use a specialized hardware.
- `NVIDIA GPUs` is the most popular specialized hardware followed by `Google Cloud TPUs`.

In [None]:
q = 'Q12'
cols = get_ques_cols(q)
hardware_filtered_data = data[~data[cols].isnull().all(1)]
hardware_exp_data = hardware_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
hardware_exp_data = hardware_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(14, 6))
plt.title("Which types of specialized hardware do you use on a regular basis? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=hardware_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Specialized Hardware');

### Data Visualization Libraries

- `>70%` of participants with coding/programming experience use `Matplotlib` on a regular basis.
- Other popular data visualization libraries are `Seaborn`, `Plotly` and `ggplot`.

In [None]:
q = 'Q14'
cols = get_ques_cols(q)
dv_libs_filtered_data = data[~data[cols].isnull().all(1)]
dv_libs_exp_data = dv_libs_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
dv_libs_exp_data = dv_libs_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title("What data visualization libraries or tools do you use on a regular basis? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=dv_libs_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Library')
plt.xticks(rotation=45, horizontalalignment='right')
plt.ylim(0, 80);

### Machine Learing Frameworks

- `Scikit-Learn` is the most popular machine learning frameworks with ~71% of participants with programming experience using it on a regular basis.
- `Tensorflow`, `Keras`, `Xgboost` and `Pytorch` are other popular machine learning frameworks.

In [None]:
q = 'Q16'
cols = get_ques_cols(q)
ml_fram_filtered_data = data[~data[cols].isnull().all(1)]
ml_fram_exp_data = ml_fram_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
ml_fram_exp_data = ml_fram_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(14, 24))
plt.title("Which of the following machine learning frameworks do you use on a regular basis?\n(Select all that apply)\n")
sns.barplot(x='Percentage', y='Option', hue='Experience'
            , data=ml_fram_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.ylabel('Framework')
plt.legend(loc='right');

### Machine Learning Algorithms

- `Linear or Logistic Regression` and `Decision Trees and Random Forests` are the most popular machine learning algorithms.
- The usage of different ML algorithms rises with experience.

In [None]:
q = 'Q17'
cols = get_ques_cols(q)
ml_algo_filtered_data = data[~data[cols].isnull().all(1)]
ml_algo_exp_data = ml_algo_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
ml_algo_exp_data = ml_algo_exp_data.rename({'Q6': 'Experience'}, axis='columns')
ml_algo_exp_data = ml_algo_exp_data.replace({
    'Gradient Boosting Machines (xgboost, lightgbm, etc)': 'Gradient Boosting Machines'  
})

plt.figure(figsize=(18, 6))
plt.title("Which of the following ML algorithms do you use on a regular basis? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=ml_algo_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Algorithm')
plt.xticks(rotation=45, horizontalalignment='right' );

### Cloud Computing Platforms

- `Amazon Web Services` is the most popular cloud computing platform in use.
- About `40%` of people with <1 year of experience in programming don't use a cloud computing platform.

In [None]:
q = 'Q27_A'
cols = get_ques_cols(q)
cc_plat_filtered_data = data[~data[cols].isnull().all(1)]
cc_plat_exp_data = cc_plat_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
cc_plat_exp_data = cc_plat_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title("Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=cc_plat_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Platform')
plt.xticks(rotation=45, horizontalalignment='right' );

- `AWS`, `GCP` and `Microsoft Azure` are the 3 most popular cloud computing platforms on respondents' wishlist.
- A much higher proportion of people with no programming experience want to learn `SAP Cloud` and `Salesforce Cloud` than other experience levels.

In [None]:
q = 'Q27_B'
cols = get_ques_cols(q)
cc_plat_filtered_data_B = data[~data[cols].isnull().all(1)]
cc_plat_exp_data_B = cc_plat_filtered_data_B.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
cc_plat_exp_data_B = cc_plat_exp_data_B.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title("Which of the following cloud computing platforms do you hope to become more familiar with in the next 2 years?\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=cc_plat_exp_data_B, alpha=0.9
            , hue_order=EXP_ORDER)
plt.xlabel('Platform')
plt.xticks(rotation=45, horizontalalignment='right' );

### Managed Machine Learning Products

- `66%` of the respondents with programming experience do not use any managed ML products.
- `Azure ML studio`, `Amazon SageMaker` and `Google Cloud Vertex AI` are the most popular managed ML Products.

In [None]:
q = 'Q31_A'
cols = get_ques_cols(q)
ml_prod_filtered_data = data[~data[cols].isnull().all(1)]
ml_prod_exp_data = ml_prod_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
ml_prod_exp_data = ml_prod_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title("Do you use any of the following managed machine learning products on a regular basis? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=ml_prod_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Managed ML products')
plt.xticks(rotation=45, horizontalalignment='right' );

- `Google Cloud Vertex AI` and `Azure ML Studio` are the 2 most popular managed ML products to learn on respondents' wishlist.
- Participants with lower experience are more keen on learning such ML products.

In [None]:
q = 'Q31_B'
cols = get_ques_cols(q)
ml_prod_filtered_data_B = data[~data[cols].isnull().all(1)]
ml_prod_exp_data_B = ml_prod_filtered_data_B.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
ml_prod_exp_data_B = ml_prod_exp_data_B.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title("In the next 2 years, do you hope to become more familiar with any of these managed machine learning products?\n(Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=ml_prod_exp_data_B, alpha=0.9
            , hue_order=EXP_ORDER)
plt.xlabel('Managed ML products')
plt.xticks(rotation=45, horizontalalignment='right' );

### Big Data Products

- `MySQL` is the most popular big data product throughout the different experience levels and with ~37% of all respondents using it on a regular basis.
- Almost all other big data products like `PostgreSQL`, `SQLite` and `Microsoft SQL server` gain popularity with experience.

In [None]:
q = 'Q32_A'
cols = get_ques_cols(q)
bd_prod_filtered_data = data[~data[cols].isnull().all(1)]
bd_prod_exp_data = bd_prod_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
bd_prod_exp_data = bd_prod_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(20, 6))
plt.title("Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?\n(Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=bd_prod_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Big Data Products')
plt.xticks(rotation=45, horizontalalignment='right' );

- `MySQL` and `MongoDB` are the 2 most popular big data products to learn on respondents' wishlist.
- Participants with lower experience are more keen on learning such big data products.

In [None]:
q = 'Q32_B'
cols = get_ques_cols(q)
bd_prod_filtered_data_B = data[~data[cols].isnull().all(1)]
bd_prod_exp_data_B = bd_prod_filtered_data_B.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
bd_prod_exp_data_B = bd_prod_exp_data_B.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(20, 6))
plt.title("Which of the following big data products do you hope to become more familiar with in the next 2 years? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=bd_prod_exp_data_B, alpha=0.9
            , hue_order=EXP_ORDER)
plt.xlabel('Big Data Products')
plt.xticks(rotation=45, horizontalalignment='right' );

### Business Intelligence Tools

- The two most popular such tools are `Tableau` and `Microsoft Power BI`.
- Usage of such business intelligence tools is higher among newer programmers.


In [None]:
q = 'Q34_A'
cols = get_ques_cols(q)
bi_tools_filtered_data = data[~data[cols].isnull().all(1)]
bi_tools_exp_data = bi_tools_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
bi_tools_exp_data = bi_tools_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(20, 6))
plt.title("Which of the following business intelligence tools do you use on a regular basis? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=bi_tools_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.ylabel('Tools')
plt.xticks(rotation=45, horizontalalignment='right' );

- `Tableau`, `MS Power BI` and `Google Data Studio` are the 3 most popular business intelligence tools to learn on participants' wishlist.
- Participants with lower experience are more keen on learning such business intelligence tools.

In [None]:
q = 'Q34_B'
cols = get_ques_cols(q)
bi_tools_filtered_data_B = data[~data[cols].isnull().all(1)]
bi_tools_exp_data_B = bi_tools_filtered_data_B.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
bi_tools_exp_data_B = bi_tools_exp_data_B.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(20, 6))
plt.title("Which of the following business intelligence tools do you hope to become more familiar with in the next 2 years? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=bi_tools_exp_data_B, alpha=0.9
            , hue_order=EXP_ORDER)
plt.ylabel('Tools')
plt.xticks(rotation=45, horizontalalignment='right');

### Automated Machine Learning Tools

- `~32%` of respondents answered affirmatively to using AutoML tools on a regular basis.
- There isn't any major differences in the usage of such tools across the difference experience levels.

In [None]:
q = 'Q36_A'
cols = get_ques_cols(q)
auto_ml_tools_filtered_data = data[~data[cols].isnull().all(1)]
auto_ml_tools_exp_data = auto_ml_tools_filtered_data.groupby('Q6').apply(lambda df: 1 - (df['Q36_A_Part_7'] == 'No / None').sum() / df.shape[0])
auto_ml_tools_exp_data = round(auto_ml_tools_exp_data.loc[EXP_ORDER[1:]]*100, 2)

plt.figure(figsize=(14, 6))
plt.title("Share of Participants that use AutoML or Partial AutoML Tools on a Regular Basis (in %)")
sns.barplot(y=auto_ml_tools_exp_data.values, x=auto_ml_tools_exp_data.index, alpha=0.9)
plt.xlabel('Experience')
plt.ylabel('Percentage')
plt.ylim(0, 100);

- Most participants hope to become familiar with ML tools that deal with `Automation of full ML pipelines` and `Automated model selection`.

In [None]:
q = 'Q36_B'
cols = get_ques_cols(q)
auto_ml_tools_filtered_data_B = data[~data[cols].isnull().all(1)]
auto_ml_tools_exp_data_B = auto_ml_tools_filtered_data_B.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
auto_ml_tools_exp_data_B = auto_ml_tools_exp_data_B.rename({'Q6': 'Experience'}, axis='columns')
auto_ml_tools_exp_data_B = auto_ml_tools_exp_data_B.replace({
    'Automation of full ML pipelines (e.g. Google Cloud AutoML, H2O Driverless AI)': 'Automation of full ML pipelines',
    'Automated model selection (e.g. auto-sklearn, xcessiv)': 'Automated model selection',
    'Automated feature engineering/selection (e.g. tpot, boruta_py)': 'Automated feature engineering/selection',
    'Automated data augmentation (e.g. imgaug, albumentations)': 'Automated data augmentation',
    'Automated hyperparameter tuning (e.g. hyperopt, ray.tune, Vizier)': 'Automated hyperparameter tuning',
    'None': 'None',
    'Automated model architecture searches (e.g. darts, enas)': 'Automated model architecture searches',
    'Other': 'Other'
})

plt.figure(figsize=(14, 6))
plt.title("Which categories of automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the\nnext 2 years? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=auto_ml_tools_exp_data_B, alpha=0.9
            , hue_order=EXP_ORDER)
plt.ylabel('Tools')
plt.xticks(rotation=45, horizontalalignment='right');

###  Machine Learning Experiment Tools

- `~67%` of respondents do not use any ML experiment tools.
- `TensorBoard` and `MLflow` are the two most popular such tools.

In [None]:
q = 'Q38_A'
cols = get_ques_cols(q)
ml_exp_tools_filtered_data = data[~data[cols].isnull().all(1)]
ml_exp_tools_exp_data = ml_exp_tools_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
ml_exp_tools_exp_data = ml_exp_tools_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title("Do you use any tools to help manage machine learning experiments? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=ml_exp_tools_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('ML Experiment Tools')
plt.xticks(rotation=45, horizontalalignment='right' );

- `>58%` of total respondents home to become familiar with ML experiment tools within the next 2 years.
- `TensorBoard` and `MLflow` are 2 most popular such tools on people's wishlist.

In [None]:
q = 'Q38_B'
cols = get_ques_cols(q)
ml_exp_tools_filtered_data_B = data[~data[cols].isnull().all(1)]
ml_exp_tools_exp_data_B = ml_exp_tools_filtered_data_B.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
ml_exp_tools_exp_data_B = ml_exp_tools_exp_data_B.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title("In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=ml_exp_tools_exp_data_B, alpha=0.9
            , hue_order=EXP_ORDER)
plt.xlabel('ML Experiment Tools')
plt.xticks(rotation=45, horizontalalignment='right' );

### Project Deployment Platforms

- The most popular platforms for deployment of data analysis and ML applications are `Github` and `Kaggle`, with ~50% and ~33% of total respondents using it, respectively.
- With the rise of experience levels, participants tend to not share/publish there work publicly.

In [None]:
q = 'Q39'
cols = get_ques_cols(q)
ml_deploy_filtered_data = data[~data[cols].isnull().all(1)]
ml_deploy_exp_data = ml_deploy_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
ml_deploy_exp_data = ml_deploy_exp_data.rename({'Q6': 'Experience'}, axis='columns')

plt.figure(figsize=(18, 6))
plt.title("Where do you publicly share or deploy your data analysis or machine learning applications? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=ml_deploy_exp_data, alpha=0.9
            , hue_order=EXP_ORDER[1:])
plt.xlabel('Platform')
plt.xticks(rotation=45, horizontalalignment='right' );

### Platforms of Data Science Courses

- `Coursera` and `Kaggle` courses are the most popular with 50% and 45% of overall respondents having been involved in a DS course on these platforms, respectively.
- `Udemy` and `DataCamp` courses are more popular among newer programmers, while `edx` courses have a higher proportion of experienced programmers taking them.

In [None]:
q = 'Q40'
cols = get_ques_cols(q)
course_platform_filtered_data = data[~data[cols].isnull().all(1)]
course_platform_exp_data = course_platform_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
course_platform_exp_data = course_platform_exp_data.rename({'Q6': 'Experience'}, axis='columns')
course_platform_exp_data = course_platform_exp_data.replace({
    'University Courses (resulting in a university degree)': 'University Courses',
    'Cloud-certification programs (direct from AWS, Azure, GCP, or similar)': 'Cloud-certification programs (direct\nfrom AWS, Azure, GCP, or similar)'
})

plt.figure(figsize=(18, 6))
plt.title("On which platforms have you begun or completed data science courses? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=course_platform_exp_data, alpha=0.9
            , hue_order=EXP_ORDER)
plt.xlabel('Platform')
plt.xticks(rotation=45, horizontalalignment='right' );

### Data Analysis Tools

- `Basic statistical software` such as MS Excel lose popularity as the primary data analysis tool among the more experienced groups.
- `Local development environments` like RStudio and JupyterLab are more popular among the more experienced groups.

In [None]:
q = 'Q41'
da_tools_filtered_data = data[~data[q].isnull()]
da_tools_exp_data = da_tools_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
da_tools_exp_data = da_tools_exp_data.rename({'Q6': 'Experience'}, axis='columns')
da_tools_exp_data = da_tools_exp_data.replace({
    'Basic statistical software (Microsoft Excel, Google Sheets, etc.)': 'Basic statistical software\n(Microsoft Excel, Google Sheets, etc.)',
    'Local development environments (RStudio, JupyterLab, etc.)': 'Local development environments\n(RStudio, JupyterLab, etc.)',
    'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)': 'Business intelligence software\n(Salesforce, Tableau, Spotfire, etc.)',
    'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)': 'Cloud-based data software & APIs\n(AWS, GCP, Azure, etc.)',
    'Advanced statistical software (SPSS, SAS, etc.)': 'Advanced statistical software\n(SPSS, SAS, etc.)'
})

plt.figure(figsize=(14, 6))
plt.title("What is the primary tool that you use at work or school to analyze data?\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=da_tools_exp_data, alpha=0.9
            , hue_order=EXP_ORDER)
plt.xlabel('Tools')
plt.xticks(rotation=45, horizontalalignment='right' );

### Data Science Media Sources

- `Kaggle` and `YouTube` are the two most popular media sources for data science, with 62% and 57% of respondents having listed them, respectively.
- As we head towards more experienced programmers, a higher share of them listed their media sources as `Blogs`, `Email newsletters` and `Journal Publications`.

In [None]:
q = 'Q42'
cols = get_ques_cols(q)
sources_filtered_data = data[~data[cols].isnull().all(1)]
sources_exp_data = sources_filtered_data.groupby('Q6').apply(lambda df: options_by_share(df, q)).reset_index().drop(['level_1'], axis=1)
sources_exp_data = sources_exp_data.rename({'Q6': 'Experience'}, axis='columns')
sources_exp_data = sources_exp_data.replace({
    'Kaggle (notebooks, forums, etc)': 'Kaggle',
    'YouTube (Kaggle YouTube, Cloud AI Adventures, etc)': 'YouTube',
    'Blogs (Towards Data Science, Analytics Vidhya, etc)': 'Blogs',
    'Twitter (data science influencers)': 'Twitter',
    "Email newsletters (Data Elixir, O'Reilly Data & AI, etc)": 'Email newsletters',
    'Course Forums (forums.fast.ai, Coursera forums, etc)': 'Course Forums',
    'Reddit (r/machinelearning, etc)': 'Reddit',
    'Journal Publications (peer-reviewed journals, conference proceedings, etc)': 'Journal Publications',
    'Slack Communities (ods.ai, kagglenoobs, etc)': 'Slack Communities',
    'Podcasts (Chai Time Data Science, O’Reilly Data Show, etc)': 'Podcasts',
    'None': 'None',
    'Other': 'Other'
})

plt.figure(figsize=(18, 6))
plt.title("Who/what are your favorite media sources that report on data science topics? (Select all that apply)\n")
sns.barplot(y='Percentage', x='Option', hue='Experience'
            , data=sources_exp_data, alpha=0.9
            , hue_order=EXP_ORDER)
plt.xlabel('Sources')
plt.xticks(rotation=45, horizontalalignment='right' );

---
#### Thank you for sticking around. If you spot any errors, typos or have any suggestions for improvement, please feel free to leave a message.