# Introduction

This notebook is intended to provide the comprehansive EDA and data-driven insights from the data collected in Kaggle's 2020 survey of **'State of Data Science and Machine Learning 2020'**.

It contains several chapters dedicated to various topics covered by the survey.

**Chapter 1** provides the comprehensive outlook on the demographical and generic experience profile of the survey repondents in terms of their

- Age
- Gender
- Level of Education
- Occupation
- Years of Programming Experience
- Years of ML Experience

**Chapter 2** conveys some insights on the technical skills of the kagglers. It is not extensive due to the number of high-quality notebooks already dedicated to this topic (see some links in the **References** section below). Instead of investing more time into analysis of this type, I spent it on analytics in the areas where less attention has been paid by other contest participants (see details in Chapters 3-6  below). 

However, there are still interesting discoveries found due to research on the multi-variative categorical feature associations (*spoiler*: there is a forecast about when and how *Julia revolution* in Data Science could happen). 

**Chapter 3** is concentrated on the organizational environment and job responsibilities of the survey participants. These are

- Size of the employer organization
- Size of the team working in Data Science-related areas
- Adoption of ML in the organization
- Individual job responsibilities of the survey participants
- Spending on ML and Cloud Computing in the last 5 years

From that stand-point, it describes the essential business landscape where kagglers operate on the daily basis.

**Chapter 4** is focused on the preferences of Cloud Computing Provider usage by Kagglers. It draws insights on the popularity of cloud computing platforms and products among the survey participants who are professionals (as opposed to non-professionals - see the note below). In particular, it covers

- Cloud Platforms usage
- Cloud Computing products usage
- Cloud ML products usage
- BigData platforms
- BI tools (mostly, the cloud-based ones)

The line of the narrative in this chapter is often attached to the good news and opportunities for the top three cloud service providers in the market as follows

- Amazon Web Services (AWS)
- Google Cloud Platform (GCP)
- Microsoft Azure Cloud (MS Azure)

**Chapter 5** is dedicated to the analysis of usage of Data Science automation tools by the survey respondents. In particular, it conveys findings regarding

- the usage of automated machine learning tools
- the usage of tools to help manage machine learning experiments

**Chapter 6** delves into details about the favorite knowledge sharing and information/knowledge acquisition channels used by the survey participants. In this chapter, we are going to analize the preferences of the survey participants as for

- platforms and tools to publicly share or deploy their data analysis or machine learning applications
- platforms to take online data science courses
- primary tools they use to analyze data
- favorite media sources that report on data science topics

**Chapter 7** provides the geographical perspective on the Kaggle 2020 survey participants. It draws a collection of maps to display by-country distribution of the participants by their demographic data, experience and usage of Cloud Service platforms.

**Appendix** at the end of the notebook contains supplementary notes on

- Methodology and technical implementation details about this notebook
- Comments on why Rapid EDA tools are not applicable to the EDA in the scope of this project
- References to the high-quality notebooks of other contest participants that inspired the work done here as well as to the blog posts explaining advanced topics of visualizing relations between categorical features

In [None]:
!pip install pdpipe

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from typing import Tuple

import pdpipe as pdp

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline


# Reading Data Into Memory

In [None]:
# read data
in_kaggle = True

def get_data_file_path(is_in_kaggle: bool) -> str:
    train_path = ''
    test_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        data_path = '../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv'

    else:
        # running locally
        data_path = 'data/kaggle_survey_2020_responses.csv'


    return data_path

# set the size of the geo bubble
def set_size(value):
    '''
    Takes the numeric value of a parameter to visualize on a map (Plotly Geo-Scatter plot)
    Returns a number to indicate the size of a bubble for a country which numeric attribute value 
    was supplied as an input
    '''
    result = np.log(1+value/1000)
    if result < 0:
        result = 0.001
    return result

# cascatter implementation - reused from https://github.com/myrthings/catscatter/blob/master/catscatter.py
# (c) Myr Barnés, 2020
# More info about this function is available at
# - https://towardsdatascience.com/visualize-categorical-relationships-with-catscatter-e60cdb164395
# - https://github.com/myrthings/catscatter/blob/master/README.md
def catscatter(df,colx,coly,cols,color=['grey','black'],ratio=10,font='Helvetica',save=False,save_name='Default'):
    '''
    Goal: This function create an scatter plot for categorical variables. It's useful to compare two lists with elements in common.
    Input:
        - df: required. pandas DataFrame with at least two columns with categorical variables you want to relate, and the value of both (if it's just an adjacent matrix write 1)
        - colx: required. The name of the column to display horizontaly
        - coly: required. The name of the column to display vertically
        - cols: required. The name of the column with the value between the two variables
        - color: optional. Colors to display in the visualization, the length can be two or three. The two first are the colors for the lines in the matrix, the last one the font color and markers color.
            default ['grey','black']
        - ratio: optional. A ratio for controlling the relative size of the markers.
            default 10
        - font: optional. The font for the ticks on the matrix.
            default 'Helvetica'
        - save: optional. True for saving as an image in the same path as the code.
            default False
        - save_name: optional. The name used for saving the image (then the code ads .png)
            default: "Default"
    Output:
        No output. Matplotlib object is not shown by default to be able to add more changes.
    '''
    # Create a dict to encode the categeories into numbers (sorted)
    colx_codes=dict(zip(df[colx].sort_values().unique(),range(len(df[colx].unique()))))
    coly_codes=dict(zip(df[coly].sort_values(ascending=False).unique(),range(len(df[coly].unique()))))
    
    # Apply the encoding
    df[colx]=df[colx].apply(lambda x: colx_codes[x])
    df[coly]=df[coly].apply(lambda x: coly_codes[x])
    
    
    # Prepare the aspect of the plot
    plt.rcParams['xtick.bottom'] = plt.rcParams['xtick.labelbottom'] = False
    plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = True
    plt.rcParams['font.sans-serif']=font
    plt.rcParams['xtick.color']=color[-1]
    plt.rcParams['ytick.color']=color[-1]
    plt.box(False)

    
    # Plot all the lines for the background
    for num in range(len(coly_codes)):
        plt.hlines(num,-1,len(colx_codes)+1,linestyle='dashed',linewidth=1,color=color[num%2],alpha=0.5)
    for num in range(len(colx_codes)):
        plt.vlines(num,-1,len(coly_codes)+1,linestyle='dashed',linewidth=1,color=color[num%2],alpha=0.5)
        
    # Plot the scatter plot with the numbers
    plt.scatter(df[colx],
               df[coly],
               s=df[cols]*ratio,
               zorder=2,
               color=color[-1])
    
    # Change the ticks numbers to categories and limit them
    plt.xticks(ticks=list(colx_codes.values()),labels=colx_codes.keys(),rotation=90)
    plt.yticks(ticks=list(coly_codes.values()),labels=coly_codes.keys())
    plt.xlim(xmin=-1,xmax=len(colx_codes))
    plt.ylim(ymin=-1,ymax=len(coly_codes))
    
    # Save if wanted
    if save:
        plt.savefig(save_name+'.png')

In [None]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

# get the survey response data
data_path = get_data_file_path(in_kaggle)

# read the raw survey data into a Pandas DF in memory
df = pd.read_csv(data_path, low_memory=False)

questions = df.iloc[0, :].T
data = df.iloc[1:, :]

In [None]:
data.head()

In [None]:
# data preprocessing 

# pre-process gender info to collapse rare gender categories to 'Etc.'
data['Q2'] = data['Q2'].apply(lambda x : 'Etc.' if x not in ['Man', 'Woman'] else x)

# fill NA in Q38 (data analysis tools) with 'None'
data['Q38'] = data['Q38'].fillna('None')

# Chapter 1: Demography and Generic Experience

This chapter will be focused on the insights about basic demographic and generic experience characteristis of the survey participants as follows

- Age
- Gender
- Level of Education
- Occupation
- Years of Programming Experience
- Years of ML Experience

## Age-Gender Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q1','Q2']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q1': 'age', 
        'Q2': 'gender', 
    })

First of all, we will build a category scatter plot to display the respondent for the age-to-gender categorical value intersects

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'age', 'gender', 'respondent_count', font='Helvetica', color=colors, ratio=10)

plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.show()



In [None]:
fig = px.bar(
    agg_data, 
    x='age', 
    y='respondent_count', 
    color='gender', 
    title="Kaggle 2020 Survey Respondens by Age and Gender", 
    height=600)
fig.show()

We can see a number of interesting demographic insights out of the charts above

- the young generations (peope of the age of 18-29) predominate the population of the survey respondens, and the entire Kaggle population as of the moment, most likely
- people of the age of 25-29 is the top group within the Kaggler survey community
- males dominate over females and other genders in every age category
- the fraction of women in younger generation of Kagglers is slightly higher then in the older ones

## Age-to-Level-of-Education Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q1','Q4']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q1': 'age', 
        'Q4': 'education_level', 
    })

First of all, we will build a category scatter plot to display the respondent for the age-to-eduction level cat value intersects

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'age', 'education_level', 'respondent_count', font='Helvetica', color=colors, ratio=20)

plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='age', 
    y='respondent_count', 
    color='education_level', 
    title="Kaggle 2020 Survey Respondens by Age and Education Level", 
    height=600)
fig.show()

We can see that there are four biggest Kaggle population clusters in the space of Education level and Age as follows

1. Bachelors of the age of 18-21 (the biggest cluster so far)
2. People of the age of 25-29 with the Master degree
3. Bachelors of the age of 22-24
4. Masters of the age of 22-24 

On top of the population cluster insight above, we can draw the additional intelligence as follows

- People with Bachelor degree predominate in more junior age groups
- In senior age groups, the percentage of Master and Doctoral degrees grows
- People with Master degrees predominate in every age group, starting from 25-29 and older.
- Starting the age group of 35-39 and older, the amount of respondents with Doctoral degrees is only a little less then the respondents with  Master degree for the same age group

Based on the facts/knowledge discoveries above, combined with my onw (quite subjective) 'kaggling' experience, I could formulate a few suggestions regarding better Kaggle community engagement/participation practices. 

1. Senior kagglers with Master and Doctoral degrees who crossed the *akme age* line (that is, those who are of the age of 45-49 and older) constitute the 'golden pool' of experts here in Kaggle platform. It could make sense to better engage with them to motivate them to do more knowledge sharing with and possibly other training activities for more junior population of Kagglers.
2. Ethical standards conduct across various activities on the platform (otherwise known as 'ethical kaggling') could be one of the pillars to keep Senior-level expertswilling to contribute to the body of knowledge here at Kaggle more. Conversly, the lack of 'ethical kaggling' may discourage them from being active on the platform.
3. Since the majority of Kagglers are quite junior, one of their motivations is to prove their competencies/expertise as well as stand out of the crowd. Some of them are yet to get their heads around the concept of 'ethical kaggling' (see the discussion threads like https://www.kaggle.com/discussion/206639 and similar).
4. Propaganda and some sort of control of conducting the 'ethical kaggling' practices can be benefitial for the Kaggle community to sustain in the long run. It will also help to better motivate the top senior-level experts to share their knowledge and wisdom with junior generations.

## Age-Occupation Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q1','Q5']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q1': 'age', 
        'Q5': 'occupation', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(70,30))
catscatter(agg_data_copy , 'age', 'occupation', 'respondent_count', font='Helvetica', color=colors, ratio=10)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='age', 
    y='respondent_count', 
    color='occupation', 
    title="Kaggle 2020 Survey Respondens by Age and Occupation", 
    height=600)
fig.show()

As we can see, there are top clusters of the Kaggler survey respondents in the space of Age and Occupation as follows

- Students of the age of 18-21
- Students of the age of 22-24
- Students of the age of 25-29
- Data Scientists of the age of 25-29
- Data Scientists of the age of 30-34

The next tier of the responder clusters shows quite a significant amount of young unemployed people (of the age of 22-24 and 25-29, respectively). It may imply that young unemployed data science professional regard Kaggle as a platform to increase their chances to get a prominent job in the respective Data Science-related area.

Additional insigts within the Age-Occupation breakdown can be drawn as follows

- students predominate the age groups of 18-21 and 22-24, and they stop being dominant in the age group of 25-29 and older
- in the age groups of 25-29 and older, Data Scientist is the most popular occupation within the survey responders
- the occupations of Software Engineer, Data Analyst, and Research Scientist are in the next tier in terms of popularity
- quite a significant amount of responders indicated 'Other' as their occupation - it can be assumed these are professionals outside of IT/Software Development/Scient/Statistic-related fields interested in Data Science topics
- the ratio of product and project management professionals, although quite small in every age group, displays the tendency to grow up from the age group of 25-29 to 50-54
- the ratio of unemployed is still significant for middle-age and senior-age groups

## Age-to-Years-of-Programming-Experience Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q1','Q6']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q1': 'age', 
        'Q6': 'sd_experience', 
    })


In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(70,30))
catscatter(agg_data_copy , 'age', 'sd_experience', 'respondent_count', font='Helvetica', color=colors, ratio=20)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='age', 
    y='respondent_count', 
    color='sd_experience', 
    title="Kaggle 2020 Survey Respondens by Age and Programming Experience", 
    height=600)
fig.show()

As we can see, the charts above support our naive intution about the positive correlation between the age and programming experience. More specifically we find that

- the majority of young kagglers are relatively inexperienced (having up to 3-5 years of programming experience at maximum)
- more senior people tend to demonstrate more years of programming experience
- in every age group (even in the senior ones), there is always a tiny fraction of people who indicated no programming experience at all

## Gender-to-Level of Education Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q2', 'Q4']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q4': 'education', 
        'Q2': 'gender', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'gender', 'education', 'respondent_count', font='Helvetica', color=colors, ratio=5)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='gender', 
    y='respondent_count', 
    color='education', 
    title="Kaggle 2020 Survey Respondens by Gender and Level of Education", 
    height=600)
fig.show()

As we can see, there are top clusters of the survey respondents within the dimensions of Gender and Level of Educations as follows

- Males with Bachelor's degree
- Males with Master's degree
- Males with Doctoral degree
- Females with Bachelor's degree
- Females with Master's degree

Bachelor and Master degrees are the most common levels of education for every gender.

## Gender-to-Occupation Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q2', 'Q5']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation', 
        'Q2': 'gender', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'gender', 'occupation', 'respondent_count', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='gender', 
    y='respondent_count', 
    color='occupation', 
    title="Kaggle 2020 Survey Respondens by Gender and Occupation", 
    height=600)
fig.show()

We find that 

- Student occupation is the most popular one among every gender
- Top three occupations (after Student) for males are Data Scientist, Software Engineer, and Other
- Top three occupations (after Student) for females are Data Scientist, Data Analyst, and Unemployed

As we can see, the ratio of unemployment among female respondents to the survey is higher.

## Gender-to-Programming Experience Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q2', 'Q6']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q6': 'sd_experience', 
        'Q2': 'gender', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'gender', 'sd_experience', 'respondent_count', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='gender', 
    y='respondent_count', 
    color='sd_experience', 
    title="Kaggle 2020 Survey Respondens by Gender and Programming Experience", 
    height=600)
fig.show()

We find that

- the majority of both males and females are relatively inexperienced as they fit into 3 programming experience categories (3-5 years, 1-2 years, and <1 year)
- the ratio of males having 5-10 years of experience is higher then the one of females
- the ratio of females having zero programming experience is higher then the one of males

## Gender-to-ML Experience Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q2', 'Q15']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q15': 'ml_experience', 
        'Q2': 'gender', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'gender', 'ml_experience', 'respondent_count', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='gender', 
    y='respondent_count', 
    color='ml_experience', 
    title="Kaggle 2020 Survey Respondens by Gender and Machine Learning Experience", 
    height=600)
fig.show()

We find that

- the majority of the survey respondents are relatively inexperienced in ML for every gender (they fit to Under 1 year, 1-2 years, 2-3 years and 'I do not use ML ... ' categories)
- the ratio of female indicated 'I do not use ML ... ' is higher then the one of males

## Level of Education-to-Occupation Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q4', 'Q5']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q4': 'education', 
        'Q5': 'occupation', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(60,20))
catscatter(agg_data_copy , 'occupation', 'education', 'respondent_count', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='occupation', 
    y='respondent_count', 
    color='education', 
    title="Kaggle 2020 Survey Respondens by Level of Education and Occupation", 
    height=600)
fig.show()

We can see the top three clusters of the survey respondents in terms of the Level of Education and Occupation

- Students with Bachelor degree
- Students with Master degree
- Data Scientists with Master degree

On top of that, we find that

- number of people with Bachelor degree exceeds the number of people with Master degree in Students and Unemployed groups
- number of people with Bachelor and Master degrees is on a par for Software Engineers
- for the rest of occupations, we see the number of people with Master degree exceeding the one with Bachelor degree

## Level of Education-to-Programming Experience Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q4', 'Q6']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q4': 'education', 
        'Q6': 'sd_experience', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(60,20))
catscatter(agg_data_copy , 'sd_experience', 'education', 'respondent_count', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='sd_experience', 
    y='respondent_count', 
    color='education', 
    title="Kaggle 2020 Survey Respondens by Level of Education and Programming Experience", 
    height=600)
fig.show()

As we can see there are four top clusters among the survey responders within Education and Programming experience dimensions as follows

- People with Bachelor's degree having 1-2 years of programming experience
- People with Master's degree having 3-5 years of programming experience
- People with Bachelor's degree having 3-5 years of programming experience
- People with Master's degree having 1-2 years of programming experience

For more experienced categories (5+ years of programming experience), we find that the ratio of people with Master's degree exceeds one for Bachelor's degree.

Percentage of people with Doctoral degree is higher in the most experienced categories (programming experience of 10-20 years as well as 20+ years).

## Level of Education-to-ML Experience Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q4', 'Q15']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q4': 'education', 
        'Q15': 'ml_experience', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(60,20))
catscatter(agg_data_copy , 'ml_experience', 'education', 'respondent_count', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='ml_experience', 
    y='respondent_count', 
    color='education', 
    title="Kaggle 2020 Survey Respondens by Level of Education and Machine Learning Experience", 
    height=500)
fig.show()

As we can see, the majority of the survey participants fall into the clusters indicating relatively inexperienced professionals. These are

- People with Bachelor's degree having less then 1 year of ML experience
- People with Master's degree having less then 1 year of ML experience
- People with Bachelor's degree having 1-2 years of ML experience
- People with Master's degree having 1-2 years of ML experience

## Occupation-to-Programming Experience Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q6', 'Q5']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q6': 'sd_experience', 
        'Q5': 'occupation', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'sd_experience', 'occupation', 'respondent_count', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='sd_experience', 
    y='respondent_count', 
    color='occupation', 
    title="Kaggle 2020 Survey Respondens by Occupation and Programming Experience", 
    height=600)
fig.show()

As we can see, the population of the survey responders is absolutely predominated by students with a little programming experience (between 0 and 5 years).

Among non-student participants, we can see top three clusters (by the number of responders in the cluster) as follows

- Data Scientists with 3-5 years of programming experience
- Data Scientists with 5-10 years of programming experience
- Unemployed with less then 1 year of programming experience

## Occupation-to-ML Experience Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q5', 'Q15']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q15': 'ml_experience', 
        'Q5': 'occupation', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'ml_experience', 'occupation', 'respondent_count', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='ml_experience', 
    y='respondent_count', 
    color='occupation', 
    title="Kaggle 2020 Survey Respondens by Occupation and Machine Learning Experience", 
    height=600)
fig.show()

## Programming-to-ML Experience Relationship

In [None]:
# aggregate basis cat data for age and gender
agg_data = data.groupby(['Q6', 'Q15']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q15': 'ml_experience', 
        'Q6': 'sd_experience', 
    })

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'ml_experience', 'sd_experience', 'respondent_count', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.bar(
    agg_data, 
    x='sd_experience', 
    y='respondent_count', 
    color='ml_experience', 
    title="Kaggle 2020 Survey Respondens by Programming and Machine Learning Experience", 
    height=600)
fig.show()

As we can see, the majority of the survey participants fall into relatively inexperienced clusters as follows

- professionals with less then 1 year of programming and less then 1 year of ML experience
- professionals with 1-2 years of programming and less then 1 year of ML experience
- professionals with 3-5 years of programming and less then 1 year of ML experience
- professionals with 1-2 years of programming and 1-2 years of ML experience
- professionals with 3-5 years of programming and 1-2 years of ML experience

# Chapter 2: Skills Profile of Kaggle Survey 2020 Participants

In this chapter, we are going to get some insights on the technical skills of the survey participants

*Note:* We are not going to exhaustively analize each and every technical skill of the respondents due to a number of high quality notebooks already covering it professionally (see https://www.kaggle.com/dwin183287/kagglers-seen-by-continents, for instance). We are only going to capture non-trivial insights via parallel and hierarchical multi-variative associations between the categorical variables in the survey dataset.

## Programming Language Preferences

The basic findings about programming language preferences discovered in https://www.kaggle.com/dwin183287/kagglers-seen-by-continents as follows

- Python is the clear winners in here, Python is the most popular programming languanges amount Kagglers in every continents.
- The second most popular programming languanges that Kagglers use is SQL.
- R is not popular among Kaggler in Asia as it is ranked 6, which very different from the rest of the world which put R as no 3 programming languanges.
- C and C++ are also still popular in Asia compared to other continents.

In this section, we are going to look programming language preferences deeper, via the multi-variative parallel interactions with multiple demographic features.

### Programming Languages by Occupation and Programming Experience

In [None]:
languange_lst = ["Q7_Part_1", "Q7_Part_2", "Q7_Part_3", "Q7_Part_4", "Q7_Part_5", "Q7_Part_6",
                "Q7_Part_7", "Q7_Part_8", "Q7_Part_9", "Q7_Part_10", "Q7_Part_11", "Q7_Part_12", "Q7_OTHER"] 

agg_data = data.groupby(["Q5", "Q6"])[languange_lst].count().reset_index()


agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        "Q7_Part_1": "Python", 
        "Q7_Part_2": "R", 
        "Q7_Part_3": "SQL", 
        "Q7_Part_4": "C", 
        "Q7_Part_5": "C++", 
        "Q7_Part_6": "Java",
        "Q7_Part_7": "Javascript", 
        "Q7_Part_8": "Julia", 
        "Q7_Part_9": "Swift", 
        "Q7_Part_10": "Bash", 
        "Q7_Part_11": "MATLAB", 
        "Q7_Part_12": "None", 
        "Q7_OTHER": "Other"
    })

#### Python Usage Patterns by Occupation and Education Level

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'sd_experience', 'occupation', 'Python', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.treemap(
    agg_data, 
    path=[ 'occupation', 'sd_experience'], 
    values='Python', 
    #color='Python', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, the most numerous users of Python among Kaggle 2020 survey respondents are within the clusters below 

- Students with 1-2 years of programming experience
- Students with 3-5 years of programming experience
- Students with less then 1 year of programming experience

Among non-student clusters, we see active Python usage by

- Data Scientists with 3-5 years of programming experience
- Data Scientists with 5-10 years of programming experience
- Data Scientists with 1-2 years of programming experience
- Data Analysts with 1-2 years of programming experience
- Data Scientists with 10-20 years of programming experience
- Software Engineers with 3-5 years of programming experience
- Software Engineers with 5-10 years of programming experience

We can draw the insights below

- Currently unemployed kagglers prefer Python
- ML Engineers use Python to some extent
- Data Analysts with the industrial experience of more then 2 years use Python less actively

#### R Usage Patterns by Occupation and Education Level

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'sd_experience', 'occupation', 'R', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'sd_experience'], #
    values='R', 
    # color='R', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, R is the most actively used by the users in the following clusters

- Students with moderate programming experience (less then 1 year, 1-2 years, and 3-5 years of programming experience)
- Data Scientists with 3-5 years of programming experience
- Data Scientists with 5-10 years of programming experience

We also see the following R usage patterns 
- R is more or less popular among Data Analysts of various programming experience levels
- Software Engineers and ML Engineers rarely use R

#### SQL Usage Patterns by Occupation and Education Level

In [None]:
# this will prevent manifesting a little bug of catscatter 
# casting age and gender to int64 as a result of the catscatter plotting below
agg_data_copy = agg_data.copy() 


#plot it

colors=['blue', 'grey', 'green']
# create the plot
plt.figure(figsize=(50,20))
catscatter(agg_data_copy , 'sd_experience', 'occupation', 'SQL', font='Helvetica', color=colors, ratio=3)

plt.xticks(fontsize=40)
plt.yticks(fontsize=40)
plt.show()

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'sd_experience'], 
    values='SQL', 
    # color='SQL', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, SQL is the most actively used by the users in the following clusters

- Students with moderate programming experience (less then 1 year, 1-2 years, and 3-5 years of programming experience)
- Data Scientists with 3-5 years of programming experience
- Data Scientists with 5-10 years of programming experience

The next tier of user clusters (in terms of the number of responders using SQL) is provided below

- Data Scientists within other programming experience groups
- Data Analysts with 1-2 and 3-5 years of programming experience
- Software Engineers

#### Julia: the Raising Star

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'sd_experience'], 
    values='Julia', 
    #color='Julia', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

Per the word of some of the industry analysts, Julia is predicted to be the next big thing in the area of Machine Learning and Data Science, to gradually replace Python in the top leading positions in the next 3-5 years. However, the results of the survey shows Julia is yet to find its way to the mainstream ML and Data Science software development.

We can see that 

- Julia is experimented with by relatively experienced professionals (Data Scientists, Research Scientists, and Software Engineers with 10+ years of experience) as well as by Students
- the rest of the kaggler respondents do not display the sound interest in using Julia yet

So 'Julia revolution' is not likely to happen soon. It will only be possible to start when today's students who became funs of Julia obtain more industrial  experience as well as occupy senior-level positions to influence strategic technical and business decisions in their organizations.

### Programming Languages by Occupation and Education Level

In [None]:
languange_lst = ["Q7_Part_1", "Q7_Part_2", "Q7_Part_3", "Q7_Part_4", "Q7_Part_5", "Q7_Part_6",
                "Q7_Part_7", "Q7_Part_8", "Q7_Part_9", "Q7_Part_10", "Q7_Part_11", "Q7_Part_12", "Q7_OTHER"] 

agg_data = data.groupby(["Q5", "Q4"])[languange_lst].count().reset_index()


agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q4': 'education',
        "Q7_Part_1": "Python", 
        "Q7_Part_2": "R", 
        "Q7_Part_3": "SQL", 
        "Q7_Part_4": "C", 
        "Q7_Part_5": "C++", 
        "Q7_Part_6": "Java",
        "Q7_Part_7": "Javascript", 
        "Q7_Part_8": "Julia", 
        "Q7_Part_9": "Swift", 
        "Q7_Part_10": "Bash", 
        "Q7_Part_11": "MATLAB", 
        "Q7_Part_12": "None", 
        "Q7_OTHER": "Other"
    })

#### Python Usage Patterns by Education and Occupation

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'education'], 
    values='Python', 
    color='Python', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, the most active users of Python fall into the clusters below

- Students
- Data Scientists with Master's degree

#### R Usage Patterns by Education and Occupation

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'education'], 
    values='R', 
    #color='R', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, the top user clusters who use R actively are as follows

- Data Scientists with Master's Degree
- Students with Master's Degree
- Students with Bachelor's Degree

We also see that Data Analysts category goes next in terms of R usage (after Data Scientists and Students), as opposed to what has been observed for Python (see above).

#### SQL Usage Patterns by Education and Occupation

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'education'], 
    values='SQL', 
    # color='SQL', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, the top SQL user clusters within the survey respondents are as follows

- Data Scientists with Master's degree
- Students with Bachelor's degree
- Students with Masters's degree
- Data Analysts with Masters's degree

## IDE Preferences

In https://www.kaggle.com/dwin183287/kagglers-seen-by-continents, the following basic findings about IDE preferences have been discovered

- Jupyter is the most used IDE for Kagglers around the world.
- Though VSCode is the 2nd most used IDE in the world, it still has a very big gap compared to Jupyter.
- The third place is varies among the continents with most of Kagglers choose PyCharm except for America and Australia which prefer RStudio compared to PyCharm.
- R is ranked as the 3rd highest languange in all continents (except in Asia), but RStudio is not as popular as R itself in Europe, Others and Africa. Pycharm is still dominating in those three continents.

In this section, we are going to look IDE preferences deeper, via the multi-variative parallel interactions with multiple demographic features.

### IDE By Occupation, Education Level and Programming Experience

In [None]:
ide_lst = ["Q9_Part_1", "Q9_Part_2", "Q9_Part_3", "Q9_Part_4", "Q9_Part_5", "Q9_Part_6",
           "Q9_Part_7", "Q9_Part_8", "Q9_Part_9", "Q9_Part_10", "Q9_Part_11", "Q9_OTHER"] 

agg_data = data.groupby(["Q5", "Q4", "Q6"])[ide_lst].count().reset_index()


agg_data = agg_data.rename(
    columns={
        'Q4': 'education', 
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        "Q9_Part_1": "Jupyter", 
        "Q9_Part_2": "RStudio", 
        "Q9_Part_3": "Visual Studio", 
        "Q9_Part_4": "VSCode", 
        "Q9_Part_5": "PyCharm", 
        "Q9_Part_6": "Spyder",
        "Q9_Part_7": "Notebook++", 
        "Q9_Part_8": "Sublime Text", 
        "Q9_Part_9": "Vim, Emacs", 
        "Q9_Part_10": "MATLAB", 
        "Q9_Part_11": "None", 
        "Q9_OTHER": "Other"
    })

**Jupyter Popularity Chart**

In [None]:
dimension_cats = ['occupation', 'education', 'sd_experience']
fig = px.parallel_categories(agg_data, dimensions=dimension_cats, color='Jupyter')
fig.show()

We can see there are no special patterns in Jupyter usage. It is simply the most popular IDE as of the moment. The fact the students show the most active usage of Jupyter is correlated with Students being the largest population amoung the Kaggle 2020 survey participants.

**VSCode Popularity**

In [None]:
dimension_cats = ['occupation', 'education', 'sd_experience']
fig = px.parallel_categories(agg_data, dimensions=dimension_cats, color='VSCode')
fig.show()

We find that VSCode has interesting popularity patterns as it is most used by professionals with Master's degree having Data Scientist and Software Engineer titles.

**PyCharm Popularity**

In [None]:
dimension_cats = ['occupation', 'education', 'sd_experience']
fig = px.parallel_categories(agg_data, dimensions=dimension_cats, color='PyCharm')
fig.show()

We can see that PyCharm is mostly preferred by Software Engineers, Data Scientists and Machine Learning Engineers holding Master's degrees

**RStudio Popularity**

In [None]:
dimension_cats = ['occupation', 'education', 'sd_experience']
fig = px.parallel_categories(agg_data, dimensions=dimension_cats, color='RStudio')
fig.show()

As we can see, RStudio is quite popular among Data Scientists with Master Degree and Students. Data Scientists with 3-5 and 5-10 years of programming experience use it the most.

Conversly, other technical experts (Software Engineers, ML Engineers) do not use it often (it supports the naive intuition, with Python to be their programming language of choice at work).

# Chapter 3: Organization and Job Responsibility Profiles

In this chapter, we are going to analize the key characteristics of the employers of the survey participants. These are

- Size of the organization
- Size of the team working in Data Science-related areas
- Adoption of ML in organization
- Individual Job responsibilities of the survey participants
- Spending on ML and Cloud Computing in the last 5 years

We are specifically going to analize responses to Q20-Q23 and Q25

*Note:* Per the findings of other researches, the responses to Q24, the quetion on the level of compensation (https://www.kaggle.com/dwin183287/kagglers-seen-by-continents), seem to be skewed or somewhat faked by the respondents. Therefore the analysis in this notebook won't cover it.

## Organization Size, Data Science and ML Adoption Chart

In [None]:
# aggregate basis cat data
agg_data = data.groupby(['Q20','Q21', 'Q22']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q20': 'size', 
        'Q21': 'data_scientists', 
        'Q22': 'ml_adoption', 
    })

In [None]:
fig = px.treemap(
    agg_data, 
    path=['size', 'data_scientists', 'ml_adoption'], 
    values='respondent_count', 
    color='respondent_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, the largest cluster of the survey responders work for the small organizations (0-49 employees). Within that bucket, the most of the organization has either 1-2, 0 or 3-4 employees responsible for data science workloads.

The majority of such organizations is still exploring ML methods. However, some fraction of such organizations 

- is reported to recently start using ML methods, or
- do not use ML currently

Small organizations with no workers dedicated to data science workloads mostly do not use ML in their daily operations (which is a kind of expected intuition).

Interesting (small-sized) subclusters of small organizations with 5+ employees dedicated to data science workloads display mature usage of ML solutions in production.

Second and third largest clusters of survey responders constitute those who work for

- organizations with 10000+ employees
- organizations with 1000-9999 employees

In such clusters, we can see quite a lot of respondents to indicate their organizations to have well established ML methods.

Overall, irrespective to the organization size, it seems that the size of a Data Science-centric team of 5+ workers indicates certain level of ML methods maturity within an organization. In organizations in every size group, the companies with 5+ workers dedicated to Data Science activities indicated to either start using ML in production recently or have the mature ML methods established.

## Individual Job Responsibilities of the Survey Participants

In [None]:
responsibilities_lst = ["Q23_Part_1", "Q23_Part_2", "Q23_Part_3", "Q23_Part_4", "Q23_Part_5", "Q23_Part_6",
                "Q23_Part_7", "Q23_OTHER"] 

agg_data = data.groupby(["Q20", "Q5", "Q6"])[responsibilities_lst].count().reset_index()


agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q20': 'org_size',
        "Q23_Part_1": "Analyze and understand data", 
        "Q23_Part_2": "Build and/or run the data infrastructure", 
        "Q23_Part_3": "Build prototypes", 
        "Q23_Part_4": "Build and/or run a machine learning service", 
        "Q23_Part_5": "Experimentation", 
        "Q23_Part_6": "Do research",
        "Q23_Part_7": "None", 
        "Q23_OTHER": "Other"
    })

### Analizing and Understanding Data

In [None]:
fig = px.treemap(
    agg_data, 
    path=['org_size', 'occupation', 'sd_experience'], 
    values='Analyze and understand data', 
    # color='Analyze and understand data', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, Data Scientists and Data Analysts are typically the go-to persons to analyze and understand data, regardless the employer organization size.

Aditional insights are detected as follows

- typically, the organizations of larger size hire more experienced Data Scientists and Data Analysts, involving them in the worksflows to anlize and understand the data 
- Other occupations are typically less involved in handling such a type of job responsibilities

### Building and/or Running the Data Infrastructure

In [None]:
fig = px.treemap(
    agg_data, 
    path=['org_size', 'occupation', 'sd_experience'], 
    values="Build and/or run the data infrastructure", 
    # color="Build and/or run the data infrastructure", 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We can find that

- In the majority of the organizations, Data Scientists and Software Engineers are often tasked with Building and/or Running the Data Infrastructure
- In large organizations (1000+), a special role of Data Engineer is often introduced to handle such type of work in a dedicated manner (with Data Scientists and Software Engineers off-loaded to do some other core activities on their side)
- Data Analysts are sometimes  involved in building and running the data infrastructure, especially in smaller-sized organizations (however, the percentage of Data Analysts doing so in organizations of 1000-9999 employee size sounds surprinsingly high as displayed)

### Building Prototypes

In [None]:
fig = px.treemap(
    agg_data, 
    path=['org_size', 'occupation', 'sd_experience'], 
    values="Build prototypes", 
    # color="Build prototypes", 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that, regardless the size of the employer organizations, the following roles are mostly involved in building prototypes

- Data Scientists
- Software Engineers
- ML Engineers
- Research Scientists

### Building and/or Running an ML Service

In [None]:
fig = px.treemap(
    agg_data, 
    path=['org_size', 'occupation', 'sd_experience'], 
    values="Build and/or run a machine learning service", 
    # color="Build and/or run a machine learning service", 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, the top three roles involved in building and/or running ML Services (regardless the size of an employer organization) are

- Data Scientist
- ML Engineer
- Software Engineer

### Experimentation

In [None]:
fig = px.treemap(
    agg_data, 
    path=['org_size', 'occupation', 'sd_experience'], 
    values="Experimentation", 
    # color="Experimentation", 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, the top roles involved in experimentation (regardless the size of an employer organization) are

- Data Scientist
- Research Scientist

Sometimes ML and Software Engineers are involved in this type of activities.

### Doing Research

In [None]:
fig = px.treemap(
    agg_data, 
    path=['org_size', 'occupation', 'sd_experience'], 
    values="Do research", 
    # color="Do research", 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

As we can see, the top roles involved in doing research (regardless the size of an employer organization) are

- Data Scientist
- Research Scientist

## ML Spendings

In [None]:
# aggregate basis cat data
agg_data = data.groupby(['Q20','Q21', 'Q25']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q20': 'size', 
        'Q21': 'data_scientists', 
        'Q25': 'ml_spendings', 
    })

In [None]:
fig = px.treemap(
    agg_data, 
    path=['size', 'data_scientists', 'ml_spendings'], 
    values='respondent_count', 
    color='respondent_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

From looking at the chart above, we can see the most popular  option as for ML spending is USD 0, regardless the orgranization size. It may be indicative of the inaccuracies in the responder inputs when they responded to the respective question of the survey (or otherwise they could fake the answers for a reason).

Still, if considering the meaningful responses, it looks like

- large organizations tend to spend more in ML (100+ k USD in the last 5 years)
- smaller-sized organizations spent 100-999 or 1000-9999 USD in ML in the last 5 years

# Chapter 4: Battle of Giants

In this chapter, we are going to draw insights on the popularity of cloud computing platforms and products among the survey participants who are professionals (as opposed to non-professionals - see the note below). In particular, it will cover

- Cloud Platforms usage
- Cloud Computing products usage
- Cloud ML products usage
- BigData platforms
- BI tools (mostly, the cloud-based ones)

The line of the narrative in this chapter will be often attached to the good news and opportunities for the top three cloud service providers in the market as follows

- Amazon Web Services (AWS)
- Google Cloud Platform (GCP)
- Microsoft Azure Cloud (MS Azure)

*Notes:* 

- We are specifically going to analize the responses to Q26-Q31.
- The insights in this section refer to the Kaggle community only. It may not be  representative group for the entire community of Data Scientist professionals across the globe. However, some of the conclusions below can be interesting from the marketing stand-point.
- Survey organizers defined non-professionals as students, unemployed, and respondents that have never spent any money in the cloud.

## Usage of Cloud Service Providers: By Occupation/Programming Experience View

In [None]:

cloud_providers_lst = ['Q26_A_Part_1',
 'Q26_A_Part_2',
 'Q26_A_Part_3',
 'Q26_A_Part_4',
 'Q26_A_Part_5',
 'Q26_A_Part_6',
 'Q26_A_Part_7',
 'Q26_A_Part_8',
 'Q26_A_Part_9',
 'Q26_A_Part_10',
 'Q26_A_Part_11',
 'Q26_A_OTHER',] 

agg_data = data.groupby(["Q5", "Q6"])[cloud_providers_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q26_A_Part_1': 'AWS',
        'Q26_A_Part_2': 'Azure',
        'Q26_A_Part_3': 'GCP',
        'Q26_A_Part_4': 'IBM Cloud',
        'Q26_A_Part_5': 'Oracle Cloud',
        'Q26_A_Part_6': 'SAP Cloud',
        'Q26_A_Part_7': 'Salesforce Cloud',
        'Q26_A_Part_8': 'VMware Cloud',
        'Q26_A_Part_9': 'Alibaba Cloud',
        'Q26_A_Part_10': 'Tencent Cloud',
        'Q26_A_Part_11': 'None',
        'Q26_A_OTHER': "Other"
})

In [None]:
agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='cloud_provider', value_name='response_count').copy()

First of all, we are going to look at total usage of the cloud service providers by programming experience:

In [None]:
fig_df = agg_melted_df.groupby(['sd_experience', 'cloud_provider'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='cloud_provider', 
    y='response_count', 
    color='sd_experience', 
    title="Kaggle 2020 Survey: Usage of Cloud Service Providers by Programming Experience", 
    height=600)
fig.show()

We find that the top three list of cloud service providers among the Kaggle survey responders are

- AWS
- GCP
- MS Azure

The rest of the cloud service providers seems to loose the competitive edge toward the top three provider list above at the moment.

Also, it is notable that 'None' category slightly exceeds the size of MS Asure bar, and it means the market may not be saturated with the cloud service provider offerings.

We also see professionals with 3-5 years and 5-10 years of programming experince to be the largest group of users of the top 3 cloud service providers. More senior professionals (with 10+ years of experience) are less represented within the cloud service users on each of top 3 platforms (depending on the marketing priorities, special actions to educatesuch seniors could help to get the better spread of the cloud services).

Now we are going to look at the cloud service  provider users by their occupation:

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'cloud_provider'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='cloud_provider', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Cloud Service Providers by Occupation", 
    height=600)
fig.show()

As we can see, the majority of top three cloud service provider users fit into the roles below

- Data Scientists
- Software Engineers

The third posion is held by

- ML Engineeris (AWS, GCP)
- Data Analysts (MS Azure)

As noted above, 'Other' occupation group is too big itself, and it might be worth breaking it into more granular categories in the future surveys. As we see, 'Other' group takes a significant fraction of each cloud service platform users (although it is never seen in the top three list ffor any off the platforms).

Now we are going to look at the top three cloud service providers in more details.

### AWS Usage Patterns

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'sd_experience'], 
    values='AWS', 
    # color='AWS', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that

- Data Scientists with 3-5 and 5-10 years of programming experience are the top user groups for AWS within the survey respondents
- In Software Engineer, ML Engineer, and Data Analyst groups, professionals with  3-5 and 5-10 years of experience predominate
- In Research Scientist, Data Engineer, DBA, Statistician and Other groups, professionals with 10+ years of experience are the biggest fraction of the users
- In Product/Project Management group, professionals with 5+ years of experience are the biggest fraction of the users
- In Business Analyst group, we see the users with 1-2 years of experience to dominate

### GSP Usage Patterns

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'sd_experience'], 
    values='GCP', 
    # color='GCP', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that

- like for AWS, the top user groups for GCP are Data Scientists with 3-5 and 5-10 years of programming experience
- In Software Engineer, ML Engineer, and Data Analyst groups, professionals with  3-5 and 5-10 years of experience predominate (similar to GCP)
- In Research Scientist and DBA groups,  professionals with 10+ years of experience are the biggest fraction of the users
- In Data Engineer and Product/Project Manager groups, professionals with 3-5 years of experience predominate (as oppose to AWS)
- In Statistician and Other groups, junior-level professionals with 1-2 year of programming experience take the leadership

### MS Azure Usage Patterns

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'sd_experience'], 
    values='Azure', 
    # color='Azure', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that 

- Data Scientists with 5-10 and 3-5 years of programming experience are the top user clusters for MS Azure
- Software Engineers and junior-level Business analysts are within the next tier of the users

## Usage of Cloud Service Providers: By Org Size/Data Science Team Size View

We are going to see how the top three cloud service providers (AWS, GCP, MS Azure) are used in the employer organizations where Kaggle survey responders work.

In [None]:
agg_data = data.groupby(["Q20", "Q21"])[cloud_providers_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q20': 'org_size',
        'Q21': 'ds_headcount',
        'Q26_A_Part_1': 'AWS',
        'Q26_A_Part_2': 'Azure',
        'Q26_A_Part_3': 'GCP',
        'Q26_A_Part_4': 'IBM Cloud',
        'Q26_A_Part_5': 'Oracle Cloud',
        'Q26_A_Part_6': 'SAP Cloud',
        'Q26_A_Part_7': 'Salesforce Cloud',
        'Q26_A_Part_8': 'VMware Cloud',
        'Q26_A_Part_9': 'Alibaba Cloud',
        'Q26_A_Part_10': 'Tencent Cloud',
        'Q26_A_Part_11': 'None',
        'Q26_A_OTHER': "Other"
})

### AWS Usage Patterns

In [None]:
fig = px.treemap(
    agg_data, 
    path=['org_size', 'ds_headcount'], 
    values='AWS', 
    color='AWS', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that AWS is mostly used by kagglers working in the organization types below

- Organizations with 0-49 employees, having 1-2 workers dedicated to Data Science workloads
- Organizations with 10000+ employees, having 20+ workers dedicated to Data Science workloads

### GCP Usage Patterns

In [None]:
fig = px.treemap(
    agg_data, 
    path=['org_size', 'ds_headcount'], 
    values='GCP', 
    color='GCP', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that GCP is mostly used by kagglers working in the organization types below

- Organizations with 0-49 employees, having 1-2 workers dedicated to Data Science workloads
- Organizations with 10000+ employees, having 20+ workers dedicated to Data Science workloads
- Organizations with 0-49 employees, having 3-4 workers dedicated to Data Science workloads

So we can conclude that AWS and GCP tightly compete on the same types of the organizations/users.

### MS Azure Usage Patterns

In [None]:
fig = px.treemap(
    agg_data, 
    path=['org_size', 'ds_headcount'], 
    values='Azure', 
    color='Azure', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that MS Azure is mostly used by kagglers working in the organization types below

- Organizations with 0-49 employees, having 1-2 workers dedicated to Data Science workloads
- Organizations with 10000+ employees, having 20+ workers dedicated to Data Science workloads

So we can conclude that MS Azure tightly ccompetes with AWS and GCP on the same types of the organizations/users.

## Usage of Cloud Computing Products

In [None]:
cloud_computing_lst = [
    'Q27_A_Part_1',
    'Q27_A_Part_2',
    'Q27_A_Part_3',
    'Q27_A_Part_4',
    'Q27_A_Part_5',
    'Q27_A_Part_6',
    'Q27_A_Part_7',
    'Q27_A_Part_8',
    'Q27_A_Part_9',
    'Q27_A_Part_10',
    'Q27_A_Part_11',
    'Q27_A_OTHER'
]

agg_data = data.groupby(["Q5", "Q6"])[cloud_computing_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q27_A_Part_1': 'Amazon EC2',
        'Q27_A_Part_2': 'AWS Lambda',
        'Q27_A_Part_3': 'Amazon Elastic Container Service',
        'Q27_A_Part_4': 'Azure Cloud Services',
        'Q27_A_Part_5': 'MS Azure Container Instances',
        'Q27_A_Part_6': 'Azure Functions',
        'Q27_A_Part_7': 'Google Cloud Compute Engine',
        'Q27_A_Part_8': 'Google Cloud Functions',
        'Q27_A_Part_9': 'Google Cloud Run',
        'Q27_A_Part_10': 'Google Cloud App Engine',
        'Q27_A_Part_11': 'None',
        'Q27_A_OTHER': "Other"
})

In [None]:
agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='cloud_product', value_name='response_count').copy()

### Usage of Cloud Computing Products by Occupation

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'cloud_product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='cloud_product', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Cloud Computing Products by Occupation", 
    height=600)
fig.show()

We find that 

- in the segment of cloud computing engines, Amazon EC2 is more popular than its rivals from Google (Google Cloud Computing Engine) and MS Azure (Azure Cloud Services)
- in the segment of cloud functions, AWS Lambda is more popular than its rivals from Google (Google Cloud Functions) and MS Azure (Azure Functions)
- in the segment of cloud container runners, Amazon Elastic Container Service is more popular that its rivals from Google (Google Cloud Run) and MS Azure (MS Azure Container Instances)
- Google holds the second place in cloud computing engine and cloud function segments, and it is on the third place in the cloud container runner segment
- there is a huge pool of responses with 'None', and it is most likely to indicate the entire market of cloud computing applications is not saturated yet

In terms of user roles, the most users of every cloud computing product above hold the roles below (top to bottom)
- Data Scientists
- Software Engineers
- ML Engineers
- Data Analysts

### Usage of Cloud Computing Products by Programming Experience

In [None]:
fig_df = agg_melted_df.groupby(['sd_experience', 'cloud_product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='cloud_product', 
    y='response_count', 
    color='sd_experience', 
    title="Kaggle 2020 Survey: Usage of Cloud Computing Products by Programming Experience", 
    height=600)
fig.show()

In addition to the insights above, we see that the top number of cloud computing product users fall into the following clusters in terms of their programming experience

- 5-10 years of experience
- 3-5 years of experience
- 10-20 years of experience

Juniors and super-seniors (20+ years of programming experience) seem to be less covered by the respective knowledge/skills.

## Usage of Cloud ML Products

In [None]:
cloud_ml_lst = [
    'Q28_A_Part_1',
    'Q28_A_Part_2',
    'Q28_A_Part_3',
    'Q28_A_Part_4',
    'Q28_A_Part_5',
    'Q28_A_Part_6',
    'Q28_A_Part_7',
    'Q28_A_Part_8',
    'Q28_A_Part_9',
    'Q28_A_Part_10',
    'Q28_A_OTHER'
]

agg_data = data.groupby(["Q5", "Q6"])[cloud_ml_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q28_A_Part_1': 'Amazon SageMaker',
        'Q28_A_Part_2': 'Amazon Forecast',
        'Q28_A_Part_3': 'Amazon Rekognition',
        'Q28_A_Part_4': 'Azure Machine Learning Studio',
        'Q28_A_Part_5': 'Azure Cognitive Services',
        'Q28_A_Part_6': 'Google Cloud AI Platform / Google Cloud ML Engine',
        'Q28_A_Part_7': 'Google Cloud Video AI',
        'Q28_A_Part_8': 'Google Cloud Natural Language',
        'Q28_A_Part_9': 'Google Cloud Vision AI',
        'Q28_A_Part_10': 'None',
        'Q28_A_OTHER': "Other"
})

In [None]:
agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='cloud_product', value_name='response_count').copy()

## Usage of Cloud ML Products by Occupation

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'cloud_product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='cloud_product', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Cloud ML Products by Occupation", 
    height=600)
fig.show()

We find that 

- Google Cloud AI Platform / Google Cloud ML Engine leads the ML cloud products usage 'nomination
- the second and third best are Amazon SageMaker and Azure Machine Learning Studio, respectively
- Data Scientists are the top users of cloud ML products (for every product investigated)
- There is a huge chunk of responders who indicated they do not use cloud ML products at all - this indicates the market is under-saturated, and there is a good growth potential, subject to resolving the marketing and end-user barries on the way

## Usage of Cloud ML Products by Programming Experience

In [None]:
fig_df = agg_melted_df.groupby(['sd_experience', 'cloud_product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='cloud_product', 
    y='response_count', 
    color='sd_experience', 
    title="Kaggle 2020 Survey: Usage of Cloud ML Products by Programming Experience", 
    height=600)
fig.show()

In addition to the insights above, we can see that the cloud ML products are mostly used by the responders who has programming experience of

- 3-5 years
- 5-10 years

## Usage of Cloud ML Products by Organization Size and DS Capacity

In [None]:
agg_data = data.groupby(["Q20", "Q21"])[cloud_ml_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q20': 'org_size',
        'Q21': 'ds_headcount',
        'Q28_A_Part_1': 'Amazon SageMaker',
        'Q28_A_Part_2': 'Amazon Forecast',
        'Q28_A_Part_3': 'Amazon Rekognition',
        'Q28_A_Part_4': 'Azure Machine Learning Studio',
        'Q28_A_Part_5': 'Azure Cognitive Services',
        'Q28_A_Part_6': 'Google Cloud AI Platform / Google Cloud ML Engine',
        'Q28_A_Part_7': 'Google Cloud Video AI',
        'Q28_A_Part_8': 'Google Cloud Natural Language',
        'Q28_A_Part_9': 'Google Cloud Vision AI',
        'Q28_A_Part_10': 'None',
        'Q28_A_OTHER': "Other"
})

In [None]:
agg_melted_df = agg_data.melt(id_vars=['org_size', 'ds_headcount',],
                     var_name='cloud_product', value_name='response_count').copy()

In [None]:
fig = px.treemap(
    agg_melted_df, 
    path=['org_size', 'ds_headcount', 'cloud_product'], 
    values='response_count', 
    #color='response_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that the majority of organizations in every size category does not use any Cloud ML Products at the moment.

For the tiny fraction of those who use them, there are interesting insights as follows

- in small organizations (0-49 employees), Google Cloud AI Platform / Google Cloud ML Engine dominates
- in the middle-sized organizations (50-249 employees), Google Cloud AI Platform / Google Cloud ML Engine and Amazon SageMaker titly
- for companies of bigger size (250+ employees), the size of Data Science team is often correlated with the preferred Cloud ML Product (smaller teams sticks to Google Cloud AI Platform / Google Cloud ML Engine more, and Data Science teams with 20+ headcount are more inclined to use Amazon SageMaker )

## Usage of Big Data Products By Occupation

In [None]:
big_data_lst = [
        'Q29_A_Part_1',
        'Q29_A_Part_2',
        'Q29_A_Part_3',
        'Q29_A_Part_4',
        'Q29_A_Part_5',
        'Q29_A_Part_6',
        'Q29_A_Part_7',
        'Q29_A_Part_8',
        'Q29_A_Part_9',
        'Q29_A_Part_10',
        'Q29_A_Part_11',
        'Q29_A_Part_12',
        'Q29_A_Part_13',
        'Q29_A_Part_14',
        'Q29_A_Part_15',
        'Q29_A_Part_16',
        'Q29_A_Part_17',
        'Q29_A_OTHER'
]

agg_data = data.groupby(["Q5", "Q6"])[big_data_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q29_A_Part_1': 'MySQL',
        'Q29_A_Part_2': 'PostgreSQL',
        'Q29_A_Part_3': 'SQLite',
        'Q29_A_Part_4': 'Oracle Database',
        'Q29_A_Part_5': 'MongoDB',
        'Q29_A_Part_6': 'Snowflake',
        'Q29_A_Part_7': 'IBM DB2',
        'Q29_A_Part_8': 'Microsoft SQL Server',
        'Q29_A_Part_9': 'Microsoft Access',
        'Q29_A_Part_10': 'Microsoft Azure Data Lake Storage',
        'Q29_A_Part_11': 'Amazon Redshift',
        'Q29_A_Part_12': 'Amazon Athena',
        'Q29_A_Part_13': 'Amazon DynamoDB',
        'Q29_A_Part_14': 'Google Cloud BigQuery',
        'Q29_A_Part_15': 'Google Cloud SQL',
        'Q29_A_Part_16': 'Google Cloud Firestore',
        'Q29_A_Part_17': 'None',
        'Q29_A_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='big_data_product', value_name='response_count').copy()

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'big_data_product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='big_data_product', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Big Data Products by Occupation", 
    height=600)
fig.show()

We find that

- Overall top 3 list is constituted by three relational DBMS platforms (MySQL, PostgreSQL, MS SQL Server)
- MongoDB, a non-relational database platform, takes position 4 in the list
- Other relational DBMS platforms in the list (Oracle, IBM DB2, SQLite) are behind MongoDB
- In the segment of truly cloud-based Big Data products, Google BigQuery overruns its Amazon and MS Azure competitors (     Amazon Redshift, Amazon Athena, Amazon DynamoDB, and Microsoft Azure Data Lake Storage)
- Google Cloud SQL instances are still less popular then 'native' relational database instances for MySQL and PostgreSQL
- MS Access is still in use in the industry
- Data Scientists are the top users of each product in this list

## Big Data Product Usage Patterns by User Occupation and Programming Experience

In [None]:
fig = px.treemap(
    agg_melted_df, 
    path=['occupation', 'sd_experience', 'big_data_product'], 
    values='response_count', 
    #color='response_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that

- MySQL and PostreSQL are the  most popular database management platforms within each occupation
- MongoDB is quite popular with Software Engineers (although less popular then MySQL and PostreSQL)

## Big Data Product Usage Patterns by Organization Size and Data Science Capacity

In [None]:
agg_data = data.groupby(["Q20", "Q21"])[big_data_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q20': 'org_size',
        'Q21': 'ds_capacity',
        'Q29_A_Part_1': 'MySQL',
        'Q29_A_Part_2': 'PostgreSQL',
        'Q29_A_Part_3': 'SQLite',
        'Q29_A_Part_4': 'Oracle Database',
        'Q29_A_Part_5': 'MongoDB',
        'Q29_A_Part_6': 'Snowflake',
        'Q29_A_Part_7': 'IBM DB2',
        'Q29_A_Part_8': 'Microsoft SQL Server',
        'Q29_A_Part_9': 'Microsoft Access',
        'Q29_A_Part_10': 'Microsoft Azure Data Lake Storage',
        'Q29_A_Part_11': 'Amazon Redshift',
        'Q29_A_Part_12': 'Amazon Athena',
        'Q29_A_Part_13': 'Amazon DynamoDB',
        'Q29_A_Part_14': 'Google Cloud BigQuery',
        'Q29_A_Part_15': 'Google Cloud SQL',
        'Q29_A_Part_16': 'Google Cloud Firestore',
        'Q29_A_Part_17': 'None',
        'Q29_A_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['org_size', 'ds_capacity',],
                     var_name='big_data_product', value_name='response_count').copy()

In [None]:
fig = px.treemap(
    agg_melted_df, 
    path=['org_size', 'ds_capacity', 'big_data_product'], 
    values='response_count', 
    #color='response_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that 

- Almost all of organizations except from the extra-large ones address their data management needs with MySQL, PostgreSQL, and MongoDB the most
- Extra-large organizations (with 10000+ employees) prefer to work with MySQL, MS SQL Server, Oracle, and PostgreSQL

## Usage of BI Tools

In [None]:
bi_lst = [
        'Q31_A_Part_1',
        'Q31_A_Part_2',
        'Q31_A_Part_3',
        'Q31_A_Part_4',
        'Q31_A_Part_5',
        'Q31_A_Part_6',
        'Q31_A_Part_7',
        'Q31_A_Part_8',
        'Q31_A_Part_9',
        'Q31_A_Part_10',
        'Q31_A_Part_11',
        'Q31_A_Part_12',
        'Q31_A_Part_13',
        'Q31_A_Part_14',
        'Q31_A_OTHER'
]

agg_data = data.groupby(["Q5", "Q6"])[bi_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q31_A_Part_1': 'Amazon QuickSight',
        'Q31_A_Part_2': 'Microsoft Power BI',
        'Q31_A_Part_3': 'Google Data Studio',
        'Q31_A_Part_4': 'Looker',
        'Q31_A_Part_5': 'Tableau',
        'Q31_A_Part_6': 'Salesforce',
        'Q31_A_Part_7': 'Einstein Analytics',
        'Q31_A_Part_8': 'Qlik',
        'Q31_A_Part_9': 'Domo',
        'Q31_A_Part_10': 'TIBCO Spotfire',
        'Q31_A_Part_11': 'Alteryx',
        'Q31_A_Part_12': 'Sisense',
        'Q31_A_Part_13': 'SAP Analytics Cloud',        
        'Q31_A_Part_14': 'None',
        'Q31_A_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='bi_product', value_name='response_count').copy()

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'bi_product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='bi_product', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Big Data Products by Occupation", 
    height=600)
fig.show()

We find that

- Tableue and MS Power BI outperforms other rivals significantly
- Google Data Studio becomes a challenger to the leading BI products above, occupying the third place in the list
- Data Scientists, Data Analysts, Research Scientists, and ML Engineers are the most frequent users of BI tools
- Huge fraction of the survey responders indicated they do not use BI tools at all

# Chapter 5: Data Science Automation Tools

In this chapter, we are going to analize preferences of kagglers as for

- using automated machine learning tools
- using tools to help manage machine learning experiments

*Note:* We are specifically going to analize responses to Q33-35

## Using Auto ML Tools: By-Purpose View

In [None]:
auto_ml_purpose_lst = [
        'Q33_A_Part_1',
        'Q33_A_Part_2',
        'Q33_A_Part_3',
        'Q33_A_Part_4',
        'Q33_A_Part_5',
        'Q33_A_Part_6',
        'Q33_A_Part_7',
        'Q33_A_OTHER'
]

agg_data = data.groupby(["Q5", "Q6"])[auto_ml_purpose_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q33_A_Part_1': 'Automated data augmentation (e.g. imgaug, albumentations)',
        'Q33_A_Part_2': 'Automated feature engineering/selection (e.g. tpot, boruta_py)',
        'Q33_A_Part_3': 'Automated model selection (e.g. auto-sklearn, xcessiv)',
        'Q33_A_Part_4': 'Automated model architecture searches (e.g. darts, enas)',
        'Q33_A_Part_5': 'Automated hyperparameter tuning (e.g. hyperopt, ray.tune, Vizier)',
        'Q33_A_Part_6': 'Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)',
        'Q33_A_Part_7': 'None',
        'Q33_A_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='auto_ml_purpose', value_name='response_count').copy()

## Using Auto ML Tools By Purpose and User Occupation

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'auto_ml_purpose'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='auto_ml_purpose', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Auto ML Tools By Purpose and User Occupation", 
    height=700)
fig.show()

We find that

- the vast majority of the survey participants do not use any Auto ML tools in their daily routines
- automated model selection tools is the most popular type of Auto ML tools used by the kagglers at the moment
- automation tools for selecting a neural network architecture  are not vibrant and well known
- Data Scientists are the primary users of Auto ML tools (if in use)

## Using Auto ML Tools By Purpose and Size of Employer Organization

In [None]:
agg_data = data.groupby(["Q20", "Q21"])[auto_ml_purpose_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q20': 'org_size',
        'Q21': 'ds_capacity',
        'Q33_A_Part_1': 'Automated data augmentation (e.g. imgaug, albumentations)',
        'Q33_A_Part_2': 'Automated feature engineering/selection (e.g. tpot, boruta_py)',
        'Q33_A_Part_3': 'Automated model selection (e.g. auto-sklearn, xcessiv)',
        'Q33_A_Part_4': 'Automated model architecture searches (e.g. darts, enas)',
        'Q33_A_Part_5': 'Automated hyperparameter tuning (e.g. hyperopt, ray.tune, Vizier)',
        'Q33_A_Part_6': 'Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)',
        'Q33_A_Part_7': 'None',
        'Q33_A_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['org_size', 'ds_capacity',],
                     var_name='auto_ml_purpose', value_name='response_count').copy()

In [None]:
fig_df = agg_melted_df.groupby(['org_size', 'auto_ml_purpose'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='auto_ml_purpose', 
    y='response_count', 
    color='org_size', 
    title="Kaggle 2020 Survey: Usage of Auto ML Tools By Purpose and Size of Employer Organization", 
    height=600)
fig.show()

We find that

- Both the large-sized and small-sized organizations are equally represented across the cluster of organizations where auto ML tools are not used at the moment
- for each type of auto ML tools, the most usage of them is observed in small-sized organizations (it is also correlated with the fact the majority of the survey respondents work for such organizations in fact)

## Using Auto ML Tools By Purpose and Size of Data Science Team in Organizations

In [None]:
fig_df = agg_melted_df.groupby(['ds_capacity', 'auto_ml_purpose'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='auto_ml_purpose', 
    y='response_count', 
    color='ds_capacity', 
    title="Kaggle 2020 Survey: Usage of Auto ML Tools By Purpose and Size of Data Science Team", 
    height=600)
fig.show()

- Both the organizations with large-sized and small-sized data science teams are represented across the cluster of organizations where auto ML tools are not used at the moment
- if auto ML tools used, the most use of them is observed in organizations where Data Science team has capacity of 1-2, 3-4, or 20+ workers

## Utilizing Auto ML Tools By Product and User Occupation

In [None]:
auto_ml_product_lst = [
    'Q34_A_Part_1',
    'Q34_A_Part_2',
    'Q34_A_Part_3',
    'Q34_A_Part_4',
    'Q34_A_Part_5',
    'Q34_A_Part_6',
    'Q34_A_Part_7',
    'Q34_A_Part_8',
    'Q34_A_Part_9',
    'Q34_A_Part_10',
    'Q34_A_Part_11',
    'Q34_A_OTHER'
]

agg_data = data.groupby(["Q5", "Q6"])[auto_ml_product_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q34_A_Part_1': 'Google Cloud AutoML',
        'Q34_A_Part_2': 'H20 Driverless AI',
        'Q34_A_Part_3': 'Databricks AutoML',
        'Q34_A_Part_4': 'DataRobot AutoML',
        'Q34_A_Part_5': 'Tpot',
        'Q34_A_Part_6': 'Auto-Keras',
        'Q34_A_Part_7': 'Auto-Sklearn',
        'Q34_A_Part_8': 'Auto_ml',
        'Q34_A_Part_9': 'Xcessiv',
        'Q34_A_Part_10': 'MLbox',
        'Q34_A_Part_11': 'None',
        'Q34_A_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='auto_ml_product', value_name='response_count').copy()

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'auto_ml_product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='auto_ml_product', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Auto ML Tools By Product and User Occupation", 
    height=600)
fig.show()

We find that

- a lot of respondents indicated they do not use any auto ML product at the moment
- top three leading auto ML product (in terms of the number of respondents using them) are Auto-Sklearn, Auto-Keras, and Google Cloud AutoML
- Data Scientists are the primary users of auto ML products
- the next tier of roles played by the users of auto ML products contains Software Engineers, Data Analysts, and Research Scientists

## Utilizing Auto ML Tools By Product and User Occupation

In [None]:
fig_df = agg_melted_df.groupby(['sd_experience', 'auto_ml_product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='auto_ml_product', 
    y='response_count', 
    color='sd_experience', 
    title="Kaggle 2020 Survey: Utilizing Auto ML Tools By Product and User Programming Experience", 
    height=600)
fig.show()

In addition to the insights above, we find that

- for every auto ML product surveyed, professionals with 1-2, 3-5, 5-10, and 10-20 years of programming experiences are equally represented among the users
- professionals with 20+ years of programming experience are less attached to using auto ML products

## Utilizing Auto ML Tools By Product, Size of Organization, and Data Science Team Capacity

In [None]:
agg_data = data.groupby(["Q20", "Q21"])[auto_ml_product_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q20': 'org_size',
        'Q21': 'ds_capacity',
        'Q34_A_Part_1': 'Google Cloud AutoML',
        'Q34_A_Part_2': 'H20 Driverless AI',
        'Q34_A_Part_3': 'Databricks AutoML',
        'Q34_A_Part_4': 'DataRobot AutoML',
        'Q34_A_Part_5': 'Tpot',
        'Q34_A_Part_6': 'Auto-Keras',
        'Q34_A_Part_7': 'Auto-Sklearn',
        'Q34_A_Part_8': 'Auto_ml',
        'Q34_A_Part_9': 'Xcessiv',
        'Q34_A_Part_10': 'MLbox',
        'Q34_A_Part_11': 'None',
        'Q34_A_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['org_size', 'ds_capacity',],
                     var_name='auto_ml_product', value_name='response_count').copy()

In [None]:
fig = px.treemap(
    agg_melted_df, 
    path=['org_size', 'ds_capacity', 'auto_ml_product'], 
    values='response_count', 
    #color='response_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We find that

- small-sized organizations (0-49 employees) have more inclination to play around Auto-Scikitlearn and Auto-Keras
- for larger-sized organization, the biggest cluster is always the subset of organizations that do not use any auto ML, regardless the size of their Data Science team

## Utilization of Tools to Manage DS Experiments: By-Product and User Occupation View

In this section, we are going to investigate utilization patterns for tools to help manage machine learning experiments by user occupation.

In [None]:
ds_management_product_lst = [
    'Q35_A_Part_1',
    'Q35_A_Part_2',
    'Q35_A_Part_3',
    'Q35_A_Part_4',
    'Q35_A_Part_5',
    'Q35_A_Part_6',
    'Q35_A_Part_7',
    'Q35_A_Part_8',
    'Q35_A_Part_9',
    'Q35_A_Part_10',
    'Q35_A_OTHER'
]

agg_data = data.groupby(["Q5", "Q6"])[ds_management_product_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q35_A_Part_1': 'Neptune.ai',
        'Q35_A_Part_2': 'Weights & Biases',
        'Q35_A_Part_3': 'Comet.ml',
        'Q35_A_Part_4': 'Sacred + Omniboard',
        'Q35_A_Part_5': 'TensorBoard',
        'Q35_A_Part_6': 'Guild.ai',
        'Q35_A_Part_7': 'Polyaxon',
        'Q35_A_Part_8': 'Trains',
        'Q35_A_Part_9': 'Domino Model Monitor',
        'Q35_A_Part_10': 'None',
        'Q35_A_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='product', value_name='response_count').copy()

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='product', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of DS Helper Tools By Product and User Occupation", 
    height=600)
fig.show()

We find that

- the majority of the survey respondents do not use any tools to help to manage data science experiments (so the market for such tools is extensively under-saturated)
- TensorBoard is the leading tool in use at the moment
- Data Scientists, Software Engineers, and Data Analysts are the top occupations of the users of such tools

## Utilization of Tools to Manage DS Experiments: By-Product and User Programming Experience View

In [None]:
fig_df = agg_melted_df.groupby(['sd_experience', 'product'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='product', 
    y='response_count', 
    color='sd_experience', 
    title="Kaggle 2020 Survey: Usage of DS Helper Tools By Product and User Programming Experience", 
    height=600)
fig.show()

In addition to the insights above, we find that

- the  users with 1+ year experience and more are equally represented in the pool of users of each helper product investigated

# Chapter 6: Knowledge and Information Sharing Channels

Knowledge is power, as Sir Francis Bacon told one day. Additionally, information becomes the power and 'new oil' in the modern post-industrial age.

Therefore it is essentially interesting to draw insights on how kagglers work with information as well as share/obtain professional knowledge.

In this chapter, we are going to analize the preferences of the survey participants as for

- platforms and tools to publicly share or deploy their data analysis or machine learning applications
- platforms to take online data science courses
- primary tools they use to analyze data
- favorite media sources that report on data science topics

*Note:* We are specifically going to analize responses to Q36-39

## Public Sharing Platform Preferences by Occupation

In [None]:
public_sharing_lst = [
    'Q36_Part_1',
    'Q36_Part_2',
    'Q36_Part_3',
    'Q36_Part_4',
    'Q36_Part_5',
    'Q36_Part_6',
    'Q36_Part_7',
    'Q36_Part_8',
    'Q36_Part_9',
    'Q36_OTHER'
]

agg_data = data.groupby(["Q5", "Q6"])[public_sharing_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q36_Part_1': 'Plotly Dash',
        'Q36_Part_2': 'Streamlit',
        'Q36_Part_3': 'NBViewer',
        'Q36_Part_4': 'GitHub',
        'Q36_Part_5': 'Personal blog',
        'Q36_Part_6': 'Kaggle',
        'Q36_Part_7': 'Colab',
        'Q36_Part_8': 'Shiny',      
        'Q36_Part_9': 'None',
        'Q36_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='sharing_platform', value_name='response_count').copy()

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'sharing_platform'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='sharing_platform', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Public Sharing Platforms by User Occupation", 
    height=600)
fig.show()

We find that

- GitHub is the primary choice for Kagglers to share their work publicly
- Kaggle itself as well as Colab take the next places in the top 3 list of platforms to publicly share the work assets
- Good work sharing instruments like Streamlit and Plotly Dash are under-utilized by the community
- Quite a big fraction of the survey respondents indicated they do not share any work publicly
- The percentage of Business Analysts and people with 'Other' occupation who do not share their work results publicly is higher vs. the rest of the occupations

## Public Sharing Platform Preferences by Occupation and Programming Experience

In [None]:
fig = px.treemap(
    agg_melted_df, 
    path=['occupation', 'sd_experience', 'sharing_platform'], 
    values='response_count', 
    #color='response_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

In addition to the insights above, we find that

- GitHub and Kaggle are the preferred sharing platforms for every occupation and programming experience level groups
- The percentage of professionals not sharing their work results publicly is higher with more experienced professional groups (people with 10-20 and 20+ years of programming experience) - the replied with 'None' more frequently then other experience level group members

As my subjective opinion driven by the data insights above, Kaggle as a platform has an opportunity to involve the 'golden pool' of  experts (people with 10-20 and 20+ years of programming experience) to share their knowledge and wisdom with younter generations more actively if they 'crack the code' to motivate such professionals properly.

## Kaggle vs. Colab Usage

In [None]:
# subset of raw responses with respective dimension categories as well Kaggle and Colab usage
col_subset = [
    'Q5',
    'Q6',
    'Q2',
    'Q4',
    'Q36_Part_6',
    'Q36_Part_7',
]
kaggle_collab_df = data[col_subset]

kaggle_collab_df = kaggle_collab_df.rename(
    columns={
        'Q2': 'gender',
        'Q4': 'education',
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q36_Part_6': 'Kaggle',
        'Q36_Part_7': 'Colab',
})

In [None]:
# drop records with NA in dimensional values
kaggle_collab_df = kaggle_collab_df.dropna(subset=['education', 'occupation', 'sd_experience'])

# impute NA values for Kaggle and Colab usage
kaggle_collab_df['Kaggle'] = kaggle_collab_df['Kaggle'].fillna('0')
kaggle_collab_df['Colab'] = kaggle_collab_df['Colab'].fillna('0')

# replacce existing non-NaN values with '1'
kaggle_collab_df.loc[(kaggle_collab_df['Colab'] == ' Colab '),'Colab'] = '1'
kaggle_collab_df.loc[(kaggle_collab_df['Kaggle'] == ' Kaggle '),'Kaggle'] = '1'

# cast "Kaggle" and "Colab" to int to make use of alluvial diagrams down the road
kaggle_collab_df[["Kaggle", "Colab"]] = kaggle_collab_df[["Kaggle", "Colab"]].apply(pd.to_numeric)

**Colab Usage**

In [None]:
dimension_cats = ['occupation', 'sd_experience', 'education', 'gender']
fig = px.parallel_categories(kaggle_collab_df, dimensions=dimension_cats, color='Colab')
fig.show()

**Kaggle Usage**

In [None]:
dimension_cats = ['occupation', 'sd_experience', 'education', 'gender']
fig = px.parallel_categories(kaggle_collab_df, dimensions=dimension_cats, color='Kaggle')
fig.show()

After reviewing both of the alluvial diagrams above, we can see that

- Kaggle is used more actively by males in the majority of occupations, levels of education and programming experience levels
- The most tangible diff is observed for male Data Scientists with Master's and Bachelor's degree, with programming experience between 1 and 10 years
- in a sense, we can think of Kaggle notebooks and Colab to compete over the same target audience

In the evidence of the above-mentioned data-driven insights, we can assume Google could benefit from re-positioning Colab from the functional and marketing stand-points, to drive more paying users to it.

## Online Training Platform Preferences by Occupation

In [None]:
training_platform_lst = [
    'Q37_Part_1',
    'Q37_Part_2',
    'Q37_Part_3',
    'Q37_Part_4',
    'Q37_Part_5',
    'Q37_Part_6',
    'Q37_Part_7',
    'Q37_Part_8',
    'Q37_Part_9',
    'Q37_Part_10',
    'Q37_Part_11',
    'Q37_OTHER'
]

agg_data = data.groupby(["Q5", "Q6"])[training_platform_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q37_Part_1': 'Coursera',
        'Q37_Part_2': 'edX',
        'Q37_Part_3': 'Kaggle Learn Courses',
        'Q37_Part_4': 'DataCamp',
        'Q37_Part_5': 'Fast.ai',
        'Q37_Part_6': 'Udacity',
        'Q37_Part_7': 'Udemy',
        'Q37_Part_8': 'LinkedIn Learning',
        'Q37_Part_9': 'Cloud-certification programs (direct from AWS, Azure, GCP, or similar)',  
        'Q37_Part_10': 'University Courses (resulting in a university degree)',   
        'Q37_Part_11': 'None',
        'Q37_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='learning_platform', value_name='response_count').copy()

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'learning_platform'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='learning_platform', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Online E-Learning Platforms by User Occupation", 
    height=700)
fig.show()

We find that

- Coursera, Kaggle Learn Courses, and Udemy are in the top 3 list of preferred learning platforms (with Coursera being far more popular then the rest of the platforms in top 3 list)
- University courses leading to a formal university degree are also quite popular (it takes the 4th rank in the list)
- edX is behind its primary e-learning platform rivals
- only a minor fraction of respondents indicate they did not begin or complete data science courses (this is the healthy indicator of the data science community here at Kaggle to have good learning and self-learning attitudes)

## Online Learning Platform Preferences by Occupation and Programming Experience

In [None]:
fig = px.treemap(
    agg_melted_df, 
    path=['occupation', 'sd_experience', 'learning_platform'], 
    values='response_count', 
    #color='response_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

In addition to the insights above, we find that

- Coursera is the top platform of choice by every occupation and programming experience level groups
- Kaggle Learn Courses are more popular with the inexperienced junior professionals (have <1 or 1-2 year of professional experience)
- For more senior professionals (with experience level of 3+ years), Udemy is the second platform of choice (after Coursera), and they do not actively use Kaggle Learn Courses

## Primary Data Analysis Tool Preferences by Occupation

In [None]:
agg_data = data.groupby(["Q5", "Q6", 'Q38']).size().reset_index(name='response_count')
agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q38': 'analysis_tool',
})

In [None]:
fig_df = agg_data.groupby(['occupation', 'analysis_tool'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='analysis_tool', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Data Analysis Tools by User Occupation", 
    height=700)
fig.show()

We find that

- Local IDEs is the primary choice of the majority of the survey respondents (it highlights quite good technical skills of the respective respondents)
- Basic Stas software is the second preference of the respondents
- Data Scientists do not often use Basic Stas software as opposed to other analysis tool options investigated
- The rest of tool types goes far behind Local IDEs and Basic Stas software 

## Primary Data Analysis Tool Preferences by Occupation and Programming Experience

In [None]:
fig = px.treemap(
    agg_data, 
    path=['occupation', 'sd_experience', 'analysis_tool'], 
    values='response_count', 
    #color='response_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

We can find that

- technically savvy occupations are more inclined to use Local IDEs as their primary 'data analysis' tool of choice whereas less technical occupations (like Business Analysts) tend to use basic stat tools more
- it is also noted that professionals with little programming experience (<1 year or zero experience) tend to choose basic stats softare whereas professionals with 2+ years of programming experience do not feel any fear of using local IDEs to serve their data analysis needs

## Favorite Media Sources by Occupation

In [None]:
media_lst = ['Q39_Part_1',
 'Q39_Part_2',
 'Q39_Part_3',
 'Q39_Part_4',
 'Q39_Part_5',
 'Q39_Part_6',
 'Q39_Part_7',
 'Q39_Part_8',
 'Q39_Part_9',
 'Q39_Part_10',
 'Q39_Part_11',
 'Q39_OTHER',
]

agg_data = data.groupby(["Q5", "Q6"])[media_lst].count().reset_index()

agg_data = agg_data.rename(
    columns={
        'Q5': 'occupation',
        'Q6': 'sd_experience',
        'Q39_Part_1': 'Twitter (data science influencers)',
        'Q39_Part_2': "Email newsletters (Data Elixir, O'Reilly Data & AI, etc)",
        'Q39_Part_3': 'Reddit (r/machinelearning, etc)',
        'Q39_Part_4': 'Kaggle (notebooks, forums, etc)',
        'Q39_Part_5': 'Course Forums (forums.fast.ai, Coursera forums, etc)',
        'Q39_Part_6': 'YouTube (Kaggle YouTube, Cloud AI Adventures, etc)',
        'Q39_Part_7': 'Podcasts (Chai Time Data Science, O’Reilly Data Show, etc)',
        'Q39_Part_8': 'Blogs (Towards Data Science, Analytics Vidhya, etc)',
        'Q39_Part_9': 'Journal Publications (peer-reviewed journals, conference proceedings, etc)',  
        'Q39_Part_10': 'Slack Communities (ods.ai, kagglenoobs, etc)',   
        'Q39_Part_11': 'None',
        'Q39_OTHER': "Other"
})

agg_melted_df = agg_data.melt(id_vars=['occupation', 'sd_experience',],
                     var_name='medium', value_name='response_count').copy()

In [None]:
fig_df = agg_melted_df.groupby(['occupation', 'medium'])['response_count'].sum().reset_index() 
fig = px.bar(
    fig_df, 
    x='medium', 
    y='response_count', 
    color='occupation', 
    title="Kaggle 2020 Survey: Usage of Favorit Media Sources by User Occupation", 
    height=700)
fig.show()

We find that

- Kaggle as a medium of a valuable Data Science-related information it the top choice with the majority of the survey respondents, and it strongly out-performs other media sources investigated
- The second best is Youtube content
- Popular Data Science blogs take the third place in the rank
- other media sources are well below the top three media listed above

## Favorite Media Sources by Occupation and Programming Experience

In [None]:
fig = px.treemap(
    agg_melted_df, 
    path=['occupation', 'sd_experience', 'medium'], 
    values='response_count', 
    #color='response_count', 
    color_continuous_midpoint=50, 
    color_continuous_scale=px.colors.diverging.Portland
    )
fig.show()

In addition to insights above, we find that

- Almost in every occupation, people with programming experience of less then 10 years are funs of Kaggle as the primary information source
- People with 10+ years of experience tend to rank other information sources as s their primary  go-to medium choice Blogs a
- Junior Business Analysts (with 2 or less years of programming experience) rank YouTube as their primary go-to medium whereas more senior-level Business Analysts prefer Blog posts and Kaggle
- Statisticians with 20+ years of experience prefer Journal papers/publications, and Statisticians with less years of programming experience look the information up in Blogs and Kaggle as their go-to choices
- Product/Project managers rank Kaggle as their go-to information source, regardless their programming experience

# Chapter 7: Geographical Perspective On Kaggle Survey 2020 Participants

A set of geoscatter plots in the sections below will draw some insights on the geography of Kaggle survey 2020 participants

## Kagglers Geography By Gender

In [None]:
# create the neccessary aggregated dataframes
agg_data = data.groupby(['Q2', 'Q3']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q3': 'country', 
        'Q2': 'gender', 
    })

### Male Kagglers

In [None]:
agg_males = agg_data[agg_data["gender"] == 'Man']

In [None]:
country_size_pipeline = pdp.PdPipeline([
    pdp.ApplyByCols('respondent_count', 
                    set_size, 'size', drop=False)
])

agg_males = country_size_pipeline.apply(agg_males)

fig = px.scatter_geo(
    agg_males, locations="country", locationmode='country names', 
    color="respondent_count", 
    size='size', hover_name="country", 
    range_color= [0, 500], 
    projection="natural earth",
    title='Male Kagglers', 
    color_continuous_scale="portland")

fig.show()

As we can see, the top 5 countries with the largest male kagglers population (500+ respondents) within the survey respondents are as follows

- India
- USA
- Brazil
- Japan
- Russia

In the second tier of the countries with the most active male survey participats (300-500 participants) are the countries below 

- UK
- Germany
- Nigeria
- China

We also see a few countries to be slightly behind the tier 1 and tier 2 leaders in terms of participating kagglers (France, Spain, Turkey).

*Note:* it is interesting to indicate the under-represented voice of China in this survey (relative to the country population).

### Female Kagglers

In [None]:
# female Kagglers
agg_females = agg_data[agg_data["gender"] == 'Woman']

agg_females = country_size_pipeline.apply(agg_females)

fig = px.scatter_geo(
    agg_females, locations="country", locationmode='country names', 
    color="respondent_count", 
    size='size', hover_name="country", 
    range_color= [0, 500], 
    projection="natural earth",
    title='Female Kagglers', 
    color_continuous_scale="portland")

fig.show()

As we can see, only two countries have more then 400 female kagglers who participated in the survey 2020. These are

- India
- USA

## Kagglers with Doctoral Degree

In [None]:
# create the neccessary aggregated dataframes
agg_data = data.groupby(['Q3', 'Q4']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q3': 'country', 
        'Q4': 'education_level', 
    })

In [None]:
# Kagglers with doctoral degree
agg_doctors = agg_data[agg_data["education_level"] == 'Doctoral degree']

agg_doctors = country_size_pipeline.apply(agg_doctors)

fig = px.scatter_geo(
    agg_doctors, locations="country", locationmode='country names', 
    color="respondent_count", 
    size='size', hover_name="country", 
    range_color= [0, 200], 
    projection="natural earth",
    title='Kagglers with Doctoral Degree', 
    color_continuous_scale="portland")

fig.show()

As we can see, the number of Kaggle survey responders with Doctoral degree is the highest in the US - *407* (although the US-based Kaggle survey participants population is not the largest one - please see above). The second place holds India (whose Kaggle survey participants population is the largest across the globe), with *275* kagglers with Doctoral degrees participated in the survey.

The rest of the countries have less then 200 Kagglers with Doctoral degree who responded to the survey. Still , there some countries who have between 60 and 200 Doctors responded to the survey. These are

- UK
- Germany
- Brazil
- France
- Spain
- Japan

As we can see, Nigeria, one of the countries with the highest Kaggler community, is not in the top list of countries with Doctoral degree kagglers.

## Kagglers with Master's Degree

In [None]:
# Kagglers with Master's degree
agg_masters = agg_data[agg_data["education_level"] == 'Master’s degree']

agg_masters = country_size_pipeline.apply(agg_masters)

fig = px.scatter_geo(
    agg_masters, locations="country", locationmode='country names', 
    color="respondent_count", 
    size='size', hover_name="country", 
    range_color= [0, 200], 
    projection="natural earth",
    title='Kagglers with Master’s Degree', 
    color_continuous_scale="portland")

fig.show()

We can see the following countries to have  more then 200 survey participants with Master's degree as follows

- India 
- USA
- Japan
- Russia
- Brazil
- China
- UK
- Germany
- France
- Spain

As we can see, Nigeria, one of the countries with the highest Kaggler community, is not in the top list of countries with Master's degree kagglers.

## Kaggle Programming Veterans

This section will display the geograpic distribution of the survey participants with 20+ years of programming experience across the globe.

In [None]:
# create the neccessary aggregated dataframes
agg_data = data.groupby(['Q3', 'Q6']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q3': 'country', 
        'Q6': 'sd_experience', 
    })

In [None]:
# Kagglers with 20+ years of programming experience
agg_programming_veterans = agg_data[agg_data["sd_experience"] == '20+ years']

agg_programming_veterans = country_size_pipeline.apply(agg_programming_veterans)

fig = px.scatter_geo(
    agg_programming_veterans, locations="country", locationmode='country names', 
    color="respondent_count", 
    size='size', hover_name="country", 
    range_color= [0, 200], 
    projection="natural earth",
    title='Kagglers with 20+ Years of Programming Experience', 
    color_continuous_scale="portland")

fig.show()

As we can see, the biggest number of programming veterans among the survey respondents reside in the US. The rest of the countries are well below the bar set by the US.


However, there are several countries that have between 60 and 100 programming veterans participated in the survey. These are

- Japan
- UK
- Brazil
- India

As we see, India, although represented by the largest population of Kagglers among the survey respondents, does not possess the huge pool of the experts with 20+ years of programming experience vs. the rest of the leading countries.

## Kaggle ML Veterans

This section will display the geograpic distribution of the survey participants with 10-20 and 20+ years of ML experience across the globe.

In [None]:
# create the neccessary aggregated dataframes
agg_data = data.groupby(['Q3', 'Q15']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q3': 'country', 
        'Q15': 'ml_experience', 
    })

In [None]:
# Kagglers with 10-20 and 20+ years of ML experience
ml_veteran_list = ['10-20 years', '20 or more years']
agg_ml_veterans = agg_data[agg_data['ml_experience'].isin(ml_veteran_list)]

agg_ml_veterans = agg_ml_veterans.groupby(['country'], as_index=False)['respondent_count'].sum()

agg_ml_veterans = country_size_pipeline.apply(agg_ml_veterans)

fig = px.scatter_geo(
    agg_ml_veterans, locations="country", locationmode='country names', 
    color="respondent_count", 
    size='size', hover_name="country", 
    range_color= [0, 200], 
    projection="natural earth",
    title='Kagglers with 10+ Years of ML Experience', 
    color_continuous_scale="portland")

fig.show()

As we can see, the US absolutely predominates the rest of the world in terms of the number of the survey participants having 10+ years of ML experience

## AWS Professional Users Across the Globe

In [None]:
# create the neccessary aggregated dataframes
agg_data = data.groupby(['Q3', 'Q26_A_Part_1']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q3': 'country', 
        'Q26_A_Part_1': 'AWS', 
    })

agg_data = country_size_pipeline.apply(agg_data)

fig = px.scatter_geo(
    agg_data, locations="country", locationmode='country names', 
    color="respondent_count", 
    size='size', hover_name="country", 
    range_color= [0, 500], 
    projection="natural earth",
    title='Kaggle Professionals Utilizing AWS Across the Globe', 
    color_continuous_scale="portland")

fig.show()

We find that

- AWS is popular among survey respondents in India and USA the most
- Brasil, Japan and UK go into tier 2 in terms of the number of respondents from these country who use AWS

## GCP Professional Users Across the Globe

In [None]:
# create the neccessary aggregated dataframes
agg_data = data.groupby(['Q3', 'Q26_A_Part_3']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q3': 'country', 
        'Q26_A_Part_3': 'GCP', 
    })

agg_data = country_size_pipeline.apply(agg_data)

fig = px.scatter_geo(
    agg_data, locations="country", locationmode='country names', 
    color="respondent_count", 
    size='size', hover_name="country", 
    range_color= [0, 500], 
    projection="natural earth",
    title='Kaggle Professionals Utilizing GCP Across the Globe', 
    color_continuous_scale="portland")

fig.show()

We find that

- India is the top country where GCP is used
- USA takes the second place in the rank but it is significantly below India (unlike AWS, where India and USA were relatively on a par)
- Japan and Brazil are in the tier 2 in terms of the number of respondents from these country who use AWS
- GCP is less popular in UK, Canada and Australia vs. AWS
- GCP outperforms AWS in Turkey, Indonesia, and Russia

## MS Azure Professional Users Across the Globe

In [None]:
# create the neccessary aggregated dataframes
agg_data = data.groupby(['Q3', 'Q26_A_Part_2']).size().reset_index(name='respondent_count')

agg_data = agg_data.rename(
    columns={
        'Q3': 'country', 
        'Q26_A_Part_2': 'Azure', 
    })

agg_data = country_size_pipeline.apply(agg_data)

fig = px.scatter_geo(
    agg_data, locations="country", locationmode='country names', 
    color="respondent_count", 
    size='size', hover_name="country", 
    range_color= [0, 500], 
    projection="natural earth",
    title='Kaggle Professionals Utilizing MS Azure Across the Globe', 
    color_continuous_scale="portland")

fig.show()

We find that

- Top country in terms of the number of MS Azure users is India (although MS Azure is well behind AWS and GCP there)
- USA holds the second place in the rank, and the number of MS Azure users is on a par with the number of GCP users in the US
- Brazil belongs to tier 2 in term of the number of MS Azure users
- In the majority of the countries (except the US), the number of MS Azure users is smaller then the number of GCP and AWS users

# Appendix

Appendix contains supplementary notes on

- Methodology and technical implementation details about this notebook
- Comments on why Rapid EDA tools are not applicable to the EDA in the scope of this project
- References to the high-quality notebooks of other contest participants that inspired the work done here as well as to the blog posts explaining advanced topics of visualizing relations between categorical features

## Methodology and Technical Implementation Notes

Due to the nature of Kaggle 2020 Survey responses dataset (a lot of categorical variables), the major scope of this project is to visualise and review categorical variables. Processing and visualising data when there are multiple categorical variables can be tricky.

Therefore I used special visualization instruments that make it easy to 

- visualize pair associations between two categorical features (yes, we are not interested in correlations in this case but we rather seek for categories better connected to or associated with each other)
- visualize multi-variative parallel category relations where more then two categorical features involved
- visualize tree-like hierarchical relations between arbitrary number of category features (this is extremely powerful when interactions between more then two categorical features are investigated)

Such instruments are listed below as follows

- catscatter plots
- parallel category plots (aka alluvial plots)
- tree map plots
- stacked bar charts with color coding
- geoscatter plots to draw by-country data visualizations 

From the technical implementation stand-point, I used catscatter plot implemented by 
Myr Barnés (see https://github.com/myrthings/catscatter/blob/master/catscatter.py for more details)

For the rest of the above-mentioned charts, I used the power, simplisity and interactivity appeal of Plotly (although other visualization libraries exist that can provide similar capabilities).

## Can Rapid EDA Tools Help in This Competition?

Due to the nature of the source data , any of the popular Rapid EDA tools (like *AutoViz*, *SweetViz*, or *Pandas Profiling*) is less then useful for this problem.

More specifically, the following factors block any attempt to use Rapid EDA tools in a meaningful manner

- sparse data, with multiple columns related to the same domain-area attribute sometimes
- a lot of categorical features, with the appealing visualization to discover cat-to-cat pair relations and multivariate parallel and hierarchical cat-to-cat relations with more than 2 categorical features to interact

## References

This notebook has been inspired by some of the ideas in several great survey EDA contributions per the list below

- https://www.kaggle.com/subinium/kaggle-2020-visualization-analysis
- https://www.kaggle.com/dwin183287/kagglers-seen-by-continents

As for the best practices visualizing the associations / relations between categorical variables in your dataset, you may want to review the blog posts below

- https://towardsdatascience.com/processing-and-visualizing-multiple-categorical-variables-with-python-nbas-schedule-challenges-b48453bff813
- https://towardsdatascience.com/visualize-categorical-relationships-with-catscatter-e60cdb164395


In [None]:
finish_time = dt.datetime.now()
elapsed_time =  finish_time - start_time
print("Finished at ", finish_time)
print("Elapsed time: ", elapsed_time)