![](https://i.imgur.com/wcCJf9v.png)

# Imports 📚

Let's import the libraries we will be using in this notebook

In [None]:
# Data Import on Kaggle
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Importing processing libraries
import numpy as np
import pandas as pd

# Importing Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import palettable.scientific.sequential as palette
from palettable.tableau import Tableau_20
import matplotlib.gridspec as gridspec

import warnings
warnings.filterwarnings("ignore")

sns.set_style('white')

In [None]:
df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
df = df.iloc[1:] # Removing the questions from the data

df20 = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2020_responses.csv')
df20 = df20.iloc[1:] # Removing the questions from the data

df19 = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2019_responses.csv')
df19 = df19.iloc[1:] # Removing the questions from the data

# My Plan

The Kaggle Machine Learning and Data Science Survey has collected huge amounts of information on users across the world and there are several burning questions we might want answered. One such question I had was the differences in Kaggle users across countries. Now this is a broad questions in itself and has various caveats we need to look at.

How do user behaviours change when we look at a developing country (like India) as opposed to a developed country (like USA)? What are the differences not only in the gender split but their ages, education, earnings, etc. 

As we take a deeper dive into the comparisons of these countries, we also need to make sure our results are reproducible. Are the similarities/differences we see unique to India and USA or are they seen in other comparisons as well (say, Nigeria vs Japan, Brazil vs UK).

Furthermore, while we're talking about similarities between one group of countries, do we see a trend here? Do countries with a similar population have unique trends? What about ones with similar GDP?

While looking at these macro comparisons across the world, there was a small-question I wanted to answer. Of course, different countries shouldn't have different uses of models (not significant at least). However, different industries would use require custom models for their own data and I take a quick look at the models each industry prefers. Also, I look at how usage of models differs across company sizes.

Finally, we conclude with our results from the comparisons of the countries.

In [None]:
df.head(5)

## World Plots

Let's look at some distributions of Kaggle users across the world: 
- Gender distributions
- Distribution of users by their Education Level
- Distribution of users by their Job Title

In [None]:
def gender_split():
    male_df = df[df['Q2'] == 'Man']
    female_df = df[df['Q2'] == 'Woman']
    other_df = df[(df['Q2'] != 'Man') & (df['Q2'] != 'Woman')]
    return male_df, female_df, other_df
male_df, female_df, other_df = gender_split()

fig = plt.figure(figsize=(17, 8))
spec = gridspec.GridSpec(ncols=3, nrows=2, figure=fig)

# Gender Split
ax = fig.add_subplot(spec[0,0])
gender_idx = ['Male', 'Woman', 'Other']
gender_vals = df['Q2'].value_counts().values
gender_vals[2] = gender_vals[2:].sum()

circle = plt.Circle( (0,0), 0.7, color='white')
ax.pie(gender_vals[:3], explode = (0, 0.1, 0.2), labels = gender_idx, colors=palette.Acton_4.hex_colors)
p=plt.gcf()
p.gca().add_artist(circle)
for s in ['top', 'right', 'bottom', 'left']:
    ax.spines[s].set_visible(False)

# Education Split
edu_ser= df['Q4'].value_counts()
edu_ser = edu_ser.reindex(['No formal education past high school','Professional doctorate',
                           'Bachelor’s degree',"Master’s degree",'Doctoral degree',
                           'Some college/university study without earning a bachelor’s degree','I prefer not to answer'])
edu_idx = edu_ser.index
edu_vals = edu_ser.values
edu_idx = ['High School','Doctorate',
                       'Bachelor’s',"Master’s",'Doctoral degree',
                       'Other','No Answer']
ax1 = fig.add_subplot(spec[0,2])
ax1.bar( edu_idx, edu_vals, color=palette.Acton_7.hex_colors)
ax1.set_xticklabels(edu_idx, rotation=40)

for s in ['top', 'right', 'bottom', 'left']:
    ax1.spines[s].set_visible(False)

# Current Role v Gender    
ax1 = fig.add_subplot(spec[1,:])
role_ser= df['Q5'].value_counts()
role_ser = role_ser.reindex(['Developer Relations/Advocacy','Statistician',
                             'Data Engineer','Business Analyst','Research Scientist','Data Analyst','Software Engineer','Student',
              'Data Scientist','Other','Currently not employed','Machine Learning Engineer','Program/Project Manager','Product Manager','DBA/Database Engineer'])
role_idx = role_ser.index
role_vals = role_ser.values
role_idx = ['Developer Relations','Statistician',
                             'Data Engineer','Business Analyst','Research Scientist','Data Analyst','Software Engineer','Student',
              'Data Scientist','Other','Unemployed','ML Engineer','Project Manager','Product Manager','Database Engineer']

ax1.bar( role_idx, role_vals, color=palette.Acton_15.hex_colors)
ax1.set_xticklabels(role_idx, rotation=40)

for s in ['top', 'right', 'bottom', 'left']:
    ax1.spines[s].set_visible(False)


# Text

fig.text(0.28, 0.75, 'Distributions of the respondants', fontsize=17, fontweight='bold', fontfamily='sans-serif')
fig.text(0.28, 0.52, 
'''From the figures below, clearly, most of the repondants are originated
from India and more significantly, the majority are male. We also take a 
quick look at their education backgrounds and their experience. Don't 
worry, we will dive into the correlations between these features soon. 
As mentioned earlier, our analysis here is geared towards comparisons in 
developed vs developing countries so we're looking at a few variables that 
might be important later.
'''
, fontsize=14, fontweight='light', fontfamily='sans-serif')

fig.tight_layout() 
plt.subplots_adjust(wspace=1)
plt.show()






# Developed vs Developing


In [None]:
display(df.groupby(['Q3']).count()['Q1'].sort_values(ascending=False)[:10])

Our concern here is that since a majority of the data comes from Indians, our plots earlier could be misleading. So, we will compare the data from a developing country (India) against a developed country (USA) and see how the results hold up. Are Kaggle users still majorly "male undergraduate students"?

In [None]:
def create_country_plots(df, row_num, color1, color2, color3, name, fig, spec):

    # Gender Split
    # ------------------------
    ax = fig.add_subplot(spec[row_num,0])
    gender_idx = ['Male', 'Woman', 'Other']
    gender_vals = df['Q2'].value_counts().values
    gender_vals[2] = gender_vals[2:].sum()

    circle = plt.Circle( (0,0), 0.7, color='white')
    ax.pie(gender_vals[:3], explode = (0, 0.1, 0.2), labels = gender_idx, colors=color1)
    p=plt.gcf()
    p.gca().add_artist(circle)
    for s in ['top', 'right', 'bottom', 'left']:
        ax.spines[s].set_visible(False)
#     ax.set_xlabel('Gender Distribution')

    # Education Split
    # ------------------------
    edu_ser= df['Q4'].value_counts()
    edu_ser = edu_ser.reindex(['No formal education past high school','Professional doctorate',
                               'Bachelor’s degree',"Master’s degree",'Doctoral degree',
                               'Some college/university study without earning a bachelor’s degree','I prefer not to answer'])
    edu_idx = edu_ser.index
    edu_vals = edu_ser.values
    edu_idx = ['High School','Doctorate',
                           'Bachelor’s',"Master’s",'Doctoral degree',
                           'Other','No Answer']
    ax1 = fig.add_subplot(spec[row_num,1])
    ax1.bar( edu_idx, edu_vals, color=color2)
    ax1.set_xticklabels(edu_idx, rotation=70)
    ax1.set_yticklabels([])
    ax1.set_title(name)
#     ax1.set_xlabel('Education Distribution')

    for s in ['top', 'right', 'bottom', 'left']:
        ax1.spines[s].set_visible(False)
        
    # Role Split    
    # ------------------------
    ax1 = fig.add_subplot(spec[row_num,2])
    role_ser_us= df['Q5'].value_counts()
    role_ser_us = role_ser_us.reindex(['Developer Relations/Advocacy','Statistician',
                                 'Data Engineer','Business Analyst','Research Scientist','Data Analyst','Software Engineer','Student',
                  'Data Scientist','Other','Currently not employed','Machine Learning Engineer','Program/Project Manager','Product Manager','DBA/Database Engineer'])
    role_idx_us = role_ser_us.index
    role_vals_us = role_ser_us.values
    role_idx_us = ['Dev Relations','Statistician',
                                 'Data Eng','Business Analyst','Research Scientist','Data Analyst','SWE','Student',
                  'Data Scientist','Other','Unemployed','ML Engineer','Project Manager','Product Manager','Database Engineer']

    ax1.bar( role_idx_us, role_vals_us, color=color3)
    ax1.set_xticklabels(role_idx_us, rotation=70)
    ax1.set_yticklabels([])
#     ax1.set_xlabel('Work Ex Distribution')

    for s in ['top', 'right', 'bottom', 'left']:
        ax1.spines[s].set_visible(False)

        

In [None]:
ind_df = df[df['Q3'] == 'India']
us_df = df[df['Q3'] == 'United States of America']

fig = plt.figure(figsize=(17, 12))
spec = gridspec.GridSpec(ncols=3, nrows=4, figure=fig)

# Gender Split
ax = fig.add_subplot(spec[0,0])
gender_idx = ['Male', 'Woman', 'Other']
gender_vals = ind_df['Q2'].value_counts().values
gender_vals[2] = gender_vals[2:].sum()
circle = plt.Circle( (0,0), 0.7, color='white')
ax.pie(gender_vals[:3], explode = (0, 0.1, 0.2), labels = gender_idx, colors=palette.Acton_4.hex_colors)
p=plt.gcf()
p.gca().add_artist(circle)
ax.set_title("India")

# ------------------------

ax2 = fig.add_subplot(spec[1,0])
gender_vals_us = us_df['Q2'].value_counts().values
gender_vals_us[2] = gender_vals_us[2:].sum()
circle = plt.Circle( (0,0), 0.7, color='white')
ax2.pie(gender_vals_us[:3], explode = (0, 0.1, 0.2), labels = gender_idx, colors=palette.Batlow_4.hex_colors)
p=plt.gcf()
p.gca().add_artist(circle)
ax2.set_title("USA")

# ------------------------
# ------------------------

# Education Split
edu_ser= ind_df['Q4'].value_counts()
edu_ser = edu_ser.reindex(['No formal education past high school','Professional doctorate',
                           'Bachelor’s degree',"Master’s degree",'Doctoral degree',
                           'Some college/university study without earning a bachelor’s degree','I prefer not to answer'])
edu_idx = edu_ser.index
edu_vals = edu_ser.values
edu_idx = ['High School','Doctorate',
                       'Bachelor’s',"Master’s",'Doctoral degree',
                       'Other','No Answer']
ax1 = fig.add_subplot(spec[0,2])
ax1.bar( edu_idx, edu_vals, color=palette.Acton_7.hex_colors)
ax1.set_xticklabels(edu_idx, rotation=40)
ax1.set_yticklabels([])
ax1.set_title("India")

for s in ['top', 'right', 'bottom', 'left']:
    ax1.spines[s].set_visible(False)


# ------------------------

edu_ser_us= us_df['Q4'].value_counts()
edu_ser_us = edu_ser_us.reindex(['No formal education past high school','Professional doctorate',
                           'Bachelor’s degree',"Master’s degree",'Doctoral degree',
                           'Some college/university study without earning a bachelor’s degree','I prefer not to answer'])
edu_idx_us = edu_ser_us.index
edu_vals_us = edu_ser_us.values
edu_idx_us = ['High School','Doctorate',
                       'Bachelor’s',"Master’s",'Doctoral degree',
                       'Other','No Answer']
ax1 = fig.add_subplot(spec[1,2])
ax1.bar( edu_idx_us, edu_vals_us, color=palette.Batlow_7.hex_colors)
ax1.set_xticklabels(edu_idx_us, rotation=40)
ax1.set_yticklabels([])
ax1.set_title("USA")

for s in ['top', 'right', 'bottom', 'left']:
    ax1.spines[s].set_visible(False)

# ------------------------
# ------------------------

# Current Role v Gender    
ax1 = fig.add_subplot(spec[2,:])
role_ser= ind_df['Q5'].value_counts()
role_ser = role_ser.reindex(['Developer Relations/Advocacy','Statistician',
                             'Data Engineer','Business Analyst','Research Scientist','Data Analyst','Software Engineer','Student',
              'Data Scientist','Other','Currently not employed','Machine Learning Engineer','Program/Project Manager','Product Manager','DBA/Database Engineer'])
role_idx = role_ser.index
role_vals = role_ser.values
role_idx = ['Developer Relations','Statistician',
                             'Data Engineer','Business Analyst','Research Scientist','Data Analyst','Software Engineer','Student',
              'Data Scientist','Other','Unemployed','ML Engineer','Project Manager','Product Manager','Database Engineer']

ax1.bar( role_idx, role_vals, color=palette.Acton_15.hex_colors)
ax1.set_xticklabels(role_idx, rotation=40)
ax1.set_yticklabels([])
ax1.set_title("India")
for s in ['top', 'right', 'bottom', 'left']:
    ax1.spines[s].set_visible(False)
    
    
# ------------------------
ax1 = fig.add_subplot(spec[3,:])
role_ser_us= us_df['Q5'].value_counts()
role_ser_us = role_ser_us.reindex(['Developer Relations/Advocacy','Statistician',
                             'Data Engineer','Business Analyst','Research Scientist','Data Analyst','Software Engineer','Student',
              'Data Scientist','Other','Currently not employed','Machine Learning Engineer','Program/Project Manager','Product Manager','DBA/Database Engineer'])
role_idx_us = role_ser_us.index
role_vals_us = role_ser_us.values
role_idx_us = ['Developer Relations','Statistician',
                             'Data Engineer','Business Analyst','Research Scientist','Data Analyst','Software Engineer','Student',
              'Data Scientist','Other','Unemployed','ML Engineer','Project Manager','Product Manager','Database Engineer']

ax1.bar( role_idx_us, role_vals_us, color=palette.Batlow_15.hex_colors)
ax1.set_xticklabels(role_idx_us, rotation=40)
ax1.set_yticklabels([])
ax1.set_title("USA")

for s in ['top', 'right', 'bottom', 'left']:
    ax1.spines[s].set_visible(False)


# Text

fig.text(0.28, 0.89, 'Respondant Comparison - India vs USA', fontsize=17, fontweight='bold', fontfamily='sans-serif')
fig.text(0.28, 0.58, 
'''Well, this was a little unexpected. While I did predict differences in 
the distributions, I expected a much bigger difference in the gender gap 
between the two countries.
However, we see that while the gender split remains almost the same (with a
slight increase in women in USA), Kaggle users from the United States seem 
to be more educated (or, specialised). They tend to have a Master's degree 
or a Doctoral degree as compared to the primarily undergraduate students
from India.
This also falls in line with our 
visualisations in the third plot. USA has more users in Kaggle that are 
working professionally.

However, will these results hold up in other comparisons? Are these
differences common to all developing vs developed countries?
'''
, fontsize=14, fontweight='light', fontfamily='sans-serif')

fig.tight_layout() 
plt.subplots_adjust(wspace=1)
plt.show()

Now that we've seen the differences in India vs USA, let's see if these results are consistent across other developing vs developed countries.

For this, I will take a look at China and Brazil (developing) alongside Japan and the UK (developed).

In [None]:
nig_df = df[df['Q3'] == 'Nigeria']
brazil_df = df[df['Q3'] == 'Brazil']
uk_df = df[df['Q3'] == 'United Kingdom of Great Britain and Northern Ireland']
japan_df = df[df['Q3'] == 'Japan']

fig = plt.figure(figsize=(18, 15))
spec = gridspec.GridSpec(ncols=3, nrows=4, figure=fig)


create_country_plots(nig_df, 0, palette.Acton_3.hex_colors, palette.Acton_7.hex_colors,palette.Acton_15.hex_colors, "Nigeria", fig, spec)
create_country_plots(brazil_df, 1, palette.Bamako_3.hex_colors, palette.Bamako_7.hex_colors,palette.Bamako_15.hex_colors, "Brazil", fig, spec)
create_country_plots(uk_df, 2, palette.Batlow_3.hex_colors, palette.Batlow_7.hex_colors,palette.Batlow_15.hex_colors, "United Kingdom", fig, spec)
create_country_plots(japan_df, 3, palette.Davos_3.hex_colors, palette.Davos_7.hex_colors,palette.Davos_15.hex_colors, "Japan", fig, spec)


fig.tight_layout() 
plt.subplots_adjust(wspace=0.3)
plt.show()

# Conclusions from Spread across Countries

Our comparisons of Nigeria, Brazil, Japan and the UK are in-line with the conclusions drawn from the US-India distributions earlier. In developed countries, Kaggle users tend to be more educated/specialized while they are usually undergraduate students in dveeloping countries. Moreover, users are generally Students developing countries and relatively more spread out across professional fields in developed countries (this difference, however, is marginal).

Another interesting aspect to note is that while all countries have a primarily male population of around 75%-80%, almost 90% of Kaggle users from Japan are male.

# ML Models in Industries

Alright, you're probably bored of seeing the same visualisations for different countries. Let's take a look at the models people prefer. 

How do usage of ML models differ in industries and company sizes? While we aren't performing hypothesis testing in this notebook, my intial thoughts are that smaller companies that don't usually have a dedicated data team probably don't work with these models much.

Also, for the Industry, one would expect organisations in Computers/Technology to use Neural Network architechture with the Financial industry using basic Regression and Classification models.

In [None]:
# Code from Ruchi798's notebook

def visualize_relation(start_slice, end_slice, new_col_names, old_col, new_col, xlabel, title, p1,p2):
    df_sliced = df.iloc[:,start_slice:end_slice].iloc[1:]

    df_sliced = df_sliced.rename(columns=new_col_names).fillna(0).replace('[^\\d]',1, regex=True)
    df_sliced = df_sliced.join(df[old_col])

    df_sliced_stats = pd.DataFrame()
    for col in df_sliced.columns[:-1]:
        df_sliced_stats[col] = df_sliced.groupby(old_col)[col].mean().values

    df_sliced = df_sliced.rename(columns={old_col:new_col})
    df_sliced_stats.index = df_sliced.groupby(new_col)[list(new_col_names.items())[0][1]].mean().index

    cmap = sns.diverging_palette(p1, p2, as_cmap=True)
    display(df_sliced_stats.style.background_gradient(cmap, axis=0).format("{:.0%}"))

    df_sliced_stats[new_col] = df_sliced_stats.index
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111)
    for i in range(len(df_sliced_stats.columns[:-1])):
        color = Tableau_20.hex_colors[i]
        col = df_sliced_stats.columns[i]
        df_sliced_stats.plot(kind="scatter", x=col,y=new_col, color=color, label=col,ax=ax, s=100)
    ax.xaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y))) 
    ax.set_xlabel(xlabel)
    ax.legend(loc='upper right',bbox_to_anchor=(1.35, 1), frameon=False)
    ax.set_title(title,font="Serif")
    plt.show()

In [None]:
new_col_names ={'Q17_Part_1': 'Lin/Log Reg',
                'Q17_Part_2': 'Dec Trees/Random For',
                'Q17_Part_3': 'GBM',
                'Q17_Part_4': 'Bayesian Approaches',
                'Q17_Part_5': 'Evolutionary Approaches',
                'Q17_Part_6': 'Dense NN',
                'Q17_Part_7': 'CNN',
                'Q17_Part_8': 'GAN',
                'Q17_Part_9': 'RNN',
                'Q17_Part_10': 'Transformer Networks',
                'Q17_Part_11': 'None',
                'Q17_OTHER': 'Other'
                }

visualize_relation(90,102, new_col_names, 'Q20', 'Industry', "Usage of Models", "Industry vs Model Used", 240, 10)
visualize_relation(90,102, new_col_names, 'Q21', 'Company Size', "Usage of Models", "Company Size vs Model Used", 240, 10)

The correlations seem to be in-line with our initial thoughts. Additionally, Security and Defense Industries focus on GANs and CNNs probably owing to their heavy involvement with facial recognition softwares.

The distributions of model usage seem consistent across the board for different industries as well as various company sizes.

# Next Steps

This has just been an initial dive into usage between different countries. There are still some insights we haven't looked at and I will be adding them to this notebook soon such as:

- We saw similarities between groups of developing and developed countries respectively. But, what makes them stand out. In other words, how does an economy like China with its large population hold up against Nigeria. Similarly, how does USA compare to UK, Canada and other first world nations?
- How do salaries differ across countries and their correlations with different Job Titles
- Are there certain industries that do better in some countries and are relatively higher paying?
- Time Series -- How have all of these trends changed over the years (Compairsons with Kaggle survey results from 2019 and 2020)?

# Concluding Remarks

A huge thank you to all the other amazing Kaggle contributors whose notebooks and stories inspired me to make this. I have referred to code from a couple other contributors and I'm grateful for their input!

If reading this notebook, helped spur an idea in you and there's something you'd like me to analyse, leave a comment and I'll be sure to include it here! Thanks for Reading!