# "Who's not here?": analyzing Kaggle user survey 2020 

**TABLE OF CONTENTS**
1. Introduction <br>
2. Survey Data <br>
3. User Age <br>
4. User Gender <br>
5. User Location <br>
6. Users Per Capita <br>
7. User Education <br>
8. User Occupation <br>

## 1. Introduction

This notebook takes on one particular research question regarding the 2020 Kaggle user survey. The question posed to the data is "who is here?" Or, rather, "who is not here?"

Of course one must take into account that the question cannot be directly answered by using survey data for reasons concerning accuracy. We can only analyze data from people who actually took part in the survey, and this group may well differ from overall Kaggle population to some extent. However, since no validation data on the subject is available, in this notebook the presupposition will be made that the 2020 survey answers equal to actual Kaggle user base when it comes to more detailed data such as age, gender and education.

One could say that Kaggle has - right from the start - omitted the principle of inclusion in how the service works. True, the English language barrier may hinder some people from participating but then again the data itself is indeed not written in English, and all datasets are available for all users. There is no such thing as Kaggle+ or Kaggle Premium regarding data access (knock on digital wood on that one...)

This notebook concentrates on the first five survey data answer columns (Q1-Q5). These columns consist of data regarding age, gender, country of reside, level of formal education and occupation title. This data will be analyzed from the viewpoint of the aforementioned research question. Where are Kaggle users *not* from? How old are they *not*? What are they *not* doing regarding education and occupation?

No pre-analysis hypotheses will be made on research question. Let's just jump in the data pool and start swimming.

*December 15th, 2020* <br>
*Jari Peltola*

## 2. Survey data

In [None]:
# import modules
import math
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.graph_objs as go
from  matplotlib.ticker import PercentFormatter

import numpy as np
import plotly.express as px

In [None]:
# set column and row display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# load survey dataset
df = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv') 

df.head(10)

The survey questions are included in the zero row of the survey dataset. Since these questions are in detail available in the provided supplementary data, it is sensible to remove the questions from the dataset before analysis. Next we will check the dataframe shape in order to find out how many rows we are actually dealing with.

In [None]:
#get dataframe shape
shape = df.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

It seems that what we are looking for are the last 20036 rows of the dataset, so let's make a copy of the original dataframe that includes them, at the same time dropping the survey questions. 

In [None]:
df_copy = df.copy()
df_copy = df_copy.tail(20036)
df_copy.head()

As mentioned, we are only interested in the first five questions, so we may also exclude other columns from our selection.

In [None]:
# select preferred columns by name
df_copy = df_copy.loc[:,['Q1', 'Q2', 'Q3', 'Q4', 'Q5']]

In the original dataset, age groups are in format (xx-yy) that does not able to treat them as what they actually are: numeric entities. There are probably dozens of solutions to this, but I went for a simple one. By changing the "-" character to a common dot, age groups become decimal figures in format "xx.yy". Finally, by replacing the "+" character with double zero in the "70+" category all groups maintain their preferred order also after the transformation.  

In [None]:
# replace characters
df_copy.Q1 = df_copy.Q1.str.replace('-', '.')
df_copy.Q1 = df_copy.Q1.str.replace('+', '.00')

df_copy.head()

However, the **age_group** column is still in string format, so we must convert its values to float numbers.

In [None]:
# convert oolumn datatype to float
df_copy["Q1"] = df_copy.Q1.astype(float)

# check datatypes
df_copy.dtypes

## 3. User Age

The first answer column (Q1) has age group data on survey answers. Let's first check out in detail what these age groups are.

In [None]:
# check unique age groups
df_copy['Q1'].unique()

If age groups are set in order, one can see that there are five-year gaps between younger groups whereas in the other end the gap is ten years.

In [None]:
# sort values on age groups
df_copy = df_copy.sort_values(by = 'Q1')

df_copy['Q1'].unique()

We can take a visual on the same data, since this will quickly tell us a little something about the size of the age groups. 

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Q1',
              data = df_copy,
              order = df_copy['Q1'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: age groups",fontsize=20)
plot.set_xlabel("Age group",fontsize=18)
plot.set_ylabel("Number of answers",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

Let's have the same data in numbers. Now the age groups are sorted by the number of users in each group, from largest to smallest.

In [None]:
# calculate age value count
age_count = df_copy['Q1'].value_counts()

age_count

If all this is converted into percentages, we can see that some 80 percent of Kaggle users are between age 18-40. 

In [None]:
# calculate age value count percentage
age_perc = df_copy['Q1'].value_counts(normalize=True) * 100

age_perc

Also, only about seven percent of users are 50 years or older. That is definitely the first significant answer regarding our research question. As an off-data personal note, this is quite remarkable since home computers have been a common item in many countries for the last 40 years or so. It seems that the kids who once saw the movie ***War Games*** never made it to Kaggle...

Age groups sorted by count is something that will come at hand a bit later. Therefore a new temporary dataframe **df_age_temp** is created based on **age_count**. In that dataframe, the respective age group column (in sorted order) is renamed as such to **age_group_sorted**.

In [None]:
df_age_temp = df_copy['Q1'].value_counts().to_frame()

# reset index
df_age_temp.reset_index(inplace = True) 

# rename old index
df_age_temp.rename(columns = {'index':'age_group_sorted'}, inplace = True) 

df_age_temp.head()

Next the sorted age group column is converted to list form, and we name the list **list_1**. We will also flatten the list for later use.

In [None]:
# make the list
list_1 = df_age_temp['age_group_sorted'].values.tolist()

# flatten lists
list_1 = np.array(list_1).flatten()

After that we make two other lists (**list_2** and **list_3**) based on users' age and flatten them as well. These lists will include the age count as well as age percentage values we calculated earlier.

In [None]:
# make the lists
list_2 = age_count.tolist()
list_3 = age_perc.tolist()

# flatten lists
list_2 = np.array(list_2).flatten()
list_3 = np.array(list_3).flatten()

Next we will create a new dataframe with three empty columns. The column names equal to the three lists we just made.

In [None]:
df_kaggle_age = pd.DataFrame(columns=['age_group', 'age_count', 'age_perc'])

df_kaggle_age.head()

We will make the age group data the index of our new dataframe, leaving the two other lists serving as columms.

For clarity, the column descriptions are:

**age_group** (the age group provided in the original dataset, in sorted order)<br>
**age_count** (the number of answers per age group)<br>
**age_perc** (the percentage of each age group by number of answers in it)

In [None]:
# new column
df_kaggle_age['age_group'] = np.array(list_1)

# set column as index
df_kaggle_age.set_index('age_group')

# two new columns
df_kaggle_age['age_count'] = np.array(list_2)
df_kaggle_age['age_perc'] = np.array(list_3)

# round percentage columnto one decimal
df_kaggle_age['age_perc'] = df_kaggle_age['age_perc'].round(decimals=1)

df_kaggle_age.head(11)

Now we can sort the dataframe values by age group. The value "11" in printing the dataframe head is not a ***Spinal Tap*** reference but, rather, the actual number of age groups available. The earlier plot already showed us what is going on, but it is rather interesting that there is a slight notch in user figures in age group 25-29. Who knows, maybe we should already by talking about different generations of modern-era data scientists...

Next we will sort the age group categories in ascending order to make the dataframe more readable. A quick comparison to the print above proves that the other column values follow this new rule in an effortless manner.

In [None]:
# sort values on age groups
df_kaggle_age = df_kaggle_age.sort_values(by = 'age_group', ascending = True) 

df_kaggle_age.head(11)

## 3. User Gender

*But what about gender?* Here my layman hypothesis would definitely to assume that men are the dominant group. Luckily, we can rely on solid data instead of prejudice. 

In [None]:
# calculate answers on Q2 column value count
gender_count = df_copy['Q2'].value_counts()

gender_count

In [None]:
# calculate gender percentage
gender_perc = df_copy['Q2'].value_counts(normalize=True) * 100

gender_perc

Men *do* dominate Kaggle when it comes to gender, with women consisting of less than 20 percent of total users. The other three gender categories include about two percent of all answers. This is why this notebook for now concentrates mainly on the two gender categories with most data available. We can also analyze gender and age at the same time.

In [None]:
df_age_man = df_copy[df_copy['Q2'] == 'Man']
#df_age_man = df_age_man.sort_values(by = 'Q1', ascending = True) 

men = df_age_man.groupby('Q1')['Q2'].value_counts()

men

As these figures represent men in Kaggle divided by age group, it is a good idea to store them into our new dataframe. Hence the new column **men_count**. 

In [None]:
# make new list
list_4 = men.tolist()

# flatten the list
list_4 = np.array(list_4).flatten()

# create new column "men_count"
df_kaggle_age['men_count'] = np.array(list_4)

# reset index
df_kaggle_age.reset_index(inplace = True) 

# select and drop selected column
col = ['index']
df_kaggle_age = df_kaggle_age.drop(col, axis=1)

df_kaggle_age.head(11)

In addition to mere value count, we add another column **men_perc**, which shows the same figures as percentages by age group.

In [None]:
# calculate percentage of men 
df_kaggle_age['men_perc'] = (df_kaggle_age['men_count'] / df_kaggle_age['men_count'].sum()) * 100

# round the result to one decimal
df_kaggle_age['men_perc'] = df_kaggle_age['men_perc'].round(decimals=1)

df_kaggle_age.head(11)

As we can see, age-wise over 50 percent of men on Kaggle are 29 years or younger. Conversely, only about a fifth of all men are 40 years of age or older. Next we will calculate the same figures for women on Kaggle and create two new columns (respectively named as **women_count** and **women_perc**). 

In [None]:
df_age_woman = df_copy[df_copy['Q2'] == 'Woman']
df_age_woman = df_age_woman.sort_values(by = 'Q1', ascending = True) 

women = df_age_woman.groupby('Q1')['Q2'].value_counts()

women

In [None]:
# make new list
list_5 = women.tolist()

# flatten the list
list_5 = np.array(list_5).flatten()

# create new column "women_count"
df_kaggle_age['women_count'] = np.array(list_5)

# reset index
df_kaggle_age.reset_index(inplace = True) 

# select and drop selected column
col = ['index']
df_kaggle_age = df_kaggle_age.drop(col, axis=1)

In [None]:
# calculate percentage of women 
df_kaggle_age['women_perc'] = (df_kaggle_age['women_count'] / df_kaggle_age['women_count'].sum()) * 100

# round the result to one decimal
df_kaggle_age['women_perc'] = df_kaggle_age['women_perc'].round(decimals=1)

df_kaggle_age.head(11)

It is interesting that on Kaggle, ***women under 30 years form relatively larger a group than men under 30***. With the percentages at hand, we may plot them to give the subject matter more visual a context.

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x=df_kaggle_age['age_group'],
    y=df_kaggle_age['men_perc'],
    name='percentage of men',
    marker_color='indianred'
))
fig.add_trace(go.Bar(
    x=df_kaggle_age['age_group'],
    y=df_kaggle_age['women_perc'],
    name='percentage of women',
    marker_color='lightsalmon'
))

fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

annotations = []

annotations.append(dict(xref='paper', yref='paper', x=0.9, y=-0.10,
                              xanchor='center', yanchor='top',
                               text='data: Kaggle User Survey 2020',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)
fig.update_layout(barmode='group')
fig.update_layout(title_text='<b>Kaggle User Survey 2020</b>:<br>percentage of users by age and gender',
 
                  
      font=dict(family='calibri',
        size=12,
        color='rgb(64,64,64)'),
      legend=dict(
        x=0.75,
        y=0.8,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    ),
    barmode='group',
    bargap=0.15,
    bargroupgap=0.1
)

fig.update_xaxes(showgrid=False, gridwidth=1, gridcolor='LightGrey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')
fig.update_yaxes(title_text='Percentage')
fig.update_xaxes(title_text='Age')
fig.update_yaxes(title_font=dict(size=14))
fig.update_xaxes(title_font=dict(size=14))
#fig.update_layout(xaxis_showgrid=False)
   
fig.show()

## 5. User Location

Now we know a bit more about age and gender, so let's find out something regarding location too. For that, we first create a new dataframe with columns Q1, Q2 and Q3 included.

In [None]:
df_nation = df_copy.loc[:,['Q1', 'Q2', 'Q3']]

df_nation.head(10)

Next we may check out the unique names of locations included in answers.

In [None]:
# check unique locations
df_nation['Q3'].unique()

There are some overlapping locations ('Republic of Korea' and 'South Korea' refer to same country) as well as lengthy names included, so we will manually clear them for further easier use.

In [None]:
# replace selected strings
df_nation['Q3'] = df_nation['Q3'].replace(['United States of America'],'United States')
df_nation['Q3'] = df_nation['Q3'].replace(['Viet Nam'],'Vietnam')
df_nation['Q3'] = df_nation['Q3'].replace(['United Kingdom of Great Britain and Northern Ireland'],'United Kingdom')
df_nation['Q3'] = df_nation['Q3'].replace(['Iran, Islamic Republic of...'],'Iran')
df_nation['Q3'] = df_nation['Q3'].replace(['Republic of Korea'],'South Korea')

df_nation['Q3'].unique()

Now we can inspect more closely where all Kaggle users are coming from.

In [None]:
# check unique location count
nation_count = df_nation['Q3'].value_counts()

nation_count

To get a more comprehensive outlook, we will add population data on each country. This data is retrieved from ***Our World In Data*** project homepage.

In [None]:
# get population dataset
url_two = "https://covid.ourworldindata.org/data/ecdc/locations.csv"

# upload dataset as pandas dataframe
df_population = pd.read_csv(url_two)

# drop columns irrelevant to task at hand
cols = ['countriesAndTerritories', 'population_year']
df_population = df_population.drop(cols, axis=1)

# rename column for future merge
df_population.rename(columns = {'location':'Q3'}, inplace = True) 

df_population.head(10)

Renaming the 'location' column to 'Q3' was simply for compatibility reasons, since it would be somewhat confusing to rename all our Kaggle dataset columns. Now we take a look at its unique values.

In [None]:
df_population['Q3'].unique()

The population data includes 214 unique countries and regions, as we can see below.

In [None]:
#get dataframe shape
shape = df_population.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

At this stage, we don't actually need population as numeric value, so we temporarily convert it into object. *This is by any means not necessary, but merging dataframes sometimes affects numeric decimal value formats make makes them hard to work with.* 

In [None]:
df_population['population'] = df_population['population'].astype(str)

In [None]:
df_population.dtypes

Now a new dataframe **df_kaggle_population** is created based on Kaggle user nation data and our new population data. Left merge here means that all countries from Kaggle user data (left) are included, but only data of those Kaggle user countries is retrieved from larger population dataframe (right). 

In [None]:
# merge the two dataframes
df_users = pd.merge(df_nation, df_population, how='left')

# replace NaN values with string 'other'
df_users = df_users.fillna('Other')

df_users.head(10)

As a bonus, we now have also a continent column, which enables us to take a wider perspective than mere individual countries. *It is notable though that sometimes this perspective may be too wide: for example both China and India fall into Asia continent category.*

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'continent',
              data = df_users,
              order = df_users['continent'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: users by continent",fontsize=20)
plot.set_xlabel("Continent",fontsize=18)
plot.set_ylabel("Number of answers",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

As suspected, Asia turns out as a major user hub. Also, there is the 'other' category where no location data was included in the survey answers.

Following this, the next step will be to further inspect individual continents. For this, we will create separate dataframes for each continent.

In [None]:
# create Asia dataframe
df_asia = df_users[df_users['continent'] == 'Asia']

df_asia.head(10)

If age groups are sorted in ascending order, a quick peek at the age groups shows us that users from Asia are - just like the overall age data suggested - mainly of younger age category. However, in this very continent, the age phenomenon can be observed even more clearly than in other locations.

In [None]:
# sort dataframe
df_asia = df_asia.sort_values(by = 'Q1', ascending = True) 

# show value count
asia = df_asia.groupby('Q1')['continent'].value_counts()

asia

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Q1',
              data = df_asia,
              order = df_asia['Q1'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: users in Asia by age group",fontsize=20)
plot.set_xlabel("Age group",fontsize=18)
plot.set_ylabel("Number of answers",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

We will next add this continent data to our previous age dataframe.

In [None]:
# make new list
list_6 = asia.tolist()

# flatten list
list_6 = np.array(list_6).flatten()

# new column
df_kaggle_age['asia_count'] = np.array(list_6)

# reset index
df_kaggle_age.reset_index(inplace = True) 

# drop column
col = ['index']
df_kaggle_age = df_kaggle_age.drop(col, axis=1)

# calculate percentage to a new column
df_kaggle_age['asia_perc'] = (df_kaggle_age['asia_count'] / df_kaggle_age['asia_count'].sum()) * 100
df_kaggle_age['asia_perc'] = df_kaggle_age['asia_perc'].round(decimals=1)

df_kaggle_age.head(11)

Next the same method will be applied to Europe as continent.

In [None]:
df_europe = df_users[df_users['continent'] == 'Europe']
df_europe = df_europe.sort_values(by = 'Q1', ascending = True) 

europe = df_europe.groupby('Q1')['continent'].value_counts()

europe

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Q1',
              data = df_europe,
              order = df_europe['Q1'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: users in Europe by age group",fontsize=20)
plot.set_xlabel("Age group",fontsize=18)
plot.set_ylabel("Number of answers",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

In [None]:
list_7 = europe.tolist()
list_7 = np.array(list_7).flatten()

df_kaggle_age['europe_count'] = np.array(list_7)
df_kaggle_age.reset_index(inplace = True) 

col = ['index']
df_kaggle_age = df_kaggle_age.drop(col, axis=1)

df_kaggle_age['europe_perc'] = (df_kaggle_age['europe_count'] / df_kaggle_age['europe_count'].sum()) * 100
df_kaggle_age['europe_perc'] = df_kaggle_age['europe_perc'].round(decimals=1)

df_kaggle_age.head(11)

We will next continue our journey to North America.

In [None]:
df_north_america = df_users[df_users['continent'] == 'North America']
df_north_america = df_north_america.sort_values(by = 'Q1', ascending = True) 

north_america = df_north_america.groupby('Q1')['continent'].value_counts()

north_america

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Q1',
              data = df_north_america,
              order = df_north_america['Q1'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: users in North America by age group",fontsize=20)
plot.set_xlabel("Age group",fontsize=18)
plot.set_ylabel("Number of answers",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

In [None]:
list_8 = north_america.tolist()
list_8 = np.array(list_8).flatten()

df_kaggle_age['north_america_count'] = np.array(list_8)
df_kaggle_age.reset_index(inplace = True) 

col = ['index']
df_kaggle_age = df_kaggle_age.drop(col, axis=1)

df_kaggle_age['north_america_perc'] = (df_kaggle_age['north_america_count'] / df_kaggle_age['north_america_count'].sum()) * 100
df_kaggle_age['north_america_perc'] = df_kaggle_age['north_america_perc'].round(decimals=1)

df_kaggle_age.head(11)

There is a slight increase in "older" (read: about 30 years old people) user groups in the North American user data. Coming up next is South America.

In [None]:
df_south_america = df_users[df_users['continent'] == 'South America']
df_south_america = df_south_america.sort_values(by = 'Q1', ascending = True) 

south_america = df_south_america.groupby('Q1')['continent'].value_counts()

south_america

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Q1',
              data = df_south_america,
              order = df_south_america['Q1'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: users in South America by age group",fontsize=20)
plot.set_xlabel("Age group",fontsize=18)
plot.set_ylabel("Number of answers",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

In [None]:
list_9 = south_america.tolist()
list_9 = np.array(list_9).flatten()

df_kaggle_age['south_america_count'] = np.array(list_9)
df_kaggle_age.reset_index(inplace = True) 

col = ['index']
df_kaggle_age = df_kaggle_age.drop(col, axis=1)

df_kaggle_age['south_america_perc'] = (df_kaggle_age['south_america_count'] / df_kaggle_age['south_america_count'].sum()) * 100
df_kaggle_age['south_america_perc'] = df_kaggle_age['south_america_perc'].round(decimals=1)

df_kaggle_age.head(11)

Next we will take a closer look at Africa. As we can see, along with Asia, Africa is another continent where the very youngest Kaggle members form a relatively large portion of overall users.

In [None]:
df_africa = df_users[df_users['continent'] == 'Africa']
df_africa = df_africa.sort_values(by = 'Q1', ascending = True) 

africa = df_africa.groupby('Q1')['continent'].value_counts()

africa

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(18.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Q1',
              data = df_africa,
              order = df_africa['Q1'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: users in Africa by age group",fontsize=20)
plot.set_xlabel("Age group",fontsize=18)
plot.set_ylabel("Number of answers",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

In [None]:
list_10 = africa.tolist()
list_10 = np.array(list_10).flatten()

df_kaggle_age['africa_count'] = np.array(list_10)
df_kaggle_age.reset_index(inplace = True) 

col = ['index']
df_kaggle_age = df_kaggle_age.drop(col, axis=1)

df_kaggle_age['africa_perc'] = (df_kaggle_age['africa_count'] / df_kaggle_age['africa_count'].sum()) * 100
df_kaggle_age['africa_perc'] = df_kaggle_age['africa_perc'].round(decimals=1)

df_kaggle_age.head(11)

Finally, we head to Oceania.

In [None]:
df_oceania = df_users[df_users['continent'] == 'Oceania']
df_oceania = df_oceania.sort_values(by = 'Q1', ascending = True) 

oceania = df_oceania.groupby('Q1')['continent'].value_counts()

oceania

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Q1',
              data = df_oceania,
              order = df_oceania['Q1'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: users in Oceania by age group",fontsize=20)
plot.set_xlabel("Age group",fontsize=18)
plot.set_ylabel("Number of answers",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

In [None]:
list_11 = oceania.tolist()
list_11 = np.array(list_11).flatten()

df_kaggle_age['oceania_count'] = np.array(list_11)
df_kaggle_age.reset_index(inplace = True) 

col = ['index']
df_kaggle_age = df_kaggle_age.drop(col, axis=1)

df_kaggle_age['oceania_perc'] = (df_kaggle_age['oceania_count'] / df_kaggle_age['oceania_count'].sum()) * 100
df_kaggle_age['oceania_perc'] = df_kaggle_age['oceania_perc'].round(decimals=1)

df_kaggle_age.head(11)

## 6. Users Per Capita

As relative population is concerned, Oceania is not a large continent. Still the overall user figures in Oceania are quite low taking into account the region's native language, infrastructure as well as overall level of education. Also, *people from Oceania under 25 years of age are almost nonexistent on Kaggle, so maybe there's something for the Kaggle regional recruitment team to work on in the future...* There is still the 'also' location category left, but since it does not by default provide us any relevant information on our research question, we will exclude it for now.

Instead we will turn our attention to the relative number of users i.e. per capita user figures on Kaggle. We will start by creating a copy of our original population dataframe, and we will call it **df_pop_cap**.

In [None]:
# new dataframe
df_pop_cap = df_population.copy()

df_pop_cap.head(10)

We will also make a copy of the earlier Kaggle user dataframe and name it **df_user_cap**.

In [None]:
# new dataframe
df_user_cap = df_nation.copy()

# reset index
df_user_cap.reset_index(inplace = True) 
col = ['index']
df_user_cap = df_user_cap.drop(col, axis=1)

df_user_cap.head()

It's always good to recall the datatypes of these two dataframes to avoid further issues.

In [None]:
# show datatypes
df_pop_cap.dtypes

In [None]:
# show datatypes
df_user_cap.dtypes

As noted earlier, not all survey answers included location, meaning there is still the 'other' category looming around in our dataframe. 

In [None]:
# show unique values
df_user_cap['Q3'].unique()

Next the rows with value 'other' in location will be removed by inverese selecting i.e. including every row with something else than 'other' as location value.

In [None]:
# select preferred rows
df_user_cap = df_user_cap[df_user_cap.Q3 != 'Other']

Next a new dataframe **df_kaggle_people** will be created, including all Kaggle user countries (column: **Q3**) and their respected number of users (column: **Q3_count**). 

In [None]:
# create variable for user count
value_counts = df_user_cap['Q3'].value_counts().to_frame()

# convert to df, reset index and assign names to columns
df_kaggle_people = pd.DataFrame(value_counts)
df_kaggle_people = df_kaggle_people.reset_index()
df_kaggle_people.columns = ['Q3', 'Q3_count']

df_kaggle_people.head (10)

Furthermore, this dataframe will be merged with the **df_pop_cap** dataframe. The result is a dataframe **df_kaggle_global**, which includes all individual user countries and their respected population. Also, as noted, the population value was temporarily converted to object (or string format) to avoid any unwanted changes with numeric format. This is why we will now convert the population value back to numeric (float), and since no decimals are required, we will further convert the .float values as integers (.int). 

In [None]:
# merge dataframes
df_kaggle_global = pd.merge(df_kaggle_people, df_pop_cap, on ="Q3", how='left')

# convert value to float and after that to integer
df_kaggle_global['population'] = df_kaggle_global['population'].astype(float) 
df_kaggle_global['population'] = df_kaggle_global['population'].astype(int) 

df_kaggle_global.head(10)

If we take a look at dataframe shape, we can see that a total of 53 countries were included in survey answers. From raw memory, compared to our larger population dataframe earlier, that is about 25 percent of all nations and sovereign regions included in it. From that perspecive - ironic enough - being a Kaggle user means belonging to a global minority. On a more positive note, the Kaggle global marketing team still has plenty of frontier for further treading.

In [None]:
#get dataframe shape
shape = df_kaggle_global.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

Next we will make two lists from **Q3_count** and **population** columns. After that a **per_capita** value will be calculated for each country listed in survey answers.

In [None]:
# values to two lists
list_15 = df_kaggle_global['Q3_count'].values.tolist()
list_16 = df_kaggle_global['population'].values.tolist()

# empty list
CapPerc = []

# function to calculate per capita value using two lists
def per_capita(x1, x2): 
                    result =  [(x1 / x2 * 100) for (x1, x2) in zip(list_15,list_16)] 
                    CapPerc.append(result)   
            
# execute function on list values            
per_capita (list_15,list_16)

# flatten results list
CapPerc = np.array(CapPerc).flatten()

# round to four digits
CapPerc = np.round(CapPerc, 6)

# create new column for results
df_kaggle_global['per_capita'] = np.array(CapPerc)

df_kaggle_global.head(10)

Now we can sort the dataframe on per capita values and take a look at the result. Let's start with the highest per capita values and the respected countries. For this purpose, we will create dataframe **df_percapita_most**, which consists of ten countries. 

Let's first look at the highest per capita values and the respected countries.

In [None]:
# sort dataframe
df_percapita_most = df_kaggle_global.sort_values(by = 'per_capita', ascending = False)

df_percapita_most.head(10)

As we can see, Singapore is very much the clubhouse leader in per capita category. Also, Israel is erroneously listed under Asia in the continent column, but the nations' calculated per capita rate is still very much correct. Perhaps unexpectedly, all dataframe continents but South America are represented in top 10 including Tunisia from Africa and Australia from Oceania.

Let's make a visual presentation of out top 10:

In [None]:
df_percapita_most_10 = df_percapita_most[:10]  

# values to ascending order
df_percapita_most_10 = df_percapita_most_10.sort_values(by ='per_capita', ascending=True)

# define parameters
fig = px.bar(df_percapita_most_10, x='Q3', y ='per_capita', text = 'per_capita', color= 'per_capita', height=600)

# set graphics
fig.data[0].marker.line.width = 0.5
fig.data[0].marker.line.color = "black"

fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(uniformtext_mode='hide')  

fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

# set annotations
annotations = []

# data source
annotations.append(dict(xref='paper', yref='paper', x=0.88, y=-0.12,
                              xanchor='center', yanchor='top',
                              text='data: Kaggle user survey 2020',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

# set plot title
fig.update_layout(
    title='<b>Kaggle User Survey 2020</b>:<br>countries with most per capita users',
                font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

# set axe titles etc.
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')

fig.update_yaxes(title_text='Per capita users')
fig.update_xaxes(title_text='Country')

fig.update_yaxes(title_font=dict(size=14))
fig.update_xaxes(title_font=dict(size=14))

fig.update_layout(coloraxis_colorbar=dict(
    title="per capita"    
))

fig.update_layout(xaxis_showgrid=False)

# show figure
fig.show()

Let's take a look at the other end of the per capita user figures.

In [None]:
# select countries
df_percapita_least_10 = df_percapita_most[-10:]  

df_percapita_least_10.head(20)

In [None]:
# values to ascending order
df_percapita_least_10 = df_percapita_least_10.sort_values(by ='per_capita', ascending=True)

# define parameters
fig = px.bar(df_percapita_least_10, x='Q3', y ='per_capita', text = 'per_capita', color= 'per_capita', height=600)

# set graphics
fig.data[0].marker.line.width = 0.5
fig.data[0].marker.line.color = "black"

fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(uniformtext_mode='hide')  

fig.update_layout(
    xaxis=dict(
        showline=True,
        showgrid=False,
        showticklabels=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

# set annotations
annotations = []

# data source
annotations.append(dict(xref='paper', yref='paper', x=0.88, y=-0.12,
                              xanchor='center', yanchor='top',
                              text='data: Kaggle user survey 2020',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

# set plot title
fig.update_layout(
    title='<b>Kaggle User Survey 2020</b>:<br>countries with least per capita users',
                font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

# set axe titles etc.

fig.update_layout(
    yaxis = dict (
            range = [0.00001, 0.0002 
                    ]
    ))

fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')

fig.update_yaxes(title_text='Per capita users')
fig.update_xaxes(title_text='Country')

fig.update_yaxes(title_font=dict(size=14))
fig.update_xaxes(title_font=dict(size=14))

fig.update_layout(coloraxis_colorbar=dict(
    title="per capita"    
))

fig.update_layout(xaxis_showgrid=False)

# show figure
fig.show()

Here we can see the 'law of big numbers' in action, since China is listed as the very last country in per capita rate. This is because of the fact that a total population of almost 1,5 billion people statistically overcomes Kaggle user base quite from that country quite easily no matter how large it is.

As such China can be considered as outlier in this category. However, even China does reflect the basic idea of 'digital divide' and how it shapes the world, since many Chinese people haven't even heard of Kaggle or know much about data science as subject matter. The digital divide inside a country is just as real as the one between sovereign regions.

On global scale, access to for example broadband internet is still very much a privilege, not a basic commodity, and without proper infrastructure learning data science online easily becomes unattainable a task. In a sentence, many people who would very much like to be on Kaggle simply cannot do so as for now. Just think about an imaginary dataset on refugee crisis. The very subjects - refugees - will not have a say in the analysis, since they are at the same time more concerned on how to feed their family than learning data analysis. 

## 7. User Education

Next our focus will turn to education as a factor in all this. To start with, we will create a copy of our earlier Kaggle dataframe and name it **df_edu**.

In [None]:
# new dataframe
df_edu = df_copy.copy()

# edit columns
df_edu["Q1"] = df_edu.Q1.astype(float)
df_edu['Q3'] = df_edu['Q3'].replace(['United States of America'],'United States')
df_edu['Q3'] = df_edu['Q3'].replace(['Viet Nam'],'Vietnam')
df_edu['Q3'] = df_edu['Q3'].replace(['United Kingdom of Great Britain and Northern Ireland'],'United Kingdom')
df_edu['Q3'] = df_edu['Q3'].replace(['Iran, Islamic Republic of...'],'Iran')
df_edu['Q3'] = df_edu['Q3'].replace(['Republic of Korea'],'South Korea')

df_edu.head(10)

The column **Q4** will tell us more about the level of education among Kaggle users. Let's first check out the unique values included in the column. After that we calculate value count as well as percentage.

In [None]:
# check unique values
edu_group = df_edu['Q4'].unique()

edu_group

In [None]:
edu_count = df_edu['Q4'].value_counts()

edu_count

In [None]:
edu_perc = df_edu['Q4'].value_counts(normalize=True) * 100

edu_perc

The primary observation here is the high level of formal education among Kaggle users. Bachelor's degree and upwards cover almost 90 percent of all user answers in the 2020 survey. This alone gives us one important answer to our research question: *if you are not/have not been to at least college, you most likely are not on Kaggle*.

The second observation is the plethora of special characters in answer choices. Next we will concentrate on some rigid data cleaning to make things a bit easier in the future. Before that we will however merge the **df_edu** dataframe with **df_kaggle_global**, since this will show us more useful data.

In [None]:
# merge dataframes
df_edu_location = pd.merge(df_edu, df_kaggle_global, on ="Q3", how='left')

# select values
df_edu_location = df_edu_location[df_edu_location.Q3 != 'Other']

# population column to integer
df_edu_location['population'] = df_edu_location['population'].astype(int) 

# replace special characters
df_edu_location['Q4'] = df_edu_location['Q4'].str.replace(r"[\’\',]", '')
df_edu_location['Q4'] = df_edu_location['Q4'].str.replace(r"[\/\',]", '_')
df_edu_location['Q5'] = df_edu_location['Q5'].str.replace(r"[\/\',]", '_')

# condense answers for easer use
df_edu_location['Q4'] = df_edu_location['Q4'].str.replace('Some college_university study without earning a bachelors degree','Some college_uni')
df_edu_location['Q4'] = df_edu_location['Q4'].str.replace('No formal education past high school','High school')

# drop NaN values
df_edu_location = df_edu_location.dropna()

# drop column
col = ['Q3_count']
df_edu_location = df_edu_location.drop(col, axis=1)

df_edu_location.head()

After proper cleanup, let's check the education level value count again. This time we will get a more visual presentation and literally add gender in the picture.

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(10.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(y = 'Q4',
              data = df_edu_location,
              hue = 'Q2',
              order = df_edu_location['Q4'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: users by gender and level of formal education",fontsize=20)
plot.set_xlabel("Number of answers",fontsize=18)
plot.set_ylabel("Level of education",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend(loc='lower right')

# show plot
plt.show()

It is a known fact that definitions of formal education are dependent on a particular education system. For example, master's degree in United States is different from same degree in Finland when it comes to actual degree structure and workload. Some universities also offer doctoral degrees closer to what other students study as master's degree elsewhere. Therefore these degrees cannot directly be compared with each other.

A quick peek tells us that the number of men and women stays relatively same regardless of education level. Next we will take a look at where the doctoral degree users on Kaggle come from. 

In [None]:
# sort values
df_doctoral = df_edu_location[df_edu_location['Q4'] == 'Doctoral degree']
df_doctoral = df_doctoral.sort_values(by = 'Q4', ascending = True) 

# group values
doctoral = df_doctoral.groupby('Q4')['continent'].value_counts()

doctoral

On continent level, Asia and Europe are the leaders in doctoral degree Kaggle users. Next the same comparison will be applied to master's as well as bachelor's degree.

In [None]:
# sort values
df_masters = df_edu_location[df_edu_location['Q4'] == 'Masters degree']
df_masters = df_masters.sort_values(by = 'Q4', ascending = True) 

# group values
masters = df_masters.groupby('Q4')['continent'].value_counts()

masters

In [None]:
# sort values
df_bachelors = df_edu_location[df_edu_location['Q4'] == 'Bachelors degree']
df_bachelors = df_bachelors.sort_values(by = 'Q4', ascending = True) 

# group values
bachelors = df_bachelors.groupby('Q4')['continent'].value_counts()

bachelors

Let's have the same data in more visual form and add gender in the mix.

In [None]:
# create dataframes
df_edu_doctoral = df_edu_location[(df_edu_location['Q4']=='Doctoral degree')]
df_edu_masters = df_edu_location[(df_edu_location['Q4']=='Masters degree')]
df_edu_bachelors = df_edu_location[(df_edu_location['Q4']=='Bachelors degree')]

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(10.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(y = 'continent',
              data = df_edu_doctoral,
              hue = 'Q2',
              order = df_edu_doctoral['continent'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: gender and doctoral degree by continent",fontsize=16)
plot.set_xlabel("Number of answers",fontsize=18)
plot.set_ylabel("Continent",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend(loc='lower right')

# show plot
plt.show()

It seems like men with doctoral degree living in Europe are more involved in Kaggle than their peers in North America. Let's see what master's degree looks like.

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(10.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(y = 'continent',
              data = df_edu_masters,
              hue = 'Q2',
              order = df_edu_masters['continent'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: gender and master's degree by continent",fontsize=20)
plot.set_xlabel("Number of answers",fontsize=18)
plot.set_ylabel("Continent",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend(loc='lower right')

# show plot
plt.show()

Again, women in North America and Europe with master's degree are just about as active, but men in North America fall short of their European peers. However it may well be that men in North America simply did not take part in the 2020 survey in the first place. 

Let's check if bachelor's degree gives a different outlook. 

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(10.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(y = 'continent',
              data = df_edu_bachelors,
              hue = 'Q2',
              order = df_edu_bachelors['continent'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: gender and bachelor's degree by continent",fontsize=20)
plot.set_xlabel("Number of answers",fontsize=18)
plot.set_ylabel("Continent",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend(loc='lower right')

# show plot
plt.show()

Here it is safe to assume that the aforementioned differences between regional education systems affect the result. Bachelor's degree may be more popular a choice in Asia whereas the study structure elsewhere may not always even recognize such a degree. 

Concluding our remarks on education, it looks like it is one of the key factors when participating in Kaggle activities is concerned. Thus creating access points to people from more varied backgrounds than formal college/university might be something to think about in the future, if I were part of the Kaggle regional equality and diversity inclusion team.

## 8. User Occupation

Finally, we take deeper dive at occupation data. First we will make a new copy of our dataframe and name it **df_occupation**.

In [None]:
df_occupation = df_edu_location.copy()

df_occupation.head(10)

Next we will get the unique values, value counts and percentages in Q5 column.

In [None]:
df_occupation['Q5'].unique()

In [None]:
occupation = df_occupation['Q5'].value_counts()

occupation

In [None]:
occupation_perc = df_occupation['Q5'].value_counts(normalize=True) * 100

occupation_perc

Given the strong representation of younger age groups, it comes as no surprise that some fourth of Kaggle users are students. Categories 'Other' and 'Currently not employed' also come up with high marks. It is also interesting that less than 20 percent of Kaggle users said their occupation is either data scientist or data analyst. Of course there are caveats in all this: people may apply data analysis in their work regardless of their formal job title.

Next we will take a similar continent look on occupation than we just did on education.

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(10.7,10.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(y = 'Q5',
              data = df_occupation,
              hue = 'Q2',
              order = df_occupation['Q5'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: users by gender and occupation",fontsize=20)
plot.set_xlabel("Number of answers",fontsize=18)
plot.set_ylabel("Occupation",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend(loc='lower right')

# show plot
plt.show()

Next we will select the three most common occupation categories (student, data scientist, software engineer) and see what they look like from the viewpoint of gender and continent.

In [None]:
# create dataframes
df_occupation_student = df_occupation[(df_occupation['Q5']=='Student')]
df_occupation_ds = df_occupation[(df_occupation['Q5']=='Data Scientist')]
df_occupation_se = df_occupation[(df_occupation['Q5']=='Software Engineer')]

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(10.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(y = 'continent',
              data = df_occupation_student,
              hue = 'Q2',
              order = df_occupation_student['continent'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: students by gender and continent",fontsize=16)
plot.set_xlabel("Number of answers",fontsize=18)
plot.set_ylabel("Continent",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend(loc='lower right')

# show plot
plt.show()

The women students in North America seem to be more involved in Kaggle than their European peer group. Otherwise the figures follow in line with what we already know from our previous work. Let's see what the data scientist selection brings out. 

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(10.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(y = 'continent',
              data = df_occupation_ds,
              hue = 'Q2',
              order = df_occupation_ds['continent'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: data scientists by gender and continent",fontsize=16)
plot.set_xlabel("Number of answers",fontsize=18)
plot.set_ylabel("Continent",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend(loc='lower right')

# show plot
plt.show()

Women data scientists in Asia are a relatively smaller group than Asian women students as Kaggle users. Then again, it is logical to assume that this will change once those students - many of them likely study data science - graduate.

Finally we see what the software engineer selection looks like

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(10.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(y = 'continent',
              data = df_occupation_se,
              hue = 'Q2',
              order = df_occupation_se['continent'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle survey 2020: software engineers by gender and continent",fontsize=16)
plot.set_xlabel("Number of answers",fontsize=18)
plot.set_ylabel("Continent",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend(loc='lower right')

# show plot
plt.show()

Software engineer as selection criterion changes the overall view to a more male one, to put it simply. The relative quantity of women falls evenly in all categories, although software engineer women in Asia do defend their position with honor. 

Again, it is good to keep in mind that data analytics will become an increasing part of variety of job titles. Therefore the occupation approach will not tell us the whole truth about the people who occasionally use data analytics in their everyday work in one form or another.

All in all the same principle goes to Kaggle users in general. From the viewpoint of data analysis, more diverse data is always better than one-sided view. The same applies to Kaggle as a community. When data science and data analysis bring people from different backgrounds together, it has in a way fulfilled its primary task - and all this without a single line of code.