# 22% of Employees Earn 0-999$ Yearly!? An Investigation 🕵️‍♀️

![](https://image.freepik.com/free-vector/company-employees-use-web-search-find-ideas-doing-business-company_1150-43196.jpg)

Hey Kagglers! I'm thrilled to share with you my first ever notebook.
As a natural first step in understanding the data I'm working with, I started my EDA by visualizing the distribution of each question in the survey. This is when I stumbled upon the distribution of the "yearly compensation rate" among the respondents. I was extremely surprised to discover that the most chosen (and by a significant margin) answer here was the **yearly** salary range 0-999$. In this notebook, I aim to understand why this was the most chosen salary range, and discover the profile of the respondents that chose this option.

---
### A little about me 👩‍🔬👩‍💻

As mentioned earlier, this here is my first ever notebook (yay!). Being a biologist that lately found an interest towards the data science field, I decided to put my -newly learned- data science skills in action, by exploring the 2021 Kaggle Machine Learning & Data Science Survey. I'm always looking for ways to improve myself and would be more than appreciative for your comments and critiques.

---

So, are you ready? Wear your detective hats and let the investigation begin!

# 0. Setup

In [None]:
# Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import plotly.express as px
import plotly.graph_objects as go

# 1. Data Cleaning

In [None]:
# Reading the data
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")

# Dropping the first row
df.drop(index=0, inplace=True)

# Converting the type of the time column to int
df["Time from Start to Finish (seconds)"] = df["Time from Start to Finish (seconds)"].astype(int)

In [None]:
# Renaming some columns names
df.rename(columns={"Q1":"Age"}, inplace=True)
df.rename(columns={"Q2":"Gender"}, inplace=True)
df.rename(columns={"Q3":"Country"}, inplace=True)
df.rename(columns={"Q4":"Education"}, inplace=True)
df.rename(columns={"Q5":"Title"}, inplace=True)
df.rename(columns={"Q6":"Years of code/programming"}, inplace=True)
df.rename(columns={"Q7_Part_1":"Language used: Python", "Q7_Part_2":"Language used: R", 
                   "Q7_Part_3":"Language used: SQL", "Q7_Part_4":"Language used: C",
                  "Q7_Part_5":"Language used: C++", "Q7_Part_6":"Language used: Java",
                  "Q7_Part_7":"Language used: Javascript", "Q7_Part_8":"Language used: Julia", 
                   "Q7_Part_9":"Language used: Swift", "Q7_Part_10":"Language used: Bash",
                  "Q7_Part_11":"Language used: MATLAB", "Q7_Part_12":"Language used: None",
                  "Q7_OTHER":"Language used: Other"}, inplace=True)
df.rename(columns={"Q8":"Recommended language to learn first"}, inplace=True)
df.rename(columns={"Q11":"Computing platform for data science projects"}, inplace=True)
df.rename(columns={"Q20":"Type of employment industry"}, inplace=True)
df.rename(columns={"Q21":"Size of the employment company"}, inplace=True)
df.rename(columns={"Q25":"Yearly compensation ($USD)"}, inplace=True)
df.rename(columns={"Q40_Part_1":"Platform used to learn DS: Coursera", "Q40_Part_2":"Platform used to learn DS: edX", 
                   "Q40_Part_3":"Platform used to learn DS: Kaggle Learn Courses", "Q40_Part_4":"Platform used to learn DS: DataCamp",
                  "Q40_Part_5":"Platform used to learn DS: Fast.ai", "Q40_Part_6":"Platform used to learn DS: Udacity",
                  "Q40_Part_7":"Platform used to learn DS: Udemy", "Q40_Part_8":"Platform used to learn DS: LinkedIn Learning", 
                   "Q40_Part_9":"Platform used to learn DS: Cloud-certification programs", "Q40_Part_10":"Platform used to learn DS: University Courses",
                  "Q40_Part_11":"Platform used to learn DS: None","Q40_OTHER":"Platform used to learn DS: Other"}, inplace=True)


In [None]:
# Replacing some long strings with smaller representations
df.replace({"Iran, Islamic Republic of..." : "Iran",
            "United States of America" : "USA",
            "United Kingdom of Great Britain and Northern Ireland" : "UK",
            "Republic of Korea" : "Korea",
            "United Arab Emirates": "UAE",
            "I do not wish to disclose my location": "Unknown",
            "No formal education past high school" : "High school",
            "Some college/university study without earning a bachelor’s degree" : "College/university without degree",
            },
           
           inplace=True)

In [None]:
# Converting several columns to categorical with order

salary_order = ['$0-999', '1,000-1,999', '2,000-2,999','3,000-3,999','4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999','15,000-19,999',
                '20,000-24,999', '25,000-29,999', "30,000-39,999", "40,000-49,999", "50,000-59,999", "60,000-69,999", "70,000-79,999", "80,000-89,999", 
                "90,000-99,999", "100,000-124,999", "125,000-149,999", "150,000-199,999", "200,000-249,999", "250,000-299,999", "300,000-499,999", 
                "$500,000-999,999", ">$1,000,000"]
cat_dtype = pd.api.types.CategoricalDtype(categories=salary_order, ordered=True)
df['Yearly compensation ($USD)'] = df['Yearly compensation ($USD)'].astype(cat_dtype)


education_level_order = ['High school', 'College/university without degree','Bachelor’s degree', 'Master’s degree', 'Doctoral degree',
                         'Professional doctorate', 'I prefer not to answer']
cat_dtype = pd.api.types.CategoricalDtype(categories=education_level_order, ordered=True)
df.Education = df.Education.astype(cat_dtype)


years_programming_order = ['I have never written code', '< 1 years', '1-3 years',
                         '3-5 years', '5-10 years', '10-20 years', '20+ years']
cat_dtype = pd.api.types.CategoricalDtype(categories=years_programming_order, ordered=True)
df["Years of code/programming"] = df["Years of code/programming"].astype(cat_dtype)


age_range_order = ['18-21', '22-24', '25-29','30-34','35-39', '40-44', '45-49', '50-54','55-59','60-69', '70+']
cat_dtype = pd.api.types.CategoricalDtype(categories=age_range_order, ordered=True)
df['Age'] = df['Age'].astype(cat_dtype)

# 2. Yearly Compensation ($USD)

Without wasting any time, let's take a look at the yearly salary distribution among all the respondents.

In [None]:
# Salary distribution amongst all the respondents

salary_dist = df.groupby("Yearly compensation ($USD)").size()

fig = px.bar(x = salary_dist.index,
             y = salary_dist.values,
             labels={"x": "Yearly salary (USD)", "y":"Nb respondents"},
             title= "<b>Yearly Salary distribution among all the Respondents</b>",
             color=salary_dist.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

It is surprising to find out that the most chosen salary range is by far **0-999\\$** (*3369 respondents*), largly surpassing the value counts of all other salary ranges. The rest of this notebook aims to explore this specific proportion of respondents and try to understand the reason behind having such a high number of respondents choosing this option. Moving forward, we will refer to the respondents answering the option 0-999\\$ as respondents "with low income".

# 3. Let the Investigation Begin! 🔍

In [None]:
# Groupping the dataframe by yearly compensation
salary_group = df.groupby("Yearly compensation ($USD)")

## 3.1. How much time (in seconds) did these respondents spend to complete the survey?

As an initial intuition, I thought that maybe respondents just chose this answer as a result of lazyness or unwillingness to complete the survey in a complete and thought-through manner. This is why, we will start by verifying if these respondents spent a considerably fair amount of time to complete the survey.

In [None]:
salary_group["Time from Start to Finish (seconds)"].mean()

As we can see, the mean time spent by the respondents with low income to complete the survey is close to the other respondents. This means that there is no question mark on the time they spent in order to complete the survey (e.g. people rushing through the survey without a particular interest to answer the questions correctly).

*Notice the particularly low average time spent on the survey by people choosing the option \\$500,000-999,999. This definitely raises a question mark on this group, but we'll leave that to another discussion.*

## 3.2. What are Their Titles?

Next, Let's take a look at the titles of respondents with low income.

In [None]:
low_income_titles = df[df["Yearly compensation ($USD)"] == "$0-999"].groupby("Title").size().sort_values(ascending=False)

fig = px.bar(x = low_income_titles.index, 
             y = low_income_titles.values,
             labels={"x": "Title", "y":"Nb respondents"},
             title= "<b>Titles distribution of respondents with low income</b>",
             height=600,
             color=low_income_titles.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

Now let's see the percentage each title occupies within the low income group

In [None]:
title_percentages = pd.concat([low_income_titles.rename("Nb low income")], axis="columns")
title_percentages["Pct distribution"] = (title_percentages["Nb low income"]/3369)*100

In [None]:
trace = go.Pie(labels=title_percentages.index,
               values=title_percentages["Pct distribution"], 
               #title="<b>The percentage each title occupies within the low income group</b>",
               titleposition="top center",
               title_font_size=16,
               hovertemplate="<b>%{label}</b><br>Percentage: %{value}%<br>",
               #hoverinfo='percent+value+label', 
               textinfo='percent',
               textposition='inside',
               hole=0.6,
               showlegend=True,
               marker=dict(colors=plt.cm.viridis_r(np.linspace(0, 1, 28)),
                           line=dict(color='#000000',
                                     width=2),
                          ),
               name=""
              )

fig=go.Figure(data=[trace])
fig.update_layout(title_text="<b>The percentage each title occupies within the low income group</b>", title_xanchor="left", title_pad={"l":210},
                  legend={"title":"<b>Title</b><br>"}, uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

As we can see, the respondents forming the biggest proportion of the low income group are data scientists, whereas the respondents forming the lowest proportion are DBA/Database engineers. This kind of follows the general distribution of all the respondents (i.e., there are much more data scientists in the survey as a whole than there are DBA/Database engineers, so it is kind of expected to have a higher number of data scientists choosing the low income answer).

In order to normalize these values, and to make sure that this visualization is not just following the global distribution of titles, let's calculate the percentage of respondents within each title group that responded earning a yearly salary of 0-999$. From this point onwords, we will refer to this percentage as a **Local pct** (local percentage), and we will refer to the previously visualized numbers as **Global pct** (global percentage).

So, to recap:
* **Global pct** will refer to the percentage of respondents with low income that answered a given category (in the above example the categories were the titles).
* **Local pct** will refer to the percentage of respondents within a given category that answered having a low income (0-999\\$). 

In [None]:
total_resp = df["Title"].value_counts()
title_df = pd.concat([total_resp], axis="columns")
title_df.rename(columns={"Title":"Total Respondents"}, inplace=True)
title_df["Nb low income"] = low_income_titles
title_df["Global Pct"] = (title_df["Nb low income"]/3369)*100
title_df["Local Pct"] = (title_df["Nb low income"]/title_df["Total Respondents"])*100
title_df.sort_values(by="Local Pct", ascending=False, inplace=True)
title_df

Let's visualize the local percentages of the titles

In [None]:
# Dropping the last 2 rows of the title_df 
title_df.drop(["Student", "Currently not employed"], inplace=True)

In [None]:
fig = px.bar(x = title_df.index, 
             y = title_df["Local Pct"].values/100, 
             title= "<b>Percentage of respondents from each title having a low income</b>",
             labels={"x": "Title", "y":"Pct respondents"},
             color=title_df["Local Pct"].values,
             color_continuous_scale="purples",
             text=[str(x)[:5] + "%" for x in title_df["Local Pct"].values])

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20,
                  title_x=0.5,
                  coloraxis={"showscale":False},
                  xaxis={"titlefont":{"size":18}},
                  yaxis={"tickformat":".1%", "range":[0,1], "titlefont":{"size":18}})
                 
fig.show()

Now, these normalized values (that I'm refering to as **local pct**) have a much more interpretable semantic value. Using this metric, we can now intuitively understand the degree to which each category correlates with low income. This is why, from this point unwards, most of the visualizations will represent the local percentages rather than the global percentages.

As we can see in the graph above, the statisticians group has the highest proportion of low income respondents, whereas the product manager group has the lowest proportion of low income respondents. By visualizing the local percentages of the titles, there is nothing aberrant to mention, as the percentages are kind of close to each others. 

## 3.3. What is the level of their education?

In [None]:
low_income_education = df[df["Yearly compensation ($USD)"] == "$0-999"].groupby("Education").size()

In [None]:
fig = px.bar(x = low_income_education.index, 
             y = low_income_education.values,
             labels={"x": "Education level", "y":"Nb respondents"},
             title= "<b>Education level distribution of respondents with low income</b>",
             color=low_income_education.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

In [None]:
total_resp = df["Education"].value_counts()
education_df = pd.concat([total_resp], axis="columns")
education_df.rename(columns={"Education":"Total Respondents"}, inplace=True)
education_df["Nb low income"] = low_income_education
education_df["Global Pct"] = (education_df["Nb low income"]/3369)*100
education_df["Local Pct"] = (education_df["Nb low income"]/education_df["Total Respondents"])*100
education_df.sort_values(by="Local Pct", ascending=False, inplace=True)

In [None]:
fig = px.bar(x = education_df.index, 
             y = education_df["Local Pct"].values/100, 
             title= "<b>Percentage of respondents from each level of education <br> having a low income</b>",
             labels={"x": "Education level", "y":"Pct respondents"},
             color=education_df["Local Pct"].values,
             color_continuous_scale="purples",
             text=[str(x)[:5] + "%" for x in education_df["Local Pct"].values])

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20,
                  title_x=0.5,
                  coloraxis={"showscale":False},
                  xaxis={"titlefont":{"size":18}},
                  yaxis={"tickformat":".1%", "range":[0,1], "titlefont":{"size":18}})
                 
fig.show()

Similarly to the titles' local pct graph, there is nothing flagrant to extract from the local pct of the education levels.

## 3.4.  What are their ages?

In [None]:
low_income_age = df[df["Yearly compensation ($USD)"] == "$0-999"].groupby("Age").size()

In [None]:
fig = px.bar(x = low_income_age.index, 
             y = low_income_age.values,
             labels={"x": "Age", "y":"Nb respondents"},
             title= "<b>Age distribution of respondents with low income</b>",
             color=low_income_age.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

In [None]:
total_resp = df["Age"].value_counts()
age_df = pd.concat([total_resp], axis="columns")
age_df.rename(columns={"Age":"Total Respondents"}, inplace=True)
age_df["Nb low income"] = low_income_age
age_df["Global Pct"] = (age_df["Nb low income"]/3369)*100
age_df["Local Pct"] = (age_df["Nb low income"]/age_df["Total Respondents"])*100
age_df.sort_values(by="Local Pct", ascending=False, inplace=True)

In [None]:
fig = px.bar(x = age_df.index, 
             y = age_df["Local Pct"].values/100, 
             title= "<b>Percentage of respondents belonging to each age range <br> having a low income</b>",
             labels={"x": "Age", "y":"Pct respondents"},
             color=age_df["Local Pct"].values,
             color_continuous_scale="purples",
             text=[str(x)[:5] + "%" for x in age_df["Local Pct"].values])

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20,
                  title_x=0.5,
                  coloraxis={"showscale":False},
                  xaxis={"titlefont":{"size":18}},
                  yaxis={"tickformat":".1%", "range":[0,1], "titlefont":{"size":18}})
                 
fig.show()

By visializing the bar chart, it's interesting to see that the age range of 70+ has the highest proportion of respondents with low income. Could this refer to retired fellas? Or just random noise? We can't be sure to be honest, and the total number of participants in this category is only 24, so there's really no point in further understanding here. Otherwise, nothing to note here.

## 3.5.  What is their gender?

In [None]:
low_income_gender = df[df["Yearly compensation ($USD)"] == "$0-999"].groupby("Gender").size().sort_values(ascending=False)

In [None]:
fig = px.bar(x = low_income_gender.index, 
             y = low_income_gender.values,
             labels={"x": "Gender", "y":"Nb respondents"},
             title= "<b>Gender distribution of respondents with low income</b>",
             color=low_income_gender.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

In [None]:
total_resp = df["Gender"].value_counts()
gender_df = pd.concat([total_resp], axis="columns")
gender_df.rename(columns={"Gender":"Total Respondents"}, inplace=True)
gender_df["Nb low income"] = low_income_gender
gender_df["Global Pct"] = (gender_df["Nb low income"]/3369)*100
gender_df["Local Pct"] = (gender_df["Nb low income"]/gender_df["Total Respondents"])*100
gender_df.sort_values(by="Local Pct", ascending=False, inplace=True)

In [None]:
fig = px.bar(x = gender_df.index, 
             y = gender_df["Local Pct"].values/100, 
             title= "<b>Percentage of respondents from each gender <br> having a low income</b>",
             labels={"x": "Gender", "y":"Pct respondents"},
             color=gender_df["Local Pct"].values,
             color_continuous_scale="purples",
             text=[str(x)[:5] + "%" for x in gender_df["Local Pct"].values])

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20,
                  title_x=0.5,
                  coloraxis={"showscale":False},
                  xaxis={"titlefont":{"size":18}},
                  yaxis={"tickformat":".1%", "range":[0,1], "titlefont":{"size":18}})
                 
fig.show()

Once again, there's nothing of value here.

Let's not lose hope and keep on investigating!

## 3.6. For how many years they've been coding/programming?

Let's take a look at the years of coding/programming distribution of respondents with low income.

In [None]:
low_income_years_coding = df[df["Yearly compensation ($USD)"] == "$0-999"].groupby("Years of code/programming").size()

In [None]:
fig = px.bar(x = low_income_years_coding.index, 
             y = low_income_years_coding.values,
             labels={"x": "Years of coding/programming", "y":"Nb respondents"},
             title= "<b>Years of coding/programming distribution <br> of respondents with low income</b>",
             color=low_income_years_coding.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

As we can see, the highest proportions of the low income group have between 1 and 3 years of coding experience, or less than 1 year of experience. Does the considerably low years of experience explain the low income? Let's not make conclusions at this stage, and continue with our investigation. 

Now, let's take a look at the local percentages of the years of coding groups.

In [None]:
total_resp = df["Years of code/programming"].value_counts()
years_coding_df = pd.concat([total_resp], axis="columns")
years_coding_df.rename(columns={"Years of code/programming":"Total Respondents"}, inplace=True)
years_coding_df["Nb low income"] = low_income_years_coding
years_coding_df["Global Pct"] = (years_coding_df["Nb low income"]/3369)*100
years_coding_df["Local Pct"] = (years_coding_df["Nb low income"]/years_coding_df["Total Respondents"])*100
years_coding_df.sort_values(by="Local Pct", ascending=False, inplace=True)

In [None]:
fig = px.bar(x = years_coding_df.index, 
             y = years_coding_df["Local Pct"].values/100, 
             title= "<b>Percentage of respondents with each years of coding experience <br> having a low income</b>",
             labels={"x": "Years of code/programming", "y":"Pct respondents"},
             color=years_coding_df["Local Pct"].values,
             color_continuous_scale="purples",
             text=[str(x)[:5] + "%" for x in years_coding_df["Local Pct"].values])

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=18,
                  title_x=0.5,
                  coloraxis={"showscale":False},
                  xaxis={"titlefont":{"size":18}},
                  yaxis={"tickformat":".1%", "range":[0,1], "titlefont":{"size":18}})
                 
fig.show()

It is interesting to notice that the local pct of the years of code/programming is inversely proportional to the number of years of code/programming. Also, peeps who have never written code answered having a low income with a considerably higher proportion than all other groups.

It seems like we're getting closer to an answer, so let's keep digging!

## 3.7. From where do they come from?

Next, let's visualize the country distribution of respondents with low income

In [None]:
low_income_country = df[df["Yearly compensation ($USD)"] == "$0-999"].groupby("Country").size().sort_values(ascending=False)

In [None]:
trace = go.Pie(labels=low_income_country.index,
               values=low_income_country.values, 
               title="<b>Country Distribution</b>",
               title_font_size=16,
               hovertemplate="<b>%{label}</b><br>Respondents: %{value}<br><i>%{percent}</i>",
               #hoverinfo='percent+value+label', 
               textinfo='percent',
               textposition='inside',
               hole=0.6,
               showlegend=True,
               marker=dict(colors=plt.cm.viridis_r(np.linspace(0, 1, 28)),
                           line=dict(color='#000000',
                                     width=2),
                          ),
               name=""
              )
fig=go.Figure(data=[trace])
fig.update_layout(legend={"title":"<b>Country</b><br>"})
fig.show()

Now, let's take a look at the -more informative- local percentages of the countries.

In [None]:
total_resp = df["Country"].value_counts()
country_df = pd.concat([total_resp], axis="columns")
country_df.rename(columns={"Country":"Total Respondents"}, inplace=True)
country_df["Nb low income"] = low_income_country
country_df["Global Pct"] = (country_df["Nb low income"]/3369)*100
country_df["Local Pct"] = (country_df["Nb low income"]/country_df["Total Respondents"])*100
country_df.sort_values(by="Local Pct", ascending=False, inplace=True)

In [None]:
fig = px.bar(x = country_df.index, 
             y = country_df["Local Pct"].values/100, 
             title= "<b>Percentage of respondents from each country <br> having a low income</b>",
             labels={"x": "Country", "y":"Pct respondents"},
             color=country_df["Local Pct"].values,
             color_continuous_scale="purples",
             #text=[str(x)[:5] + "%" for x in local_pct_country_df.values]
            )

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20,
                  title_x=0.5,
                  coloraxis={"showscale":False},
                  xaxis={"titlefont":{"size":18}},
                  yaxis={"tickformat":".1%", "range":[0,1], "titlefont":{"size":18}})
                 
fig.show()

By looking at the bar chart above, we can see that there is a ramarkable difference in the local percentages of each country. Egypt, Iraq, Ethiopia, Nigeria, Pakistan and Indonesia take the lead by having local percentages ranging from 31.1% to 26%. The local percentages for the rest of the countries drop from 25% (Algeria) all the way to 2.9% (Israel). It is worth stopping here and understand why the beforementioned countries have such high local percentages.

#### Printing the top 15 countries in terms of local pct

In [None]:
country_df.head(15)

We will focus on 4 countries: Egypt, Nigeria, Pakistan and Indonesia, since they have a relatively high number of respondents with low income (150, 196, 143 and 115 respectively).

#### GDP per capita of the countries having the highest Local percentage

**GDP** or Gross Domestic Product, as defined by the International Monetary Fund, "measures the monetary value of final goods and services—that is, those that are bought by the final user—produced in a country in a given period of time". 
**GDP per capita** shows a country's GDP divided by its total population. It is theoretically the amount of money that each individual gets in that particular country.

To this end, we will gather the GDP per capita for all the countries in the year 2021, and visualize the placement of the 4 mentioned countries relatively to the rest of the world. Having not found such an up to date dataset on Kaggle, I decided to download the data available on International Monetary Fund's website, and uploaded the dataset publicly on Kaggle [here](https://www.kaggle.com/joannamoussa/international-monetary-fund-gdp-per-capita). 

So, my idea is the following: *if the GDP per capita of these 4 countries are 1) relatively low to the rest of the data and 2) are close to the 1000\\$ mark, this could explain why these countries lead the board of the local percentages.*

In [None]:
#Reading the dataset containing the GDP per capita of all the countries worldwide
gdp_df = pd.read_csv("../input/international-monetary-fund-gdp-per-capita/GDP per capita.csv")

In [None]:
#cleaning the dataset
gdp_df.dropna(inplace=True)
gdp_df.set_index("GDP per capita, current prices\n (U.S. dollars per capita)", inplace=True)
gdp_df.drop(index="World", inplace=True)

# As Pakistan has no data in 2021, we will assign its value from 2020.
gdp_df.loc["Pakistan", "2021"] = gdp_df.loc["Pakistan", "2020"]

#dropping the rows that contains "no data" in the 2021 column
filt = (gdp_df["2021"]== "no data")
gdp_df.drop(index=gdp_df[filt].index, inplace=True)

#casting the 2021 column values
gdp_df["2021"] = gdp_df["2021"].astype(float)

# Sorting the countries by the GDP values of the year 2021 
gdp_df.sort_values(by="2021", inplace=True)

In [None]:
#Visualization

gdp_2021 = gdp_df["2021"]

color_discrete_map = dict()
for country in gdp_2021.index:
    if country in ["Pakistan", "Egypt", "Indonesia", "Nigeria"]:
        color_discrete_map[country] =  'rgb(255,0,0)'
    elif country in list(country_df.head(10).index) + ["Vietnam"]:
        color_discrete_map[country] =  'rgb(255,165,0)'
    else:
        color_discrete_map[country] =  'rgb(150,150,150)'

fig = px.bar(x = gdp_2021.index, 
             y = gdp_2021.values,
             labels={"x": "Country", "y":"GDP per Capita (USD)"},
             title= "<b>GDP per capita of all the countries worldwide</b>",
             height=600,
             width=900,
             color= gdp_2021.index,
             color_discrete_map = color_discrete_map)

fig.update_traces(hovertemplate = "<b>%{x}</b><br>GDP per Capita: %{y}")

fig.update_layout(title_font_size=20,
                  title_x=0.5, 
                  showlegend=False,
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18), showticklabels=False),
                  yaxis=dict(titlefont=dict(size=18)))


for country in ["Pakistan", "Egypt", "Indonesia", "Nigeria"]:
    if country == "Indonesia":
        fig.add_annotation(x=country, 
                           y=gdp_df.loc[country, "2021"],
                           text=country,
                           showarrow=True,
                           font=dict(
                                family="Arial Black",
                                size=14,
                                color="red"),
                           align="center",
                           arrowhead=2,
                           arrowsize=1,
                           arrowwidth=2,
                           arrowcolor="red",
                           ax=30,
                           ay=-80
                          )
        
    else: 
        fig.add_annotation(x=country, 
                           y=gdp_df.loc[country, "2021"],
                           text=country,
                           showarrow=True,
                           font=dict(
                                family="Arial Black",
                                size=14,
                                color="red"),
                           align="center",
                           arrowhead=2,
                           arrowsize=1,
                           arrowwidth=2,
                           arrowcolor="red",
                           ax=-20,
                           ay=-80
                          )
        
fig.add_shape(type='line',
              x0="Indonesia", y0=gdp_df.loc["Indonesia", "2021"],
              x1=0, y1=gdp_df.loc["Indonesia", "2021"],
              line={"color":'#ee634d', "width":2, "dash":"dot"},
              xref='x', yref='y'
             )

fig.add_annotation(xref="paper", x=0, y=gdp_df.loc["Indonesia", "2021"],
                   text="<i>" + str(gdp_df.loc["Indonesia", "2021"]).split(".")[0] + "</i>",
                   showarrow=False,
                   xanchor="right",
                   yanchor="bottom",
                   font={"color":"#ee634d", "size":10})

fig.add_shape(type='line',
              x0="Pakistan", y0=gdp_df.loc["Pakistan", "2021"],
              x1=0, y1=gdp_df.loc["Pakistan", "2021"],
              line={"color":'#ee634d', "width":2, "dash":"dot"},
              xref='x', yref='y'
             )

fig.add_annotation(xref="paper", x=0, y=gdp_df.loc["Pakistan", "2021"]-1000,
                   text="<i>" + str(gdp_df.loc["Pakistan", "2021"]).split(".")[0] + "</i>",
                   showarrow=False,
                   xanchor="right",
                   yanchor="bottom",
                   font={"color":"#ee634d", "size":10})

fig.show()

According to the GDP per capita dataset, we realize that the GDP per capita of our 4 countries of interest are amongst the lowest of all the countries, ranging from 1,254 USD (Pakistan) to 4,224 USD (Indonesia), as shown in the bar chart above. This could most certainly explain why in these countries, the number of low income respondents is relatively high.

---

Keep in mind that the GDP per capita is essentially an average over the whole population. So for the case of Pakistan, with a GDP per capita of 1,254\\$, and assuming a normal distribution of salaries, this would mean that this value is also the median and 50% of the population earns less than this number. Keeping in mind that salary distributions are usually skewed to the left (meaning the median is less than the average and therefore meaning more than 50% of the population earns less than the mean), the local percentage of 27% makes complete sense for a country such as Pakistan.

---

Woohoo!! We now at least understand the reason behind the answer of a portion of respondents with low income. So:

**A portion of respondents with low income, coming from countries caracterized by a low GDP per capita, are very likely to be actually earning 0-999\\$**.

If we're just foculizing on the 4 countries amongst the highest local pct (Pakistan, Nigeria, Egypt and Indonesia), they constitute 18% of the respondents that answered having a yearly compensation of 0-999\\$ (604 out of 3369 respondents with low income). But if we take into consideration the top 10 countries in terms of local pct (the countries highlighted in red plus the countries highlighted in orange in the bar chart above), **they constitute approximately 23% of the respondents with low income (779 out of 3369 respondents with low income)**.

We just found one piece of the puzzle, as we understood the answer of just 23% of the respondents with low income. What about the remaining ~77%? Let's discover that in the next question!

## 3.8 -  In what type of industry do they work?

In [None]:
low_income_industry_type = df[df["Yearly compensation ($USD)"] == "$0-999"].groupby("Type of employment industry").size().sort_values(ascending=False)

In [None]:
fig = px.bar(x = low_income_industry_type.index,
             y = low_income_industry_type.values,
             labels={"x": "Type of employment industry", "y":"Nb respondents"},
             title="<b>Employment industry type distribution of respondents with low income</b>",
             color=low_income_industry_type.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

In [None]:
total_resp = df["Type of employment industry"].value_counts()
industry_type_df = pd.concat([total_resp], axis="columns")
industry_type_df.rename(columns={"Type of employment industry":"Total Respondents"}, inplace=True)
industry_type_df["Nb low income"] = low_income_industry_type
industry_type_df["Global Pct"] = (industry_type_df["Nb low income"]/3369)*100
industry_type_df["Local Pct"] = (industry_type_df["Nb low income"]/industry_type_df["Total Respondents"])*100
industry_type_df.sort_values(by="Local Pct", ascending=False, inplace=True)

In [None]:
fig = px.bar(x = industry_type_df.index, 
             y = industry_type_df["Local Pct"].values/100, 
             title= "<b>Percentage of respondents from each industry type <br> having a low income</b>",
             labels={"x": "Industry type", "y":"Pct respondents"},
             color=industry_type_df["Local Pct"].values,
             color_continuous_scale="purples",
             text=[str(x)[:5] + "%" for x in industry_type_df["Local Pct"].values])

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20,
                  title_x=0.5,
                  coloraxis={"showscale":False},
                  xaxis={"titlefont":{"size":18}},
                  yaxis={"tickformat":".1%", "range":[0,1], "titlefont":{"size":18}})
                 
fig.show()

According to the bar chart above, the non-profit/Service industry type group has the highest local percentage, a totally logic observation. However, the surprising thing is that the Academics/Education industry type group comes in the second place, with 35.1% of respondents choosing this industry earning a yearly salary ranging from 0 to 999\\$. Let's zoom in on this specific proportion of respondents, and discover their background.

### Zoom in on the respondents that work in Academics/Education industries and have a low income

### 3.8.1. Ages of respondents working in Academics/Education industries and earning a low salary

In [None]:
#filtering
academics_low_income = (df["Type of employment industry"] == "Academics/Education") & (df["Yearly compensation ($USD)"] == "$0-999")

In [None]:
ages = df[academics_low_income].groupby("Age").size()

In [None]:
fig = px.bar(x = ages.index,
             y = ages.values,
             labels={"x": "Age", "y":"Nb respondents"},
             title="<b>Age of respondents working in Academics/Education<br>industries and earning a low income</b>",
             color=ages.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

Interestingly, the biggest proportion of these respondents are young, with ages ranging from 18 to 29 years. Let's further zoom in and take a deeper look at the young respondents aged **18-24**, and discover what are their titles?

### 3.8.2. Titles of young respondents working in Academics/Education industries and earning a low salary

In [None]:
#filtering the respondents having 18-24 years
academics_low_income_young = (df["Type of employment industry"] == "Academics/Education") & (df["Yearly compensation ($USD)"] == "$0-999") & ((df["Age"] == "22-24") | (df["Age"] == "18-21"))

In [None]:
#18-24 years old
titles = df[academics_low_income_young].Title.value_counts()

In [None]:
fig = px.bar(x = titles.index,
             y = titles.values,
             labels={"x": "Title", "y":"Nb respondents"},
             title="<b>Titles of young respondents working in<br>Academics/Education industries and earning a low salary</b>",
             color=titles.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

Surprisingly, for these young age groups, there are a multitude of titles that *generally* need approximately 3 to 5 years of studies post highschool (Bachelor's or master's degree) to be able to access them. So how could respondents of age 18, 19, 20 occupy titles such as machine learning engineers?

One hypothesis would be that these young respondents started coding/programming while they were in highschool, so they were gaining experience since then. Let's study this hypothesis next.

### 3.8.3. Years of coding/programming of the young respondents working in Academics/Education industries and having a low income

In [None]:
#18-24 years
years_coding = df[academics_low_income_young].groupby("Years of code/programming").size()

In [None]:
fig = px.bar(x = years_coding.index,
             y = years_coding.values,
             labels={"x": "Years of coding/programming", "y":"Nb respondents"},
             title="<b>Years of coding/programming of the young respondents working in<br>Academics/Education industries and having a low income</b>",
             color=years_coding.values,
             color_continuous_scale="mint")

fig.update_traces(hovertemplate = "<b>%{x}</b><br>Respondents: %{y}")

fig.update_layout(title_font_size=20, 
                  title_x=0.5, 
                  coloraxis={"showscale":False}, 
                  xaxis=dict(titlefont=dict(size=18)),
                  yaxis=dict(titlefont=dict(size=18)))
fig.show()

As we can see, the majority of these young respondents have considerably a low number of years of coding/programming experience to be able to access titles like Data Scientist, Machine Learning Engineer, etc. So what could explain such results?

---

Looking back at the titles distribution of these young respondents, these titles seem more of a major title rather than a job title for such ages. For instance, "Data Scientist" and "Machine Learning Engineer" correlate much more with majors at universities (Data Science, ML Engineering, etc.) when compared to "DBA/Database Engineer" or "Product Manager".

Taking a step back and having a second look at the survey questions (specifically at the file: `kaggle_survey_2021_answer_choices.pdf`), we can notice that the 5th question asked in the survey was about the "Title". The titles are mainly alphabetically ordered with titles such as "Business Analyst" and "Data Analyst" coming first. The 3 last title options in the list are: "Student", "Currently not employed" and "Other".

Okay, so, hear me out on my analysis here:

Since the "Student" option is listed at the bottom of the list of options, a multitude of student respondents could've missed it. This may have led them to choose the first viable option that they came across, which could very well be their studies title (e.g. if they are studying Machine Learning Engineering, they could have chosen "Machine Learning Engineer" as their title, instead of choosing "Student"). Further down the road, these respondents will bump into 2 questions that shouldn't be asked to the "student" respondents. These questions are:
- **Q20) In what industry is your current employer/contract (or your most recent employer if retired)?** Having neither an employer or a contract, the students may have chosen the "Academics/Education" option since they're enrolled in a University degree and this would be the only logical option to choose in this scenario. 
- **Q25) What is your current yearly compensation (approximate \\$USD)?** . Here the students may have chosen the range 0-999\\$, since they most likely don't have a job. 

This hypothesis could in fact explain why in the data exploration we see that a big proportion of respondents working in Academics/Education seems to have a yearly salary of 0-999\\$.

# 4. Conclusion

To sum it up, and as my investigation results may suggest, there are 2 main points that may help us understand the big number of respondents that chose a yearly salary of 0-999\\$:
- First off, approximately 23% of respondents that chose a yearly salary of 0-999\\$ come from countries caracterized by a low GDP per capita, such as Pakistan, Nigeria, Egypt and Indonesia. Here, this option could very well be reflective of the truth.
- Secondly, a considerable number of students could be behind the over-representation of the 0-999\\$ answer. These would be students mistankenly choosing their majors' titles as their title, rather than the "Student" title, and consequently being faced with questions about their employer industry and salary range, questions that aren't asked to respondents choosing the "Student" title.

Given that there is nothing to address in the first point, and while the remainder of the respondents choosing the 0-999\\$ option could be a mix of respondents actually earning that amount of money, respondents whom I couldn't come across a viable explanation to their answer, and probably some noise, we are left with the second point.

**In order to avoid such ambiguities, I would recommend kaggle to place the "Student" option at the top of the choices list for the "Title" question. This will eliminate the confusion and allow students to directly spot their appropriate title, without risking bumping into questions that aren't meant to be answered by students.**

**Given these results, and in order to have a cleaner data, I would recommend all kagglers that are interested in working with the yearly salaries: to remove from the dataset the respondents that belong to the "18-21" or "22-24" age groups and work in the Academics/Education industries.**

---

That's it for this notebook. It was originally planned to be published way sooner to help the maximum of people out with my findings, but many cirmustances slowed me down. In any case, my notebook is here and I hope that you guys enjoyed it and hopefully benefited from it.

### Don't forget to give this notebook an upvote if you found it useful 😊

## Happy Kaggling!

# 5. Acknowledgements

- https://www.imf.org/external/datamapper/NGDPDPC@WEO/OEMDC/ADVEC/WEOWORLD
- https://www.worldometers.info/gdp/gdp-per-capita/
- http://www.differencebetween.net/
- https://www.freepik.com/