# Introduction

This notebook contains several visualizations which provide insights into the Kaggle 2021 survey data, on a global level. 🗺 🌎

This notebook tries to look at the trends in Data Science across the continents, especially with the help of the colorful scatterplots which are the highlight of the notebook.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import os # for interacting with the OS

The above code cell loads important libraries and packages.

In [None]:
df = pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
df['Q3'] = df['Q3'].replace(['Viet Nam'],'Vietnam')
df

The above code cell loads and cleans the dataset, before displaying its head and tail. The number of rows shows that 25,973 respondents from around the world were included in the dataset.

# Age of the survey participants

In [None]:
print(df.loc[1:, ['Q1']].value_counts().plot(kind='bar', fontsize=10))

The above code cell plots a bar graph for the number of survey participants from each age group. The plot indicates that most Kagglers are young, being in the age groups 18-21,
22-24, 25-29 and 30-34 (in years).

In [None]:
age_18_21 = df[df['Q1']=='18-21']['Q3'].value_counts().sort_index().drop(["Other", "I do not wish to disclose my location"]).tolist()
age_22_24 = df[df['Q1']=='22-24']['Q3'].value_counts().sort_index().drop(["Other", "I do not wish to disclose my location"]).tolist()
age_25_29 = df[df['Q1']=='25-29']['Q3'].value_counts().sort_index().drop(["Other", "I do not wish to disclose my location"]).tolist()

Total = df.loc[1:, ['Q3']].value_counts().drop(["Other", "I do not wish to disclose my location"]).sort_index().tolist() 
age_below_30 = (np.array(age_18_21)+np.array(age_22_24)+np.array(age_25_29)).tolist()
age_30_and_above = (np.array(Total)-np.array(age_below_30)).tolist()

plt.figure(figsize=(15, 11))
col=['blue', 'yellow', 'black', 'green', 'red', 'green', 'green', 'yellow', 'yellow', 'yellow', 'red', 'yellow', 'green', 'green', 'yellow', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'red', 'red', 'red', 'red', 'red', 'green', 'red', 'green', 'red', 'red', 'blue', 'red', 'yellow', 'blue', 'red', 'green', 'blue', 'green', 'red', 'yellow', 'red', 'green', 'green', 'green', 'green', 'red', 'red', 'blue', 'red', 'green', 'red', 'green', 'green', 'red', 'red', 'blue', 'red', 'blue', 'green', 'red', 'green', 'yellow', 'red']
plt.scatter(age_30_and_above, age_below_30, s=Total, c=col, alpha=0.5)

plt.xscale('log')
plt.yscale('log')
plt.xlabel('No. of respondents aged 30 and above', fontsize=20)
plt.ylabel('No. of respondents with age below 30', fontsize=20)
plt.title('below 30 vs 30 and above', fontsize=24)

tick_val = [100, 1000, 10000]
tick_lab = ['0.1k', '1k', '10k']

plt.xticks(tick_val, tick_lab, fontsize=24)
plt.yticks(tick_val, tick_lab, fontsize=24)

plt.text(1400, 5200, 'India', fontsize=17)
plt.text(1560, 750, 'USA', fontsize=17)

plt.show()
plt.clf()

**The above code cell plots a scatterplot which has the following description:-**
1. The circles represent the countries in the Kaggle survey data.
2. The size of the circle corresponds to the total number of respondents from the country.
3. The x coordinates of the centres of the circles are the number of respondents who are aged 30 and above from each of the countries.
4. The y coordinates of the centres of the circles are the number of respondents who are aged below 30 from each of the countries.
5. The colors of the circles indicate the continent to which the countries belong: Blue for African countries, Red for Asian countries, Green for European countries, Yellow for the Americas (North and South America) and Black for Oceania.

**The scatterplot helps us gain the following insights:-**
1. European, American, Oceanic and a few Asian countries have more respondents which are aged 30 and above relative to most Asian and African countries having the same total number of respondents.
2. African and most Asian countries have more respondents which are aged below 30 relative to European, American, Oceanic and a few Asian countries having the same total number of respondents.
3. India has the largest number of respondents and the largest number of respondents aged below 30.
4. USA has the second largest number of respondents and the largest number of respondents aged 30 and above.
    

# Sex of the survey participants

In [None]:
print(df.loc[1:, ['Q2']].value_counts().plot(kind='bar', fontsize=10))

The above code cell plots a bar graph for the number of survey participants of each sex. More than 75% of the respondents are male, about 20% of the respondents are female and around 4% prefer to self describe or prefer not to say or are nonbinary. The survey is clearly male dominated.

# Number of survey participants from each country

In [None]:
print(df.loc[1:, ['Q3']].value_counts().drop(["Other", "I do not wish to disclose my location"]).plot(kind='bar', figsize=(21,10), fontsize=10))

The above code cell plots a bar graph for the number of survey participants from each country. India clearly has the most number of respondents and is followed by USA and Japan.

# Level of formal education past high school of the survey participants

In [None]:
print(df.loc[1:, ['Q4']].value_counts().plot(kind='pie', figsize=(15, 15), fontsize=10))

The above code cell plots a pie chart for the proportions of each level of formal education past high school. Clearly, more than 75% of the repondents are either bachelors or masters degree students.

In [None]:
Masters_degree = df[df["Q4"]=="Master’s degree"]["Q3"].value_counts().drop(["Other", "I do not wish to disclose my location"]).sort_index().tolist()
Bachelors_degree = df[df["Q4"]=="Bachelor’s degree"]["Q3"].value_counts().drop(["Other", "I do not wish to disclose my location"]).sort_index().tolist()
Masters_plus_Bachelors = np.array(Masters_degree)+np.array(Bachelors_degree)

plt.figure(figsize=(15, 11))
col=['blue', 'yellow', 'black', 'green', 'red', 'green', 'green', 'yellow', 'yellow', 'yellow', 'red', 'yellow', 'green', 'green', 'yellow', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'red', 'red', 'red', 'red', 'red', 'green', 'red', 'green', 'red', 'red', 'blue', 'red', 'yellow', 'blue', 'red', 'green', 'blue', 'green', 'red', 'yellow', 'red', 'green', 'green', 'green', 'green', 'red', 'red', 'blue', 'red', 'green', 'red', 'green', 'green', 'red', 'red', 'blue', 'red', 'blue', 'green', 'red', 'green', 'yellow', 'red']
plt.scatter(Masters_degree, Bachelors_degree, s=Masters_plus_Bachelors, c=col, alpha=0.5)

plt.xscale('log')
plt.yscale('log')
plt.xlabel('No. of Masters degree students', fontsize=24)
plt.ylabel('No. of Bachelors degree students', fontsize=24)
plt.title('Bachelors vs Masters ratio', fontsize=24)

tick_val = [100, 1000, 10000, 15000]
tick_lab = ['0.1k', '1k', '10k', '15k']

plt.xticks(tick_val, tick_lab, fontsize=24)
plt.yticks(tick_val, tick_lab, fontsize=24)

plt.text(2000, 3500, 'India', fontsize=17)
plt.text(1050, 650, 'USA', fontsize=17)

plt.show()
plt.clf()

**The above code cell plots a scatterplot which has the following description:-**
1. The circles represent the countries in the Kaggle survey data.
2. The size of a circle corresponds to the total number of masters and bachelors degree students from a country in the Kaggle survey data.
3. The x coordinates of the centres of the circles are the number of respondents from          the countries who are masters degree students.
4. The y coordinates of the centres of the circles are the number of respondents from          the countries who are bachelors degree students.
5. The colors of the circles indicate the continent to which the countries belong: Blue for African countries, Red for Asian countries, Green for European countries, Yellow for the Americas (North and South America) and Black for Oceania.

**The scatterplot helps us gain the following insights:-**
1. European, Oceanic and a few Asian and American countries have more respondents which are masters degree students relative to African and most Asian and American countries having the same total number of respondents.
2. African and most Asian and American countries have more respondents which are bachelors degree students relative to European, Oceanic and a few Asian and American countries having the same total number of respondents.
3. India has the largest total number of respondents and the largest number of respondents for both bachelors and masters degree students.
4. USA has the second largest total number of respondents and the second largest number of respondents for both masters and bachelors degree students.

# Professions of the survey participants

In [None]:
print(df.loc[1:, ['Q5']].value_counts().plot(kind='pie', figsize=(15, 15), fontsize=10))

Students comprise more than 25% of the Kaggle survey respondents. People from data centered professions like Data Scientist, Machine Learning Engineer and Data Analyast also have a significant proportion in the above pie chart.

In [None]:
Students = df[df["Q5"]=="Student"]["Q3"].value_counts().drop(["Other", "I do not wish to disclose my location"]).sort_index().tolist()
Data_Scientists = df[df["Q5"]=="Data Scientist"]["Q3"].value_counts().drop(["Other", "I do not wish to disclose my location"]).sort_index().tolist()
Students_plus_Data_Scientists = np.array(Students)+np.array(Data_Scientists)

plt.figure(figsize=(15, 11))
col=['blue', 'yellow', 'black', 'green', 'red', 'green', 'green', 'yellow', 'yellow', 'yellow', 'red', 'yellow', 'green', 'green', 'yellow', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'red', 'red', 'red', 'red', 'red', 'green', 'red', 'green', 'red', 'red', 'blue', 'red', 'yellow', 'blue', 'red', 'green', 'blue', 'green', 'red', 'yellow', 'red', 'green', 'green', 'green', 'green', 'red', 'red', 'blue', 'red', 'green', 'red', 'green', 'green', 'red', 'red', 'blue', 'red', 'blue', 'green', 'red', 'green', 'yellow', 'red']
plt.scatter(Students, Data_Scientists, s=Students_plus_Data_Scientists, c=col, alpha=0.5)

plt.xscale('log')
plt.yscale('log')
plt.xlabel('No. of Students', fontsize=24)
plt.ylabel('No. of Data Scientists', fontsize=24)
plt.title('Data Scientists vs Students ratio', fontsize=24)

tick_val = [100, 1000, 10000, 15000]
tick_lab = ['0.1k', '1k', '10k', '15k']

plt.xticks(tick_val, tick_lab, fontsize=24)
plt.yticks(tick_val, tick_lab, fontsize=24)

plt.text(2250, 800, 'India', fontsize=17)
plt.text(400, 400, 'USA', fontsize=17)

plt.show()
plt.clf()

**The above code cell plots a scatterplot which has the following description:-**
1. The circles represent the countries in the Kaggle survey data.
2. The size of a circle corresponds to the total number of Data Scientists and Students from a country in the Kaggle survey data.
3. The x coordinates of the centres of the circles are the number of respondents from          the countries who are Students.
4. The y coordinates of the centres of the circles are the number of respondents from          the countries who are Data Scientists.
5. The colors of the circles indicate the continent to which the countries belong: Blue for African countries, Red for Asian countries, Green for European countries, Yellow for the Americas (North and South America) and Black for Oceania.

**The scatterplot helps us gain the following insights:-**
1. European, Oceanic, American and a few Asian and African countries have more respondents which are Data Scientists relative to most Asian and African countries having the same total number of Students and Data Scientists.
2. Most African and Asian countries have more respondents which are Students relative to European, Oceanic, American and a few Asian and African countries having the same total number of Students and Data Scientists.
3. India has the largest total number of Students and Data Scientists and the largest number of respondents for both Students and Data Scientists.
4. USA has the second largest total number of respondents and the second largest number of respondents for both Students and Data Scientists.

# Years of experience of the survey participants

In [None]:
print(df.loc[1:, ['Q6']].value_counts().plot(kind='pie', figsize=(15, 15), fontsize=10))

An interesting observation here is that survey participants with 1-3 years of experience in coding comprise more than 25% of the Kaggle survey respondents and have the biggest proportion of all experience groups. 

In [None]:
Never = df[df["Q6"]=="I have never written code"]["Q3"].value_counts().reindex(df.Q3.unique(), fill_value=0).drop(["Other", "I do not wish to disclose my location", "In which country do you currently reside?"]).sort_index().tolist()
Less_than_one = df[df["Q6"]=="< 1 years"]["Q3"].value_counts().drop(["Other", "I do not wish to disclose my location"]).sort_index().tolist()
One_to_three = df[df["Q6"]=="1-3 years"]["Q3"].value_counts().drop(["Other", "I do not wish to disclose my location"]).sort_index().tolist()
Three_to_five = df[df["Q6"]=="3-5 years"]["Q3"].value_counts().drop(["Other", "I do not wish to disclose my location"]).sort_index().tolist()
Five_to_ten = df[df["Q6"]=="5-10 years"]["Q3"].value_counts().drop(["Other", "I do not wish to disclose my location"]).sort_index().tolist()
Ten_to_twenty = df[df["Q6"]=="10-20 years"]["Q3"].value_counts().reindex(df.Q3.unique(), fill_value=0).drop(["Other", "I do not wish to disclose my location", "In which country do you currently reside?"]).sort_index().tolist()
Twenty_plus = df[df["Q6"]=="20+ years"]["Q3"].value_counts().reindex(df.Q3.unique(), fill_value=0).drop(["Other", "I do not wish to disclose my location", "In which country do you currently reside?"]).sort_index().tolist()

Never_to_three = np.array(Never)+np.array(Less_than_one)+np.array(One_to_three)
More_than_three = np.array(Three_to_five)+np.array(Five_to_ten)+np.array(Ten_to_twenty)+np.array(Twenty_plus)

plt.figure(figsize=(15, 11))
col=['blue', 'yellow', 'black', 'green', 'red', 'green', 'green', 'yellow', 'yellow', 'yellow', 'red', 'yellow', 'green', 'green', 'yellow', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'red', 'red', 'red', 'red', 'red', 'green', 'red', 'green', 'red', 'red', 'blue', 'red', 'yellow', 'blue', 'red', 'green', 'blue', 'green', 'red', 'yellow', 'red', 'green', 'green', 'green', 'green', 'red', 'red', 'blue', 'red', 'green', 'red', 'green', 'green', 'red', 'red', 'blue', 'red', 'blue', 'green', 'red', 'green', 'yellow', 'red']
plt.scatter(Never_to_three, More_than_three, s=(Never_to_three+More_than_three).tolist(), c=col, alpha=0.5)

plt.xscale('log')
plt.yscale('log')
plt.xlabel('Beginners to 3 years experience in coding', fontsize=24)
plt.ylabel('More than 3 years experience in coding', fontsize=24)
plt.title('More than 3 years experience vs upto 3 years of experience', fontsize=24)

tick_val = [100, 1000, 10000, 15000]
tick_lab = ['0.1k', '1k', '10k', '15k']

plt.xticks(tick_val, tick_lab, fontsize=24)
plt.yticks(tick_val, tick_lab, fontsize=24)

plt.text(4350, 2000, 'India', fontsize=17)
plt.text(900, 1500, 'USA', fontsize=17)

plt.show()
plt.clf()

**The above code cell plots a scatterplot which has the following description:-**
1. The circles represent the countries in the Kaggle survey data.
2. The size of a circle corresponds to the total number of respondents from a country in the Kaggle survey data.
3. The x coordinates of the centres of the circles are the number of respondents from          the countries who have upto 3 years experience in coding.
4. The y coordinates of the centres of the circles are the number of respondents from          the countries who have more than 3 years experience in coding.
5. The colors of the circles indicate the continent to which the countries belong: Blue for African countries, Red for Asian countries, Green for European countries, Yellow for the Americas (North and South America) and Black for Oceania.

**The scatterplot helps us gain the following insights:-**
1. European, Oceanic, American and a few Asian and African countries have more respondents having more than 3 years experience in coding relative to most Asian and African countries having the same total number of respondents.
2. Most African and Asian countries have more respondents having upto 3 years experience in coding relative to European, Oceanic, American and a few Asian and African countries having the same total number of respondents.
3. India has the largest total number of respondents and the largest number of respondents for both respondents having more than 3 years experience in coding and upto three years experince in coding.
4. USA has the second largest total number of respondents and the second largest number of respondents for both respondents having more than 3 years experience in coding and upto three years experince in coding.

# Programming languages of choice

In [None]:
fig, ax = plt.subplots()
Lang = ['Python', 'R', 'SQL', 'C', 'C++', 'Java', 'JavaScript', 'Julia', 'Swift', 'Bash', 'MATLAB', 'None', 'Other']
x = np.arange(len(Lang))
print(df.loc[1:, ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5', 'Q7_Part_6', 'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9', 'Q7_Part_10', 'Q7_Part_11', 'Q7_Part_12', 'Q7_OTHER']].notnull().astype('int').sum().plot(kind='bar'))
ax.set_ylabel('No. of users')
ax.set_xlabel('Languages')
ax.set_title('Languages used on a regular basis by respondents')
ax.set_xticks(x)
ax.set_xticklabels(Lang)
plt.show()

The above code cell plots a bar graph corresponding to the programming languages the respondents use on a regular basis.
It is important to note that this graph is based on a multiple choice question and respondents may have chosen more than one language for their regular use.
Python is part of most respondents regular use and is followed by SQL, C++ and R. SQL has approximately half the number of users of Python, while both C++ and R have approximately half the number of users of SQL.

In [None]:
Python = df[df['Q7_Part_1']=='Python']['Q3'].value_counts().sort_index().drop(["Other", "I do not wish to disclose my location"]).tolist()
SQL = df[df['Q7_Part_3']=='SQL']['Q3'].value_counts().sort_index().drop(["Other", "I do not wish to disclose my location"]).tolist()

plt.figure(figsize=(15, 11))
col=['blue', 'yellow', 'black', 'green', 'red', 'green', 'green', 'yellow', 'yellow', 'yellow', 'red', 'yellow', 'green', 'green', 'yellow', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'red', 'red', 'red', 'red', 'red', 'green', 'red', 'green', 'red', 'red', 'blue', 'red', 'yellow', 'blue', 'red', 'green', 'blue', 'green', 'red', 'yellow', 'red', 'green', 'green', 'green', 'green', 'red', 'red', 'blue', 'red', 'green', 'red', 'green', 'green', 'red', 'red', 'blue', 'red', 'blue', 'green', 'red', 'green', 'yellow', 'red']
plt.scatter(Python, SQL, s=Total, c=col, alpha=0.5)

plt.xscale('log')
plt.yscale('log')
plt.xlabel('No. of respondents using Python regularly', fontsize=20)
plt.ylabel('No. of respondents using SQL regularly', fontsize=20)
plt.title('SQL vs Python', fontsize=24)

tick_val = [100, 1000, 10000]
tick_lab = ['0.1k', '1k', '10k']

plt.xticks(tick_val, tick_lab, fontsize=24)
plt.yticks(tick_val, tick_lab, fontsize=24)

plt.show()
plt.clf()

**The above code cell plots a scatterplot which has the following description:-**
1. The circles represent the countries in the Kaggle survey data.
2. The size of a circle corresponds to the total number of respondents from a country in the Kaggle survey data.
3. The x coordinates of the centres of the circles are the number of respondents from          the countries who use Python regularly.
4. The y coordinates of the centres of the circles are the number of respondents from          the countries who use SQL regularly.
5. The colors of the circles indicate the continent to which the countries belong: Blue for African countries, Red for Asian countries, Green for European countries, Yellow for the Americas (North and South America) and Black for Oceania.

**The scatterplot helps us gain the following insights:-**
1. The linear nature of the scatterplot seems to show that the relative use of Python and SQL is more or less the same across the continents and increases with increase in the total number of respondents from a country.

2. India has the largest number of respondents using Python regularly and the largest number of respondents using SQL regularly.

3. USA has the second largest number of respondents using Python and the second largest number of respondents using SQL regularly.

In [None]:
Python = df[df['Q7_Part_1']=='Python']['Q3'].value_counts().sort_index().drop(["Other", "I do not wish to disclose my location"]).tolist()
R = df[df['Q7_Part_2']=='R']['Q3'].value_counts().sort_index().drop(["Other", "I do not wish to disclose my location"]).tolist()

plt.figure(figsize=(15, 11))
col=['blue', 'yellow', 'black', 'green', 'red', 'green', 'green', 'yellow', 'yellow', 'yellow', 'red', 'yellow', 'green', 'green', 'yellow', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'red', 'red', 'red', 'red', 'red', 'green', 'red', 'green', 'red', 'red', 'blue', 'red', 'yellow', 'blue', 'red', 'green', 'blue', 'green', 'red', 'yellow', 'red', 'green', 'green', 'green', 'green', 'red', 'red', 'blue', 'red', 'green', 'red', 'green', 'green', 'red', 'red', 'blue', 'red', 'blue', 'green', 'red', 'green', 'yellow', 'red']
plt.scatter(Python, R, s=Total, c=col, alpha=0.5)

plt.xscale('log')
plt.yscale('log')
plt.xlabel('No. of respondents using Python regularly', fontsize=20)
plt.ylabel('No. of respondents using R regularly', fontsize=20)
plt.title('R vs Python', fontsize=24)

tick_val = [100, 1000, 10000]
tick_lab = ['0.1k', '1k', '10k']

plt.xticks(tick_val, tick_lab, fontsize=24)
plt.yticks(tick_val, tick_lab, fontsize=24)

plt.show()
plt.clf()

**The above code cell plots a scatterplot which has the following description:-**
1. The circles represent the countries in the Kaggle survey data.
2. The size of a circle corresponds to the total number of respondents from a country in the Kaggle survey data.
3. The x coordinates of the centres of the circles are the number of respondents from          the countries who use Python regularly.
4. The y coordinates of the centres of the circles are the number of respondents from          the countries who use R regularly.
5. The colors of the circles indicate the continent to which the countries belong: Blue for African countries, Red for Asian countries, Green for European countries, Yellow for the Americas (North and South America) and Black for Oceania.

**The scatterplot helps us gain the following insights:-**
1. Oceanic and American countries have more respondents using R regularly relative to most Asian, African and European countries having the same total number of respondents.

2. Asia, Africa and Europe have both countries with a greater relative use of R and countries with a greater relative use of Python.

2. India has the largest number of respondents using Python regularly and the largest number of respondents using R regularly.

3. USA has the second largest number of respondents using Python and the second largest number of respondents using R regularly.

In [None]:
print(df.loc[1:, ['Q8']].value_counts().plot(kind='bar'))

The above code cell plots a bar graph corresponding to the languages which the respondents will recommend to learn first. Here Python is clearly the most recommended language and overshadows the other languages when it comes to be recommended for learning first.