# Introduction
#### Hello everyone, my name is Eshna! I am a newbie into the field of data science and data analysis. This is my first time joining a competition as well as with using Plotly. :)
#### My goal is to learn more about the demographics of the participants from Kaggle's Machine Learning and Data Science 2021 Survey.

# Imports and Reading in CSV

In [None]:
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import numpy as np
import pandas as pd

In [None]:
# Skip first row, the Q columns will be renamed later
df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', skiprows=[1])
df.head()

# Part 1: Let's clean up some of the data!
#### I will rename several of the columns and include an additional "Kagglers" column to help with my data analysis.

In [None]:
# Rename columns with demographic data to better labels
new_names = {'Time from Start to Finish (seconds)': 'Time(s)', 'Q1': 'Age', 'Q2': 'Gender', 'Q3': 'Country',
             'Q4': 'Education', 'Q5': 'Job', 'Q6': 'Coding_Exp'}
df = df.rename(columns=new_names)

# Add Kagglers label for EDA purposes
df.insert(loc=0, column='Kagglers',value=range(df.shape[0]))

# Part 2: What is the max and min time taken to complete the survey?

In [None]:
print("The maximum time is: {} seconds".format(max(df['Time(s)'])))

**The maximum time to complete the survey took roughly a whole February (28-29 days)!**

In [None]:
print("The minimum time is: {} seconds".format(min(df['Time(s)'])))

**The minimum time to complete the survey took the recommended time to brush your teeth (2 minutes). :)**

# Part 3: Gender Distribution and Coding Experience
#### What is the gender distribution of the survey takers? What is the coding experience distribution per gender category?

In [None]:
# Map gender values into different categories for sunburst chart
gender_dict = {"Man": "Man", "Woman": "Woman", "Nonbinary": "Nonbinary/Other",
               "Prefer not to say": "Nonbinary/Other",
               "Prefer to self-describe": "Nonbinary/Other"}
df['Gender'].replace(gender_dict, inplace=True)

# Update coding experience label(s) for sunburst chart
df['Coding_Exp'].replace('I have never written code', 'None', inplace=True)


# What is the gender distribution of the survey takers?
# What is the coding experience distribution per gender category?
_df_ = df.groupby(['Gender','Coding_Exp']).count().reset_index()

fig = px.sunburst(_df_, path=['Gender', 'Coding_Exp'], values='Kagglers',
                  title="Distribution of Gender and Coding Experience per Gender")

fig.update_traces(textinfo='label+percent parent')

fig.show()

Using the sunburst chart, we can see that almost **4 out of 5 survey takers** identified as men.

The **majority** of men and women **(more than 50% each)** have expressed that they have been coding for less than a year for up to 3 years!

# Part 4: Age Distribution and Education Distribution
#### What is the age distribution of survey takers? What is the education distribution of survey takers?

In [None]:
# Update education label(s) for pie chart
df['Education'].replace('No formal education past high school', 'High school', inplace=True)
df['Education'].replace("Some college/university study without earning a bachelor’s degree",
                        'College (no degree)', inplace=True)


# What is the age distribution of survey takers?
# What is the education distribution of survey takers?
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=("Age Distribution", "Education Distribution"))

_df_ = df.groupby('Age').count().reset_index()
fig.add_trace(go.Pie(labels=_df_['Age'], name='Age',
                     values=_df_['Kagglers'], hole=0.3,
                     hoverinfo='label+name', textinfo='percent',
                     showlegend=False), row=1, col=1)

_df_ = df.groupby('Education').count().reset_index()
fig.add_trace(go.Pie(labels=_df_['Education'], name='Education',
                     values=_df_['Kagglers'], hole=0.3,
                     hoverinfo='label+name', textinfo='percent',
                     showlegend=False), row=1, col=2)

fig.show()

The Age Distribution pie chart shows us that the **majority** of these Kagglers identified as in their early adulthood (18-29 years old).

The Education Distribution pie chart shows us that over **3 out of 4 survey takers** identified as only having a Bachelor's or Master's Degree.

# Part 5: Where did the Kagglers come from?
#### Where did all the Kagglers come from? Which countries had the highest density of survey takers?

In [None]:
# Update country label(s) for choropleth and bar charts
df['Country'].replace('I do not wish to disclose my location', 'Undisclosed', inplace=True)
df['Country'].replace('Iran, Islamic Republic of...', 'Iran', inplace=True)


# Where did all the Kagglers come from?
# Which countries had the highest density of survey takers?
_df_ = df.groupby('Country').count().reset_index()

fig = px.choropleth(_df_, locations='Country', locationmode='country names',
                    color='Kagglers', hover_name='Country',
                    color_continuous_scale='magenta',
                    title="Where did the Kagglers come from? (excludes Other and Undisclosed)")
fig.show()

In [None]:
fig = px.bar(_df_, x='Country', y='Kagglers', color='Kagglers',
            title="Where did the Kagglers come from?",
            color_continuous_scale='magenta', height=700)
fig.show()

**India is definitely the country with the highest count of survey takers! The United States is in second place with only about a third of the number of Kagglers from India.**

#### Thanks for looking at my notebook! I'm still learning more about Plotly and will continue to improve my skills. :)