This is my first ever Kaggle notebook and competition entry, it is still a work in progress. All comments and feedback are really appreciated - please comment! Thank you for reading :)  

> **"To restore America's competitiveness, we must recruit a new generation of science and technology leaders by investing in diversity."** - Barack Obama

## 1. Introduction
For a long time the technology industry has fallen behind other sectors in attracting and employing more diverse people ([link](https://www.cnet.com/news/when-it-comes-to-diversity-techs-idealism-keeps-falling-short/)). In 2018 a forbes arcile quotes: ["Currently, men hold 76% of technical jobs, and 95% of the tech workforce is white."](https://www.forbes.com/sites/lisawinning/2018/03/13/its-time-to-prioritize-diversity-across-tech/#3722945516f8). There have been lots of hopes and expectations that diveristy will increase across all work places, and as a Software Developer myself, I would love this to hold true. 

Obviously, the Kaggle ML and DS Survey results may not be the data to conduct full and in depth investigation on. However that does not mean there is nothing to learn from the data alongside other sources.

In this notebook, using the 2019 DS and ML Kaggle Survey and other sources, I will investigate:
 - **Is diverisity increasing or decreasing across the Kaggle community?**
 - **Does this change in diversity correlate with changes in technology trends?** (e.g. popular programming languages, changes in job roles)

## 2. A first look at the Kaggle data
First let's import the libraries we will use to analyse and visualise the survey data and read adnd store the data itself.

In [None]:
# default imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

# visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msno
import networkx as nx
import plotly.graph_objects as go

sns.set_palette(sns.color_palette(['#20639B', '#ED553B', '#3CAEA3', '#F5D55C']))


# read input files
question = pd.read_csv('../input/kaggle-survey-2019/questions_only.csv')
schema = pd.read_csv('../input/kaggle-survey-2019/survey_schema.csv')
multiple_choice = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
other_text =  pd.read_csv('../input/kaggle-survey-2019/other_text_responses.csv')

Let us check what questions were asked during the survey and therefore which question we can use to access the diveristy of the Kaggle community.

In [None]:
def q_list(s):
    lst = []
    for i in multiple_choice.columns:
        if i[:3]==s:
            lst.append(i)

    df = multiple_choice[lst]

    df_sub = df.iloc[0].apply(lambda x: ''.join(x.split('-')[2:]))
    q = ''.join([f'<li>{i}</li>' for i in df_sub.values])
    display(HTML(f'<h2 style="color:#20639B">{s} : {question.T[0][int(s[1:])]}</h2><ol>{q}</ol>'))
    return df, df_sub

from IPython.core.display import display, HTML
q = ''.join([f'<li>{i}</li>' for i in question.T[0][1:]])
display(HTML(f'<h2 style="color:#20639B">Question List</h2><ol>{q}</ol>'))

The questions we will look at to detirmine the diveristy of the participants are:
> Q1. What is your age (# years)?

> Q2. What is your gender? - Selected Choice

> Q3. In which country do you currently reside?

> Q4. What is the highest level of formal education that you have attained or plan to attain within the next 2 years?




In [None]:
dist = multiple_choice[['Q1', 'Q2', 'Q3', 'Q4']]
dist = dist.rename(columns={"Q1": "Age", "Q2": "Gender", "Q3":"Country", "Q4":"Education"})
dist.drop(0, axis=0, inplace=True)

The question will will look at later on focusing technology trends are as follows:
> Q5. Select the title most similar to your current role (or most recent title if retired): - Selected Choice

> Q9. Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice

> Q18. What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice

> Q23. For how many years have you used machine learning methods?


## 3. Diversity

### 3.1 2019 Survey Results

First of all we will look at the gender distribution of the 2019 survey participants. I have chosen to use PyWaffle [link] as this quickly shows the distribution and is easy to read at a glance (each icon represents 1% of survey respondents).

In [None]:
!pip install pywaffle

In [None]:
from pywaffle import Waffle

gender = dist['Gender'].value_counts()

fig = plt.figure(
    FigureClass=Waffle, 
    rows=10,
    columns=10,
    values=gender,
    colors = ('#20639B', '#ED553B', '#3CAEA3', '#F5D55C'),
    title={'label': 'Gender Distribution', 'loc': 'center'},
    labels=["{}({})".format(a, b) for a, b in zip(gender.index, gender) ],
    legend={'loc': 'lower left', 'bbox_to_anchor': (1.3, .5)},
    font_size=25,
    icons='user',
    figsize=(10, 10),  
    icon_legend=True
)
fig.set_facecolor('#EEEEEE')

We see clearly approximately 82% of respondants were male this year, 16% female, 2% prefer not to say and <1% prefer to self describe.

Next we will look at the country the participants currently reside in:

In [None]:
dist['Country']=dist['Country'].replace({'United States of America':'USA',
                                     'United Kingdom of Great Britain and Northern Ireland':'UK'})
plt.figure(figsize=(10,8))
sns.countplot(x='Country',
              data=dist,
              order=dist['Country'].value_counts().index)
plt.xticks(rotation=90)
plt.ylabel('Number of participants')
plt.title('Country wise distribution in Survey')

plt.show()

We see the majority of respondants are from India and the USA.

And now the age distribution:

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 5))

sns.set_palette(sns.color_palette(['#20639B', '#ED553B', '#3CAEA3', '#F5D55C']))

sns.countplot(x='Age', hue='Gender', data=dist, 
              order = dist['Age'].value_counts().sort_index().index, 
              ax=ax )

plt.title('Age & Gender Distribution', size=15)
plt.show()

Finally looking at the age and gender distribution side by side, we see the majority of the community are young (below 35) and this is true for all genders. To investigate this more fully let's plot the ratio of men to women in each age bracket.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 5))

sns.set_palette(sns.color_palette(['#20639B', '#ED553B', '#3CAEA3', '#F5D55C']))

ageRatio = dist.groupby(['Age', 'Gender']).size()
sns.countplot(x='Age', hue='Gender', data=ageRatio, 
              ax=ax )

plt.title('Age & Gender Ratio', size=15)
plt.show()

In [None]:
ageRatio = dist.groupby(['Age', 'Gender']).size()
ageRatio

# dist.groupby(['Age', 'Gender']).size().plot.barh()

# import matplotlib.pyplot as plt

# for title, group in ageRatio:
#     group.plot(x='Age', y='Gender', title=title)

### 4.2 2017/18 Survey Results

In [None]:
#Importing the 2017 Dataset
df_2017=pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv',encoding='ISO-8859-1')
df_2017

gender_count_2017 = df_2017['GenderSelect'].value_counts(sort=True)

fig = plt.figure(
    FigureClass=Waffle, 
    rows=10,
    columns=10,
    values=gender_count_2017,
    colors = ('#20639B', '#ED553B', '#3CAEA3', '#F5D55C'),
    title={'label': 'Gender Distribution 2017', 'loc': 'center'},
    labels=["{}({})".format(a, b) for a, b in zip(gender_count_2017.index, gender_count_2017) ],
    legend={'loc': 'lower left', 'bbox_to_anchor': (1.3, .5)},
    font_size=25,
    icons='user',
    figsize=(10, 10),  
    icon_legend=True
)
fig.set_facecolor('#EEEEEE')

In [None]:
#Importing the 2018 Dataset
df_2018 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
df_2018.columns = df_2018.iloc[0]
df_2018=df_2018.drop([0])

gender_count_2018 = df_2018['What is your gender? - Selected Choice'].value_counts(sort=True)

fig = plt.figure(
    FigureClass=Waffle, 
    rows=10,
    columns=10,
    values=gender_count_2018,
    colors = ('#20639B', '#ED553B', '#3CAEA3', '#F5D55C'),
    title={'label': 'Gender Distribution 2018', 'loc': 'center'},
    labels=["{}({})".format(a, b) for a, b in zip(gender_count_2018.index, gender_count_2018) ],
    legend={'loc': 'lower left', 'bbox_to_anchor': (1.3, .5)},
    font_size=25,
    icons='user',
    figsize=(10, 10),  
    icon_legend=True
)
fig.set_facecolor('#EEEEEE')

Unfortunately, it appears the ratio of women in Kaggle surveys have remained almost constant over the last three years. This is not what I was expecting or hoping to see in the data but it is what it is. Are there any possible reasons for this? Can we find anything else out by looking more at the data?

## 5. Tech Trends

What I love most about working in Technolgy is how fast everything changes! Which means we all get to keep learning forever!! 😁 Recently the following, amazing video went viral (in my circles anyway) and showed everyone how much and how fast the tech industry changes (please click the image for the video):


[![Most Popular Programming Languages 1965 - 2019](https://img.youtube.com/vi/Og847HVwRSI/0.jpg)](https://youtu.be/Og847HVwRSI "Most Popular Programming Languages 1965 - 2019")




Let's explore how programming languages have changed over the years in Kaggles Surveys. 

As we mentioned earlier the questions from the survey regarding tech trends we will look into are:
> Q5. Select the title most similar to your current role (or most recent title if retired): - Selected Choice

> Q18. What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice

> Q23. For how many years have you used machine learning methods?

In [None]:
role_2019 = multiple_choice[['Q5']]
role_2019.drop(0, axis=0, inplace=True)

language_2019 = multiple_choice[['Q19']]
language_2019.drop(0, axis=0, inplace=True)

experience_2019 = multiple_choice[['Q23']]
experience_2019.drop(0, axis=0, inplace=True)

In [None]:
plt.figure(figsize=(10,8))

sns.countplot(x='Q5',
              data=role_2019)
plt.xticks(rotation=90)
plt.ylabel('Number of participants')
plt.title('Country wise distribution in Survey')
plt.show()

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x='Q19',
              data=language_2019)
plt.xticks(rotation=90)
plt.ylabel('Number of participants')
plt.title('Country wise distribution in Survey')

plt.show()

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x='Q23',
              data=experience_2019)
plt.xticks(rotation=90)
plt.ylabel('Number of participants')
plt.title('Country wise distribution in Survey')

plt.show()

## 6. Do we think changes in tech trends and changes in diversity correlate?

## 7. Conclusions

Thank you very much for looking at my first Kaggle notebook. I hope you enjoyed it and learnt or saw something new! Any and all feedback would be hugely apprieciated :)