# Data Visualisation



## Introduction


>**E**xploratory **D**ata **A**nalysis (EDA) is a critical precursor to model application. As the name implies, it is all about exploring data and validating that a certain dataset is clean and without missing values. Perhaps most interestingly, however, is the ability to use various visualisation techniques in our data to understand the underlying trends between the variables provided.

Notably, _all_ the problems you will come across require the use of a model. Perhaps the task at hand is to 'simply' provide visualisations and identify interesting facts, which cannot be done through non-visual analyses.
To use a model, it is important to formulate a hypothesis. This is because the identification of the intended target will be relevant in determining what parts of the data you explore. Nonetheless, here, we will be visualising data. To expand on the importance of this, see the image below, which demonstrates something known as *Anscombe's Quartet*:

<p align=center><img src="https://www.researchgate.net/profile/Arch_Woodside2/publication/285672900/figure/fig4/AS:305089983074309@1449750528742/Anscombes-quartet-of-different-XY-plots-of-four-data-sets-having-identical-averages.png">

([source](https://www.researchgate.net/publication/285672900_The_general_theory_of_culture_entrepreneurship_innovation_and_quality-of-life_Comparing_nurturing_versus_thwarting_enterprise_start-ups_in_BRIC_Denmark_Germany_and_the_United_States)) </p>


## A Case in Point
As a case in point, Anscombe's Quartet exemplifies the utmost importance of data visualisation. The image exhibits that the summary statistics (e.g. mean and variance) for all the data is the same. However, as can be observed, the distributions from which the data originate are considerably different. Had we not visualised our data, we would not have been able to trivially identify the data relationships.

Data cleaning and addressing the problem of missing data fall under the EDA umbrella, which is a cyclical process comprising data exploration, issue identification, data cleaning and data exploration once more.

## Data-Visualisation Demonstration
We will be working with the 'multiple_choice_responses.csv' file from the [2019 Kaggle ML & DS Survey](https://www.kaggle.com/c/kaggle-survey-2019/data?select=multiple_choice_responses.csv), which is a 35-question survey performed on Kaggle users regarding the state of data science and ML. As indicated in their abstract, this survey received 19,717 usable respondents from 171 countries and territories. Countries or territories with less than 50 respondents were classified into a group named 'Other' for anonymity. Our task with this dataset is to identify what factors significantly impact the annual salary of those in DSML.

In [None]:
## Load the dataset, and return the first few rows.
import pandas as pd
pd.options.display.max_columns = None

df = pd.read_csv("https://aicore-files.s3.amazonaws.com/Data-Science/multiple_choice_responses.csv")
df

In [None]:
## There's a file in the DATA folder called 'questions_only.csv'. Load the dataset, and print all the questions.
q_df = pd.read_csv("https://aicore-files.s3.amazonaws.com/Data-Science/questions_only.csv")
for i, question in enumerate(q_df.iloc[0]):
    print(i, "\t", question)

From this preview, we can note the following:
- There are many questions (a considerable volume of data to analyse).
- Some of the questions allow for multiple inputs. For these questions, the header row/column names have `_` appended to them, followed by some text.
  - If the text is `OTHER_TEXT`, then it appears that following a categorical question, a text field that affords the recipient the option to expand is provided. It appears as though -1 indicates that the user did not write anything.
  - If the text is `PART_N`, then it appears that it is a checkbox question (i.e. tick all that apply).
  - They are not mutually exclusive.
  
## Conducting Analyses
Conducting a column-by-column analysis of this data will be time-intensive. Therefore, we will determine the factors that may influence the salary and extract the relevant questions that meet this criterion from the list. This is partially why data science is considered an art; you may have a big dataset but are unsure of where to initiate the analysis. Based on hypotheses and the identification of the intended target for modeling, you need to intuitively guess (i.e. hypothesise) the factors that may exert significant influence. This is why domain expertise is important. However, the more you explore your data with the initial concepts you hypothesised, the more you will learn about the wider dataset.
- Salary (target)
- Age
- Gender
- Residence
- Education
- Job role/experience
- Programming languages
- ML frameworks

From this list, we will extract these questions:
**Q:** 1, 2, 3, 4, 5, 9, **10**, 15, 18, 24, 28.

There are a couple of others that are relevant to the analysis; ideally, we would analyse them as well. However, we have limited time here, and we prioritise teaching you various visualisation techniques while building your intuition as to what to look for in the data.

Some of these questions encompass multiple columns in our dataframe. Extracting the desired, relevant columns is not the most straightforward task. To improve your understanding, attempt to implement something that returns a new dataframe containing the relevant columns. If you are unsure of how to proceed after a couple of minutes, follow the steps below.
<details>
    <summary><b>> Click here for guidance.</b></summary>
    <ul>
        <li>Define a function which loops over a list of integers of the questions we intend to keep.</li>
        <li>For every iteration, determine the number of columns from the current question to the next question in the dataframe (__not the next question we intend to extract__).</li>
        <li>Extract/concatenate from the current column position to the current column position + 'distance' (probably using the <code>range()</code> function from Python).</li> 
    </ul>
</details>


In [None]:
idx_to_keep = [1,2,3,4,5,9,10,15,18,24,28]

def extract_columns(df, idx_to_keep):
    
    new_df = pd.DataFrame() # empty dataframe
    df_col_list = df.columns.tolist()
    
    for i in idx_to_keep:
        column_name_base = "Q{}".format(i)
        column_index = [df_col_list.index(col_name) for col_name in df_col_list if col_name.startswith(column_name_base)][0]
               
        next_column_name_base = "Q{}".format(i+1)
        next_column_index = [df_col_list.index(col_name) for col_name in df_col_list if col_name.startswith(next_column_name_base)][0]
         
        col_idxs_to_extract = range(column_index, next_column_index)
        relevant_cols_df = df.iloc[:, col_idxs_to_extract]
        
        new_df = pd.concat([new_df, relevant_cols_df], axis=1)
        
    return new_df


df_orig = df.copy(deep=True)
df = extract_columns(df_orig, idx_to_keep)
df = df[1:]
df

### Gender analysis
As the first step, we make an arbitrary choice: we start with Gender (Q2). We see that the data here are meant to be categorical; thus, after ensuring that that is the case, we simply plot the frequency of each of the values.

In [None]:
df["Q2"] = df["Q2"].astype("category")
set(df["Q2"])

In [None]:
!pip install plotly

In [None]:
import plotly.express as px
px.histogram(df, "Q2", labels={"value": "Gender"}, title="Counts of Gender")

### Country analysis
Next, we plot the residencies of the individuals on a world map, heating them based on the number of respondents by country. This is known as a [choropleth map](https://plotly.com/python/choropleth-maps/) and will require us to change our country names to [three-letter ISO codes](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3).

As the first step, we examine the country column, i.e. Q3, and note that some values require updating to something more conventional.

Subsequently, we load a package to which we can pass a country and have it return the ISO code. Thereafter, we will utilise the new column to plot the choropleth.

In [None]:
set(df["Q3"])

Here are the values that we think require updating:
- Hong Kong (S.A.R.)
- Iran, Islamic Republic of ...
- United Kingdom of Great Britain and Northern Ireland
- Viet Nam
- South Korea

Additionally, notice that there is an 'Other' value.

In [None]:
print("Percentage of 'Other':", df["Q3"].value_counts()["Other"]/len(df) * 100)

values_to_update = {"Q3": 
                    {"Hong Kong (S.A.R.)": "Hong Kong",
                     "Iran, Islamic Republic of...": "Iran",
                     "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
                     "South Korea": "Republic of Korea",
                     "Viet Nam": "Vietnam"}}

## Using the replace method, update the values in the relevant column.
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
df.replace(values_to_update, inplace=True)
set(df["Q3"])

In [None]:
!pip install pycountry

In [None]:
import pycountry

## Create a new dataframe which will hold only the unique countries, their country codes, and the number of instances of this country (WITHOUT 'Other').
countries = df["Q3"][df["Q3"]!= "Other"].unique()
countries_df = pd.DataFrame(countries, columns=["Country"])
countries_df["Count"] = countries_df["Country"].map(df["Q3"].value_counts())

## Create a new column in the dataframe containing the ISO country codes.
country_codes = []
for country in countries_df["Country"]:
    country_code = pycountry.countries.search_fuzzy(country)[0] # Take the first element returned from the search
    country_codes.append(country_code.alpha_3)

countries_df["Country Code"] = country_codes
countries_df

In [None]:
px.choropleth(countries_df, locations="Country Code", hover_name="Country", color="Count")

### Age by gender analysis
To consider age by gender, variable grouping is required as the first step.

In [None]:
age_gender_df = df[["Q1", "Q2"]]
age_gender_groups = age_gender_df.groupby(["Q1", "Q2"]).size().unstack()
fig = px.bar(age_gender_groups, title="Count of Age per Gender", labels={"Q1": "Age", "value": "Count"})
fig.update_layout(legend_title_text='Gender')
# fig.update_layout(barmode="group")
fig.show()

As we can observe, the ages with the highest occurrence frequency among the DSML employees are between 25 and 29. There are two probable reasons why these values are considerably higher than the others:
1. Data science and ML are relatively new disciplines. However, direct education paths to these fields exist currently, with improved accessibility to young people.
2. Think about _where_ the data were collected from. Older people are perhaps less likely to use 'resource' sites such as Kaggle because 1) they do not feel the need for such a learning experience and 2) they do not frequent social sites as much as young people.

Thus far, we have simply arbitrarily produced plots. Perhaps a better plan is to perform a slightly more investigative analysis of the categories we outlined earlier. Let's do this with education analysis.

### Education analysis

Produce two plots:
1. The participants' formal education.
2. The count of formal education per gender. 

Display these as a grouped bar chart.

In [None]:
fig = px.histogram(df, "Q4", height=800, title="Count of Education", labels={"value": "Education level"})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [None]:
edu_gender_df = df[["Q2", "Q4"]]
edu_gender_groups = edu_gender_df.groupby(["Q4", "Q2"]).size().unstack()
fig = px.bar(edu_gender_groups, title="Education level count per Gender",
             labels={"Q2": "Education", "value": "Count"},
             height=800)
fig.update_layout(legend_title_text='Gender', xaxis={'categoryorder':'total descending'})
# fig.update_layout(barmode="group")
fig.show()

Next, we create another diagram showing the same information but across four different plots.

In [None]:
fig = px.histogram(df, "Q4", 
                   facet_col="Q2", 
                   color="Q2",
                   title="Counts of Education level per Gender",
                   labels={"Q4": "Education Level"},
                   height=1000, 
                   facet_col_wrap=2, 
                   facet_col_spacing=0.1,
                   )
fig.update_layout(showlegend=False, xaxis={'categoryorder':'total descending'})
fig.update_yaxes(matches=None, showticklabels=True)
# fig.update_xaxes(showticklabels=True)
fig.show()

From the above, we can establish the following:
1. Those who choose to self-describe their gender are more likely to have a doctorate than a bachelor's. This is contrary to every other category, the members of which are more likely to have a bachelor's degree than a doctorate. Although if we note the counts, we can observe that we are working with single-digit figures, not something we can statistically extrapolate.
2. Those who preferred not to give their gender also preferred not to give their education level (relative to the other categories).

### [Sankey Diagrams](https://en.wikipedia.org/wiki/Sankey_diagram)

Sankey diagrams are vital for plot analysis and data visualisation. The easiest way to start working with the Sankey diagram is to identify the terminating column. In this case, we will use the education level as the final column. Additionally, we will need 'counts', which refers to the total number of people in each education level. To save space on the diagram, we will generalise some of the levels.

In this Sankey diagram, we will visualise the paths of gender, age and country to the education level.

In [None]:
# We want five levels of education: Bachelor's, Master's, Doctoral, Professional and Other.
## Create a new dataframe with only the education level of the respondents, where their education information has been mapped to the level above.
education_df = pd.DataFrame(df["Q4"])
education_df.rename(columns={"Q4": "Education Level"}, inplace=True)

values_to_update = {"Education Level": 
                    {"Some college/university study without earning a bachelor’s degree": "Other",
                     "No formal education past high school": "Other",
                     "I prefer not to answer": "Other"}}

education_df = education_df.replace(values_to_update)
set(education_df["Education Level"])

In [None]:
# Let's drop na's from Education Level
education_df.isna().sum()
education_df = education_df.dropna(subset=["Education Level"])
education_df.isna().sum()

In [None]:
## Add the gender, age and region columns to the new dataframe. Name the columns appropriately
cols_to_join = ["Q1", "Q2", "Q3"]
desired_col_names = ["Age", "Gender", "Region"]
for col, name in zip(cols_to_join, desired_col_names):
    education_df[name] = df[col]
    
education_df


In [None]:
# For visualisation purposes let's create:
# 1. wider age bins as 18-29, 30-49, 50-69 and 70+
# 2. group genders as "Male", "Female", "Other"
# 3. Convert countries to continents - apart from 'India', 'The United States of America' and 'Other'

## Overwrite the age and gender columns so that ages are now: 18-29, 30-49, 50-69 and 70+ and genders are "Male", "Female" and "Other"
values_to_update = {
    "Age": {"18-21": "18-29", "22-24": "18-29", "25-29": "18-29",
            "30-34": "30-49", "35-39": "30-49", "40-44": "30-49", "45-49": "30-49",
            "50-54": "50-69", "55-59": "50-69", "60-69": "50-69"
           },
    "Gender": {"Prefer not to say": "Other", "Prefer to self-describe": "Other"}
}

education_df = education_df.replace(values_to_update)
education_df

In [None]:
!pip install pycountry_convert

In [None]:
import pycountry_convert as pc
## Map countries to their relevant continents, unless the country is India, The United States of America, or Other
countries_to_not_map = ["India", "United States of America", "Other"]
countries_to_map_to_continents = set(education_df["Region"])
for country in countries_to_not_map:
    countries_to_map_to_continents.discard(country)

countries_continent_dict = dict()
for country in countries_to_map_to_continents:
    country_alpha2 = pycountry.countries.search_fuzzy(country)[0].alpha_2
    continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
    continent_name = pc.convert_continent_code_to_continent_name(continent_code)
    countries_continent_dict[country] = continent_name

to_update = {"Region": countries_continent_dict}
education_df = education_df.replace(to_update)
education_df

In [None]:
# We re-index the columns in our desired order for the diagram for convenience.
education_df = education_df.reindex(["Gender", "Age", "Region", "Education Level"], axis=1)

col_names = education_df.columns.tolist()
node_labels = []
num_categorical_vals_per_col = []
for col in col_names:
    uniques = education_df[col].unique().tolist()
    node_labels.extend(uniques)
    num_categorical_vals_per_col.append(len(uniques))
    
node_labels, num_categorical_vals_per_col

The `num_categorical_vals_per_col` parameter provides information on the values from the previous level that we need to map to the next.

Now, we need to construct a `link` dictionary. This is slightly less straightforward than that above. Our `link` dictionary will contain three lists: `source`, `target` and `value`. The `source` and `target` indicate the nodes that we intend to interconnect, and `value` indicates the quantity with which we intend to 'fill' the connection. The `source` and `target` are numerical indexes of the `node_labels` list we created above.

We will link each category per column (source category) to all the other categories of the next column (target category) with the `size` of the number of the source categories that are mapped to the target categories.

In [None]:
education_df.groupby(["Gender", "Age"]).size()["Female"]["18-29"]

In [None]:
import numpy as np
import random

source = []
target = []
value = []
colors = []
for i, num_categories in enumerate(num_categorical_vals_per_col):
    
    if i == len(num_categorical_vals_per_col)-1:
        break
    
    # index allows us to refer to the categories by index from the `node_labels` list
    start_index = sum(num_categorical_vals_per_col[:i])
    start_index_next = sum(num_categorical_vals_per_col[:i+1])
    end_index_next = sum(num_categorical_vals_per_col[:i+2])
#     print(start_index, start_index_next, end_index_next)
    
    # i can also provide the category column to refer to
    col_name = col_names[i]
    next_col_name = col_names[i+1]
    
    grouped_df = education_df.groupby([col_name, next_col_name]).size()
#     print(grouped_df)
    
    for source_i in range(start_index, start_index_next):
        for target_i in range(start_index_next, end_index_next):
            source.append(source_i)
            target.append(target_i)
            source_label = node_labels[source_i]
            target_label = node_labels[target_i]
            # if the index does not exist in the grouped_df, then the value is 0
            try:
                value.append(grouped_df[source_label][target_label])
            except:
                value.append(0)
            
            random_color = list(np.random.randint(256, size=3)) + [random.random()]
            random_color_string = ','.join(map(str, random_color))
            colors.append('rgba({})'.format(random_color_string))

print(source)
print(target)
print(value)

link = dict(source=source, target=target, value=value, color=colors)

In [None]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = node_labels,
      color = "blue"
    ),
    link = link)])

fig.update_layout(title_text="Sankey Diagram (Gender, Age, Region, Education)", font_size=10)
fig.show()

#### Age

Earlier, we examined age briefly. Here, we explore age more in depth. Using the original `df`, produce the following plots:
1. A facet plot of the count of education levels by age.
2. A facet plot of the different roles by age.

Next, we will cover the following:
1. A plot of the count of the different languages. 
2. Subplots/facet plots of the count of different languages per age.


In [None]:
# We will sort the df by Age so that our plot displays based on the age.
df = df.sort_values(by=["Q1"])

fig = px.histogram(df, "Q4", facet_col="Q1",
             color="Q1",
             title="Counts of Education level per Age",
             labels={"Q1": "Age", "Q4": "Education Level"},
             height=1000, 
             facet_col_wrap=4, 
             facet_col_spacing=0.1)

fig.update_layout(showlegend=False, xaxis={'categoryorder':'total descending'})
fig.update_yaxes(matches=None, showticklabels=True)

Interestingly, the results for the younger age groups are expected: 18-21 year olds typically are not old enough to do master's degrees; hence, the number of bachelor's is higher for that group. However, for almost every other age group, Master's degrees are prominent. Curiously, those over 70 are more likely to have a doctorate.

In [None]:
set(df["Q5"])

In [None]:
fig = px.histogram(df, "Q5", facet_col="Q1",
             color="Q1",
             title="Counts of Education level per Age",
             labels={"Q1": "Age", "Q5": "Job Role"},
             height=2000, 
             facet_col_wrap=2, 
             facet_col_spacing=0.1)

fig.update_layout(showlegend=False)
fig.update_xaxes(showticklabels=True, tickangle=45)
fig.update_yaxes(matches=None, showticklabels=True)

In [None]:
# The data are in columns 18_p1 and 18_p12.
# The first step is to create a new column called 'Known programming languages', and per row, create a comma-separated list containing the programming languages they know (certainly excluding NaNs).
programming_cols = ["Q18_Part_{}".format(str(i)) for i in range(1, 13)]
programming_df = df[programming_cols]
programming_df

In [None]:
programming_col = []
for row in programming_df.itertuples(index=False):
    languages_known = [language for language in row if isinstance(language, str)]
    programming_col.append(",".join(languages_known))
    
programming_df["languages_known"] = programming_col
programming_df

In [None]:
# Now, we trim the new df so it only has our new column.
programming_df.drop(labels=programming_cols, axis=1, inplace=True)
programming_df

In [None]:
# Assume blanks mean they don't know a language and replace both the blanks and "None" with "None/NA"
values_to_update = {"languages_known": {"": "None/NA", "None": "None/NA"}}
programming_df = programming_df.replace(values_to_update)
programming_df

In [None]:
# We use the get_dummies method to create new columns. 
language_dummies = programming_df['languages_known'].str.get_dummies(sep=',')
language_dummies

In [None]:
fig = px.bar(language_dummies.sum(), labels={"index": "Programming Language"}, title="Count of Programming Languages")
fig.update_layout(showlegend=False, xaxis={'categoryorder':'total descending'})
fig.show()

In [None]:
# For 4, we retrieve the ages from the original dataframe and join them to this new dataframe.
ages = df["Q1"]
language_dummies_with_age = language_dummies.join(ages).rename(columns={"Q1": "Age"})
language_dummies_with_age

In [None]:
programming_languages_by_age = language_dummies_with_age.groupby(["Age"]).sum()
px.bar(programming_languages_by_age)
# px.bar(programming_languages_by_age.T)
programming_languages_by_age = programming_languages_by_age.reindex(
    programming_languages_by_age.mean().sort_values().index, axis=1)
programming_languages_by_age = programming_languages_by_age.iloc[:, ::-1]
programming_languages_by_age

In [None]:
programming_languages_by_age_row_norm = programming_languages_by_age.div(programming_languages_by_age.sum(axis=1), axis=0)

In [None]:
from plotly.subplots import make_subplots

# programming_languages_by_age.index
programming_languages = programming_languages_by_age_row_norm.columns.tolist()
fig = make_subplots(4, 3, subplot_titles=programming_languages_by_age_row_norm.index)
for i, age_range in enumerate(programming_languages_by_age_row_norm.index):
    row = (i // 3) + 1
    col = (i % 3) + 1
    fig.add_trace(
        go.Bar(x=programming_languages, y=programming_languages_by_age_row_norm.iloc[i]),
        row=row, col=col
    )
fig.update_layout(showlegend=False, height=1000, title="Percent of Known Programming Languages by Age")
fig.update_yaxes(tickformat="%")
fig.show()

It appears as though everyone likes Python. Younger people (who are most likely to be doing a bachelor's) know more lower-level languages, such as C, C++ and Java. This could be because they are required to study these languages at the university. As data scientists specialise more in their careers, they appear to move away from these languages into more typical DSML-related languages. The 60-69 age group has a high percentage of R users relative to the other age groups, whereas it appears as though most people over 70 do not know any programming language.

#### Job roles
Perhaps more useful to us is the popularity (occurrence frequency) of programming languages relative to job roles (Q5). Create a plot that demonstrates this.

In [None]:
## Join the Job Titles question with the language_dummies dataframe.
language_dummies_with_job = language_dummies.join(df["Q5"]).rename(columns={"Q5": "Job Title"})
language_dummies_with_job

In [None]:
## Group by Job Title, aggregate the number, normalise each row, and sort the dataframe by the mean of the columns.
languages_job_title_grouped = language_dummies_with_job.groupby(["Job Title"]).sum()
languages_job_title_grouped = languages_job_title_grouped.div(languages_job_title_grouped.sum(axis=1), axis=0)
languages_job_title_grouped = languages_job_title_grouped.reindex(
    languages_job_title_grouped.mean().sort_values().index, axis=1)
languages_job_title_grouped = languages_job_title_grouped.iloc[:, ::-1]
languages_job_title_grouped

In [None]:
px.imshow(languages_job_title_grouped, title="Heatmap of Programming Languages and Job Title")

In [None]:
## Plot the data.
programming_languages = languages_job_title_grouped.columns.tolist()
fig = make_subplots(4, 3, subplot_titles=languages_job_title_grouped.index)
for i, role in enumerate(languages_job_title_grouped.index):
    row = (i // 3) + 1
    col = (i % 3) + 1
#     numbers = langugages_job_title_grouped.iloc[i]
#     as_percent = [number / sum(numbers) for number in numbers]
    fig.add_trace(
        go.Bar(x=programming_languages, y=languages_job_title_grouped.iloc[i]),
        row=row, col=col
    )
fig.update_layout(showlegend=False, height=1000, title="Percent of Known Programming Languages Usage per Job Role")
fig.update_yaxes(tickformat="%")
fig.show()

**Findings**
- On average, Python is the most popular language among all job roles.
- SQL is the most popular language among database engineers.
- MATLAB is relatively more popular among research scientists than other job roles.
- Statisticians prefer R over Python.
- Students and software engineers prefer C++ over R.

Now, we do one more plot: a heatmap of preferred frameworks relative to job roles.

In [None]:
## Plot the top 3 frameworks that each job role likes to use
# 28p1 - 28p12
## Create a dateframe which contains just the framework columns
framework_cols = ["Q28_Part_{}".format(str(i)) for i in range(1, 13)]
framework_df = df[framework_cols]
framework_df

Create a general function, `get_df_for_dummies()`, which accepts a dataframe, a column name prefix string, and an upper range and returns a dataframe populated with the range of columns based on the prefix string.

In [None]:
def get_df_for_dummies(df, prefix_string, end_range, start_range=1):
    ## HINT: use a range function and iteration to generate the column names.
    dummies_cols = [prefix_string + str(i) for i in range(start_range, end_range)]
    ## HINT: extract and return the column names from the dataframe
    return df[dummies_cols]

In [None]:
## Create a column in this dataframe called 'frameworks used', and populate that column with comma-separated frameworks.
framework_col = []
for row in framework_df.itertuples(index=False):
    frameworks_used = [framework for framework in row if isinstance(framework, str)]
    framework_col.append(",".join(frameworks_used))
    
framework_df["frameworks_used"] = framework_col
framework_df

In [None]:
## Replace the blank and None columns with 'None/NA'.
values_to_update = {"frameworks_used": {"": "None/NA", "None": "None/NA"}}
framework_df = framework_df.replace(values_to_update)
framework_df

Create a general function, `get_dummies()`, which accepts a dataframe and returns a column populated with comma-separated values of each of the individual values over the dataframe.

In [None]:
def get_dummies_col(df, sep=","):

    ## initialise an empty list to hold the column of strings which will be used to create a dummies dataframe.
    dummies_col = []
    
    ## iterate over each of the rows in the dataframe in a manner that enables access to the individual elements of the cells.
    ## get a list of the values of the cells over the row (e.g. a list of programming languages). Make sure that you do not add nan's. 
    ## join these as a comma-separated string, and append it to the empty list for the column.
    for row in df.itertuples(index=False):
        values = [item for item in row if isinstance(item, str)]
        dummies_col.append(sep.join(values))
        
    ## create a new column in the dataframe called 'dummies', which accepts the contents of the dummies column.
    df["dummies"] = dummies_col
    
    ## replace all the "" and "None"s from the dataframe with 'None/NA'.
    values_to_update = {"dummies": {"": "None/NA", "None": "None/NA"}}
    df = df.replace(values_to_update)
    
    ## return the new dataframe.
    return df

In [None]:
## Create a dummies dataframe for the frameworks.
framework_dummies = framework_df['frameworks_used'].str.get_dummies(sep=',')
framework_dummies

Create a general function, `dummies_from_series()`, which accepts a dataframe and a separator argument, and returns dummies for the Series.

In [None]:
def dummies_from_series(series, sep=","):
    ## return a dummies dataframe from the dataframe. 
    # remember which column was used to assign the strings over which we intend to create dummies.
    return series["dummies"].str.get_dummies(sep=sep)

In [None]:
## Create a new dataframe which joins this dataframe with job roles.
frameworks_for_job_role = framework_dummies.join(df["Q5"]).rename(columns={"Q5": "Job Title"})
frameworks_for_job_role

In [None]:
## Group the dataframe by the job title, and aggregate over the programming languages.
frameworks_for_job_role_grouped = frameworks_for_job_role.groupby(["Job Title"]).sum()
frameworks_for_job_role_grouped

Create a general function, `group_dummies_by()`, which accepts a dummies dataframe and a Series, and returns the dummies grouped and aggregated by the Series.

In [None]:
def group_dummies_by(dummies_df, series):
    series_name = series.name
    to_group = dummies_df.join(series)
    grouped = to_group.groupby([series_name]).sum()
    
    return grouped

In [None]:
framework_df = get_df_for_dummies(df, "Q28_Part_", 13)
framework_df = get_dummies_col(framework_df)
framework_dummies = dummies_from_series(framework_df)
frameworks_for_job_role_grouped = group_dummies_by(framework_dummies, df["Q5"])
frameworks_for_job_role_grouped = frameworks_for_job_role_grouped.div(
    frameworks_for_job_role_grouped.sum(axis=1), axis=0)
frameworks_for_job_role_grouped.index.rename("Job Role", inplace=True)
frameworks_for_job_role_grouped

In [None]:
## Produce a heatmap of the above dataframe!
px.imshow(frameworks_for_job_role_grouped, title="Heatmap of preferred Frameworks per Job Role")

**Findings**
- None/NA may be slightly misleading. It does not always imply that the role has no need for frameworks because it also incorporates people who may not have answered the question since their tool of use was not provided as an option. For example, a statistician may be using an R framework whose option is not provided here.
- Scikit-learn is the most popular listed tool.
- There appears to be a high occurrence frequency of random forest among statisticians.

#### Yearly compensation

Now, we begin considering our target variable. We will start at the basic level and work our way up. We plot a histogram of salary earnings and sort this plot based on salary ranges.

In [None]:
px.histogram(df, "Q10", labels={"Q10": "Salary"}, title="Count of salary ranges (Unsorted)")

Before proceeding, we sort the x axis in order of numerical values. In other words, the leftmost column is `$0-999`, and the rightmost is `> $500,000`. 

Spend some time pondering on how you would solve this problem. If after a few minutes you do not make headway, follow the steps below.
<details>
<summary><b>> Click here for guidance.</b></summary>
<ul>
<li>Create a new dataframe with only the salaries.</li>
<li>Retrieve the set of salaries.</li>
<li>Map the salary categories to an int, where the int is the first numerical part of the string (e.g. <code>{"$0-999": 0, "100,000-124,999": 100000}</code>). This part will require you to use the `.replace()` and `.split()` methods native to Python.
<li>Replace the salaries in the dataframe with the integer value.</li>
<li>Sort the dataframe in ascending numerical order.</li>
<li>Reverse the mapping, and replace the ints with their string variants.</li>
<li>Plot the dataframe, and replace the x labels with the salary strings.</li>
</ul>
</details>

In [None]:
salary_df = pd.DataFrame(df["Q10"])
salary_df.rename(columns={"Q10": "Salary"}, inplace=True)
salary_set = set(salary_df["Salary"])
salary_string_int_dict = dict()

for string_salary in salary_set:
    
    if isinstance(string_salary, float): continue
        
    salary = string_salary.replace("$", "").replace("> ", "").replace(",", "")
    salary = salary.split("-")[0]
    salary_string_int_dict[string_salary] = int(salary)

values_to_update = {"Salary": salary_string_int_dict}
salary_df = salary_df.replace(values_to_update)
salary_df = salary_df.sort_values("Salary")

salary_int_string_dict = {v:k for k,v in salary_string_int_dict.items()}
values_to_update = {"Salary": salary_int_string_dict}
salary_df = salary_df.replace(values_to_update)

percent_na = np.round(100 * salary_df["Salary"].isna().sum()/len(salary_df), 2)
print("Percent of users who didn't answer the salary question:", percent_na)
px.histogram(salary_df, "Salary", title="Count of Salary ranges")

Almost 37% of the survey participants explicitly did not answer this question. There appears to be an oddly large amount of people who earn between \\$0 and \\$999 per annum. Perhaps, this is overly high because many people who did not want to answer the question (and who did not realise that it was optional) ticked this box. We can spot some further interesting facts. There appear to be 'two' peaks at vastly different salaries: at 10,000-14,999 and 100,000-124,999. Additionally, it appears as though there are some very rich kagglers, with 83 of them earning over \\$500k per annum.

The top three common wage groups (par 0-999) appear to be 10,000-14,999, 100,000-124,999 and 30,000-39,999. Produce a choropleth plot of the median salary of the countries to better discern the correlation between earning expectancy and country of residence.

In [None]:
## Produce a choropleth plot of the median salary.
median_salaries_df = df[["Q3", "Q10"]]
median_salaries_df.rename(columns={"Q3": "Country", "Q10": "Salary"}, inplace=True)
values_to_update = {"Salary": salary_string_int_dict}
median_salaries_df = median_salaries_df.replace(values_to_update)
median_salaries_df = median_salaries_df.groupby(["Country"]).median()
median_salaries_df

In [None]:
country_codes = []
for country in median_salaries_df.index:
    country_code = pycountry.countries.search_fuzzy(country)[0] # Take the first element returned from the search.
    country_codes.append(country_code.alpha_3)

median_salaries_df["Country Code"] = country_codes
median_salaries_df

In [None]:
salaries_series = median_salaries_df["Salary"]
values_to_update = {"Salary": salary_int_string_dict}
median_salaries_df = median_salaries_df.replace(values_to_update)
median_salaries_df["Salary Values"] = salaries_series

In [None]:
px.choropleth(median_salaries_df, locations="Country Code", hover_name=median_salaries_df.index, color="Salary Values", hover_data=["Salary"], title="Median Salaries by Country")

Considering the number of participants from the USA and India that we discerned earlier, we can take these values to be more correct to the underlying data-generation distribution than those for most other countries. Working under this assumption, in the US and Switzerland, a reasonable figure would be 100,000+, whereas in India, it would most probably be in the 7,500-9,999 range. It appears that Australia has some high-paying jobs as well. We would expect the average salary range in the UK to be higher than the 10,000-14,999 mark. What are some of the factors we could investigate to determine the reason for this salary variance?

We consider salaries based on gender.

In [None]:
salaries_by_gender_df = df[["Q2", "Q10"]]
salaries_by_gender_df.rename(columns={"Q2": "Gender", "Q10": "Salary"}, inplace=True)
values_to_update = {"Salary": salary_string_int_dict}
salaries_by_gender_df = salaries_by_gender_df.replace(values_to_update)
salaries_by_gender_df

In [None]:
px.box(salaries_by_gender_df, "Gender", "Salary", labels={"Salary": "Salary (Lower Bound)"}, title="Boxplot of Salary per Gender")

Even though it appears that any of the categories can reach the highest salary bracket, females appear to have the lowest median salary. Those who prefer to self-describe appear to be more likely to have higher average salaries as well.

To determine the best-paying jobs, we plot the mean salary of each role.

In [None]:
salaries_job_df = df[["Q5", "Q10"]]
salaries_job_df.rename(columns={"Q5": "Job Title", "Q10": "Salary"}, inplace=True)

values_to_update = {"Salary": salary_string_int_dict}
salaries_job_df = salaries_job_df.replace(values_to_update)
salaries_job_df

# salary_series = salaries_job_df["Salary"]
# salaries_job_df["Salary Bracket"] = salary_series
# salaries_job_df

In [None]:
grouped_mean_salaries = salaries_job_df.groupby(["Job Title"]).mean().reset_index().sort_values(by="Salary", ascending=False)
grouped_mean_salaries.dropna(inplace=True)
px.bar(grouped_mean_salaries, "Job Title", "Salary", labels={"Salary": "Mean Salary"}, title="Mean Salary per Job Role")

As a more informative approach for representing these salaries, we employ a boxplot. We exclude the NAs and '0' salaries and visualise the following:

In [None]:
salaries_job_df.dropna(inplace=True)
salaries_job_df = salaries_job_df[(salaries_job_df["Salary"] != 0)]
fig = px.box(salaries_job_df, "Job Title", "Salary", labels={"Salary": "Salary (Lower Bound of Bracket)"})
fig.show()

**Findings**:
- The survey indicates that individuals in all job roles have the potential to gain jobs that pay \\$500k+, apart from database engineers.
- Globally, software engineers and data analysts have the lowest average salaries, with a software engineer having a lower median salary than a data analyst, but a higher mean (as indicated in the first chart).
- Product/project management and data science appear to be the most lucrative job roles, with the former slightly ahead.

Next, we determine the percentage of applicable data scientists who earn above \\$500k relative to project managers:

In [None]:
num_data_scientists_above_500 = len(salaries_job_df[(salaries_job_df["Job Title"] == "Data Scientist") & (salaries_job_df["Salary"] == 500000)])
num_project_managers_above_500 = len(salaries_job_df[(salaries_job_df["Job Title"] == "Product/Project Manager") & (salaries_job_df["Salary"] == 500000)])

percent_ds_above_500 = 100 * num_data_scientists_above_500/len(salaries_job_df)
percent_pm_above_500 = 100 * num_project_managers_above_500/len(salaries_job_df)

print("The percent of Data Scientists who earn above $500,000: {}%".format(np.round(percent_ds_above_500, 2)))
print("The percent of Project Managers who earn above $500,000: {}%".format(np.round(percent_pm_above_500, 2)))

Subsequently, we determine the effect of years of programming experience (Q11) on salary.

In [None]:
programming_experience_salary_df = df[["Q10", "Q15"]]
programming_experience_salary_df.rename(columns={"Q10": "Salary", "Q15": "Programming Experience"}, inplace=True)
values_to_update = {"Salary": salary_string_int_dict}
programming_experience_salary_df = programming_experience_salary_df.replace(values_to_update)
programming_experience_salary_df

In [None]:
category_array = ["I have never written code", "< 1 years", "1-2 years", "3-5 years", "5-10 years", "10-20 years", "20+ years"]
# fig = px.scatter(programming_experience_salary_df, "Programming Experience", "Salary", title="Density of Programming Experience vs Salary")
fig = px.scatter(programming_experience_salary_df, "Programming Experience", "Salary", facet_col=df["Q2"],title="Density of Programming Experience vs Salary")
fig.update_traces(marker=dict(
            opacity=0.05,
            size=20,
            line=dict(
                color='MediumPurple',
                width=0.5
            )))
fig.update_layout(xaxis={'categoryorder':'array', 'categoryarray':category_array})
fig.show()

**Findings**
- There are people earning \\$500k+ who have not ever written code. Same with people who have under a year's worth of programming experience.
- There is a trend, however, where more experienced programmers earn a higher salary across all salary brackets.
- There are not that many experienced females, i.e. females with 10-20 years of experience.

Now, we will plot another Sankey Diagram, tracking the effects of gender, age, degree, role and country on salary. To achieve this, generalise the code we wrote earlier for the Sankey diagram into a function, `get_sankey_data`, which accepts a re-indexed dataframe as an argument and returns the node_labels and link dictionary. Subsequently, use these items to plot a Sankey diagram.

In [None]:
## Functionise the sankey diagram code I wrote earlier
def get_sankey_data(reindexed_df):
    col_names = reindexed_df.columns.tolist()
    node_labels = []
    num_categorical_vals_per_col = []
    
    for col in col_names:
        uniques = reindexed_df[col].unique().tolist()
        node_labels.extend(uniques)
        num_categorical_vals_per_col.append(len(uniques))


    source = []
    target = []
    value = []
    for i, num_categories in enumerate(num_categorical_vals_per_col):

        if i == len(num_categorical_vals_per_col)-1:
            break

        # index allows us to refer to the categories by index from the `node_labels` list
        start_index = sum(num_categorical_vals_per_col[:i])
        start_index_next = sum(num_categorical_vals_per_col[:i+1])
        end_index_next = sum(num_categorical_vals_per_col[:i+2])


        # i can also give us the category column to refer to
        col_name = col_names[i]
        next_col_name = col_names[i+1]

        grouped_df = reindexed_df.groupby([col_name, next_col_name]).size()

        for source_i in range(start_index, start_index_next):
            for target_i in range(start_index_next, end_index_next):
                source.append(source_i)
                target.append(target_i)
                source_label = node_labels[source_i]
                target_label = node_labels[target_i]
                # if the index doesn't exist in the grouped_df, then the value is 0
                try:
                    value.append(grouped_df[source_label][target_label])
                except:
                    value.append(0)

#                 random_color = list(np.random.randint(256, size=3)) + [random.random()]
#                 random_color_string = ','.join(map(str, random_color))
#                 colors.append('rgba({})'.format(random_color_string))

    link = dict(source=source, target=target, value=value)
    return node_labels, link

In [None]:
## Create a new dataframe with the relevant variables with which we intend to plot our Sankey diagram, re-index it, and pass it to the get_sankey_data function.
salaries_sankey_df = df[["Q1", "Q2", "Q3", "Q4", "Q5", "Q10"]]
salaries_sankey_df.rename(columns={"Q1": "Age", "Q2": "Gender", "Q3": "Country", "Q4": "Education", "Q5": "Role", "Q10": "Salary"}, inplace=True)
salaries_sankey_df = salaries_sankey_df.reindex(["Gender", "Age", "Education", "Role", "Country", "Salary"], axis=1)
node_labels, link = get_sankey_data(salaries_sankey_df)
node_labels

In [None]:
fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = node_labels,
      color = "blue"
    ),
    link = link)])

fig.update_layout(title_text="Sankey Diagram (Gender, Age, Education, Role, Country, Salary)", font_size=10, height=1000)
fig.show()

As the next step, we clean the above diagram:
- Group countries by continent (again we will leave India and USA as is).
- Create wider bins for the salaries (maybe five bins), excluding \\$0-999.
- Add coding experience as a column in the above plot.

In [None]:
## Replace countries with continents
values_to_update = {"Country": countries_continent_dict}
salaries_sankey_df = salaries_sankey_df.replace(values_to_update)
salaries_sankey_df

In [None]:
## Drop rows with $0-999 and create wider bins for salary
set(salaries_sankey_df["Salary"])
salaries_sankey_df = salaries_sankey_df[salaries_sankey_df["Salary"] != "$0-999"]
under_30000, under_80000, under_150000, under_300000, under_500000 = "0 - 29,999", "30,000 - 79,999", "80,000 - 149,999", "150,000 - 299,999", "300,000 - 500,000" 
salary_wider_bins_dict = dict()
for salary_string in set(salaries_sankey_df["Salary"]):
    
    if salary_string == "> $500,000" or isinstance(salary_string, float):
        continue
    
    salary_upper_bound = salary_string.split("-")[-1]
    salary_upper_bound = salary_upper_bound.replace(",", "")
    salary_upper_bound = int(salary_upper_bound)
    
    if salary_upper_bound < 30000:
        salary_wider_bins_dict[salary_string] = under_30000
    elif salary_upper_bound < 80000:
        salary_wider_bins_dict[salary_string] = under_80000
    elif salary_upper_bound < 150000:
        salary_wider_bins_dict[salary_string] = under_150000
    elif salary_upper_bound < 300000:
        salary_wider_bins_dict[salary_string] = under_300000
    elif salary_upper_bound < 500000:
        salary_wider_bins_dict[salary_string] = under_500000

values_to_update = {"Salary": salary_wider_bins_dict}
salaries_sankey_df = salaries_sankey_df.replace(values_to_update)
salaries_sankey_df

In [None]:
## Create a new column in salaries_sankey_df for the programming experience length.
salaries_sankey_df["Programming Experience"] = df["Q15"]
salaries_sankey_df

In [None]:
## Re-index, and plot the Sankey diagram.
# Exclude the age and gender columns, and reindex the dataframe as
# Role, Programming Experience, Region, Education and Salary.
salaries_sankey_df = salaries_sankey_df.rename(columns={"Country": "Region"})
salaries_sankey_df = salaries_sankey_df.reindex(["Role", "Programming Experience", "Region", "Education", "Salary"], axis=1)
node_labels, link = get_sankey_data(salaries_sankey_df)
fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = node_labels,
      color = "blue"
    ),
    link = link)])

fig.update_layout(title_text="Sankey Diagram (Role, Programming Experience, Region, Education, Salary)", font_size=10, height=800)
fig.show()

The information and the paths to the different salary brackets become a lot clearer to interpret. Interestingly, there are some software engineers and data scientists who have never written code. Since this is highly improbable and they are few, we could chalk it up to anomalies. The majority of people earning over 150,000 seem to hold either a master's degree or a doctorate. Experimenting with the order of indexing can allow you to quickly draw insights from different combinations of columns (e.g. by putting Education before Job Role, we can see what roles people are likely to go into based on their education).

One last plot! Question 9 asks about the skills and responsibilities that their current job entails. Produce subplots of these skills, aggregated for each salary bracket. Use the six salary brackets we defined previously. Ensure that the columns of the skill dataframes you create are intact, and demonstrate your plot/and any other python output in an interpretable manner, whether this be by the strings on the axis or the ordering of the facets.

In [None]:
skills_df = get_df_for_dummies(df, "Q9_Part_", 9)
skills_df = get_dummies_col(skills_df, sep="::")
skills_dummies = dummies_from_series(skills_df, sep="::")
skills_grouped = group_dummies_by(skills_dummies, salaries_sankey_df["Salary"])
skills_grouped = skills_grouped.reindex(["0 - 29,999", "30,000 - 79,999", "80,000 - 149,999", "150,000 - 299,999", "300,000-500,000", "> $500,000"])
skills_grouped

In [None]:
skills_df_columns = skills_grouped.columns
skills_df_columns_mapping = {col: i for i, col in enumerate(skills_df_columns)}
skills_grouped = skills_grouped.rename(columns=skills_df_columns_mapping)
[print(i, "\t", col) for col, i in skills_df_columns_mapping.items()]
print()

In [None]:
## Produce plots! The x axis should be the top four programming languages, while the y axis should be their count.
fig = make_subplots(2, 3, subplot_titles=skills_grouped.index)
for i, role in enumerate(skills_grouped.index):
    row = (i // 3) + 1
    col = (i % 3) + 1
    skills_values = skills_grouped.iloc[i]
    fig.add_trace(
        go.Bar(x=skills_values.index, y=skills_values),
        row=row, col=col
    )
    fig.update_xaxes(type="category")
fig.update_layout(showlegend=False, height=800, title="Skills/Responsibilities per Salary Bracket")
fig.show()

**Findings**
- Across all salary brackets, many individuals chose option 0 (analyze and understand data to influence product or business decisions).
- Apart from the lowest and highest brackets, it appears that many jobs involve 3 (building prototypes that explore the application of ML to new areas).
- The top two salary brackets have a higher relative frequency of answer 4 (do research that advances the state of the art of ML).
- The last three brackets appear to have a higher proportion of employees who chose option 5 (experimentation and iteration to improve existing ML models) as a responsibility.
- Option 7, None/NA, appears to be more prevalent in the bottom two brackets and the uppermost bracket. This may be because none of the job descriptions apply to these individuals, whereas the other three brackets have job responsibilities focused more on data science and ML.

## Conclusion
At this point, you should have a good understanding of

- EDA and its importance.
- how to apply various visualisation techniques.
- how to determine the suitable plot for different situations.