# <div style=" text-align: center; font-weight: bold; font-size:50px">Exploratory data analysis</div>

## **Import necessary libraries:**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import re

## **Read data from file:**

In [2]:
ds_survey_df = pd.read_csv('../Dataset/final_data.csv', low_memory=False)

## **Ask some question:**

### **Question 01: How do factors such as residential country, age, current role, education level, programming experience, current industry affect a person's salary or the distribution of the income across these factors?**

We know that, the income is just an aspect that many of us prioritize, most of us always tend to pursure higher income. So how is the status of this survey of participants' income? How was it affected by different factors? In this analysis, we aim to get more insights about the salary of a person. Specifically, we will focus on more about factors that may be most important such as `Residential country`, `Age`, `Role`, `Education level`, `Programming experience`, `Industry`.
But before that, let's take an overview of the income and perform any necessary preprocessing steps.

Firstly, by above exploration, we can see that there is over 50 percent of missing values for the column `Current income`. As we discuss earlier, these missing values come from survey participants that currently are student. There for, in this analysis, we just drop all nan values
 

In [3]:
# Drop the missing value for column `Current income`
income_df = ds_survey_df.dropna(subset= 'Current income')

print (f"Current income range of participants:")
ranges = income_df['Current income'].unique()
print(ranges)

Current income range of participants:
['25,000-29,999' '100,000-124,999' '200,000-249,999' '150,000-199,999'
 '90,000-99,999' '30,000-39,999' '3,000-3,999' '50,000-59,999'
 '125,000-149,999' '15,000-19,999' '5,000-7,499' '10,000-14,999'
 '20,000-24,999' '$0-999' '7,500-9,999' '4,000-4,999' '80,000-89,999'
 '2,000-2,999' '250,000-299,999' '1,000-1,999' '$500,000-999,999'
 '70,000-79,999' '60,000-69,999' '40,000-49,999' '>$1,000,000'
 '300,000-499,999']


Through the above result, we can see that the income is seperated into too many ranges, We will group these ranges and create a new income range which become more easier for our anslysis.

In [4]:
# Identify the new income range. 
new_income_range = {
    '<$10,000': (0, 10000),
    '$10,000-50,000': (0, 50000),
    '$50,000-100,000': (50000, 100000),
    '$100,000-300,000' : (100000, 300000),
    '$300,000-500,000' : (300000, 500000),
    '>$500,000' : (500000, float('inf'))
}

Now, replace the current range with the new range for all sample in our dataset.

In [5]:
processed_income_df = income_df.copy()  # Make a copy of income_df.

# Define a function to map income ranges to new ranges
def map_income_range(income_range):
    for new_range, (lower_bound, upper_bound) in new_income_range.items():
        lower = income_range.split('-')[0]
        lower = int(lower.replace(',', '').replace('$', '').replace('>', ''))
        if lower_bound <= lower < upper_bound:
            return new_range
    return None  # Handle cases where income range doesn't fall into any new range

# Apply the function to the 'Current income' column
processed_income_df['Current income'] = income_df['Current income'].apply(map_income_range)

We have converted all the current income of the participants to the new range of income.From this, `processed_income_df` will be the main dataframe that we use for out analysis.     

Now, see the current distribution of income in new range:

In [6]:
# Find the frequency of each income range
income_distribution_df = processed_income_df['Current income'].value_counts().reset_index(name='Count')

fig = px.bar(income_distribution_df, 
            x='Count', 
            y='Current income',
            text='Count',
            labels={ 'Current income': '<b>Income range</b>', 'Count': '<b>Count</b>'},)

fig.update_layout(
    title_text='<b>Participants\' current income distribution</b>',
    title_font_size=25,
    width=1000, height=400,
    margin=dict(l=20, r=30, t=70, b=20),
    paper_bgcolor='#ffe6cc',
    xaxis=dict(tickfont=dict(size=14)),
    yaxis=dict(tickfont=dict(size=14), ticksuffix = " "),
)
fig.update_traces(marker_line_color='black', marker_line_width=1, hovertemplate='<b>%{text}</b> participants have the income in range <b>%{y}</b>')
fig.show()

Currently, we can see that a large percentage of participants' income drop into the range `<$10,000` and `$10,000 - 50,000`. Just a little percentage have the income `>$300,000`. So we can see that this income is the super high income and not popular with most of participants.

Now, we will get the a deeper analysis for each factors.

### **1.1 Income distribution by residential country**:
Easily, in different countries, the income will also very differnt, but can we get more insights about it? How the income is distributed in each country? Which continent or countries will give us more potential to a good income? Or which one will have less advantage for income? Let's find the answer in this section


Now we want check how the income is distributed by the residential country through these steps:
- Step 01: Group the income data by residential country then count the value for each range of current income.
- Step 02: Find the percentage of each income range in each country.
- Step 03: Use a `scatter geo plot` to present the distribution

In [7]:
# Group the dataframe by countries and find the percentage of each income ranges in each countries.
income_by_countries_df = processed_income_df.groupby('Residential country')['Current income'].value_counts().to_frame().reset_index()
income_by_countries_df['Percentage'] = income_by_countries_df.groupby('Residential country')['count'].transform(lambda x: x / x.sum() * 100)
income_by_countries_df['Percentage'] = income_by_countries_df['Percentage'].round(3)

In [8]:
# Draw a scatter_geo plot
fig = px.scatter_geo(
                    income_by_countries_df, 
                    locations='Residential country', 
                    locationmode='country names',
                    color='Current income',
                    size='Percentage',
                    labels={ 'Current income': '<b>Income Range</b>'},
                    category_orders={'Current income': [ '<$10,000', '$10,000-50,000', '$50,000-100,000', '$100,000-300,000',  '$300,000-500,000', '>$500,000']},
                    opacity = 0.7
                    )

fig.update_geos(showcountries=True, countrycolor="black", showland=True, showocean=True, oceancolor="#E3F4F4", landcolor = '#A9B388' )

fig.update_layout(
                title='<b>Percentage Distribution of Income Ranges by Country<b>',
                title_font_size = 25,
                width = 1000,
                height = 600,
                margin=dict(l=20, r=20, t=70, b=0),
                paper_bgcolor='#ffe6cc',
                legend=dict(orientation="h", yanchor="bottom", y=0.97, xanchor="right", x=1 )
                )

fig.update_traces(marker=dict(sizemode='area', sizeref= 0.1),
                  marker_autocolorscale=True, selector=dict(type='scattergeo'),
                  marker_line_color='black',
                  marker_line_width=0.5,
                  hovertemplate='%{marker.size} (%) of participants in %{location} have the income %{fullData.name}'
                  )

config = {'scrollZoom': True}
fig.show(config = config)



- The countries that have the high percentage of income in range `<$10,000` mostly located in `Asia`, `Africa`, `South Americas` (from about 40% up to over 80% in `Iran` or 90% in `Ethiopia`). Especially, in `Middle East`and `Africa`, this income range is the most popular for participants in these region. `Austrilia`, `North America` almost `Europe` got just a little percentage of income in this range.

- Unlike range `<$10,000`, we can see that the range `$10,000-50,000` has different distribution in most of countries. Which lower percentage in `Africa` and `Middle East`(of course). There also a big increase percentage for countries in `Europe` while other countries still remain the same percentage of previous range. The thick density of this range in `Europe` show that this is the porpular range of income in this continent. With other countries in `Asia`, this also an porpular income range, we can see the high percentage in `Japan`, `Korea`, `Chinese` and countries in `South East Asia`.

- For the range `%50,000-100,000` there are some nortable changes  for all countries:

    - With `Africa`, most of countries have no one that has income in this range. It show that this income is a high-level income for most of countries in `Africa`. Only `South Africa` have about 21,6% participant that can reach this income range.

    - In `Europe`, most countries have high percentage of income that are in this range, go along with range `$10,000-50,000`.

    - Some countries in `Asia` also have hight percentage of this range is: `China`, `Japan`, `South Korea`, `Taiwan`. They are all countries that have high development in `Asia`.

    - Another notable countries is `Canada`, `The USA` and `Australia`, have very low percentage in two older ranges, but with this range, The percentage is much more higher.

- With range is `$100,000-300,000`, for most of countries, the percentage is slightly decrease, except for `The USA`, `Australia`, `United Arab Emirates`. Specially, with `Israel`, while up to 65.9% of participants in this country have income in this range.

- With two other ranges, these are very high range of income, so most of countries have very little percentage in this range. A notable countrt here is `Zimbabwe`, with 15% of paricipants have this income.

In conclusion, which the residential country, we can see that there is a strong relationship between them with the current income of people in countries. The region that we will have more potential to get higher income is `Europe`, `North America`, `Australia` and some developed countries in `Asia` such as `Japan`,`South Korea`, `China`

### **1.2 Income distribution in each age range:**

The age range is also a factor that we can consider.  Are younger people might have the salary that less than others that have works for a long time or not? This section aims to find some trends about the income distribution across each age range.

In [9]:
# The fucntion to draw a stacked bar plot, we will use it for all other part in this question.
def stacked_bar_plot(df, x_col, y_col,color_col, text_col, color_range, col_ordered, order):
    
    # Create a bar plot
    fig = px.bar(df, 
                x = x_col, 
                y = y_col,
                color = color_col,
                text = text_col,
                labels = { x_col: f"<b>{x_col}</b>", y_col: f"<b>{y_col}</b>", color_col: f"<b>{color_col}</b>"},
                category_orders = {'Current income':  ['<$10,000', '$10,000-50,000', '$50,000-100,000', '$100,000-300,000','$300,000-500,000', '>$500,000'],
                                   col_ordered: order},
                color_discrete_sequence = color_range,
                barmode = 'stack'
                )

    # Define the layout
    fig.update_layout(
        title_font_size = 25,
        width = 1100,
        height = 600,
        paper_bgcolor = '#ffe6cc',
        xaxis = dict(tickfont=dict(size=14)),
        yaxis = dict(tickfont=dict(size=14), ticksuffix = " "),
        margin=dict(l=20, r=30, t=70, b=20),
    )

    return fig


In [10]:
# Get the frequency of each income range for each range of age
age_income_df = processed_income_df.groupby('Age')['Current income'].value_counts().reset_index(name = 'Count')

# Find the percentage of each income range
age_income_df.loc[:,'Percentage'] = ((age_income_df['Count'] / age_income_df.groupby('Age')['Count'].transform('sum'))*100)
age_income_df['Percentage'] = age_income_df['Percentage'].round(2)
age_income_df = age_income_df.sort_values(by = ['Age', 'Percentage'], ascending = [True, False])

In [11]:
# Draw the plot
fig = stacked_bar_plot(age_income_df, 'Age', 'Percentage', 'Current income' , 'Percentage', ['#beca5c', '#57c550', '#45bfa3','#3835b7','#9e2bb2' ,'#aa2b1d'], 0, 0)
fig.update_layout(
    title_text = '<b>Participants\' current income with the Age range</b>',
)
fig.update_traces(marker_line_color='black', marker_line_width=1, hovertemplate='<b>%{text} (%)</b> of participants that in age range <b>%{x}</b> have the income <b>%{fullData.name}</b>')

fig.show()

Through the plot, we can easily see the trend is the overall income will getting higher over each age range from `22-24` and reach the max in range `55-59`, then the income will change a little in other next range. Now, let's take a closer look to it:
- The notable range here is the first range `18-21`, while there is 66.27% of individuals have the income `<$10,000`, there also a high percentage of people who have the income in the highest income range like `$300,000-500,000` and `>$500,000`. This percentage even higher than other age range like `22-24` or `25-29`, but by the very small percentage of people in age `18-21` in the total. We just recorgnize people that reach these range of income at age `18-21` are someone who much oustanding with others. And can not be the characteristic for the this age range.
- Also, we can see that people tend to get higher income in their middle range of age (about `35-59`), with the percentage of participants in low income range decrease over age ranges. There also an increasing of higher income range percentage through each age range.
- For the income range between `$100,000 - $300,000` and `>$300,000`, the percentage increases across age ranges. This means that with the higher age range, the percentage of participant with higher income ranges will increase.


### **1.3 Income distribution by curent role:**
In this part, we will see how the roles of pariticipants affect to their income.         
Firstly, in `Data exploration part 2`, we found that this is many roles that is not related to the data jobs, so before starting our analysis, we will eliminate them.

In [12]:
# Define necessary roles
data_roles = ['Manager (Program, Project, Operations, Executive-level, etc)',
            'Machine Learning/ MLops Engineer',
            'Research Scientist', 
            'Data Scientist',
            'Data Analyst (Business, Marketing, Financial, Quantitative, etc)',
            'Statistician',
            'Teacher / professor']

processed_income_df_copy = processed_income_df.copy()

role_income_df = processed_income_df_copy.groupby('Current role')['Current income'].value_counts().reset_index(name = 'Count')
filtered_role_income_df = role_income_df[role_income_df['Current role'].isin(data_roles)].copy()

# Shorten the name of roles for better visualization
filtered_role_income_df.loc[:, 'Current role'] = filtered_role_income_df['Current role'].replace(r'^Data Analyst.*', 'Data Analyst', regex=True)
filtered_role_income_df.loc[:, 'Current role'] = filtered_role_income_df['Current role'].replace(r'^Manager.*', 'Manager', regex=True)
filtered_role_income_df.loc[:, 'Current role'] = filtered_role_income_df['Current role'].replace(r'^Machine Learning.*', 'ML Engineer', regex=True)

Then, we will find the `Percentage` of the income for each role

In [13]:
filtered_role_income_df['Percentage'] = (filtered_role_income_df['Count'] / filtered_role_income_df.groupby('Current role')['Count'].transform('sum')) * 100
filtered_role_income_df.loc[:, 'Percentage'] = filtered_role_income_df['Percentage'].round(2)
filtered_role_income_df = filtered_role_income_df.sort_values(by = ['Percentage'], ascending=[True])

Finally, make a visualization for a clearly view of the distribution. In this case, I will use stacked bar chart to show the distribution of income range for each data role.

In [14]:
fig = stacked_bar_plot(filtered_role_income_df, 'Percentage', 'Current role', 'Current income' , 'Percentage', ['#F2F7A1', '#F05941', '#BE3144', '#973089', '#541c7b','#141E46'], 0, 0)
fig.update_layout(
    title_text = '<b>The percentage of participants\' current income for each role</b>',
    width = 1100,
    height = 600,
    margin = dict(l=10, r=10, t=110, b=20),
    legend=dict( orientation="h", yanchor="bottom", y=1, xanchor="right", x = 1)
)

fig.update_traces(marker_line_color='black', marker_line_width=1.2, hovertemplate='<b>%{text} (%)</b> of <b>%{y}</b> have the income <b>%{fullData.name}</b>')
fig.show()

Through the plot, we can get these conclusions:

- With `Statiscian`, we can see that majority of statiscian (59.26%) have the income `<$10,000`, also, there isn't any statiscian have income `>$500,000`,and just a little percentage (0.93%) get the income `%300,000-500,000`. That indicates that the potential to have the high income in `Statiscian` is quite limited.

- `Teacher` and `Data Analyst` have the quite similar distribution of income, with about 50% in range `<$10,000`, followed by 25% in `$10,000-50,000` and 16% in `$100,000-300,000`, also a very low percentage in `>$500,000`, but compare with `Statiscian`, the overall income raise a little bit.

- With `Machine learning engineer`, `Data Scientist` and `Reseach Scientist`, more paricipants get higher income with just about 30% have income `<$10,000`, also, the percentage of income in range `$100,000-300,000` and `>$500,000` increase alot, shown that these role have more potential to get a higher income.

- Among the roles, `Manager` have the overall income that outstanding other roles. with over 30% in `$100,000-300,000`, and 2.55% `>$500,000`. It is reasonable because the `Manager` is always the one who have higher position, also more the skills and the experience than the others.  

To sum up, the is a large differences of how the income is distributed for each role. We can consider that `Manager` can have the highest income, followed by `Machine learning engineer`, `Data Scientist` and `Reseach Scientist`

###  **1.4 Income distribution by the eductional level**
Eductional level, can be easily considered as a important factors that determine a person's income. Most of use think that the higher level of education, more potential to get a higher income. But is this thought works for participants of this survey? Let's check it now.

Also, go along with the previous section, in this section, we just consider the educations of data roles.

In [15]:
# Rename the level for clearly meaning
processed_income_df.loc[:, 'Higher education?'] = processed_income_df['Higher education?'].replace(r'^No formal education.*', 'Under high school', regex=True)
processed_income_df.loc[:, 'Higher education?'] = processed_income_df['Higher education?'].replace(r'^Some college.*', 'College without Bachelor degree', regex=True)
level_to_drop = 'I prefer not to answer'

In [16]:
# Filter and find the percentage of income across roles.
filtered_role_df = processed_income_df[processed_income_df['Current role'].isin(data_roles)].copy()
education_income_df = filtered_role_df.groupby('Higher education?')['Current income'].value_counts().reset_index(name = 'Count')
filtered_education_income_df = education_income_df[education_income_df['Higher education?'] != level_to_drop].copy()

filtered_education_income_df.loc[:,'Percentage'] = ((filtered_education_income_df['Count'] / education_income_df.groupby('Higher education?')['Count'].transform('sum'))*100)
filtered_education_income_df['Percentage'] = filtered_education_income_df['Percentage'].round(2)
filtered_education_income_df = filtered_education_income_df.sort_values(by = ['Higher education?', 'Percentage'], ascending = [True, False])

In [27]:
fig = stacked_bar_plot(filtered_education_income_df, 'Percentage', 'Higher education?', 'Current income' , 'Percentage', ['#FCF5ED', '#F4BF96', '#CE5A67', '#973089', '#541c7b','#141E46'],
                        'Higher education?', ['Under high school', 'College without Bachelor degree', 'Bachelor’s degree', 'Master’s degree', 'Doctoral degree', 'Professional doctorate'])
fig.update_layout(
    title_text = '<b>Participants\' current income by level of education</b>',
    height = 500,
    margin = dict(l=10, r=10, t=110, b=20),
    legend = dict(orientation = "h", yanchor="bottom", y=1, xanchor="right", x = 1 ),
)

fig.update_traces(marker_line_color = 'black', marker_line_width = 1.2, hovertemplate='<b>%{text} (%)</b> of <b>%{y}</b> have the income <b>%{fullData.name}</b>')
fig.show()

With all data roles, we can see some trends below:
- With higher levels, the income of participants will be more. With the highest in overall is `Doctoral degree`.
- Also, being considered as the highest level in the field, but we can see there is differnces between the income of `Doctoral degree` and `Professional doctorate`, we can easily see that `Dotoral degree` will have more percentage that have income in range `<$300,000`, but in range `>$300,000`, `Professional doctorate` is higher. These differences may come from the diffrences of the works, or which fields each role focus on.

But in overall, the education level have a important impact to a person's salary, higher levels will give us more potential to get higher income.

###  **1.5 Income distribution by the programming experience**
We all know that, in the field of Information Technology, programming skills are widely recorgnized as one of the most important part for every jobs in IT, and the data-related as well. So in this section, we will explore how the income is affect by the programming experience. Is the one who have more programming expericence will have the high potential to get higher income?

In [18]:
experience_income_df = processed_income_df.groupby('Programming experience')['Current income'].value_counts().reset_index(name = 'Count')
experience_income_df.loc[:,'Percentage'] = ((experience_income_df['Count'] / experience_income_df.groupby('Programming experience')['Count'].transform('sum'))*100)
experience_income_df['Percentage'] = experience_income_df['Percentage'].round(2)
experience_income_df = experience_income_df.sort_values(by = ['Programming experience', 'Percentage'], ascending=[True, False])

In [19]:
fig = stacked_bar_plot(experience_income_df, 'Programming experience', 'Percentage', 'Current income' , 'Percentage', ['#F2F7A1', '#84ce69', '#309771','#1c737b','#14526e' ,'#071952'],
                        'Programming experience',  ['I have never written code', '< 1 years', '1-3 years', '3-5 years', '5-10 years', '10-20 years', '20+ years'])
fig.update_layout(
    title_text = '<b>Participants\' current income by level of education</b>',
    )

fig.update_traces(marker_line_color='black', marker_line_width = 1, hovertemplate = '<b>%{text} (%)</b> of participants that <b>%{x}</b> have the income <b>%{fullData.name}</b>')
fig.show()

From this plot, we can see the overall trend is when individuals get more programming experience, their income potential tends to increase. We will dicuss in detail below:
- With experience `<3 years`: This is the early state of participant to acess to a new programming language, aslo they wil get lower income, which over 50% of them is in `<$10,000`.
- With experience `3-10 years`: In this range, the icome of participants will increase with higher percentage of higher income range.
- With experience `>10 years`: While there still individuals that have low income, but at this experience, most of participants will get a higher income, expecially in range `$100,000-300,000` and `$300,000-500,000`.

###  **1.6 Income distribution by the current industry**

The current industry that a person is working in also affect alot to the income.

In [20]:
industry_income_df = processed_income_df.groupby('Current industry')['Current income'].value_counts().reset_index(name = 'Count')
industry_income_df.loc[:,'Percentage'] = ((industry_income_df['Count'] / industry_income_df.groupby('Current industry')['Count'].transform('sum'))*100)
industry_income_df['Percentage'] = industry_income_df['Percentage'].round(2)
industry_income_df = industry_income_df.sort_values(by='Percentage', ascending=False)

In [21]:
fig = stacked_bar_plot( industry_income_df, 'Current industry', 'Percentage', 'Current income' , 'Percentage', ['#8ADAB2', '#B5CB99', '#E48F45','#CD6688','#7A316F' ,'#461959'], 0,0)
fig.update_layout(
    title_text = '<b>Percentage distribution of income ranges by Industry<b>',
)
fig.update_traces(marker_line_color='black', marker_line_width=0.5, hovertemplate='<b>%{text} (%)</b> of participants that in <b>%{x}</b> industry have the income <b>%{fullData.name}</b>')

fig.show()

By this plot, we can get some conclusion below:

- With low income (in range `<$50,000`) `Academics/ Education`, and `Non-profit/ Service` industry, we can easily know that these industry may get lower income. As we discuss above, partipants who work in `Academics/ Education` generally is  `Teacher / professor`, so there is a lower potential to get higher income. 

- In range of medium income (`$50,000 - 300,000`), the high percentage is `Medical/Pharmaceutical`, `Insurance/ Risk Assessment` and `Online Service/ Internet-based Services`. This means , most individuals in these industries mostly get the income in this range. This is a good range of salary for everyone. So these industriese are suitable for someone who are finding jobs that have the stable income.

- With the high income (`$300,000` or more), `Online Service`, `Accounting/Finance` and `Computer/ Technology` have the high percentage of `$300,000-399,999` and `>$500,000` show that these industries have the most potential to reach the highest range of income. Also, there is a notable industry is `Non-profit/Service`, despite of the very high percentage of low income, but in the high income range, the percentage also very high (upto 1.14%), this means individuals in `Non-profit/Service` industry also can get very high income.



##### **Overall:**
Okay, with all the analysis above, let's take a overall view for what we have done up to now:
<br>
Based on the overall exploration of factors such as geographical location, current role, programming experience, and qualifications, we can draw the following conclusions regarding their impact on a person's salary:

- `Residential country`:  Different regions or countries often have various salary levels due to factors such as cost of living, economic development, and demand for specific skills. Therefore, the income can very different between each countries

- `Age` : The distribution of income also different between age ranges. The income will tend to be higher for older age ranges. Some reasons for this pattern here is in many case, the higher age also go along with the higher experience, higher skills or even the higher position in the company. These are just a hypothesis but we can easily check it in the dataset.

- `Current Role`: The specific role that a person holds in the workplace can significantly affect their salary. Roles that require specialized skills, expertise, and have lots of demand in the industry like `ML Engineer`, `Reseacher Scientist`, `Data Scientist` will give higher salaries. Additionally, managerial or leadership positions like `Mangager` tend to have higher earning potential compared to entry-level or junior positions.

- `Education level`: With different level, participants will have different income, the higher educaitonal level that a person have, their potential to get a good salary will become more. The reason here is that just because the educational level, in many case, represent for the  skills, working capacity and professional knowledges of a person. With these abilities and competency, everyone want a position and a corresponding salary fit with their skills and qualifications.

- `Programming Experience`: Programming experience is a valuable asset in the field of IT and can positively impact a person's salary. Generally, individuals with more programming experience tend to have higher income potential. As individuals gain more experience and skills in programming, they often become better in their roles, handle more complex projects, and can get higher salaries than employers.

- `Industry`: Industry is also a very important factor to determine the income of a person. Industries with the highly development rate, or in a specific condition, will tend to pay more to get the employer.  
       
Beside that, it is important to note that these factors do not affect the income independently, and their impact on salary can vary depending on various combinations and interactions. Other factors like `company size`, `skill set` can also influence a person's salary.

Overall, a combination of `Residential country`, `Current role`, `Programming experience`, and `Industry` make a significant contribution to a person's salary. By understanding their influences, we can make better decisions about our career paths, skill development, and potential earnings.

### **Question 02: Is there any difference between men's and women's salaries? If yes, explain the reason.**?

For the gender, in this analysis, we just consider participants that is Male and Female.

In [22]:
# Extract and find the percentage of each income range per gender.
gender_income_df = processed_income_df.groupby('Gender')['Current income'].value_counts().reset_index(name = 'Count')
filtered_gender_income_df = gender_income_df[gender_income_df['Gender'].isin(['Man', 'Woman'])].copy()
filtered_gender_income_df.loc[:,'Percentage'] = ((filtered_gender_income_df.loc[:,'Count'] / filtered_gender_income_df.groupby('Gender')['Count'].transform('sum'))*100)
filtered_gender_income_df['Percentage'] = filtered_gender_income_df['Percentage'].round(2)

In [23]:
fig = stacked_bar_plot( filtered_gender_income_df, 'Gender', 'Percentage', 'Current income' , 'Percentage', ['#D2D79F', '#86b395', '#75969b','#646684','#66536c' ,'#483838'], 0, 0)
fig.update_layout(
    title_text = '<b>The percentage of participants\' current income for each gender</b>',
)
fig.update_traces(marker_line_color='black', marker_line_width=0.5, hovertemplate='<b>%{text} (%)</b> of <b>%{x}</b> have the income <b>%{fullData.name}</b>')

fig.show()

With the plot above, we can make some comparisions between income of man and woman and find out some differences below:
- There are higher incomes for Men: Men have a higher percentage in most of income ranges (from `$10,000` and more). This suggests that men have a higher likelihood to earn higher incomes compared to women.
- Women show a higher percentage in the lower income range of `<$10,000.` This indicates that a larger percentage of women fall into the category of lower-income earners.

Now, let's make some analysis to find out the reasons for these differences
The the previous question, we found that the income of a person have strong relative with the factors like `Age`, `Programming exprience`, `Education level`, `Current role`, `Industry`, so let's make a deep analysis about the distribution of gender in each factor. I really want to go through all these factors, but with the current limited of time, I just show the differences of the income by `Age` and `Current role`

#### **2.1 Gender by age:**

Now, we consider the age range of each gender, with the conclusion that higher age will give us more potential to have higher salary. I want to see how is the gender distributed across age ranges.

In [24]:
# Extract the gender_df with Male and Female
gender_df = ds_survey_df[ds_survey_df['Gender'].isin(['Man', 'Woman'])].copy()

In [25]:
gender_age_df = gender_df.groupby('Age')['Gender'].value_counts().reset_index(name = 'Count')
gender_age_df['Percentage'] = (gender_age_df['Count'] / gender_age_df.groupby('Age')['Count'].transform(sum)) * 100
gender_age_df['Percentage'] = gender_age_df['Percentage'].round(2)

In [41]:
fig = stacked_bar_plot( gender_age_df, 'Age', 'Count', 'Gender' , 'Percentage',None, 0,0)
fig.update_layout(
    title_text='<b>Participants\' current income with the programming experience</b>',
    width=1000,
    height=400,
    margin=dict(l=20, r=30, t=70, b=20),
)
fig.update_traces(marker_line_color='black', marker_line_width=1, hovertemplate='<b>%{text} (%)</b> of <b>%{fullData.name}</b> in age range <b>%{x}</b> ')

fig.show()

Yeah, we can see that the higher age, the less percentage of woman participate in this survey. By the result that participants may get lower income in there younger age, and the descrease in the percentage of `Woman` in the older range indicate why there is difference between the income between `Man` and `Woman`.

#### **2.2 Gender by Current job:**
Now, let's move to the percentage of each gender in the date-related job, by this we can see which gender is the major in each job

In [31]:

role_gender_df = gender_df.groupby('Current role')['Gender'].value_counts().reset_index(name='Count')
role_gender_df = role_gender_df[role_gender_df['Current role'].isin(data_roles)].copy()

# Shorten the name of roles for better visualization
role_gender_df.loc[:, 'Current role'] = role_gender_df['Current role'].replace(r'^Data Analyst.*', 'Data Analyst', regex=True)
role_gender_df.loc[:, 'Current role'] = role_gender_df['Current role'].replace(r'^Manager.*', 'Manager', regex=True)
role_gender_df.loc[:, 'Current role'] = role_gender_df['Current role'].replace(r'^Machine Learning.*', 'ML Engineer', regex=True)

role_gender_df['Percentage'] = (role_gender_df['Count'] / role_gender_df.groupby('Current role')['Count'].transform(sum)) * 100
role_gender_df['Percentage'] = role_gender_df['Percentage'].round(2)


In [40]:
fig = stacked_bar_plot( role_gender_df, 'Current role', 'Count', 'Gender' , 'Percentage', None, 0, 0)
fig.update_layout(
    title = '<b>The percentage of participants\' Gender for each role</b>',
    width=1000,
    height=400,
)
fig.update_traces(marker_line_color='black', marker_line_width=1.2, hovertemplate='<b>%{text} (%)</b> of <b>%{y}</b> have the income <b>%{fullData.name}</b>')

fig.show()

As we consider above, in the role that have high potential to get more income like `Manager`, `ML Engineer`, `Research Scientist` just have of little percentage of `Woman`, while in other roles have lest income like `Data Analyst`, `Statistician` ans `Teacher/ professor`, `Woman` make up a higher percentage. This also contribute to the explain why there is difference between the income of `Man` and `Woman`.

Overall, we just consider two basic factors that affect to the income of `Man` and `Woman`. Beside that, there are also another that we could consider like `Programming`, `Industry`, ... But I think we just consider them later.

Finally, by these comparision above, we can get a overview about the difference between income of `Man` and `Woman` who participate in this survey.   

### **Question 03: What is the tool set for each data roles?**

Tools like `programming language`, `ide`, `Machine learning frameworks`,etc are very important for a person who work in the data industry. For each roles, there will be several specific tools that a person should learn and master. So, with this question, we will find out a set of tools that each roles in data-related jobs need.

In this question, we just concern about these roles: `Data Scientist`, `Data Analyst`, `Machine learning engineer`, `Research scientist`, `Statistician`.          
Also, the tools we wil consider here is `Programming language`, `IDE`, `Data visualization libraries`, `ML frameworks`, `ML Algorithms`, `NLP methods`, `Computer visions method`, `Data products`, `Business Intelligent tools`.`

In [33]:
data_roles = ['Machine Learning/ MLops Engineer',
            'Research Scientist', 
            'Data Scientist',
            'Data Analyst (Business, Marketing, Financial, Quantitative, etc)',
            'Statistician']

# Define some skills that significant.
skills = ['Programming language', 'IDE', 'Data visualization libraries', 'ML frameworks', 'ML Algorithms', 'NLP methods', 'Computer visions method', 'Data products', 'Business Intelligent tools']
role_df = processed_income_df[processed_income_df.loc[:,'Current role'].copy().isin(data_roles)].copy()

role_df.loc[:,'Current role'] = role_df['Current role'].replace(r'^Data Analyst.*', 'Data Analyst', regex=True)
role_df.loc[:,'Current role'] = role_df['Current role'].replace(r'^Machine Learning.*', 'ML Engineer', regex=True)

filtered_role_df = role_df.loc[:,role_df.columns.str.startswith(tuple(skills))].copy()
filtered_role_df['Current role'] = role_df['Current role']

filtered_role_df = filtered_role_df.groupby('Current role').count()

After that, we will extract the dataframe for each tool and store them in a dictionary

In [34]:

skill_df_dict = {}

# Iterate for each skill and extract all the paricipants that have this skill and store them into a dataframe
for skill in skills:
    sub_df = filtered_role_df.loc[:, filtered_role_df.columns.str.startswith(skill)]
    sub_df.index = sub_df.index.rename(name= None)
    skill_df_dict[skill] = sub_df.T
     

Then, find the percentage of participant that use each tool

In [35]:
for key, value in skill_df_dict.items():

    for role in value.columns:
        value[role+'_total'] = value[role].sum()
        value[role] = (value[role]/ value[role+'_total'])* 100
        value.drop(columns= [role +'_total'], inplace= True)


We just get the three most popular tools for each tool type

In [36]:
top_3_skill_dict_for_roles = {}

for key, value in skill_df_dict.items():
    skill_set_list = []

    # Find top 3 tools for each skill of each roles
    for role in value.columns:
        if role not in top_3_skill_dict_for_roles:
            top_3_skill_dict_for_roles[role] = value.loc[:,role].nlargest(3).to_frame().T
        else:
            top_3_skill_dict_for_roles[role] = pd.concat([top_3_skill_dict_for_roles[role], value.loc[:, role].nlargest(3).to_frame().T], axis=1)
        

In [37]:
all_role_skill_dict = {}

for key, value in top_3_skill_dict_for_roles.items():
    skill_dict = {}

    # Group all top tools and their percenatage for each skill.
    for col in value.columns:
        match = re.search(r'(.+)\s+\(([^)]+)\)', col)
        if match:
            skill_group = match.group(1)
            skill = match.group(2)
            if skill_group not in skill_dict:
                skill_dict[skill_group] = []
            skill_dict[skill_group].append((skill, value.loc[key,col]))
    all_role_skill_dict[key] = skill_dict


Now, visualize and see the result for each role. Here, I use `Sun brust` plot to visual the result. The first layer will be the skills that we consider. The outside layer will be the top 3 tools for this skill, present by the percentage of participants use it, also show how importants it is?

In [38]:

fig = make_subplots(
    rows = 3, cols = 2,
    subplot_titles = [f'<b>Tool set for {role}</b>' for role in all_role_skill_dict.keys()],
    specs = [[{'type': 'sunburst'}] *2] * 3,
    horizontal_spacing = 0,
    vertical_spacing=0.05,
)

for i, (role, skill_dict) in enumerate(all_role_skill_dict.items(),):

    df = pd.DataFrame(skill_dict)
    df = pd.melt(df)
    df['value1'] = df['value'].apply(lambda x: x[0])
    df['value2'] = df['value'].apply(lambda x: x[1])
    df.drop(columns='value', inplace=True)
    df.rename(columns={'value1': '<b>Tool</b>', 'variable': '<b>Tool type</b>', 'value2': 'Percentage'}, inplace=True)

    sunburst_fig = px.sunburst(df,
                               path=['<b>Tool type</b>', '<b>Tool</b>'],
                               values='Percentage',
                               branchvalues='total',
                               color_continuous_scale='Viridis'
                               )
    sunburst_data = sunburst_fig['data'][0]
    sunburst_data['meta'] = {'role': role}
    sunburst_data['hovertemplate'] = '<b>%{label}</b> is used by <b>%{value}</b> of <b>%{meta.role}</b>'
    row_idx = i // 2 + 1
    col_idx = i % 2 + 1

    fig.add_trace(sunburst_data, row=row_idx, col=col_idx)

# Update the layout
fig.update_layout(
    width=1100,
    height=1800,
    margin=dict(l=0, r=0, t=70, b=20),
    paper_bgcolor='#ffe6cc',
    showlegend=False,
)
fig.update_traces(marker_line_color='black', marker_line_width=0.5)

fig.show()


By these plot, we can see the toolset for each role. 
- There are lots of similarities between the tool set of each role:
    - `Python`, `SQL` and `R` are three programming languages that are widly used by all roles. So if we want to have a good skill to take a roles in data jobs, we have to get more concern in practicing these languages.
    - About the `visualization libraries`: `Seaborn`, `Matplotlib` and `Plotly` are the most common libraries with all roles. This show there advance position compare with other visualization libraries.
    - With `Machine Learning`, an important skill for all roles, `Scikit-learn`, `TensorFlow` are two frameworks that all roles used. Also, `Linear regression` and `Decision tree` are two the most popular `algorithm` for all roles.
    - All other tool set also share a high similarities with each other.

- Beside that, there also some tool are most specific for each role. Below is something special we can consider about these plot:

    - With `Programming language`:  `Bash` is used by `Machine Learning Engineer`, and `Matlab` for `Research Scientist`

    - With `IDE`: For `Statistician` and `Research Scientist`, `RStudio` is the most necessary because `R` is a significant language for statistical analysis.

    - With `ML Algorithms`: `Convolational Neutral Network` is mostly used  by `Machine Learning Engineer` and `Research Scientist`, 


In conclusion, tools that is specific for each roles. From that, be can start to learn, and practice it, to prepare for our career in the future.