# Final Project - Data Visualization

## Chhort Chhorraseth
This dataset provides a comprehensive understanding about salary trend and workforce dynamics within a specific professions. It is structured to include columns that detail various attributes related to employee roles and their corresponding compensation. The dataset typically includes the following key columns:
1. **work_year:** The year the salary was paid.
2. **experience_level:** The experience level in the job during the year
3. **employment_type:** The type of employment for the role

4. **job_title:** The role worked in during the year.

5. **salary:** The total gross salary amount paid.

6. **salary_currency:** The currency of the salary paid as an ISO 4217 currency code.

7. **salaryinusd:** The salary in USD

8. **employee_residence:** Employee's primary country of residence in during the work year as an ISO 3166 country code.

9. **remote_ratio:** The overall amount of work done remotely

10. **company_location:** The country of the employer's main office or contracting branch

11. **company_size:** The median number of people that worked for the company during the year

The dataset serves as a powerful resource for analyzing salary distribution across various roles, experience levels, and organizational contexts. It enables stakeholders to identify trends such as the demand for specific experience levels, the geographic influence on salary, and variations in compensation by company size or job title.



In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

In [2]:
df = pd.read_csv('ds_salaries.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          607 non-null    int64 
 1   work_year           607 non-null    int64 
 2   experience_level    607 non-null    object
 3   employment_type     607 non-null    object
 4   job_title           607 non-null    object
 5   salary              607 non-null    int64 
 6   salary_currency     607 non-null    object
 7   salary_in_usd       607 non-null    int64 
 8   employee_residence  607 non-null    object
 9   remote_ratio        607 non-null    int64 
 10  company_location    607 non-null    object
 11  company_size        607 non-null    object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB


In [5]:
# Unnamed and salary_currency columns are dropped since it is not nescessary used for analysis
df.drop(['Unnamed: 0','salary_currency'], axis=1, inplace=True)

In [6]:
df.shape
print('Number of Rows', df.shape[0])
print('Number of Columns', df.shape[1])

Number of Rows 607
Number of Columns 10


In [7]:
# Check if there are any missing values
df.isna().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [8]:
# Check unique value in each column
category = [ 'experience_level', 'employment_type', 'job_title',
       'salary', 'employee_residence', 'remote_ratio',
       'company_location', 'company_size']
print ('\nNumber of unique values in Categorical variables:\n')
print(df[category].nunique())


Number of unique values in Categorical variables:

experience_level        4
employment_type         4
job_title              50
salary                272
employee_residence     57
remote_ratio            3
company_location       50
company_size            3
dtype: int64


**1. What percentage of the workforce belongs to each experience level, and how evenly distributed are the experience levels?**

In [9]:
import plotly.express as px

In [10]:
ex_level = df['experience_level'].value_counts().reset_index()
ex_level.columns = ['Experience Level', 'Count']

# Replace short names with full descriptive names
ex_level['Experience Level'] = ex_level['Experience Level'].replace({
    'EN': 'Entry-level/Junior',
    'MI': 'Mid-level/Intermediate',
    'SE': 'Senior-level/Expert',
    'EX': 'Executive-level/Director'
})

# Calculate percentages and create labels
ex_level['Percentage'] = (ex_level['Count'] / ex_level['Count'].sum() * 100).round(2)
ex_level['Label'] = ex_level.apply(
    lambda row: f"{row['Experience Level']}<br>{row['Count']}<br>{row['Percentage']}%", axis=1
)

# Create a treemap
fig = px.treemap(
    ex_level,
    path=['Label'],  # Use the label with descriptive names
    values='Count',  
    color='Percentage',  
    color_continuous_scale=px.colors.sequential.PuBuGn,
    title='Experience Level Distribution',
    template='plotly_dark',
    width=1000,
    height=500
)

fig.show()



This graph simply emphasizes the distribution of experience level across a dataset, categorized into 4 groups (Entry-level/Junior, Mid-level/Intermediate, Senior-level/Expert, Executive-level/Director) along with their respective proportions. It provides an insight into what experience level or roles are in high-demand within the dataset.<br><br>
**Key Insights**
- The most prominent group is Senior-level/Expert, comprising 46.13% of the total, indicating a strong preference for individuals with significant expertises. 
Mid-level/Intermediate roles follow at 35.09%, suggesting a substaintial need for individuals with moderate experience who can contribute reliably while also growing into a senior level. Together these 2 categories dominate the dataset making up over 80% of the overall distribution.

- Entry-level/Junior on the other hand, fall into only 14.5% indicating that there are relatively scarcity of opportunities for fresh talent or early-career professionals because it might be challenges for individuals entering the industry.

- Lastly, Executive-level/Director also represents a small proportion at just 4.28%. which is expected from leadership position compared to operational roles. 

Overall, the graph reflects a workforce structure prioritizing expertise and leadership while offering limited entry points for junior professionals. It provides a useful perspective for organizations to assess their talent pipeline and for individuals to understand the competitiveness of entering the industry. 


**2. How does the salary distribution compare across different job experience levels?**

In [11]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

In [12]:
df.columns

Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary', 'salary_in_usd', 'employee_residence', 'remote_ratio',
       'company_location', 'company_size'],
      dtype='object')

In [13]:
df['experience_level'] = df['experience_level'].replace({
    'EN': 'Entry-level/Junior',
    'MI': 'Mid-level/Intermediate',
    'SE': 'Senior-level/Expert',
    'EX': 'Executive-level/Director'
})

ex_level_colors = {
    'Entry-level/Junior': '#1f77b4',       # Blue
    'Mid-level/Intermediate': '#ff7f0e',  # Orange
    'Senior-level/Expert': '#2ca02c',     # Green
    'Executive-level/Director': '#d62728' # Red
}

# Create the boxplot with the custom color palette
fig = px.box(
    df,
    x='experience_level',
    y='salary_in_usd',
    color='experience_level',
    color_discrete_map=ex_level_colors,
    title='Salary Distribution by Experience Level',
    labels={
        'experience_level': 'Experience Level',
        'salary_in_usd': 'Salary (USD)'
    },
    template='plotly_dark'
)
fig.update_layout(
    xaxis_title='Experience Level',
    yaxis_title='Salary (USD)'
)

fig.show()

This box plot illustrates the distribution of salary across 4 different categories(Entry-level/Junior, Mid-level/Intermediate, Senior-level/Expert, Executive-level/Director). By using box-plot, we can identify each of this category with its range, median, and outliers, providing informations about how salaries differ based on experience level.<br><br>
**Key Insights**
- This graph provides a clear positive trend between salary and experience level, where salary tend to increase significantly as individual climb up their experience level. This serves as a great insight for those who wants to persue their professional career for a financial benefits. For example in this case, the interquartile range from Junior level to Intermediate level spanning a broader distribution from below $100k to roughly between $100k and $200k. However, a few outliers exceed the upper fence, indicating that some entry-level are likely in high-demand, earning more salary above their peers.

- The executive level stands out the most for its median and outlier reaching to the top approximately $600k which is particurlarly useful for professionals aspiring to reaching leadership roles, potentially earning a great amount of salary of reaching the top.

- Decision-making for both employees and employers: 
For employers, this graph helps inform salary structing to attract and retain talents based on their level gap. Same goes for employees, it helps them understand their market value at a different stage of their career.

Overall, this visualization provides a general understanding about how each experience level is being categorized in what amount of salary distributed. It enables professionals to make informed decision about how they should structurized their compensation strategies to ensure fairness and competitiveness.


**3. What's the difference in relationship between remote-ratio across the year?**

In [14]:
df.columns

Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary', 'salary_in_usd', 'employee_residence', 'remote_ratio',
       'company_location', 'company_size'],
      dtype='object')

In [15]:
remote_ratio_year = df.groupby(['work_year','remote_ratio']).size()
ratio_2020 = np.round(remote_ratio_year[2020].values/remote_ratio_year[2020].values.sum(),2)
ratio_2021 = np.round(remote_ratio_year[2021].values/remote_ratio_year[2021].values.sum(),2)
ratio_2022 = np.round(remote_ratio_year[2022].values/remote_ratio_year[2022].values.sum(),2)
fig = go.Figure()
categories = ['No-Remote', 'Partially-Remote', 'Fully-Remote']
fig.add_trace(go.Scatterpolar(
    r = ratio_2020, 
    theta = categories,
    fill = 'toself',
    name = '2020 remote ratio',
    
))
fig.add_trace(go.Scatterpolar(
    r = ratio_2021, 
    theta = categories,
    fill = 'toself',
    name = '2021 remote ratio'
))
fig.add_trace(go.Scatterpolar(
    r = ratio_2022, 
    theta = categories,
    fill = 'toself',
    name = '2022 remote ratio'
))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 1] 
        )
    ),
    template='plotly_dark',  # Set the dark theme
    title='Remote Work Ratio by Year',
    legend_title="Work Year"
)
fig.show()


This radar chart shows the distribution of remote work ratio across the year 2020-2022.
Basically, it displays how many ratio of works are done categorized into 3 variables (No-Remote, Partially-Remote, Fully-Remote), where each data points along the axes indicate the proportion of work in each category for the respective year. <br><br>
**Key Insights**
- In 2020, we can see the proportion between No-Remote and Partially Remote are almost equalized, suggesting that there was a balance between fully on-site roles and hybrid roles during that year.

- Fully-Remote work, however, has almost double proportion compared to other two categories which might reflect an early shift toward working remotely, possibly influenced by COVID-19 pandamic, which forced many organisation to adopt remote work pratices.

- Adoption of Fully-Remote work: There are a continuous shift toward Fully-remote work in the following year 2021 and 2022 which reflects the growing acceptance of remote-work as a viable long-term option.

Overall, this radar plot provides an insight of workplace practice evolution. It shows how pandamic accelerated the shift from traditional on-site work to hybrid and fully-remote practices. Consequently, It seems that fully-remote has been effective continuously between 2020-2022 suggesting its acceptance as a sustainable apporach to work in a long run. It also emphasizes for a company to regulate a remote-friendly practice and policies to ensure a long-term success.

**4. Which company location offer a highest/lowest average salary?**

In [16]:
pip install country-converter

Note: you may need to restart the kernel to use updated packages.


In [17]:
import country_converter as coco

In [18]:
df['company_location'] = coco.convert(names=df['company_location'], to='ISO3')
df['company_location']

0      DEU
1      JPN
2      GBR
3      HND
4      USA
      ... 
602    USA
603    USA
604    USA
605    USA
606    USA
Name: company_location, Length: 607, dtype: object

In [19]:
location_salary = df.groupby('company_location')['salary_in_usd'].mean().reset_index()
location_salary.columns = ['Location', 'Average Salary']

In [20]:
highest_lowest_avg_salary = df.groupby('company_location')['salary_in_usd'].mean().sort_values()
highest_lowest_avg_salary

company_location
VNM      4000.000000
IRN      4000.000000
KEN      9272.000000
PAK     13333.333333
UKR     13400.000000
MDA     18000.000000
ASM     18053.000000
BRA     18602.666667
HND     20000.000000
TUR     20096.666667
COL     21844.000000
MLT     28369.000000
IND     28581.750000
NGA     30000.000000
MEX     32123.333333
EST     32974.000000
HUN     35735.000000
ITA     36366.500000
MYS     40000.000000
CHL     40038.000000
LUX     43942.666667
HRV     45618.000000
PRT     47793.750000
CZE     50937.000000
GRC     52293.090909
ESP     53060.142857
DNK     54386.333333
NLD     54945.750000
ROU     60000.000000
SVN     63831.000000
FRA     63970.666667
CHE     64114.000000
POL     66082.500000
IRL     71444.000000
CHN     71665.500000
AUT     72920.750000
GBR     81583.042553
DEU     81887.214286
BEL     85699.000000
SGP     89294.000000
CAN     99823.733333
IRQ    100000.000000
DZA    100000.000000
ARE    100000.000000
AUS    108042.666667
JPN    114127.333333
ISR    119059.000

In [21]:
fig = px.choropleth(
    location_salary,
    locations='Location',  
    locationmode='ISO-3',  
    color='Average Salary',  
    hover_name='Location',  
    title='Average Salary by Company Location',
    color_continuous_scale=px.colors.sequential.Plasma,
    template='plotly_dark'
)

fig.show()

This map visualize the average salary by company location, using average salary as a color parameter determine the range of salary based on color. Brighter yellow signify the highest average salaries (150k), whereas darker blue indicate lower average salaries(50k).<br><br>
**Key Insights**
This chart specifically illustrates different values (average salary) across geographic regions

- **High Salary Region:** Russia stands out with a brigthest yellow, indicating the highest average salaries. United State & New Zealand also display relatively high salaries, which could indicate a strong demand for skilled or professionals individual with a high demanding role.

- **Low Salary Region:** Vietnam and Iran shows the darkest blue, suggesting lower average salaries compared to other regions. This could be due to various factors such as economic conditions, industry maturity, and cost of living.

**5. What is the distribution of salaries across different job titles over the years?**

In [22]:
import plotly.express as px

avg_salary = df.groupby(['work_year', 'job_title'])['salary_in_usd'].mean().reset_index()


fig = px.line(
    avg_salary,
    x='work_year',
    y='salary_in_usd',
    color='job_title',
    markers=True,
    title='Salary Trends by Job Title Over the Years',
    labels={'work_year': 'Year', 'salary_in_usd': 'Average Salary (USD)', 'job_title': 'Job Title'},
    template='plotly_dark'
)

fig.show()



This line chart above highlights the average salary accross various job title within 2020 to 2022. It provides a clearer representation compared to scatterplot as it shows the trend more effectively, althought it may seem many lines crossing each other since there are many job titles. <br><br>
**Key insights:**
- The line chart showcases the job title overall trend based on average salary over a year, as it shows the significant increasing/decreasing in terms of salary between Financial Data Analyst & Applied Data Scientist within 2021-2022, reflecting an increasing demand of Applied Data Scientist role(approx 82k-238k) and less demand of Financial Data Analyst(450k-100k) within 2021-2022.

- The chart also illustrates the salary stability of the roles, such as Data Scientist & Data Engineer display a consistent salary growth with a minimal fluctuations

- There was also a new emerging role which is Data Analytic Lead potentially indicates a new career paths with a significant salary (405k)

- Practically, this data can be crucial for Job Seekers who focuses on a job title with a consistent salary for their future career, for example a Data Scientist has a relatively flat trend line which indicates consistent demand and salary. Also, for Employers who are planning for recruitment, this will help them making a better decision in terms of adjusting the salary to attract a top talent seeker. 

Overall, this chart helps analyzing in making for career planning, recruitment strategies, and overall market trend analysis.