<a href="https://www.kaggle.com/code/pantanjali/data-science-job-analysis?scriptVersionId=124698723" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## 🚀 <span style="color:green">About Dataset</span>

 **work_year** - The year the salary was paid to the employee working in different fields.

**experience_level** - The experience level in the job during the year:

**EN** = Entry-level / Junior

**MI** = Mid-level / Intermediate

**SE** = Senior-level / Expert

**EX** = Executive-level / Director

**employment_type** - The type of employement for the role:

**PT** = Part-time

**FT** = Full-time

**CT** = Contract

**FL** = Freelance

 **job_title** - The role worked in during the year.

**salary** - The total gross salary amount paid.

**salary_currency** - The currency of the salary paid as an ISO 4217 currency code.

**salary_in_usd** - The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com).

**employee_residence** - Employee's primary country of residence in during the work year as an ISO 3166 country code(Alpha-2 code).

**remote_ratio** - The overall amount of work done remotely, possible values are as follows:

**0** = No remote work (less than 20%)

**50** = Partially remote

**100** = Fully remote (more than 80%)

**company_location** - The country of the employer's main office or contracting branch as an ISO 3166 country code(Alpha-2 code).

**company_size** - The average number of people that worked for the company during the year:

**S** = less than 50 employees (small)

**M** = 50 to 250 employees (medium)

**L** = more than 250 employees (large)

## 📚 <span style="color:green">Importing libraries and dataset</span>


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

In [2]:
job_data_file_path= '../input/data-science-job-salaries/ds_salaries.csv'

In [3]:
job_data= pd.read_csv(job_data_file_path)

---
## 📚 <span style="color:green">Data wrangling</span>


In [4]:
job_data.info()
job_data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          607 non-null    int64 
 1   work_year           607 non-null    int64 
 2   experience_level    607 non-null    object
 3   employment_type     607 non-null    object
 4   job_title           607 non-null    object
 5   salary              607 non-null    int64 
 6   salary_currency     607 non-null    object
 7   salary_in_usd       607 non-null    int64 
 8   employee_residence  607 non-null    object
 9   remote_ratio        607 non-null    int64 
 10  company_location    607 non-null    object
 11  company_size        607 non-null    object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB


Unnamed: 0            0
work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [5]:
job_data.remote_ratio.replace([0,50,100],['Onsite','Hybrid','Remote'], inplace=True)
job_data.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,Onsite,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,Onsite,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,Hybrid,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,Onsite,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,Hybrid,US,L


In [6]:
job_data.experience_level.replace(['MI','SE','EN','EX'],['Intermediate','Expert','Junior','Director'],inplace=True)
job_data.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,Intermediate,FT,Data Scientist,70000,EUR,79833,DE,Onsite,DE,L
1,1,2020,Expert,FT,Machine Learning Scientist,260000,USD,260000,JP,Onsite,JP,S
2,2,2020,Expert,FT,Big Data Engineer,85000,GBP,109024,GB,Hybrid,GB,M
3,3,2020,Intermediate,FT,Product Data Analyst,20000,USD,20000,HN,Onsite,HN,S
4,4,2020,Expert,FT,Machine Learning Engineer,150000,USD,150000,US,Hybrid,US,L


## 1. Checking the categorical data in the dataset

In [7]:
print("Categories in 'experience_level':  ", end=" " )
print(job_data['experience_level'].unique())

print("Categories in 'employment_type':  ",end=" ")
print(job_data['employment_type'].unique())

print("Categories in'job_title':",end=" " )
print(job_data['job_title'].unique())

print("Categories in 'salary_currency':     ",end=" " )
print(job_data['salary_currency'].unique())

print("Categories in 'remote_ratio':     ",end=" " )
print(job_data['remote_ratio'].unique())

print("Categories in 'remote_ratio':     ",end=" " )
print(job_data['remote_ratio'].unique())

print("Categories in 'company_size' :     ",end=" " )
print(job_data['company_size'].unique())


Categories in 'experience_level':   ['Intermediate' 'Expert' 'Junior' 'Director']
Categories in 'employment_type':   ['FT' 'CT' 'PT' 'FL']
Categories in'job_title': ['Data Scientist' 'Machine Learning Scientist' 'Big Data Engineer'
 'Product Data Analyst' 'Machine Learning Engineer' 'Data Analyst'
 'Lead Data Scientist' 'Business Data Analyst' 'Lead Data Engineer'
 'Lead Data Analyst' 'Data Engineer' 'Data Science Consultant'
 'BI Data Analyst' 'Director of Data Science' 'Research Scientist'
 'Machine Learning Manager' 'Data Engineering Manager'
 'Machine Learning Infrastructure Engineer' 'ML Engineer' 'AI Scientist'
 'Computer Vision Engineer' 'Principal Data Scientist'
 'Data Science Manager' 'Head of Data' '3D Computer Vision Researcher'
 'Data Analytics Engineer' 'Applied Data Scientist'
 'Marketing Data Analyst' 'Cloud Data Engineer' 'Financial Data Analyst'
 'Computer Vision Software Engineer' 'Director of Data Engineering'
 'Data Science Engineer' 'Principal Data Engineer'
 'Mac

## 2. Defining the numerical & categorical features

In [8]:
numeric_features = [feature for feature in job_data.columns if job_data[feature].dtype != 'O']
categorical_features = [feature for feature in job_data.columns if job_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 4 numerical features : ['Unnamed: 0', 'work_year', 'salary', 'salary_in_usd']

We have 8 categorical features : ['experience_level', 'employment_type', 'job_title', 'salary_currency', 'employee_residence', 'remote_ratio', 'company_location', 'company_size']


In [9]:
job_data.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,Intermediate,FT,Data Scientist,70000,EUR,79833,DE,Onsite,DE,L
1,1,2020,Expert,FT,Machine Learning Scientist,260000,USD,260000,JP,Onsite,JP,S
2,2,2020,Expert,FT,Big Data Engineer,85000,GBP,109024,GB,Hybrid,GB,M
3,3,2020,Intermediate,FT,Product Data Analyst,20000,USD,20000,HN,Onsite,HN,S
4,4,2020,Expert,FT,Machine Learning Engineer,150000,USD,150000,US,Hybrid,US,L


## 3.Dropping the 'Unnamed' column

In [10]:
job_data.drop(['Unnamed: 0'], axis=1)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,Intermediate,FT,Data Scientist,70000,EUR,79833,DE,Onsite,DE,L
1,2020,Expert,FT,Machine Learning Scientist,260000,USD,260000,JP,Onsite,JP,S
2,2020,Expert,FT,Big Data Engineer,85000,GBP,109024,GB,Hybrid,GB,M
3,2020,Intermediate,FT,Product Data Analyst,20000,USD,20000,HN,Onsite,HN,S
4,2020,Expert,FT,Machine Learning Engineer,150000,USD,150000,US,Hybrid,US,L
...,...,...,...,...,...,...,...,...,...,...,...
602,2022,Expert,FT,Data Engineer,154000,USD,154000,US,Remote,US,M
603,2022,Expert,FT,Data Engineer,126000,USD,126000,US,Remote,US,M
604,2022,Expert,FT,Data Analyst,129000,USD,129000,US,Onsite,US,M
605,2022,Expert,FT,Data Analyst,150000,USD,150000,US,Remote,US,M


----
# 📊 <span style="color:green">Exploratory Data Analysis</span>



## 💡 <span style="color:red">Analysis-1</span>

In [11]:
px.histogram(job_data,x='employment_type',color='work_year',barmode='group',color_discrete_sequence=px.colors.sequential.thermal_r, template='plotly_dark',title='Count of employment type')

**📍 Observation: In year 2022, the employees worked more in full-time position.**

--------------------------------------------------------------------------------------------------------------

## 💡 <span style="color:red">Analysis-2</span>


In [12]:
fig= px.histogram(job_data,x='employment_type',histnorm='percent',text_auto='.2f', template='plotly_dark',title='% of employment type')
fig.show()

**📍 Observation- 96.87% employees work for the full-time position.**

-----

## 💡 <span style="color:red">Analysis-3</span>

In [13]:
exp_level= job_data['experience_level'].value_counts()
fig= px.treemap(exp_level,
               path=[exp_level.index],
               values=exp_level.values,
               title= 'Treemap of Experience level',
               color= exp_level.index,
               color_discrete_sequence=px.colors.sequential.GnBu_r,
               template='plotly_dark',
               width=1000,height=500)

percentage= np.round((100*exp_level.values/sum(exp_level.values)).tolist(),2)
fig.data[0].customdata=[14.49,4.28,46.12,35.09]
fig.data[0].texttemplate='%{label}<br>%{value}<br>%{customdata}%'
fig.show()

**📍Observation- Most employees are working as a senior level/expert**

## 💡 <span style="color:red">Analysis-4</span>

In [14]:
fig= px.histogram(job_data,x='remote_ratio',color='work_year',barmode='group',color_discrete_sequence=px.colors.sequential.Magenta,template='plotly_dark',title='Variation of remote ratio with work year')
fig.show()

**📍Observation- Due to COVID-19, most employees worked from home (remote) the demand for data analyst/science increased for research purposes like medicenes,start-up, environment, etc, so the count for work type-remote increased every year (2020,2021 & 2022)**

---

## 💡 <span style="color:red">Analysis-5</span>

In [15]:
top10_job_title = job_data['job_title'].value_counts()[:10]
fig = px.bar(y=top10_job_title.values, 
             x=top10_job_title.index, 
             color = top10_job_title.index,
             color_discrete_sequence=px.colors.sequential.Magma_r,
             text=top10_job_title.values,
             title= 'Top 10 Job Titles',
             template= 'plotly_dark')
fig.update_layout(
    xaxis_title="Job Titles",
    yaxis_title="count",
    font = dict(size=14,family="Franklin Gothic"))
fig.show()

**📍Observation- Most employees work as data scientist & data engineer.**

---

## 💡 <span style="color:red">Analysis-6</span>

In [16]:
top10_company_resd = job_data['company_location'].value_counts()[:10]
fig = px.bar(y=top10_company_resd.values, 
             x=top10_company_resd.index, 
             color = top10_company_resd.index,
             color_discrete_sequence=px.colors.sequential.amp,
             text=top10_company_resd.values,
             title= 'Top 10 company residence',
             template= 'plotly_dark')
fig.update_layout(
    xaxis_title="Company residence",
    yaxis_title="count",
    font = dict(size=14,family="Franklin Gothic"))
fig.show()

**📍Observation- Most companies are located in US (United States).**

---

## 💡 <span style="color:red">Analysis-7</span>

In [17]:
top10_employee_resd = job_data['employee_residence'].value_counts()[:10]
fig = px.bar(y=top10_employee_resd.values, 
             x=top10_employee_resd.index, 
             color = top10_employee_resd.index,
             color_discrete_sequence=px.colors.sequential.haline,
             text=top10_employee_resd.values,
             title= 'Top 10 employee residence',
             template= 'plotly_dark')
fig.update_layout(
    xaxis_title="Employee residence",
    yaxis_title="count",
    font = dict(size=14,family="Franklin Gothic"))
fig.show()

**📍Observations- Most employees reside in US (United States)**

---

## 💡 <span style="color:red">Analysis-8</span>

In [18]:
fig= px.scatter(job_data, x=job_data['employee_residence'].sort_values(), y = job_data['company_location'].sort_values(), color = 'remote_ratio',
           labels ={"x":'Employee Residence', "y":'Company Location', "remote_ratio":'Work Type'},
           color_discrete_sequence=px.colors.qualitative.Light24, template = 'plotly_dark',
           title = 'Company Location VS Employee Residence for type of work')
fig.show()


**📍Observation- Most of the remote employees work from different countries.**

---

## 💡 <span style="color:red">Analysis-9</span>

In [19]:
fig= px.histogram(job_data,x='job_title',color='company_size',barmode='group',color_discrete_sequence=px.colors.sequential.YlGnBu,template='plotly_dark',title='Variation of job title with company size')
fig.show()

 **📍Observation- Company of medium size, have the most job roles like Data scientist, Data Analyst & Data engineer.**

---

## 💡 <span style="color:red">Analysis-10</span>

In [20]:
fig= px.scatter(job_data, x=job_data['salary_in_usd'].sort_values(), y = job_data['company_size'].sort_values(), color = 'remote_ratio',
           labels ={"x":'Salary', "y":'Company size', "remote_ratio":'Work Type'},
           color_discrete_sequence=px.colors.qualitative.Vivid, template = 'plotly_dark',
           title = 'Company size vs salary with respect to work type')
fig.show()

#### **📍Observation-**

**1. Due to covid pandamic, most employees started working work from home(remote).**

**2. Small companies are providing employees higher salary compare to large company.**

# **Thank you! 💫**