## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Loading

In [4]:
salaries = pd.read_csv('data/ds_salaries.csv').drop(['Unnamed: 0'], axis=1)

In [5]:
salaries.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


### Data Characteristics

In [7]:
print(f'Number of rows: {salaries.shape[0]}')
print(f'Number of columns: {salaries.shape[1]}')

Number of rows: 607
Number of columns: 11


In [8]:
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           607 non-null    int64 
 1   experience_level    607 non-null    object
 2   employment_type     607 non-null    object
 3   job_title           607 non-null    object
 4   salary              607 non-null    int64 
 5   salary_currency     607 non-null    object
 6   salary_in_usd       607 non-null    int64 
 7   employee_residence  607 non-null    object
 8   remote_ratio        607 non-null    int64 
 9   company_location    607 non-null    object
 10  company_size        607 non-null    object
dtypes: int64(4), object(7)
memory usage: 52.3+ KB


From the above information, we can see that the data already pretty clean and **doesn't have any missing values**. The data also has:
- 4 integer columns
- 7 object columns

In [9]:
salaries.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
work_year,607.0,2021.405272,0.692133,2020.0,2021.0,2022.0,2022.0,2022.0
salary,607.0,324000.062603,1544357.0,4000.0,70000.0,115000.0,165000.0,30400000.0
salary_in_usd,607.0,112297.869852,70957.26,2859.0,62726.0,101570.0,150000.0,600000.0
remote_ratio,607.0,70.92257,40.70913,0.0,50.0,100.0,100.0,100.0


The dataset contains information about salaries from **2020** to **2022** with the average salary around **$112,298** and the lowest salary is **$2,859** and the highest salary is **$600,000** per year.

## Data Cleaning

In this section, we will clean the data and because the data is already clean and doesn't have any missing values, we just need to check if there is any duplicate data.

In [10]:
salaries.duplicated().any()

True

Ohh, we found some duplicated data, let's check how many duplicated data we have.

In [11]:
salaries.duplicated().sum()

42

That's quite a lot of duplicated data, let's see what is the duplicated data in our dataset.

In [16]:
salaries[salaries.duplicated(keep=False)].sort_values(by='salary', ascending=False)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
592,2022,SE,FT,Data Scientist,230000,USD,230000,US,100,US,M
486,2022,SE,FT,Data Scientist,230000,USD,230000,US,100,US,M
596,2022,SE,FT,Data Scientist,210000,USD,210000,US,100,US,M
576,2022,SE,FT,Data Scientist,210000,USD,210000,US,100,US,M
574,2022,SE,FT,Data Scientist,210000,USD,210000,US,100,US,M
...,...,...,...,...,...,...,...,...,...,...,...
443,2022,MI,FT,Data Engineer,60000,GBP,78526,GB,100,GB,M
367,2022,MI,FT,Data Analyst,58000,USD,58000,US,0,US,S
406,2022,MI,FT,Data Analyst,58000,USD,58000,US,0,US,S
373,2022,MI,FT,ETL Developer,50000,EUR,54957,GR,0,GR,M


After we see the duplicated data, let's remove them and keep the first one.

In [17]:
salaries = salaries.drop_duplicates(keep='first')

In [18]:
salaries.shape

(565, 11)

Now the dataset pretty much clean and has 565 rows after we remove the duplicated data.