# About Dataset
Data Science Job Salaries Dataset contains 11 columns, each are:

- work_year: The year the salary was paid.
- experience_level: The experience level in the job during the year
- employment_type: The type of employment for the role
- job_title: The role worked in during the year.
- salary: The total gross salary amount paid.
- salary_currency: The currency of the salary paid as an ISO 4217 currency code.
- salaryinusd: The salary in USD
- employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code.
- remote_ratio: The overall amount of work done remotely
- company_location: The country of the employer's main office or contracting branch
- company_size: The median number of people that worked for the company during the year

# Setup and import

In [1]:
import pandas as pd 
import matplotlib.pylab as plt
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [98]:
file_name = 'ds_salaries.csv'

try: 
    df = pd.read_csv(file_name) ##in case of running locally
except:
    df = pd.read_csv('/kaggle/input/jobs-in-data/' + file_name) ##in case of running on kaggle

salaries = df

In [99]:
salaries

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M
...,...,...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L
3751,2021,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L
3752,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S
3753,2020,EN,CT,Business Data Analyst,100000,USD,100000,US,100,US,L


# Replacing acronyms

Personally I do not like to use acronyms for EDAs

In [100]:
# Replacing Acronyms for countries

country_iso3166 = pd.read_html("https://www.iban.com/country-codes")
country_iso3166 = country_iso3166[0].drop(['Alpha-3 code','Numeric'], axis = 'columns')

# The full name of USA and Britain is a little big, so I'll reduce it to more readable form
country_iso3166['Country'] = country_iso3166['Country'].replace({'United States of America (the)': 'United States', 'United Kingdom of Great Britain and Northern Ireland (the)': 'Britain'})

# we create a dictionary to convert acronyms to country names
country_dict = country_iso3166.set_index('Alpha-2 code')['Country'].to_dict()

# finally replacing the values in the dataframe
salaries['company_location'] = salaries['company_location'].replace(country_dict)
salaries['employee_residence'] = salaries['employee_residence'].replace(country_dict)

In [101]:
# Other replacements

# Experience level
salaries['experience_level'] = salaries['experience_level'].replace(
    {'EN': 'Entry-Level',
     'MI': 'Mid-Level'  ,
     'SE': 'Senior'     ,
     'EX': 'Executive'} 
)

# Company Size
salaries['company_size'] = salaries['company_size'].replace(
    {'S': 'Small'   ,
     'M': 'Medium'  ,
     'L': 'Large'   }
)

In [102]:
salaries['experience_level'] = salaries['experience_level'].replace({'EN': 'Entry-Level', 'MI' : 'Mid-Level', 'SE': 'Senior', 'EX': 'Executive'})

In [103]:
salaries = salaries.drop(['salary', 'salary_currency'], axis = 'columns')
salaries

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,Senior,FT,Principal Data Scientist,85847,Spain,100,Spain,Large
1,2023,Mid-Level,CT,ML Engineer,30000,United States,100,United States,Small
2,2023,Mid-Level,CT,ML Engineer,25500,United States,100,United States,Small
3,2023,Senior,FT,Data Scientist,175000,Canada,100,Canada,Medium
4,2023,Senior,FT,Data Scientist,120000,Canada,100,Canada,Medium
...,...,...,...,...,...,...,...,...,...
3750,2020,Senior,FT,Data Scientist,412000,United States,100,United States,Large
3751,2021,Mid-Level,FT,Principal Data Scientist,151000,United States,100,United States,Large
3752,2020,Entry-Level,FT,Data Scientist,105000,United States,100,United States,Small
3753,2020,Entry-Level,CT,Business Data Analyst,100000,United States,100,United States,Large
