## Cleaning

In [1]:
# Data
import pandas as pd
import numpy as np
import country_converter as coco

In [2]:
df = pd.read_csv('Data/data_science_jobs_dataset.csv')

In [3]:
df.sample(3)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
320,2022,SE,FT,Data Scientist,123000,USD,123000,US,100,US,M
357,2022,SE,FT,Data Engineer,128875,USD,128875,US,100,US,M
537,2021,MI,FT,Data Scientist,82500,USD,82500,US,100,US,S


### Abreviations

We'll change the abreviations to the actual meaning for extra clarity:
- Experience level:
    - EN: Junior
    - MI: Intermediate
    - SE: Senior
    - EX: Executive
- Employment:
    - FT: Full time
    - PT: Part time
    - CT: Contract
    - FL: Freelance
- Country:
    - Every ISO2 abreviation is changed to the short name of the country
- Remote:
    - 100: Fully Remote
    - 50: Partially Remote
    - 0: No remote work
- Company size:
    - S: Small (less than 50 employees)
    - M: Medium (between 50 and 250 employees)
    - L: Large (more than 250 employees)

In [4]:
experience_rename = {"EN":"Junior", "MI":"Intermediate", "SE":"Senior", "EX":"Executive"}
employment_rename = {"FT":"Full time", "PT":"Part time", "CT":"Contract", "FL":"Freelance"}
country_rename = coco.CountryConverter().get_correspondence_dict('ISO2', 'name_short')
country_rename["GB"] = "United Kingdom" #GB'S conversion is missing from the premade dictionary, so we add it manually
remote_rename = {100:"Fully remote", 50:"Partially remote", 0:"No remote work"}
size_rename = {"S":"Small", "M":"Medium", "L":"Large"}

In [5]:
df = df.replace({"experience_level": experience_rename, "employment_type":employment_rename, 
                 "employee_residence":country_rename, "remote_ratio":remote_rename, 
                 "company_location":country_rename, "company_size":size_rename})
df.sample(3)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
164,2022,Senior,Full time,Data Analyst,99450,USD,99450,United States,Fully remote,United States,Medium
388,2022,Intermediate,Full time,Data Analyst,75000,USD,75000,Canada,No remote work,Canada,Medium
222,2022,Senior,Full time,Data Engineer,155000,USD,155000,United States,Fully remote,United States,Medium


### Droping columns

To be able to make comparisons, we are only interested in the salary in USD. We will drop the columns salary and salary_currency.

In [6]:
df = df.drop(["salary", "salary_currency"], axis=1)
df.sample(3)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
652,2021,Senior,Full time,Big Data Architect,99703,Canada,Partially remote,Canada,Medium
647,2021,Senior,Full time,Data Engineer,96282,United Kingdom,Partially remote,United Kingdom,Large
564,2020,Junior,Full time,Big Data Engineer,70000,United States,Fully remote,United States,Large


In [7]:
df.to_csv("data/cleaned_data_science_jobs_dataset.csv", index=False)