In [None]:
# Data
import pandas as pd
import numpy as np
import country_converter as coco
import regex as re

# Visualization and settings
from matplotlib import pyplot as plt
%matplotlib inline
%config Inlinebackend.figure_format = 'retina'

import seaborn as sns
sns.set_context('poster')
sns.set(rc={'figure.figsize': (16., 9.)})
sns.set_style('whitegrid')

import plotly.express as px

In [None]:
df = pd.read_csv('Data/data_science_jobs_dataset.csv')

In [None]:
df.sample(3)

## Cleaning

### Abreviations

We'll change the abreviations to the actual meaning for extra clarity:
- Experience level:
    - EN: Junior
    - MI: Intermediate
    - SE: Senior
    - EX: Executive
- Employment:
    - FT: Full time
    - PT: Part time
    - CT: Contract
    - FL: Freelance
- Country:
    - Every ISO2 abreviation is changed to the short name of the country
- Remote:
    - 100: Fully Remote
    - 50: Partially Remote
    - 0: No remote work
- Company size:
    - S: Small (less than 50 employees)
    - M: Medium (between 50 and 150 employees)
    - L: Large (more than 150 employees)

In [None]:
experience_rename = {"EN":"Junior", "MI":"Intermediate", "SE":"Senior", "EX":"Executive"}
employment_rename = {"FT":"Full time", "PT":"Part time", "CT":"Contract", "FL":"Freelance"}
country_rename = coco.CountryConverter().get_correspondence_dict('ISO2', 'name_short')
country_rename["GB"] = "United Kingdom" #GB'S conversion is missing from the premade dictionary, so we add it manually
remote_rename = {100:"Fully remote", 50:"Partially remote", 0:"No remote work"}
size_rename = {"S":"Small", "M":"Medium", "L":"Large"}

In [None]:
df = df.replace({"experience_level": experience_rename, "employment_type":employment_rename, 
                 "employee_residence":country_rename, "remote_ratio":remote_rename, 
                 "company_location":country_rename, "company_size":size_rename})
df.sample(3)

### Droping columns

To be able to make comparisons, we are only interested in the salary in USD. We will drop the columns salary and salary_currency.

In [None]:
df = df.drop(["salary", "salary_currency"], axis=1)
df.sample(3)

In [None]:
df.to_csv("data/cleaned_data_science_jobs_dataset.csv", index=False)

# EDA

## Feature analysis univariate

### Work Year

![](Images/Work_year.png)

Every year there is a noticable increase in the number of data in our dataset. More available jobs related to data worldwide could explain a small difference, but the reason for an increase this big is that the data is entered voluntarily and the site that collects it has been gaining popularity over the last 2 years.

### Experience level

![](Images/Experience_level.png)

More than half of the positions are Senior, and aproximately one out of three are intermediate, one out of seven are junior and only one out of 25 are executive level positions

### Employment type

![](Images/Employment_type.png)

Almost every single job listed is a full time job.

### Job title

![](Images/Job_title.png)

### Salary (USD)

### Employee residence

### Remote Ratio

### Company Location

### Company Size

## Feature analysis multivariate

### Job title by experience level

### Remote ratio by work year

### Remote Ratio by experience level

Compared to remote ratio overall

### Remote Ratio by employment type

Compared to remote ratio overall

### Remote Ratio by job title

Compared to remote ratio overall

### Remote Ratio by Country

Compared to remote ratio overall

### Remote Ratio by company size

Compared to remote ratio overall

### Remote Ratio when employee residence = company location

### Employee residence by Company location

To see how many remote jobs are from employees living in a different country

Useful if you want to look for remote jobs, so you know where to look

### Experience level by company size

### Salary by work year

### Salary by experience level

### Salary by job title

### Salary by job title by experience level

### Salary by job title by company location

### Salary by remote ratio

### Salary by company location

### Salary by company size

# Prediction