# Statistical analysis of salaries in data science related jobs

The goal of this project is to identify the conditions that make a data analyst better paid according to the database provided by [Kaggle](https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries) on a sample of data analysts in different parts of the world.
Some of the question that guide this project are:

- What is the salary a data analyst can aspire to?
- In which countries are the best salaries offered?
- Have salaries increased over time?
- Does the level of experience influence the salary?
- Does the size of the company influence the salary it can offer a data analyst?
- What type of contract (part-time, full-time, etc.) offers the best salaries? What type of contract will be the most suitable?

Part of this notebook is based in [Titanic Data Science Solutions notebook in Kaggle.](https://www.kaggle.com/code/startupsci/titanic-data-science-solutions)

## Database exploration
### Accessing the database and setting up the required libraries

In [3]:
# data analysis and wrangling
import pandas as pd
# import numpy as np
# import random as rnd

# visualization
# import seaborn as sns
# import matplotlib.pyplot as plt
# %matplotlib inline

# statistics

In [4]:
df_salaries = pd.read_csv('ds_salaries.csv', index_col=0)

### Analyze dataset features
In the next codeblock we can see all the available features of the dataset already described in the [Kaggle page here.](https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries)

In [9]:
print(df_salaries.columns.values)

['work_year' 'experience_level' 'employment_type' 'job_title' 'salary'
 'salary_currency' 'salary_in_usd' 'employee_residence' 'remote_ratio'
 'company_location' 'company_size']


#### Which features are categorical?

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

- Nominal: employment_type, job_title, salary_currency, employee_residence and company_location.
- Ordinal: experience_level and company_size.

#### Which features are numerical?

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.

- Continuous: salary, salary_in_usd and remote_ratio (although each one is integer data type, and remote ratio can be easily converted to a categorical statistical type of data because it only has 3 different values that represent an ordinal category).
- Timeseries: work_year.

In [11]:
# preview of the data

df_salaries.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


#### Extra observations
The dataset have no missing values in any feature, which means we don't have to worry about fill empty values. We can observe that there it is 4 features that are integer data type and 7 that are string data type.

We might need to convert in future all string features (categorical data) to numerical data types (numerical data).

In [12]:
df_salaries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 607 entries, 0 to 606
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           607 non-null    int64 
 1   experience_level    607 non-null    object
 2   employment_type     607 non-null    object
 3   job_title           607 non-null    object
 4   salary              607 non-null    int64 
 5   salary_currency     607 non-null    object
 6   salary_in_usd       607 non-null    int64 
 7   employee_residence  607 non-null    object
 8   remote_ratio        607 non-null    int64 
 9   company_location    607 non-null    object
 10  company_size        607 non-null    object
dtypes: int64(4), object(7)
memory usage: 56.9+ KB


### Database exploration
#### Describe the numerical features

In [13]:
df_salaries.describe()

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,607.0,607.0,607.0,607.0
mean,2021.405272,324000.1,112297.869852,70.92257
std,0.692133,1544357.0,70957.259411,40.70913
min,2020.0,4000.0,2859.0,0.0
25%,2021.0,70000.0,62726.0,50.0
50%,2022.0,115000.0,101570.0,100.0
75%,2022.0,165000.0,150000.0,100.0
max,2022.0,30400000.0,600000.0,100.0


#### Describe the categorical features

In [14]:
df_salaries.describe(include=['O'])

Unnamed: 0,experience_level,employment_type,job_title,salary_currency,employee_residence,company_location,company_size
count,607,607,607,607,607,607,607
unique,4,4,50,17,57,50,3
top,SE,FT,Data Scientist,USD,US,US,M
freq,280,588,143,398,332,355,326
