# Exploratory Analytics

##### Introduction

> Introduction to the dataset

###### AI-ML_Salaries
###### A Data Science portfolio project: Data Science job salaries from 2020 to 2022

###### About Dataset
The salaries are from [ai-jobs](https://salaries.ai-jobs.net/download/). Ai-jobs collects salary information anonymously from professionals all over the world in the AI/ML and Big Data space and makes it publicly available for anyone to use, share and play around with. The data is being updated regularly with new data coming in, usually on a weekly basis.
The primary goal is to have data that can provide better guidance in regards to what's being paid globally. So newbies, experienced pros, hiring managers, recruiters and also startup founders or people wanting to make a career switch can make better informed decisions.

The dataset contains one table structured as follow:
- **work_year**: The year the salary was paid.
- **experience_level**: The experience level in the job during the year with the following possible values:
    - **EN**: Entry-level / Junior
    - **MI**: Mid-level / Intermediate
    - **SE**: Senior-level / Expert
    - **EX**: Executive-level / Director
- **employment_type**: The type of employement for the role:
    - **PT**: Part-time
    - **FT**: Full-time
    - **CT**: Contract
    - **FL**: Freelance
- **job_title**: The role worked in during the year.
- **salary**: The total gross salary amount paid.
- **salary_currency**: The currency of the salary paid as an ISO 4217 currency code.
- **salaryinusd**: The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com).
- **employee_residence**: Employee's primary country of residence in during the work year as an ISO 3166 country code.
- **remote_ratio**: The overall amount of work done remotely, possible values are as follows:
    - **0**: No remote work (less than 20%)
    - **50**: Partially remote
    - **100**: Fully remote (more than 80%)
- **company_location**: The country of the employer's main office or contracting branch as an ISO 3166 country code.
- **company_size**: The average number of people that worked for the company during the year:
    - **S**: less than 50 employees (small)
    - **M**: 50 to 250 employees (medium)
    - **L**: more than 250 employees (large)

##### Preliminary Wrangling

In [1]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

> - Loading in dataset with pandas
> - Describing the dataset
> - Printing out some parts of the table

In [2]:
# importing or loading in the dataset using pandas, here I am using a copy of salaries.csv (salaries_usable.csv)
salaries_df = pd.read_csv('salaries_usable.csv')

# shape of the dataset
Shape = salaries_df.shape
columns = salaries_df.shape[1]
rows = salaries_df.shape[0]

print("This dataset contains {} rows and {} columns, hence {}".format(rows, columns, Shape))

# print out some part of dataset

# Head
print(salaries_df.head(5))

# 10 random samples
print(salaries_df.sample(n = 10))

# Tail
print(salaries_df.tail())

This dataset contains 1332 rows and 11 columns, hence (1332, 11)
   work_year experience_level employment_type                  job_title  \
0       2022               MI              FT  Machine Learning Engineer   
1       2022               MI              FT  Machine Learning Engineer   
2       2022               MI              FT             Data Scientist   
3       2022               MI              FT             Data Scientist   
4       2022               MI              FT             Data Scientist   

   salary salary_currency  salary_in_usd employee_residence  remote_ratio  \
0  130000             USD         130000                 US             0   
1   90000             USD          90000                 US             0   
2  120000             USD         120000                 US           100   
3  100000             USD         100000                 US           100   
4   85000             USD          85000                 US           100   

  company_locat

##### Assesssment and Cleaning of the dataset

###### Assessment

In [3]:
# Info
print(salaries_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1332 entries, 0 to 1331
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           1332 non-null   int64 
 1   experience_level    1332 non-null   object
 2   employment_type     1332 non-null   object
 3   job_title           1332 non-null   object
 4   salary              1332 non-null   int64 
 5   salary_currency     1332 non-null   object
 6   salary_in_usd       1332 non-null   int64 
 7   employee_residence  1332 non-null   object
 8   remote_ratio        1332 non-null   int64 
 9   company_location    1332 non-null   object
 10  company_size        1332 non-null   object
dtypes: int64(4), object(7)
memory usage: 114.6+ KB
None


In [14]:
# Duplicated values
# True: duplicated
# False: not duplicated, unique
print(salaries_df.duplicated().sum())
print(salaries_df.shape[0] - salaries_df.duplicated().sum())
# salaries_df.duplicated().value_counts()

263
1069
