## FEATURE ENGINEERING

#### In this notebook I am going to perform featre engineering (FE for short) on Data Science Salary dataset, available on Kaggle at this [link](https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries)

The trasformations that I decided to apply on the data are:

- Unnamed: 0 -> Drop;
- work_year -> Ordinal encoding;
- experience_level -> Ordinal Encoding;
- employment_type -> Drop;
- job_title -> One-hot encoding (Bucktization if not perform well);
- salary_currency -> Drop;
- salary_in_usd -> Drop;
- employee_residence -> Group less represented values in 'Other' category + One-hot encoding;
- remote_ratio -> Ordinal encoding;
- company_location -> Group less represented values in 'Other' category + One-hot encoding;
- company_size -> Ordinal encoding.


In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import pickle
from pathlib import Path

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

warnings.filterwarnings("ignore")

%matplotlib inline

In [10]:
# Define CSV path
csv_path = Path("../data") / "ds_salaries.csv" 

# Read data 
df = pd.read_csv(csv_path)

# visualize some examples
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [3]:
# Grouping less represented category for `employee_residence` and `company_location`
def group_less_frequent(df, column, threshold=0.01):
    value_counts = df[column].value_counts(normalize=True)
    rare_values = value_counts[value_counts < threshold].index
    df[column] = df[column].apply(lambda x: 'Other' if x in rare_values else x)
    return df

In [12]:
# Drop unnecessary columns 
df = df.drop(columns=['Unnamed: 0', 'employment_type', 'salary_currency', 'salary_in_usd'])

# Ordinal Encoding
ordinal_cols = ['work_year', 'experience_level', 'remote_ratio', 'company_size']
ordinal_encoder = OrdinalEncoder()
df[ordinal_cols] = ordinal_encoder.fit_transform(df[ordinal_cols])

# One-Hot Encoding for job_title
one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
job_title_encoded = pd.DataFrame(one_hot_encoder.fit_transform(df[['job_title']]), 
                                 columns=one_hot_encoder.get_feature_names(['job_title']))
df = df.drop(columns=['job_title']).join(job_title_encoded)

df = group_less_frequent(df, 'employee_residence')
df = group_less_frequent(df, 'company_location')

# One-Hot Encoding for `employee_residence` and `company_location`
one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
encoded_cols = ['employee_residence', 'company_location']
encoded_data = pd.DataFrame(one_hot_encoder.fit_transform(df[encoded_cols]), 
                            columns=one_hot_encoder.get_feature_names(encoded_cols))

df = df.drop(columns=encoded_cols).join(encoded_data)

df.head()


Unnamed: 0,work_year,experience_level,salary,remote_ratio,company_size,job_title_3D Computer Vision Researcher,job_title_AI Scientist,job_title_Analytics Engineer,job_title_Applied Data Scientist,job_title_Applied Machine Learning Scientist,...,employee_residence_US,company_location_CA,company_location_DE,company_location_ES,company_location_FR,company_location_GB,company_location_GR,company_location_IN,company_location_Other,company_location_US
0,0.0,2.0,70000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,3.0,260000,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,3.0,85000,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,2.0,20000,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,3.0,150000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [None]:
print(df.head())

   work_year  experience_level  salary  remote_ratio  company_size  \
0        0.0               2.0   70000           0.0           0.0   
1        0.0               3.0  260000           0.0           2.0   
2        0.0               3.0   85000           1.0           1.0   
3        0.0               2.0   20000           0.0           2.0   
4        0.0               3.0  150000           1.0           0.0   

   job_title_3D Computer Vision Researcher  job_title_AI Scientist  \
0                                      0.0                     0.0   
1                                      0.0                     0.0   
2                                      0.0                     0.0   
3                                      0.0                     0.0   
4                                      0.0                     0.0   

   job_title_Analytics Engineer  job_title_Applied Data Scientist  \
0                           0.0                               0.0   
1                   

Save transformed data ready to train the model 

In [19]:
with open(Path("../data/data_transformed.pkl"), "wb") as f:
    pickle.dump(df, f)