## Preprocessing based on EDA
Based on Evidence from EDA I will follow this:

    1️⃣ Missing value cross-checking
    2️⃣ Outlier handling
    3️⃣ Skewness correction
    4️⃣ Encoding categorical data
    5️⃣ Scaling numerical data
    6️⃣ Ready for model training


In [36]:

import warnings
warnings.filterwarnings("ignore")

# Core Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler


## ✈️ Import data from EDA

In [37]:
data = pd.read_csv("cleaned_EDA_data.csv")
data.head()

Unnamed: 0,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


In [38]:
#  make a copy of the data
df= data.copy()

## 1️⃣  Missing value cross-checking

In [39]:
df.isna().sum()

continent                0
education_of_employee    0
has_job_experience       0
requires_job_training    0
no_of_employees          0
yr_of_estab              0
region_of_employment     0
prevailing_wage          0
unit_of_wage             0
full_time_position       0
case_status              0
dtype: int64

**No missing values as expected from EDA**

##  2️⃣ Outlier handling

In [40]:
def capping_outliers(df, features:list):
    for feature in features:
        Q1=df[feature].quantile(0.25)
        Q3=df[feature].quantile(0.75)
        IQR= Q3-Q1
        lower= Q1 - 1.5 * IQR
        upper= Q3 + 1.5 * IQR
        count_outlier= ((df[feature]<lower)|(df[feature]> upper)).sum()
        df[feature]= np.where(df[feature]<lower, lower, np.where(df[feature]> upper, upper, df[feature]))
        print(f"The count of Clipped data for {feature} is: {count_outlier}")

In [41]:
# based on EDA all our numberical columns possibly have outliers
# let get our numerical columns
num_col=df.select_dtypes(include='number')

# now let cap them
capping_outliers(df,num_col)

The count of Clipped data for no_of_employees is: 1556
The count of Clipped data for yr_of_estab is: 3260
The count of Clipped data for prevailing_wage is: 427


## 3️⃣ Skewness correction

In [42]:
# from the EDA prevailing_wage and no_of_employees are rightly skew, so log transformation is suggested
def transform_feature(df, features):
    for feature in features:
        # if right skew use log transformation
        if df[feature].skew()>0:
            df[feature]= np.log1p(df[feature])
        # use roburst scalling
        scaler= RobustScaler()
        df[feature]= scaler.fit_transform(df[[feature]])

# calling the function

## 4️⃣ Encoding categorical data

## 5️⃣ Scaling numerical data

## 6️⃣ Ready for model training