# DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING


Objective:
This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.
Dataset:
Given "Adult" dataset, which predicts whether income exceeds $50K/yr based on census data.
Tasks:
1. Data Exploration and Preprocessing:
•	Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).
•	Handle missing values as per the best practices (imputation, removal, etc.).
•	Apply scaling techniques to numerical features:
•	Standard Scaling
•	Min-Max Scaling
•	Discuss the scenarios where each scaling technique is preferred and why.
2. Encoding Techniques:
•	Apply One-Hot Encoding to categorical variables with less than 5 categories.
•	Use Label Encoding for categorical variables with more than 5 categories.
•	Discuss the pros and cons of One-Hot Encoding and Label Encoding.
3. Feature Engineering:
•	Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.
•	Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.
4. Feature Selection:
•	Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.
•	Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.


###  1: Data Exploration and Preprocessing

In [4]:
import pandas as pd
df= pd.read_csv(r'C:/Users/DELL/Desktop/DATAsets/adult_with_headers.csv')      #loading dataset
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [5]:
# Display dataset information
df.info()     # df.info() provides data types and missing value counts.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [6]:
df.describe()    #it summarizes numerical columns (mean, std, min, max).

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [7]:
df.isnull().sum()   # checks missing values. It is importand to handle missing value because Missing values can reduce model accuracy. Can be handled using imputation (replacing missing values) or removal (deleting rows/columns).


age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

In [8]:
# Fill missing values in numerical columns with median
df.fillna(df.median(numeric_only=True), inplace=True)
# Fill missing values in categorical columns with mode
df.fillna(df.mode().iloc[0], inplace=True)

In [9]:
df.fillna(df.median).head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [10]:
df.fillna(df.mode()).head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [11]:
df.nunique()      # helps identify categorical vs. numerical features.

age                  73
workclass             9
fnlwgt            21648
education            16
education_num        16
marital_status        7
occupation           15
relationship          6
race                  5
sex                   2
capital_gain        119
capital_loss         92
hours_per_week       94
native_country       42
income                2
dtype: int64

##### Scaling Numerical Features

In [13]:
#Machine learning models perform better when numerical data is on a similar scale.
#Prevents features with larger values from dominating models like logistic regression and KNN.

In [14]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

### Select numerical features
numerical_features = ['age','fnlwgt','education_num','capital_gain','capital_loss','hours_per_week']

scaler = StandardScaler()      # Apply Standard Scaling
df_standard_scaled = df.copy()
df_standard_scaled[numerical_features] = scaler.fit_transform(df[numerical_features])

In [15]:
df_standard_scaled[numerical_features]

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429
1,0.837109,-1.008707,1.134739,-0.145920,-0.21666,-2.222153
2,-0.042642,0.245079,-0.420060,-0.145920,-0.21666,-0.035429
3,1.057047,0.425801,-1.197459,-0.145920,-0.21666,-0.035429
4,-0.775768,1.408176,1.134739,-0.145920,-0.21666,-0.035429
...,...,...,...,...,...,...
32556,-0.849080,0.639741,0.746039,-0.145920,-0.21666,-0.197409
32557,0.103983,-0.335433,-0.420060,-0.145920,-0.21666,-0.035429
32558,1.423610,-0.358777,-0.420060,-0.145920,-0.21666,-0.035429
32559,-1.215643,0.110960,-0.420060,-0.145920,-0.21666,-1.655225


In [16]:
## Applying Min-Max Scaling
minmax_scaler = MinMaxScaler()
df_minmax_scaled = df.copy()
df_minmax_scaled[numerical_features] = minmax_scaler.fit_transform(df[numerical_features])

In [17]:
df_minmax_scaled[numerical_features]

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
0,0.301370,0.044302,0.800000,0.021740,0.0,0.397959
1,0.452055,0.048238,0.800000,0.000000,0.0,0.122449
2,0.287671,0.138113,0.533333,0.000000,0.0,0.397959
3,0.493151,0.151068,0.400000,0.000000,0.0,0.397959
4,0.150685,0.221488,0.800000,0.000000,0.0,0.397959
...,...,...,...,...,...,...
32556,0.136986,0.166404,0.733333,0.000000,0.0,0.377551
32557,0.315068,0.096500,0.533333,0.000000,0.0,0.397959
32558,0.561644,0.094827,0.533333,0.000000,0.0,0.397959
32559,0.068493,0.128499,0.533333,0.000000,0.0,0.193878


### 2: Encoding Categorical Variables

In [19]:
#Machine learning models can’t handle categorical data directly. We convert them into numbers using encoding techniques.

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
categorical_columns = df.select_dtypes(include=['object']).columns      # Identify categorical columns

In [20]:
categorical_columns

Index(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country', 'income'],
      dtype='object')

In [21]:
# Apply One-Hot Encoding to categories with < 5 unique values
one_hot_cols= [col for col in categorical_columns if df[col].nunique() < 5]
df_one_hot = pd.get_dummies(df, columns=one_hot_cols, drop_first=True)

In [22]:
one_hot_cols

['sex', 'income']

In [23]:
df_one_hot

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Male,income_ >50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,2174,0,40,United-States,1,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,13,United-States,1,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,40,United-States,1,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,40,United-States,1,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,0,0,40,Cuba,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,0,0,38,United-States,0,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,0,0,40,United-States,1,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,0,0,40,United-States,0,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,0,0,20,United-States,1,0


In [24]:
# Apply Label Encoding to categories with > 5 unique values
label_cols = [col for col in categorical_columns if df[col].nunique() >= 5]
label_encoders = {}

for col in label_cols:
    le = LabelEncoder()
    df_one_hot[col] = le.fit_transform(df_one_hot[col])
    label_encoders[col] = le      ### Save encoders for future use

#pd.get_dummies() converts categorical variables into binary columns (One-Hot Encoding).
#LabelEncoder() assigns numeric labels to categories.

In [25]:
label_encoders[col]

In [26]:
 df_one_hot[col]

0        39
1        39
2        39
3        39
4         5
         ..
32556    39
32557    39
32558    39
32559    39
32560    39
Name: native_country, Length: 32561, dtype: int32

### 3: Feature Engineering

In [28]:
#Helps models capture additional patterns.
#Can improve accuracy by introducing domain-specific knowledge.

In [29]:
# Create Working Hours Category
df_one_hot['working_hours_category'] = pd.cut(df['hours_per_week'], bins=[0,20,40,60,100], labels=['Low','Medium', 'High', 'Very High'])

In [30]:
df_one_hot['working_hours_category']

0        Medium
1           Low
2        Medium
3        Medium
4        Medium
          ...  
32556    Medium
32557    Medium
32558    Medium
32559       Low
32560    Medium
Name: working_hours_category, Length: 32561, dtype: category
Categories (4, object): ['Low' < 'Medium' < 'High' < 'Very High']

In [31]:
## Create Age Group
df_one_hot['age_group'] = pd.cut(df['age'], bins=[0,18,35,50,100], labels=['Teen', 'Young Adult', 'Middle Aged', 'Senior'])

In [32]:
df_one_hot['age_group']

0        Middle Aged
1        Middle Aged
2        Middle Aged
3             Senior
4        Young Adult
            ...     
32556    Young Adult
32557    Middle Aged
32558         Senior
32559    Young Adult
32560         Senior
Name: age_group, Length: 32561, dtype: category
Categories (4, object): ['Teen' < 'Young Adult' < 'Middle Aged' < 'Senior']

###### Transformation of Skewed Data

In [34]:
#If ‘capital-gain’ is highly skewed, apply log transformation:
import numpy as np
df_one_hot['capital_gain_log'] = np.log1p(df['capital_gain'])   # log(1+x) to avoid log(0)
df_one_hot['capital_gain_log']

0        7.684784
1        0.000000
2        0.000000
3        0.000000
4        0.000000
           ...   
32556    0.000000
32557    0.000000
32558    0.000000
32559    0.000000
32560    9.617471
Name: capital_gain_log, Length: 32561, dtype: float64

###  4: Feature Selection

#####  Outlier Detection Using Isolation Forest

In [37]:
# Outliers can bias models and reduce accuracy.Isolation Forest detects and removes outliers.
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier = iso_forest.fit_predict(df_one_hot[numerical_features])
df_filtered = df_one_hot[outlier == 1]

In [38]:
outlier

array([1, 1, 1, ..., 1, 1, 1])

In [39]:
df_filtered

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Male,income_ >50K,working_hours_category,age_group,capital_gain_log
0,39,7,77516,9,13,4,1,1,4,2174,0,40,39,1,0,Medium,Middle Aged,7.684784
1,50,6,83311,9,13,2,4,0,4,0,0,13,39,1,0,Low,Middle Aged,0.000000
2,38,4,215646,11,9,0,6,1,4,0,0,40,39,1,0,Medium,Middle Aged,0.000000
3,53,4,234721,1,7,2,6,0,2,0,0,40,39,1,0,Medium,Senior,0.000000
4,28,4,338409,9,13,2,10,5,2,0,0,40,5,0,0,Medium,Young Adult,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,7,12,2,13,5,4,0,0,38,39,0,0,Medium,Young Adult,0.000000
32557,40,4,154374,11,9,2,7,0,4,0,0,40,39,1,1,Medium,Middle Aged,0.000000
32558,58,4,151910,11,9,6,1,4,4,0,0,40,39,0,0,Medium,Senior,0.000000
32559,22,4,201490,11,9,4,1,3,4,0,0,20,39,1,0,Low,Young Adult,0.000000


##### Predictive Power Score (PPS)

In [41]:
 #PPS measures the predictive relationship between features (better than correlation). Unlike correlation, it works with both numerical and categorical features.

In [42]:
import ppscore as pps

pps_matrix = pps.matrix(df_filtered)
print(pps_matrix[['x','y','ppscore']].sort_values(by='ppscore',ascending=False))

                 x              y  ppscore
0              age            age      1.0
171   capital_gain   capital_gain      1.0
58       education  education_num      1.0
75   education_num      education      1.0
76   education_num  education_num      1.0
..             ...            ...      ...
144           race            age      0.0
145           race      workclass      0.0
147           race      education      0.0
148           race  education_num      0.0
172   capital_gain   capital_loss      0.0

[324 rows x 3 columns]


##### Comparison with Correlation Matrix

In [44]:
correlation_matrix = df_filtered.corr()
print(correlation_matrix)

                       age  workclass    fnlwgt  education  education_num  \
age               1.000000   0.009988 -0.079026  -0.002812       0.035963   
workclass         0.009988   1.000000 -0.017831   0.019409       0.046283   
fnlwgt           -0.079026  -0.017831  1.000000  -0.023298      -0.039355   
education        -0.002812   0.019409 -0.023298   1.000000       0.347690   
education_num     0.035963   0.046283 -0.039355   0.347690       1.000000   
marital_status   -0.284393  -0.060436  0.027795  -0.034145      -0.056905   
occupation       -0.017709   0.248579 -0.001048  -0.025483       0.106509   
relationship     -0.262209  -0.092290  0.008571  -0.010679      -0.088812   
race              0.027202   0.048704 -0.022577   0.014373       0.028561   
capital_gain      0.093410   0.020837 -0.012993   0.020707       0.130080   
capital_loss      0.029835  -0.003454 -0.011045   0.021202       0.045779   
hours_per_week    0.094233   0.129166 -0.021354   0.053627       0.134755   

  correlation_matrix = df_filtered.corr()


In [45]:
#Preprocessing improves data quality.
#Feature engineering creates informative variables.
#Feature selection reduces noise and improves efficiency.