# EDA2 — Data Preprocessing & Feature Engineering

This notebook follows the assignment brief: loading the Adult dataset, preprocessing, encoding, feature engineering, outlier detection, and feature selection. Run cells sequentially. This notebook is prepared to be submitted as the `.ipynb` file (instructor requested .ipynb only).

In [1]:
# Optional: install missing packages. Uncomment to run if packages missing.
# !pip install ppscore
# !pip install category_encoders

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.ensemble import IsolationForest

print('Imports OK')

Imports OK


## 1) Load dataset

The notebook expects the file `adult_with_headers.csv` in the working directory or in `/mnt/data/`.

In [5]:
DATA_PATH = r"D:\DATA-SCIENCE\ASSIGNMENTS\12 EDA2\adult_with_headers.csv"
 # change if needed

try:
    df = pd.read_csv(DATA_PATH)
except FileNotFoundError:
    print(f'File not found at {DATA_PATH}. Please update DATA_PATH to the correct path.')
else:
    print('Loaded df shape:', df.shape)
    display(df.head())

Loaded df shape: (32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## 2) Basic exploration & missing values

In [6]:
# Quick overview
print('Columns:', df.columns.tolist())
print('\nDtypes:')
print(df.dtypes)

print('\nMissing values:')
print(df.isna().sum())

print('\nDescriptive statistics:')
df.describe(include='all').T

Columns: ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

Dtypes:
age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object

Missing values:
age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

Descriptive statistics:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
age,32561.0,,,,38.581647,13.640433,17.0,28.0,37.0,48.0,90.0
workclass,32561.0,9.0,Private,22696.0,,,,,,,
fnlwgt,32561.0,,,,189778.366512,105549.977697,12285.0,117827.0,178356.0,237051.0,1484705.0
education,32561.0,16.0,HS-grad,10501.0,,,,,,,
education_num,32561.0,,,,10.080679,2.57272,1.0,9.0,10.0,12.0,16.0
marital_status,32561.0,7.0,Married-civ-spouse,14976.0,,,,,,,
occupation,32561.0,15.0,Prof-specialty,4140.0,,,,,,,
relationship,32561.0,6.0,Husband,13193.0,,,,,,,
race,32561.0,5.0,White,27816.0,,,,,,,
sex,32561.0,2.0,Male,21790.0,,,,,,,


### Handling missing values

Strategy: depending on column type — drop rows with many missing target values; for categorical, consider mode imputation; for numeric, median or mean. Document choices.

In [8]:
# Example missing-value handling - customize as needed
# 1) Show columns with missing values
missing = df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False)
print(missing)

# 2) Example imputation (only run after reviewing missing):
# numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
# cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

# df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
# for c in cat_cols:
#     df[c] = df[c].fillna(df[c].mode()[0])

print('Adjust imputation code above as per dataset and run it when ready.')

Series([], dtype: int64)
Adjust imputation code above as per dataset and run it when ready.


## 3) Scaling numerical features

We demonstrate StandardScaler and MinMaxScaler on numeric columns. Use the one appropriate for your model.

In [9]:
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
num_cols

# Example scaling (do not overwrite original df unless intended)
scaler_std = StandardScaler()
scaler_mm = MinMaxScaler()

if num_cols:
    X_num = df[num_cols].copy()
    X_std = pd.DataFrame(scaler_std.fit_transform(X_num), columns=num_cols, index=df.index)
    X_mm = pd.DataFrame(scaler_mm.fit_transform(X_num), columns=num_cols, index=df.index)
    display(X_std.describe().T)
    display(X_mm.describe().T)
else:
    print('No numeric columns found to scale.')

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,32561.0,-2.6622710000000004e-17,1.000015,-1.582206,-0.775768,-0.115955,0.690484,3.769612
fnlwgt,32561.0,-9.732565e-17,1.000015,-1.681631,-0.681691,-0.108219,0.447877,12.268563
education_num,32561.0,1.479525e-16,1.000015,-3.529656,-0.42006,-0.03136,0.746039,2.300838
capital_gain,32561.0,5.237255e-18,1.000015,-0.14592,-0.14592,-0.14592,-0.14592,13.394578
capital_loss,32561.0,-4.364379e-17,1.000015,-0.21666,-0.21666,-0.21666,-0.21666,10.593507
hours_per_week,32561.0,-2.5749840000000003e-17,1.000015,-3.19403,-0.035429,-0.035429,0.369519,4.742967


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,32561.0,0.295639,0.186855,0.0,0.150685,0.273973,0.424658,1.0
fnlwgt,32561.0,0.120545,0.071685,0.0,0.071679,0.112788,0.152651,1.0
education_num,32561.0,0.605379,0.171515,0.0,0.533333,0.6,0.733333,1.0
capital_gain,32561.0,0.010777,0.073854,0.0,0.0,0.0,0.0,1.0
capital_loss,32561.0,0.020042,0.092507,0.0,0.0,0.0,0.0,1.0
hours_per_week,32561.0,0.402423,0.125994,0.0,0.397959,0.397959,0.44898,1.0


## 4) Encoding categorical variables

- One-Hot Encoding: for categorical cols with <5 unique categories
- Label Encoding: for categorical cols with >5 unique categories

Adjust thresholds as needed.

In [10]:
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
cat_cols

# Choose encoding strategy
onehot_cols = [c for c in cat_cols if df[c].nunique() < 5]
label_cols = [c for c in cat_cols if df[c].nunique() >= 5]

print('One-hot columns (example):', onehot_cols)
print('Label-encode columns (example):', label_cols)

# One-hot (pandas)
df_onehot = pd.get_dummies(df[onehot_cols], drop_first=True) if onehot_cols else pd.DataFrame(index=df.index)

# Label encode
df_label = df[label_cols].copy()
for c in label_cols:
    le = LabelEncoder()
    df_label[c] = le.fit_transform(df_label[c].astype(str))

print('\nShapes: original', df.shape, 'onehot', df_onehot.shape, 'label', df_label.shape)

# Combine: numeric + encoded categorical
df_processed = pd.concat([df.select_dtypes(include=[np.number]), df_onehot, df_label], axis=1)
print('\nProcessed shape:', df_processed.shape)

# Show sample
display(df_processed.head())

One-hot columns (example): ['sex', 'income']
Label-encode columns (example): ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'native_country']

Shapes: original (32561, 15) onehot (32561, 2) label (32561, 7)

Processed shape: (32561, 15)


Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,sex_ Male,income_ >50K,workclass,education,marital_status,occupation,relationship,race,native_country
0,39,77516,13,2174,0,40,True,False,7,9,4,1,1,4,39
1,50,83311,13,0,0,13,True,False,6,9,2,4,0,4,39
2,38,215646,9,0,0,40,True,False,4,11,0,6,1,4,39
3,53,234721,7,0,0,40,True,False,4,1,2,6,0,2,39
4,28,338409,13,0,0,40,False,False,4,9,2,10,5,2,5


## 5) Feature engineering

Create at least 2 new features and transform skewed features. Examples provided below.

In [22]:
# Create engineered features using the actual column names (underscores)
import numpy as np
import pandas as pd

# Pick the source df (df is present)
src = df  # already confirmed present

created = []
skipped = []

# 1) capital_diff using capital_gain & capital_loss
if ('capital_gain' in src.columns) and ('capital_loss' in src.columns):
    src['capital_diff'] = src['capital_gain'].fillna(0) - src['capital_loss'].fillna(0)
    created.append('capital_diff')
else:
    skipped.append('capital_diff (requires capital_gain and capital_loss)')

# 2) hours_per_age using hours_per_week & age
if ('hours_per_week' in src.columns) and ('age' in src.columns):
    src['hours_per_age'] = src['hours_per_week'] / src['age'].replace({0: np.nan})
    created.append('hours_per_age')
else:
    skipped.append('hours_per_age (requires hours_per_week and age)')

# 3) log_capital_gain using capital_gain
if 'capital_gain' in src.columns:
    src['log_capital_gain'] = np.log1p(src['capital_gain'].clip(lower=0).fillna(0))
    created.append('log_capital_gain')
else:
    skipped.append('log_capital_gain (requires capital_gain)')

# Copy to df_processed (if exists) so modeling frame contains new features
if 'df_processed' in globals() and isinstance(df_processed, pd.DataFrame):
    for c in created:
        df_processed[c] = src[c]
    copied_to_processed = True
else:
    copied_to_processed = False

# Summary + preview
print("Created columns:", created if created else "None")
print("Skipped columns (missing inputs):", skipped if skipped else "None")
print("Copied into df_processed:", copied_to_processed)
if created:
    display(src[created].head())
else:
    print("\nNo new features created — but that's unexpected because df has the expected underscore column names.")


Created columns: ['capital_diff', 'hours_per_age', 'log_capital_gain']
Skipped columns (missing inputs): None
Copied into df_processed: True


Unnamed: 0,capital_diff,hours_per_age,log_capital_gain
0,2174,1.025641,7.684784
1,0,0.26,0.0
2,0,1.052632,0.0
3,0,0.754717,0.0
4,0,1.428571,0.0


## 6) Outlier detection using Isolation Forest

We will run IsolationForest on numeric features and optionally remove detected outliers.

In [23]:
num_cols = df_processed.select_dtypes(include=[np.number]).columns.tolist()
print('Numeric features used for isolation forest:', num_cols[:20])

if num_cols:
    iso = IsolationForest(contamination=0.01, random_state=42)
    preds = iso.fit_predict(df_processed[num_cols].fillna(0))
    df['outlier_iforest'] = (preds == -1)
    print('Outliers detected:', df['outlier_iforest'].sum())
else:
    print('No numeric features available for IsolationForest')

Numeric features used for isolation forest: ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week', 'workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'native_country', 'capital_diff', 'hours_per_age', 'log_capital_gain']
Outliers detected: 326


## 7) PPS (Predictive Power Score)

PPS detects predictive power between pairs of variables (non-linear aware). If `ppscore` isn't installed, uncomment the pip install line in the first cell.

In [24]:
# Compute correlation matrix and PPS (if available)
print('Correlation matrix (pearson) sample:')
if not df_processed.empty:
    display(df_processed.corr().iloc[:10,:10])

try:
    import ppscore as pps
    pps_matrix = pps.matrix(df)[['x','y','ppscore']]
    # pivot for heatmap-like display for a subset
    pivot = pps_matrix.pivot(index='x', columns='y', values='ppscore')
    print('\nPPS matrix (sample):')
    display(pivot.iloc[:10,:10])
except Exception as e:
    print('\nPPS not available (install ppscore to compute).')
    print('Exception:', e)

print('\nNote: Compare PPS results with correlation matrix for linear vs non-linear relationships.')

Correlation matrix (pearson) sample:


Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,sex_ Male,income_ >50K,workclass,education
age,1.0,-0.076646,0.036527,0.077674,0.057775,0.068756,0.088832,0.234037,0.003787,-0.010508
fnlwgt,-0.076646,1.0,-0.043195,0.000432,-0.010252,-0.018768,0.026858,-0.009463,-0.016656,-0.028145
education_num,0.036527,-0.043195,1.0,0.12263,0.079923,0.148123,0.01228,0.335154,0.052085,0.359153
capital_gain,0.077674,0.000432,0.12263,1.0,-0.031615,0.078409,0.04848,0.223329,0.033835,0.030046
capital_loss,0.057775,-0.010252,0.079923,-0.031615,1.0,0.054256,0.045567,0.150526,0.012216,0.016746
hours_per_week,0.068756,-0.018768,0.148123,0.078409,0.054256,1.0,0.229309,0.229689,0.138962,0.05551
sex_ Male,0.088832,0.026858,0.01228,0.04848,0.045567,0.229309,1.0,0.21598,0.095981,-0.027356
income_ >50K,0.234037,-0.009463,0.335154,0.223329,0.150526,0.229689,0.21598,1.0,0.051604,0.079317
workclass,0.003787,-0.016656,0.052085,0.033835,0.012216,0.138962,0.095981,0.051604,1.0,0.023513
education,-0.010508,-0.028145,0.359153,0.030046,0.016746,0.05551,-0.027356,0.079317,0.023513,1.0



PPS not available (install ppscore to compute).
Exception: No module named 'ppscore'

Note: Compare PPS results with correlation matrix for linear vs non-linear relationships.


## 8) Feature selection guidance

- Use PPS and correlation to shortlist features.
- Consider variance threshold, mutual information, or model-based importances for final selection.

In [27]:
OUT = r"D:\DATA-SCIENCE\ASSIGNMENTS\12 EDA2\adult_processed_for_modeling.csv"

df_processed.to_csv(OUT, index=False)
print("✅ Saved processed dataset successfully to:", OUT)


✅ Saved processed dataset successfully to: D:\DATA-SCIENCE\ASSIGNMENTS\12 EDA2\adult_processed_for_modeling.csv


In [29]:
from IPython.display import FileLink, display
display(FileLink("adult_processed_for_modeling.csv"))
