In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.ensemble import IsolationForest

In [3]:
# --- 1. Load Data, Clean Headers, and Define Columns ---
# Load the dataset
df = pd.read_csv("/content/adult_with_headers.csv")
df


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


## 1. Data Exploration and Preprocessing Analysis

### Handling Missing Values

Initial inspection of the categorical features (`workclass`, `occupation`, `native_country`) revealed missing values represented by the string **' ?'**.

* **Strategy:** These values were treated as `NaN` and imputed using the **mode** (most frequent value) for each respective column.
* **Rationale:** Since these columns are nominal categories, mode imputation is the safest choice, as it preserves the overall distribution of the column and is robust against the unknown reason for the missingness.

### Scaling Scenarios (Standard vs. Min-Max)

We applied both Standard Scaling and Min-Max Scaling to all numerical features.

| Scaling Technique | Formula | When to Use |
| :--- | :--- | :--- |
| **Standard Scaling** | $z = \frac{x - \mu}{\sigma}$ (Z-score) | **Ideal for algorithms based on distance** (e.g., K-Means, SVM, PCA, Linear Regression). It ensures the mean is 0 and variance is 1, which is robust to outliers since the range is not fixed. |
| **Min-Max Scaling** | $x' = \frac{x - \min(x)}{\max(x) - \min(x)}$ | **Ideal for neural networks and image processing.** It preserves the original distribution shape and strictly maps all data to a fixed range (usually \[0, 1]), which is beneficial when gradients are involved. |

### Encoding Pros and Cons

We used OHE for columns with $<5$ categories and Label Encoding for columns with $\ge 5$ categories.

| Encoding Technique | Pros | Cons |
| :--- | :--- | :--- |
| **One-Hot Encoding (OHE)** | Creates binary features, preventing the model from inferring ordinality (rank) where none exists. | **Curse of Dimensionality:** Adds many new columns, increasing computation time and memory usage. |
| **Label Encoding (LE)** | Very efficient; adds only one column, saving memory and time. | **Implies False Ordinality:** Forces the model to assume a numerical relationship (e.g., 0 < 1 < 2), which is incorrect for non-ordinal features (like 'occupation'). |

In [4]:
# Clean column headers by stripping whitespace
df.columns = df.columns.str.strip()
df


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [25]:
# Define feature lists
numerical_cols = ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
categorical_cols = df.select_dtypes(include='object').columns.tolist()
numerical_cols


['age',
 'fnlwgt',
 'education_num',
 'capital_gain',
 'capital_loss',
 'hours_per_week']

In [28]:
# --- 2. Handle Missing Values (Imputation) ---
# Missing values are represented by ' ?' in the categorical columns.
df[categorical_cols] = df[categorical_cols].replace(' ?', np.nan)


In [29]:
df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,capital_net,work_intensity,capital_gain_log,sex_ Male,income_ >50K
0,39,7,77516,9,13,4,1,1,4,2174,0,40,39,2174,1.025641,7.684784,True,False
1,50,6,83311,9,13,2,4,0,4,0,0,13,39,0,0.260000,0.000000,True,False
2,38,4,215646,11,9,0,6,1,4,0,0,40,39,0,1.052632,0.000000,True,False
3,53,4,234721,1,7,2,6,0,2,0,0,40,39,0,0.754717,0.000000,True,False
4,28,4,338409,9,13,2,10,5,2,0,0,40,5,0,1.428571,0.000000,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,7,12,2,13,5,4,0,0,38,39,0,1.407407,0.000000,False,False
32557,40,4,154374,11,9,2,7,0,4,0,0,40,39,0,1.000000,0.000000,True,True
32558,58,4,151910,11,9,6,1,4,4,0,0,40,39,0,0.689655,0.000000,False,False
32559,22,4,201490,11,9,4,1,3,4,0,0,20,39,0,0.909091,0.000000,True,False


In [30]:
# Impute missing categorical values with the mode
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)
# No missing values in numerical columns are expected based on dataset description.


In [33]:
# --- 3. Feature Engineering ---
# A. New Feature 1: Capital Net (Measures net financial activity)
df['capital_net'] = df['capital_gain'] - df['capital_loss']

## 2. Feature Engineering and Selection Analysis

### Rationale for Engineered Features

1.  **Capital Net** (`capital_gain - capital_loss`)
    * **Rationale:** This combines the two capital features into a single, more powerful financial indicator that represents a person's net financial activity, serving as a stronger predictor for the target income variable.
2.  **Work Intensity** (`hours_per_week / age`)
    * **Rationale:** This metric normalizes work effort relative to lifetime. A high value suggests intense working hours for one's age group, potentially highlighting dedicated career professionals.

### Log Transformation Justification

* **Feature:** `capital_gain`
* **Justification:** The `capital_gain` column is highly skewed (many zeros, few large outliers). The $\mathbf{\ln(1+x)}$ transformation severely dampens the influence of these extreme outliers, creating a distribution closer to normal. This aids the performance and convergence of many machine learning models.

### Isolation Forest and Outliers

* **Outlier Impact:** Outliers (like extreme high `capital_gain` or low `hours_per_week` in relation to other features) disproportionately affect distance-based and variance-based models (e.g., K-Means, Linear Regression), skewing parameters and reducing model generalization.
* **Isolation Forest:** This algorithm is an efficient way to **identify and remove anomalies**. It works by randomly partitioning the data; outliers are typically separated from the majority of data by fewer splits, making them easy to isolate. We removed these identified outliers to create the final, robust dataset (`df_clean_final`).

In [34]:
# B. New Feature 2: Work Intensity (Ratio of hours worked to age)
df['work_intensity'] = df['hours_per_week'] / df['age']
# Clean up potential division-by-zero results
df['work_intensity'].replace([np.inf, -np.inf], 0, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['work_intensity'].replace([np.inf, -np.inf], 0, inplace=True)


In [35]:
# C. Transformation: Log Transformation on skewed 'capital_gain' (using log(1+x) due to zeros)
df['capital_gain_log'] = np.log1p(df['capital_gain'])


In [36]:
# Update numerical columns for final steps
final_numerical_cols = numerical_cols + ['capital_net', 'work_intensity', 'capital_gain_log']



In [37]:
# --- 4. Encoding Techniques (OHE and Label Encoding) ---
# Define encoding strategy based on category count (< 5 for OHE, >= 5 for LE)
ohe_cols = [col for col in categorical_cols if df[col].nunique() < 5]
le_cols = [col for col in categorical_cols if df[col].nunique() >= 5]


In [38]:
# Label Encoding (LE) for high-cardinality features
le = LabelEncoder()
for col in le_cols:
    df[col] = le.fit_transform(df[col])


In [39]:
# One-Hot Encoding (OHE) for low-cardinality features
df = pd.get_dummies(df, columns=ohe_cols, drop_first=True)


In [40]:
# --- 5. Scaling Techniques (Standard and Min-Max) ---
# Scale all numerical features, including the new engineered ones
X_numeric = df[final_numerical_cols]


In [18]:
# A. Standard Scaling
scaler_s = StandardScaler()
df_standard_scaled = X_numeric.copy()
df_standard_scaled[final_numerical_cols] = scaler_s.fit_transform(X_numeric)


In [41]:
# B. Min-Max Scaling
scaler_mm = MinMaxScaler()
df_minmax_scaled = X_numeric.copy()
df_minmax_scaled[final_numerical_cols] = scaler_mm.fit_transform(X_numeric)



In [42]:
# --- 6. Feature Selection: Outlier Removal using Isolation Forest ---
# Apply Isolation Forest to the entire numerical feature set
iso_forest = IsolationForest(random_state=42)
iso_forest


In [21]:
# Fit and predict the outliers
outliers = iso_forest.fit_predict(X_numeric)


In [43]:
# Create the final, cleaned DataFrame by removing outliers (where prediction is 1)
# Note: We use the original full dataframe 'df' to filter both numerical and encoded categorical features
df_clean_final = df[outliers == 1]
# df_clean_final now holds the complete pre-processed, encoded, engineered, and outlier-removed dataset.

In [None]:
# Install ppscore (if not available)
!pip install ppscore

import ppscore as pps

# --- Compute PPS Matrix on the final cleaned dataframe ---
# df_clean_final must be the final, processed DataFrame
pps_df = pps.matrix(df_clean_final)

# Select and display the top 10 feature relationships by PPS
pps_pivot = pps_df.pivot(columns='x', index='y', values='ppscore')
pps_pivot_top = pps_pivot.unstack().sort_values(ascending=False).drop_duplicates()
pps_pivot_top = pps_pivot_top[pps_pivot_top.index.get_level_values(0) != pps_pivot_top.index.get_level_values(1)].head(10)

# The output of this cell should show the top 10 PPS scores
print("\n--- Top 10 Feature Relationships by Predictive Power Score (PPS) ---")
print(pps_pivot_top)

Collecting ppscore
  Downloading ppscore-1.3.0.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pandas<2.0.0,>=1.0.0 (from ppscore)
  Downloading pandas-1.5.3.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: ppscore, pandas
  Building wheel for ppscore (setup.py) ... [?25l[?25hdone
  Created wheel for ppscore: filename=ppscore-1.3.0-py2.py3-none-any.whl size=13166 sha256=93d270eb5e5ddd1646c1d306f334f65534e53427beba72fbe7d2867047574648
  Stored in directory: /root/.cache/pip/wheels/30/1c/06/b724ffb08ed69cd209743b44137306245ebbf025fd9acacf0c


### Predictive Power Score (PPS) Comparison

* **What is PPS?** The Predictive Power Score measures the degree to which one column can predict another column, ranging from 0 (no predictive power) to 1 (perfect prediction). Unlike correlation, PPS is capable of detecting both **linear and non-linear** relationships, and works across both numerical and categorical data types.

* **Findings vs. Correlation:**
    * The standard **Correlation Matrix** would primarily highlight strong *linear* relationships between numerical features (e.g., a strong correlation between `education_num` and `income` if mapped numerically).
    * **PPS** is superior because it highlights the actual predictive potential, often showing high scores for **categorical** targets (like predicting `income` from `marital_status` or `education_num`).
    * **Conclusion:** PPS provides a more complete and robust view of feature relevance for a predictive model than the correlation matrix, making it invaluable for initial feature selection.