## Income Classification Data Preprocessing Workflow

This notebook demonstrates a step-by-step data preprocessing pipeline for an income classification dataset. The workflow includes:

- Loading the raw dataset
- Dropping unnecessary columns
- Removing duplicate entries
- Handling missing and categorical data
- Outlier detection and transformation for numerical features
- Encoding categorical variables (Label Encoding and One-Hot Encoding)
- Saving the cleaned and transformed dataset for further analysis

Each step is documented and code is provided for reproducibility. This notebook prepares the data for machine learning modeling and ensures data quality for robust analysis.


In [23]:
# Import Required Libraries
import pandas as pd
import numpy as np

In [33]:
# Load the Dataset
data = pd.read_csv("../data/income.csv", na_values='?')
print("Initial shape:", data.shape)

Initial shape: (48842, 15)


## Data Preprocessing: Dropped Columns

During preprocessing we removed two columns: **`fnlwgt`** and **`educational-num`**. The reasoning for each decision is outlined below.

### 1. **`fnlwgt` Column (Final Weight)**

The `fnlwgt` column represents the final weight assigned to each individual in the Census. Although it is helpful for making population-level estimates, it does not capture any intrinsic characteristic of an individual that would help predict their income bracket.

**Why it was removed:**
- **Limited predictive value:** As a survey weight, `fnlwgt` does not provide direct information related to the income target.
- **Potential noise:** Keeping the column risks adding noise or misleading signals without improving accuracy.

### 2. **`educational-num` Column**

`educational-num` is a numeric representation of the categorical `education` feature. Because the two columns encode the same information, retaining both introduces redundancy.

**Why it was removed:**
- **Redundant with `education`:** The categorical `education` column already captures schooling level with clear labels.
- **Avoids multicollinearity:** Dropping the numeric duplicate reduces the risk of collinearity and keeps the feature set lean.

Removing these columns trims unnecessary dimensions and keeps the dataset focused on informative predictors.


In [25]:
# Drop Unnecessary Columns
selected_features = [columns for columns in data.columns if columns not in ['fnlwgt', 'educational-num']]
data = data[selected_features]

### Removing Duplicate Entries

Duplicate records can skew feature distributions and inflate counts. We remove exact duplicates so that each individual appears only once in the dataset.


In [37]:
# Remove Duplicate Entries
data = data.drop_duplicates()
print("Shape after removing duplicates:", data.shape)

Shape after removing duplicates: (48790, 15)


### before handling missing values

In [38]:
# Check for missing values
print("\nMissing Values:\n", data.isnull().sum())


Missing Values:
 age                   0
workclass          2795
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2805
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      856
income                0
dtype: int64


### Handling Missing and Categorical Data

Placeholder values (`?`) are converted to `NaN`, categorical gaps are filled with `"Unknown"`, and the target label is stripped of whitespace before being mapped to binary classes.


In [27]:
# Handle Missing and Categorical Data
# Replace placeholder values (?) with proper NaN values 
data = data.replace("?", np.nan)

# Define categorical columns for processing
categorical_cols = ["workclass", "education", "marital-status", 
                    "occupation", "relationship", 
                    "gender","race", "native-country"]

# Fill missing categorical values with "Unknown" instead of dropping rows
for col in categorical_cols:
    data[col] = data[col].fillna("Unknown")

# Convert income target variable to binary format (1 for >50K, 0 for <=50K)
data["income"] = data["income"].astype(str).str.strip()
data["income"] = data["income"].apply(lambda x: 1 if x == ">50K" else 0)

# Verify no missing values remain in the dataset
print("Missing values after handling:\n", data.isnull().sum())

Missing values after handling:
 age               0
workclass         0
education         0
marital-status    0
occupation        0
relationship      0
race              0
gender            0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


#### Numeric Features

We limit extreme values in `age` and `hours-per-week` using the IQR rule and tame heavy-tailed distributions (`capital-gain`, `capital-loss`) with log transforms.


##### Outlier Detection and Transformation

The helper below trims records outside 1.5×IQR for the selected columns before applying `log1p` to skewed monetary fields.


In [28]:
# Handle Outliers in Numeric Features
numeric_cols_iqr = ["age", "hours-per-week"]   # Apply IQR here
skewed_cols = ["capital-gain", "capital-loss"] # Transform instead of remove

# Function: remove outliers with IQR
def remove_outliers_iqr(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return df[(df[col] >= lower) & (df[col] <= upper)]

# Apply IQR only on selected numeric cols
for col in numeric_cols_iqr:
    before = data.shape[0]
    data = remove_outliers_iqr(data, col)
    after = data.shape[0]
    print(f"{col}: removed {before - after} outliers")

# Apply log transformation to skewed features
import numpy as np
for col in skewed_cols:
    data[col] = np.log1p(data[col])   # log(1 + x) keeps 0 as 0
    print(f"{col}: applied log transformation")

age: removed 184 outliers
hours-per-week: removed 10321 outliers
capital-gain: applied log transformation
capital-loss: applied log transformation


#### Categorical Features

Categorical attributes require tailored encoders: ordinal fields keep their rank ordering, while nominal fields expand into binary indicators.


##### Encoding Strategy

We combine label encoding for the ordinal **`education`** feature with one-hot encoding for the remaining nominal columns.


**Label encoding for the `education` column** — the categories have an intrinsic order, so encoding them as integers preserves the ordinal relationship.


In [30]:
import joblib
from sklearn.preprocessing import LabelEncoder
import pandas as pd


# Initialize the LabelEncoder
encoder = LabelEncoder()

# Fit the encoder to the 'education' column and transform it
data['education'] = encoder.fit_transform(data['education'])

# Save the fitted encoder to a file
joblib.dump(encoder, '../tools/education_encoder.pkl')

# Check the transformed dataset
print(data)

       age     workclass  education      marital-status         occupation  \
0       25       Private          1       Never-married  Machine-op-inspct   
1       38       Private         11  Married-civ-spouse    Farming-fishing   
2       28     Local-gov          7  Married-civ-spouse    Protective-serv   
3       44       Private         15  Married-civ-spouse  Machine-op-inspct   
4       18       Unknown         15       Never-married            Unknown   
...    ...           ...        ...                 ...                ...   
48835   53       Private         12  Married-civ-spouse    Exec-managerial   
48836   22       Private         15       Never-married    Protective-serv   
48837   27       Private          7  Married-civ-spouse       Tech-support   
48839   58       Private         11             Widowed       Adm-clerical   
48841   52  Self-emp-inc         11  Married-civ-spouse    Exec-managerial   

        relationship   race  gender  capital-gain  capital-loss

Apply **one-hot encoding** to the nominal columns (`workclass`, `marital-status`, `occupation`, `relationship`, `race`, `native-country`, `gender`) so each category becomes its own binary indicator.


In [31]:
#encode the other nominal categorical columns using one-hot encoding
# Apply One-Hot Encoding to categorical columns
columns_to_encode = ['workclass', 'marital-status', 'occupation', 'relationship', 'race', 'native-country','gender']
data_encoded = pd.get_dummies(data, columns=columns_to_encode)

# Check the transformed dataset
print(data_encoded)


       age  education  capital-gain  capital-loss  hours-per-week  income  \
0       25          1      0.000000           0.0              40       0   
1       38         11      0.000000           0.0              50       0   
2       28          7      0.000000           0.0              40       1   
3       44         15      2.297326           0.0              40       1   
4       18         15      0.000000           0.0              30       0   
...    ...        ...           ...           ...             ...     ...   
48835   53         12      0.000000           0.0              40       1   
48836   22         15      0.000000           0.0              40       0   
48837   27          7      0.000000           0.0              38       0   
48839   58         11      0.000000           0.0              40       0   
48841   52         11      2.362501           0.0              40       1   

       workclass_Federal-gov  workclass_Local-gov  workclass_Never-worked  

In [32]:
#save the preprocessed data to a new CSV file
data_encoded.to_csv("../data/cleaned.csv", index=False)

print("Dataset saved as 'cleaned.csv' in the 'data' folder.")

Dataset saved as 'cleaned.csv' in the 'data' folder.
