# Income Classification Data Preprocessing Workflow

This notebook demonstrates a step-by-step data preprocessing pipeline for an income classification dataset. The workflow includes:

- Loading the raw dataset
- Dropping unnecessary columns
- Removing duplicate entries
- Handling missing and categorical data
- Outlier detection and transformation for numerical features
- Encoding categorical variables (Label Encoding and One-Hot Encoding)
- Saving the cleaned and transformed dataset for further analysis

Each step is documented and code is provided for reproducibility. This notebook prepares the data for machine learning modeling and ensures data quality for robust analysis.

In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np

In [2]:
# Load the Dataset
data = pd.read_csv("../data/income.csv")
print("Initial shape:", data.shape)

Initial shape: (48842, 15)


In [3]:
# Drop Unnecessary Columns
selected_features = [columns for columns in data.columns if columns not in ['fnlwgt', 'educational-num']]
data = data[selected_features]

In [4]:
# Remove Duplicate Entries
data = data.drop_duplicates()
print("Shape after removing duplicates:", data.shape)

Shape after removing duplicates: (42468, 13)


In [5]:
# Handle Missing and Categorical Data
# Replace placeholder values (?) with proper NaN values 
data = data.replace("?", np.nan)

# Define categorical columns for processing
categorical_cols = ["workclass", "education", "marital-status", 
                    "occupation", "relationship", 
                    "gender","race", "native-country"]

# Fill missing categorical values with "Unknown" instead of dropping rows
for col in categorical_cols:
    data[col] = data[col].fillna("Unknown")

# Convert income target variable to binary format (1 for >50K, 0 for <=50K)
data["income"] = data["income"].astype(str).str.strip()
data["income"] = data["income"].apply(lambda x: 1 if x == ">50K" else 0)

# Verify no missing values remain in the dataset
print("Missing values after handling:\n", data.isnull().sum())

Missing values after handling:
 age               0
workclass         0
education         0
marital-status    0
occupation        0
relationship      0
race              0
gender            0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


In [6]:
# Handle Outliers in Numeric Features
numeric_cols_iqr = ["age", "hours-per-week"]   # Apply IQR here
skewed_cols = ["capital-gain", "capital-loss"] # Transform instead of remove

# Function: remove outliers with IQR
def remove_outliers_iqr(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return df[(df[col] >= lower) & (df[col] <= upper)]

# Apply IQR only on selected numeric cols
for col in numeric_cols_iqr:
    before = data.shape[0]
    data = remove_outliers_iqr(data, col)
    after = data.shape[0]
    print(f"{col}: removed {before - after} outliers")

# Apply log transformation to skewed features
import numpy as np
for col in skewed_cols:
    data[col] = np.log1p(data[col])   # log(1 + x) keeps 0 as 0
    print(f"{col}: applied log transformation")

age: removed 184 outliers
hours-per-week: removed 10321 outliers
capital-gain: applied log transformation
capital-loss: applied log transformation


In [7]:
print( data.head())

   age  workclass     education      marital-status         occupation  \
0   25    Private          11th       Never-married  Machine-op-inspct   
1   38    Private       HS-grad  Married-civ-spouse    Farming-fishing   
2   28  Local-gov    Assoc-acdm  Married-civ-spouse    Protective-serv   
3   44    Private  Some-college  Married-civ-spouse  Machine-op-inspct   
4   18    Unknown  Some-college       Never-married            Unknown   

  relationship   race  gender  capital-gain  capital-loss  hours-per-week  \
0    Own-child  Black    Male      0.000000           0.0              40   
1      Husband  White    Male      0.000000           0.0              50   
2      Husband  White    Male      0.000000           0.0              40   
3      Husband  Black    Male      8.947546           0.0              40   
4    Own-child  White  Female      0.000000           0.0              30   

  native-country  income  
0  United-States       0  
1  United-States       0  
2  United-S

Numerical Columns

In [8]:
#capital-gain and capital-loss are highly skewed, we can apply log transformation to reduce the skewness instead of removing outliers.


# Apply log transformation to capital-gain (log1p avoids issues with zero values)
data['capital-gain'] = np.log1p(data['capital-gain'])
data['capital-gain'].head()

# Apply log transformation to capital-loss (log1p avoids issues with zero values)
data['capital-loss'] = np.log1p(data['capital-loss'])
data['capital-loss'].head()


0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: capital-loss, dtype: float64

#hours-per-week and age are not highly skewed and Income has been transformed to 1 - 0 values

Categrorical columns

In [9]:
import joblib
from sklearn.preprocessing import LabelEncoder
import pandas as pd


# Initialize the LabelEncoder
encoder = LabelEncoder()

# Fit the encoder to the 'education' column and transform it
data['education'] = encoder.fit_transform(data['education'])

# Save the fitted encoder to a file
joblib.dump(encoder, '../tools/education_encoder.pkl')

# Check the transformed dataset
print(data)

       age     workclass  education      marital-status         occupation  \
0       25       Private          1       Never-married  Machine-op-inspct   
1       38       Private         11  Married-civ-spouse    Farming-fishing   
2       28     Local-gov          7  Married-civ-spouse    Protective-serv   
3       44       Private         15  Married-civ-spouse  Machine-op-inspct   
4       18       Unknown         15       Never-married            Unknown   
...    ...           ...        ...                 ...                ...   
48835   53       Private         12  Married-civ-spouse    Exec-managerial   
48836   22       Private         15       Never-married    Protective-serv   
48837   27       Private          7  Married-civ-spouse       Tech-support   
48839   58       Private         11             Widowed       Adm-clerical   
48841   52  Self-emp-inc         11  Married-civ-spouse    Exec-managerial   

        relationship   race  gender  capital-gain  capital-loss

In [10]:
#one-hot encode other categorical columns
# Apply One-Hot Encoding to categorical columns
columns_to_encode = ['workclass', 'occupation', 'relationship', 'race', 'native-country', 'marital-status']
data_encoded = pd.get_dummies(data, columns=columns_to_encode)

# Check the transformed dataset
print(data_encoded)

       age  education  gender  capital-gain  capital-loss  hours-per-week  \
0       25          1    Male      0.000000           0.0              40   
1       38         11    Male      0.000000           0.0              50   
2       28          7    Male      0.000000           0.0              40   
3       44         15    Male      2.297326           0.0              40   
4       18         15  Female      0.000000           0.0              30   
...    ...        ...     ...           ...           ...             ...   
48835   53         12    Male      0.000000           0.0              40   
48836   22         15    Male      0.000000           0.0              40   
48837   27          7  Female      0.000000           0.0              38   
48839   58         11  Female      0.000000           0.0              40   
48841   52         11  Female      2.362501           0.0              40   

       income  workclass_Federal-gov  workclass_Local-gov  \
0           0 

In [11]:
#encode the other nominal categorical columns using one-hot encoding
# Apply One-Hot Encoding to categorical columns
columns_to_encode = ['workclass', 'marital-status', 'occupation', 'relationship', 'race', 'native-country','gender']
data_encoded = pd.get_dummies(data, columns=columns_to_encode)

# Check the transformed dataset
print(data_encoded)


       age  education  capital-gain  capital-loss  hours-per-week  income  \
0       25          1      0.000000           0.0              40       0   
1       38         11      0.000000           0.0              50       0   
2       28          7      0.000000           0.0              40       1   
3       44         15      2.297326           0.0              40       1   
4       18         15      0.000000           0.0              30       0   
...    ...        ...           ...           ...             ...     ...   
48835   53         12      0.000000           0.0              40       1   
48836   22         15      0.000000           0.0              40       0   
48837   27          7      0.000000           0.0              38       0   
48839   58         11      0.000000           0.0              40       0   
48841   52         11      2.362501           0.0              40       1   

       workclass_Federal-gov  workclass_Local-gov  workclass_Never-worked  

In [12]:
#save the preprocessed data to a new CSV file
data_encoded.to_csv("../data/cleaned.csv", index=False)

print("Dataset saved as 'cleaned.csv' in the 'data' folder.")

Dataset saved as 'cleaned.csv' in the 'data' folder.
