In [7]:
# Task 4: Feature Encoding & Scaling
# Dataset: Adult Income Dataset


import pandas as pd
import numpy as np

# Load Dataset

df = pd.read_csv("adult.csv")

# Handle Missing Values

df.replace("?", np.nan, inplace=True)
df.dropna(inplace=True)

# Separate Features & Target

target = "income"
X = df.drop(columns=[target])
y = df[target]

# Identify Categorical & Numerical Columns

categorical_cols = X.select_dtypes(include="object").columns
numerical_cols = X.select_dtypes(include=["int64", "float64"]).columns

# Label Encoding (Target Variable)

y_encoded = y.map({"<=50K": 0, ">50K": 1})

# One-Hot Encoding (Categorical Features)

X_encoded = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# Feature Scaling (Standardization)
# Formula: (x - mean) / std

X_scaled = X_encoded.copy()

for col in numerical_cols:
    mean = X_scaled[col].mean()
    std = X_scaled[col].std()
    X_scaled[col] = (X_scaled[col] - mean) / std

# Combine Features & Target

processed_df = X_scaled.copy()
processed_df["income"] = y_encoded.values

# Save Preprocessed Dataset

processed_df.to_csv("adult_income_processed.csv", index=False)

# Verification Output

print("Preprocessing completed successfully!")
print("Processed dataset shape:", processed_df.shape)
print("File saved as: adult_income_processed.csv")


Preprocessing completed successfully!
Processed dataset shape: (44513, 97)
File saved as: adult_income_processed.csv


## Task 4: Feature Encoding & Scaling
Dataset: Adult Income Dataset

### Steps Performed
- Loaded and cleaned the dataset by handling missing values.
- Identified categorical and numerical features.
- Applied label encoding to the target variable (income).
- Applied one-hot encoding to categorical features without order.
- Standardized numerical features using mean and standard deviation.
- Saved the preprocessed dataset for model training.

### Why Encoding & Scaling?
- Encoding converts categorical data into numerical form required by ML models.
- Scaling ensures all numerical features contribute equally and improves model performance.

### Algorithms That Need Scaling
- Logistic Regression
- KNN
- SVM
- K-Means
- PCA

### Deliverables
- Preprocessed dataset: `adult_income_processed.csv`
- Jupyter Notebook with pandas and numpyâ€“based preprocessing code

### Outcome
- Dataset is fully model-ready.
- Understanding of feature engineering basics achieved.
