#### DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING
Dataset: Adult Income Dataset (Census Data)
Objective

The objective of this assignment is to apply essential data preprocessing, encoding, scaling, and feature engineering techniques to prepare the Adult dataset for machine learning models. These steps improve model accuracy, stability, and interpretability.

#### 1. Data Loading and Exploration

In [1]:
import pandas as pd
import numpy as np

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv("C:/Users/harik/Data science Assignment/adult_with_headers (1).csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [5]:
df.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [6]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

#### 2. Handling Missing Values

In [7]:
df.replace('?', np.nan, inplace=True)

In [8]:
for col in df.select_dtypes(include='object'):
    df[col].fillna(df[col].mode()[0], inplace=True)

for col in df.select_dtypes(include=['int64', 'float64']):
    df[col].fillna(df[col].median(), inplace=True)

#### 3. Feature Scaling

In [9]:
num_cols = df.select_dtypes(include=['int64', 'float64']).columns

In [10]:
#a) Standard Scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_standard = df.copy()
df_standard[num_cols] = scaler.fit_transform(df[num_cols])

In [11]:
#b) Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler

minmax = MinMaxScaler()
df_minmax = df.copy()
df_minmax[num_cols] = minmax.fit_transform(df[num_cols])

#### 4. Encoding Techniques

In [12]:
cat_cols = df.select_dtypes(include='object').columns

In [13]:
low_cardinality = [col for col in cat_cols if df[col].nunique() < 5]
high_cardinality = [col for col in cat_cols if df[col].nunique() >= 5]

In [14]:
#a) One-Hot Encoding (Low Cardinality)
df = pd.get_dummies(df, columns=low_cardinality, drop_first=True)

In [15]:
#b) Label Encoding (High Cardinality)
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in high_cardinality:
    df[col] = le.fit_transform(df[col])

#### 5. Feature Engineering

In [16]:
df['capital_diff'] = df['capital_gain'] - df['capital_loss']
df['hours_per_age'] = df['hours_per_week'] / df['age']

#### 6. Feature Transformation

In [17]:
df['log_capital_gain'] = np.log1p(df['capital_gain'])

In [19]:
df['log_capital_gain']

0        7.684784
1        0.000000
2        0.000000
3        0.000000
4        0.000000
           ...   
32556    0.000000
32557    0.000000
32558    0.000000
32559    0.000000
32560    9.617471
Name: log_capital_gain, Length: 32561, dtype: float64