## Feature Engineering


Feature engineering is the process of **creating or transforming features** to improve how well a machine learning model can learn from data.  

Raw datasets often don’t contain the most informative representations directly — they may need transformation, grouping, or encoding before they can be used effectively.  

In this notebook, we will demonstrate **how new features can be created from existing ones** in the Adult Income dataset to make it more suitable for predictive modeling.  


Before engineering new features, let’s quickly look at what the dataset looks like 

In [5]:
from sklearn.datasets import fetch_openml
import pandas as pd

adult = fetch_openml(name="adult", version=2, as_frame=True)

df_adult = adult.frame
df_adult=df_adult.drop(columns=['fnlwgt'])
df_adult=df_adult.drop_duplicates()
print(df_adult.head())

    age  workclass     education  education-num      marital-status  \
0  25.0    Private          11th            7.0       Never-married   
1  38.0    Private       HS-grad            9.0  Married-civ-spouse   
2  28.0  Local-gov    Assoc-acdm           12.0  Married-civ-spouse   
3  44.0    Private  Some-college           10.0  Married-civ-spouse   
4  18.0        NaN  Some-college           10.0       Never-married   

          occupation relationship   race     sex  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male           0.0           0.0   
1    Farming-fishing      Husband  White    Male           0.0           0.0   
2    Protective-serv      Husband  White    Male           0.0           0.0   
3  Machine-op-inspct      Husband  Black    Male        7688.0           0.0   
4                NaN    Own-child  White  Female           0.0           0.0   

   hours-per-week native-country  class  
0            40.0  United-States  <=50K  
1       

Lets create a feature called **age-group** wherein we bucket ages intro categories like young, middle aged and old

In [6]:
print(df_adult['age'].min(), df_adult['age'].max())

17.0 90.0


I'm going to create age divisions as follows:
- Teens: 17-19 
- Young Adults: 20-29 
- Adults: 30-49 
- Middle-Aged Adults: 50-64 
- Seniors: 65+

In [7]:
import pandas as pd
bins=[17,19,29,49,64,float('inf')]

labels=['Teens','Young Adults','Adults','Middle-Aged Adults','Seniors']
df_adult['age_group']=pd.cut(df_adult['age'], bins=bins, labels=labels,right=True, include_lowest=True)
print(df_adult.head())

    age  workclass     education  education-num      marital-status  \
0  25.0    Private          11th            7.0       Never-married   
1  38.0    Private       HS-grad            9.0  Married-civ-spouse   
2  28.0  Local-gov    Assoc-acdm           12.0  Married-civ-spouse   
3  44.0    Private  Some-college           10.0  Married-civ-spouse   
4  18.0        NaN  Some-college           10.0       Never-married   

          occupation relationship   race     sex  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male           0.0           0.0   
1    Farming-fishing      Husband  White    Male           0.0           0.0   
2    Protective-serv      Husband  White    Male           0.0           0.0   
3  Machine-op-inspct      Husband  Black    Male        7688.0           0.0   
4                NaN    Own-child  White  Female           0.0           0.0   

   hours-per-week native-country  class     age_group  
0            40.0  United-States  <=

Next lets create a feature called capital flag which will hold true for people with captial-gain>0 and false for capital-gain<0

In [8]:
df_adult['capital_flag'] = df_adult['capital-gain'].apply(lambda x: 1 if x > 0 else 0)
print(df_adult.head())

    age  workclass     education  education-num      marital-status  \
0  25.0    Private          11th            7.0       Never-married   
1  38.0    Private       HS-grad            9.0  Married-civ-spouse   
2  28.0  Local-gov    Assoc-acdm           12.0  Married-civ-spouse   
3  44.0    Private  Some-college           10.0  Married-civ-spouse   
4  18.0        NaN  Some-college           10.0       Never-married   

          occupation relationship   race     sex  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male           0.0           0.0   
1    Farming-fishing      Husband  White    Male           0.0           0.0   
2    Protective-serv      Husband  White    Male           0.0           0.0   
3  Machine-op-inspct      Husband  Black    Male        7688.0           0.0   
4                NaN    Own-child  White  Female           0.0           0.0   

   hours-per-week native-country  class     age_group  capital_flag  
0            40.0  Uni

Lets Simplify the marital-status into married vs not married instead of multiple categories

In [9]:
df_adult['marital-status'].value_counts()

marital-status
Married-civ-spouse       19215
Never-married            13360
Divorced                  6218
Separated                 1512
Widowed                   1499
Married-spouse-absent      627
Married-AF-spouse           37
Name: count, dtype: int64

In [10]:
import numpy as np

df_adult['marital-status'] = np.where(
    df_adult['marital-status'].str.contains('Married'), 
    'Married', 
    'Not-Married'
)

In [11]:
print(df_adult.head())

    age  workclass     education  education-num marital-status  \
0  25.0    Private          11th            7.0    Not-Married   
1  38.0    Private       HS-grad            9.0        Married   
2  28.0  Local-gov    Assoc-acdm           12.0        Married   
3  44.0    Private  Some-college           10.0        Married   
4  18.0        NaN  Some-college           10.0    Not-Married   

          occupation relationship   race     sex  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male           0.0           0.0   
1    Farming-fishing      Husband  White    Male           0.0           0.0   
2    Protective-serv      Husband  White    Male           0.0           0.0   
3  Machine-op-inspct      Husband  Black    Male        7688.0           0.0   
4                NaN    Own-child  White  Female           0.0           0.0   

   hours-per-week native-country  class     age_group  capital_flag  
0            40.0  United-States  <=50K  Young Adult