# **Adult Income Dataset – Feature Analysis Notebook** 
This notebook explores the UCI Adult Income Dataset, which predicts whether a person earns more than $50K a year based on demographic and work-related attributes.

In this notebook, we will:

* Load and clean the dataset (handle missing values like ?)
* Identify Constant Features (columns with the same value throughout)
* Identify Quasi-Constant Features (columns with ~99% same values)
* Perform Correlation Analysis (to see relationships between numeric features)
* Calculate Mutual Information (MI) (to find which features are most useful for predicting income)

**Step 1: Upload Data into Kaggle Notebook**

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [1]:
import pandas as pd

columns = ["age", "workclass", "fnlwgt", "education", "education_num", "marital_status", 
           "occupation", "relationship", "race", "sex", "capital_gain", 
           "capital_loss", "hours_per_week", "native_country", "income"]

train_data = pd.read_csv("/kaggle/input/adult-income-practice/adult.data", names=columns, sep=",", skipinitialspace=True)
test_data = pd.read_csv("/kaggle/input/adult-income-practice/adult.test", names=columns, sep=",", skiprows=1, skipinitialspace=True)

data = pd.concat([train_data, test_data], ignore_index=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**Step 2: Clean the Data**
The Adult dataset has “?” for missing values. Replace them:

In [6]:
print([col for col in data.columns if data[col].isnull().sum() > 0])
data = data.replace("?", pd.NA)
data = data.dropna()

[]


In [4]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [10]:
#Split the data

x_train, x_test, y_train, y_test = train_test_split(
    data.drop(['income'], axis=1),
    data['income'],
    test_size = 0.3,
    random_state = 0
)

In [14]:
print(x_train)
print()
print()
print(y_train)

       age         workclass  fnlwgt     education  education_num  \
11140   18           Private  190325          11th              7   
42461   41           Private  107845     Bachelors             13   
2903    42           Private  200574  Some-college             10   
19409   20           Private  164219       HS-grad              9   
12432   40           Private  110028  Some-college             10   
...    ...               ...     ...           ...            ...   
32823   29           Private  100293  Some-college             10   
22924   39      Self-emp-inc  188069     Assoc-voc             11   
46005   58  Self-emp-not-inc  290670     Bachelors             13   
47040   56           Private   70857       HS-grad              9   
2976    34           Private  182177       HS-grad              9   

           marital_status         occupation   relationship   race     sex  \
11140       Never-married       Craft-repair      Own-child  White    Male   
42461  Married-

**Step 3: Detect Constant & Quasi-Constant Features**


* Constant Feature → Same value in all rows.
* Quasi-Constant Feature → Same value in almost all rows (like 99% same).


In [16]:
#Constant Features
constant_feature = [col for col in x_train.columns if x_train[col].nunique() == 1]
print("Constant Features: ", constant_feature)

Constant Features:  []


In [21]:
#Quasi-Constant Feature
quasi_constant_feature = []
for col in x_train.columns:
    top_freq = x_train[col].value_counts(normalize=True).max() #predominant
    if top_freq > 0.98:
        quasi_constant_feature.append(col)

print("Quasi Constant Features: ", quasi_constant_feature)

Quasi Constant Features:  []


Step-3: Duplicate Features

In [4]:
duplicate_feat = []
for i in range(0,len(x_train.columns)):
    col_1 = x_train.columns[i]
    for col_2 in x_train.columns[i+1:]:
        if x_train[col_1].equals(x_train[col_2]):
            duplicate_feat.append(col_2)

duplicate_features = set(duplicate_feat)
print(duplicate_features)

set()


Step-4: Correlation

In [None]:
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the correlated column names to drop
    corr_matrix = dataset.corr()

    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                colname1 = corr_matrix.columns[i]
                colname2 = corr_matrix.columns[j]
                # Add only one of the columns to drop — you can modify this logic
                col_corr.add(colname1)

    return col_corr

# Usage
corr_features = correlation(x_train, 0.9)
print(f"Number of highly correlated features to remove: {len(corr_features)}")
print(corr_features)
