<< Problem Statement >>

Develop a GA-based feature selection data for the adult data set for the data provided. Compare your results with at least three ML methods (with and without feature selection). 
Draw your inferences and explain them with evidence. 

The experiments is based on the Adult dataset, which contains 48842 instances and 15 features. With this dataset, we test several classifiers with all of the features, the subset of features from the genetic algorithm, and thirty features using the full data for comparison.

Details of the file-

Fields
The dataset contains 16 columns
Target filed: Income
-- The income is divide into two classes: <=50K and >50K
Number of attributes: 14
-- These are the demographics and other features to describe a person

In [1]:
#Data management Libraryies
import pandas as pd
import numpy as np
#from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Graph and plots
import seaborn as sns
import matplotlib.pyplot as plt

#Modeling - AIML
from genetic_selection import GeneticSelectionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import ComplementNB
from sklearn.naive_bayes import GaussianNB 

#Evaluation
from sklearn import metrics

In [2]:
# Load datafile adult.csv (command for google colab)
#from google.colab import files
#uploaded = files.upload()

In [3]:
# Importing the dataset
#/workspaces/codespaces-jupyter/data/adult.csv
data = pd.read_csv('../data/adult.csv')

#Data Exploration

In [4]:
# Display top 5 records 
data.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


# Data preprocessing

In [5]:
# There are datas with ? in multiple columns so will clean it 
# Columns workclass, occupation, native-country have ? 
data.isin(["?"]).sum()

age                   0
workclass          2799
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64

In [6]:
# Clean data
data["workclass"].replace("?", np.nan, inplace= True)
data["occupation"].replace("?", np.nan, inplace= True)
data["native-country"].replace("?", np.nan, inplace= True)

In [7]:
data.isin(["?"]).sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

In [8]:
# check if any null value in any column
data.isnull().sum()

age                   0
workclass          2799
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64

In [9]:
# Total rows 48842, columns 15
data.shape

(48842, 15)

In [10]:
# Delete records for null values 
data.dropna(how = "any", inplace = True)
data.shape

# After drop null - Total rows 45222, columns 15

(45222, 15)

In [11]:
data.isnull().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

In [12]:
# check for duplicate data and drop 

data.duplicated().any()

True

In [13]:
# there is duplicate data so delete them
data = data.drop_duplicates()
data.shape

# After duplicate delete - Total rows 45175, columns 15

(45175, 15)

In [14]:
# Describe Data 
data.describe()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week
count,45175.0,45175.0,45175.0,45175.0,45175.0,45175.0
mean,38.55617,189738.8,10.119314,1102.57627,88.687593,40.942512
std,13.215349,105652.4,2.55174,7510.249876,405.156611,12.00773
min,17.0,13492.0,1.0,0.0,0.0,1.0
25%,28.0,117392.5,9.0,0.0,0.0,40.0
50%,37.0,178312.0,10.0,0.0,0.0,40.0
75%,47.0,237903.0,13.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [15]:
# Display columns type
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45175 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              45175 non-null  int64 
 1   workclass        45175 non-null  object
 2   fnlwgt           45175 non-null  int64 
 3   education        45175 non-null  object
 4   educational-num  45175 non-null  int64 
 5   marital-status   45175 non-null  object
 6   occupation       45175 non-null  object
 7   relationship     45175 non-null  object
 8   race             45175 non-null  object
 9   gender           45175 non-null  object
 10  capital-gain     45175 non-null  int64 
 11  capital-loss     45175 non-null  int64 
 12  hours-per-week   45175 non-null  int64 
 13  native-country   45175 non-null  object
 14  income           45175 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.5+ MB


In [16]:
# Fields
#The dataset contains 16 columns
#Target filed: Income
#-- The income is divide into two classes: <=50K and >50K
#Number of attributes: 14
#-- These are the demographics and other features to describe a person

In [17]:
# target field is income (Given in Data file)
# Print all unique values  
# There are only 2 unique values <=50k, >50k
data.income.unique()

array(['<=50K', '>50K'], dtype=object)

In [18]:
# now we will convert these 2 values in 0/1 
# <=50k to 0, >50k to 1
data.income.replace(['<=50K', '>50K'],[0,1], inplace=True)

In [19]:
# Print after converting income column 
data.sample(2)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
30585,22,Private,156822,10th,6,Never-married,Sales,Not-in-family,White,Female,0,1762,25,United-States,0
15914,42,Private,112507,12th,8,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,30,United-States,0


In [20]:
# Divive data into x(Independent variables) and y (Target/Dependent variable) 
x = data.drop("income", axis= 1)
y = data["income"]

In [21]:
# Print x 
x.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States


In [22]:
# Print y (target)
y.head(2)

0    0
1    0
Name: income, dtype: int64

In [23]:
#Separate out objects and numeric columns in x  
cat_cols = x.columns[x.dtypes == 'object']
num_cols = x.columns[(x.dtypes == 'float64') | (x.dtypes == 'int64')]
# num_cols1=list(x.select_dtypes(exclude='object').columns) # can do this way also 

In [24]:
cat_cols

Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'gender', 'native-country'],
      dtype='object')

In [25]:
num_cols

Index(['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
       'hours-per-week'],
      dtype='object')

In [26]:
# Get dummies 
x=pd.get_dummies(x)

In [27]:
x.head(4)

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,25,226802,7,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,38,89814,9,0,0,50,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,28,336951,12,0,0,40,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,44,160323,10,7688,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0


In [28]:

#applying scalar to all numeric columns 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x[num_cols]=scaler.fit_transform(x[num_cols])

In [29]:
x.sample(2)

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
24561,1.244312,-1.437463,2.304605,-0.146811,-0.218899,1.587119,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
44614,1.698334,-1.520851,-0.438652,-0.146811,6.578547,-1.327702,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0


In [30]:
#Sample for training and testing the model
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.4, random_state=1)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((27105, 104), (18070, 104), (27105,), (18070,))

#### 1.Logistic Regression classifier
#### 2.Random Forest classifier  
#### 3.Knn
#### 4.SVC
#### 5.Naive Bayes
#### 6.without feature selection modeling (with all columns)
#### 7. Compare accuracy

In [31]:
#Build base models
lr = LogisticRegression(random_state=0).fit(X_train, Y_train)
rf = RandomForestClassifier(n_estimators = 7).fit(X_train, Y_train)
knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, Y_train)
svc = SVC().fit(X_train, Y_train)
##nb = ComplementNB().fit(X_train, Y_train)
nb = GaussianNB().fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [32]:
# Prediction on Train data set 
lr_pred = lr.predict(X_train)
rf_pred = rf.predict(X_train)
knn_pred = knn.predict(X_train)
svc_pred = svc.predict(X_train)
nb_pred = nb.predict(X_train)

lr_train_score_1 = 100*metrics.accuracy_score(Y_train, lr_pred)
rf_train_score_1 = 100*metrics.accuracy_score(Y_train, rf_pred)
knn_train_score_1 = 100*metrics.accuracy_score(Y_train, knn_pred)
svc_train_score_1 = 100*metrics.accuracy_score(Y_train, svc_pred)
nb_train_score_1 = 100*metrics.accuracy_score(Y_train, nb_pred)


print("Train Accuracy score of LR = ",lr_train_score_1)
print("Train Accuracy score of RF = ",rf_train_score_1)
print("Train Accuracy score of KNN = ",knn_train_score_1)
print("Train Accuracy score of SVC = ",svc_train_score_1)
print("Train Accuracy score of NB = ",nb_train_score_1)

Train Accuracy score of LR =  84.88839697472791
Train Accuracy score of RF =  98.62018077845416
Train Accuracy score of KNN =  89.9870872532743
Train Accuracy score of SVC =  86.24976941523704
Train Accuracy score of NB =  61.50156797638812


In [33]:
# Prediction on Test data set 
lr_pred_test = lr.predict(X_test)
rf_pred_test = rf.predict(X_test)
knn_pred_test = knn.predict(X_test)
svc_pred_test = svc.predict(X_test)
nb_pred_test = nb.predict(X_test)

lr_test_score_1 = 100*metrics.accuracy_score(Y_test, lr_pred_test)
rf_test_score_1 = 100*metrics.accuracy_score(Y_test, rf_pred_test)
knn_test_score_1 = 100*metrics.accuracy_score(Y_test, knn_pred_test)
svc_test_score_1 = 100*metrics.accuracy_score(Y_test, svc_pred_test)
nb_test_score_1 = 100*metrics.accuracy_score(Y_test, nb_pred_test)

print("Test Accuracy score of LR = ",lr_test_score_1)
print("Test Accuracy score of RF = ",rf_test_score_1)
print("Test Accuracy score of KNN = ",knn_test_score_1)
print("Test Accuracy score of SVC = ",svc_test_score_1)
print("Test Accuracy score of NB = ",nb_test_score_1)

Test Accuracy score of LR =  85.03043718871058
Test Accuracy score of RF =  83.43110127282789
Test Accuracy score of KNN =  82.05866076369674
Test Accuracy score of SVC =  85.45655783065855
Test Accuracy score of NB =  61.10127282789153


In [34]:
#<<<<<< Without feature selection on Train dataset - 
#Train Accuracy score of LR =  84.88839697472791
#Train Accuracy score of RF =  98.45784910533112
#Train Accuracy score of KNN =  89.9870872532743
#Train Accuracy score of SVC =  86.24976941523704

#<<<<<< Without feature selection on Test dataset - 
#Test Accuracy score of LR =  85.03043718871058
#Test Accuracy score of RF =  83.44216934144991
#Test Accuracy score of KNN =  82.05866076369674
#Test Accuracy score of SVC =  85.45655783065855

# RF Acuuracy on Train is 98% which come down to 83% on Test data set, this is causing overfitting 
# SVC Accuracy on Train is 86% and goes to 85.4% on Test. This seems to be best model for this criteria 


#### Modeling with less features, will do feaure selection with GA (Genetic Algorithm)

In [35]:
# install GA library

!pip install sklearn-genetic



# GA-based feature selection 

In [36]:
estimator = DecisionTreeClassifier()
fs_model = GeneticSelectionCV(
    estimator, cv=5, verbose=0,
    scoring="accuracy", max_features=5,
    n_population=100, crossover_proba=0.5,
    mutation_proba=0.2, n_generations=50,
    crossover_independent_proba=0.5,
    mutation_independent_proba=0.04,
    tournament_size=3, n_gen_no_change=10,
    caching=True, n_jobs=-1)

In [37]:
# fit model and get most important features and print them
fs_model = fs_model.fit(x, y)
features =  x.columns[fs_model.support_]

print('Features:', features)

Features: Index(['capital-gain', 'capital-loss', 'workclass_Self-emp-inc',
       'education_Masters', 'marital-status_Married-civ-spouse'],
      dtype='object')


In [38]:
# For learning purpose you can limit to 3 features with below code
#features = ['educational-num', 'capital-gain', 'capital-loss']
#X =  x[features]


#Build optimized models
# Doing with all 5 features 
x1 =  x[features]
y1 = y
print("Number of features - ", len(x1.columns))

Number of features -  5


In [39]:
X_train_ga, X_test_ga, Y_train_ga, Y_test_ga = train_test_split(x1, y1, test_size=0.4, random_state=1)
X_train_ga.shape, X_test_ga.shape, Y_train_ga.shape, Y_test_ga.shape

((27105, 5), (18070, 5), (27105,), (18070,))

In [40]:
lr = LogisticRegression(random_state=0).fit(X_train_ga, Y_train_ga)
rf = RandomForestClassifier(n_estimators = 7).fit(X_train_ga, Y_train_ga)
knn = knn = KNeighborsClassifier(n_neighbors=3).fit(X_train_ga, Y_train_ga)
svc = SVC().fit(X_train_ga, Y_train_ga)
#nb = ComplementNB().fit(X_train_ga, Y_train_ga)
nb = GaussianNB().fit(X_train_ga, Y_train_ga)

In [41]:
lr_pred_ga = lr.predict(X_train_ga)
rf_pred_ga = rf.predict(X_train_ga)
knn_pred_ga = knn.predict(X_train_ga)
svc_pred_ga = svc.predict(X_train_ga)
nb_pred_ga = nb.predict(X_train_ga)

lr_train_score_2 = 100*metrics.accuracy_score(Y_train_ga, lr_pred_ga)
rf_train_score_2 = 100*metrics.accuracy_score(Y_train_ga, rf_pred_ga)
knn_train_score_2 = 100*metrics.accuracy_score(Y_train_ga, knn_pred_ga)
svc_train_score_2 = 100*metrics.accuracy_score(Y_train_ga, svc_pred_ga)
nb_train_score_2 = 100*metrics.accuracy_score(Y_train_ga, nb_pred_ga)

print("<< TRAIN Accuracy after GA feature selection >>")
print("Train Accuracy score of LR = ",lr_train_score_2)
print("Train Accuracy score of RF = ",rf_train_score_2)
print("Train Accuracy score of KNN = ",knn_train_score_2)
print("Train Accuracy score of SVC = ",svc_train_score_2)
print("Train Accuracy score of NB = ",nb_train_score_2)

<< TRAIN Accuracy after GA feature selection >>
Train Accuracy score of LR =  81.37244050913117
Train Accuracy score of RF =  83.8996495111603
Train Accuracy score of KNN =  82.85187234827522
Train Accuracy score of SVC =  82.36856668511344
Train Accuracy score of NB =  78.62018077845417


In [42]:
lr_pred_ga_test = lr.predict(X_test_ga)
rf_pred_ga_test = rf.predict(X_test_ga)
knn_pred_ga_test = knn.predict(X_test_ga)
svc_pred_ga_test = svc.predict(X_test_ga)
nb_pred_ga_test = nb.predict(X_test_ga)

lr_test_score_2 = 100*metrics.accuracy_score(Y_test_ga, lr_pred_ga_test)
rf_test_score_2 = 100*metrics.accuracy_score(Y_test_ga, rf_pred_ga_test)
knn_test_score_2 = 100*metrics.accuracy_score(Y_test_ga, knn_pred_ga_test)
svc_test_score_2 = 100*metrics.accuracy_score(Y_test_ga, svc_pred_ga_test)
nb_test_score_2 = 100*metrics.accuracy_score(Y_test_ga, nb_pred_ga_test)

print("<< TEST Accuracy after GA feature selection >>")
print("Test Accuracy score of LR = ",lr_test_score_2)
print("Test Accuracy score of RF = ",rf_test_score_2)
print("Test Accuracy score of KNN = ",knn_test_score_2)
print("Test Accuracy score of SVC = ",svc_test_score_2)
print("Test Accuracy score of NB = ",nb_test_score_2)

<< TEST Accuracy after GA feature selection >>
Test Accuracy score of LR =  81.30049806308799
Test Accuracy score of RF =  83.74654122855561
Test Accuracy score of KNN =  82.72827891532928
Test Accuracy score of SVC =  82.35749861649143
Test Accuracy score of NB =  78.81018262313226


# Inference

In [46]:

d = {
    'Model_name':['LR', 'RF', 'KNN', 'SVC','NB'],
    'Score1_Train':[lr_train_score_1,rf_train_score_1,knn_train_score_1,svc_train_score_1,nb_train_score_1],
    'Score1_test':[lr_test_score_1,rf_test_score_1,knn_test_score_1,svc_test_score_1,nb_test_score_1],
   'Score2_GA_FS_Train':[lr_train_score_2,rf_train_score_2,knn_train_score_2,svc_train_score_2,nb_train_score_2],
   'Score2_GA_FS_test':[lr_test_score_2,rf_test_score_2,knn_test_score_2,svc_test_score_2,nb_test_score_2]}

result = pd.DataFrame(d) 
result.sort_values(by='Score2_GA_FS_test', ascending=False, inplace = True)
result

Unnamed: 0,Model_name,Score1_Train,Score1_test,Score2_GA_FS_Train,Score2_GA_FS_test
1,RF,98.620181,83.431101,83.89965,83.746541
2,KNN,89.987087,82.058661,82.851872,82.728279
3,SVC,86.249769,85.456558,82.368567,82.357499
0,LR,84.888397,85.030437,81.372441,81.300498
4,NB,61.501568,61.101273,78.620181,78.810183


# 1. Without feature selectoion with all 14 features - 
  RF giving best accuracy 98% on train data, on test it went down to 84%. 
  Other two also giving similar accuracy on Tets data set

# 2. With GA feature selectoion on 3 features -  
     RF giving best accuracy 83% on train data, on test it remains to 83%. 
  LR and KNN not giving good results with less features 

# 3. In my opinion RF performing good 
