# Case Study 02 on Ensemble Learning

**Data:**

Extraction was done by Barry Becker from the 1994 Census database. Prediction task is to determine whether a person makes over 50K a year.

Columns are:
* age: continuous.
* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
* fnlwgt: continuous.
* education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
* education-num: continuous.
* marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
* occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
* relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
* sex: Female, Male.
* capital-gain: continuous.
* capital-loss: continuous.
* hours-per-week: continuous.
* native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
* salary: <=50K or >50K


**Q.** Create an ensemble learning based machine learning model to classify people based on their salary.

Try using the following methods:
* Decision Tree
* Random Forest
* Bagging Classifiers
* Boosting Classifiers

In [1]:
# importing libraries

import pandas as pd
import numpy as np

### Data Loading

In [2]:
df = pd.read_csv('salary.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### Data Preprocessing

In [4]:
for column_name in df.columns:
    count = (df[column_name] == 0).sum()
    print(column_name,count)

age 0
workclass 0
fnlwgt 0
education 0
education-num 0
marital-status 0
occupation 0
relationship 0
race 0
sex 0
capital-gain 29849
capital-loss 31042
hours-per-week 0
native-country 0
salary 0


In [5]:
print(df['workclass'].unique())
print(df['education'].unique())
print(df['marital-status'].unique())
print(df['occupation'].unique())
print(df['relationship'].unique())
print(df['race'].unique())
print(df['sex'].unique())
print(df['native-country'].unique())

[' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']
[' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']
[' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']
[' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?'
 ' Protective-serv' ' Armed-Forces' ' Priv-house-serv']
[' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']
[' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other']
[' Male' ' Female']
[' United-States' ' Cuba' ' Jamaica' ' India' ' ?' ' Mexico' ' South'
 ' Puerto-Rico' ' Honduras' ' England' ' Canada' ' Germany' 

In [6]:
for column_name in df.columns:
    count = (df[column_name] == ' ?').sum()
    print(column_name,count)

age 0
workclass 1836
fnlwgt 0
education 0
education-num 0
marital-status 0
occupation 1843
relationship 0
race 0
sex 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 583
salary 0


In [7]:
df.loc[df['occupation'] == ' ?'].index

Int64Index([   27,    61,    69,    77,   106,   128,   149,   154,   160,
              187,
            ...
            32426, 32477, 32490, 32494, 32525, 32530, 32531, 32539, 32541,
            32542],
           dtype='int64', length=1843)

In [8]:
df = df.drop(index=df.loc[df['occupation'] == ' ?'].index)

In [9]:
for column_name in df.columns:
    count = (df[column_name] == ' ?').sum()
    print(column_name,count)

age 0
workclass 0
fnlwgt 0
education 0
education-num 0
marital-status 0
occupation 0
relationship 0
race 0
sex 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 556
salary 0


In [10]:
df = df.drop(index=df.loc[df['native-country'] == ' ?'].index)

In [11]:
for column_name in df.columns:
    count = (df[column_name] == ' ?').sum()
    print(column_name,count)

age 0
workclass 0
fnlwgt 0
education 0
education-num 0
marital-status 0
occupation 0
relationship 0
race 0
sex 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 0
salary 0


In [12]:
df.shape

(30162, 15)

### Encoding

In [13]:
df = pd.get_dummies(data=df, columns = ['race'])

In [14]:
from sklearn.preprocessing import LabelEncoder
df[['workclass','education',
    'marital-status','occupation',
   'relationship','sex',
    'native-country']] = df[['workclass','education',
                             'marital-status','occupation',
                             'relationship','sex','native-country']].apply(LabelEncoder().fit_transform)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   age                       30162 non-null  int64 
 1   workclass                 30162 non-null  int32 
 2   fnlwgt                    30162 non-null  int64 
 3   education                 30162 non-null  int32 
 4   education-num             30162 non-null  int64 
 5   marital-status            30162 non-null  int32 
 6   occupation                30162 non-null  int32 
 7   relationship              30162 non-null  int32 
 8   sex                       30162 non-null  int32 
 9   capital-gain              30162 non-null  int64 
 10  capital-loss              30162 non-null  int64 
 11  hours-per-week            30162 non-null  int64 
 12  native-country            30162 non-null  int32 
 13  salary                    30162 non-null  object
 14  race_ Amer-Indian-Eski

### Model Building

In [16]:
x = df.drop('salary',axis=1)
y = df.salary

In [17]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1,test_size = 0.2)

### Desicion Tree

In [18]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf = clf.fit(x_train,y_train)

In [19]:
y_pred = clf.predict(x_test)

In [20]:
from sklearn import metrics
metrics.accuracy_score(y_test,y_pred)

0.8047405934029505

### Random Forest

In [21]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(random_state=42, n_estimators = 100)
rf_clf.fit(x_test,y_test)

In [22]:
y_pred = rf_clf.predict(x_test)

In [23]:
metrics.accuracy_score(y_test,y_pred)

1.0

### Bagging Classifiers

In [24]:
# Using Decision tree classifier

from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier()

bagging_clf = BaggingClassifier(base_estimator=tree, n_estimators=100,random_state=42)
bagging_clf.fit(x_train,y_train)

In [25]:
y_pred = bagging_clf.predict(x_test)

In [26]:
metrics.accuracy_score(y_test,y_pred)

0.8488314271506713

### Boosting Classifiers

In [29]:
from sklearn.ensemble import AdaBoostClassifier

ada_boost_clf = AdaBoostClassifier(n_estimators=100)
ada_boost_clf.fit(x_train, y_train)
y_pred = ada_boost_clf.predict(x_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8606000331510029


##### Jibin K Joy, ML & AI, KKEM August 2022 Batch