## Case Study 02 on Ensemble Learning

* Create an ensemble learning based machine learning model to classify people based on their salary.
* Try using the following methods:
     * Decision Tree
     * Random Forest
     * Bagging Classifiers
     * Boosting Classifiers
* Data:
* Extraction was done by Barry Becker from the 1994 Census database.Prediction task is to determine whether a person makes over 50K a year.
* Columns are:
    * age: continuous.
    * workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Withoutpay, Never-worked.
    * fnlwgt: continuous.
    * education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th,10th, Doctorate, 5th-6th, Preschool.
    * education-num: continuous.
    * marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Marriedspouse-absent, Married-AF-spouse.
    * occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Privhouse-serv, Protective-serv, Armed-Forces.
    * relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
    * race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
    * sex: Female, Male.
    * capital-gain: continuous.
    * capital-loss: continuous.
    * hours-per-week: continuous.
    * native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, OutlyingUS(Guam-USVI-etc), India, Japan,Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland,France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
    * salary: <=50K or >50K


### importing necessary libraries and Salary dataset

In [32]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('salary.csv')

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [5]:
df = df.drop('education',axis=1) # dropped as education-num column is the encoded version of education column

### Encoding of features

In [6]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country', 'salary'],
      dtype='object')

In [7]:
df['workclass'].unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [8]:
df['workclass'].replace(to_replace = ' ?', value = df['workclass'].mode()[0], inplace = True)

In [9]:
df['workclass'].unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' Self-emp-inc', ' Without-pay', ' Never-worked'],
      dtype=object)

In [10]:
df['workclass'] = df['workclass'].map({' Never-worked':0,' Without-pay':1,' Self-emp-not-inc':2,' Self-emp-inc':4,' Private':5, ' Local-gov':6,' Federal-gov':7,' State-gov':8})

In [11]:
df['occupation'].unique() 

array([' Adm-clerical', ' Exec-managerial', ' Handlers-cleaners',
       ' Prof-specialty', ' Other-service', ' Sales', ' Craft-repair',
       ' Transport-moving', ' Farming-fishing', ' Machine-op-inspct',
       ' Tech-support', ' ?', ' Protective-serv', ' Armed-Forces',
       ' Priv-house-serv'], dtype=object)

In [12]:
df['occupation'].replace(to_replace = ' ?', value = df['occupation'].mode()[0], inplace = True)

In [13]:
df['race'].unique()

array([' White', ' Black', ' Asian-Pac-Islander', ' Amer-Indian-Eskimo',
       ' Other'], dtype=object)

In [14]:
df['race'] = df['race'].map({' Black':0,' Amer-Indian-Eskimo':1,' Asian-Pac-Islander':2,' Other':3,' White':4})

In [16]:
df['native-country'].replace(to_replace = ' ?', value = df['native-country'].mode()[0], inplace = True)

In [22]:
df['sex'] = df['sex'].map({' Male':1,' Female':0})

### Splitting of dataset

In [24]:
x = df.drop('salary',axis=1)
y = df.salary

### One hot encoding

In [25]:
x = pd.get_dummies(x)

In [26]:
x.head()

Unnamed: 0,age,workclass,fnlwgt,education-num,race,sex,capital-gain,capital-loss,hours-per-week,marital-status_ Divorced,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,8,77516,13,4,1,2174,0,40,0,...,0,0,0,0,0,0,0,1,0,0
1,50,2,83311,13,4,1,0,0,13,0,...,0,0,0,0,0,0,0,1,0,0
2,38,5,215646,9,4,1,0,0,40,1,...,0,0,0,0,0,0,0,1,0,0
3,53,5,234721,7,0,1,0,0,40,0,...,0,0,0,0,0,0,0,1,0,0
4,28,5,338409,13,0,0,0,0,40,0,...,0,0,0,0,0,0,0,0,0,0


### Model fitting

In [35]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.3)

Decision Tree
Random Forest
Bagging Classifiers
Boosting Classifiers

### Decision Tree

In [36]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(x_train,y_train)
y_pred=dt.predict(x_test)

In [38]:
from sklearn.metrics import accuracy_score
acc=round(accuracy_score(y_test,y_pred),2)
acc

0.81

### Random Forest

In [39]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state = 42,n_estimators = 100)
rf = rf.fit(x_train,y_train)
y_pred = rf.predict(x_test)

In [40]:
acc=round(accuracy_score(y_test,y_pred),2)
acc

0.86

### Bagging Classifiers

#### 1. With Decision Tree

In [41]:
from sklearn.ensemble import BaggingClassifier
dt = DecisionTreeClassifier()
bagging_clf = BaggingClassifier(base_estimator = dt, n_estimators = 100, random_state = 42)
bagging_clf.fit(x_train,y_train)
y_pred = bagging_clf.predict(x_test)

In [42]:
acc=round(accuracy_score(y_test,y_pred),2)
acc

0.85

#### 2. With Random Forest

In [43]:
from sklearn.ensemble import BaggingClassifier
bagging_clf = BaggingClassifier(base_estimator = rf, n_estimators = 100, random_state = 42)
bagging_clf.fit(x_train,y_train)
y_pred = bagging_clf.predict(x_test)

In [44]:
acc=round(accuracy_score(y_test,y_pred),2)
acc

0.86

### Adaboost Classifier

In [45]:
from sklearn.ensemble import AdaBoostClassifier
ad = AdaBoostClassifier(n_estimators = 100)
ad = ad.fit(x_train,y_train)
y_pred = ad.predict(x_test)

In [46]:
acc=round(accuracy_score(y_test,y_pred),2)
acc

0.87

### Conclusion

|Model|Accuracy|
|---|---|
|Decision Tree|0.81|
|Random Forest|0.86|
|Bagging with decision tree|0.85|
|Bagging with random forest|0.86|
|**Adaboost Classifier**|**0.87**|

* From the above table we can see thatAdaboost classifier is the best fit for the prediction on salary in this perticular dataset