In Ensemble Method, the voting classifier takes all the results of different classifier and assemble them together and pick the one with most votes inside.

Hard voting classifier will take predictions from each instance and the class that predicts majority times correct will be used

On the other hand, a Soft Voting Classifier will pick the class with the highest Class Probability averaged across all classifiers. 

*Refer the votong classifier types image in folder*

In this note book we'll be working through a dataset I found in Kaggle

### Dataset we'll be using: 

Placement Data Full Class by Ben Roshan

https://www.kaggle.com/benroshan/factors-affecting-campus-placement

We want to try and predict the "status" column, basically was someone placed or not

In [1]:
import pandas as pd

## Dataset Overview: 

### Overview
Placement data for students on a campus

### Columns
**sl_no**
* Serial Number

**gender**
* Gender M = Male | F = Female

**ssc_p**
* Secondary Education Percentage (10th Grade)

**ssc_b**
* Board of Education - Central/Others

**hsc_p**
* Higher Secondary Education Percentage - 12th Grade

**hsc_b**
* Board of Education - Central/Others

**hsc_s**
* Specialization in Higher Secondary Education

**degree_p**
* Degree Percentage

**degree_t**
* Undergrad Degree Type (Field of education)

**workex**
* Work Experience

**etest_p**
* Employability test percentage (conducted by college)

**specialisation**
* Post Grad (MBA) - Specialization

**mba_p**
* MBA percentage

**status**
* Status of placement - Placed/ Not placed

**salary**
* Salary offered to corporate candidates

In [2]:
data_import = pd.read_csv("Placement_Data_Full_Class.csv")
data_import

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed,250000.0
3,4,M,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed,425000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,211,M,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed,400000.0
211,212,M,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed,275000.0
212,213,M,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed,295000.0
213,214,F,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed,204000.0


In [None]:
data_import.describe() # No missing values (we don't care about salary)

Unnamed: 0,sl_no,ssc_p,hsc_p,degree_p,etest_p,mba_p,salary
count,215.0,215.0,215.0,215.0,215.0,215.0,148.0
mean,108.0,67.303395,66.333163,66.370186,72.100558,62.278186,288655.405405
std,62.209324,10.827205,10.897509,7.358743,13.275956,5.833385,93457.45242
min,1.0,40.89,37.0,50.0,50.0,51.21,200000.0
25%,54.5,60.6,60.9,61.0,60.0,57.945,240000.0
50%,108.0,67.0,65.0,66.0,71.0,62.0,265000.0
75%,161.5,75.7,73.0,72.0,83.5,66.255,300000.0
max,215.0,89.4,97.7,91.0,98.0,77.89,940000.0


In [None]:
# Checking Data Types

for column in data_import.columns:
    print(column, data_import[column].dtype, len(data_import[column].unique()))

sl_no int64 215
gender object 2
ssc_p float64 103
ssc_b object 2
hsc_p float64 97
hsc_b object 2
hsc_s object 3
degree_p float64 89
degree_t object 3
workex object 2
etest_p float64 100
specialisation object 2
mba_p float64 205
status object 2
salary float64 46


## Data Cleansing:

As per usual, we'll need to clean our data before we can use it in our Machine Learning algorithms

1. Drop the sl_no column as we can't predict with it and drop salary as we're not going to use it
2. Label Encode: "Gender", "workex", "ssc_b", "hsc_b", "specialization"
3. One Hot Encode: "hsc_s", "degree_t"



In [None]:
dropped_columns = data_import.copy()

dropped_columns = dropped_columns.drop(["sl_no", "salary"], axis=1)
dropped_columns

Unnamed: 0,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status
0,M,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed
1,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,M,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed
3,M,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,M,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,M,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed
211,M,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed
212,M,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed
213,F,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed


In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

label_encoded_data = dropped_columns.copy()

columns_to_label_encode = ["gender", "workex", "ssc_b", "hsc_b", "specialisation"]

label_encoded_data[columns_to_label_encode] = label_encoded_data[columns_to_label_encode].apply(le.fit_transform)

label_encoded_data

Unnamed: 0,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status
0,1,67.00,1,91.00,1,Commerce,58.00,Sci&Tech,0,55.0,1,58.80,Placed
1,1,79.33,0,78.33,1,Science,77.48,Sci&Tech,1,86.5,0,66.28,Placed
2,1,65.00,0,68.00,0,Arts,64.00,Comm&Mgmt,0,75.0,0,57.80,Placed
3,1,56.00,0,52.00,0,Science,52.00,Sci&Tech,0,66.0,1,59.43,Not Placed
4,1,85.80,0,73.60,0,Commerce,73.30,Comm&Mgmt,0,96.8,0,55.50,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,1,80.60,1,82.00,1,Commerce,77.60,Comm&Mgmt,0,91.0,0,74.49,Placed
211,1,58.00,1,60.00,1,Science,72.00,Sci&Tech,0,74.0,0,53.62,Placed
212,1,67.00,1,67.00,1,Commerce,73.00,Comm&Mgmt,1,59.0,0,69.72,Placed
213,0,74.00,1,66.00,1,Commerce,58.00,Comm&Mgmt,0,70.0,1,60.23,Placed


We used label encoding for those having categories, below we will be using hot encoder for those with more than 2 categories

In [None]:
hot_encoded_data = label_encoded_data.copy()

hot_encoded_data_y_placeholder = hot_encoded_data["status"]
hot_encoded_data = hot_encoded_data.drop("status", axis=1) # Removes the prediction column so that we don't encode it

hot_encoded_data = pd.get_dummies(hot_encoded_data)  # pd.get_dummies is used for data manipulation 

hot_encoded_data = pd.concat([hot_encoded_data, hot_encoded_data_y_placeholder], axis=1)  # pd.concat is used to join 2 datasets
hot_encoded_data

Unnamed: 0,gender,ssc_p,ssc_b,hsc_p,hsc_b,degree_p,workex,etest_p,specialisation,mba_p,hsc_s_Arts,hsc_s_Commerce,hsc_s_Science,degree_t_Comm&Mgmt,degree_t_Others,degree_t_Sci&Tech,status
0,1,67.00,1,91.00,1,58.00,0,55.0,1,58.80,0,1,0,0,0,1,Placed
1,1,79.33,0,78.33,1,77.48,1,86.5,0,66.28,0,0,1,0,0,1,Placed
2,1,65.00,0,68.00,0,64.00,0,75.0,0,57.80,1,0,0,1,0,0,Placed
3,1,56.00,0,52.00,0,52.00,0,66.0,1,59.43,0,0,1,0,0,1,Not Placed
4,1,85.80,0,73.60,0,73.30,0,96.8,0,55.50,0,1,0,1,0,0,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,1,80.60,1,82.00,1,77.60,0,91.0,0,74.49,0,1,0,1,0,0,Placed
211,1,58.00,1,60.00,1,72.00,0,74.0,0,53.62,0,0,1,0,0,1,Placed
212,1,67.00,1,67.00,1,73.00,1,59.0,0,69.72,0,1,0,1,0,0,Placed
213,0,74.00,1,66.00,1,58.00,0,70.0,1,60.23,0,1,0,1,0,0,Placed


In [None]:
# Scaling our data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_data = hot_encoded_data.copy()

scaled_data_temp = scaled_data.iloc[:, -1]      # Data of all rows of the last column.....scaled_data_temp will be used further
scaled_data = scaled_data.iloc[:, :-1]  # All rows and columns excepot the last column

scaled_data = scaler.fit_transform(scaled_data)

# scaled_data = pd.concat([scaled_data, scaled_data_temp], axis=1)
scaled_data

array([[ 0.73943397, -0.02808697,  1.08245885, ..., -1.43924583,
        -0.23221018,  1.62605898],
       [ 0.73943397,  1.11336869, -0.92382264, ..., -1.43924583,
        -0.23221018,  1.62605898],
       [ 0.73943397, -0.21323793, -0.92382264, ...,  0.69480833,
        -0.23221018, -0.61498384],
       ...,
       [ 0.73943397, -0.02808697,  1.08245885, ...,  0.69480833,
        -0.23221018, -0.61498384],
       [-1.35238581,  0.61994138,  1.08245885, ...,  0.69480833,
        -0.23221018, -0.61498384],
       [ 0.73943397, -0.49096436, -0.92382264, ...,  0.69480833,
        -0.23221018, -0.61498384]])

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

X = scaled_data
y = scaled_data_temp


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
X_train

array([[-1.35238581, -0.37061624, -0.92382264, ...,  0.69480833,
        -0.23221018, -0.61498384],
       [-1.35238581, -0.7686908 , -0.92382264, ...,  0.69480833,
        -0.23221018, -0.61498384],
       [ 0.73943397,  0.05523096,  1.08245885, ..., -1.43924583,
        -0.23221018,  1.62605898],
       ...,
       [-1.35238581,  0.23112437,  1.08245885, ..., -1.43924583,
        -0.23221018,  1.62605898],
       [-1.35238581,  0.74954705,  1.08245885, ..., -1.43924583,
        -0.23221018,  1.62605898],
       [-1.35238581,  0.15706399, -0.92382264, ...,  0.69480833,
        -0.23221018, -0.61498384]])

## Run ML Algorithms

### Build Ensemble Predictor

In [None]:
# Instantiating and configuring our Ensemble Classifier

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_for_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(estimators = [('lr', log_clf), ('rf', rnd_for_clf), ('sc', svm_clf)], voting="hard")


In [None]:
# Training the Ensemble classifier

voting_clf.fit(X_train, y_train)

### Measure the accuracy of our classifier

In [None]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_for_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.8450704225352113
RandomForestClassifier 0.8450704225352113
SVC 0.8591549295774648
VotingClassifier 0.8732394366197183


In [None]:
y_test

67         Placed
92         Placed
188    Not Placed
113        Placed
141    Not Placed
          ...    
163        Placed
56         Placed
214    Not Placed
29     Not Placed
116        Placed
Name: status, Length: 71, dtype: object

## Let us do it for Soft Voting Classifier

In [None]:
# Instantiating and configuring our Ensemble Classifier

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf_s = LogisticRegression()
rnd_for_clf_s = RandomForestClassifier()
svm_clf_s = SVC(probability=True)

voting_clf_s = VotingClassifier(estimators = [('lr', log_clf_s), ('rf', rnd_for_clf_s), ('sc', svm_clf_s)], voting="soft")


In [None]:
# Training the Ensemble classifier

voting_clf_s = voting_clf_s.fit(X_train, y_train)

In [None]:
class_probabilities = voting_clf_s.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_for_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.8450704225352113
RandomForestClassifier 0.7887323943661971
SVC 0.8591549295774648
VotingClassifier 0.8873239436619719


In [None]:
y_test

67         Placed
92         Placed
188    Not Placed
113        Placed
141    Not Placed
          ...    
163        Placed
56         Placed
214    Not Placed
29     Not Placed
116        Placed
Name: status, Length: 71, dtype: object

## Bagging and Pasting in skLearn

In bagging and pasting we train different subsets of a data one after the other, in bagging there can be common data among different subsets but in pasting all the subsets are mutually exclusive

Bagging creates more diverse subsets slightly more biased than pasting. The bagging predictions are also often less correlated than those when using pasting. Bagging creates better models and hence is generally preffered over pasting

Out-of-Bag Evaluation are those data which are not treained during bagging, as the machine has not seen this data, it can be used as "validation test set" (a secondary test set)

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1) # Bootstrap determines whether this is bagging or pasting, if bootstrap=False then it's pasting classifier
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)


## Random Forests

Random Forests are basically ensemble of decision trees usually trained through bagging.

Extra-Trees is forest of extremely random trees (also called as Extremely Randomised Trees). This technique trades even more bias for less variance. (Remember bias is where some aspects of data are given more prefernce over others whereas variance is an error from sensitivity to small fluctuations in our training set).

Feature Importance is a very useful benifit we get from random forests, feature importance tells us that which feature was the most important during our decision making.

In [None]:
## Create a random forest classifier and see the importance of each variable

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rf_clf.fit(X_train, y_train)

for name, score in zip(hot_encoded_data.iloc[:, :-1].columns, rf_clf.feature_importances_):
    print(name, ": ", score)


gender :  0.02148670306804405
ssc_p :  0.29339936380912496
ssc_b :  0.018838976565373002
hsc_p :  0.2003652388382213
hsc_b :  0.013549334164651613
degree_p :  0.15616558806515396
workex :  0.04341173295748576
etest_p :  0.07740975999889994
specialisation :  0.02074928259078174
mba_p :  0.1040722832747201
hsc_s_Arts :  0.003526114627166769
hsc_s_Commerce :  0.009226020089733599
hsc_s_Science :  0.009483333294543605
degree_t_Comm&Mgmt :  0.012135309655479822
degree_t_Others :  0.006017083107547036
degree_t_Sci&Tech :  0.01016387589307281


Boosting methods try and combine weak learners into strong learners. The idea is to train several predictors sequentially allowing each predictor to correct the inaccuracies of the previous one.

### AdaBoost (Adaptive boosting)

Adaptive boosting pays more attention to the training instances which are underfit (Refer to the diagram in the folder for better understanding).

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200, 
    algorithm="SAMME.R", learning_rate=0.5)

ada_clf.fit(X_train, y_train)

y_pred = ada_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.8169014084507042

### Gradient Boosting

Gradiant Boosting fits a new predictor to the residual errors made by the previous predictor.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbrt = GradientBoostingClassifier(max_depth=2, n_estimators=120)

gbrt.fit(X_train, y_train)

y_pred = gbrt.predict(X_test)

accuracy_score(y_test, y_pred)

0.8450704225352113