# Ensembling Modelling

You take predictions from multiple ML models and combine them in some manner to create a final prediction.

### <b>4 Techniques: </b>
1. Max Voting
2. Average
3. Weighted Averaging
4. Rank Averaging


## <b>Max Voting:</b> 
Whatever my models tell me the most, is what I’ll go ahead with. So, here, I take the outcome from individual models and just take a vote. Only for Classification problem


In [3]:
# Import Titanic Dataset which is classification problem
import pandas as pd
import numpy as np

In [2]:
data_mv=pd.read_csv('data_cleaned.csv')

In [6]:
data_mv.shape

(891, 25)

In [7]:
data_mv.head()

Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


In [14]:
X_mv = data_mv.drop(['Survived'], axis=1)
y_mv = data_mv['Survived']

In [18]:
# Do the train test split

from sklearn.model_selection import train_test_split

X_mv_train, X_mv_test, y_mv_train, y_mv_test = train_test_split(X_mv, y_mv, random_state=101, stratify=y_mv)

In [20]:
# Checking if data looks correct
X_mv_test.shape, X_mv_train.shape, y_mv_test.shape, y_mv_train.shape

((223, 24), (668, 24), (223,), (668,))

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

In [29]:
# Logistic Regression Model
model1 = LogisticRegression(solver='liblinear')
model1.fit(X_mv_train, y_mv_train)

predict1 = model1.predict(X_mv_test)

predict1[:10], model1.score(X_mv_test, y_mv_test)

(array([0, 0, 0, 0, 1, 1, 0, 1, 0, 0], dtype=int64), 0.7757847533632287)

In [36]:
# K Nearest Neighbour Model
model2 = KNeighborsClassifier(n_neighbors=5)
model2.fit(X_mv_train, y_mv_train)

predict2 = model2.predict(X_mv_test)

predict2[:10], model2.score(X_mv_test, y_mv_test)

(array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0], dtype=int64), 0.7399103139013453)

In [40]:
# Decision Tree Model

model3 = DecisionTreeClassifier(max_depth=8)
model3.fit(X_mv_train, y_mv_train)
predict3 = model3.predict(X_mv_test)

predict3[:10], model3.score(X_mv_test, y_mv_test)

(array([1, 0, 0, 1, 1, 1, 0, 0, 0, 0], dtype=int64), 0.7847533632286996)

In [50]:
# Create an array

from statistics import mode
final_pred = []

for i in range(0, len(y_mv_test)):
    final_pred = np.append(final_pred, mode([predict1[i], predict2[i], predict3[i]]))


In [51]:
# Manually checking the results
predict1[:10], predict2[:10], predict3[:10], final_pred[:10]

(array([0, 0, 0, 0, 1, 1, 0, 1, 0, 0], dtype=int64),
 array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0], dtype=int64),
 array([1, 0, 0, 1, 1, 1, 0, 0, 0, 0], dtype=int64),
 array([1., 0., 0., 0., 1., 1., 0., 0., 0., 0.]))

In [53]:
from sklearn.metrics import accuracy_score 

accuracy_score(y_mv_test, final_pred)

0.8026905829596412

Thus, the final model created has better accuracy than the 3 individual models