**In this notebook we will see how to implement the different ensembling techniques using the Titanic Survial Dataset for LogisticRegression and The Bigmart Sales dataset for Linear Regression**

## MAX VOTING

In [2]:
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv('data_cleaned.csv')
df.shape

(891, 25)

In [4]:
df.head()

Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


In [5]:
x=df.drop(['Survived'],axis=1)
y=df['Survived']

In [6]:
from sklearn.model_selection import train_test_split
train_x,valid_x,train_y,valid_y=train_test_split(x,y,random_state=42,stratify=y)

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

In [8]:
m1=LogisticRegression()
m1.fit(train_x,train_y)
p1=m1.predict(valid_x)
print('Accuracy for Logistic Regression is :',m1.score(valid_x,valid_y))


Accuracy for Logistic Regression is : 0.7892376681614349


*When we make prediction using logistic regression,we get an accuracy of 79%*

In [9]:
p1[:10]

array([0, 1, 0, 1, 0, 1, 1, 0, 0, 0], dtype=int64)

*Here we are looking at the first 10 predictions made by our logistic regression model*

In [10]:
m2=KNeighborsClassifier(n_neighbors=5)
m2.fit(train_x,train_y)
p2=m2.predict(valid_x)
print('Accuracy when using KNCLassifier is :',m2.score(valid_x,valid_y))

Accuracy when using KNCLassifier is : 0.6816143497757847


*Using KNeighborsCLassifier,we get an accuracy of 69% which is worse than what we got for Logistic Regression*

In [11]:
p2[:10]

array([0, 1, 1, 1, 0, 0, 1, 0, 1, 0], dtype=int64)

*The predictions made by the 2nd model are as above*

In [12]:
from sklearn.tree import DecisionTreeClassifier


In [13]:
m3=DecisionTreeClassifier(max_depth=5)
m3.fit(train_x,train_y)
p3=m3.predict(valid_x)
print('Accuracy for Decision Tree is',m3.score(valid_x,valid_y))

Accuracy for Decision Tree is 0.7668161434977578


*Using the DecisionTreeClassifier, we get an accuracy of 77%,better than the previous one*

In [14]:
p3[:10]

array([0, 1, 0, 1, 0, 1, 1, 0, 0, 0], dtype=int64)

*The predicted values by the decision tree classifier are as above*

**Now we have the outputs of all 3 variables,lets combine them and see the results**

In [15]:
from statistics import mode
final_prediction=np.array([])
for i in range(0,len(valid_x)):
    final_prediction=np.append(final_prediction,mode([p1[i],p2[i],p3[i]]))

In [16]:
from sklearn.metrics import accuracy_score

In [17]:
accuracy_score(valid_y,final_prediction)

0.7982062780269058

*So we can see that by combining the 3 models,we get a better accuracy of nearly 80%*

## AVERAGING

In [18]:
data=pd.read_csv('train_cleaned.csv')
data.shape

(8523, 46)

In [19]:
data.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,Item_Fat_Content_LF,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Fat_Content_low fat,Item_Fat_Content_reg,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0.016047,249.8092,1999,3735.138,0,1,0,0,0,...,0,1,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,2009,443.4228,0,0,1,0,0,...,0,1,0,0,0,1,0,0,1,0
2,17.5,0.01676,141.618,1999,2097.27,0,1,0,0,0,...,0,1,0,1,0,0,0,1,0,0
3,19.2,0.0,182.095,1998,732.38,0,0,1,0,0,...,0,0,0,0,0,1,1,0,0,0
4,8.93,0.0,53.8614,1987,994.7052,0,1,0,0,0,...,1,0,0,0,0,1,0,1,0,0


In [20]:
x=data.drop(['Item_Outlet_Sales'],axis=1)
y=data['Item_Outlet_Sales']

In [21]:
from sklearn.model_selection import train_test_split
train_x,valid_x,train_y,valid_y=train_test_split(x,y,random_state=42)

In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

In [23]:
m4=LinearRegression()
m4.fit(train_x,train_y)
p4=m4.predict(valid_x)
print('R^2 for Linear Regression is:',m4.score(valid_x,valid_y))

R^2 for Linear Regression is: 0.5669780185279932


*Using Linear Regression for our model gives us an accuracy of 56%*

In [24]:
p4[:10]

array([1363.7362535 ,  721.68330566,  885.31491571, 4239.23414745,
       3345.97312369,  614.5804804 , 4761.29272429, 2070.21994734,
       1404.80127708, 2824.53236367])

*Here again we are looking at the first 10 predictions made by our Linear Regression model*

In [25]:
m5=KNeighborsRegressor(n_neighbors=11)
m5.fit(train_x,train_y)
p5=m5.predict(valid_x)
print('R^2 for Nearest_Neighbors is:',m5.score(valid_x,valid_y))

R^2 for Nearest_Neighbors is: 0.5021719204402499


In [26]:
p5[:10]

array([ 930.36470909,  767.90950909,  708.59278182, 4797.51269091,
       3203.64801818,  559.45358182, 5009.23709091, 1722.48512727,
       1556.82198182, 3104.14118182])

*Here  we are looking at the first 10 predictions made by our KNN model*

In [27]:
m6=DecisionTreeRegressor(max_depth=11)
m6.fit(train_x,train_y)
p6=m6.predict(valid_x)
print('R^2 for DTRegressor is:',m6.score(valid_x,valid_y))

R^2 for DTRegressor is: 0.5188824229219824


In [28]:
p6[:10]

array([ 893.14043636,  673.5359619 ,  673.5359619 , 4880.49733623,
       3119.38500374,  448.57551304, 5224.8655    ,  351.8753    ,
       1486.37150811, 3119.38500374])

*Here  we are looking at the first 10 predictions made by our DTRegressor model*

In [29]:
from statistics import mean
final_prediction=np.array([])
for i in range(0,len(valid_x)):
    final_prediction=np.append(final_prediction,mean([p4[i],p5[i],p6[i]]))

In [30]:
from sklearn.metrics import r2_score
r2_score(valid_y,final_prediction)

0.5781647943125985

*Here the final R^2 value is not that great as what we would have expected,lets see if we can add weights and then compare the results*

**Note-We can also tune the hyperparameters in the functions to make better predictions**

## WEIGHTED_AVERAGING

In [31]:
import numpy as np
import pandas as pd

In [32]:
df=pd.read_csv('train_cleaned.csv')
df.shape

(8523, 46)

In [33]:
df.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,Item_Fat_Content_LF,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Fat_Content_low fat,Item_Fat_Content_reg,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0.016047,249.8092,1999,3735.138,0,1,0,0,0,...,0,1,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,2009,443.4228,0,0,1,0,0,...,0,1,0,0,0,1,0,0,1,0
2,17.5,0.01676,141.618,1999,2097.27,0,1,0,0,0,...,0,1,0,1,0,0,0,1,0,0
3,19.2,0.0,182.095,1998,732.38,0,0,1,0,0,...,0,0,0,0,0,1,1,0,0,0
4,8.93,0.0,53.8614,1987,994.7052,0,1,0,0,0,...,1,0,0,0,0,1,0,1,0,0


In [34]:
x=df.drop(['Item_Outlet_Sales'],axis=1)
y=df['Item_Outlet_Sales']

In [35]:
from sklearn.model_selection import train_test_split
train_x,valid_x,train_y,valid_y=train_test_split(x,y,random_state=39)

In [36]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

In [37]:
m4=LinearRegression()
m4.fit(train_x,train_y)
p4=m4.predict(valid_x)
print('R^2 for Linear Regression is:',m4.score(valid_x,valid_y))

R^2 for Linear Regression is: 0.576185849994943


In [38]:
p4[:10]

array([1843.8024783 , 2584.38828838, 3467.50114756, 4010.52333799,
       2585.9557832 , 2257.15015965, 5423.19465537, 4198.68032933,
       -805.38698638, 1633.00403974])

In [39]:
m5=KNeighborsRegressor(n_neighbors=11)
m5.fit(train_x,train_y)
p5=m5.predict(valid_x)
print('R^2 for Nearest_Neighbors is:',m5.score(valid_x,valid_y))

R^2 for Nearest_Neighbors is: 0.5276537469898547


In [40]:
p5[:10]

array([2131.71001818, 2670.22116364, 3637.14434545, 3658.69205455,
       1718.30874545, 2247.61974545, 6397.30903636, 4711.68501818,
        471.99167273, 1875.5586    ])

In [41]:
m6=DecisionTreeRegressor(max_depth=7)
m6.fit(train_x,train_y)
p6=m6.predict(valid_x)
print('R^2 for DTRegressor is:',m6.score(valid_x,valid_y))

R^2 for DTRegressor is: 0.591204920699139


In [42]:
p6[:10]

array([2044.34834397, 3173.65758858, 3533.12139355, 4026.98617368,
       2559.94238306, 2044.34834397, 6376.74245484, 4176.49755165,
        181.34816989, 1646.470136  ])

In [43]:
from statistics import mean
final_prediction=np.array([])
for i in range(0,len(valid_x)):
    final_prediction=np.append(final_prediction,mean([p4[i],p4[i],p5[i],p6[i],p6[i]]))

In [44]:
from sklearn.metrics import r2_score
r2_score(valid_y,final_prediction)

0.6048666821403824

*So by using weighted averaging,our model has further increased its performance*

# RANK AVERAGING

In [45]:
m1_score=m4.score(valid_x,valid_y)
m2_score=m5.score(valid_x,valid_y)
m3_score=m6.score(valid_x,valid_y)

In [48]:
index_=[1,2,3]
valid_r2=[m1_score,m2_score,m3_score]

rank_eval=pd.DataFrame({'Score':valid_r2},index=index_)
rank_eval

Unnamed: 0,Score
1,0.576186
2,0.527654
3,0.591205


In [51]:
Sort_rank=rank_eval.sort_values('Score')
Sort_rank

Unnamed: 0,Score
2,0.527654
1,0.576186
3,0.591205


In [55]:
Sort_rank['rank']=[i for i in range(1,4)]
Sort_rank

Unnamed: 0,Score,rank
2,0.527654,1
1,0.576186,2
3,0.591205,3


In [56]:
Sort_rank['weight']=Sort_rank['rank']/Sort_rank['rank'].sum()
Sort_rank

Unnamed: 0,Score,rank,weight
2,0.527654,1,0.166667
1,0.576186,2,0.333333
3,0.591205,3,0.5


In [57]:
wt1=p4+float(Sort_rank.loc[[1],['weight']].values)
wt2=p5+float(Sort_rank.loc[[2],['weight']].values)
wt3=p6+float(Sort_rank.loc[[3],['weight']].values)
rank_pred=wt1+wt2+wt3
rank_pred

array([ 6020.86084045,  8429.2670406 , 10638.76688656, ...,
        7284.5289077 ,  2048.62131351,  4623.50008208])

In [59]:
r2_score(valid_y,rank_pred)

-8.041233537916458