# Assembly methods

Combine different ML methods with different settings and apply one method to achieve consensus.

Diversity is a very good option.

Esamble's methods have stood out for winning many ML competitions.

### Bagging (Bootstrap Aggregation)

What if instead of relying on the opinion of a single "*expert*" we consult the opinion of several experts in parallel and try to reach a consensus?

The final answer is the combination (for example by voting) between the individual answers

#### Algorithms

1. Random Forest
2. Voting Classifiers / Regressors
3. Can be applied on any family of ML models

<img src="https://miro.medium.com/max/850/1*_pfQ7Xf-BAwfQXtaBbNTEg.png">

### Boosting 

We will ask an expert for his opinion on a problem, we measure his possible error, and then using that calculated error we ask another expert for his judgment on the same problem


Boosting is a sequential method, it seeks to gradually strengthen a learning model using always the residual error of the previous stages, the final result is also achieved by consensus among all models.


#### Algorithms

1. AdaBoost
2. Gradient Tree Boosting
3. XGBoost


<img src="https://www.cs.us.es/~fsancho/images/2018-12/boosting.png">

In [1]:
import pandas as pd 
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('../Datasets/Week9/heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [3]:
df['target'].describe()

count    1025.000000
mean        0.513171
std         0.500070
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: target, dtype: float64

In [4]:
x = df.drop(['target'], 1)
y = df['target']

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.35)

In [5]:
knn = KNeighborsClassifier().fit(x_train, y_train)
knn_pred = knn.predict(x_test)
print('Acuraccy:', accuracy_score(y_test, knn_pred))

Acuraccy: 0.724233983286908


In [6]:
bag = BaggingClassifier(base_estimator=KNeighborsClassifier(), n_estimators=50).fit(x_train, y_train)
bag_pred = bag.predict(x_test)
print('Acuraccy:', accuracy_score(y_test, bag_pred))

Acuraccy: 0.7604456824512534


In [7]:
dtc = DecisionTreeClassifier().fit(x_train,y_train)
dtc_pred = dtc.predict(x_test)
print('Acuraccy:', accuracy_score(y_test, dtc_pred))

Acuraccy: 0.9888579387186629


In [8]:
bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50).fit(x_train, y_train)
bag_pred = bag.predict(x_test)
print('Acuraccy:', accuracy_score(y_test, bag_pred))

Acuraccy: 0.9888579387186629


In [9]:
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier

In [10]:
booster = GradientBoostingClassifier(n_estimators=50).fit(x_train, y_train)
boost_pred = booster.predict(x_test)
print('Acuraccy:', accuracy_score(y_test, boost_pred))

Acuraccy: 0.9275766016713092


In [11]:
booster = AdaBoostClassifier(n_estimators=50).fit(x_train, y_train)
boost_pred = booster.predict(x_test)
print('Acuraccy:', accuracy_score(y_test, boost_pred))

Acuraccy: 0.9108635097493036


## Dataset wines

In [16]:
wines = datasets.load_wine()
df = pd.DataFrame(wines.data, columns=wines.feature_names)
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [17]:
x = df 
y = wines.target

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.20)

In [19]:
knn = KNeighborsClassifier().fit(x_train, y_train)
knn_pred = knn.predict(x_test)
print('Acuraccy KNN:', accuracy_score(y_test, knn_pred))

bag = BaggingClassifier(base_estimator=KNeighborsClassifier(), n_estimators=50).fit(x_train, y_train)
bag_pred = bag.predict(x_test)
print('Acuraccy Bagging KNN:', accuracy_score(y_test, bag_pred))

dtc = DecisionTreeClassifier().fit(x_train,y_train)
dtc_pred = dtc.predict(x_test)
print('Acuraccy DT:', accuracy_score(y_test, dtc_pred))

bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50).fit(x_train, y_train)
bag_pred = bag.predict(x_test)
print('Acuraccy Bagging DT:', accuracy_score(y_test, bag_pred))

Acuraccy KNN: 0.75
Acuraccy Bagging KNN: 0.7777777777777778
Acuraccy DT: 0.9722222222222222
Acuraccy Bagging DT: 1.0


In [21]:
booster = GradientBoostingClassifier(n_estimators=50).fit(x_train, y_train)
boost_pred = booster.predict(x_test)
print('Acuraccy GBC:', accuracy_score(y_test, boost_pred))

booster = AdaBoostClassifier(n_estimators=50).fit(x_train, y_train)
boost_pred = booster.predict(x_test)
print('Acuraccy Ada:', accuracy_score(y_test, boost_pred))

Acuraccy GBC: 0.9444444444444444
Acuraccy Ada: 0.9444444444444444


## Dataset Diabetes

In [27]:
cancer = datasets.load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df.head()


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [28]:
x = df 
y = cancer.target

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.20)

In [29]:
knn = KNeighborsClassifier().fit(x_train, y_train)
knn_pred = knn.predict(x_test)
print('Acuraccy KNN:', accuracy_score(y_test, knn_pred))

bag = BaggingClassifier(base_estimator=KNeighborsClassifier(), n_estimators=50).fit(x_train, y_train)
bag_pred = bag.predict(x_test)
print('Acuraccy Bagging KNN:', accuracy_score(y_test, bag_pred))

dtc = DecisionTreeClassifier().fit(x_train,y_train)
dtc_pred = dtc.predict(x_test)
print('Acuraccy DT:', accuracy_score(y_test, dtc_pred))

bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50).fit(x_train, y_train)
bag_pred = bag.predict(x_test)
print('Acuraccy Bagging DT:', accuracy_score(y_test, bag_pred))

Acuraccy KNN: 0.9210526315789473
Acuraccy Bagging KNN: 0.9298245614035088
Acuraccy DT: 0.9473684210526315
Acuraccy Bagging DT: 0.9736842105263158


In [30]:
booster = GradientBoostingClassifier(n_estimators=50).fit(x_train, y_train)
boost_pred = booster.predict(x_test)
print('Acuraccy GBC:', accuracy_score(y_test, boost_pred))

booster = AdaBoostClassifier(n_estimators=50).fit(x_train, y_train)
boost_pred = booster.predict(x_test)
print('Acuraccy Ada:', accuracy_score(y_test, boost_pred))

Acuraccy GBC: 0.9736842105263158
Acuraccy Ada: 0.9736842105263158


## Dataset Iris

In [31]:
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [33]:
x = df 
y = iris.target

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.20)

In [34]:
knn = KNeighborsClassifier().fit(x_train, y_train)
knn_pred = knn.predict(x_test)
print('Acuraccy KNN:', accuracy_score(y_test, knn_pred))

bag = BaggingClassifier(base_estimator=KNeighborsClassifier(), n_estimators=50).fit(x_train, y_train)
bag_pred = bag.predict(x_test)
print('Acuraccy Bagging KNN:', accuracy_score(y_test, bag_pred))

dtc = DecisionTreeClassifier().fit(x_train,y_train)
dtc_pred = dtc.predict(x_test)
print('Acuraccy DT:', accuracy_score(y_test, dtc_pred))

bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50).fit(x_train, y_train)
bag_pred = bag.predict(x_test)
print('Acuraccy Bagging DT:', accuracy_score(y_test, bag_pred))

Acuraccy KNN: 0.9333333333333333
Acuraccy Bagging KNN: 0.9333333333333333
Acuraccy DT: 0.9666666666666667
Acuraccy Bagging DT: 0.9666666666666667


In [35]:
booster = GradientBoostingClassifier(n_estimators=50).fit(x_train, y_train)
boost_pred = booster.predict(x_test)
print('Acuraccy GBC:', accuracy_score(y_test, boost_pred))

booster = AdaBoostClassifier(n_estimators=50).fit(x_train, y_train)
boost_pred = booster.predict(x_test)
print('Acuraccy Ada:', accuracy_score(y_test, boost_pred))

Acuraccy GBC: 0.9666666666666667
Acuraccy Ada: 0.9666666666666667
