# Exercise 1 - AdaBoost
### Duarte Balata (46304) e Miguel Oliveira (55772)

We begin by importing some necessary packages.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

#Common Model Helpers
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn import metrics
from sklearn.metrics import classification_report, accuracy_score, make_scorer 
from sklearn.model_selection import train_test_split

## Data Pre-Processing 

We do some pre-processing work on the data file which includes loading the data with the correct headers, checking the data types and looking for errors in the data.

In [2]:
df = pd.read_csv('/Users/migueloliveira/Documents/Data Science/2/AAA/Ex.15 March/breast-cancer-wisconsin.data', 
                 names=['id', 
                        'Clump_Thickness', 
                        'Uniformity_of_Cell_Size', 
                        'Uniformity_of_Cell_Shape', 
                        'Marginal_Adhesion', 
                        'Single_Epithelial_Cell_Size',
                        'Bare_Nuclei',
                        'Bland_Chromatin',
                        'Normal_Nucleoli',
                        'Mitoses',
                        'Class'
                       ],  
                 sep=",")
print(df.shape)
df.head()

(699, 11)


Unnamed: 0,id,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


We notice below that column Bare_Nuclei has a data type as object - we will briefly investigate why.

In [3]:
df.dtypes

id                              int64
Clump_Thickness                 int64
Uniformity_of_Cell_Size         int64
Uniformity_of_Cell_Shape        int64
Marginal_Adhesion               int64
Single_Epithelial_Cell_Size     int64
Bare_Nuclei                    object
Bland_Chromatin                 int64
Normal_Nucleoli                 int64
Mitoses                         int64
Class                           int64
dtype: object

In [4]:
df.Bare_Nuclei.value_counts()

1     402
10    132
5      30
2      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: Bare_Nuclei, dtype: int64

We noticed that there was an error with values as "?" - To continue with the exercise , we change ? to the most frequent value "1". 

In [5]:
df['Bare_Nuclei'].values[df.Bare_Nuclei == '?'] = 1

Below, we will change all values data types to floats in order to standardize the data further down the process

In [6]:
df=df.astype(float)

We will drop ID column.

In [7]:
data = df.drop(["id"],axis = 1)
data.head()

Unnamed: 0,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
1,5.0,4.0,4.0,5.0,7.0,10.0,3.0,2.0,1.0,2.0
2,3.0,1.0,1.0,1.0,2.0,2.0,3.0,1.0,1.0,2.0
3,6.0,8.0,8.0,1.0,3.0,4.0,3.0,7.0,1.0,2.0
4,4.0,1.0,1.0,3.0,2.0,1.0,3.0,1.0,1.0,2.0


Moving on to Train Test Split - In this particular case, the data does not need to be standardized since the range seems to be the same for all the columns - nevertheless we will use Standard Scaler to ensure this. 

In [10]:
X = data.iloc[:,0:9].values
Y = data.iloc[:,9:].values

In [11]:
X = X.astype(float)

In [12]:
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3,random_state=5)

In [13]:
std=StandardScaler()

In [14]:
X_train=std.fit_transform(X_train)
X_test=std.fit_transform(X_test)

## Logistic Regression

We will use Logistic Regression algorithm first to obtain results that will be later be compared with the AdaBoost method. 

Logistic regression can be used to predict whether something is True or False - which in this case will predicting the class. Similar to linear regression, however linear regression is used to predict a continuous values. In addition, linear regression uses a line to fit in the data while Logistic Regression fits an S-shape curve based of the logistic function. 

Below we train the Logistic Regression model and print out the accuracy of the model.

In [48]:
#Train the model
model = LogisticRegression()
model.fit(X_train,y_train.ravel())
model.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, model.predict(X_test)))

Accuracy: 0.9428571428571428


## AdaBoost

Adaboost takes the boosting approach by using a weak algorithm that in this case we will call base learner experts. Each weak learner takes on a sample of the data , and learns it - in the case below we will do use with a logistic Regression Base Learner (and later on with decision trees). 

As each learner is train it then filters the data selected to train the following one. The filter functions by adding all the incorrectly classified data plus new data into the next learner. AdaBoost works by combining all the learners hypothesis into a single prediction and adjusting the weights of the data that is harder to classify.

### Logistic Regression Base Learner Experts

In [19]:
from sklearn.ensemble import AdaBoostClassifier

To use a Logistic Regression Base learner , we will identify the Logistic Regression model as the base_estimator. First we will begin with 7 base learner experts.

In [47]:
AdaModel_LR= AdaBoostClassifier(n_estimators=7, base_estimator=model, learning_rate=0.1)
model_1 = AdaModel_LR.fit(X_train,y_train.ravel())
model_1.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, model_1.predict(X_test)))

Accuracy: 0.919047619047619


#### Experimenting with the number of learner experts (Logistic Regression)

1 Learner Expert

In [42]:
AdaModel_LR= AdaBoostClassifier(n_estimators=1, base_estimator=model, learning_rate=0.1)
model_1 = AdaModel_LR.fit(X_train,y_train.ravel())
model_1.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, model_1.predict(X_test)))

Accuracy: 0.919047619047619


10 Learner Experts

In [40]:
AdaModel_LR= AdaBoostClassifier(n_estimators=10, base_estimator=model, learning_rate=0.1)
model_1 = AdaModel_LR.fit(X_train,y_train.ravel())
model_1.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, model_1.predict(X_test)))

Accuracy: 0.9238095238095239


100 Learner Experts

In [41]:
AdaModel_LR= AdaBoostClassifier(n_estimators=100, base_estimator=model, learning_rate=0.1)
model_1 = AdaModel_LR.fit(X_train,y_train.ravel())
model_1.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, model_1.predict(X_test)))

Accuracy: 0.9476190476190476


### Decision Tree Base Learner Experts

This is the default of the AdaBoost model in skitlearn. Below, we will experiment with 7, 1 and 10 and 100 learner experts to compare results with the previous models.

When using AdaBoost with Decision Tree Base Learner Experts, what is usually a decision tree of "n" nodes and leaves which can vary in depth, becomes only a tree with one node and two leaves. This therefore is a weak learning algorithm, however it computes an outcome based on this small tree. As we increase the number of leaner experts, the final prediction will be an outcome of all hypothesis of each iteration. It is important to note that the errors that each small tree makes influence the prediction of the following tree by updating the weights.  

7 Learner Experts

In [43]:
AdaModel_DT = AdaBoostClassifier(n_estimators=7,
                              learning_rate=0.1)
model_2 = AdaModel_DT.fit(X_train,y_train.ravel())
model_2.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, model_2.predict(X_test)))

Accuracy: 0.9285714285714286


#### Experimenting with the number of learner experts (Decision Trees)

1 Learner Expert

In [44]:
AdaModel = AdaBoostClassifier(n_estimators=1,
                              learning_rate=0.1)
model_2 = AdaModel.fit(X_train,y_train.ravel()) 
model_2.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, model_2.predict(X_test)))

Accuracy: 0.9047619047619048


10 Learner Experts

In [45]:
AdaModel = AdaBoostClassifier(n_estimators=10,
                              learning_rate=0.1)
model_2 = AdaModel.fit(X_train,y_train.ravel()) 
model_2.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, model_2.predict(X_test)))

Accuracy: 0.9428571428571428


100 Learner Experts

In [46]:
AdaModel = AdaBoostClassifier(n_estimators=100,
                              learning_rate=0.1)
model_2 = AdaModel.fit(X_train,y_train.ravel()) 
model_2.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, model_2.predict(X_test)))

Accuracy: 0.9476190476190476


## Results

As expected - as the number of learner experts increased , the prediction results improved. It is also curious to note that when using AdaBoost with a logistic model , the best result (with 10 and above learning experts) was equal to that of a Logistic Regression model without AdaBoost. It also indicates that from a certain threeshold of learning experts the prediction accuracy reaches its best and does not improve.

Prediction Results using AdaBoost with over 10 Decision Tree Learner experts performed better than Logistic Regression - Indicating that perhaps AdaBoost is most effective with Decision Trees. 
