# Resampling Methods

## Cross-Validation

Cross-Validation (or K-Fold Cross-Validation) involves randomly partioning the original dataset in K datasets. All subsets are disjoint (two datasets are disjoint datasets if they do have no sample/observation in common). The motivation appears when the dataset has a small number of samples and we would like to check all possible ways whether the model has been overfitted or not.

```
Step 1. Train on folds {2,3,4,5} and test on fold 1
...
Step 5. Train of folds {1,2,3,4} and test on fold 5
```

<br>
Usually, accuracy is computed as the average of the accuracy for each fold.
<br><br>

![](http://drive.google.com/uc?export=view&id=1vEpRsWfG6BVEeWTq0HWnB4rjYd80Aiwz)

## Bootstrap

Bootstrap is a machine learning technique that relies on random sampling with replacement (a samples can be selected multiple times)

```
- Choose a number of bootstrap samples to perform
- Choose a bootsrap sample size
- For each bootstrap sample
  - Draw a sample with replacement with the chosen size
  - Fit a model on the data sample
  - Estimate the skill of the model on the out-of-bag sample.
  - Calculate the mean of the sample of model skill estimates.
```

![](http://drive.google.com/uc?export=view&id=1UbMxQfvtt6HHWHQq3w10ns1QPJk-quRq)

# Bagging



Ensemble methods represent a machine learning technique which combines several models in order to obtain a better predictive model. Ensemble methods require much more computation than a single model, thus it arises the problem of space and time vs optimal learning tradeoff. Each base model can be created using different subsets of the same training dataset and same algorithm, or using the same dataset with different algorithms, or any other method.

Bagging:
- BAGGing combines Bootstrapping and Aggregation to form an ensamble model.
- Bagging steps:
  - Starting with the original dataset, build ùíè training datasets by sampling with replacement (bootstrap samples)
  - For each training dataset build a classifier using the same learning algorithm.
  - The final classifier is obtained by combining the results of
each classifiers (by voting for example).
- Bagging helps to improve the accuracy for unstable learning algorithms: decision trees, neural networks. 
- It does not help for kNN, Na√Øve Bayesian classification or CARs.

![](http://drive.google.com/uc?export=view&id=1kUu7OwAuw-Cx0c3_GmOp-WXPdyu1QEci)



## Bagging Forest vs Random Forest

- final classifier outputs
  - classification: the modal value (most frequent value)
  - regression: the average over all predictions
- Bagging Forest computes the output by using an ensemble of Decision Tree classifiers
- Random Forest use also Decision Tree classifiers, but it brings an improvement of the greedy split method used by classical Decision Tree. In CART, when selecting a split node, the learning algorithm is allowed to look through all attributes and all attributes values in order to select the most optimal split-node. On the other hand, Random Forest will have to choose from a sample of features/attributes.
- There is no prunning when building trees.

![](http://drive.google.com/uc?export=view&id=1SDjlLuuELkcMI7cOTW3-Yg6Ke0BHPynw)

![](http://drive.google.com/uc?export=view&id=1x7B1Bv8K9O-VDUpJAJSO1-k1gCiOHH8W)

## Extremely Randomized Trees

- Extremely Randomized Trees add another step of randomization in the way splits are computed
- The same input training dataset is used to train all trees. (no bootstrap)
- It essentially consists of randomizing strongly both attribute and attribute value's choice while splitting a tree node.

#Boosting



- Boosting consists in building a sequence of weak classifiers and adding them in the structure of the final strong classifier. 
- Weak learner (classifier) ‚Äì a classification algorithm with a substantial error rate which performance is not random
- In other words, a weak leaner has an accuracy only slightly better than using random guessing
- The weak classifiers are weighted based on the weak learners' accuracy.
- Also data is reweighted after each weak classifier is built such as examples that are incorrectly classified gain some extra weight.
- The result is that the next weak classifiers in the sequence focus more on the examples that previous weak classifiers missed

![](http://drive.google.com/uc?export=view&id=1nm222fahzc3NOUe0ieaWLgD5VoJ-dvkI)







## Ada Boost

- combination of **Weak Learners** that are slightly better than random guess
- for binary classification, if the classification error is larger than $\frac{1}{2}$, then we stop with finding Weak Learners
- below you have the pseudocode for the original AdaBoost, but take into account that scikit-learn [implementation](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_weight_boosting.py) is based on a slightly modified version [SAMME](https://web.stanford.edu/~hastie/Papers/samme.pdf) for multiclass classification, that finally is reduced to AdaBoost if number of classes is 2.

![](http://drive.google.com/uc?export=view&id=1d8HPe8KkQIeGgPAJAbVdgqCPpG_RtcSG)

# Exercises

## Ex0. Download Titanic Dataset
Columns:
- 0: Survived Indicator
- 1: Passenger Class
- 2: Name
- 3: Sex
- 4: Age
- 5: Siblings Aboard
- 6: Parents Aboard
- 7: Fare paid in ¬£s

In [None]:
!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

## Ex1. Feature Engineering

- Process data before applying classifiers
  - check if there are null values in the dataset
  - binning continous attributes for more efficiency
  - remove column "Name"
  - convert column "Sex" to 0/1
  - split dataset into train/test
  - for other preprocessing techniques you can access this [tutorial](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114) or this [book](https://github.com/yanshengjia/ml-road/blob/master/resources/Feature%20Engineering%20for%20Machine%20Learning.pdf)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

def go(title, df_result):
  df_str = df_result.to_string().split('\n')
  max_len = max(map(len, df_str))
  half_len = int((max_len-len(title)-1)/2)
  half_len = half_len if half_len else 1
  print("-" * half_len, title, "-" * half_len)
  print(df_result.to_string())
  print("\n")


large_titanic_df = pd.read_csv("titanic.csv")
go("Null Values", large_titanic_df.isnull().sum())
#large_titanic_df["Age"] = pd.cut(large_titanic_df["Age"], [0,10,20,30,40,50,60,70,80], 
#                           labels=[ 'Age=Child', 'Age=Adult', 'Age=Senior'])

#titanic_df = large_titanic_df[['Sex', 'Age', 'Survived', 'Pclass']].copy()

large_titanic_df['Age'] = large_titanic_df['Age'].apply(lambda x: int(x / 10) * 10)
#titanic_df = titanic_df.drop('Name', axis=0)
#data = pd.DataFrame(titanic_df)
data = large_titanic_df.drop('Name', axis=1)
data.head()
data['Sex'] = data['Sex'].apply(lambda x: 0 if x == 'male' else 1)
data.head()

y = data.iloc[:,0]
x = data.iloc[:,1:]
print(x.head())
print(y.head())
#x = data.drop(['Fare', 'Parents/Children Aboard'], axis=1)
#y = data[['Fare', 'Parents/Children Aboard']] 


X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state=20)



## Ex2. Decision Tree

- use Decision Tree Classifier to check is passengers survived or not
- check if **max_depth** parameter improves accuracy
- compute confusion matrix and accuracy for train vs test

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

clf = DecisionTreeClassifier(max_depth=3)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

res = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(res)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

y_pred=clf.predict(X_train)
res = confusion_matrix(y_train, y_pred)
plot_confusion_matrix(res)
print("Accuracy:",metrics.accuracy_score(y_train, y_pred))

## Ex3. Random Forest

- use Random Forest to check is passengers survived or not
- compute confusion matrix and accuracy for train vs test

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

#Create a Gaussian Classifier
clf=RandomForestClassifier()

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)
res = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(res)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

y_pred=clf.predict(X_train)
res = confusion_matrix(y_train, y_pred)
plot_confusion_matrix(res)
print("Accuracy:",metrics.accuracy_score(y_train, y_pred))


## Ex4. AdaBoost

- use AdaBoost to check is passengers survived or not
- compute confusion matrix and accuracy for train vs test

In [None]:
from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier(n_estimators=50,
                         learning_rate=1)
# Train Adaboost Classifer
model = abc.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = model.predict(X_test)
res = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(res)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


y_pred=clf.predict(X_train)
res = confusion_matrix(y_train, y_pred)
plot_confusion_matrix(res)
print("Accuracy:",metrics.accuracy_score(y_train, y_pred))



---

[Copyright stuff](https://docs.google.com/document/d/1v7Rddbjfhb2D29vsalKZoE-WSqTwbnTZm9a-FaLYWEY/edit?usp=sharing)