# Ensemble Methods
This classifier makes predictions based on [Speed Dating dataset](https://www.openml.org/search?type=data&status=active&id=40536) from [OpenML website](https://www.openml.org).

In [None]:
import pandas as pd
import time
import matplotlib.pyplot as plt
plt.style.use("seaborn-v0_8-whitegrid") # Plot style

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

%load_ext autoreload
%autoreload 2

In [None]:
# Import the Speed Dating dataset from OpenML website
from sklearn.datasets import fetch_openml
sd_data = fetch_openml(name="SpeedDating")

In [None]:
sd_data.keys()

In [None]:
# Get the data description
print(sd_data.DESCR)

In [None]:
# Get the data and classes (labels, targets)
X = sd_data.data
y = sd_data.target

# Create a data frame
X = pd.DataFrame(X, columns=sd_data.feature_names)
y = pd.Series(y)

X.head()

In [None]:
# Get the (number of data, number of features) tuple
X.shape

In [None]:
# Check if there are missing values
X.isna().sum()

In [None]:
# Delete all columns (axis=1) with missing values. This operation reduces the number of features.
# In order to delete all rows with missing values, set axis=0. This operation will reduce the number of data.
X.dropna(axis=1, inplace=True)

In [None]:
# Get the (number of data, number of features) tuple after deleting all rows (or columns) with missing values
X.shape

In [None]:
# Get the class distribution
y.value_counts() / len(y)

The class distribution is:
1. approximately 83% data has negative label,
2. approximately 16% data has positive label.

Therefore, there is insufficient data for positive cases. This situation is called class imbalance. In such cases it is necessary to find the features that most influence the positive prediction.

In [None]:
# Split the data into training (80%) and testing (20%) datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

print(y_train.value_counts())
print(y_test.value_counts())

## 1. Bagging
Train and test the sklearn's BaggingClassifier. The steps of finding the optimal values of hyperparameters are skipped. The ideal scenario:
1. find the optimal number of estimators on training dataset (e.g. using Cross-Validation),
2. train an ensemble model with optimal number of estimators (i.e. individual models) on training dataset,
3. test the trained ensemble model on testing dataset.

In [None]:
start = time.perf_counter()
classifier = BaggingClassifier(random_state=0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

Here,
1. train accuracy should be around 98% => ensemble model overfits on training dataset,
2. test accuracy should be around 84%.

In [None]:
# Get the class distribution in testing dataset
y_test.value_counts() / len(y_test)

The class distribution in testing dataset is:
1. approximately 83% data has negative label,
2. approximately 16% data has positive label.

Since most of the data in the testing dataset has a negative label, if the model makes a negative prediction for all the input data, it will get about 84% accuracy on the testing dataset. Therefore, the obtained test accuracy is not a good result, because it can be obtained with a model that always makes negative predictions. The reason behind this is a class imbalance, i.e. when the model mainly trains on data with the same label, it automatically gets biased on data with that label; therefore, the model's future predictions will be the label of training dataset's data's majority.

## 2. Random Forest
Train and test the sklearn's RandomForestClassifier with default parameters. The obtained results are slightly better than BaggingClassifier's, but the class imbalance problem still exists.

In [None]:
start = time.perf_counter()
classifier = RandomForestClassifier(random_state=0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

## Balancing [Strategies](https://imbalanced-learn.readthedocs.io/en/stable/api.html#module-imblearn.over_sampling)

## 1. Over-sampling
The idea of over-sampling is to copy the data of the label that is less in the original dataset (in this case, the data of positive class) until the classes are balanced.

In [None]:
from imblearn.over_sampling import RandomOverSampler

In [None]:
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

In [None]:
y_resampled = pd.Series(y_resampled)
y_resampled.value_counts()

### 1.1 Bagging

In [None]:
start = time.perf_counter()
classifier = BaggingClassifier(random_state = 0)
classifier.fit(X_resampled, y_resampled)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_resampled), y_resampled))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

### 1.2 Random Forest

In [None]:
start = time.perf_counter()
classifier = RandomForestClassifier(random_state = 0)
classifier.fit(X_resampled, y_resampled)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_resampled), y_resampled))
print('Test accuracy: ', accuracy_score(classifier.predict(X_test), y_test))

## 2. Under-sampling
Under-sampling is the opposite of oversampling, i.e. it takes the data of the label that is more in the original dataset (in this case, the data of negative class) with the quantity that is equal to the quantity of the data of the label that is less in the original dataset (in this case, the data of positive class).


In [None]:
from imblearn.under_sampling import RandomUnderSampler

In [None]:
rus = RandomUnderSampler()
X_resampled2, y_resampled2 = rus.fit_resample(X_train, y_train)

In [None]:
y_resampled2 = pd.Series(y_resampled2)
y_resampled2.value_counts()

### 2.1 Bagging

In [None]:
start = time.perf_counter()
classifier = BaggingClassifier(random_state = 0)
classifier.fit(X_resampled2, y_resampled2)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_resampled2), y_resampled2))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

### 2.2 Random Forest

In [None]:
start = time.perf_counter()
classifier = RandomForestClassifier(random_state = 0)
classifier.fit(X_resampled2, y_resampled2)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_resampled2), y_resampled2))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

## Other strategies for handling missing values ([imputation](https://scikit-learn.org/stable/modules/impute.html))
If there are missing values in data, deleting the entire row (or column) is not a good approach because the deleted data instance (or feature) might be important.

In [None]:
# Restore the original data
X = sd_data.data
y = sd_data.target

In [None]:
X = pd.DataFrame(X, columns=sd_data.feature_names)
y = pd.Series(y)

In [None]:
# Get the (number of data, number of features) tuple to make sure the original data has been restored
X.shape

## 1. [Simple Imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer)
Instead of deleting, the missing values can be replaced, for example, with respective most frequent values.

In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="most_frequent")
X_imp = imp.fit_transform(X)
X_imp = pd.DataFrame(X_imp, columns=X.columns)
X_imp.isna().sum()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_imp, y, train_size=0.8, random_state=0)

### 1.1 Bagging

In [None]:
start = time.perf_counter()
classifier = BaggingClassifier(random_state = 0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

### 1.2 Random Forest

In [None]:
start = time.perf_counter()
classifier = RandomForestClassifier(random_state = 0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

## 2. [kNN Imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer)
If there is a missing value in a row, it can be replaced with a respective value of row's k nearest neighbours.

In [None]:
from sklearn.impute import KNNImputer
imp = KNNImputer()
X_imp = imp.fit_transform(X)
X_imp = pd.DataFrame(X_imp, columns=X.columns)
X_imp.isna().sum()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_imp, y, train_size=0.8, random_state=0)

### 2.1 Bagging

In [None]:
start = time.perf_counter()
classifier = BaggingClassifier(random_state = 0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

### 2.2 Random Forest

In [None]:
start = time.perf_counter()
classifier = RandomForestClassifier(random_state = 0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

## 3. [Iterative Imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer)
If there is a missing value in a row, it is possible to create a model based on data where the corresponding value is present, and predict that value for the row where that value is missing

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer()
X_imp = imp.fit_transform(X)
X_imp = pd.DataFrame(X_imp, columns=X.columns)
X_imp.isna().sum()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_imp, y, train_size=0.8, random_state=0)

### 3.1 Bagging

In [None]:
start = time.perf_counter()
classifier = BaggingClassifier(random_state = 0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

### 3.2 Random Forest

In [None]:
start = time.perf_counter()
classifier = RandomForestClassifier(random_state = 0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

In [None]:
# Returns importance of each feature
df = pd.DataFrame({'feat': X_train.columns,
                   'importance': classifier.feature_importances_}).sort_values('importance', ascending=False)

df [df['importance']<0.01]

In [None]:
# Get the data description
print(sd_data.DESCR)