<a href="https://colab.research.google.com/github/RylieWeaver9/Machine-Learning/blob/main/Voting_and_Stacking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Overview

### I have made a voting classifier and a stacking ensemble to predict the income class of people.

## 8. Voting Classifier

### Firstly, I downloaded the data from Kaggle. Now, I must mount my drive and read the data with pandas.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('/content/drive/My Drive/toy_dataset.csv')

Import necessary tools.

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.ensemble import StackingClassifier

Visualize the data. I will be attempting to predict the income class based on gender and age.

In [None]:
data

Unnamed: 0,Number,City,Gender,Age,Income,Illness
0,1,Dallas,Male,41,40367.0,No
1,2,Dallas,Male,54,45084.0,No
2,3,Dallas,Male,42,52483.0,No
3,4,Dallas,Male,40,40941.0,No
4,5,Dallas,Male,46,50289.0,No
...,...,...,...,...,...,...
149995,149996,Austin,Male,48,93669.0,No
149996,149997,Austin,Male,25,96748.0,No
149997,149998,Austin,Male,26,111885.0,No
149998,149999,Austin,Male,25,111878.0,No


In order to use the gender for classification, I will need to change the categories to numbers, which I will do using an encoder.

In [None]:
gend = data['Gender']
gend_encode = pd.get_dummies(gend)

Moreover, I need to change the income levels into income bins so that this is a classification task, not a regression task. I automate this bin making with a for loop below. The reason my first bin starts below 0 is because one data entry had a negative income, which was messing up my labelling. Since this data point belongs at the bottom income level classification anyway, it works to just set my bottom bin below this point.

In [None]:
import numpy as np
bin_array=[-5000]
label_array = []
num=25000
for i in range(6):
  y=[num]
  bin_array = np.concatenate([bin_array, y])
  num = num+25000
  label_array = np.concatenate([label_array, [i]])

bin_array = np.concatenate([bin_array, [np.max(data['Income'])]])
label_array = np.concatenate([label_array, [6]])
print(bin_array)
print(label_array)

[ -5000.  25000.  50000.  75000. 100000. 125000. 150000. 177157.]
[0. 1. 2. 3. 4. 5. 6.]


In [None]:
data_clean = data.assign(Income_lvl = pd.cut(data['Income'], bins = bin_array,
                          labels = label_array))

Now I define the features and the target values.

In [None]:
features = pd.concat([gend_encode, data['Age']], axis=1)
target = data_clean['Income_lvl']

Split data.

In [None]:
X_train, y_train = features[:6_000], target[:6_000]
X_valid, y_valid = features[6_000:8_000], target[6_000:8_000]
X_test, y_test = features[8_000:10_000], target[8_000:10_000]

Import and declare classifiers.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC

In [None]:
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)

Train estimators in an automated way.

In [None]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(random_state=42)
Training the ExtraTreesClassifier(random_state=42)
Training the LinearSVC(max_iter=100, random_state=42, tol=20)


Evaluate estimators.

In [None]:
[estimator.score(X_valid, y_valid) for estimator in estimators]

[0.6105, 0.609, 0.6375]

### This may seem like poor performance, but keep in mind that there are 7 possible classes of income levels, so a random choice would be only ~14% accurate - much lower than the performance we're getting here!

The SVC performs the best, followed by Random Forest and then Extra Trees.

In [None]:
from sklearn.ensemble import VotingClassifier

In [None]:
named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf)
]

In [None]:
voting_clf = VotingClassifier(named_estimators)

In [None]:
voting_clf.fit(X_train, y_train)

In [None]:
voting_clf.score(X_valid, y_valid)

0.609

### The voting classifier performs worse than or equal to any of the individual classifiers.

The `VotingClassifier` made a clone of each classifier, and it trained the clones using class indices as the labels, not the original class names. Therefore, to evaluate these clones we need to provide class indices as well. To convert the classes to class indices, we can use a `LabelEncoder`:

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_valid_encoded = encoder.fit_transform(y_valid)

Now let's evaluate the classifier clones:

In [None]:
[estimator.score(X_valid, y_valid_encoded)
 for estimator in voting_clf.estimators_]

[0.6105, 0.609, 0.6375]

I'll remove the worst classifier (Extra Trees) to see if performance improves. It is possible to remove an estimator by setting it to `"drop"` using `set_params()` like this:

In [None]:
voting_clf.set_params(extra_trees_clf="drop")

This updated the list of estimators:

In [None]:
voting_clf.estimators

[('random_forest_clf', RandomForestClassifier(random_state=42)),
 ('extra_trees_clf', 'drop'),
 ('svm_clf', LinearSVC(max_iter=100, random_state=42, tol=20))]

However, it did not update the list of _trained_ estimators:

In [None]:
voting_clf.estimators_

[RandomForestClassifier(random_state=42),
 ExtraTreesClassifier(random_state=42),
 LinearSVC(max_iter=100, random_state=42, tol=20)]

In [None]:
voting_clf.named_estimators_

{'random_forest_clf': RandomForestClassifier(random_state=42),
 'extra_trees_clf': ExtraTreesClassifier(random_state=42),
 'svm_clf': LinearSVC(max_iter=100, random_state=42, tol=20)}

In [None]:
extra_trees_clf_trained = voting_clf.named_estimators_.pop("extra_trees_clf")
voting_clf.estimators_.remove(extra_trees_clf_trained)

Now let's evaluate the `VotingClassifier` again:

In [None]:
voting_clf.score(X_valid, y_valid)

0.6375

### This classifier is better (equal to the best individual classifier).

In [None]:
voting_clf.score(X_test, y_test)

0.6235

In [None]:
[estimator.score(X_test, y_test.astype(np.int64))
 for estimator in voting_clf.estimators_]

[0.6285, 0.6235]

### The voting classifier performs worse than or equal to any individual classifier on the test set.

## 9. Stacking Ensemble

In [None]:
X_valid_predictions = np.empty((len(X_valid), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_valid_predictions[:, index] = estimator.predict(X_valid)

In [None]:
X_valid_predictions

array([[1.0, 1.0, 1.0],
       [2.0, 2.0, 1.0],
       [1.0, 1.0, 1.0],
       ...,
       [1.0, 1.0, 1.0],
       [2.0, 2.0, 1.0],
       [1.0, 1.0, 1.0]], dtype=object)

In [None]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True,
                                            random_state=42)
rnd_forest_blender.fit(X_valid_predictions, y_valid)

In [None]:
rnd_forest_blender.oob_score_

0.639

### This is a good score for the validation set... now let's see test.

In [None]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

In [None]:
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [None]:
accuracy_score(y_test, y_pred)

0.6255

### The blender performs better than voting, but worse than some individual classifiers.

## Now I'll make a stacking classifier

Since `StackingClassifier` uses K-Fold cross-validation, we don't need a separate validation set, so let's join the training set and the validation set into a bigger training set:

In [None]:
X_train_full, y_train_full = features[:8_000], target[:8_000]

Now let's create and train the stacking classifier on the full training set:

**Warning**: the following cell will take quite a while to run (15-30 minutes depending on your hardware), as it uses K-Fold validation with 5 folds by default. It will train the 4 classifiers 5 times each on 80% of the full training set to make the predictions, plus one last time each on the full training set, and lastly it will train the final model on the predictions. That's a total of 25 models to train!

In [None]:
stack_clf = StackingClassifier(named_estimators,
                               final_estimator=rnd_forest_blender)
stack_clf.fit(X_train_full, y_train_full)

In [None]:
stack_clf.score(X_test, y_test)

0.618

### The Stacking Classifier does not perform very well.

### Overall, the SVC performs better than any other classifier, including the ensembles.