<a href="https://colab.research.google.com/github/PanosRntgs/Machine-Learning/blob/main/Ensemble_Learning_with_MNIST_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this python notebook, we explore learning techniques using the MNIST dataset.

We begin by splitting the dataset into training and test sets.

Next, we employ Principal Component Analysis (PCA) to reduce the dimensions of the dataset while retaining 90% of the variance in the training set.

We then train various classifiers, including Decision Tree, Random Forest, AdaBoost, LinearSVC, and Logistic Regression, on the training data and assess their performance on the test set.

Afterwards, we consolidate these individual classifiers into a Stacking Ensemble Classifier, leveraging 3-fold cross-validation, with a Random Forest Classifier serving as the final model.

Finally, we evaluate the Stacking Classifier's performance on the test set and analyze its effectiveness in comparison to individual classifiers.

In [1]:
from tensorflow.keras.datasets import mnist
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.base import clone

In [2]:
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [3]:
# Split the dataset into training (6/7) and test (1/7) sets
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=1/7, random_state=42)

In [4]:
# Flatten the images
x_train_flat = x_train.reshape(x_train.shape[0], -1)
x_test_flat = x_test.reshape(x_test.shape[0], -1)

In [5]:
# Normalize pixel values to be between 0 and 1
x_train_flat = x_train_flat / 255.0
x_test_flat = x_test_flat / 255.0

In [6]:
# Apply PCA to reduce dimensions and preserve 90% of variance
target_variance = 0.9
pca = PCA(n_components=target_variance)
x_train_pca = pca.fit_transform(x_train_flat)
x_test_pca = pca.transform(x_test_flat)

In [7]:
# Train Decision Tree
decision_tree = DecisionTreeClassifier(max_depth=10)
decision_tree.fit(x_train_pca, y_train)
decision_tree_score = decision_tree.score(x_test_pca, y_test)

In [8]:
# Train Random Forest
random_forest = RandomForestClassifier(n_estimators=50, random_state=42)
random_forest.fit(x_train_pca, y_train)
random_forest_score = random_forest.score(x_test_pca, y_test)

In [9]:
# Train AdaBoost
adaboost = AdaBoostClassifier(n_estimators=50, random_state=42)
adaboost.fit(x_train_pca, y_train)
adaboost_score = adaboost.score(x_test_pca, y_test)

In [10]:
# Train LinearSVC
linear_svc = LinearSVC(max_iter=500, dual=False, random_state=42)
linear_svc.fit(x_train_pca, y_train)
linear_svc_score = linear_svc.score(x_test_pca, y_test)

In [11]:
# Train Logistic Regression
logistic_regression = LogisticRegression(max_iter=500, random_state=42)
logistic_regression.fit(x_train_pca, y_train)
logistic_regression_score = logistic_regression.score(x_test_pca, y_test)

In [12]:
# Print the scores
print("Decision Tree Score:", decision_tree_score)
print("Random Forest Score:", random_forest_score)
print("AdaBoost Score:", adaboost_score)
print("LinearSVC Score:", linear_svc_score)
print("Logistic Regression Score:", logistic_regression_score)

Decision Tree Score: 0.7865142323845077
Random Forest Score: 0.947736817545497
AdaBoost Score: 0.679888007466169
LinearSVC Score: 0.905506299580028
Logistic Regression Score: 0.9213719085394307


In [13]:
# Define the base classifiers
base_classifiers = [
    ('decision_tree', decision_tree),
    ('random_forest', random_forest),
    ('adaboost', adaboost),
    ('linear_svc', linear_svc),
    ('logistic_regression', logistic_regression)]

In [14]:
# Create the stacking ensemble with Random Forest as the final classifier
stacking_classifier = StackingClassifier(estimators=base_classifiers,
                                         final_estimator=random_forest,
                                         cv=3)

In [15]:
# Train the stacking ensemble on the training set and evaluate on the test set
stacking_classifier.fit(x_train_pca, y_train)
stacking_score = stacking_classifier.score(x_test_pca, y_test)

In [16]:
# Print the test set score
print("Stacking Ensemble Test Set Score:", stacking_score)

Stacking Ensemble Test Set Score: 0.9562529164722352


In [17]:
# Individual classifier scores
individual_scores = {
    'Decision Tree': decision_tree_score,
    'Random Forest': random_forest_score,
    'AdaBoost': adaboost_score,
    'LinearSVC': linear_svc_score,
    'Logistic Regression': logistic_regression_score
}
print(individual_scores)

{'Decision Tree': 0.7865142323845077, 'Random Forest': 0.947736817545497, 'AdaBoost': 0.679888007466169, 'LinearSVC': 0.905506299580028, 'Logistic Regression': 0.9213719085394307}


In [18]:
# Calculate improvement for each classifier
improvements = {classifier: (stacking_score - score) / score * 100
                for classifier, score in individual_scores.items()}

In [19]:
# Display the improvement for each classifier
for classifier, improvement in improvements.items():
    print(f"Improvement of Stacking Classifier compared to {classifier}: {improvement:.2f}%")

Improvement of Stacking Classifier compared to Decision Tree: 21.58%
Improvement of Stacking Classifier compared to Random Forest: 0.90%
Improvement of Stacking Classifier compared to AdaBoost: 40.65%
Improvement of Stacking Classifier compared to LinearSVC: 5.60%
Improvement of Stacking Classifier compared to Logistic Regression: 3.79%
