#### 8. Load the MNIST dataset, and split it into a training set, a validation set, and a test set (e.g. using 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a random forest classifier, an extra-trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [5]:
# Import the MNIST dataset
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", as_frame=False)

In [6]:
X, y = mnist.data, mnist.target

In [7]:
print(f"Shape of the full dataset: {X.shape}")
print(f"Shape of the full label set: {y.shape}")

Shape of the full dataset: (70000, 784)
Shape of the full label set: (70000,)


In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15, random_state=42)

In [9]:
print(f"Shape of the training data: {X_train.shape}")
print(f"Shape of the training labels: {y_train.shape}")

print(f"Shape of the validation data: {X_val.shape}")
print(f"Shape of the validation labels: {y_val.shape}")

print(f"Shape of the test data: {X_test.shape}")
print(f"Shape of the test labels: {y_test.shape}")

Shape of the training data: (50575, 784)
Shape of the training labels: (50575,)
Shape of the validation data: (8925, 784)
Shape of the validation labels: (8925,)
Shape of the test data: (10500, 784)
Shape of the test labels: (10500,)


In [10]:
#### First let us train a Random Forest Classifier ####
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
forest_clf.fit(X_train, y_train)

In [11]:
# Evaluating the random forest classifier on the training and validation sets
print(f"Training set score: {forest_clf.score(X_train, y_train)}")
print(f"Validation set score: {forest_clf.score(X_val, y_val)}")

Training set score: 1.0
Validation set score: 0.9687394957983193


In [12]:
#### Second, we train an extra-trees classifier ####
from sklearn.ensemble import ExtraTreesClassifier

extra_clf = ExtraTreesClassifier(random_state=42)
extra_clf.fit(X_train, y_train)


In [13]:
# Evaluating the extra trees classifier on the training and validation sets
print(f"Training set score: {extra_clf.score(X_train, y_train)}")
print(f"Validation set score: {extra_clf.score(X_val, y_val)}")

Training set score: 1.0
Validation set score: 0.9718767507002801


In [14]:
# Finally, let us train a Support Vector Classifier
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(loss="log_loss", random_state=42)
sgd_clf.fit(X_train, y_train)

In [15]:
# Evaluating the SVC on the training and validation sets
print(f"Training set score: {sgd_clf.score(X_train, y_train)}")
print(f"Validation set score: {sgd_clf.score(X_val, y_val)}")

Training set score: 0.8851408798813644
Validation set score: 0.8677871148459384


In [16]:
# Now we combine them into one voting classifier
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[
        ('forest_clf', forest_clf),
        ('extra_clf', extra_clf),
        ('sgd_clf', sgd_clf)
    ], 
    voting="soft"
)

voting_clf.fit(X_train, y_train)

In [17]:
# Now let us evaluate the voting classifier on the validation set
print(f"Voting Classifier Validation Score: {voting_clf.score(X_val, y_val)}")

Voting Classifier Validation Score: 0.8771988795518207


  prob /= prob.sum(axis=1).reshape((prob.shape[0], -1))


In [18]:
voting_clf.voting = "hard"
voting_clf.score(X_val, y_val)

0.9695238095238096

- Changing the voting to hard increases the performance drastically, but it still does not do better than the extra trees classifier.
- Let us try adding another fairly powerful classifier.

In [20]:
from sklearn.neural_network import MLPClassifier
voting_clf_with_mlp = VotingClassifier(
    estimators=[
        ('forest_clf', forest_clf),
        ('extra_clf', extra_clf),
        ('sgd_clf', sgd_clf),
        ('mlp_clf', MLPClassifier(random_state=42))
    ], 
    voting="soft"
)

voting_clf_with_mlp.fit(X_train, y_train)

In [None]:
voting_clf_with_mlp.score(X_val, y_val)

  prob /= prob.sum(axis=1).reshape((prob.shape[0], -1))


0.9019607843137255

#### 9. Run the individual classifiers from the previous exercises to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set. Congratulations - you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. How does it compare to the voting classifier you trained earlier? Now try again using a `StackingClassifier` instead. Do you get better performance? If so, why?

In [None]:
val_preds = []
for name, clf in voting_clf.named_estimators_.items():
    val_preds.append(clf.predict(X_val))

NameError: name 'voting_clf' is not defined

In [25]:
from sklearn.ensemble import RandomForestClassifier
import joblib

joblib.dump(RandomForestClassifier(random_state=42), "exercise_models/forest_clf.pkl")
joblib.dump(extra_clf, "exercise_models/extra_clf.pkl")
joblib.dump(sgd_clf, "exercise_models/sgd_clf.pkl")
joblib.dump(voting_clf, "exercise_models/voting_clf.pkl")
joblib.dump(voting_clf_with_mlp.named_estimators_["mlp_clf"], "exercise_models/mlp_clf.pkl")
joblib.dump(voting_clf_with_mlp, "exercise_models/voting_clf_with_mlp.pkl")

['exercise_models/voting_clf_with_mlp.pkl']