# Preamble
This problem set is an extension of Problem Set 6.  You will need the MNIST 784 dataset from OpenML, with dimensionality reduced to about 75\% of original variance.

As with last week, the first 60,000 observations are available to use as training data, and the remaining 10,000 images as test data.  In training the models, you do not need to use all 60,000 observations.  (It is suggested to partition the training data into a training dataset and holdout dataset rather than use cross-validation.  Training on as few as 5000 observations is sufficient to reduce training time.)

For purposes of this problem set, recode the target variable for both the test and training sets to classify whether a digit is less than 5 (i.e., $y \in \left\{0, 1, 2, 3, 4\right\}$).  That is, the target variable should take the value 0 where the corresponding observation depicts a 0, 1, 2, 3, or 4; and the value 1 where the corresponding observation depicts a 5, 6, 7, 8, or 9.


In [None]:
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_predict
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import f1_score


In [None]:
#Load the MNIST dataset
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
#Perform dimensionality reduction
pca = PCA(0.75)
X_reduced = pca.fit_transform(X)
#Split the dataset into training and test sets
X_train, X_test = X_reduced[:60000], X_reduced[60000:]
y_train, y_test = y[:60000], y[60000:]
#Partition training data into a training dataset and a holdout dataset
N_train = 5000
X_train_dataset, X_holdout_dataset = X_train[:N_train], X_train[N_train:]
y_train_dataset, y_holdout_dataset = y_train[:N_train], y_train[N_train:]
#Recode the target variable for training and test sets
recode = lambda y: np.where(np.isin(y, ['0','1','2','3','4']), 0, 1)
y_train_rcd = recode(y_train)
y_test_rcd = recode(y_test)

# Problem 1 -- Classifiers

Train 3 classifiers on the dataset, each using a different algorithm.  Each classifier must have an $F_1$ score of at least 0.9.  At least one classifier must use gradient boosting (AdaBoost, Gradient Boost, or xgboost).  Show the $F_1$ score and classification report for each model.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, classification_report

# Classifier 1: Decision Tree Classifier (Similar to Random Forest with a single tree)
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train_rcd)

dt_y_pred = dt_classifier.predict(X_test)

dt_f1_score = f1_score(y_test_rcd, dt_y_pred)
print("Decision Tree Classifier:")
print("F1 Score:", dt_f1_score)
print(classification_report(y_test_rcd, dt_y_pred))


Decision Tree Classifier:
F1 Score: 0.8973407544836116
              precision    recall  f1-score   support

           0       0.90      0.91      0.90      5139
           1       0.90      0.90      0.90      4861

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



In [None]:
from xgboost import XGBClassifier
xgb_clf = XGBClassifier(n_estimators=150, learning_rate=0.1, use_label_encoder=False, eval_metric='logloss')
xgb_clf.fit(X_train_dataset, y_train_rcd[:N_train])
y_pred_xgb = xgb_clf.predict(X_test)
f1_xgb = f1_score(y_test_rcd, y_pred_xgb)
print("\nXGBoost Classifier:")
print(classification_report(y_test_rcd, y_pred_xgb))
print(f"F1 Score: {f1_xgb:.4f}")

Parameters: { "use_label_encoder" } are not used.




XGBoost Classifier:
              precision    recall  f1-score   support

           0       0.95      0.93      0.94      5139
           1       0.92      0.95      0.94      4861

    accuracy                           0.94     10000
   macro avg       0.94      0.94      0.94     10000
weighted avg       0.94      0.94      0.94     10000

F1 Score: 0.9362


In [None]:
from sklearn.ensemble import RandomForestClassifier
# Train Random Forest classifier with optimized parameters
rf_clf = RandomForestClassifier(n_estimators=200, max_depth=15, random_state=42)
rf_clf.fit(X_train_dataset, y_train_rcd[:N_train])
y_pred_rf = rf_clf.predict(X_test)
f1_rf = f1_score(y_test_rcd, y_pred_rf)
print("Random Forest Classifier:")
print(classification_report(y_test_rcd, y_pred_rf))
print(f"F1 Score: {f1_rf:.4f}")

Random Forest Classifier:
              precision    recall  f1-score   support

           0       0.96      0.92      0.94      5139
           1       0.92      0.95      0.94      4861

    accuracy                           0.94     10000
   macro avg       0.94      0.94      0.94     10000
weighted avg       0.94      0.94      0.94     10000

F1 Score: 0.9356


# Problem 2 -- Voting ensemble model

(20 pts) Build a voting ensemble model that combines the three classifiers from the previous problem, in addition to the SVM model developed last week.  What is the $F_1$ score of the ensemble model?

In [None]:
from sklearn.svm import SVC
svm_classifier = SVC(probability=True, random_state=42)
svm_classifier.fit(X_train, y_train_rcd)
#creating a voting classifier
voting_classifier = VotingClassifier(estimators=[
    ('dt', dt_classifier),
    ('xgb', xgb_clf),
    ('rf', rf_clf),
    ('svm', svm_classifier)
], voting='hard')

voting_classifier.fit(X_train, y_train_rcd)
voting_predictions = voting_classifier.predict(X_test)
ensemble_f1_score = f1_score(y_test_rcd, voting_predictions)
print("Ensemble Model F1 Score:", ensemble_f1_score)

Parameters: { "use_label_encoder" } are not used.



Ensemble Model F1 Score: 0.9684471024953598


## Problem 3 -- Stacking ensemble model
Stacking uses a final classifier (often a logistic regression) that outputs an aggregate of the predictors. Repeat the previous problem using a StackingClassifier rather than voting to compute the final prediction.  What is the $F_1$ score of the stacking classifier?

In [None]:
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score

# Initialize individual base classifiers
dt_classifier = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_classifier = SVC(probability=True, random_state=42)

# Creating a StackingClassifier with the base classifiers and final estimator (logistic regression)
stacking_classifier = StackingClassifier(
    estimators=[
        ('dt', dt_classifier),
        ('xgb', xgb_clf),
        ('rf', rf_clf),
        ('svm', svm_classifier)
    ],
    final_estimator=LogisticRegression(),
    cv=5
)

# Train the stacking classifier
stacking_classifier.fit(X_train, y_train_rcd)

# Make predictions
stacking_predictions = stacking_classifier.predict(X_test)

# Compute F1 Score
stacking_f1_score = f1_score(y_test_rcd, stacking_predictions)

print("Stacking Classifier F1 Score:", stacking_f1_score)


Stacking Classifier F1 Score: 0.9825862957238537
