# Model Training and Evaluation
You should build a machine learning pipeline with a complete model training and evaluation step. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Choose a few machine learning algorithms, such as [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).
- Define a grid of hyperparameters for every selected model.
- Conduct [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) or [random search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) using k-fold cross-validation on the training set to find out the best model (i.e., the best algorithm and its hyperparameters).
- Train the best model on the whole training set.
- Test the trained model on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

In [None]:
url = "https://raw.githubusercontent.com/m-mahdavi/teaching/refs/heads/main/datasets/mnist.csv"
df = pd.read_csv(url)
print("Dataset downloaded. Shape:", df.shape)
df.head()

In [None]:
#  2. Separate features & label
# ----------------------------------------
# Assuming label column is named ‘label’ (adjust if different)
y = df['class']
X = df.drop(columns=['class'])

print("Features shape:", X.shape, "pixel shape:", y.shape)
# df.head()


In [None]:
# 3. Train-test split
# ----------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# ----------------------------------------
# 4. Build Pipelines for each model
# ----------------------------------------
pipelines = {
    'knn': Pipeline([
        ('scaler', StandardScaler()),
        ('model', KNeighborsClassifier())
    ]),
    'decision_tree': Pipeline([
        ('model', DecisionTreeClassifier())
    ]),
    'gradient_boosting': Pipeline([
        ('model', GradientBoostingClassifier())
    ])
}

In [None]:
# ----------------------------------------
# 5. Define Hyperparameter Grids
# ----------------------------------------
param_grids = {
    'knn': {
        'model__n_neighbors': [3, 5, 7],
        'model__weights': ['uniform', 'distance']
    },
    'decision_tree': {
        'model__max_depth': [10, 20, None],
        'model__criterion': ['gini', 'entropy']
    },
    'gradient_boosting': {
        'model__n_estimators': [50, 100],
        'model__learning_rate': [0.05, 0.1],
        'model__max_depth': [3, 5]
    }
}

In [None]:
# ----------------------------------------
# 6. Grid Search with Cross-Validation
# ----------------------------------------
best_models = {}
for key in pipelines:
    print(f"\nRunning GridSearchCV for: {key}")
    grid = GridSearchCV(
        estimator=pipelines[key],
        param_grid=param_grids[key],
        cv=3,
        n_jobs=-1
    )
    grid.fit(X_train, y_train)
    best_models[key] = grid
    print(f"Best params for {key}: {grid.best_params_}")
    print(f"Best CV score for {key}: {grid.best_score_:.4f}")



Running GridSearchCV for: knn
Best params for knn: {'model__n_neighbors': 3, 'model__weights': 'distance'}
Best CV score for knn: 0.8769

Running GridSearchCV for: decision_tree
Best params for decision_tree: {'model__criterion': 'entropy', 'model__max_depth': 10}
Best CV score for decision_tree: 0.7441

Running GridSearchCV for: gradient_boosting


In [None]:
# ----------------------------------------
# 7. Select Best Model Overall
# ----------------------------------------
best_key = max(best_models, key=lambda k: best_models[k].best_score_)
best_model = best_models[best_key]

print("\n=======================================")
print(f"BEST MODEL SELECTED: {best_key}")
print("Hyperparameters:", best_model.best_params_)
print("=======================================\n")

In [None]:
# ----------------------------------------
# 8. Train Best Model on Full Training Set
# ----------------------------------------
final_model = best_model.best_estimator_
final_model.fit(X_train, y_train)

In [None]:
# 9. Evaluate on Test Set
# ----------------------------------------
y_pred = final_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))