# General Flow for Training/Fitting Models

In [None]:
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/data_utils.py

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

from data_utils import classification_error, display_confusion_matrix
from data_utils import object_from_json_url

### 3 Stages
- Data Prep: Encoding, Scaling, Clustering, sometimes Splitting into train/test datasets
- Modeling: `fit()` classifier
- Evaluation: `predict()` and measure error

#### Data Prep:
Do we need to split our data, or is it already split into train/test sets?

If it's already split we prepare the Encoding, Scaling, Clustering objects using the `train` data (usually with the `fit_transform()` function), and then we use those same objects to encode, scale, cluster the `test` data (usually with the `transform()` function).

If the data is not split into two datasets, we could first split it and repeat the steps above, or, although it might add a bit of bias to the models, we could perform encoding, scaling, clustering with `fit_transform()` on the entire dataset and then only split the already encoded, scaled, clustered data. This biases the encoder, scaler, cluster models, and in turn, the model, but is a bit easier to perform.

#### Modeling
Once we have `train` and `test` datasets that has been encoded, scaled, clustered, we can use the `train` dataset to fit a supervised model (classifier, regression, etc).

Here we will usually call a `fit()` function with the training dataset's features and, separately, its labels or outcome variable values. Something like `fit(features, labels)`.

#### Evaluation
We have a model we trained/fitted with the `train` dataset. Now we can measure how well it actually performs once it's used without the correct labels.

Here we usually call `predict()` with a dataset's features to get label or regression predictions.

We want to call `predict()` for both the `train` and `test` dataset, and then measure how close those predictions are to the actual labels and values that we have in our dataset.

Eavluating with the `train` dataset will tell us if the model is capable of learning anything about the data. Evaluating with the `train` dataset will tell us if the model is capable of learning patterns and trends beyond the data that is fed to it.

It's common for the model to perform better with the `train` data since it was trained using that data and labels, but the `test` dataset error is what's more important because it will tell us what kind of error to expect from data that the model hasn't seen.

### Example

Classifying penguins based on measurements.

Let's load a dataset and look.

In [None]:
PENGUIN_URL = "https://raw.githubusercontent.com/PSAM-5020-2025F-A/5020-utils/refs/heads/main/datasets/json/penguins.json"
penguin_data = object_from_json_url(PENGUIN_URL)

display(penguin_data)

It doesn't have separate train and test data, so we will have to split our dataset before doing any encoding/scaling/pre-processing. 

<img src="./imgs/datasplit-00.jpg" width="720px"/>

In [None]:
# TODO: Put in DataFrames
# TODO: Encode Species Label (we can do encoding before we split)

### Split the Data

Using `train_test_split()`

In [None]:
# Split with train_test_split()
penguin_train, penguin_test = train_test_split(penguin_df, train_size=0.8)

In [None]:
# TODO: scale training data
# Keep sex and label encoded as int values for classification

In [None]:
# TODO: scale test data
# Keep sex and label encoded as int values for classification

### Model/Fit

We can train our model now. We're going to use a `RandomForestClassifier` and `fit()` it with the training data.

In [None]:
# TODO: separate features and outcomes (outcome can be sex, species or both)

# TODO: fit RandomForestClassifier for sex classification

# TODO: fit RandomForestClassifier for species classification

### Evaluate

We can now run predictions for both `train` and `test` data and measure error.

In [None]:
# TODO: predict() for train and test data

### Measure Error

In [None]:
# Measure classification error with classification_error()
display(classification_error(train_scaled_df["sex"], train_sex_pred))
display(classification_error(test_scaled_df["sex"], test_sex_pred))

### Look at Confusion (Matrix)

`display_confusion_matrix(labels, predictions, display_labels=unique_labels)`

In [None]:
# Look at confusion matrices
display_confusion_matrix(train_scaled_df["sex"], train_sex_pred, display_labels=penguin_df["sex"].unique().tolist())
display_confusion_matrix(test_scaled_df["sex"], test_sex_pred, display_labels=penguin_df["sex"].unique().tolist())

### Both Sex and Species ?

We can fit a `RandomForestClassifier` to predict multiple classes at the same time.

In [None]:
# something like this
penguin_model = RandomForestClassifier().fit(train_features, train_scaled_df[["label", "sex"]])

In [None]:
# TODO: run predict() on train and test data

### Looking at the results

We could've re-organized our dataset to have one output variable that combined `sex` and `label`, since there are $6$ possible combinations of those two features. If we did that, we would be able to plot a single confusion matrix to show the results of our multi-class, multi-output classifier.

Since we didn't combine the outcome variables, we'll have to plot these separately. The classifier model was the same and it is able to classify our penguins into $6$ classes, but we have to look at the confusion matrixes for each output variable separately.

<h3 style="color:#f00;">Warning</h3>

Unlike the scalars, the `predict()` functions in `sklearn` lose their column names. We have to remember which column represents which outcome variable, or create a new `DataFrame` with the results.

In [None]:
train_pred_df = pd.DataFrame(train_pred, columns=["label", "sex"])
test_pred_df = pd.DataFrame(test_pred, columns=["label", "sex"])

In [None]:
# Look at confusion matrices
display_confusion_matrix(train_scaled_df["sex"], train_pred_df["sex"], display_labels=penguin_df["sex"].unique().tolist())
display_confusion_matrix(test_scaled_df["sex"], test_pred_df["sex"], display_labels=penguin_df["sex"].unique().tolist())

In [None]:
# Look at confusion matrices
display_confusion_matrix(train_scaled_df["label"], train_pred_df["label"], display_labels=penguin_df["species"].unique().tolist())
display_confusion_matrix(test_scaled_df["label"], test_pred_df["label"], display_labels=penguin_df["species"].unique().tolist())