# TI3130: Classification Lab &mdash; Solutions (variant B)
**Julián Urbano &mdash; January 2022**

In [1]:
import sys
import numpy as np
import pandas as pd
from plotnine import *
from plotnine import __version__ as p9__version__
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import __version__ as sk__version__

print("python", sys.version,
      "\nnumpy", np.__version__,
      "\npandas", pd.__version__,
      "\nplotnine", p9__version__,
      "\nstatsmodels", sm.__version__,
      "\nsklearn", sk__version__)

python 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)] 
numpy 1.21.2 
pandas 1.3.4 
plotnine 0.8.0 
statsmodels 0.13.0 
sklearn 1.0.1


For these exercises we will use the _Amsterdam Lite_ dataset and the _Heart_ dataset. Please refer to their HTML files for a description of the variables.

In [2]:
ams = pd.read_csv('amsterdam_lite.csv')
for col in ams.select_dtypes('object').columns:
    ams[col] = pd.Categorical(ams[col])

heart = pd.read_csv('heart.csv')
for col in heart.select_dtypes('object').columns:
    heart[col] = pd.Categorical(heart[col])

For these exercises we will use the evaluation metrics and cross-validation code we used in the tutorial:

In [3]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import StratifiedKFold

def kfold_cv(X, y, k, H, cv_fun, random_state):
    """
    Do stratified k-fold cross-validation with a dataset, to check how a model behaves as a function
    of the values in H (eg. a hyperparameter such as tree depth, or polynomial degree).

    :param X: feature matrix.
    :param y: response column.
    :param k: number of folds.
    :param H: values of the hyperparameter to cross-validate.
    :param cv_fun: function of the form (X_train, y_train, X_valid, y_valid, h) to evaluate the model in one split,
        as a function of h. It must return a dictionary with metric score values.
    :param random_state: controls the pseudo random number generation for splitting the data.
    :return: a Pandas dataframe with metric scores along values in H.
    """
    kf = StratifiedKFold(n_splits = k, shuffle = True, random_state = random_state)
    pr = []  # to store global results

    # for each value h in H, do CV
    for h in H:
        scores = []  # to store the k results for this h
        # for each fold 1..K
        for train_index, valid_index in kf.split(X, y):
            # partition the data in training and validation
            X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]

            # call cv_fun to train the model and compute performance
            fold_scores = cv_fun(X_train, y_train, X_valid, y_valid, h)
            scores.append(fold_scores)

        rowMeans = pd.DataFrame(scores).mean(axis = 0)  # average scores across folds
        pr.append(rowMeans)  # append to global results

    pr = pd.DataFrame(pr).assign(_h = H)
    return pr

**a) The first classifier in the tutorial notebook used a logistic model with a default threshold of 0.5. Function `predict` below implements a prediction function for the given `model`, test data `X_test` and response `vocabulary`. Implement this function such that it optimizes the recall of the second class, `2.high`.**

In [None]:
# sample data and model
X = ams.assign(y = ams['saf_catering'].cat.codes)
m = smf.glm('y ~ hou_value', X, family = sm.families.Binomial()).fit()

def predict(model, X_test, vocabulary):
    p = model.predict(X_test)
    # always predict 2.high
    
    return #TODO

# sample execution
predict(m, X, ams['saf_catering'].cat.categories)

**b) Slide 38 shows an example of how complex models keep improving in the training set but not necessarily in a validation set because they eventually overfit. Use the _Heart_ dataset and decision trees to predict `ahd` with _all_ other features to produce a plot similar to the one in the slides via 10-fold cross-validation. You only need to plot accuracy in the training and validation sets. What number of leaves would you choose based on the results? You will need to use Pandas' [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) to encode categorical features.**

**c) Use Scikit-learn's [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to build a model to predict `saf_nonneighbors` using all integer features in the dataset (you may have to increase the `max_iter` argument to something like `1000`). Produce a full classification report.**

**d) Implement function `tree_ovo_fit` to fit a multi-class decision tree following a one-versus-one approach with individual binary decision trees.**

In [None]:
from sklearn import tree

# sample data
y = ams['spa_streets']
X = pd.get_dummies(ams.filter(regex = '^(fac_|inc_|saf_)'))

def tree_ovo_fit(X, y, random_state):
    models = []
    vocabulary = y.cat.categories    
    
    # for every pair of classes cl1-cl2
        # filter out rows where the response is neither cl1 nor cl2
        # fit a model cl1-vs-cl2
        # store the model in models
    #TODO
            
    return {'vocabulary': vocabulary,
            'models': models}

# sample execution
m = tree_ovo_fit(X, y, random_state = 123)
m

**e) Implement function `tree_ovo_predict` to predict from a model trained with `tree_ovo_fit`. The prediction should not be probabilistic (ie. it returns the predicted class). You will probably need a function to compute the mode (such as SciPy's [`mode`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html)), and a way to [flatten a list of lists](https://stackoverflow.com/a/953097/14674728).**

In [None]:
from scipy.stats import mode
from itertools import chain

# sample model
m = tree_ovo_fit(X, y, random_state = 123)

def tree_ovo_predict(model, X):
    # prepare an array with as many rows as X, and as many columns as individual models
    p = np.empty_like(model['vocabulary'], shape = [X.shape[0], len(model['models'])])
    
    # for every individual model
        # make a prediction and store it in p
    #TODO
    
    # select the classes that got predicted most often and return them
    return #TODO

p = tree_ovo_predict(m, X)
p

**f) Fit an SVM model to predict `district` based on *all other* features in the dataset, and produce a full classification report with the training data. Is there any problem with this model, and if so, what is the cause? You don't need to tune hyperparameters, but you will need to use Pandas' [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) to encode categorical features.**

**g) Use 5-fold cross-validation to plot how the F-score is affected by `max_features` when building random forests to predict `ahd` based on _all_ other features in the _Heart_ dataset. Compare with the performance of a decision tree and a bagging model, also doing 5-fold cross-validation. Use 50 trees for random forests and bagging.**

**h) The tutorial notebook built a multinomial model to predict `spa_streets` based on some features and their interactions, but the model had clear issues of class imbalance whereby the majority class `2.average` dominated the learning process. The resulting macro-averaged F-score was indeed low at 0.39. Try solving this issue by fitting the same model on a new dataset where the minority classes `1.low` and `3.high` are oversampled to contain the same number of instances as the majority class. Compare the model with the one in the tutorial using the original dataset. Does the new model improve in terms of imbalance? Why? You can use Pandas' [`sample`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html) to oversample instances.**

**i) Create an _unambiguous_ and _nontrivial_ question, and its corresponding solution, as if you were writing the set of exercises for the _Classification_ lab. The question must cover at least 3 of the following aspects:**

- **Logistic or Multinomial regression for classification**
- **Decision trees, bagging or random forests**
- **Support Vector Machines**
- **Model evaluation**
- **Choice of hyperparameters via cross-validation**
- **An open-ended question to explain some behavior**

**Please make it explicit which 3 of these aspects your question covers. You can use any of the datasets available on Brightspace.**