# Tuning an SVC Model

copied from `xgb_fitting.ipynb`

The goal of this notebook is to train and evaluate an SVC model, comparing it's performance on a holdout set against other types of models (LR,LDA, XGBoost). 

To ensure reproducibility and consistent evaluation across models, all datasets were **pre-split into cross-val data and holdout data** as below:

| Split type           | CV training file     | Holdout file              | Description                              |
| -------------------- | -------------------- | ------------------------- | ---------------------------------------- |
| **Random**           | `apps_cv_random.csv` | `apps_holdout_random.csv` | Simple random sampling                   |
| **Stratified**       | `apps_cv_strat.csv`  | `apps_holdout_strat.csv`  | Stratified by `TARGET`                   |
| **Multi-Stratified** | `apps_cv_multi.csv`  | `apps_holdout_multi.csv`  | Stratified by `TARGET` + `CODE_GENDER_M` |

Each dataset for cross-validation (`apps_cv_*.csv`) also contains a column, `fold`, with pre-assigned folds from 1-5 using the corresponding splitting method to ensure consistent evaluation. Therefore, no additional splitting is needed inside this notebook -- can simply loop through assigned folds for cross-validation.


In [1]:
import pandas as pd 
import numpy as np 
from sklearn.svm import SVC
from sklearn.decomposition import PCA

Copied from `XGB_fitting.ipynb`

In [2]:
# METRICS 

def classification_metrics(y_true, y_pred):
    """
    Computes confusion matrix + accuracy, precision, recall, F1, and balanced accuracy.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    # Confusion matrix components
    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))

    # Metrics
    acc  = (tp + tn) / max((tp + tn + fp + fn), 1)
    prec = tp / max((tp + fp), 1)
    rec  = tp / max((tp + fn), 1)
    f1   = (2 * prec * rec / max((prec + rec), 1e-12)) if (prec + rec) > 0 else 0.0

    # Specificity (True Negative Rate)
    spec = tn / max((tn + fp), 1)

    # Balanced accuracy
    bal_acc = 0.5 * (rec + spec)

    metrics = {
        "n": len(y_true),
        "tp": tp, "tn": tn, "fp": fp, "fn": fn,
        "acc": acc, "bal_acc": bal_acc, "prec": prec, "rec": rec, "spec": spec,
        "f1": f1
    }
    return metrics

def roc_auc_from_probs(y_true, y_prob):
    
    desc_sort_indices = np.argsort(-y_prob)
    y_true = np.array(y_true)[desc_sort_indices]
    y_prob = np.array(y_prob)[desc_sort_indices]
    pos = np.sum(y_true == 1)
    neg = np.sum(y_true == 0)

    # running totals for TPR/FPR
    tpr = [0.0]
    fpr = [0.0]
    tp = fp = 0
    for i in range(len(y_true)):
        if y_true[i] == 1:
            tp += 1
        else:
            fp += 1
        tpr.append(tp / pos)
        fpr.append(fp / neg)

    # get auc
    auc = np.trapz(tpr, fpr)
    return auc

def pr_auc_from_probs(y_true, y_prob):
    # Sort by predicted probability descending
    desc_sort_indices = np.argsort(-y_prob)
    y_true = np.array(y_true)[desc_sort_indices]
    y_prob = np.array(y_prob)[desc_sort_indices]
    
    tp = 0
    fp = 0
    pos = np.sum(y_true == 1)
    
    precision = [1.0]  # starts at 1 when recall=0
    recall = [0.0]
    
    for i in range(len(y_true)):
        if y_true[i] == 1:
            tp += 1
        else:
            fp += 1
        prec = tp / (tp + fp)
        rec = tp / pos
        precision.append(prec)
        recall.append(rec)
    
    # ensure it ends at recall=1
    precision = np.array(precision)
    recall = np.array(recall)
    
    # integrate area under curve
    auc_pr = np.trapz(precision, recall)
    return auc_pr

In [3]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV

def cv_svc_pca(data, feature_cols, target_col, params=None, n_components=0.95):
    
    if params is None:
        params = {}
    params = {
        "dual": True,
        "class_weight": "balanced",
        **params
    }

    fold_metrics = []

    for f in sorted(data.fold.unique()):
        train = data[data.fold != f]
        test  = data[data.fold == f]
        X_train, y_train = train[feature_cols], train[target_col]
        X_test,  y_test  = test[feature_cols],  test[target_col]

        pipe = make_pipeline(
            StandardScaler(),
            PCA(n_components=n_components),
            LinearSVC(**params)
        )

        pipe.fit(X_train, y_train)

        y_pred = pipe.predict(X_test)
        y_scores = pipe.decision_function(X_test)
        y_train_scores = pipe.decision_function(X_train)
        # y_prob = pipe.predict_proba(X_test)
        # y_train_prob = pipe.predict_proba(X_train)

        m = classification_metrics(y_test, y_pred)
        m["roc_auc"]       = roc_auc_from_probs(y_test, y_scores)
        m["train_roc_auc"] = roc_auc_from_probs(y_train, y_train_scores)
        m["fold"] = int(f)
        fold_metrics.append(m)

    return pd.DataFrame(fold_metrics).sort_values("fold").reset_index(drop=True)

# Model Development

**Notes:** 
- All evaluation will focus on stratified cross-validation, but we will test the other methods as well. 
- Recall that folds have been pre-assigned to ensure consistency across different model development processes
- We have decided to scale + PCA.

**Process:**
1. Setting a baseline
    - evaluating an xgb model with all default parameters to build off of
2. Hyperparameter tuning
    - evaluate many different combinations of parameters
    - choose the best set based on average ROC-AUC across all folds
3. Holdout evaluation
    - evaluate on the corresponding holdout table. the performance here is what we will compare with other models (LR, SVC, LDA)
4. Threshold tuning
    - tweak the threshold on the best model to maximize another chosen metric (recall, precision, f1, balanced accuracy, etc.) 
        - note that roc-auc is not affected by threshold, hence the need a different optimizing metric
    - what metric we choose to optimize with threshold depends on business needs
        - consider the cost of mislabeling someone as high risk? or trusting an applicant that you shouldn't? will there be human review?
        - something we can include in the right up as optionality moving forward, not something we have to decide now on our own
        - "our model is very solid at ranking applicants from low-risk to high-risk, but in terms of actual classification, we can move the threshold based on what matters most to the business"

# setup

In [4]:
apps_cv_strat = pd.read_csv("data/apps_cv_strat.csv")
apps_holdout_strat = pd.read_csv("data/apps_holdout_strat.csv")
target_col = 'TARGET'
feature_cols = [col for col in apps_cv_strat.columns if col not in 
                [target_col, 'SK_ID_CURR', 'fold', 'neighbors_target_mean_500' , 'AGE_INT', 'DAYS_BIRTH', 'CODE_GENDER_M', 'CODE_GENDER_XNA', 'NAME_FAMILY_STATUS_Previously Married', 'NAME_FAMILY_STATUS_Single']]
print(sorted(feature_cols))

['AMT_ANNUITY', 'AMT_CREDIT', 'AMT_GOODS_PRICE', 'AMT_INCOME_TOTAL', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'BUREAU_REQ_MISSING', 'BUREAU_REQ_TOTAL', 'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'DAYS_CREDIT_mean_external', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_REGISTRATION', 'DEF_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'EMERGENCYSTATE_MODE_UNKNOWN', 'EMERGENCYSTATE_MODE_Yes', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'EXT_SOURCE_MEAN', 'FLAG_CONT_MOBILE', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCU

In [10]:
params = {}
results = cv_svc_pca(apps_cv_strat, feature_cols, target_col, params=params)



In [11]:
results

Unnamed: 0,n,tp,tn,fp,fn,acc,bal_acc,prec,rec,spec,f1,roc_auc,train_roc_auc,fold
0,49156,1662,39535,5652,2307,0.838087,0.646833,0.227235,0.418745,0.87492,0.294602,0.734349,0.734456,1
1,49156,1987,37169,8018,1982,0.796566,0.661595,0.198601,0.50063,0.82256,0.284385,0.739607,0.735101,2
2,49156,1839,37662,7525,2130,0.803585,0.648405,0.19639,0.463341,0.83347,0.275857,0.725445,0.740305,3
3,49155,1994,37395,7791,1975,0.801322,0.664986,0.203781,0.502394,0.827579,0.289952,0.743978,0.741105,4
4,49154,1935,37294,7892,2033,0.798084,0.656498,0.196906,0.487651,0.825344,0.280536,0.73469,0.736279,5


In [12]:
from itertools import product
import time

def grid_search_SVC(data, feature_cols, target_col, param_grid):
    
    # get all possible combinations of parameters
    keys = list(param_grid.keys())
    combos = [dict(zip(keys, v)) for v in product(*param_grid.values())]

    # initialize stuff for tracking and results
    results = []
    total = len(combos)
    start = time.time()
    next_checkpoint = 5 
    best_roc_auc = 0
    best_params = None

    # evaluate every possible combo
    for i, params in enumerate(combos, 1):

        # run cross validation and store results
        fold_results = cv_svc_pca(data, feature_cols, target_col, params)
        mean_roc_auc = fold_results["roc_auc"].mean()

        results.append({
            'params': params,
            'mean_roc_auc': mean_roc_auc,
            'mean_f1': fold_results['f1'].mean(),
            'mean_acc': fold_results['acc'].mean(),
            'mean_bal_acc': fold_results['bal_acc'].mean(),
            'mean_prec': fold_results['prec'].mean(),
            'mean_rec': fold_results['rec'].mean(),
        })

        # tracker for updates
        if  mean_roc_auc > best_roc_auc:
            best_roc_auc =  mean_roc_auc
            best_params = params

        # print progress checkpoints
        pct_done = (i/total)*100
        elapsed = time.time() - start
        if pct_done >= next_checkpoint or i == total:
            print(f"{i}/{total} ({pct_done:5.1f}% in {elapsed/60:.1f} mins) | Best ROC-AUC: {best_roc_auc:.4f} | Best Params: {best_params}")
            next_checkpoint += 5


    
    return pd.DataFrame(results).sort_values("mean_roc_auc", ascending=False).reset_index(drop=True)

In [16]:
param_grid = {
    "C": [0.001, 0.01, 0.1, 1, 10, 100],
    "loss": ["hinge", "squared_hinge"]
}

wide_search_results = grid_search_SVC(apps_cv_strat, feature_cols, target_col, param_grid)



1/12 (  8.3% in 1.0 mins) | Best ROC-AUC: 0.7533 | Best Params: {'C': 0.001, 'loss': 'hinge'}




2/12 ( 16.7% in 8.4 mins) | Best ROC-AUC: 0.7533 | Best Params: {'C': 0.001, 'loss': 'hinge'}




3/12 ( 25.0% in 11.0 mins) | Best ROC-AUC: 0.7534 | Best Params: {'C': 0.01, 'loss': 'hinge'}




4/12 ( 33.3% in 18.3 mins) | Best ROC-AUC: 0.7534 | Best Params: {'C': 0.01, 'loss': 'hinge'}




5/12 ( 41.7% in 23.8 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.1, 'loss': 'hinge'}




6/12 ( 50.0% in 31.0 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.1, 'loss': 'hinge'}




7/12 ( 58.3% in 37.4 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.1, 'loss': 'hinge'}




8/12 ( 66.7% in 44.7 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.1, 'loss': 'hinge'}




9/12 ( 75.0% in 51.9 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.1, 'loss': 'hinge'}




10/12 ( 83.3% in 59.1 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.1, 'loss': 'hinge'}




11/12 ( 91.7% in 66.5 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.1, 'loss': 'hinge'}




12/12 (100.0% in 73.8 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.1, 'loss': 'hinge'}


In [17]:
wide_search_results

Unnamed: 0,params,mean_roc_auc,mean_f1,mean_acc,mean_bal_acc,mean_prec,mean_rec
0,"{'C': 0.1, 'loss': 'hinge'}",0.753527,0.265728,0.698556,0.688072,0.165395,0.67557
1,"{'C': 0.1, 'loss': 'squared_hinge'}",0.75348,0.266677,0.701555,0.68814,0.166339,0.672143
2,"{'C': 0.01, 'loss': 'hinge'}",0.753447,0.26569,0.698723,0.687933,0.165396,0.675066
3,"{'C': 0.01, 'loss': 'squared_hinge'}",0.753357,0.265717,0.697917,0.688368,0.165301,0.676981
4,"{'C': 0.001, 'loss': 'hinge'}",0.753259,0.265232,0.697226,0.687923,0.164936,0.67683
5,"{'C': 0.001, 'loss': 'squared_hinge'}",0.753234,0.265612,0.697551,0.688376,0.165193,0.677434
6,"{'C': 1, 'loss': 'hinge'}",0.745827,0.266079,0.707475,0.684305,0.166877,0.656673
7,"{'C': 1, 'loss': 'squared_hinge'}",0.731704,0.280578,0.801096,0.653681,0.199121,0.477878
8,"{'C': 10, 'loss': 'squared_hinge'}",0.606105,0.124352,0.867762,0.531312,0.138993,0.130068
9,"{'C': 100, 'loss': 'squared_hinge'}",0.589194,0.108899,0.87688,0.521904,0.141792,0.098568


In [22]:
powers = np.arange(-3, -1, 0.1)
grid_values = [10**x for x in powers]

param_grid = {
    "C": grid_values,
    "loss": ["hinge"]
}

search_results2 = grid_search_SVC(apps_cv_strat, feature_cols, target_col, param_grid)



1/20 (  5.0% in 1.0 mins) | Best ROC-AUC: 0.7533 | Best Params: {'C': 0.001, 'loss': 'hinge'}




2/20 ( 10.0% in 2.0 mins) | Best ROC-AUC: 0.7533 | Best Params: {'C': 0.0012589254117941675, 'loss': 'hinge'}




3/20 ( 15.0% in 3.0 mins) | Best ROC-AUC: 0.7533 | Best Params: {'C': 0.001584893192461114, 'loss': 'hinge'}




4/20 ( 20.0% in 4.2 mins) | Best ROC-AUC: 0.7533 | Best Params: {'C': 0.0019952623149688807, 'loss': 'hinge'}




5/20 ( 25.0% in 5.4 mins) | Best ROC-AUC: 0.7533 | Best Params: {'C': 0.002511886431509582, 'loss': 'hinge'}




6/20 ( 30.0% in 6.8 mins) | Best ROC-AUC: 0.7534 | Best Params: {'C': 0.0031622776601683824, 'loss': 'hinge'}




7/20 ( 35.0% in 8.4 mins) | Best ROC-AUC: 0.7534 | Best Params: {'C': 0.003981071705534978, 'loss': 'hinge'}




8/20 ( 40.0% in 10.1 mins) | Best ROC-AUC: 0.7534 | Best Params: {'C': 0.00501187233627273, 'loss': 'hinge'}




9/20 ( 45.0% in 12.1 mins) | Best ROC-AUC: 0.7534 | Best Params: {'C': 0.006309573444801942, 'loss': 'hinge'}




10/20 ( 50.0% in 14.3 mins) | Best ROC-AUC: 0.7534 | Best Params: {'C': 0.00794328234724283, 'loss': 'hinge'}




11/20 ( 55.0% in 16.8 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.010000000000000021, 'loss': 'hinge'}




12/20 ( 60.0% in 19.7 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.012589254117941701, 'loss': 'hinge'}




13/20 ( 65.0% in 22.8 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.015848931924611172, 'loss': 'hinge'}




14/20 ( 70.0% in 26.3 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.01995262314968885, 'loss': 'hinge'}




15/20 ( 75.0% in 30.2 mins) | Best ROC-AUC: 0.7535 | Best Params: {'C': 0.01995262314968885, 'loss': 'hinge'}




16/20 ( 80.0% in 34.4 mins) | Best ROC-AUC: 0.7536 | Best Params: {'C': 0.03162277660168389, 'loss': 'hinge'}




17/20 ( 85.0% in 39.2 mins) | Best ROC-AUC: 0.7536 | Best Params: {'C': 0.03162277660168389, 'loss': 'hinge'}




18/20 ( 90.0% in 44.2 mins) | Best ROC-AUC: 0.7536 | Best Params: {'C': 0.03162277660168389, 'loss': 'hinge'}




19/20 ( 95.0% in 49.4 mins) | Best ROC-AUC: 0.7536 | Best Params: {'C': 0.03162277660168389, 'loss': 'hinge'}




20/20 (100.0% in 54.9 mins) | Best ROC-AUC: 0.7536 | Best Params: {'C': 0.03162277660168389, 'loss': 'hinge'}


In [23]:
search_results2

Unnamed: 0,params,mean_roc_auc,mean_f1,mean_acc,mean_bal_acc,mean_prec,mean_rec
0,"{'C': 0.03162277660168389, 'loss': 'hinge'}",0.753587,0.265601,0.698759,0.687769,0.165351,0.674663
1,"{'C': 0.01995262314968885, 'loss': 'hinge'}",0.753518,0.265682,0.698731,0.687915,0.165393,0.675015
2,"{'C': 0.03981071705534985, 'loss': 'hinge'}",0.753513,0.265819,0.698943,0.68803,0.165499,0.675015
3,"{'C': 0.015848931924611172, 'loss': 'hinge'}",0.753502,0.265716,0.698808,0.687934,0.165422,0.674965
4,"{'C': 0.025118864315095874, 'loss': 'hinge'}",0.753499,0.265712,0.698869,0.687898,0.165428,0.674814
5,"{'C': 0.06309573444801955, 'loss': 'hinge'}",0.753489,0.265758,0.698654,0.688079,0.165424,0.675469
6,"{'C': 0.012589254117941701, 'loss': 'hinge'}",0.753484,0.265838,0.698861,0.6881,0.165499,0.675267
7,"{'C': 0.07943282347242846, 'loss': 'hinge'}",0.75348,0.265624,0.698918,0.687741,0.165386,0.674411
8,"{'C': 0.010000000000000021, 'loss': 'hinge'}",0.753473,0.265706,0.698678,0.687978,0.165399,0.675217
9,"{'C': 0.0501187233627274, 'loss': 'hinge'}",0.753434,0.26555,0.698711,0.68772,0.165315,0.674612


We found the best value of `C` to be 10^-1.5

# Holdout Evaluation

In [6]:
# make train and test sets
train = apps_cv_strat
test = apps_holdout_strat
X_train, y_train = train[feature_cols], train[target_col]
X_test,  y_test  = test[feature_cols],  test[target_col]

# set up pipeline with scaling, PCA, and SVC
pipe = make_pipeline(
    StandardScaler(),
    PCA(n_components=0.95),
    LinearSVC(C=10**-1.5, loss="hinge", dual=True, class_weight="balanced")
)

# fit and predict
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
y_scores = pipe.decision_function(X_test)
y_train_scores = pipe.decision_function(X_train)

# get metrics
m = classification_metrics(y_test, y_pred)
m["roc_auc"]       = roc_auc_from_probs(y_test, y_scores)
m["train_roc_auc"] = roc_auc_from_probs(y_train, y_train_scores)
m["pr_roc_auc"] = pr_auc_from_probs(y_test, y_scores)
m



{'n': 61443,
 'tp': 3443,
 'tn': 39794,
 'fp': 16689,
 'fn': 1517,
 'acc': 0.7036928535390524,
 'bal_acc': 0.6993418962628208,
 'prec': 0.17102125968607193,
 'rec': 0.6941532258064517,
 'spec': 0.7045305667191899,
 'f1': 0.2744300972421489,
 'roc_auc': 0.7628665569086446,
 'train_roc_auc': 0.7557983686938431,
 'pr_roc_auc': 0.23600742119957863}