# Tuning an SVC Model

copied from `xgb_fitting.ipynb`

The goal of this notebook is to train and evaluate an SVC model, comparing it's performance on a holdout set against other types of models (LR,LDA, XGBoost). 

To ensure reproducibility and consistent evaluation across models, all datasets were **pre-split into cross-val data and holdout data** as below:

| Split type           | CV training file     | Holdout file              | Description                              |
| -------------------- | -------------------- | ------------------------- | ---------------------------------------- |
| **Random**           | `apps_cv_random.csv` | `apps_holdout_random.csv` | Simple random sampling                   |
| **Stratified**       | `apps_cv_strat.csv`  | `apps_holdout_strat.csv`  | Stratified by `TARGET`                   |
| **Multi-Stratified** | `apps_cv_multi.csv`  | `apps_holdout_multi.csv`  | Stratified by `TARGET` + `CODE_GENDER_M` |

Each dataset for cross-validation (`apps_cv_*.csv`) also contains a column, `fold`, with pre-assigned folds from 1-5 using the corresponding splitting method to ensure consistent evaluation. Therefore, no additional splitting is needed inside this notebook -- can simply loop through assigned folds for cross-validation.


In [1]:
import pandas as pd 
import numpy as np 
from sklearn.svm import SVC
from sklearn.decomposition import PCA

In [4]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

def cv_svc_pca(data, feature_cols, target_col, params=None, n_components=0.95):
    if params == None:
        params = {}
        
    fold_metrics = []
    for f in sorted(data.fold.unique()):

        # split into train and test based on folds
        train = data[data.fold != f]
        test  = data[data.fold == f]
        X_train, y_train = train[feature_cols], train[target_col]
        X_test,  y_test  = test[feature_cols],  test[target_col]

        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test  = scaler.transform(X_test)

        pca = PCA(n_components=n_components)
        pca.fit(X_train)


        model = SVC(**params)
        model.fit(X_train, y_train)

        y_pred       = model.predict(X_test)
        y_prob       = model.predict_proba(X_test)[:, 1]
        y_train_prob = model.predict_proba(X_train)[:, 1]

        metrics = classification_metrics(y_test, y_pred)
        metrics["roc_auc"]       = roc_auc_from_probs(y_test, y_prob)
        metrics["train_roc_auc"] = roc_auc_from_probs(y_train, y_train_prob)
        metrics["fold"] = int(f)

        # add to list of all fold metrics
        fold_metrics.append(metrics)

    return pd.DataFrame(fold_metrics).sort_values("fold").reset_index(drop=True)

# Model Development

**Notes:** 
- All evaluation will focus on stratified cross-validation, but we will test the other methods as well. 
- Recall that folds have been pre-assigned to ensure consistency across different model development processes
- We have decided to scale + PCA.

**Process:**
1. Setting a baseline
    - evaluating an xgb model with all default parameters to build off of
2. Hyperparameter tuning
    - evaluate many different combinations of parameters
    - choose the best set based on average ROC-AUC across all folds
3. Holdout evaluation
    - evaluate on the corresponding holdout table. the performance here is what we will compare with other models (LR, SVC, LDA)
4. Threshold tuning
    - tweak the threshold on the best model to maximize another chosen metric (recall, precision, f1, balanced accuracy, etc.) 
        - note that roc-auc is not affected by threshold, hence the need a different optimizing metric
    - what metric we choose to optimize with threshold depends on business needs
        - consider the cost of mislabeling someone as high risk? or trusting an applicant that you shouldn't? will there be human review?
        - something we can include in the right up as optionality moving forward, not something we have to decide now on our own
        - "our model is very solid at ranking applicants from low-risk to high-risk, but in terms of actual classification, we can move the threshold based on what matters most to the business"

# setup

In [None]:
apps_cv_strat = pd.read_csv("data/apps_cv_strat.csv")
apps_holdout_strat = pd.read_csv("data/apps_holdout_strat.csv")
target_col = 'TARGET'
feature_cols = [col for col in apps_cv_strat.columns if col not in 
                [target_col, 'SK_ID_CURR', 'fold', 'neighbors_target_mean_500']]

In [None]:
corr = apps_cv_strat[feature_cols].corr().abs()
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

print(f"Dropping {len(to_drop)} highly correlated features")

feature_cols_pruned = [f for f in feature_cols if f not in to_drop]

In [None]:
params = {
    "C": 1.0,
    "kernel": "linear",
    "class_weight": "balanced",
    "probability": True,
}
results = cv_svc_pca(apps_cv_strat, feature_cols, target_col, params=params)

  C = X.T @ X
  C = X.T @ X
  C = X.T @ X
