# **Week 5 - Capstone Development**
## **Support Vector Machines**

Ok so for this one, we will take the best performing dataset in the weeks leading up to this, and then check out how a support vector 'classifier'  


Source: https://scikit-learn.org/stable/modules/svm.html

### **Imports**

In [2]:
# Standard Libraries
import os
import time
import math
import io
import zipfile
import requests
from urllib.parse import urlparse
from itertools import chain, combinations

# Data Science Libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.ticker as mticker  # Optional: Format y-axis labels as dollars
import seaborn as sns
import matplotlib.pyplot as plt



# Scikit-learn (Machine Learning)
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
    RandomizedSearchCV,
    RepeatedStratifiedKFold,
    RepeatedKFold
)

from sklearn.svm import LinearSVC, SVC, NuSVC
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, root_mean_squared_error, accuracy_score, f1_score, roc_auc_score, balanced_accuracy_score
from sklearn.feature_selection import SequentialFeatureSelector, f_regression, SelectKBest
from sklearn.linear_model import LogisticRegression, Lasso, RidgeClassifier, ElasticNet
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_predict, KFold
# Progress Tracking

from tqdm import tqdm

# =============================
# Global Variables
# =============================
random_state = 42

### **Dataset Imports**

### **Useful Functions**

Here we will import functions used in previous weeks to handle our data modeling. 

#### **Train Test Split**

In [None]:
# ===========================================================================================
# Function taken from Module 3 Final Project
# https://github.com/LeeMcFarling/Final_Project_Writeup/blob/main/Final_Project_Report.ipynb
# ===========================================================================================

def train_test_split_data(df, target_col):
    X = df.drop(columns=target_col)
    y = df[target_col]
# 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    return X_train, X_test, y_train, y_test

#### **Run Model Classifier**

In [None]:
# =============================================================================================
# Taken from Mod 3 Week 8:
# https://github.com/waysnyder/Module-3-Assignments/blob/main/Homework_08.ipynb
# 
# Global dataframe logic taken from mod 3 final project: 
# https://github.com/LeeMcFarling/Final_Project_Writeup/blob/main/Final_Project_Report.ipynb
# 
# Final Function was developed in Week 2 of this Module
# =============================================================================================

def run_model_classifier(model, X_train, y_train, X_test, y_test, n_repeats=10, n_jobs=-1, run_comment=None, return_model=False, concat_results=False, **model_params):

    global combined_results
    # Remove extra key used to store error metric, if it was added to the parameter dictionary
    if 'accuracy_found' in model_params:
        model_params = model_params.copy()
        model_params.pop('accuracy_found', None)  
        
    # Instantiate the model if a class is provided
    if isinstance(model, type):
        model = model(**model_params)
    else:                                    
        model.set_params(**model_params)    

    model_name = model.__name__ if isinstance(model, type) else model.__class__.__name__ # Added because 


    # Use RepeatedStratifiedKFold for classification to preserve class distribution
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=n_repeats, random_state=42)
    
    # Perform 5-fold cross-validation using accuracy as the scoring metric
    cv_scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=n_jobs)
    
    mean_cv_accuracy = np.mean(cv_scores)
    std_cv_accuracy  = np.std(cv_scores)
    
    # Fit the model on the full training set
    model.fit(X_train, y_train)
    
    # Compute training and testing accuracy
    train_preds    = model.predict(X_train)
    test_preds     = model.predict(X_test)

    # Normal Accuracy 
    train_accuracy = accuracy_score(y_train, train_preds)
    test_accuracy  = accuracy_score(y_test, test_preds)

    # Balanced Accuracy Metrics
    balanced_train_accuracy = balanced_accuracy_score(y_train, train_preds)
    balanced_test_accuracy = balanced_accuracy_score(y_test, test_preds)

    results_df = pd.DataFrame([{
        'model': model_name, 
        'model_params': model.get_params(),
        'mean_cv_accuracy': mean_cv_accuracy,
        'std_cv_accuracy': std_cv_accuracy,
        'train_accuracy': train_accuracy, 
        'test_accuracy': test_accuracy,
        'balanced_train_accuracy' : balanced_train_accuracy,
        'balanced_test_accuracy': balanced_test_accuracy,
        'run_comment': run_comment
    }])
    
    if concat_results:
        try:
            combined_results = pd.concat([combined_results, results_df], ignore_index=True)
        except NameError:
            combined_results = results_df

    return (results_df, model) if return_model else results_df

___

### **Data Pre-Processing**

Train-test splitting our data: 


___

## **Modeling**

Let's start with a Linear Support Vector Classifier with parameters set to the same general params as our original Logistic Regression. 

Let's start with the Big Data Bowl Dataset: 

In [None]:
params_ = {
    'class_weight' : 'balanced',        # attempt to balance dataset
    'solver': 'saga',                   # Doc said that this solver is better for larger datasets
    'penalty': 'l2',                    # default
    'fit_intercept': 'True',
    'max_iter' : 50000,                 # Iteratively increased this until Convergence Warnings went away
    'tol': 1e-2,                        # Another convergence warning measure
    'random_state' : 42


}

BDB_Linear_SVC_results_df = run_model_classifier(
    LinearSVC,
    # BDB_PCA_X_train, 
    # BDB_PCA_y_train, 
    # BDB_PCA_X_test,
    # BDB_PCA_y_test,
    n_repeats=5, 
    n_jobs=-1, 
    run_comment='BDB - Linear SVC', 
    return_model=False,
    concat_results=True,
    **params_
    )

BDB_Linear_SVC_results_df

And then try it on the First and Future Dataset: 

In [None]:
params_ = {
    'class_weight' : 'balanced',        # attempt to balance dataset
    'solver': 'saga',                   # Doc said that this solver is better for larger datasets
    'penalty': 'l2',                    # default
    'fit_intercept': 'True',
    'max_iter' : 50000,                 # Iteratively increased this until Convergence Warnings went away
    'tol': 1e-2,                        # Another convergence warning measure
    'random_state' : 42


}

FNF_Linear_SVC_results_df = run_model_classifier(
    LinearSVC,
    # BDB_PCA_X_train, 
    # BDB_PCA_y_train, 
    # BDB_PCA_X_test,
    # BDB_PCA_y_test,
    n_repeats=5, 
    n_jobs=-1, 
    run_comment='FNF - Linear SVC - Baseline', 
    return_model=False,
    concat_results=True,
    **params_
    )

FNF_Linear_SVC_results_df

###