# Capstone Project: Credit Default Risk
## Sai Nerusu(Dataset Aggregation, Joins, & SVM)

### Table of contents:
- Loading data
- Joining Data
- SVC Modeling
   
    


## Load Data

In [1]:
# Package Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import os
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import test and trian data
app_test = pd.read_csv(r'home-credit-default-risk/application_test.csv')
app_train = pd.read_csv(r'home-credit-default-risk/application_train.csv')
bureau = pd.read_csv(r'home-credit-default-risk/bureau.csv')
# bureau_bal = pd.read_csv(r'home-credit-default-risk/bureau_balance.csv')
# cc_bal = pd.read_csv(r'home-credit-default-risk/credit_card_balance.csv')
# inst_pymt = pd.read_csv(r'home-credit-default-risk/installments_payments.csv')
# pos_bal = pd.read_csv(r'home-credit-default-risk/POS_CASH_balance.csv')
prv_app = pd.read_csv(r'home-credit-default-risk/previous_application.csv')
smpl_sub = pd.read_csv(r'home-credit-default-risk/sample_submission.csv')

In [3]:
print(app_test.shape)
app_train.shape

(48744, 121)


(307511, 122)

In [4]:
# Show training data target response totals
app_train['TARGET'].value_counts() # Only 8% of the training data has a target value of 1

0    282686
1     24825
Name: TARGET, dtype: int64

In [5]:
MajorityClass_sub = smpl_sub.copy()
MajorityClass_sub.iloc[:,1] = 1

print(MajorityClass_sub.head())
MajorityClass_sub.to_csv("Majority_sub.csv", index = False, header = True)

   SK_ID_CURR  TARGET
0      100001       1
1      100005       1
2      100013       1
3      100028       1
4      100038       1


In [1]:
def percentage_na_values_table(df):
    # Sum of NA values in df
    na_val = df.isna().sum()
    
    # Percentage of NA values
    na_val_perc = 100 * na_val / len(df)
    
    # Create Table
    na_col_table = pd.concat([na_val, na_val_perc], axis = 1)
    
    # Sort by %NA descending
    na_col_table = na_col_table[
        na_col_table.iloc[:,1] != 0].sort_values(1, ascending = False).round(1) 
    
    # Add column names
    na_col_table = na_col_table.rename(columns = {0: 'Total NA\'s in Column', 1: "Percentage NA"})
    
    print("DF has " + str(df.shape[1]) + " columns.\nThere are " + str(na_col_table.shape[0]) + " columns that have missing values.")
    
    # Return Table
    return na_col_table
    

In [7]:
percentage_na_values_table(app_train).tail(30)

DF has 122 columns.
There are 67 columns that have missing values.


Unnamed: 0,Total NA's in Column,Percentage NA
LIVINGAREA_MEDI,154350,50.2
LIVINGAREA_MODE,154350,50.2
LIVINGAREA_AVG,154350,50.2
HOUSETYPE_MODE,154297,50.2
FLOORSMAX_MEDI,153020,49.8
FLOORSMAX_AVG,153020,49.8
FLOORSMAX_MODE,153020,49.8
YEARS_BEGINEXPLUATATION_AVG,150007,48.8
YEARS_BEGINEXPLUATATION_MEDI,150007,48.8
YEARS_BEGINEXPLUATATION_MODE,150007,48.8


Wow, over half of the variables in the data have over 50% Null values. This is a major issue, lets see what the other datasets look like.

In [8]:
percentage_na_values_table(app_test).tail(25)

DF has 121 columns.
There are 64 columns that have missing values.


Unnamed: 0,Total NA's in Column,Percentage NA
LIVINGAREA_MODE,23552,48.3
FLOORSMAX_MEDI,23321,47.8
FLOORSMAX_MODE,23321,47.8
FLOORSMAX_AVG,23321,47.8
YEARS_BEGINEXPLUATATION_MEDI,22856,46.9
YEARS_BEGINEXPLUATATION_MODE,22856,46.9
YEARS_BEGINEXPLUATATION_AVG,22856,46.9
TOTALAREA_MODE,22624,46.4
EMERGENCYSTATE_MODE,22209,45.6
EXT_SOURCE_1,20532,42.1


In [9]:
percentage_na_values_table(bureau).tail(10)

DF has 17 columns.
There are 7 columns that have missing values.


Unnamed: 0,Total NA's in Column,Percentage NA
AMT_ANNUITY,1226791,71.5
AMT_CREDIT_MAX_OVERDUE,1124488,65.5
DAYS_ENDDATE_FACT,633653,36.9
AMT_CREDIT_SUM_LIMIT,591780,34.5
AMT_CREDIT_SUM_DEBT,257669,15.0
DAYS_CREDIT_ENDDATE,105553,6.1
AMT_CREDIT_SUM,13,0.0


## Joining Datasets
For furthure analysis it would be smart to join the datasets together to get a better picture of the data. The different data sets have different levels of granularity. We will need to address that when we get to the datasets we want to combined.

In [15]:
app_test.shape

(48744, 121)

In [16]:
# With combining tables, we first want to combine the application data. This allows us to maintain the same number of columns when we seperate them later.
data_j = app_test.append(app_train, ignore_index = True, verify_integrity = True)
data_j.shape

(356255, 122)

### Bureau Dataset
We wanted to first look into the bureau dataset as it is directly related to the application datasets base on the SK_ID_CURR varaible.

In [17]:
print(bureau.shape)
bureau.head()

(1716428, 17)


Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


It looks like the bureau data has multiple rows for individual applicants. This would be a problem to try and join the bureau data to the application data. Instead we can create a smaller table with variables that match the granularity of the application data.

In [18]:
num_loans = bureau.groupby('SK_ID_CURR', as_index = False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'NUM_PREV_LOANS'})
num_loans.head()

Unnamed: 0,SK_ID_CURR,NUM_PREV_LOANS
0,100001,7
1,100002,8
2,100003,4
3,100004,2
4,100005,3


In [19]:
bureau['NUM_ACTIVE_ACNT'] = np.where(bureau['CREDIT_ACTIVE'] == 'Closed', 0, 1)
num_actv_lns = bureau.groupby('SK_ID_CURR', as_index = False)['NUM_ACTIVE_ACNT'].sum()
num_actv_lns.head()

Unnamed: 0,SK_ID_CURR,NUM_ACTIVE_ACNT
0,100001,3
1,100002,2
2,100003,1
3,100004,0
4,100005,2


In [20]:
AMT_CREDIT_T = bureau.groupby('SK_ID_CURR', as_index = False)['AMT_CREDIT_SUM'].sum().round(2).rename(columns = {'AMT_CREDIT_SUM': 'TOTAL_CREDIT_AMT_PREV'})
AMT_CREDIT_T.head()

Unnamed: 0,SK_ID_CURR,TOTAL_CREDIT_AMT_PREV
0,100001,1453365.0
1,100002,865055.56
2,100003,1017400.5
3,100004,189037.8
4,100005,657126.0


In [21]:
# This function will change the numeric variables granularity to match the granularity of the application data. 
# We need to be sure that the columns this is used on are continuous and not categorical.

def agg_numeric_categorical(df, group_var, df_name):
    for xcol in df:
        if xcol != group_var and 'SK_ID' in xcol:
            df = df.drop(columns=xcol)            
    grp_id = df[group_var]
    numeric_df = df.select_dtypes(include='number')
#     categorical_df = df.select_dtypes(include='object')
#     categorical_df[group_var] = grp_id
    numeric_agg = numeric_df.groupby(group_var).agg(['count', 'sum']).reset_index() # Grouping and performing aggregations could use ['count', 'sum', 'max', 'min', 'mean']
#     categorical_agg = categorical_df.groupby(group_var).agg(lambda x: x.value_counts().index[0]).reset_index()
    numeric_agg.columns = [group_var] + [f'{df_name}_{col[0]}_{col[1]}' for col in numeric_agg.columns[1:]]
#     categorical_agg.columns = [group_var] + [f'{df_name}_{col[0]}' for col in categorical_agg.columns[1:]]
#     agg = numeric_agg.merge(categorical_agg, on=group_var, how='left')
    return numeric_agg

In [22]:
bureau_agg = agg_numeric_categorical(bureau, 'SK_ID_CURR', 'data_bureau')

bureau_agg.head()

Unnamed: 0,SK_ID_CURR,data_bureau_DAYS_CREDIT_count,data_bureau_DAYS_CREDIT_sum,data_bureau_CREDIT_DAY_OVERDUE_count,data_bureau_CREDIT_DAY_OVERDUE_sum,data_bureau_DAYS_CREDIT_ENDDATE_count,data_bureau_DAYS_CREDIT_ENDDATE_sum,data_bureau_DAYS_ENDDATE_FACT_count,data_bureau_DAYS_ENDDATE_FACT_sum,data_bureau_AMT_CREDIT_MAX_OVERDUE_count,...,data_bureau_AMT_CREDIT_SUM_LIMIT_count,data_bureau_AMT_CREDIT_SUM_LIMIT_sum,data_bureau_AMT_CREDIT_SUM_OVERDUE_count,data_bureau_AMT_CREDIT_SUM_OVERDUE_sum,data_bureau_DAYS_CREDIT_UPDATE_count,data_bureau_DAYS_CREDIT_UPDATE_sum,data_bureau_AMT_ANNUITY_count,data_bureau_AMT_ANNUITY_sum,data_bureau_NUM_ACTIVE_ACNT_count,data_bureau_NUM_ACTIVE_ACNT_sum
0,100001,7,-5145,7,0,7,577.0,4,-3302.0,0,...,6,0.0,7,0.0,7,-652,7,24817.5,7,3
1,100002,8,-6992,8,0,6,-2094.0,6,-4185.0,5,...,4,31988.565,8,0.0,8,-3999,7,0.0,8,2
2,100003,4,-5603,4,0,4,-2178.0,3,-3292.0,4,...,4,810000.0,4,0.0,4,-3264,0,0.0,4,1
3,100004,2,-1734,2,0,2,-977.0,2,-1065.0,1,...,2,0.0,2,0.0,2,-1064,0,0.0,2,0
4,100005,3,-572,3,0,3,1318.0,1,-123.0,1,...,3,0.0,3,0.0,3,-163,3,4261.5,3,2


### Previous Application Dataset
Lets quickly view the application dataset.

In [23]:
print(prv_app.shape)

print(prv_app['SK_ID_CURR'].nunique())
prv_app.head()

(1670214, 37)
338857


Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,...,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,...,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,...,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,...,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,...,XNA,24.0,high,Cash Street: high,,,,,,


In [24]:
num_prev_apps = prv_app.groupby('SK_ID_CURR', as_index = False)['SK_ID_PREV'].count().rename(columns = {'SK_ID_PREV': 'NUM_PREV_APPS'})
num_prev_apps.head()

Unnamed: 0,SK_ID_CURR,NUM_PREV_APPS
0,100001,1
1,100002,1
2,100003,3
3,100004,1
4,100005,2


In [25]:
prv_app.dtypes.value_counts()

object     16
float64    15
int64       6
dtype: int64

In [26]:
prv_app_numeric = prv_app.select_dtypes(exclude = 'object')

prv_app_num = prv_app_numeric.copy()
for i in prv_app_numeric.columns:
    if prv_app_numeric[i].nunique(1) == 2:
        prv_app_num.drop(i, inplace = True, axis = 1)
prv_app_num.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,HOUR_APPR_PROCESS_START,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,DAYS_DECISION,SELLERPLACE_AREA,CNT_PAYMENT,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION
0,2030495,271877,1730.43,17145.0,17145.0,0.0,17145.0,15,0.0,0.182832,0.867336,-73,35,12.0,365243.0,-42.0,300.0,-42.0,-37.0
1,2802425,108129,25188.615,607500.0,679671.0,,607500.0,11,,,,-164,-1,36.0,365243.0,-134.0,916.0,365243.0,365243.0
2,2523466,122040,15060.735,112500.0,136444.5,,112500.0,11,,,,-301,-1,12.0,365243.0,-271.0,59.0,365243.0,365243.0
3,2819243,176158,47041.335,450000.0,470790.0,,450000.0,7,,,,-512,-1,12.0,365243.0,-482.0,-152.0,-182.0,-177.0
4,1784265,202054,31924.395,337500.0,404055.0,,337500.0,9,,,,-781,-1,24.0,,,,,


In [27]:
prv_app_agg = agg_numeric_categorical(prv_app_num, 'SK_ID_CURR', 'prv_app')
prv_app_agg

Unnamed: 0,SK_ID_CURR,prv_app_AMT_ANNUITY_count,prv_app_AMT_ANNUITY_sum,prv_app_AMT_APPLICATION_count,prv_app_AMT_APPLICATION_sum,prv_app_AMT_CREDIT_count,prv_app_AMT_CREDIT_sum,prv_app_AMT_DOWN_PAYMENT_count,prv_app_AMT_DOWN_PAYMENT_sum,prv_app_AMT_GOODS_PRICE_count,...,prv_app_DAYS_FIRST_DRAWING_count,prv_app_DAYS_FIRST_DRAWING_sum,prv_app_DAYS_FIRST_DUE_count,prv_app_DAYS_FIRST_DUE_sum,prv_app_DAYS_LAST_DUE_1ST_VERSION_count,prv_app_DAYS_LAST_DUE_1ST_VERSION_sum,prv_app_DAYS_LAST_DUE_count,prv_app_DAYS_LAST_DUE_sum,prv_app_DAYS_TERMINATION_count,prv_app_DAYS_TERMINATION_sum
0,100001,1,3951.000,1,24835.5,1,23787.0,1,2520.0,1,...,1,365243.0,1,-1709.0,1,-1499.0,1,-1619.0,1,-1612.0
1,100002,1,9251.775,1,179055.0,1,179055.0,1,0.0,1,...,1,365243.0,1,-565.0,1,125.0,1,-25.0,1,-17.0
2,100003,3,169661.970,3,1306309.5,3,1452573.0,2,6885.0,3,...,3,1095729.0,3,-3823.0,3,-3013.0,3,-3163.0,3,-3142.0
3,100004,1,5357.250,1,24282.0,1,20106.0,1,4860.0,1,...,1,365243.0,1,-784.0,1,-694.0,1,-724.0,1,-714.0
4,100005,1,4813.200,2,44617.5,2,40153.5,1,4464.0,1,...,1,365243.0,1,-706.0,1,-376.0,1,-466.0,1,-460.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
338852,456251,1,6605.910,1,40455.0,1,40455.0,1,0.0,1,...,1,365243.0,1,-210.0,1,0.0,1,-30.0,1,-25.0
338853,456252,1,10074.465,1,57595.5,1,56821.5,1,3456.0,1,...,1,365243.0,1,-2466.0,1,-2316.0,1,-2316.0,1,-2311.0
338854,456253,2,9540.810,2,48325.5,2,41251.5,2,8806.5,2,...,2,730486.0,2,-4678.0,2,-4438.0,2,-4438.0,2,-4425.0
338855,456254,2,21362.265,2,242635.5,2,268879.5,2,0.0,2,...,2,730486.0,2,-538.0,2,302.0,2,730486.0,2,730486.0


# Modeling

In [52]:
import imblearn
np.random.seed(123)

In [53]:
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.linear_model import LogisticRegressionCV, LassoCV, RidgeCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import xgboost as xgb

In [54]:
y = data['TARGET']
x = data.drop('TARGET', axis = 1)

# Create train and test sets.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 123)

In [55]:
# Create sampling datasets to train the models with.
o_sam = imblearn.over_sampling.RandomOverSampler(sampling_strategy = 0.5, random_state= 123)
u_sam = imblearn.under_sampling.RandomUnderSampler(sampling_strategy = 0.5, random_state= 123)
smote = imblearn.over_sampling.SMOTE(random_state = 123)

In [56]:
x_over , y_over = o_sam.fit_resample(x_train, y_train)
x_under, y_under = u_sam.fit_resample(x_train, y_train)
x_smote, y_smote = smote.fit_resample(x_train, y_train)

In [58]:

# SVC is a very resource instensive model. These models were Limited due to time restrictions.
SVC_mod1 = SVC(random_state = 123)
SVC_mod2 = SVC(kernel = 'linear', random_state = 123)


In [59]:
def modelperformance(model, xtrain, ytrain, xtest, ytest):
    train = model.fit(xtrain, ytrain)
    pred = model.predict(xtest)
    if pred[0] != 0 or pred[0] != 1:
        pred = [int(i > 0.5) for i in pred] # binary values set to 1 if greater than 0.5
    
    print('Model')
    print(metrics.confusion_matrix(ytest, pred))
    print()
    print('AUC:', round(metrics.roc_auc_score(ytest, pred), 4))
    print(metrics.classification_report(ytest, pred, digits = 4))

## Imbalanced Performance

In [60]:
# Run model on imb dataset.

modelperformance(SVC_mod1, x_train, y_train, x_test, y_test)
modelperformance(SVC_mod2, x_train, y_train, x_test, y_test)


Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[56475    13]
 [ 4999    16]]

AUC: 0.5015
              precision    recall  f1-score   support

         0.0     0.9187    0.9998    0.9575     56488
         1.0     0.5517    0.0032    0.0063      5015

    accuracy                         0

The first three models predicted every application as non-default. These models are no better then using the majority class. This is to be expected with using the imbablanced dataset. Some of the more advanced models did a better job, but we can improve our results with different sampling technique. The highest AUC value was 0.5135 using the XGBoost model.

## Oversampling Performance

In [63]:
# Training models on Oversampled data

modelperformance(SVC_mod1, x_over, y_over, x_test, y_test)
modelperformance(SVC_mod2, x_over, y_over, x_test, y_test)


Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[50194  6294]
 [ 3016  1999]]

AUC: 0.6436
              precision    recall  f1-score   support

         0.0     0.9433    0.8886    0.9151     56488
         1.0     0.2410    0.3986    0.3004      5015

    accuracy                         0

Oversampling has made a impact on model performance. The highest AUC value was 0.6477, using the xgboost model. Changing the parameters in the xgboosted model made no significant impact on the results of the accuracy.

## Undersampling Performance

In [64]:
# Training models on Undersampled data
modelperformance(SVC_mod1, x_under, y_under, x_test, y_test)
modelperformance(SVC_mod2, x_under, y_under, x_test, y_test)


Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[50169  6319]
 [ 3033  1982]]

AUC: 0.6417
              precision    recall  f1-score   support

         0.0     0.9430    0.8881    0.9147     56488
         1.0     0.2388    0.3952    0.2977      5015

    accuracy                         0


## SMOTE Performance

In [65]:
# Training models on SMOTE sampled data
modelperformance(SVC_mod1, x_smote, y_smote, x_test, y_test)
modelperformance(SVC_mod2, x_smote, y_smote, x_test, y_test)


Model
[[46231 10257]
 [ 2773  2242]]

AUC: 0.6327
              precision    recall  f1-score   support

         0.0     0.9434    0.8184    0.8765     56488
         1.0     0.1794    0.4471    0.2560      5015

    accuracy                         0.7881     61503
   macro avg     0.5614    0.6327    0.5663     61503
weighted avg     0.8811    0.7881    0.8259     61503

Model
[[25093 31395]
 [ 1681  3334]]

AUC: 0.5545
              precision    recall  f1-score   support

         0.0     0.9372    0.4442    0.6027     56488
         1.0     0.0960    0.6648    0.1678      5015

    accuracy                         0.4622     61503
   macro avg     0.5166    0.5545    0.3853     61503
weighted avg     0.8686    0.4622    0.5673     61503

Model
[[39233 17255]
 [ 1635  3380]]

AUC: 0.6843
              precision    recall  f1-score   support

         0.0     0.9600    0.6945    0.8060     56488
         1.0     0.1638    0.6740    0.2635      5015

    accuracy                    