# Problem description

You are to predict whether a company will go bankrupt in the following year, based on financial attributes of the company.

Perhaps you are contemplating lending money to a company, and need to know whether the company
is in near-term danger of not being able to repay.


## Goal

## Learning objectives

- Demonstrate mastery on solving a classification problem and presenting
the entire Recipe for Machine Learning process in a notebook.
- We will make suggestions for ways to approach the problem
    - But there will be little explicit direction for this task.
- It is meant to be analogous to a pre-interview task that a potential employer might assign
to verify your skill

# Import modules

In [1]:
## Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
import imblearn

import os
import math

%matplotlib inline


# API for students

In [2]:
## Load the bankruptcy_helper module

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

# Import bankruptcy_helper module
import bankruptcy_helper
%aimport bankruptcy_helper

helper = bankruptcy_helper.Helper()

Packages I used in the project

In [3]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, recall_score, precision_score, balanced_accuracy_score
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.utils import class_weight

# Get the data

The first step in our Recipe is Get the Data.

- Each example is a row of data corresponding to a single company
- There are 64 attributes, described in the section below
- The column `Bankrupt` is 1 if the company subsequently went bankrupt; 0 if it did not go bankrupt
- The column `Id` is a Company Identifier

In [4]:
# Data directory
DATA_DIR = "./Data"

if not os.path.isdir(DATA_DIR):
    DATA_DIR = "../resource/asnlib/publicdata/bankruptcy/data"

data_file = "5th_yr.csv"
data = pd.read_csv( os.path.join(DATA_DIR, "train", data_file) )

target_attr = "Bankrupt"

n_samples, n_attrs = data.shape
print("Date shape: ", data.shape)

Date shape:  (4818, 66)


## Have a look at the data

We will not go through all steps in the Recipe, nor in depth.

But here's a peek

In [5]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X57,X58,X59,X60,X61,X62,X63,X64,Bankrupt,Id
0,0.025417,0.41769,0.0568,1.1605,-126.39,0.41355,0.025417,1.2395,1.165,0.51773,...,0.049094,0.85835,0.12322,5.6167,7.4042,164.31,2.2214,1.334,0,4510
1,-0.023834,0.2101,0.50839,4.2374,22.034,0.058412,-0.027621,3.6579,0.98183,0.76855,...,-0.031011,1.0185,0.069047,5.7996,7.7529,26.446,13.802,6.4782,0,3537
2,0.030515,0.44606,0.19569,1.565,35.766,0.28196,0.039264,0.88456,1.0526,0.39457,...,0.077337,0.95006,0.25266,15.049,2.8179,104.73,3.4852,2.6361,0,3920
3,0.052318,0.056366,0.54562,10.68,438.2,0.13649,0.058164,10.853,1.0279,0.61173,...,0.085524,0.97282,0.0,6.0157,7.4626,48.756,7.4863,1.0602,0,1806
4,0.000992,0.49712,0.12316,1.3036,-71.398,0.0,0.001007,1.0116,1.2921,0.50288,...,0.001974,0.99925,0.019736,3.4819,8.582,114.58,3.1854,2.742,0,1529


Pretty *unhelpful* !

What are these mysteriously named features ?

## Description of attributes

This may still be somewhat unhelpful for those of you not used to reading Financial Statements.

But that's partially the point of the exercise
- You can *still* perform Machine Learning *even if* you are not an expert in the problem domain
    - That's what makes this a good interview exercise: you can demonstrate your thought process even if you don't know the exact meaning of the terms
- Of course: becoming an expert in the domain *will improve* your ability to create better models
    - Feature engineering is easier if you understand the features, their inter-relationships, and the relationship to the target

Let's get a feel for the data
- What is the type of each attribute ?


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4818 entries, 0 to 4817
Data columns (total 66 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   X1        4818 non-null   object 
 1   X2        4818 non-null   object 
 2   X3        4818 non-null   object 
 3   X4        4818 non-null   object 
 4   X5        4818 non-null   object 
 5   X6        4818 non-null   object 
 6   X7        4818 non-null   object 
 7   X8        4818 non-null   object 
 8   X9        4818 non-null   float64
 9   X10       4818 non-null   object 
 10  X11       4818 non-null   object 
 11  X12       4818 non-null   object 
 12  X13       4818 non-null   float64
 13  X14       4818 non-null   object 
 14  X15       4818 non-null   object 
 15  X16       4818 non-null   object 
 16  X17       4818 non-null   object 
 17  X18       4818 non-null   object 
 18  X19       4818 non-null   float64
 19  X20       4818 non-null   float64
 20  X21       4818 non-null   obje

You may be puzzled:
- Most attributes are `object` and *not* numeric (`float64`)
- But looking at the data via `data.head()` certainly gives the impression that all attributes are numeric

Welcome to the world of messy data !  The dataset has represented numbers as strings.
- These little unexpected challenges are common in the real-word
- Data is rarely perfect and clean

So you might want to first convert all attributes to numeric

**Hint**
- Look up the Pandas method `to_numeric`
    - We suggest you use the option `errors='coerce'`
    

# Evaluating your project

We will evaluate your submission on a test dataset that we provide
- It has no labels, so **you** can't use it to evaluate your model, but **we** have the labels
- We will call this evaluation dataset the "holdout" data

Let's get it

In [7]:
holdout_data = pd.read_csv( os.path.join(DATA_DIR, "holdout", '5th_yr.csv') )

print("Data shape: ", holdout_data.shape)

Data shape:  (1092, 65)


We will evaluate your model on the holdout examples using metrics
- Accuracy
- Recall
- Precision

From our lecture: we may have to make a trade-off between Recall and Precision.

Our evaluation of your submission will be partially based on how you made (and described) the trade-off.

You may assume that it is 5 times worse to *fail to identify a company that will go bankrupt*
than it is to fail to identify a company that won't go bankrupt.

# Your model

Time for you to continue the Recipe for Machine Learning on your own.



First convert data type into numbers!

In [8]:
data = data.apply(pd.to_numeric, errors='coerce')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4818 entries, 0 to 4817
Data columns (total 66 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   X1        4816 non-null   float64
 1   X2        4816 non-null   float64
 2   X3        4816 non-null   float64
 3   X4        4803 non-null   float64
 4   X5        4808 non-null   float64
 5   X6        4816 non-null   float64
 6   X7        4816 non-null   float64
 7   X8        4804 non-null   float64
 8   X9        4818 non-null   float64
 9   X10       4816 non-null   float64
 10  X11       4816 non-null   float64
 11  X12       4803 non-null   float64
 12  X13       4818 non-null   float64
 13  X14       4816 non-null   float64
 14  X15       4812 non-null   float64
 15  X16       4804 non-null   float64
 16  X17       4804 non-null   float64
 17  X18       4816 non-null   float64
 18  X19       4818 non-null   float64
 19  X20       4818 non-null   float64
 20  X21       4744 non-null   floa

We notice that there are many missing value in the data set after converting into numeric data, so we need to deal with it before we train the model. I decide to remove this feature from the data. For other features, I decide to fill the median of each feature into missing places because median is more robust to outliers than mean.

In [9]:
def trans_data(data):
    imp = SimpleImputer(missing_values=np.nan, strategy='median')
    pipeline = Pipeline([("SimpleImputer", imp),("Scale", StandardScaler())])
    data_t = data
    for i in data_t.columns[0:64]:
        data_t[[i]] = pipeline.fit_transform(data_t[[i]])
    return data_t

In [44]:
data = trans_data(data)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4818 entries, 0 to 4817
Data columns (total 66 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   X1        4818 non-null   float64
 1   X2        4818 non-null   float64
 2   X3        4818 non-null   float64
 3   X4        4818 non-null   float64
 4   X5        4818 non-null   float64
 5   X6        4818 non-null   float64
 6   X7        4818 non-null   float64
 7   X8        4818 non-null   float64
 8   X9        4818 non-null   float64
 9   X10       4818 non-null   float64
 10  X11       4818 non-null   float64
 11  X12       4818 non-null   float64
 12  X13       4818 non-null   float64
 13  X14       4818 non-null   float64
 14  X15       4818 non-null   float64
 15  X16       4818 non-null   float64
 16  X17       4818 non-null   float64
 17  X18       4818 non-null   float64
 18  X19       4818 non-null   float64
 19  X20       4818 non-null   float64
 20  X21       4818 non-null   floa

Now we have all column with numeric data, so we can go into the next step.
# Let's create train and test set!

X should be the 64 columns that contains different information about companys. And y should just be whether the company has bankrupcy or not.

In [11]:
X = data.loc[:,'X1':'X64']
y = data['Bankrupt']
X.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X55,X56,X57,X58,X59,X60,X61,X62,X63,X64
0,0.012021,-0.096087,-0.103121,-0.038036,-0.006126,0.062188,0.010124,-0.041039,-0.308567,0.003492,...,-0.085603,0.113242,0.003624,-0.108583,-0.023624,-0.015169,-0.082377,-0.005784,-0.062392,-0.062768
1,0.004676,-0.268714,0.249124,-0.0073,0.000105,0.016518,0.002215,-0.018921,-0.444998,0.065533,...,0.339007,-0.098772,-0.00743,0.063191,-0.032173,-0.015166,-0.074409,-0.066265,0.040057,-0.053953
2,0.012782,-0.072495,0.005215,-0.033996,0.000682,0.045266,0.012189,-0.044286,-0.392286,-0.026973,...,-0.0504,-0.008166,0.007522,-0.010217,-0.003196,-0.015033,-0.187176,-0.031922,-0.051212,-0.060536
3,0.016033,-0.396555,0.278164,0.057057,0.017578,0.026558,0.015008,0.046886,-0.410683,0.026743,...,0.228298,-0.038299,0.008651,0.014195,-0.043069,-0.015163,-0.081043,-0.056477,-0.015816,-0.063237
4,0.008379,-0.030035,-0.05136,-0.036607,-0.003817,0.009006,0.006484,-0.043124,-0.213899,-0.000182,...,-0.114672,-0.100887,-0.002878,0.042544,-0.039955,-0.015199,-0.055464,-0.0276,-0.053864,-0.060355


I choose a test size of 0.1 to split the test and train data set.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

X_train shape:  (4336, 64)
X_test shape:  (482, 64)
y_train shape:  (4336,)
y_test shape:  (482,)


# Prepare the model
Let use logistic regression to do a simple training on the model. Choose "liblinear" as the solver.

In [13]:
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogisticRegression(solver="liblinear", max_iter=10000))

Train the model and get the in sample accuracy score. Use cross validation to check the performance, since we only have one set test data. Then evaluate the model using test set.

In [14]:
def get_score(model, name, X_train, X_test, y_train, y_test):
    _ = model.fit(X_train, y_train)
    score_in_sample = accuracy_score(model.predict(X_train),y_train)
    k = 5
    cross_val_scores = cross_val_score(model, X_train, y_train, cv=k)
    y_pred = model.predict(X_test)
    accuracy_test = accuracy_score(y_test, y_pred)
    balance_test = balanced_accuracy_score(y_test, y_pred)
    recall_test = recall_score(y_test, y_pred, pos_label=1, average="binary")
    precision_test = precision_score(y_test,   y_pred, pos_label=1, average="binary")
    print("Model: {m:s} class balanced Acccuracy={s:3.4f}\n".format(m=name, s=balance_test))
    print("Model: {m:s} in sample score={s:3.4f}\n".format(m=name, s=score_in_sample))
    print("Model: {m:s} avg cross validation score={s:3.4f}\n".format(m=name, s=cross_val_scores.mean()))
    print("Model: {m:s} Accuracy: {a:3.2%}, Recall {r:3.2%}, Precision {p:3.2%}".format(m=name,
                                                                            a=accuracy_test,r=recall_test,p=precision_test))

We get a in sample accuracy score of 0.938, cross validation score of 0.932, and out sample score of 0.932 with the logistic regression model. Recall and precision are also bad, so next I will use some other model to see if scores would increase. Also the balanced accuracy is only 0.5, it means this is a imbalanced data.

In [15]:
get_score(clf,"Logistic Regression", X_train, X_test, y_train, y_test)

Model: Logistic Regression class balanced Acccuracy=0.5573

Model: Logistic Regression in sample score=0.9393

Model: Logistic Regression avg cross validation score=0.9334

Model: Logistic Regression Accuracy: 93.36%, Recall 12.12%, Precision 57.14%


Let's try change a regularization method.

In [16]:
clf = make_pipeline(StandardScaler(), LogisticRegression(penalty="l1", solver="liblinear", max_iter=10000))

L2 is much better than L1.

In [17]:
get_score(clf,"Logistic Regression", X_train, X_test, y_train, y_test)
clf = make_pipeline(StandardScaler(), LogisticRegression(solver="liblinear", max_iter=10000))

Model: Logistic Regression class balanced Acccuracy=0.5247

Model: Logistic Regression in sample score=0.9410

Model: Logistic Regression avg cross validation score=0.9347

Model: Logistic Regression Accuracy: 92.53%, Recall 6.06%, Precision 28.57%



# Support Vector Machine
Let's try use SVM to see if we have a better prediction on the data.

In [18]:
svc = make_pipeline(StandardScaler(), SVC(kernel='linear'))

Balanced accruacy decreased. Other accuracy scores remain the same level, recall and precision are 0 because it's SVM model. We can see from the accuracy results that using SVM doesn't have difference compare to logistic regression.

In [19]:
get_score(svc, "Support Vector Classification", X_train, X_test, y_train, y_test)

Model: Support Vector Classification class balanced Acccuracy=0.4989

Model: Support Vector Classification in sample score=0.9384

Model: Support Vector Classification avg cross validation score=0.9354

Model: Support Vector Classification Accuracy: 92.95%, Recall 0.00%, Precision 0.00%


In [20]:
svc = make_pipeline(StandardScaler(), SVC(C=0.1, kernel='linear'))
get_score(svc, "Support Vector Classification", X_train, X_test, y_train, y_test)

Model: Support Vector Classification class balanced Acccuracy=0.5140

Model: Support Vector Classification in sample score=0.9384

Model: Support Vector Classification avg cross validation score=0.9359

Model: Support Vector Classification Accuracy: 93.15%, Recall 3.03%, Precision 50.00%


After changing penalty paracter C to 0.1 (10 times smaller), found out that we increaed precision but recall still very low.

# Gradient Descent
Since SVM's accuracy scores have very small improvements, I will try gradient descent to see if it can increase scores.

In [21]:
sgd = make_pipeline(StandardScaler(), SGDClassifier())

Balanced accuracy score increased to 0.6 which is very good. Other accuracy socres remain the same level. Recall increased to 30%, but precision decreased 20% compare to logistic model. However, I don't know why but each time I run the code, I will get different recall and precision.

In [22]:
get_score(sgd, "Gradient Descent", X_train, X_test, y_train, y_test)

Model: Gradient Descent class balanced Acccuracy=0.5787

Model: Gradient Descent in sample score=0.9269

Model: Gradient Descent avg cross validation score=0.9225

Model: Gradient Descent Accuracy: 92.12%, Recall 18.18%, Precision 35.29%


# Gradient Boosting Classifier
Because balanced accuracy of data is low, I decide to use gradient boosting classifier because this model is not sensitive to imbalanced data.

In [23]:
gbc = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=42))

The balanced accuracy score increased to 0.709. Also, all accuracy scores are the highest. Recall is still low but has a increase, precision looks good.

In [24]:
get_score(gbc, "Gradient Boosting Classifier", X_train, X_test, y_train, y_test)

Model: Gradient Boosting Classifier class balanced Acccuracy=0.7380

Model: Gradient Boosting Classifier in sample score=0.9818

Model: Gradient Boosting Classifier avg cross validation score=0.9571

Model: Gradient Boosting Classifier Accuracy: 95.64%, Recall 48.48%, Precision 80.00%


# RandomForestClassifer
This is also a classifer that robust to imbalanced data.

In [25]:
rf = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=42, n_jobs=2))
get_score(rf, "Random Forest Classifier", X_train, X_test, y_train, y_test)

Model: Random Forest Classifier class balanced Acccuracy=0.5876

Model: Random Forest Classifier in sample score=1.0000

Model: Random Forest Classifier avg cross validation score=0.9442

Model: Random Forest Classifier Accuracy: 93.78%, Recall 18.18%, Precision 66.67%


# Imbalanced Data
One way of handling imbalanced data is using class wegiht.

In [26]:
def get_class_weight(y):
    class_weights = class_weight.compute_class_weight('balanced',np.unique(y),y)
    weight = {0:class_weights[0],1:class_weights[1]}
    return (weight)

In [27]:
class_w = get_class_weight(y_train)
print(class_w)

{0: 0.5338586555035706, 1: 7.883636363636364}


1767    0
1489    0
2288    0
3659    0
       ..
4426    0
466     0
3092    0
3772    0
860     0
Name: Bankrupt, Length: 4336, dtype: int64 as keyword args. From version 0.25 passing these as positional arguments will result in an error


So we get 0 weighted 0.533 in the data, and 1 weighted 7.883 which means the data is extremely oversampling. Since we have too much 0.

In [28]:
model = rf.set_params(randomforestclassifier__class_weight=class_w)
get_score(model,"randomforestclassifier", X_train, X_test, y_train, y_test)

Model: randomforestclassifier class balanced Acccuracy=0.5691

Model: randomforestclassifier in sample score=1.0000

Model: randomforestclassifier avg cross validation score=0.9426

Model: randomforestclassifier Accuracy: 92.95%, Recall 15.15%, Precision 45.45%


Logistic regression has some improvements on balanced accuracy, recall, and precision. I hope it would work the same for other models.

In [29]:
model = clf.set_params(logisticregression__class_weight=class_w)
get_score(model,"Logistic Regression", X_train, X_test, y_train, y_test)

Model: Logistic Regression class balanced Acccuracy=0.7656

Model: Logistic Regression in sample score=0.8222

Model: Logistic Regression avg cross validation score=0.8109

Model: Logistic Regression Accuracy: 79.88%, Recall 72.73%, Precision 21.43%


For SVM, balanced accuracy increased, out sample accuracy decreased by 2%.

In [30]:
model = svc.set_params(svc__class_weight=class_w)
get_score(model,"Support Vector Classification", X_train, X_test, y_train, y_test)

Model: Support Vector Classification class balanced Acccuracy=0.7549

Model: Support Vector Classification in sample score=0.8406

Model: Support Vector Classification avg cross validation score=0.8367

Model: Support Vector Classification Accuracy: 80.50%, Recall 69.70%, Precision 21.50%


For gradient descent, all scores have been decreased after adding sample weights, so maybe sample weight has conflict with some models.

In [31]:
model = sgd.set_params(sgdclassifier__class_weight=class_w)
get_score(model, "Gradient Descent", X_train, X_test, y_train, y_test)

Model: Gradient Descent class balanced Acccuracy=0.7178

Model: Gradient Descent in sample score=0.7399

Model: Gradient Descent avg cross validation score=0.7179

Model: Gradient Descent Accuracy: 70.95%, Recall 72.73%, Precision 15.48%


Since Gradinet Boosting Classifier does not have parameter of class weight. I am going to skip it.
performs worse than before. I guess that's because GBC already handles imbalanced data, so it has some conflicts with sample weight.

In [32]:
model = rf.set_params(randomforestclassifier__class_weight=class_w)
get_score(model,"randomforestclassifier", X_train, X_test, y_train, y_test)

Model: randomforestclassifier class balanced Acccuracy=0.5691

Model: randomforestclassifier in sample score=1.0000

Model: randomforestclassifier avg cross validation score=0.9426

Model: randomforestclassifier Accuracy: 92.95%, Recall 15.15%, Precision 45.45%


# Feature Importance
Since the data have 64 different features, maybe some of the feature are not relevant to bankrupcy, so by deleting those feature I hope to increase the accuracy score of the model.

In [33]:
result = permutation_importance(gbc, X_train, y_train, n_repeats=30, random_state=42)
for i in result.importances_mean.argsort()[::-1]:
    if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
         print(f"{data.columns[i]:<8}"f"{result.importances_mean[i]:.3f}"f" +/- {result.importances_std[i]:.3f}")

X27     0.026 +/- 0.002
X46     0.015 +/- 0.001
X35     0.012 +/- 0.001
X21     0.007 +/- 0.001
X41     0.007 +/- 0.001
X56     0.006 +/- 0.001
X34     0.006 +/- 0.001
X24     0.006 +/- 0.001
X39     0.006 +/- 0.001
X5      0.005 +/- 0.001
X13     0.004 +/- 0.001
X6      0.003 +/- 0.000
X38     0.002 +/- 0.001
X20     0.001 +/- 0.000
X25     0.001 +/- 0.000
X15     0.001 +/- 0.000
X28     0.001 +/- 0.000
X40     0.001 +/- 0.000
X42     0.001 +/- 0.000
X9      0.001 +/- 0.000
X44     0.001 +/- 0.000
X62     0.000 +/- 0.000
X32     0.000 +/- 0.000
X53     0.000 +/- 0.000
X7      0.000 +/- 0.000
X22     0.000 +/- 0.000
X45     0.000 +/- 0.000
X12     0.000 +/- 0.000
X33     0.000 +/- 0.000


I see there are some negative and zero value in the importance value, and I think removing the feature can increase model performance.

In [34]:
def remove_feature(result, X_train, X_test):
    neg_imp = []
    for i in range(X_train.shape[1]):
        if result.importances_mean[i] <= 0:
            neg_imp.append(i)
        
    neg_imp = sorted(neg_imp, reverse=True)
    for i in neg_imp:
        X_train = X_train.drop(X_train.columns[i], axis=1)
        X_test = X_test.drop(X_test.columns[i], axis=1)
    return X_train, X_test

In [35]:
print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)
X_train, X_test = remove_feature(result, X_train, X_test)
print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)

X_train shape:  (4336, 64)
X_test shape:  (482, 64)
X_train shape:  (4336, 50)
X_test shape:  (482, 50)


After removing features with negative feature importance, we are left with less than 64 features.

In [36]:
gbc = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=42))
get_score(gbc, "Gradient Boosting Classifier", X_train, X_test, y_train, y_test)

Model: Gradient Boosting Classifier class balanced Acccuracy=0.7228

Model: Gradient Boosting Classifier in sample score=0.9811

Model: Gradient Boosting Classifier avg cross validation score=0.9573

Model: Gradient Boosting Classifier Accuracy: 95.44%, Recall 45.45%, Precision 78.95%


I used my best model which is gradient boosting classfier to fit again with the new data, and the result looks pretty good. Pricison has a 5% increases, other scores remain on the same level before. This shows that by removing some features, we successfully increased the precision of my model.

In [37]:
model = svc.set_params(svc__class_weight=class_w)
get_score(model,"Support Vector Classification", X_train, X_test, y_train, y_test)

Model: Support Vector Classification class balanced Acccuracy=0.7516

Model: Support Vector Classification in sample score=0.8367

Model: Support Vector Classification avg cross validation score=0.8353

Model: Support Vector Classification Accuracy: 79.88%, Recall 69.70%, Precision 20.91%


But after removing feature, SVM performs poorly with only 8% accuracy

## Submission guidelines

Although your notebook may contain many models (e.g., due to your iterative development)
we will only evaluate a single model.
So choose one (explain why !) and do the following.

- You will implement the body of a subroutine `MyModel`
    - That takes as argument a Pandas DataFrame 
        - Each row is an example on which to predict
        - The features of the example are elements of the row
    - Performs predictions on each example
    - Returns an array or predictions with a one-to-one correspondence with the examples in the test set
    

We will evaluate your model against the holdout data
- By reading the holdout examples `X_hold` (as above)
- Calling `y_hold_pred = MyModel(X_hold)` to get the predictions
- Comparing the predicted values `y_hold_pred` against the true labels `y_hold` which are known only to the instructors

See the following cell as an illustration

**Remember**

The holdout data is in the same format as the one we used for training
- Except that it has no attribute for the target
- So you will need to perform all the transformations on the holdout data
    - As you did on the training data
    - Including turning the string representation of numbers into actual numeric data types

All of this work *must* be performed within the body of the `MyModel` routine you will write

We will grade you by comparing the predictions array you create to the answers known to us.

In [71]:
import pandas as pd
import os

def MyModel(X):    
    X = X.loc[:, 'X1':'X64']
    X = X.apply(pd.to_numeric, errors='coerce')
    X = trans_data(X)
    
    svc = make_pipeline(StandardScaler(), SVC(C=0.1, kernel='linear'))
    class_w = get_class_weight(y_train)
    model = svc.set_params(svc__class_weight=class_w)

    _ = model.fit(X_train, y_train)
    predictions = model.predict(X)
    return predictions



# Prepare data for the final run
Make the 5th-yr data as train data, and perform transformation on it.

In [79]:
data = pd.read_csv( os.path.join(DATA_DIR, "train", data_file) )

X_train = data.loc[:,'X1':'X64']
y_train = data.loc[:,'Bankrupt']
X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_train = trans_data(X_train)

# Reason to choose SVM as final model
Finally I decided to use SVC with class weights as my model because it's have the highest recall score while accuracy score is around 70%. Even though precision of the model is only 20%, but here recall is more important. If recall is low, then I may classify company that are not going to bankrupty as bankrupty whic is false positive. I think recall is more important in this problem, so I don't choose model such as gradient boosting classfier which provides high pricison and accuracy score.

# Check your work: predict and evaluate metrics on *your* test examples

Although only the instructors have the correct labels for the holdout dataset, you may want
to create your own test dataset on which to evaluate your out of sample metrics.

If you choose to do so, you can evaluate your models using the same metrics that the instructors will use.

- Test whether your implementation of `MyModel` works
- See the metrics  your model produces

In [25]:
X_hold = pd.read_csv( os.path.join(DATA_DIR, "holdout", '5th_yr.csv') 
name = "Support Vector Classifier with class weight"
y_test_pred = MyModel(X_hold)

accuracy_test = accuracy_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred, pos_label=1, average="binary")
precision_test = precision_score(y_test,   y_test_pred, pos_label=1, average="binary")

print("\t{m:s} Accuracy: {a:3.1%}, Recall {r:3.1%}, Precision {p:3.1%}".format(m=name,
                                                                            a=accuracy_test,
                                                                            r=recall_test,
                                                                            p=precision_test
                                                                            )
         )

1       0
2       0
3       0
4       0
       ..
4813    0
4814    0
4815    0
4816    0
4817    0
Name: Bankrupt, Length: 4818, dtype: int64 as keyword args. From version 0.25 passing these as positional arguments will result in an error


NameError: name 'y_test' is not defined