In [1]:
import numpy as np
import pandas as pd
import FirthRegression as flr
from sklearn.metrics import confusion_matrix

# Training the Model

In [2]:
train = pd.read_csv("modified_train_set.csv")
X_train = train.drop("Y_train",1)
Y_train = train["Y_train"]

In [3]:
(intercept, beta, bse, fitll, summary) = flr.fit_firth(Y_train, X_train)

   Variables       Coeff  Wald_p        bse
0      const   -8.104687  0.0000   0.657659
1         X5    6.189979  0.0000   0.875002
2         X7   27.399041  0.0000   3.903756
3         X8   17.091045  0.0000   3.805615
4        X12   -1.544147  0.0005   0.446138
5        X13   -8.540442  0.0258   3.830482
6        X14   30.573958  0.0042  10.688401
7        X16    8.150696  0.0000   1.306940
8        X17   14.695503  0.0000   2.817914
9        X19    0.619672  0.0001   0.156021
10       X24   20.681735  0.0001   5.165991
11       X25  -13.991555  0.0000   1.523230
12       X27  -29.586910  0.0000   6.222494
13       X28   19.367041  0.0015   6.088874
14       X35  -21.507714  0.0258   9.650368
15       X36   12.481175  0.0326   5.839085
16       X37  -11.675334  0.0006   3.403846
17       X38  239.491342  0.0032  81.334682
18       X42  -33.687428  0.0000   7.355563
19       X44  -22.353485  0.0351  10.605757
20       X45   -6.981967  0.0000   1.298074
21       X46  -19.217139  0.0000

In [4]:
# Storieng all the Coefficients as a list
coeffs = [intercept] + beta

# Testing Model Performance
In order to make Predictions on the Validation set, we first create a logistic function and use a default threshold of 0.5 to make predictions.

In [5]:
def logit_predictions(weight, df, threshold = 0.5):
    X = df.to_numpy().T
    m,n = X.shape
    wt_1 = np.reshape(np.array(weight), (1,m))
    y_pred = wt_1@X
    y_1 = 1./(1. + np.exp(-y_pred))
    y_hat = np.where(y_1 > threshold,1.,0.)
    return y_hat[0]

We now test the performance of this model on the Validation set.

**NOTE:** It is to be noted that the YeoJohnson transformation has already been performed on the validation set. A column of 1s has also been added to the set. The Unnamed column has also been removed from the set.

In [6]:
val_set = pd.read_csv("Validation_set.csv")
X_val = val_set[X_train.columns.tolist()] # Selecting only the required features
y_val = val_set["y_actual"]

In [7]:
y_pred = logit_predictions(coeffs,X_val)

# Model Performance Metrics

We make use of various metrics to determine the performance of the model on the validation set.

We consider the following : 
* TP (True Positive): Actual value = 1, predicted value = 1
* TN (True negative): Actual value = 0, predicted value = 0
* FP (False Positive): Actual value = 0, predicted value = 1
* FN (False Negative): Actual value = 1, predicted value = 0

We use the following metrics:
1. Accuracy: Measures the proportion of correct predictions to the total number of predictions
$$ formula : \frac{TP + TN}{TP + TN + FP + FN} $$
2. Sensitivity: The ability of the model to correctly identify a positive. Also known as Recall.
$$ formula : \frac{TP}{TP + FN} $$
3. Specificity: The ability of the model to correctly identify a negative.
$$ formula : \frac{TN}{TN + FP} $$
4. Precision: Proportion of true positives to the total number of positives predicted
$$ formula : \frac{TP}{TP + FP} $$
5. F1-Score: The harmonic mean of Prediction and Recall (Sensitivity). The F1-score is high when there is a balance between Precision and Recall.
$$ formula : \frac{2*Precision*Recall}{Precision + Recall} $$


In [8]:
def prediction_results(Y_test, y_pred):
    cm = confusion_matrix(Y_test,y_pred)
    print('Confusion Matrix : \n')
    print(pd.crosstab(y_val, y_pred, rownames = ['Actual'], colnames =['Predicted'], margins = True),'\n')
    total=sum(sum(cm))
    TN = cm[0,0]
    FP = cm[1,0]
    FN = cm[0,1]
    TP = cm[1,1]
    #####from confusion matrix calculate various metrics
    accuracy=(TP + TN)/total
    print ('Accuracy : ', accuracy)
    sensitivity = TP/(TP + FN)
    print('Sensitivity : ', sensitivity )
    specificity = TN/(TN + FP)
    print('Specificity : ', specificity)
    precision = (TP)/(TP + FP)
    print('Precision : ', precision)
    f1_score = 2*precision*sensitivity/(precision + sensitivity)
    print('F1 - Score : ', f1_score)

In [9]:
prediction_results(y_val, y_pred)

Confusion Matrix : 

Predicted  0.0  1.0  All
Actual                  
0          457   16  473
1           28  281  309
All        485  297  782 

Accuracy :  0.9437340153452686
Sensitivity :  0.9461279461279462
Specificity :  0.9422680412371134
Precision :  0.9093851132686084
F1 - Score :  0.9273927392739274


We see that the model has performed well on the whole. In general, there is always a trade-off between the model's ability to detect positives/negatives better. This makes it extremely important to understand the use-case before applying any algorithm. We need to be able to prioritize the business requirements, and then check if a model's precision is more important than its ability to recall, or if it's the other way round. We can also fit various other models and compare the model performances based on these metrics.

# Making Predictions on the Test set

In [10]:
test = pd.read_csv("test_set.csv")

Before fitting the model, we need to convert all the variables into their log-transformations using YeoJohnson transformation,  add the intercept column, and select the features required.

In [11]:
# Applying log-transformations
from scipy.stats import yeojohnson
df_1 = pd.DataFrame()
for col in test.columns:
    df_1[col] = yeojohnson(test[col])[0]

In [12]:
# Adding an intercept column , selecting required features and making predictions
import statsmodels.api as sm
X = sm.add_constant(df_1)
y_hat = logit_predictions(coeffs, X[X_train.columns.tolist()])


In [13]:
# Storing the predictions 
test['Y_preds'] = y_hat
test.to_csv("Predictions.csv",index = False)
