## Agenda
   
    ♦ Problem Description
    ♦ Data Understanding and exploration
    ♦ Split the data into Train and Validation sets
    ♦ Model Building
       - Logistic Regression
       - ROC curve to fix the threshold values
    ♦ Construct a confusion matrix
    ♦ Evaluation of the error metrics
    ♦ How do we implement Regularization techniques
    ♦ Build model using Naive Bayes classifier
    ♦ Compute Evaluation metrics
    ♦ Prinicpal Component Analysis
    

## Problem Description

A Regional Bank XYZ with 40000+ Customers would like to expand its business by predicting Customer's behavior to better sell cross products (eg: Selling Term Deposits to Retail Customers). The Bank has approached us to assess the same by providing access to their Customer campaign data. 

The data is related with direct marketing campaigns. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. 

Predict if an existing customer would subscribe to a Term Deposit

#### Attribute information:



Input variables:

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                   "blue-collar","self-employed","retired","technician","services") 

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric) 

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no") 

##### Related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular") 

10 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

11 - duration: last contact duration, in seconds (numeric)

##### Other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

##### Output variable (desired target):

16 - y - has the client subscribed a term deposit? (binary: "yes","no")

 

In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score,recall_score,precision_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.naive_bayes import MultinomialNB 
from sklearn.naive_bayes import GaussianNB 
%matplotlib inline

### Loading the data

In [20]:
df=pd.read_csv("./Data/Bank_Data.csv")

### Understanding the data

In [21]:
df.shape

(4521, 16)

In [22]:
df.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [23]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,may,226,1,-1,0,unknown,no


In [24]:
df.tail()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,month,duration,campaign,pdays,previous,poutcome,y
4516,33,services,married,secondary,no,-333,yes,no,cellular,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,feb,129,4,211,3,other,no
4520,44,entrepreneur,single,tertiary,no,1136,yes,yes,cellular,apr,345,2,249,7,other,no


### Summary statistics

In [25]:
df.describe()

Unnamed: 0,age,balance,duration,campaign,pdays,previous
count,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0
mean,41.170095,1422.657819,263.961292,2.79363,39.766645,0.542579
std,10.576211,3009.638142,259.856633,3.109807,100.121124,1.693562
min,19.0,-3313.0,4.0,1.0,-1.0,0.0
25%,33.0,69.0,104.0,1.0,-1.0,0.0
50%,39.0,444.0,185.0,2.0,-1.0,0.0
75%,49.0,1480.0,329.0,3.0,-1.0,0.0
max,87.0,71188.0,3025.0,50.0,871.0,25.0


In [26]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
       'poutcome', 'y'],
      dtype='object')

In [29]:
df.y.value_counts(normalize=True) # frequency for each level within the target variable

no     0.88476
yes    0.11524
Name: y, dtype: float64

In [27]:
df.y.value_counts() # frequency for each level

no     4000
yes     521
Name: y, dtype: int64

In [31]:
' jagan    '.strip()

'jagan'

### Recode the levels of target on  data ; yes=1 and no=0


In [10]:
df['y'] = df['y'].apply(lambda x: 0 if x.strip()=='no' else 1)


In [32]:
cat_attr

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome'],
      dtype='object')

In [11]:
cat_attr=df.select_dtypes(include ='object').columns 

df[cat_attr]= df[cat_attr].astype('category')

In [12]:
data = pd.get_dummies(columns=cat_attr, data = df, prefix=cat_attr, prefix_sep="_", drop_first=True)
data.head()

Unnamed: 0,age,balance,duration,campaign,pdays,previous,y,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,79,1,-1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
1,33,4789,220,1,339,4,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,35,1350,185,1,330,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,30,1476,199,4,-1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
4,59,0,226,1,-1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1


In [33]:
data.columns

Index(['age', 'balance', 'duration', 'campaign', 'pdays', 'previous', 'y',
       'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_married', 'marital_single', 'education_secondary',
       'education_tertiary', 'education_unknown', 'default_yes', 'housing_yes',
       'loan_yes', 'contact_telephone', 'contact_unknown', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_other', 'poutcome_success', 'poutcome_unknown'],
      dtype='object')

### Splitting the data into train and Validation sets

In [36]:
X = data.loc[:,data.columns.difference(['y'])] # taking all the independent columns
y = data.y # separting the target variable

In [37]:
X.head()

Unnamed: 0,age,balance,campaign,contact_telephone,contact_unknown,default_yes,duration,education_secondary,education_tertiary,education_unknown,...,month_mar,month_may,month_nov,month_oct,month_sep,pdays,poutcome_other,poutcome_success,poutcome_unknown,previous
0,30,1787,1,0,0,0,79,0,0,0,...,0,0,0,1,0,-1,0,0,1,0
1,33,4789,1,0,0,0,220,1,0,0,...,0,1,0,0,0,339,0,0,0,4
2,35,1350,1,0,0,0,185,0,1,0,...,0,0,0,0,0,330,0,0,0,1
3,30,1476,4,0,1,0,199,0,1,0,...,0,0,0,0,0,-1,0,0,1,0
4,59,0,1,0,1,0,226,1,0,0,...,0,1,0,0,0,-1,0,0,1,0


In [38]:
y[:6]

0    0
1    0
2    0
3    0
4    0
5    0
Name: y, dtype: int64

In [40]:
data['y'].value_counts(normalize=True)

0    0.88476
1    0.11524
Name: y, dtype: float64

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify= y, test_size = 0.3, random_state=2323)

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=124)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3164, 41)
(1357, 41)
(3164,)
(1357,)


In [56]:
print('train - Stratification',y_train.value_counts())
print('test - Stratification',y_test.value_counts())

train - Stratification 0    2799
1     365
Name: y, dtype: int64
test - Stratification 0    1201
1     156
Name: y, dtype: int64


In [57]:
#X_train, X_test, y_train, y_test = train_test_split(X, y, stratify= y, test_size = 0.3, random_state=124)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=2323)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3164, 41)
(1357, 41)
(3164,)
(1357,)


In [58]:
print('train - rand',y_train.value_counts())
print('test - rand',y_test.value_counts())

train - rand 0    2782
1     382
Name: y, dtype: int64
test - rand 0    1218
1     139
Name: y, dtype: int64


### Standardizing the numeric attributes in the train and test data

In [59]:
scaler = MinMaxScaler()
X_train[['age','balance','duration','pdays','previous','campaign']] = scaler.fit_transform(X_train[['age','balance','duration','pdays','previous','campaign']])
X_test[['age','balance','duration','pdays','previous','campaign']]=scaler.transform(X_test[['age','balance','duration','pdays','previous','campaign']])

In [60]:
import statsmodels.api as sm


In [62]:
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

### Model Building

In [68]:
X_train.columns

Index(['const', 'age', 'balance', 'campaign', 'contact_telephone',
       'contact_unknown', 'default_yes', 'duration', 'education_secondary',
       'education_tertiary', 'education_unknown', 'housing_yes',
       'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'loan_yes', 'marital_married', 'marital_single', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'pdays', 'poutcome_other', 'poutcome_success', 'poutcome_unknown',
       'previous'],
      dtype='object')

In [84]:
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.238201
         Iterations 8
                          Results: Logit
Model:               Logit            Pseudo R-squared: 0.353      
Dependent Variable:  y                AIC:              1591.3336  
Date:                2022-03-31 19:52 BIC:              1845.8365  
No. Observations:    3164             Log-Likelihood:   -753.67    
Df Model:            41               LL-Null:          -1165.6    
Df Residuals:        3122             LLR p-value:      2.4625e-146
Converged:           1.0000           Scale:            1.0000     
No. Iterations:      8.0000                                        
-------------------------------------------------------------------
                     Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-------------------------------------------------------------------
const               -2.5251   0.6442 -3.9195 0.0001 -3.7877 -1.2624
age                 -0.2827   0.5815 -0.4

In [81]:
X_train[['age','const']][:5]

Unnamed: 0,age,const
4209,0.279412,1.0
3748,0.235294,1.0
1516,0.147059,1.0
1364,0.220588,1.0
1735,0.235294,1.0


In [82]:
train_preds[:5]

4209    0.114379
3748    0.109471
1516    0.100207
1364    0.107877
1735    0.109471
dtype: float64

In [83]:
train_preds = result.predict(X_train[['age','const']])
#train_preds_prob=result.predict_proba(X_train[['age','const']])[:,1]
#test_preds = result.predict(X_test[['age','const']])
#test_preds_prob=result.predict_proba(X_test[['age','const']])[:,1]

### Logistic Regression

In [85]:
logistic_model = LogisticRegression(penalty='none')

logistic_model.fit(X_train,y_train)

LogisticRegression(penalty='none')

### Generating predictions

In [86]:
train_preds = logistic_model.predict(X_train)
train_preds_prob=logistic_model.predict_proba(X_train)[:,1]
test_preds = logistic_model.predict(X_test)
test_preds_prob=logistic_model.predict_proba(X_test)[:,1]

In [87]:
train_preds[:5]

array([1, 0, 0, 0, 0])

In [88]:
train_preds_prob[:5]

array([0.85664702, 0.08079027, 0.01945116, 0.02104693, 0.02291449])

In [68]:
logistic_model.coef_

array([[-0.98477239, -0.04692803, -0.98378516, -2.24583127, -0.01390572,
        -1.36255178,  0.12843002, 12.7384954 , -0.12102335,  0.15318758,
        -0.46624448, -0.2917254 , -0.65034421, -0.05779099, -0.41895847,
        -0.34290137,  0.36646608, -0.57082786, -0.48286498,  0.25167875,
        -0.33033404, -0.799167  ,  0.39104086, -0.91648376, -0.56062518,
        -0.27926841, -0.16053139,  1.00712411,  0.17178137, -0.57051554,
        -0.45640724,  0.67984217,  1.60504932, -0.37830692, -0.83617182,
         1.36833687,  1.01843484, -0.06270185,  0.22911892,  2.31712389,
        -0.2874725 , -0.05620251]])

### Confusion Matrix

In [69]:
confusion_matrix(y_train,train_preds)

array([[2736,   63],
       [ 236,  129]])

In [70]:
train_accuracy_1= accuracy_score(y_train,train_preds)
train_recall_1= recall_score(y_train,train_preds)
train_precision_1= precision_score(y_train,train_preds)

test_accuracy_1= accuracy_score(y_test,test_preds)
test_recall_1= recall_score(y_test,test_preds)
test_precision_1= precision_score(y_test,test_preds)

In [71]:
print(train_accuracy_1)
print(train_recall_1)
print(train_precision_1)

print(test_accuracy_1)
print(test_recall_1)
print(test_precision_1)

0.9054993678887484
0.35342465753424657
0.671875
0.9005158437730287
0.33974358974358976
0.6235294117647059


In [72]:
#Classification report
print(classification_report(y_train,train_preds))

              precision    recall  f1-score   support

           0       0.92      0.98      0.95      2799
           1       0.67      0.35      0.46       365

    accuracy                           0.91      3164
   macro avg       0.80      0.67      0.71      3164
weighted avg       0.89      0.91      0.89      3164



In [73]:
print(classification_report(y_test,test_preds))

              precision    recall  f1-score   support

           0       0.92      0.97      0.95      1201
           1       0.62      0.34      0.44       156

    accuracy                           0.90      1357
   macro avg       0.77      0.66      0.69      1357
weighted avg       0.89      0.90      0.89      1357



### ROC and AUC

In [74]:
fpr, tpr, threshold = roc_curve(y_train, train_preds_prob)
roc_auc = auc(fpr, tpr)

In [75]:
%matplotlib notebook
# plt.figure()
plt.plot([0,1],[0,1],color='navy', lw=2, linestyle='--')
plt.plot(fpr,tpr,color='orange', lw=3, label='ROC curve (area = %0.2f)' % roc_auc)

plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend(loc="lower right")

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x11fbf7128>

### Manual inspection of threshold value

In [77]:
roc_df = pd.DataFrame({'FPR':fpr, 'TPR':tpr, 'Threshold':threshold})

roc_df

Unnamed: 0,FPR,TPR,Threshold
0,0.000000,0.000000,1.999788
1,0.000357,0.000000,0.999788
2,0.000357,0.016438,0.990268
3,0.000715,0.016438,0.987232
4,0.000715,0.019178,0.983515
...,...,...,...
437,0.595570,0.994521,0.029973
438,0.595570,0.997260,0.029952
439,0.866738,0.997260,0.010434
440,0.866738,1.000000,0.010360


In [78]:
roc_df.sort_values('TPR',ascending=False,inplace=True)


In [79]:
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = threshold[optimal_idx]

In [80]:
optimal_threshold

0.09936133439145774

In [81]:
custom_threshold = 0.099


## To get in 0-1 format vector (pandas Series)
final_pred_array = pd.Series([0 if x>custom_threshold else 1 for x in train_preds_prob])
final_pred_array.value_counts()

final_test_pred_array = pd.Series([0 if x>custom_threshold else 1 for x in test_preds_prob])
final_test_pred_array.value_counts()

1    973
0    384
dtype: int64

In [82]:
## To get True-False format vector (pandas Series)
final_pred = pd.Series(train_preds_prob > 0.099)
final_pred.value_counts()
final_test_pred=pd.Series(test_preds_prob > 0.099)

In [83]:
print(classification_report(y_train,final_pred))

              precision    recall  f1-score   support

           0       0.98      0.81      0.89      2799
           1       0.38      0.86      0.52       365

    accuracy                           0.82      3164
   macro avg       0.68      0.84      0.71      3164
weighted avg       0.91      0.82      0.85      3164



In [84]:
print(classification_report(y_test,final_test_pred))

              precision    recall  f1-score   support

           0       0.97      0.79      0.87      1201
           1       0.34      0.83      0.48       156

    accuracy                           0.79      1357
   macro avg       0.66      0.81      0.68      1357
weighted avg       0.90      0.79      0.83      1357



In [85]:
train_accuracy= accuracy_score(y_train,final_pred)
train_recall= recall_score(y_train,final_pred)
print(train_accuracy)
print(train_recall)

test_accuracy= accuracy_score(y_test,final_test_pred)
test_recall= recall_score(y_test,final_test_pred)
print(test_accuracy)
print(test_recall)

0.81826801517067
0.863013698630137
0.793662490788504
0.8333333333333334


### Logistic model 2 -   Using Penalty as 'L1'

In [36]:
l1_model = LogisticRegression(penalty='l1', solver='saga')

l1_model.fit(X_train,y_train)

l1_train_pred = l1_model.predict(X_train)
l1_test_pred = l1_model.predict(X_test)

#### Confusion Matrix for model2

In [37]:
confusion_matrix(y_train,l1_train_pred)

array([[2745,   54],
       [ 242,  123]])

In [38]:
train_pred_prob = l1_model.predict_proba(X_train)[:,1]
train_pred_classes=l1_model.predict(X_train)
print(train_pred_classes)

test_pred_classes=l1_model.predict(X_test)
test_pred_prob=l1_model.predict_proba(X_train)[:,1]

[0 0 0 ... 0 0 0]


In [39]:
l1_model.coef_

array([[ 0.00000000e+00,  0.00000000e+00, -3.69266739e-01,
         0.00000000e+00, -1.23923987e+00,  0.00000000e+00,
         1.19327000e+01, -2.26992033e-02,  1.65830050e-01,
        -1.88516625e-01, -3.05989010e-01, -4.53375411e-01,
         0.00000000e+00, -2.02321504e-03, -1.33990640e-01,
         4.82409486e-01, -2.57743732e-01, -2.71768934e-01,
         2.81811248e-01, -1.49607771e-01, -3.87165990e-01,
         3.52304275e-02, -8.32040544e-01, -4.82977369e-01,
        -1.74287845e-01, -2.47281506e-01,  3.53863578e-01,
         2.81770127e-03, -4.73620442e-01, -5.05590337e-01,
         4.59196820e-01,  1.35404982e+00, -4.31087702e-01,
        -8.11009285e-01,  1.15517286e+00,  7.81936121e-01,
         0.00000000e+00,  1.33748373e-01,  2.20012821e+00,
        -3.28928209e-01,  0.00000000e+00]])

In [40]:
train_accuracy_2 = accuracy_score(y_train,train_pred_classes)
train_recall_2 = recall_score(y_train,train_pred_classes)
train_precision_2 = precision_score(y_train,train_pred_classes)

test_accuracy_2= accuracy_score(y_test,test_pred_classes)
test_recall_2= recall_score(y_test,test_pred_classes)
test_precision_2= precision_score(y_test,test_pred_classes)

In [41]:
print(train_accuracy_2)
print(train_recall_2)
print(train_precision_2)

print(test_accuracy_2)
print(test_recall_2)
print(test_precision_2)

0.9064475347661188
0.336986301369863
0.6949152542372882
0.8975681650700074
0.30128205128205127
0.6103896103896104


### Model using penalty 'L2'

In [42]:
l2_model = LogisticRegression(penalty='l2')

l2_model.fit(X_train,y_train)

l2_train_pred = l2_model.predict(X_train)
l2_test_pred = l2_model.predict(X_test)

In [43]:
confusion_matrix(y_train,l2_train_pred)

array([[2757,   42],
       [ 268,   97]])

In [44]:
train_l2 = l2_model.predict_proba(X_train)[:,1]
train_l2c=l2_model.predict(X_train)
print(train_l2c)

test_l2c=l2_model.predict(X_test)
test_l2=l2_model.predict_proba(X_train)[:,1]

[0 0 0 ... 0 0 0]


In [45]:
l2_model.coef_

array([[ 5.94269264e-02, -3.75055946e-01, -6.99259042e-01,
        -8.52791693e-03, -1.12652146e+00,  9.84086844e-02,
         8.63887252e+00, -5.30686781e-02,  1.47230666e-01,
        -2.57817199e-01, -2.56435030e-01, -4.72280281e-01,
         7.06174690e-03, -1.67146554e-01, -2.22213454e-01,
         4.34002528e-01, -3.44302381e-01, -3.81971174e-01,
         2.80608779e-01, -2.69727651e-01, -4.43810561e-01,
         2.66030128e-01, -7.80406653e-01, -5.42165240e-01,
        -2.09555461e-01, -2.74412125e-01,  5.92009710e-01,
         1.32501309e-02, -5.87422826e-01, -4.52928441e-01,
         4.20096152e-01,  1.12576693e+00, -4.38105543e-01,
        -7.73401562e-01,  1.06707863e+00,  7.21166089e-01,
        -4.77856236e-02,  1.71093019e-01,  2.08232642e+00,
        -3.12932304e-01,  1.01885606e-01]])

In [46]:
train_accuracy_3 = accuracy_score(y_train,train_l2c)
train_recall_3 = recall_score(y_train,train_l2c)
train_precision_3 = precision_score(y_train,train_l2c)

test_accuracy_3= accuracy_score(y_test,test_l2c)
test_recall_3= recall_score(y_test,test_l2c)
test_precision_3= precision_score(y_test,test_l2c)

In [47]:
print(train_accuracy_3)
print(train_recall_3)
print(train_precision_3)

print(test_accuracy_3)
print(test_recall_3)
print(test_precision_3)

0.9020227560050569
0.26575342465753427
0.697841726618705
0.8975681650700074
0.24358974358974358
0.6440677966101694


### Using penalty 'elasticnet'

In [48]:
reg_model = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5)

reg_model.fit(X_train,y_train)

train_pred = reg_model.predict(X_train)
test_pred = reg_model.predict(X_test)

In [49]:
reg_model.coef_

array([[ 0.00000000e+00, -2.44027748e-02, -6.07989144e-01,
         0.00000000e+00, -1.16263190e+00,  0.00000000e+00,
         9.91105389e+00, -4.04109371e-02,  1.50598057e-01,
        -2.18237993e-01, -2.75474305e-01, -4.59266885e-01,
         0.00000000e+00, -9.16767112e-02, -1.81981090e-01,
         4.58290449e-01, -3.01114240e-01, -3.35004181e-01,
         2.72900518e-01, -2.19261751e-01, -4.11545625e-01,
         1.76708049e-01, -7.92769086e-01, -5.17034190e-01,
        -1.98208360e-01, -2.58748913e-01,  5.02812015e-01,
         9.14648881e-03, -5.39637101e-01, -4.66393663e-01,
         4.28802856e-01,  1.21232738e+00, -4.30905974e-01,
        -7.86249937e-01,  1.09723012e+00,  7.40371064e-01,
         0.00000000e+00,  1.58593274e-01,  2.13368554e+00,
        -3.15090649e-01,  0.00000000e+00]])

### Naive Bayes Classifier

In [86]:
model = GaussianNB().fit(X_train,y_train) 

pred_train = model.predict(X_train)  
pred_test = model.predict(X_test) #predict on test data 

print(accuracy_score(y_train,pred_train)) 
print(recall_score(y_train,pred_train)) 
print(accuracy_score(y_test,pred_test))
print(recall_score(y_test,pred_test))

0.8460809102402023
0.4520547945205479
0.8400884303610906
0.42948717948717946


In [87]:
confusion_matrix(y_train,pred_train)
confusion_matrix(y_test,pred_test)

array([[1073,  128],
       [  89,   67]])

### Balancing the class weights

In [80]:
logistic_model = LogisticRegression(penalty='none', class_weight='balanced')

logistic_model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None,
                   penalty='none', random_state=None, solver='lbfgs',
                   tol=0.0001, verbose=0, warm_start=False)

In [81]:
train_preds = logistic_model.predict(X_train)
train_preds_prob=logistic_model.predict_proba(X_train)[:,1]
test_preds = logistic_model.predict(X_test)
test_preds_prob=logistic_model.predict_proba(X_test)[:,1]

In [82]:
logistic_model.coef_

array([[-0.05719819, -1.06264921, -5.03299116,  0.20300295, -1.43736724,
        -0.05157771, 18.20054463, -0.03471874,  0.09623885, -0.32352011,
        -0.30760532, -0.99091598, -0.22284338, -0.38091642, -0.43370438,
         0.36264382, -0.74533016, -0.68609119,  0.27885408, -0.4081228 ,
        -0.77432992,  0.47738301, -1.15927732, -0.43651184, -0.13080537,
        -0.37814643,  0.70945383,  0.13410137, -0.9437042 , -0.96423591,
         0.48298593,  1.73356476, -0.86112252, -0.78285516,  1.53783985,
         1.16969852,  0.10283404,  0.26856747,  2.41095531, -0.45540592,
        -0.26988001]])

In [83]:
confusion_matrix(y_train,train_preds)

array([[2370,  429],
       [  64,  301]])

In [84]:
train_accuracy_1= accuracy_score(y_train,train_preds)
train_recall_1= recall_score(y_train,train_preds)
train_precision_1= precision_score(y_train,train_preds)

test_accuracy_1= accuracy_score(y_test,test_preds)
test_recall_1= recall_score(y_test,test_preds)
test_precision_1= precision_score(y_test,test_preds)

In [85]:
print(train_accuracy_1)
print(train_recall_1)
print(train_precision_1)

print(test_accuracy_1)
print(test_recall_1)
print(test_precision_1)

0.8441845764854614
0.8246575342465754
0.4123287671232877
0.8282977155490051
0.8012820512820513
0.382262996941896


In [86]:
print(classification_report(y_train,train_preds))

              precision    recall  f1-score   support

           0       0.97      0.85      0.91      2799
           1       0.41      0.82      0.55       365

    accuracy                           0.84      3164
   macro avg       0.69      0.84      0.73      3164
weighted avg       0.91      0.84      0.86      3164



In [87]:
print(classification_report(y_test,test_preds))

              precision    recall  f1-score   support

           0       0.97      0.83      0.90      1201
           1       0.38      0.80      0.52       156

    accuracy                           0.83      1357
   macro avg       0.68      0.82      0.71      1357
weighted avg       0.90      0.83      0.85      1357

