**Run the following two cells before you begin.**

In [1]:
%autosave 10

Autosaving every 10 seconds


In [2]:
import pandas as pd
import numpy as np

______________________________________________________________________
**First, import your data set and define the sigmoid function.**
<details>
    <summary>Hint:</summary>
    The definition of the sigmoid is $f(x) = \frac{1}{1 + e^{-X}}$.
</details>

In [3]:
# Import the data set
df = pd.read_csv('cleaned_data.csv')

In [4]:
# Define the sigmoid function
def sigmoid(X):
    Y = 1 / (1 + np.exp(-X))
    return Y

**Now, create a train/test split (80/20) with `PAY_1` and `LIMIT_BAL` as features and `default payment next month` as values. Use a random state of 24.**

In [5]:
# Create a train/test split
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(df[['PAY_1','LIMIT_BAL']], df['default payment next month'].values,
test_size=0.2, random_state=24)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(21331, 2)
(5333, 2)
(21331,)
(5333,)


______________________________________________________________________
**Next, import LogisticRegression, with the default options, but set the solver to `'liblinear'`.**

In [6]:
from sklearn.linear_model import LogisticRegression
my_new_lr = LogisticRegression(solver = 'liblinear')
#my_new_lr.solver = 'liblinear'
my_new_lr

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

______________________________________________________________________
**Now, train on the training data and obtain predicted classes, as well as class probabilities, using the testing data.**

In [7]:
# Fit the logistic regression model on training data
#X_train = np.array([X1_train,X2_train])
my_new_lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [8]:
# Make predictions using `.predict()`
y_pred = my_new_lr.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [9]:
# Find class probabilities using `.predict_proba()`
from sklearn import metrics
y_pred_proba = my_new_lr.predict_proba(X_test)
print('Y prediction probability is\n',y_pred_proba)

Y prediction probability is
 [[0.74826924 0.25173076]
 [0.584297   0.415703  ]
 [0.79604453 0.20395547]
 ...
 [0.584297   0.415703  ]
 [0.82721498 0.17278502]
 [0.66393435 0.33606565]]


______________________________________________________________________
**Then, pull out the coefficients and intercept from the trained model and manually calculate predicted probabilities. You'll need to add a column of 1s to your features, to multiply by the intercept.**

In [18]:
# Add column of 1s to features
add_ones=np.hstack([np.ones((X_test.shape[0],1)),X_test])
#X_train['1s'] = pd.Series(np.ones(21331),dtype=int, index=X_train.index)
#X_train

In [38]:
# Get coefficients and intercepts from trained model
coefs = my_new_lr.coef_
intercepts = my_new_lr.intercept_
intercepts_and_coefs = np.concatenate([intercept.reshape(1,1),coef],axis = 1)
print(intercepts_and_coefs)
#print(intercept)

[[-6.57647457e-11  8.27451187e-11 -6.80876727e-06]]


In [39]:
# Manually calculate predicted probabilities
X=np.dot(coefs_and_intercepts,np.transpose(add_ones))
manual_prediction_prob=sigmoid(X)
manual_prediction_prob
#X_total = intercept*X_train['1s'] + coef[0,0]*X_train['PAY_1'] + coef[0,1]*X_train['LIMIT_BAL']
#P = sigmoid(X_total)
#P

array([[0.25173076, 0.415703  , 0.20395547, ..., 0.415703  , 0.17278502,
        0.33606565]])

______________________________________________________________________
**Next, using a threshold of `0.5`, manually calculate predicted classes. Compare this to the class predictions output by scikit-learn.**

In [40]:
# Manually calculate predicted classes
manual_predictions=manual_prediction_prob>=0.5
#Y_tr= P[P>0.5]
#Y_fls = P[P<0.5]
#Y_fls


In [41]:
# Compare to scikit-learn's predicted classes
np.array_equal(y_pred.reshape(1,-1),manual_predictions)

True

______________________________________________________________________
**Finally, calculate ROC AUC using both scikit-learn's predicted probabilities, and your manually predicted probabilities, and compare.**

In [42]:
# Use scikit-learn's predicted probabilities to calculate ROC AUC
from sklearn.metrics import roc_auc_score
#roc_auc_score(y_test,manual_prediction_prob.reshape(manual_prediction_prob.shape[1],))
pos_proba = y_pred_proba[:,1]
metrics.roc_auc_score(y_test, pos_proba)

0.627207450280691

In [43]:
# Use manually calculated predicted probabilities to calculate ROC AUC
roc_auc_score(y_test,manual_prediction_prob[0])

0.627207450280691

We can see ROC AUC using both scikit-learn's predicted probabilities and manually predicted probabilities are exactly same i.e almost 0.63.