**Run the following two cells before you begin.**

In [1]:
%autosave 10

Autosaving every 10 seconds


In [2]:
import pandas as pd
import numpy as np

______________________________________________________________________
**First, import your data set and define the sigmoid function.**
<details>
    <summary>Hint:</summary>
    The definition of the sigmoid is $f(x) = \frac{1}{1 + e^{-X}}$.
</details>

In [3]:
# Import the data set
df = pd.read_csv('cleaned_data.csv')

In [4]:
# Define the sigmoid function
def sigmoid(X):
    Y = 1 / (1 + np.exp(-X))
    return Y

**Now, create a train/test split (80/20) with `PAY_1` and `LIMIT_BAL` as features and `default payment next month` as values. Use a random state of 24.**

In [5]:
# Create a train/test split
features = ['PAY_1','LIMIT_BAL']
x_df = df[features] 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
x_df.values.reshape(-1,2), df['default payment next month'].values,
test_size=0.2, random_state=24)

______________________________________________________________________
**Next, import LogisticRegression, with the default options, but set the solver to `'liblinear'`.**

In [6]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

______________________________________________________________________
**Now, train on the training data and obtain predicted classes, as well as class probabilities, using the testing data.**

In [7]:
# Fit the logistic regression model on training data
lr.fit(X_train,y_train)

LogisticRegression(multi_class='ovr', solver='liblinear')

In [8]:
# Make predictions using `.predict()`
pred = lr.predict(X_test)

In [9]:
# Find class probabilities using `.predict_proba()`
prob = lr.predict_proba(X_test)
prob

array([[0.74826924, 0.25173076],
       [0.584297  , 0.415703  ],
       [0.79604453, 0.20395547],
       ...,
       [0.584297  , 0.415703  ],
       [0.82721498, 0.17278502],
       [0.66393435, 0.33606565]])

______________________________________________________________________
**Then, pull out the coefficients and intercept from the trained model and manually calculate predicted probabilities. You'll need to add a column of 1s to your features, to multiply by the intercept.**

In [10]:
# Add column of 1s to features
x_df['Ones'] = 1
x_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,PAY_1,LIMIT_BAL,Ones
0,2,20000,1
1,-1,120000,1
2,0,90000,1
3,0,50000,1
4,-1,50000,1
...,...,...,...
26659,0,220000,1
26660,-1,150000,1
26661,4,30000,1
26662,1,80000,1


In [11]:
# Get coefficients and intercepts from trained model
intercept = lr.intercept_
coefficient1 = lr.coef_[0][0]
coefficient2 = lr.coef_[0][1]

In [12]:
X_test = pd.DataFrame(data = X_test , columns = ['PAY_1','LIMIT_BAL'])

In [13]:
# Manually calculate predicted probabilities
pred_prob = sigmoid(coefficient1*X_test['PAY_1'] + coefficient2*X_test['LIMIT_BAL'] + intercept)
pred_prob

0       0.251731
1       0.415703
2       0.203955
3       0.203955
4       0.415703
          ...   
5328    0.278236
5329    0.415703
5330    0.415703
5331    0.172785
5332    0.336066
Length: 5333, dtype: float64

______________________________________________________________________
**Next, using a threshold of `0.5`, manually calculate predicted classes. Compare this to the class predictions output by scikit-learn.**

In [14]:
# Manually calculate predicted classes
pred_class = pd.Series(map(int,pred_prob>0.5))
pred_class

0       0
1       0
2       0
3       0
4       0
       ..
5328    0
5329    0
5330    0
5331    0
5332    0
Length: 5333, dtype: int64

In [15]:
# Compare to scikit-learn's predicted classes
len(pred) == sum(pred == pred_class)

True

______________________________________________________________________
**Finally, calculate ROC AUC using both scikit-learn's predicted probabilities, and your manually predicted probabilities, and compare.**

In [18]:
# Use scikit-learn's predicted probabilities to calculate ROC AUC
from sklearn import metrics
y_pred_proba = lr.predict_proba(X_test)
pos_proba = y_pred_proba[:,1]
metrics.roc_auc_score(y_test, pos_proba)

0.627207450280691

In [100]:
# Use manually calculated predicted probabilities to calculate ROC AUC
metrics.roc_auc_score(y_test, pred_prob)

0.627207450280691