# CMPS 320
## Lab 7: Logistic Regression

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn.linear_model as skl_lm
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn import preprocessing


import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')

### Data Description

A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.

#### Format

A data frame with 10000 observations on the following 4 variables.

***default***

A factor with levels No and Yes indicating whether the customer defaulted on their debt

***student***

A factor with levels No and Yes indicating whether the customer is a student

***balance***

The average balance that the customer has remaining on their credit card after making their monthly payment

***income***

Income of customer

In [None]:
# Load data

Default = pd.read_excel('Default.xlsx', index_col=0)

In [None]:
# Obtain summary of the dataframe
Default.info()

In [None]:
# View the first five rows of the dataframe
Default.head()

In [None]:
# Generate descriptive statistics
Default.describe(include='all')

In [None]:
# Check the target variable distribution
Default.default.value_counts()

There are a total of 10000 elements in the default column, and there are 2 unique values 'No' and 'Yes'.

The number of people who defaulted to "Yes" was 333 -- Only 3.3% of all 10,000 people.

There are more cases of No than cases of Yes. 

When classifying like this, when the number of samples of one label/class is overwhelmingly large/small and thus out of balance with the number of samples of another class, this situation is called class imbalance.

In [None]:
Default.student.value_counts()              

About 30% of students default.

In [None]:
pd.crosstab(Default.student, Default.default)

### Logistic Regression

Since scikit learn models only allow `numeric` features, category variables must be encoded using dummy variables.

In [None]:
pd.get_dummies(Default).head()

In [None]:
Default_enc = pd.get_dummies(Default, drop_first=True)
Default_enc.head()

**Note**: If 'default', default_Yes is 1 and  If 'student', student_Yes is 1

### Logistic Regression Using scikit-learn

In [None]:
# import scikit-learn LogisticRegression estimator
from sklearn.linear_model import LogisticRegression

#### Category variable 'balance' as predictor

In [None]:
# Instantiate the estimator with the solver 'newton-cg' 
logistic_reg = LogisticRegression(solver='newton-cg')

X = Default_enc.balance.values.reshape(-1, 1)  # Since LogisticRegression interfaces with X in 2D, reshape it into an nx1 matrix.

# Default_Yes as Response
y = Default_enc.default_Yes

In [None]:
# Fit the model 
logistic_reg.fit(X, y)

In [None]:
print('classes: ',logistic_reg.classes_)
print('intercept :', logistic_reg.intercept_)
print('coefficient: ',logistic_reg.coef_)       

For the Default data, estimated coefficients of the logistic regression model that predicts the probability of default using balance. 

**Interpretation**: A one-unit increase in balance is associated with an increase in the log odds of default by 0.0055 units.

#### Making Predictiions

In [None]:
X_new = np.array([1000, 2000, 1700]).reshape(-1,1)
logistic_reg.predict_proba(X_new) # request a response from the logistic regression estimator with probability

We predict that the default probability for an individual with a balance of $1,000 is 0.00575.

We predict that the default probability for an individual with a balance of $2,000 is 0.5857 which is much higher.

In [None]:
# Request the estimated response as a class. 
y_pred = logistic_reg.predict(X_new) # default threshold is 0.5
y_pred

In [None]:
threshold = 0.5 # Setting your own threshold
y_pred = (logistic_reg.predict_proba(X_new)[:,1] <= threshold).astype(bool) # set threshold as 0.5
y_pred

#### Category variable 'student_Yes' as predictor


In [None]:
logistic_reg = LogisticRegression(solver='newton-cg')  
X = Default_enc.student_Yes.values.reshape(-1, 1)
y = Default_enc.default_Yes
logistic_reg.fit(X, y)
print('classes: ',logistic_reg.classes_)
print('intercept :', logistic_reg.intercept_)
print('coefficient: ',logistic_reg.coef_)

The coefficient corresponding to the student is 0.39, which is positive. That is, students are more likely to default.

***Using the previous model to predict the default when you are a student and when you are not ***

In [None]:
X_new = np.array([1, 0]).reshape(-1,1)
logistic_reg.predict_proba(X_new)

In [None]:
y_pred = logistic_reg.predict(X_new) # default threshold is 0.5
y_pred

In both cases, the model predicts 'default' = No (0) if only information on whether it is 'student' or 'not student' is provided as a predictor. 

These predictions are expected, because regardless of whether you are a student or not, most are not 'default'.

The default probability for students is 0.043, which is slightly higher than 0.029 for non-students.

### Multiple Logistic Regression

In [None]:
Default_enc.head(3)

In [None]:
# Category variables 'balance', 'income', and 'student_Yes' as predictor
X = Default_enc.loc[:, ['balance', 'income', 'student_Yes']]
X['income'] = X['income']*0.001  # income was measured in thousands of dollars

# Default_Yes as Response
y = Default_enc.default_Yes

In [None]:
X.head()

In [None]:
# Fit the model 
logistic_reg.fit(X, y)
print('classes: ',logistic_reg.classes_)
print('intercept :', logistic_reg.intercept_)
print('coefficient: ')
list(zip(X.columns, logistic_reg.coef_[0]) )

The negative coefficient for 'student_Yes' in the multiple logistic regression indicates that for a fixed value of balance and income, a student is less likely to default than a non-student.

In [None]:
X_new = np.array([[1500, 40, 1],    # balance 1500, income 40, student
                  [1500, 40, 0]])   # balance 1500, income 40, non-student
logistic_reg.predict_proba(X_new)

For non-students with a credit card balance of $\$1,500$ and an income of $40,000, the probability of 'default' increased from 0.0595 to 0.105.

## Alternative Method: Multiple Logistic Regression Using statsmodels

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
logreg_stats = smf.glm(formula = 'default ~ student + balance + income', 
                       data=Default, family=sm.families.Binomial()).fit()
logreg_stats.summary()

In [None]:
print(Default.default.value_counts())

In [None]:
# If we do not give the fitted model a new predictor, it uses the probability of the response to the training set.
logreg_stats_pred_prob = logreg_stats.predict()
logreg_stats_pred_prob[:10] # Probability of 'Default'

In [None]:
logreg_stats_pred_class = [('No' if prob < 0.5 else 'Yes') for prob in logreg_stats_pred_prob ]
logreg_stats_pred_class[:10]

In [None]:
# import estimator metrics 
from sklearn import metrics

# confusion matrix 
conf_mat = metrics.confusion_matrix(Default.default.astype(str), logreg_stats_pred_class)
print(conf_mat)