In [2]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [46]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = Path("Resources/lending_data.csv")

lending_df = pd.read_csv(file_path)

# Review the DataFrame
lending_df.head()


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [4]:
# Separate the data into labels and features
# Separate the y variable, the labels
y = lending_df["loan_status"]

# Separate the X variable, the features
X = lending_df.drop(columns="loan_status")



In [5]:
# Review the y variable Series
y[:5]

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [6]:
# Review the X variable DataFrame
#Check X contains only features
X.head(5)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [31]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 1,
                                                    stratify = y
                                                    )
X_train.shape



(58152, 7)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
log_regression_model = LogisticRegression(solver = 'lbfgs',
                                #max_iter = 100,
                                random_state = 1
                                )

# Fit the model using training data
log_regression_model.fit(X_train, y_train)

In [9]:
# Score the model = added for additional information on the accuracy of the model 
print(f"Training Data Score: {log_regression_model.score(X_train, y_train)}")
print(f"Testing Data Score: {log_regression_model.score(X_test, y_test)}")

Training Data Score: 0.9914878250103177
Testing Data Score: 0.9924164259182832


###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [10]:
# Make a prediction using the testing data
test_predictions = log_regression_model.predict(X_test)
results=pd.DataFrame({"prediction": test_predictions, "Actual" : y_test})

results.head()

Unnamed: 0,prediction,Actual
36831,0,0
75818,0,1
36563,0,0
13237,0,0
43292,0,0


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [53]:
# Generate a confusion matrix for the model
predictions_matrix = confusion_matrix(y_test, test_predictions)

#print confusion matrix 
print("Test Set:")
print(predictions_matrix)

Test Set:
[[18679    80]
 [   67   558]]


In [54]:
# Print the classification report for the model
predictions_report = classification_report(y_test, test_predictions)

#print classification report
print("Test Set:")
print(predictions_report)


Test Set:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18759
           1       0.87      0.89      0.88       625

    accuracy                           0.99     19384
   macro avg       0.94      0.94      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 

The purpose of this model is to identify high risk loans from healthy loans. The risk to the credit company is that there is a default on a loan meaning that it cannot recover the amount owed.The biggest objective will be to identify potential bad loans. It is also important not to misclassify good loans as bad, as to do so will mean potential opportunities are not identified and business is lost. Given this the measures of greatest interest are precision and recall.


Precision is the number of true positive predictions divided by the total number of positive predictions made by the model. It measures the model's ability correctly to identify positive instances. In the case of good loans, the number of positively identified good loans as a proportion of those loans identified as good gives a ratio of almost 100% (99.7%) and is therefore extremely good at identifying good loans. In relation to bad loans this proportion is 87%. Again, this is a positive indication that the model is very good at predicting those instances where a loan is likely to be bad.


Recall is the number of true positive predictions divided by the total number of actual positive instances in the data. It measures the model's ability correctly to capture positive instances. In the case of the good loans, the ratio is again approximately 100%. In the case of the high-risk loans this ratio equals 89%, meaning that the model has incorrectly classified bad loans as good 11% of the time. This represents a good outcome, potentially screening out 89% of future loans with a high-risk of default.


Although the recall and precision scores indicate that the model performs well, both for good and bad loan predictions, the sample data is has a high imbalance between classes. Within our sample data, the number of bad loans is only 3% of the total dataset (see the support column in the classification report). This represents a significant imbalance between classes which may mean that the performance metrics are misleading or skewed. There are several risks associated with an imbalanced dataset, in particular the model might focus on predicting the majority class (the healthy loans) resulting in a higher number of false negatives and a lower recall for the bad loans. There is also a risk of overfitting the majority class, having learned the patterns within the data for the healthy loans, leading to the model being less able to predict instances of the bad loans. This is also why the accuracy score (the total number of positive identifications for both good and bad loans over the total number of loans) is a poor measure in evaluating this particular model.


In conclusion the recall and precision scores indicate that that the model may provide an accurate screening tool for credit applicaitons; however, given the large imbalance between the majoity and minority classes, the model should undergo further training on sample sets with a greater number of defaulting loans for further evaluation. We can expect a much smaller number of defaulting loans in any sample, and consideration should be given to using techniques that can help address the risks of an imbalanced dataset - we can resample the dataset, producing an oversample of the bad debts vis a vis the majority set, or employ class weighting forcing the model to attach greater importance to occurrences of the minority class. 

 

In [56]:
#_______________________________________________________________________________________________

#Additional extra to check the results of the training data set and query if model is overfitted.
#_______________________________________________________________________________________________

#Produce a predicted result for the trained model results
train_predictions = log_regression_model.predict(X_train)

#Produce a confusion and clasification report on the trained data predictions
train_predictions_matrix = confusion_matrix(y_train, train_predictions)
train_predictions_report = classification_report(y_train, train_predictions)

#Print results for the 
print("Training Set - confusion matrix")
print(train_predictions_matrix)
print("")
print("Training Set - classification report")
print(train_predictions_report)


Training Set - confusion matrix
[[55980   297]
 [  198  1677]]

Training Set - classification report
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     56277
           1       0.85      0.89      0.87      1875

    accuracy                           0.99     58152
   macro avg       0.92      0.94      0.93     58152
weighted avg       0.99      0.99      0.99     58152



---