In [41]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [42]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = Path("Resources/lending_data.csv")
df_lending = pd.read_csv(file_path)


# Review the DataFrame
df_lending.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.


We separate the data into labels (y) and features (X) because this is a common data preparation step in machine learning. In a supervised learning task, we aim to train a model to predict the value of the target variable (in this case, "loan_status") based on the values of the input variables (in this case, all other columns except "loan_status").

To do this, we first need to split the data into two parts: the features, which are the input variables used to predict the target variable, and the labels, which are the target variable we are trying to predict.

By separating the data into labels and features, we can then use the features to train a machine learning model to predict the labels. The model can learn patterns and relationships in the features and their corresponding labels, and then use this knowledge to make predictions on new, unseen data.

Separating the data into labels and features is also useful for other data analysis tasks, such as exploratory data analysis and data visualization. It allows us to analyze and visualize the relationships between the input variables and the target variable separately, which can help us to better understand the data and identify potential patterns or trends.

In [43]:
# Separate the data into labels and features
y = df_lending["loan_status"]
# Separate the y variable, the labels
X = df_lending.drop("loan_status", axis=1)
# Separate the X variable, the features


In [44]:
# Review the y variable Series
print("y shape", y.shape)

y shape (77536,)


In [45]:
# Review the X variable DataFrame
print("X shape",X.shape)

X shape (77536, 7)


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [46]:
# Check the balance of our target values
label_counts = y.value_counts()
label_counts

0    75036
1     2500
Name: loan_status, dtype: int64

This indicates that the data is imbalanced, as there are more instances of one class than the other. This information can be used to inform the choice of machine learning algorithm and to determine if any sampling or weighting strategies should be applied to address the class imbalance.

### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [47]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split by default it splits 75/25
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#X_train, X_test, y_train, y_test

In [48]:
X_train.shape

(58152, 7)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

The logistic regression model uses a logistic function (also known as a sigmoid function) to transform the linear combination of the independent variables into a probability value between 0 and 1. This probability represents the likelihood that the input sample belongs to the positive class. The logistic function has an S-shaped curve that approaches 0 for very large negative values and approaches 1 for very large positive values, with a midpoint at 0.

The logistic regression model can be trained using a variety of optimization algorithms, such as gradient descent or Newton's method, to find the coefficients that minimize the difference between the predicted probabilities and the actual labels in the training data. Once the model is trained, it can be used to make predictions on new data by applying the logistic function to the linear combination of the independent variables and comparing the resulting probability to a decision threshold, usually set to 0.5.

Logistic regression is a widely used technique in many fields, such as medicine, economics, and social sciences, for tasks such as predicting the likelihood of a patient having a certain disease, classifying email messages as spam or not spam, or predicting the outcome of a political election.

In [49]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression


To improve the performance of the model, you can increase the value of max_iter. Depending on the size of the dataset and the complexity of the problem, you may need to experiment with different values of max_iter to find the optimal value that provides the best balance between accuracy and computation time.

In [50]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
model_100 = LogisticRegression(solver='lbfgs',max_iter=100, random_state=1)


In [60]:
model_200 = LogisticRegression(solver='lbfgs',max_iter=200, random_state=1)


In [51]:
# Fit the model using training data
model_100.fit(X_train, y_train)

In [61]:
model_200.fit(X_train, y_train)

In [52]:
# Score the model
print(f"Training Data Score: {model_100.score(X_train, y_train)}")
print(f"Testing Data Score: {model_100.score(X_test, y_test)}")

Training Data Score: 0.9921240885954051
Testing Data Score: 0.9918489475856377


In [62]:
# Score the model
print(f"Training Data Score: {model_200.score(X_train, y_train)}")
print(f"Testing Data Score: {model_200.score(X_test, y_test)}")

Training Data Score: 0.9921240885954051
Testing Data Score: 0.9918489475856377


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [53]:
# Make a prediction using the testing data
pred = model_100.predict(X_test)


In [57]:
predictions_df = pd.DataFrame({"Prediction": pred, "Actual": y_test})
predictions_df.head()

Unnamed: 0,Prediction,Actual
60914,0,0
36843,0,0
1966,0,0
70137,0,0
27237,0,0


In [58]:
predictions_df.describe()

Unnamed: 0,Prediction,Actual
count,19384.0,19384.0
mean,0.034307,0.031934
std,0.18202,0.175828
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,1.0,1.0


### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [59]:
# Print the balanced_accuracy score of the model
from sklearn.metrics import accuracy_score

accuracy_score(y_test, pred)

0.9918489475856377

A confusion matrix is a table that is used to evaluate the performance of a supervised machine learning model. It shows the number of correct and incorrect predictions made by the model compared to the actual outcomes 

accuracy, precision, recall, and F1-score.

In [64]:
# Generate a confusion matrix for the model
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test, pred)



In [65]:
# Print the classification report for the model
print(confusion_matrix)


[[18663   102]
 [   56   563]]


In [67]:
#The overall accuracy of the model can be computed as (TP+TN)/(TP+TN+FP+FN), a

# Calculate the overall accuracy of the model
accuracy = (confusion_matrix[0][0] + confusion_matrix[1][1]) / sum(sum(confusion_matrix))

print("Overall accuracy: {:.2f}%".format(accuracy*100))


Overall accuracy: 99.18%


### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 
18663 negative instances were correctly predicted
102 false positives

56 positive instances were incorrectly predicted
563 positive instances were correctly predicted


---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [13]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!

# Fit the original training data to the random_oversampler model
# YOUR CODE HERE!

ModuleNotFoundError: No module named 'imblearn'

In [None]:
# Count the distinct values of the resampled labels data
# YOUR CODE HERE!

### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [None]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!

# Fit the model using the resampled training data
# YOUR CODE HERE!

# Make a prediction using the testing data
# YOUR CODE HERE!

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [None]:
# Print the balanced_accuracy score of the model 
# YOUR CODE HERE!

In [None]:
# Generate a confusion matrix for the model
# YOUR CODE HERE!

In [None]:
# Print the classification report for the model
# YOUR CODE HERE!

### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** YOUR ANSWER HERE!