# Effectiveness of Random Oversampling

In this activity, you’ll fit logistic regression models to both imbalanced data and resampled data. You’ll then compare the results by using the metrics that you’ve learned.

## Instructions

1. Read in the CSV file from the `Resources` folder into a Pandas DataFrame.  

2. Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

3. Split the features and labels into training and testing sets.

4. Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels. 

5. Resample the training data by using `RandomOverSampler`.

6. Check the number of distinct values (`value_counts`) for the resampled labels.

7. Fit two logistic regression modules: one for the resampled data and another for the original data.

 8.  Using the two logistic regression models, predict the values for the original and resampled sets.

9. Print the confusion matrixes, accuracy scores, and classification reports for the original and resampled datasets.

10. Evaluate the effectiveness of random oversampling for predicting the minority class. Answer the following question: Does the model accurately flag all the loans that eventually defaulted?


## References

Following are links to modules from the scikit learn library that will be utilized:

[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

[confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

[balanced_accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html)

Following are links to modules from the imbalanced learn library that will be utilized:

[RandomOverSampler](https://imbalanced-learn.org/stable/generated/imblearn.over_sampling.RandomOverSampler.html)

[classifiction_report_imbalanced](https://imbalanced-learn.org/stable/generated/imblearn.metrics.classification_report_imbalanced.html)

In [1]:
# Import the required modules
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import classification_report_imbalanced
from imblearn.over_sampling import RandomOverSampler


ModuleNotFoundError: No module named 'imblearn'

## Step 1: Read in the CSV file from the `Resources` folder into a Pandas DataFrame. 

In [None]:
# Read the sba_loans.csv file from the Resources folder into a Pandas DataFrame
loans_df = # YOUR CODE HERE

# Review the DataFrame
# YOUR CODE HERE


## Step 2: Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

In [None]:
# Split the data into X (features) and y (lables)

# The y variable should focus on the Default column
# YOUR CODE HERE

# The X variable should include all features except the Default column
# YOUR CODE HERE


### Step 3: Split the features and labels into training and testing sets.

In [None]:
# Split data
X_train, X_test, y_train, y_test = # YOUR CODE HERE

## Step 4: Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels. 

In [None]:
# Count the distinct values in the orignal labels data
# YOUR CODE HERE


## Step 5: Resample the training data by using `RandomOverSampler`.

In [None]:
# Resample the data using RandomOverSampler

# Use RandomOversampler to create a model
# Set a random_state paramerter with a value of 1
random_oversampler = # YOUR CODE HERE

# Fit the original training data to the random_oversampler model
X_resampled, y_resampled = # YOUR CODE HERE


## Step 6: Check the number of distinct values (`value_counts`) for the resampled labels.

In [None]:
# Count the distinct values in the resampled labels data
# YOUR CODE HERE


## Step 7: Fit two logistic regression modules: one for the resampled data and another for the original data.

In [None]:
# Declare a logistic regression model
# Set a random_state paramerter with a value of 1
model = LogisticRegression(random_state=1)

In [None]:
# Fit a logistic regression for the original data.
lr_orginal_model = # YOUR CODE HERE


In [None]:
# Declare a logistic regression model
# Set a random_state paramerter with a value of 1
model = LogisticRegression(random_state=1)

In [None]:
# Fit a logistic regression for the resampled data
lr_resampled_model = # YOUR CODE HERE


## Step 8: Using the two logistic regression models, predict the values for the original and resampled sets.

In [None]:
# Predict labels for testing features using the original logistic regression model
y_original_pred = # YOUR CODE HERE

In [None]:
# Predict the labels for the testing features using the resampled logistic regression model
y_resampled_pred = # YOUR CODE HERE

## Step 9: Print the confusion matrixes, accuracy scores, and classification reports for the original and resampled datasets.

In [None]:
# Print the confusion matrix for the original data
# YOUR CODE HERE


In [None]:
# Print the confusion matrix for the resampled data
# YOUR CODE HERE


In [None]:
# Print the accuracy score for the original data
# YOUR CODE HERE


In [None]:
# Print the accuracy score for the resampled data
# YOUR CODE HERE


In [None]:
# Print the classification report for the original data
# YOUR CODE HERE


In [None]:
# Print the classification report for the resampled data
# YOUR CODE HERE


## Step 10: Evaluate the effectiveness of random oversampling for predicting the minority class. Answer the following question.

**Question:** Does the model generated using the resampled data more accurately flag all the loans that eventually defaulted?
    
**Answer:** # YOUR ANSWER HERE 