# Student Loan Risk with Deep Learning

In [1]:
# Imports
import pandas as pd
from pathlib import Path
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

---

## Prepare the data to be used on a neural network model

### Step 1: Read the `student_loans.csv` file into a Pandas DataFrame. Review the DataFrame, looking for columns that could eventually define your features and target variables.   

In [4]:
# Read the csv into a Pandas DataFrame
file_path = "https://static.bc-edx.com/mbc/ai/m6/datasets/student_loans.csv"
df = pd.read_csv(file_path)

# Review the DataFrame
df.sample(7)

Unnamed: 0,payment_history,location_parameter,stem_degree_score,gpa_ranking,alumni_success,study_major_code,time_to_completion,finance_workshop_score,cohort_ranking,total_loan_score,financial_aid_score,credit_ranking
145,8.1,0.67,0.55,1.8,0.117,32.0,141.0,0.9968,3.17,0.62,9.4,5
591,6.6,0.39,0.49,1.7,0.07,23.0,149.0,0.9922,3.12,0.5,11.5,6
456,8.9,0.59,0.39,2.3,0.095,5.0,22.0,0.9986,3.37,0.58,10.3,5
872,7.3,0.35,0.24,2.0,0.067,28.0,48.0,0.99576,3.43,0.54,10.0,4
992,6.5,0.4,0.1,2.0,0.076,30.0,47.0,0.99554,3.36,0.48,9.4,6
983,9.1,0.5,0.3,1.9,0.065,8.0,17.0,0.99774,3.32,0.71,10.5,6
1141,8.2,0.38,0.32,2.5,0.08,24.0,71.0,0.99624,3.27,0.85,11.0,6


In [5]:
# Review the data types associated with the columns
if df.isna().values.any():
  print('There are missing values in the dataset.')
else:
  print('There are no missing (null) values.')
duplicated_rows = df[df.duplicated()]
if len(duplicated_rows) > 0:
  print(f'There are {len(duplicated_rows)} duplicated rows in the dataset.')
print('All values are numerical.')
df.info()

There are no missing (null) values.
There are 240 duplicated rows in the dataset.
All values are numerical.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   payment_history         1599 non-null   float64
 1   location_parameter      1599 non-null   float64
 2   stem_degree_score       1599 non-null   float64
 3   gpa_ranking             1599 non-null   float64
 4   alumni_success          1599 non-null   float64
 5   study_major_code        1599 non-null   float64
 6   time_to_completion      1599 non-null   float64
 7   finance_workshop_score  1599 non-null   float64
 8   cohort_ranking          1599 non-null   float64
 9   total_loan_score        1599 non-null   float64
 10  financial_aid_score     1599 non-null   float64
 11  credit_ranking          1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage:

#### Inspect the features' values and target distribution

In [6]:
print(f"Balance of target variables?\n\n{df['credit_ranking'].value_counts()}.\n")
print('Statistical spread of values in columns:')
df.describe()

Balance of target variables?

5    681
6    638
7    199
4     53
8     18
3     10
Name: credit_ranking, dtype: int64.

Statistical spread of values in columns:


Unnamed: 0,payment_history,location_parameter,stem_degree_score,gpa_ranking,alumni_success,study_major_code,time_to_completion,finance_workshop_score,cohort_ranking,total_loan_score,financial_aid_score,credit_ranking
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


* There is an imbalanced distribution of credit_ranking (target) values.
May need to consider a rebalance.

####Testing credit significance of a subset of data
check stem_degree_score == 0 (since it is the only column with 0 values in it ie. min=0)

In [7]:
df['credit_ranking'].loc[df['stem_degree_score'] == 0].value_counts()

5    57
6    54
4    10
7     8
3     3
Name: credit_ranking, dtype: int64

Appears to be distributed in line with broader dataset.

### Step 2: Using the preprocessed data, create the features (`X`) and target (`y`) datasets. The target dataset should be defined by the preprocessed DataFrame column “credit_ranking”. The remaining columns should define the features dataset.

In [8]:
# Define the target set y using the credit_ranking column
# y = df['credit_ranking'].ravel()
y = df['credit_ranking'].ravel()
# Display a sample of y
y[:5]

array([5, 5, 5, 6, 5])

In [9]:
# Define features set X by selecting all columns but credit_ranking
X = df.drop(columns='credit_ranking')

# Review the features DataFrame
X.head()

Unnamed: 0,payment_history,location_parameter,stem_degree_score,gpa_ranking,alumni_success,study_major_code,time_to_completion,finance_workshop_score,cohort_ranking,total_loan_score,financial_aid_score
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


### Step 3: Split the features and target sets into training and testing datasets.


In [10]:
# Split the preprocessed data into a training and testing dataset
# Assign the function a random_state equal to 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

### Step 4: Use scikit-learn's `StandardScaler` to scale the features data.

In [11]:
# Create a StandardScaler instance
X_scaler = StandardScaler()

# Fit the scaler to the features training dataset
X_scaler.fit(X_train)

# Scale the features training and testing datasets
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

### Additional Step: Since target data contains less lower credit ratings, generate a re-sampled set of training data.

In [12]:
# Import SMOTE from imblearn
from imblearn.over_sampling import SMOTE

# Instantiate the SMOTE instance
# Set the sampling_strategy parameter equal to auto
smote_sampler = SMOTE(random_state=1, sampling_strategy='auto')

# Fit the training data to the smote_sampler model
X_resampled, y_resampled = smote_sampler.fit_resample(X_train_scaled, y_train)

# Compare distinct value counts for the original and resampled target training data
print(f'Original Value counts:\n{pd.DataFrame(y_train).value_counts()}')
print(f'Resampled value counts:\n{pd.DataFrame(y_resampled).value_counts()}')

Original Value counts:
5    510
6    471
7    157
4     37
8     15
3      9
dtype: int64
Resampled value counts:
3    510
4    510
5    510
6    510
7    510
8    510
dtype: int64


---

## First extablish baseline accuracy model with hyperparemter tuned RandomForest/s
a) RandomForest

b) XGBoost

In [13]:
# import the additional libraries
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [14]:
rf_regressor = RandomForestRegressor(random_state=1)
# set up rf search grid
rf_param_grid = {"max_depth": np.arange(1,7,1),
              "n_estimators": np.arange(60,200,20)}

# try out every combination of the above values
rf_search = GridSearchCV(rf_regressor, rf_param_grid, cv=5).fit(X_train_scaled, y_train)
# rf_search = GridSearchCV(rf_regressor, rf_param_grid, cv=5).fit(X_resampled, y_resampled)
# rf_search = RandomizedSearchCV(rf_regressor, rf_param_grid, cv=5).fit(X_train_scaled, y_train)
# rf_search = RandomizedSearchCV(rf_regressor, rf_param_grid, cv=5).fit(X_resampled, y_resampled)
print("The best hyperparameters are ", rf_search.best_params_)

The best hyperparameters are  {'max_depth': 6, 'n_estimators': 180}


In [15]:
# apply best parameters
tuned_rf_regressor = RandomForestRegressor(
    n_estimators=rf_search.best_params_["n_estimators"],
    max_depth=rf_search.best_params_["max_depth"],
    random_state=1)
# fit (train) the model
tuned_rf_regressor.fit(X_train_scaled, y_train)
# evaluate
y_pred_rf = tuned_rf_regressor.predict(X_test_scaled).round().astype("int32").ravel()

rf_accuracy = accuracy_score(y_test, y_pred_rf)
print(f'Random Forest Regressor model accuracy score: {rf_accuracy * 100:.3f}%')

Random Forest Regressor model accuracy score: 62.250%


In [16]:
xg_regressor=xgb.XGBRegressor(random_state=1)
# set up xg search grid
xg_param_grid = {"max_depth": np.arange(1,7,1),
              "n_estimators": np.arange(60,200,20),
              "learning_rate": np.arange(0.01, 0.95, 0.05)}

# try out every combination of the above values
xg_search = GridSearchCV(xg_regressor, xg_param_grid, cv=5).fit(X_train_scaled, y_train)
# xg_search = GridSearchCV(xg_regressor, xg_param_grid, cv=5).fit(X_resampled, y_resampled)
# xg_search = RandomizedSearchCV(xg_regressor, xg_param_grid, cv=5).fit(X_train_scaled, y_train)
# xg_search = RandomizedSearchCV(xg_regressor, xg_param_grid, cv=5).fit(X_resampled, y_resampled)
print("The best hyperparameters are ", xg_search.best_params_)

The best hyperparameters are  {'learning_rate': 0.11, 'max_depth': 5, 'n_estimators': 100}


In [17]:
# Make predictions and evaluate based on seatrch outcome
xg_srch_pred = xg_search.predict(X_test_scaled).round().astype("int32").ravel()
xg_srch_acc = accuracy_score(y_test, xg_srch_pred)
print(f'XG Boost model accuracy score: {xg_srch_acc * 100:.3f}%')

XG Boost model accuracy score: 68.750%


---

## Compile and Evaluate a Model Using a Neural Network

### Step 1: Create a deep neural network by assigning the number of input features, the number of layers, and the number of neurons on each layer using Tensorflow’s Keras.

> **Hint** You can start with a two-layer deep neural network model that uses the `relu` activation function for both layers.


In [18]:
# Define the the number of inputs (features) to the model
num_inputs = 11
print(f'Num Inputs: {num_inputs}')
# Review the number of features
print(f'Num features: {X.shape[1]}')

Num Inputs: 11
Num features: 11


In [19]:
# Define the number of neurons in the output layer
output_neurons = 1

In [20]:
# Define the number of hidden nodes for the first hidden layer
hidden_nodes = []
hidden_nodes.append(6)
# Review the number hidden nodes in the first layer
print(f'Hidden nodes in the first layer: {hidden_nodes[0]}')

Hidden nodes in the first layer: 6


In [21]:
# Define the number of hidden nodes for the second hidden layer
hidden_nodes.append(3)
# Review the number hidden nodes in the second layer
print(f'Hidden nodes in the second layer: {hidden_nodes[1]}')

Hidden nodes in the second layer: 3


In [22]:
# Create the Sequential model instance
nn = Sequential()

In [23]:
# Add the first hidden layer
nn.add(Dense(input_dim=num_inputs,
             units=hidden_nodes[0],
             activation='relu'))


In [24]:
# Add the second hidden layer
nn.add(Dense(units=hidden_nodes[1],
             activation='relu'))

In [25]:
# Add the output layer to the model specifying the number of output neurons and activation function
nn.add(Dense(units=output_neurons,
             activation='linear'))

In [26]:
# Display the Sequential model summary
nn.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 6)                 72        
                                                                 
 dense_1 (Dense)             (None, 3)                 21        
                                                                 
 dense_2 (Dense)             (None, 1)                 4         
                                                                 
Total params: 97 (388.00 Byte)
Trainable params: 97 (388.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### Step 2: Compile and fit the model using the `mse` loss function, the `adam` optimizer, and the `mse` evaluation metric.


In [27]:
# Compile the Sequential model
nn.compile(loss="mean_squared_error", optimizer="adam", metrics=["mse"])

In [28]:
# Fit the model using 50 epochs and the training data
seq_model = nn.fit(X_train_scaled, y_train, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


### Step 3: Evaluate the model using the test data to determine the model’s loss and accuracy.


In [29]:
# Evaluate the model loss and accuracy metrics using the evaluate method and the test data
seq_model_loss, seq_model_accuracy = nn.evaluate(X_test_scaled, y_test, verbose=2)

# Display the model loss and accuracy results
print(f"Loss: {seq_model_loss}, Accuracy: {seq_model_accuracy * 100:.3f}%")

13/13 - 0s - loss: 0.8079 - mse: 0.8079 - 169ms/epoch - 13ms/step
Loss: 0.8079410791397095, Accuracy: 80.794%


Similar to RandomForest baseline. Save but definitely worth seeking a better accuracy for the required purpose.

In [30]:
# Set and save the model's file path - allowing for multiple notebook runs
path = f"saved_models/student_loans.h5"
file_path = Path(path)
file_paths = []
file_paths.append(path.split('/')[1:2][0])

# Export your model to an HDF5 file
nn.save(file_path)

  saving_api.save_model(


####Apply the following additional steps:

1) apply SMOTE resampled training data to the Sequential model and evaluate

2) add additional hidden layers and try with both non-sampled and SMOTE resampled training data

In [31]:
# Create a separate instance for training with sampled data
nn_smote = Sequential()
nn_smote.add(Dense(input_dim=num_inputs,
                   units=hidden_nodes[0],
                   activation='relu'))
nn_smote.add(Dense(units=hidden_nodes[1],
                   activation='relu'))
nn_smote.add(Dense(units=output_neurons,
                   activation='linear'))
nn_smote.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 6)                 72        
                                                                 
 dense_4 (Dense)             (None, 3)                 21        
                                                                 
 dense_5 (Dense)             (None, 1)                 4         
                                                                 
Total params: 97 (388.00 Byte)
Trainable params: 97 (388.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [32]:
# Compile the Sequential model
nn_smote.compile(loss="mean_squared_error", optimizer="adam", metrics=["mse"])

In [33]:
# Fit using 100 epochs and the SMOTE sampled training data
smote_seq_model = nn_smote.fit(X_resampled, y_resampled, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [34]:
# Evaluate the SMOTE model loss and accuracy metrics using the evaluate method and the test data
smote_model_loss, smote_model_accuracy = nn_smote.evaluate(X_test_scaled, y_test, verbose=2)

# Display the model loss and accuracy results
print(f"Loss: {smote_model_loss}, Accuracy: {smote_model_accuracy * 100:.3f}%")

13/13 - 0s - loss: 0.9130 - mse: 0.9130 - 142ms/epoch - 11ms/step
Loss: 0.9129549264907837, Accuracy: 91.295%


Better result - save again and now try re-structuring the model

In [35]:
# Set and save the model's file path - allowing for training run instance
path = f"saved_models/student_loans_smote_new.h5"
file_path = Path(path)
file_paths.append(path.split('/')[1:2][0])

# Export your model to an HDF5 file
nn_smote.save(file_path)

## 2) Additional Hidden Layers

In [36]:
hidden_nodes = [9, 7, 5, 3]
# redefine new nn model with additional layers
nn_2 = Sequential()

# Add the first hidden layer
nn_2.add(Dense(input_dim=num_inputs,
               units=hidden_nodes[0],
               activation='relu'))

for layer in range(1, len(hidden_nodes)):
  # Add the other hidden layer/s
  nn_2.add(Dense(units=hidden_nodes[layer],
                 activation='relu'))

# add output layer
nn_2.add(Dense(units=output_neurons,
               activation='linear'))

# print the summary configuration
nn_2.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 9)                 108       
                                                                 
 dense_7 (Dense)             (None, 7)                 70        
                                                                 
 dense_8 (Dense)             (None, 5)                 40        
                                                                 
 dense_9 (Dense)             (None, 3)                 18        
                                                                 
 dense_10 (Dense)            (None, 1)                 4         
                                                                 
Total params: 240 (960.00 Byte)
Trainable params: 240 (960.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [37]:
# Compile the new model
nn_2.compile(loss="mean_squared_error", optimizer="adam", metrics=["mse"])

# Fit the new model using 100 epochs and the training data
seq_model_2 = nn_2.fit(X_train_scaled, y_train, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [38]:
# Evaluate the new model loss and accuracy metrics
seq_model_2_loss, seq_model_2_accuracy = nn_2.evaluate(X_test_scaled, y_test, verbose=2)

# Display the model loss and accuracy results
print(f"Loss: {seq_model_2_loss}, Accuracy: {seq_model_2_accuracy * 100:.3f}%")

13/13 - 0s - loss: 0.4223 - mse: 0.4223 - 145ms/epoch - 11ms/step
Loss: 0.42234283685684204, Accuracy: 42.234%


In [39]:
# Set and save the model's file path - allowing for training run instance
path = f"saved_models/student_loans_2.h5"
file_path = Path(path)
file_paths.append(path.split('/')[1:2][0])

# Export your model to an HDF5 file
nn_2.save(file_path)

In [40]:
# Create another instance of the new model for training with sampled data
nn_2_smote = Sequential()

# Add the first hidden layer
nn_2_smote.add(Dense(input_dim=num_inputs,
                     units=hidden_nodes[0],
                     activation='relu'))

for layer in range(1, len(hidden_nodes)):
  # Add the other hidden layer/s
  nn_2_smote.add(Dense(units=hidden_nodes[layer],
                       activation='relu'))

# add output layer
nn_2_smote.add(Dense(units=output_neurons,
                     activation='linear'))

# print the summary configuration
nn_2_smote.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_11 (Dense)            (None, 9)                 108       
                                                                 
 dense_12 (Dense)            (None, 7)                 70        
                                                                 
 dense_13 (Dense)            (None, 5)                 40        
                                                                 
 dense_14 (Dense)            (None, 3)                 18        
                                                                 
 dense_15 (Dense)            (None, 1)                 4         
                                                                 
Total params: 240 (960.00 Byte)
Trainable params: 240 (960.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [41]:
# Compile the new model
nn_2_smote.compile(loss="mean_squared_error", optimizer="adam", metrics=["mse"])

# Fit the new model using 100 epochs and the SMOTE sampled training data
smote_seq_model_2 = nn_2_smote.fit(X_resampled, y_resampled, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [42]:
# Evaluate the new model loss and accuracy metrics
smote_model_2_loss, smote_model_2_accuracy = nn_2_smote.evaluate(X_test_scaled, y_test, verbose=2)

# Display the model loss and accuracy results
print(f"Loss: {smote_model_2_loss}, Accuracy: {smote_model_2_accuracy * 100:.3f}%")

13/13 - 0s - loss: 0.8840 - mse: 0.8840 - 215ms/epoch - 17ms/step
Loss: 0.8840392231941223, Accuracy: 88.404%


In [43]:
# Set and save the model's file path - allowing for training run instances
path = f"saved_models/student_loans_2_smote.h5"
file_path = Path(path)
file_paths.append(path.split('/')[1:2][0])

# Export your model to an HDF5 file
nn_2_smote.save(file_path)

### Step 4: Save and export your model to an HDF5 file, and name the file `student_loans.h5`.


In [44]:
# Each model saved above. Filenames stored as:
file_paths

['student_loans.h5',
 'student_loans_smote.h5',
 'student_loans_2.h5',
 'student_loans_2_smote.h5']

---
## Predict Loan Repayment Success by Using your Neural Network Model

### Step 1: Reload your saved model.

In [50]:
# Set the (best) model's file path
file_path = Path(f"saved_models/{file_paths[1]}")           # this will load new version of standard Sequential structure fit with SMOTE and 100 epochs
# file_path = Path(f"saved_models/student_loans_smote.h5")    # this file was uploaded to canvas, please add to Colab before funning this line of code
# Load the model to a new object
nn2_imported = tf.keras.models.load_model(file_path)

### Step 2: Make predictions on the testing data.

In [51]:
# Make predictions on the testing data for the loaded model (lm)
lm_y_pred = nn2_imported.predict(X_test_scaled).round().astype("int32").ravel()
print(lm_y_pred[:5])
lm_loss, lm_accuracy = nn2_imported.evaluate(X_test_scaled, y_test, verbose=2)
print(f"Loss: {lm_loss}, Accuracy: {lm_accuracy * 100:.3f}%")

[6 5 6 5 7]
13/13 - 1s - loss: 0.9130 - mse: 0.9130 - 658ms/epoch - 51ms/step
Loss: 0.9129549264907837, Accuracy: 91.295%


### Step 3: Create a DataFrame to compare the predictions with the actual values.

In [52]:
# Create a DataFrame to compare the predictions with the actual values
results = pd.DataFrame({'prediction': lm_y_pred, 'actual': y_test})

### Step 4: Display a sample of the DataFrame you created in step 3.

In [53]:
# Display sample data
results.sample(10)

Unnamed: 0,prediction,actual
230,4,6
143,4,5
161,7,6
307,7,5
94,7,5
301,5,6
244,6,6
386,5,5
225,5,5
5,7,6


In [49]:
# Accuracy Score Comparisons
print(f'   - -                     Random Forest Regressor model           accuracy: {rf_accuracy * 100:.3f}%')
print(f'   - -                     XG Boost model                          accuracy: {xg_srch_acc * 100:.3f}%')
print(f"({file_paths[0]})         Sequential model                        accuracy: {seq_model_accuracy * 100:.3f}%")
print(f"({file_paths[1]})   Sequential SMOTE sampled model          accuracy: {smote_model_accuracy * 100:.3f}%")
print(f"({file_paths[2]})       Expanded sequential model               accuracy: {seq_model_2_accuracy * 100:.3f}%")
print(f"({file_paths[3]}) Expanded sequential SMOTE sampled model accuracy: {smote_model_2_accuracy * 100:.3f}%")

# Compare distinct value counts for the original target and predicted target data
print('\nTarget Value Comparisons')
print(f'Original test value counts:\n{pd.DataFrame(y_test).value_counts()}')
print(f'Loaded Predictions (Expanded SMOTE) value counts:\n{pd.DataFrame(lm_y_pred).value_counts()}')

   - -                     Random Forest Regressor model           accuracy: 62.250%
   - -                     XG Boost model                          accuracy: 68.750%
(student_loans.h5)         Sequential model                        accuracy: 80.794%
(student_loans_smote.h5)   Sequential SMOTE sampled model          accuracy: 91.295%
(student_loans_2.h5)       Expanded sequential model               accuracy: 42.234%
(student_loans_2_smote.h5) Expanded sequential SMOTE sampled model accuracy: 88.404%

Target Value Comparisons
Original test value counts:
5    171
6    167
7     42
4     16
8      3
3      1
dtype: int64
Loaded Predictions (Expanded SMOTE) value counts:
5    147
6    120
7     74
4     32
8     21
3      4
9      2
dtype: int64


##Evaluation
###Approach
* Observed imbalance of results data. Given limited data, focussed on oversampling to increase lower value sets. Experimented with both SMOTE and SMOTEENN. The former yielded better results when applied to the Sequential model.
* Attempted to establish a baseline accuracy with RandomForest and XGBoost
  * Used Hyperparameter tuning to improve chance of best baseline - multiple attempts, including with SMOTE data
  * Results were consistently in the low 60% range
  * GridSearch was favoured over Random due to consistency of best result
* Best accuracy achieved during model fitting was 0.91507667 with specified Sequential model structure (with 2 hidden layers) and scaled data - however this was lost through further experimental model training and I was unable to re-achieve it...lesson learnt - I should have saved it before moving on (save steps added)
* NEXT best accuracy of 0.91295 was achieved with SMOTE sampled training data against the specified Sequential model structure with 100 epochs however the predictions included two target values (9) outside of the original range.

Best result achieved in 4th training run ("saved_models/3/"). Given accuracy results varied significantly during notebook and training re-runs, would suggest sourcing more data overall (including a higher original proportion of lower credit rankings) to achieve higher accuracy with less risk of over-fitting.

Final recommended model (based on the data provided and prepared and on multiple training runs) would be 'student_loans_smote.h5' however based on distributions of predictions, there may be some over-fitting.