<a href="https://colab.research.google.com/github/IvannaPrice/neural-network-challenge-1/blob/main/student_loans_with_deep_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Student Loan Risk with Deep Learning

In [1]:
# Imports
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from pathlib import Path

---

## Prepare the data to be used on a neural network model

### Step 1: Read the `student-loans.csv` file into a Pandas DataFrame. Review the DataFrame, looking for columns that could eventually define your features and target variables.   

In [2]:
# Read the csv into a Pandas DataFrame
file_path = "https://static.bc-edx.com/ai/ail-v-1-0/m18/lms/datasets/student-loans.csv"
loans_df = pd.read_csv(file_path)

# Review the DataFrame
loans_df.head()

Unnamed: 0,payment_history,location_parameter,stem_degree_score,gpa_ranking,alumni_success,study_major_code,time_to_completion,finance_workshop_score,cohort_ranking,total_loan_score,financial_aid_score,credit_ranking
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0


In [3]:
# Review the data types associated with the columns
loans_df.dtypes

Unnamed: 0,0
payment_history,float64
location_parameter,float64
stem_degree_score,float64
gpa_ranking,float64
alumni_success,float64
study_major_code,float64
time_to_completion,float64
finance_workshop_score,float64
cohort_ranking,float64
total_loan_score,float64


In [4]:
# Check the credit_ranking value counts
loans_df["credit_ranking"].value_counts()

Unnamed: 0_level_0,count
credit_ranking,Unnamed: 1_level_1
1,855
0,744


### Step 2: Using the preprocessed data, create the features (`X`) and target (`y`) datasets. The target dataset should be defined by the preprocessed DataFrame column “credit_ranking”. The remaining columns should define the features dataset.

In [5]:
# Define the target set y using the credit_ranking column
y = loans_df['credit_ranking'].values

# Display a sample of y
print(y[:5])  # Display the first 5 samples

[0 0 0 1 0]


In [6]:
# Define features set X by selecting all columns but credit_ranking
X = loans_df.drop(columns=['credit_ranking']).values

# Review the features DataFrame
print(X[:5])  # Display the first 5 rows of features


[[7.400e+00 7.000e-01 0.000e+00 1.900e+00 7.600e-02 1.100e+01 3.400e+01
  9.978e-01 3.510e+00 5.600e-01 9.400e+00]
 [7.800e+00 8.800e-01 0.000e+00 2.600e+00 9.800e-02 2.500e+01 6.700e+01
  9.968e-01 3.200e+00 6.800e-01 9.800e+00]
 [7.800e+00 7.600e-01 4.000e-02 2.300e+00 9.200e-02 1.500e+01 5.400e+01
  9.970e-01 3.260e+00 6.500e-01 9.800e+00]
 [1.120e+01 2.800e-01 5.600e-01 1.900e+00 7.500e-02 1.700e+01 6.000e+01
  9.980e-01 3.160e+00 5.800e-01 9.800e+00]
 [7.400e+00 7.000e-01 0.000e+00 1.900e+00 7.600e-02 1.100e+01 3.400e+01
  9.978e-01 3.510e+00 5.600e-01 9.400e+00]]


### Step 3: Split the features and target sets into training and testing datasets.


In [7]:
from sklearn.model_selection import train_test_split
# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Assign the function a random_state equal to 1
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")

X_train shape: (1279, 11), X_test shape: (320, 11)
y_train shape: (1279,), y_test shape: (320,)


### Step 4: Use scikit-learn's `StandardScaler` to scale the features data.

In [9]:
from sklearn.preprocessing import StandardScaler
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the scaler to the features training dataset
X_train_scaled = scaler.fit_transform(X_train)

# Fit the scaler to the features training dataset
X_test_scaled = scaler.transform(X_test)

# Display the first 5 rows of the scaled training data
print(X_train_scaled[:5])

[[-0.73307913  0.6648928  -1.25704443 -0.3204585  -0.45362151 -0.74240736
  -0.6455073   0.24000129  0.98846046  0.0630946  -0.87223395]
 [ 1.06774091 -0.62346154  1.52314768  0.60886277 -0.36954631 -1.12518952
  -1.11200285  0.18789883 -1.7535127  -0.17390392 -0.77978452]
 [-1.74604041 -1.07158479 -1.35814232 -0.53491726 -0.78992229  1.07580793
   0.53628144 -2.67773653  2.32756363  0.77409018  3.28799021]
 [-0.62052788  0.49684658 -1.05484864 -0.0345135  -0.20139592  0.11885252
   1.18937522  0.37546769  1.24352773 -0.76640023 -0.6873351 ]
 [-0.50797663  0.60887739 -1.00429969 -0.53491726  0.26101766 -0.74240736
  -0.7077067  -0.33312578 -0.09557544 -0.47015208 -0.77978452]]


---

## Compile and Evaluate a Model Using a Neural Network

### Step 1: Create a deep neural network by assigning the number of input features, the number of layers, and the number of neurons on each layer using Tensorflow’s Keras.

> **Hint** You can start with a two-layer deep neural network model that uses the `relu` activation function for both layers.


In [10]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Define the the number of inputs (features) to the model
number_of_inputs = X_train_scaled.shape[1]

# Review the number of features
number_of_features = X_train_scaled.shape[1]
print(f"Number of features: {number_of_features}")

Number of features: 11


In [11]:
# Define the number of hidden nodes for the first hidden layer
hidden_nodes_layer1 = 16  # Adjust based on model performance
hidden_nodes_layer2 = 8   # Adjust based on model performance

# Define the number of hidden nodes for the second hidden layer
model = Sequential()

# Define the number of neurons in the output layer
output_neurons = 1  # Single neuron for binary classification

In [12]:
# Create the Sequential model instance
model = Sequential()

# Add the first hidden layer
model.add(Dense(units=hidden_nodes_layer1, activation='relu', input_shape=(number_of_inputs,)))

# Add the second hidden layer
model.add(Dense(units=hidden_nodes_layer2, activation='relu'))

# Add the output layer to the model specifying the number of output neurons and activation function
model.add(Dense(units=1, activation='sigmoid'))  # For binary classification


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [13]:
# Display the Sequential model summary
model.summary()

### Step 2: Compile and fit the model using the `binary_crossentropy` loss function, the `adam` optimizer, and the `accuracy` evaluation metric.


In [14]:
# Compile the Sequential model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [15]:
# Fit the model using 50 epochs and the training data
history = model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.2)


Epoch 1/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - accuracy: 0.4624 - loss: 0.8751 - val_accuracy: 0.4805 - val_loss: 0.7578
Epoch 2/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.4707 - loss: 0.7452 - val_accuracy: 0.6211 - val_loss: 0.6746
Epoch 3/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6338 - loss: 0.6570 - val_accuracy: 0.7266 - val_loss: 0.6278
Epoch 4/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7339 - loss: 0.6283 - val_accuracy: 0.7500 - val_loss: 0.5964
Epoch 5/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7481 - loss: 0.5890 - val_accuracy: 0.7383 - val_loss: 0.5725
Epoch 6/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7455 - loss: 0.5554 - val_accuracy: 0.7422 - val_loss: 0.5576
Epoch 7/50
[1m32/32[0m [32m━━━━━━━━━

### Step 3: Evaluate the model using the test data to determine the model’s loss and accuracy.


In [16]:
# Evaluate the model loss and accuracy metrics using the evaluate method and the test data
loss, accuracy = model.evaluate(X_test_scaled, y_test)

# Display the model loss and accuracy results
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7968 - loss: 0.4854 
Test Loss: 0.5212269425392151, Test Accuracy: 0.762499988079071


### Step 4: Save and export your model to a keras file, and name the file `student_loans.keras`.


In [17]:
# Set the model's file path
model_file_path = 'student_loans.keras'

# Export your model to a keras file
model.save(model_file_path)

---
## Predict Loan Repayment Success by Using your Neural Network Model

### Step 1: Reload your saved model.

In [25]:
# ipython-input-19-41ac9927076a
from tensorflow.keras.models import load_model

# Set the model's file path
model_file_path = 'student_loans.keras'

# Load the model to a new object
# Corrected the typo here: oaded_model -> loaded_model
loaded_model = load_model(model_file_path)
print("Model reloaded successfully!")

Model reloaded successfully!


### Step 2: Make predictions on the testing data and save the predictions to a DataFrame.

In [26]:
# ipython-input-24-41ac9927076a
import pandas as pd

# Make predictions with the test data
predictions = loaded_model.predict(X_test_scaled)
# Display a sample of the predictions
print(predictions[:5])

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step 
[[0.3525709 ]
 [0.35630912]
 [0.8414205 ]
 [0.8170289 ]
 [0.9700535 ]]


In [27]:
# Save the predictions to a DataFrame and round the predictions to binary results
predictions_df = pd.DataFrame(predictions, columns=['Prediction'])
predictions_df['Prediction'] = (predictions_df['Prediction'] > 0.5).astype(int)

### Step 4: Display a classification report with the y test data and predictions

In [28]:
from sklearn.metrics import classification_report

# Print the classification report with the y test data and predictions
print(classification_report(y_test, predictions_df['Prediction']))

              precision    recall  f1-score   support

           0       0.75      0.76      0.75       154
           1       0.77      0.77      0.77       166

    accuracy                           0.76       320
   macro avg       0.76      0.76      0.76       320
weighted avg       0.76      0.76      0.76       320



---
## Discuss creating a recommendation system for student loans

Briefly answer the following questions in the space provided:

1. Describe the data that you would need to collect to build a recommendation system to recommend student loan options for students. Explain why this data would be relevant and appropriate.

2. Based on the data you chose to use in this recommendation system, would your model be using collaborative filtering, content-based filtering, or context-based filtering? Justify why the data you selected would be suitable for your choice of filtering method.

3. Describe two real-world challenges that you would take into consideration while building a recommendation system for student loans. Explain why these challenges would be of concern for a student loan recommendation system.

**1. Describe the data that you would need to collect to build a recommendation system to recommend student loan options for students. Explain why this data would be relevant and appropriate.**

To build a recommendation system for student loans, it would be important to collect a range of data. This includes student demographics such as age, gender, family income, employment status, and location, as these factors help understand the student's financial needs and capacity. Academic performance data like GPA, degree level (undergraduate or graduate), major, and institution type (public or private) would also be useful, as they often correlate with loan eligibility and repayment ability. Financial information, such as credit score, current debt, household income, and past loan history, would provide insights into a student's creditworthiness and ability to manage debt. Additionally, understanding loan preferences, including desired interest rate type (fixed or variable), loan term length, and whether a co-signer is available, would help tailor recommendations to the student's specific needs.

**2. Based on the data you chose to use in this recommendation system, would your model be using collaborative filtering, content-based filtering, or context-based filtering? Justify why the data you selected would be suitable for your choice of filtering method.**

Based on the data described, the most suitable filtering method would be context-based filtering. Here’s why:

Given this data, the most suitable filtering method would be context-based filtering. While collaborative filtering could rely on similar users’ loan choices, it may not perform well if students have limited loan history or unique financial situations. Content-based filtering, which focuses on user attributes like demographics and preferences, might be too narrow as it doesn’t incorporate situational context. Context-based filtering, on the other hand, combines elements of both collaborative and content-based approaches. It considers both personal attributes and available historical data, making it more adaptable and personalized for recommending loans.

**3. Describe two real-world challenges that you would take into consideration while building a recommendation system for student loans. Explain why these challenges would be of concern for a student loan recommendation system.**

Two significant real-world challenges in building a student loan recommendation system include data privacy and security, as well as bias in recommendations. Handling sensitive personal and financial data requires strict measures to ensure confidentiality and compliance with laws like GDPR or FERPA, as any breach could severely affect students and the organization. Additionally, bias can arise from the training data, potentially leading to unfair recommendations, such as favoring certain demographic groups or suggesting high-interest loans to lower-income students. This could worsen their financial situation. To address these challenges, it’s crucial to implement secure data handling practices and regularly audit the model to ensure fair, unbiased recommendations for all users.