# Block 48: Unit 5 Career Simulation

## Home Loan Default Prediction Project

## Scenario:

EasyCredit Financial Services is a company that specializes in providing home loans to a wide demographic of customers. In recent years, EasyCredit has experienced a higher-than-average default rate on their home loans, which has affected its profitability and market reputation. To mitigate this issue, the company has decided to initiate a project to develop a deep learning model that can predict the probability of a loan applicant defaulting on a home loan. The company aims to integrate this predictive model into its loan approval process to make more informed decisions and reduce the risk of defaults.

## Problem Statement:

Construct a deep learning model to predict the likelihood of default for future loan applications using historical loan data.

## Objective:

Develop a predictive model that determines whether loan applicants will be able to repay a loan, based on historical data.

## Detailed Directions:

1. Import the Required Libraries:

   - Essential libraries for data manipulation (NumPy, Pandas)

   - PyMySQL for database connectivity (if needed)

   - LabelEncoder for data preprocessing

   - Matplotlib for visualization

2. Load the Data:

   - Load the data from the loan_data.csv file into a Pandas DataFrame.

   - Examine the first few rows to understand the dataset's structure.

3. Check for Null Values:

   - Identify and handle any missing values in the dataset.

4. Analyze Data Column Distribution:

   - Use descriptive statistics and visualizations to understand the data distribution.

5. Balance the Data:

   - Address the imbalance in the dataset to ensure the model doesn't become biased.

6. Encode the Full Dataset:

   - Convert categorical variables into a form that could be provided to ML algorithms to do a better job in prediction.

7. Split and Train the Dataset:

   - Divide the data into training and testing sets.

   - Build and train the deep learning model on the training set.

8. Perform Feature Scaling:

   - Scale the features to standardize the range of independent variables.

9. Compile the Model:

   - Set up the deep learning model with the appropriate loss function and optimizer.

10. Fit the Model:

    - Train the model with the training data and validate it using the validation set.

11. Print the Model:

    - Output the details of the model architecture.

12. Evaluate the Metrics:

    - Calculate and interpret performance metrics, such as the confusion matrix, sensitivity, and area under the ROC curve (AUC).


### Installation of PyMySQL and Sklearn-Pandas

- Execute the command **!pip install pymysql** to install the PyMySQL package with Python's pip package manager.
- Use the command **!pip install sklearn-pandas** to install the Sklearn-Pandas package with Python's pip package manager.


In [46]:
# !pip install pymysql

In [47]:
# !pip install sklearn-pandas

### Step 1: Importing the Required Library

- Import the essential libraries for data manipulation and analysis, such as NumPy and Pandas, and use PyMySQL for database connectivity, LabelEncoder for data preprocessing, and Matplotlib for visualization.


In [48]:
import numpy as np
import pandas as pd
import pymysql
from sklearn.preprocessing import LabelEncoder
import plotly.express as px
from tqdm import tqdm

In [49]:
import tensorflow as tf

# Get the list of available devices
gpus = tf.config.list_physical_devices("GPU")
print("Available GPUs:", gpus)

# Disable GPU
if gpus:
    tf.config.set_visible_devices([], "GPU")
    print("GPU disabled. Running on CPU.")

# # Enable GPU
# if gpus:
#     tf.config.set_visible_devices(gpus[0], "GPU")
#     print("GPU enabled.")

Available GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
GPU disabled. Running on CPU.


### Observations:

I import the library that I will be using.

I also chose to disable my GPU for the data I am using and for how I'm running the model. I get better performance on my CPU instead of my GPU.


### Step 2: Loading the Data

- Load a CSV file named **loan_data.csv** into a Pandas DataFrame named **app_train**.
- Display the initial few rows of the **app_train** DataFrame.


In [50]:
app_train = pd.read_csv("loan_data.csv")
print(f"Shape: {app_train.shape}")
print("Cells (missing included):", app_train.shape[0] * app_train.shape[1])
print(
    "Cells (missing NOT included):",
    app_train.shape[0] * app_train.shape[1] - app_train.isna().sum().sum(),
)

Shape: (307511, 122)
Cells (missing included): 37516342
Cells (missing NOT included): 28363877


In [51]:
app_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


### Observations:

I load the data and print out some stats including the first few rows of the data.


### Step 3: Checking for Null Values

- Determine if there are any null values in the **TARGET** column of the **app_train** DataFrame.
- The command **app_train['TARGET'].value_counts()** gives the count of each unique value in the **TARGET** column of the **app_train** DataFrame.


In [52]:
# Check if there are any missing values in the 'TARGET' column of the app_train DataFrame
# // TODO
app_train["TARGET"].isna().sum()

0

In [53]:
# Count the occurrences of each unique value in the 'TARGET' column of the app_train DataFrame
app_train["TARGET"].value_counts()

TARGET
0    282686
1     24825
Name: count, dtype: int64

### Observations:

I check if there's any missing values in the target column, and there is none. I've also check how balanced my data is, and it is not balanced so I will need to do that in the future


### Step 4: Analyzing the Data Column Distribution

- Utilize the Seaborn library to generate a **countplot**, a bar plot that illustrates the frequency of each category in the TARGET column.
- This plot displays two categories: Payer and Defaulter, signifying the target variable. The height of the bars indicates the count of each category, denoting the number of occurrences of payers and defaulters in the dataset.


In [54]:
# Visualize the distribution of values in the 'TARGET' column using a seaborn countplot
# Set custom x-axis labels for better interpretation ('Payer' for 0, 'Defaulter' for 1)
# Adjust x-axis label size for better readability
# // TODO
fig = px.histogram(data_frame=app_train["TARGET"])

fig.update_layout(
    title="Distribution of TARGET",
    xaxis_title="Category",
    yaxis_title="Num of Instances",
    xaxis=dict(tickvals=[0, 1], ticktext=["Payer", "Defaulter"]),
)

fig.show()

### Percentage of Defaulters to Payers

- Calculate the percentage of defaulters in the dataset by dividing the number of data points with **TARGET** value of 1 (defaulters) by the number of data points with **TARGET** value of 0 (non-defaulters) and then multiplying by 100


In [55]:
# Calculate the percentage of defaulters (TARGET=1) relative to non-defaulters (TARGET=0) in the app_train DataFrame
# TODO

num_of_payer, num_of_defaulters = app_train["TARGET"].value_counts()
per_of_defaulters = num_of_defaulters / app_train["TARGET"].count() * 100
print(f"Percentage of Defaulters: {round(per_of_defaulters, 2)}%")

Percentage of Defaulters: 8.07%


### Observations:

I make some visualizations to see how my data is balanced.


### Step 5: Balancing the Data

- Upsample the fraud instances in the dataset by randomly duplicating them to increase their quantity. This step balances the data since fraud is a minority class compared to the **not_fraud** class.
- By setting **replace=True**, the resampling allows for the same fraud instances to be selected multiple times, ensuring that the number of fraud instances matches the number of **not_fraud** instances.
- Set the **random_state** parameter to ensure consistent and reproducible results.
- Use resampling with replacement to create an upsampled version of the fraud data, matching the number of instances in the majority class.
- Combine the original majority class instances (**not_fraud**) with the upsampled fraud instances (**fraud_upsampled**) to create a balanced dataset.
- Display the count of each target class (0 and 1) in the upsampled dataset.


In [56]:
# Import the resample function from sklearn.utils for handling class imbalance
# Extract the subset of the DataFrame where the target variable 'TARGET' is 0 (not fraud)
# Extract the subset of the DataFrame where the target variable 'TARGET' is 1 (fraud)
# Use the resample function to upsample the minority class (fraud) by generating additional samples
# Sample with replacement to create additional instances
# Match the number of samples in the majority class (not fraud)
# Set a random seed for reproducibility of results
# // TODO

# @ Janet told us that this is a duplicate and that should not be filled out.

In [57]:
# Use the resample function from sklearn.utils to upsample the minority class (fraud)
# 'fraud' is the DataFrame containing instances where the target variable 'TARGET' is 1
# replace=True specifies that sampling should be done with replacement
# This means that the same instance of fraud may be selected multiple times during upsampling
# n_samples sets the number of samples to generate for the minority class
# It is set to the length of the majority class (not fraud) to balance the class distribution
# random_state is used to ensure reproducibility of results
# Setting a specific random seed (e.g., 27) ensures the same results are obtained if the code is run again
# // TODO
from sklearn.utils import resample

not_fraud = app_train[app_train["TARGET"] == 0]
fraud = app_train[app_train["TARGET"] == 1]

fraud_upsampled = resample(
    fraud, n_samples=len(not_fraud), replace=True, random_state=8080
)

print(f"not_fraud: {len(not_fraud)}")
print(f"fraud: {len(fraud)}")
print(f"fraud_upsampled: {len(fraud_upsampled)}")

not_fraud: 282686
fraud: 24825
fraud_upsampled: 282686


In [58]:
upsampled = pd.concat([not_fraud, fraud_upsampled])

In [59]:
upsampled.TARGET.value_counts()

TARGET
0    282686
1    282686
Name: count, dtype: int64

In [60]:
upsampled.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
5,100008,0,Cash loans,M,N,Y,0,99000.0,490495.5,27517.5,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0


### Plotting the Balanced Data

- Create a countplot using the upsampled dataset to display the distribution of the TARGET variable. The x-axis represents the TARGET categories, labeled as Payer and Defaulter, while the y-axis represents the count of occurrences.


In [61]:
# Create a new figure and axis using matplotlib's subplots function
# This allows for customization and control over the layout of the plot
# Use seaborn's countplot to create a bar plot of the 'TARGET' variable in the upsampled DataFrame
# 'upsampled' is assumed to be the DataFrame containing the upsampled data
# 'x='TARGET'' specifies the variable to be plotted on the x-axis
# 'palette='viridis'' sets the color palette of the plot
# Set custom x-axis labels for better interpretation of the plot
# Customize x-axis tick parameters, specifically setting the label size to 15 for better readability
# Display the plot
# TODO

target_counts = upsampled["TARGET"].value_counts().reset_index()
print(target_counts)


fig = px.bar(
    data_frame=target_counts,
    color="TARGET",
    color_discrete_sequence=px.colors.sequential.Viridis,
)

fig.update_layout(
    title="Distribution of TARGET",
    xaxis_title="Category",
    yaxis_title="Num of Instances",
    xaxis=dict(
        tickvals=[0, 1], ticktext=["Payer", "Defaulter"], tickfont=dict(size=15)
    ),
)

fig.show()

   TARGET   count
0       0  282686
1       1  282686


### Observations:

In this section, we balance our data so that we have equal instances of Payers and a Defaulters.


### Step 6: Encoding (Label and One-Hot) of the Full Dataset

- Use a **LabelEncoder** object from the scikit-learn library to perform label encoding on categorical columns.
- Iterate through each column in the dataset, check if it is of object data type and if it has two or fewer unique categories. If these conditions are met, transform the column values using the label encoder, and assign the transformed values back to the column in the dataset.
- Keep track of the number of columns that were label encoded and print the count at the end.
- Perform one-hot encoding on the upsampled dataset and print its shape.


In [62]:
# Create a label encoder object using scikit-learn's LabelEncoder
# Initialize a counter variable to keep track of the number of columns that are label encoded
# Iterate through the columns of the 'upsampled' DataFrame
# Check if the data type of the column is 'object'
# Check if the column has 2 or fewer unique categories
# Fit the label encoder on the training data in the current column
# Transform both the training and testing data in the current column
# Increment the label encoder count
# Print the total number of columns that were label encoded
# // TODO

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

l = []
counter = 0

for col in upsampled.columns:
    if upsampled[col].dtype == "object":
        if len(upsampled[col].unique()) <= 2:

            l.append(col)
            upsampled[col] = label_encoder.fit_transform(upsampled[col])
            counter += 1

print(f"counter: {counter}")
print(f"l: {l}")

upsampled.head(5)

counter: 3
l: ['NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
1,100003,0,0,F,0,0,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,1,M,1,1,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,0,F,0,1,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,0,M,0,1,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
5,100008,0,0,M,0,1,0,99000.0,490495.5,27517.5,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0


In [63]:
# Use pandas get_dummies function to one-hot encode categorical variables in the 'upsampled' DataFrame
# One-hot encoding is a technique to represent categorical variables as binary vectors
# Each category is transformed into a new binary column, and each observation is marked with a 1 or 0 for the presence or absence of the category
# Print the shape of the 'upsampled' DataFrame to show the number of columns after one-hot encoding
# // TODO

upsampled = pd.get_dummies(upsampled, dtype="int")

# Print the shape of the DataFrame after one-hot encoding
print(f"Shape after one-hot encoding: {upsampled.shape}")

upsampled.head(5)

Shape after one-hot encoding: (565372, 243)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
1,100003,0,0,0,0,0,270000.0,1293502.5,35698.5,1129500.0,...,0,1,0,0,0,0,0,0,1,0
2,100004,0,1,1,1,0,67500.0,135000.0,6750.0,135000.0,...,0,0,0,0,0,0,0,0,0,0
3,100006,0,0,0,1,0,135000.0,312682.5,29686.5,297000.0,...,0,0,0,0,0,0,0,0,0,0
4,100007,0,0,0,1,0,121500.0,513000.0,21865.5,513000.0,...,0,0,0,0,0,0,0,0,0,0
5,100008,0,0,0,1,0,99000.0,490495.5,27517.5,454500.0,...,0,0,0,0,0,0,0,0,0,0


### Observations:

Although we should first handle missing values before encoding (It's just how the homework is constructed). We still encode our data so it can run through the neural network.


### Step 7: Splitting and Training the Dataset

- Import the **train_test_split** function from the **sklearn.model_selection** module, used to split the dataset into training and testing sets.
- Split the features and target data into training and testing sets using the **train_test_split** function, with a test size of 20% and a random state of 10.
- Assign the resulting training and testing sets to features_train, **features_test**, **target_train**, and **target_test**, respectively.


In [64]:
# Import the train_test_split function from scikit-learn to split the data into training and testing sets
# Extract features from the upsampled DataFrame, excluding the 'TARGET' column
# Extract the target variable 'TARGET' from the upsampled DataFrame
# Create a variable 'class_names' that holds the 'TARGET' column of the upsampled DataFrame
# This is often used in classification tasks for labeling classes
# Use train_test_split to split the data into training and testing sets
# features_train and target_train represent the training set, while features_test and target_test represent the testing set
# test_size=0.20 specifies that 20% of the data will be used for testing, and random_state=10 ensures reproducibility
# // TODO
from sklearn.model_selection import train_test_split

features = upsampled.drop("TARGET", axis=1)
target = upsampled["TARGET"]

class_names = upsampled["TARGET"]

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=10
)

In [65]:
features_train.shape, features_test.shape, target_train.shape, target_test.shape

((452297, 242), (113075, 242), (452297,), (113075,))

### Handling Missing Values and Imputing Data

- Fill the missing values in the **features_train** DataFrame using the median of each column.
- Identify the categorical columns that have missing values and store them in the **list_categorical** list.
- For each categorical column in **features_train** with missing values, apply the **SimpleImputer** with the strategy of filling the missing values with the most frequent value.
- Repeat the same process for the **features_test** DataFrame.
- Fill the upsampled DataFrame with the median values from the **app_train** DataFrame using the **fillna()** method.
- Lastly, check if there are any remaining missing values in the upsampled DataFrame and display the first few rows of the DataFrame.


In [66]:
features_train = features_train.fillna(features_train.median(numeric_only=True))
features_train.isna().sum().sum()

0

In [67]:
#
# @ The reason this doesn't work, even though you can turn off the fillna() function is because all of the columns are not objects.
# @ That is because of LabelEncoder() and pd.get_dummies().

# Import SimpleImputer from scikit-learn to handle missing values
from sklearn.impute import SimpleImputer

# Initialize an empty list to store column names of categorical variables with missing values
list_categorical = []
# Iterate through the columns of the training features
for col in features_train:
    # Check if the column is of type 'object' and contains any missing values
    if features_train[col].dtype == "object":
        if features_train[col].isnull().values.any():
            # Append the column name to the list
            list_categorical.append(col)


# // TODO

# Iterate through the list of categorical columns with missing values
for col in list_categorical:

    # Create a SimpleImputer object with the strategy of replacing missing values with the most frequent value
    imputer = SimpleImputer(strategy="most_frequent")

    # Apply the imputer to fill missing values in the current column
    # The reshape(-1, 1) is used to convert the 1D array into a 2D array, as SimpleImputer expects a 2D input
    prepped_column = features_train[col].to_numpy().reshape(-1, 1)

    features_train[col] = imputer.fit_transform(prepped_column).reshape(-1)


print(f"list_categorical: {list_categorical}")
features_train

list_categorical: []


Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
278077,422205,0,1,0,0,135000.0,415224.0,11083.5,328500.0,0.024610,...,0,0,0,0,0,1,0,0,1,0
65290,175723,0,0,0,0,76500.0,277969.5,10476.0,229500.0,0.025164,...,0,0,0,0,0,0,1,0,1,0
262632,404075,0,0,1,0,112500.0,545040.0,26640.0,450000.0,0.006671,...,0,0,0,0,0,0,1,0,1,0
22620,126320,0,0,1,0,270000.0,636138.0,20650.5,531000.0,0.009630,...,0,0,0,0,0,1,0,0,1,0
181436,310289,0,1,1,0,135000.0,900000.0,43299.0,900000.0,0.028663,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137881,259904,0,0,0,0,157500.0,986346.0,35559.0,832500.0,0.019101,...,0,0,0,0,0,0,0,0,0,0
128272,248793,0,1,0,0,108000.0,648000.0,21033.0,648000.0,0.020246,...,0,0,0,0,0,0,1,0,1,0
114960,233298,1,0,1,2,90000.0,202500.0,10125.0,202500.0,0.018850,...,0,0,0,0,0,0,0,0,0,0
303790,451974,0,0,1,2,180000.0,417024.0,22752.0,360000.0,0.009334,...,0,0,0,0,0,1,0,0,1,0


In [68]:
features_test = features_test.fillna(features_test.median())

In [69]:
# Import SimpleImputer from scikit-learn to handle missing values
from sklearn.impute import SimpleImputer

# Initialize an empty list to store column names of categorical variables with missing values
list_categorical = []

# Iterate through the columns of the testing features
for col in features_test:

    # Check if the column is of type 'object' and contains any missing values
    if (
        features_test[col].dtype == "object"
        and features_test[col].isnull().values.any()
    ):
        # Append the column name to the list
        list_categorical.append(col)


# Iterate through the list of categorical columns with missing values
for col in list_categorical:

    # Create a SimpleImputer object with the strategy of replacing missing values with the most frequent value
    imputer = SimpleImputer(strategy="most_frequent")

    # // TODO

    # Apply the imputer to fill missing values in the current column
    # The double square brackets ([[col]]) are used to pass a DataFrame with a single column to SimpleImputer,
    # as it expects a 2D input

    features_test[col] = imputer.fit_transform(features_test[[col]])


print(f"list_categorical: {list_categorical}")
features_test

list_categorical: []


Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
142015,264662,0,0,0,0,135000.0,450000.0,16164.0,450000.0,0.018209,...,0,0,0,0,0,0,0,0,0,0
242269,380469,0,1,1,0,157500.0,387000.0,28174.5,387000.0,0.028663,...,0,0,0,0,0,0,0,0,0,0
88490,202742,0,0,0,0,67500.0,646920.0,20997.0,540000.0,0.006852,...,0,0,0,0,0,0,0,0,0,0
197389,328857,0,0,1,2,225000.0,738567.0,27261.0,616500.0,0.020713,...,0,0,0,0,0,0,0,0,0,0
227063,363004,0,0,0,0,110250.0,442062.0,22729.5,369000.0,0.004960,...,0,0,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
300309,447914,0,0,0,1,135000.0,545040.0,26640.0,450000.0,0.018029,...,0,0,0,0,0,0,0,0,0,0
291996,438272,0,0,1,0,427500.0,521280.0,28408.5,450000.0,0.030755,...,0,0,0,0,0,0,1,0,1,0
253977,393885,1,0,1,3,225000.0,180000.0,9000.0,180000.0,0.010006,...,0,0,0,0,0,0,0,0,0,0
69714,180870,0,0,1,1,225000.0,301464.0,20277.0,238500.0,0.002042,...,0,0,0,0,0,0,0,0,0,0


In [70]:
upsampled.fillna(app_train.median(numeric_only=True), inplace=True)

In [71]:
print(upsampled.isnull().values.any())
upsampled.isna().sum().sum()

False


0

In [72]:
upsampled.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
1,100003,0,0,0,0,0,270000.0,1293502.5,35698.5,1129500.0,...,0,1,0,0,0,0,0,0,1,0
2,100004,0,1,1,1,0,67500.0,135000.0,6750.0,135000.0,...,0,0,0,0,0,0,0,0,0,0
3,100006,0,0,0,1,0,135000.0,312682.5,29686.5,297000.0,...,0,0,0,0,0,0,0,0,0,0
4,100007,0,0,0,1,0,121500.0,513000.0,21865.5,513000.0,...,0,0,0,0,0,0,0,0,0,0
5,100008,0,0,0,1,0,99000.0,490495.5,27517.5,454500.0,...,0,0,0,0,0,0,0,0,0,0


- Split the features and target variables into training and testing datasets using the **train_test_split** function from **sklearn.model_selection** module.


In [73]:
# Import the train_test_split function from scikit-learn to split the data into training and testing sets
from sklearn.model_selection import train_test_split

# Extract features from the 'upsampled' DataFrame, excluding the 'TARGET' column
features = upsampled.drop(["TARGET"], axis=1)

# Extract the target variable 'TARGET' from the 'upsampled' DataFrame
target = upsampled.TARGET

# Create a variable 'class_names' that holds the 'TARGET' column of the 'upsampled' DataFrame
# This is often used in classification tasks for labeling classes
class_names = upsampled.TARGET

# Use train_test_split to split the data into training and testing sets
# features_train and target_train represent the training set, while features_test and target_test will represent the testing set
# test_size=0.20 will specify that 20% of the data will be used for testing, and random_state=10 ensures reproducibility
# // TODO

# ! Do not uncommon this code!!!
# @ I have this code disabled because we are already doing this above!
# features_train, features_test, target_train, target_test = train_test_split(
#     features, target, test_size=0.2, random_state=10
# )

In [74]:
features_train.shape, features_test.shape, target_train.shape, target_test.shape

((452297, 242), (113075, 242), (452297,), (113075,))

### Observations:

In this section, we split the data into train and test splits. Then we handle the missing values, which we should've handled previously but again that's how the homework is laid out. Then we attempt to make another train test split, but I have commented that out because that would override the work we've done previously.


### Step 8: Performing Feature Scaling on the Dataset

- Perform feature scaling using **MinMaxScaler** on the training features, and store the scaled features in **features_train_scaled**.


In [75]:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import BatchNormalization

In [76]:
# Convert the training features to a NumPy array
# Convert the training target variable to a NumPy array
# Convert the testing features to a NumPy array
# Convert the testing target variable to a NumPy array
# // TODO
features_train = features_train.to_numpy()
target_train = target_train.to_numpy()
features_test = features_test.to_numpy()
target_test = target_test.to_numpy()

In [None]:
# Import the MinMaxScaler from scikit-learn to scale the features to a specified range (default is [0, 1])
# Create an instance of the MinMaxScaler
# Use the fit_transform method of the scaler to fit the scaling parameters on the training features and transform them
# Use the transform method of the scaler to apply the same scaling parameters to the testing features
# // TODO

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# ! The kernel resets when I try to use the scaler twice. The code in fact does work on VS code. If you run it there, it should work fine.

features_train_scaled = scaler.fit_transform(features_train)
features_test_scaled = scaler.transform(features_test)
features_train_scaled

array([[0.90442186, 0.        , 1.        , ..., 0.        , 1.        ,
        0.        ],
       [0.21254839, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.85353106, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       ...,
       [0.37416106, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.98798326, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.43487353, 0.        , 1.        , ..., 0.        , 1.        ,
        0.        ]])

In [78]:
target_train.shape

(452297,)

Deep Neural Network with 4 Layers

- The network includes 4 layers with 2 hidden layers, each containing 80 neurons.


In [79]:
# // TODO
# Import the Sequential class from the Keras library to create a sequential model
# @ Already done above

# Create an instance of the Sequential model
model = Sequential(
    [
        # Add a dense (fully connected) layer with 80 neurons, input dimension of 242, and ReLU activation function
        Dense(80, input_shape=(242,), activation="relu"),
        # Add a dropout layer with a dropout rate of 0.2 to reduce overfitting during training
        Dropout(0.2),
        # Add another dense layer with 80 neurons and ReLU activation function
        Dense(80, activation="relu"),
        # Add another dropout layer with a dropout rate of 0.2
        Dropout(0.2),
        # Add a final dense layer with 1 neuron and a sigmoid activation function for binary classification
        Dense(1, activation="sigmoid"),
    ]
)
model.summary()


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



### Observations:

Here we scale our features for the model. We also create the architecture for the model.


### Step 9: Compiling the Model

- Compile the model with the Adam optimizer, binary cross-entropy loss function, and accuracy as the metric.


In [80]:
# Compile the model with specific configuration for training
# The 'Adam' optimizer, 'binary_crossentropy' loss function, and 'accuracy' metric are commonly used for binary classification tasks
# Set the optimizer to 'Adam', which is a popular optimization algorithm
# Set the loss function to 'binary_crossentropy'
# This is the appropriate loss function for binary classification problems
# Specify the metric(s) to be used for evaluation during training
# 'accuracy' is a common metric for classification problems
# // TODO

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

### Observations:

Here we compile the model with an optimizer, a loss function, and the metrics we want to calculate.


### Step 10: Fitting the Model

- Fit the model to the training data, with a specified number of epochs (50), to train the model and optimize its parameters based on the provided input features and target values.


In [36]:
# Train the neural network model using the training features and target variables
# The model will be trained for 50 epochs
# Use the fit method to train the model
# Provide the scaled training features as input and the corresponding target values for supervised learning
# The 'epochs' parameter specifies the number of times the entire training dataset will be passed forward and backward through the neural network
# // TODO

history = model.fit(
    features_train_scaled,
    target_train,
    epochs=50,
    validation_data=(features_test_scaled, target_test),
)

Epoch 1/50
[1m14135/14135[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 543us/step - accuracy: 0.6624 - loss: 0.6151 - val_accuracy: 0.6828 - val_loss: 0.5948
Epoch 2/50
[1m14135/14135[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 528us/step - accuracy: 0.6824 - loss: 0.5934 - val_accuracy: 0.6791 - val_loss: 0.5967
Epoch 3/50
[1m14135/14135[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 553us/step - accuracy: 0.6899 - loss: 0.5845 - val_accuracy: 0.6886 - val_loss: 0.5861
Epoch 4/50
[1m14135/14135[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 551us/step - accuracy: 0.6953 - loss: 0.5765 - val_accuracy: 0.6916 - val_loss: 0.5836
Epoch 5/50
[1m14135/14135[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 564us/step - accuracy: 0.6984 - loss: 0.5700 - val_accuracy: 0.6875 - val_loss: 0.6072
Epoch 6/50
[1m14135/14135[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 553us/step - accuracy: 0.7014 - loss: 0.5646 - val_accuracy: 0.6943 - val_loss:

### Observations:

Here we train/fit the model. We train it with 50 epochs.


### Step 11: Printing the Model Summary

- Use the model.summary() function to display a summary of the model architecture, including the number of parameters and the shape of each layer.


In [37]:
model.summary()

### Observations:

Here we print a model summary. All that means is we print out the architecture of the model and the parameters.


### Step 12: Evaluating the Metrics

- Use the evaluate function to calculate the loss and accuracy of the trained model on the test data.
  The **test_loss** and **test_acc** variables store the values of the calculated loss and accuracy, respectively. Then, print the test accuracy value.


In [38]:
# Evaluate the trained model on the testing dataset and print the test accuracy

# Use the evaluate method to assess the model's performance on the testing dataset
# Provide the scaled testing features and their corresponding target values as input
# The method returns the test loss and test accuracy
# // TODO
_, test_acc = model.evaluate(features_test_scaled, target_test)

# Print the test accuracy obtained from the evaluation
print("Test accuracy:", test_acc)

[1m3534/3534[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 316us/step - accuracy: 0.6379 - loss: 2.5416
Test accuracy: 0.6375149488449097


- Use the **predictions = model.predict(features_test_scaled**) code to make predictions on the test data using the trained model.
- Apply the trained model to the scaled features of the test data and obtain the predicted values for the target variable.


In [None]:
# Obtain predictions from the trained model on the scaled testing features
# Use the predict method to generate predictions from the trained model
# Provide the scaled testing features as input, and the method returns the predicted probabilities for each instance
# // TODO

predictions = model.predict(features_test_scaled).reshape(-1)
predictions

[1m3534/3534[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 241us/step


array([[0.36764783],
       [0.43805364],
       [0.4346639 ],
       ...,
       [0.43436703],
       [0.36544433],
       [0.40498444]], dtype=float32)

- Use the **accuracy_score** function from the **sklearn.metrics** module to calculate the accuracy of the predicted values compared to the actual target values.


In [40]:
# Import the accuracy_score function from scikit-learn to calculate classification accuracy
# Calculate the classification accuracy by comparing the true target values (target_test) with the rounded predictions
# The predictions are rounded to convert probabilities to binary predictions (0 or 1)
# This is suitable for binary classification tasks
# // TODO

from sklearn.metrics import accuracy_score

binary_predictions = []

for i in predictions:
    binary_predictions.append(round(i))

accuracy_score(target_test, binary_predictions)

0.6375149237231925

In [41]:
# Import the confusion_matrix function from scikit-learn to compute the confusion matrix
from sklearn.metrics import confusion_matrix

# Compute the confusion matrix by comparing the true target values (target_test) with the rounded predictions
# The predictions are rounded to convert probabilities to binary predictions (0 or 1)
# The confusion matrix provides information on true positive, true negative, false positive, and false negative counts
# // TODO
cnf_matrix = confusion_matrix(target_test, binary_predictions)

# Print the computed confusion matrix
cnf_matrix

array([[49485,  7154],
       [33834, 22602]])

In [42]:
# // TODO
# Extract the elements of the confusion matrix to obtain true negative (TN), false positive (FP), false negative (FN), and true positive (TP) counts

# Use the confusion_matrix function to compute the confusion matrix by comparing the true target values (target_test)
# with the rounded predictions. The predictions are rounded to convert probabilities to binary predictions (0 or 1).

# Use the ravel() method to flatten the confusion matrix into a 1D array
# The resulting array has the order [TN, FP, FN, TP], where TN: true negative, FP: false positive, FN: false negative, TP: true positive
TN, FP, FN, TP = cnf_matrix.ravel()

# Print or view the extracted counts of true negative, false positive, false negative, and true positive
print(f"TN: {TN}")
print(f"FP: {FP}")
print(f"FN: {FN}")
print(f"TP: {TP}")

TN: 49485
FP: 7154
FN: 33834
TP: 22602


Visualizing the Confusion Matrix

- Generate a heatmap representation of the confusion matrix, which is a visual way to analyze the performance of a classification model.
- Create the heatmap using the **sns.heatmap()** function from the seaborn library, with additional customization for labels, title, and tick marks to enhance the readability of the plot.


In [43]:
# Import seaborn for data visualization and matplotlib.pyplot for creating plots

# // TODO
# Create a subplot and visualize the confusion matrix using a heatmap with seaborn
# The heatmap will have annotations, and the color map is set to 'Blues'
# The 'fmt='g'' specifies the format for the annotations as general numeric format
fig = px.imshow(cnf_matrix, text_auto=True, color_continuous_scale="Blues")


# Set the labels, title, and ticks for better interpretation
# Set x-axis and y-axis tick labels to represent the classes ('0' and '1')

fig.update_layout(
    title="Confusion Matrix",
    xaxis_title="Predicted",
    yaxis_title="Actual",
    xaxis=dict(tickmode="array", tickvals=[0, 1], ticktext=["0", "1"]),
    yaxis=dict(tickmode="array", tickvals=[0, 1], ticktext=["0", "1"]),
)


fig.show()

### Sensitivity, Recall, Hit rate, or True Positive Rate (TPR)


In [44]:
# Calculate Sensitivity (True Positive Rate or Recall) using the formula Sensitivity = TP / (TP + FN)
# // TODO
sensitivity = TP / (TP + FN)

# Calculate Sensitivity by dividing the count of true positives (TP) by the sum of true positives and false negatives (FN)
# // TODO
# @ Done! ^^^

# Print or view the calculated Sensitivity
sensitivity

0.400489049542845

### Area Under the Receiver Operating Characteristic Curve (ROC AUC)


In [45]:
# Import the roc_auc_score function from scikit-learn to calculate the area under the Receiver Operating Characteristic (ROC) curve
from sklearn.metrics import roc_auc_score

# Calculate the area under the ROC curve (AUC) by comparing the true target values (target_test) with the predicted probabilities
# The roc_auc_score function takes the true labels and predicted probabilities as input
# // TODO
roc_auc_score(target_test, binary_predictions)

0.6370901611703702

### Observations:

Here we evaluate the model and various different metrics. We use the evaluate(), accuracy_score(), and a confusion matrix. We also calculate the sensitivity (or the recall), and we also look at the Receiver Operating Characteristic Curve statistic. Overall accuracy, we seem to be getting is 63%.
