# Introduction

Welcome to our class on deep learning and neural networks.
In this session, we will introduce basic concepts of neural networks, key parameters and concepts in deep learning, and how these models are used in financial tasks like creditworthiness prediction.

## Agenda:
1. Useful concepts
2. Data preprocessing
3. Building a neural network model
4. Assignment overview




# 1. Useful concepts
An **activation function** determines the output of a neuron given an input or set of inputs. They introduce non-linearity into the network, allowing it to learn from errors and improve.

**Common activation functions:**
| **Activation Function**    | **Description**                                                                                       |
|----------------------------|-------------------------------------------------------------------------------------------------------|
| **Linear**                 | Not so useful for deep learning; can't perform backpropagation, essential for training multi-layer networks. |
| **Binary**                 | On/off activation; limited for complex classification; vertical slopes hinder calculus, ineffective for modern networks. |
| **Non-linear** (key to DL) | Essential for complex mappings, backpropagation, and multiple layers. Examples below.  |


**Common NON LINEAR activation functions:**
| **Activation Function**    | **Description**                                                                                       |
|----------------------------|-------------------------------------------------------------------------------------------------------|
| **Sigmoid (Logistic)**      | Scales outputs between 0 and 1; common for binary classification but suffers from vanishing gradients. |
| **Tanh (Hyperbolic)**       | Scales between -1 and +1; better for RNNs, but faces vanishing gradient issues and is computationally expensive. |
| **ReLU**                    | Popular for simplicity and speed; becomes linear when <= 0, limiting learning.                        |
| **Leaky ReLU**              | Introduces a negative slope for values below 0, solving the zeroing-out problem.                      |
| **Parametric ReLU (PReLU)** | Similar to Leaky ReLU, but slope is learned via backpropagation; computationally intensive.            |
| **Other ReLU Variants**     | Includes ELU (Exponential), Google’s Swish (deep networks), Maxout (powerful but impractical).         |
| **Softmax**                 | Converts outputs to probabilities; used for single-label classification; Sigmoid handles multi-label tasks. |


**Loss function:** Measures the difference between the predicted output and the actual target. The goal is to minimize this loss during training.
  - Example: Mean Squared Error (MSE), Cross-Entropy Loss.
  
**Optimization algorithms:** Methods used to minimize the loss function by adjusting the network's weights.
  - **Gradient Descent:** The most common optimization algorithm, which updates the model's parameters by computing the gradient of the loss function. 
  - **Adam** is a popular stochastic gradient descent method that is computationally efficient, has little memory requirement and is well suited for problems that are large in terms of data/parameters.
  - **Learning Rate:** A hyperparameter that controls how much the model's parameters are adjusted with each step of the optimization.

**Number of layers and neurons:** More layers and neurons can capture more complex patterns but may lead to overfitting if not managed properly.

**Regularization:** Techniques like dropout or L2 regularization help prevent overfitting by adding constraints to the model.

---

## Popular Python Frameworks for Neural Networks

### Overview of Popular Frameworks
- **TensorFlow and Keras:** Developed by Google, Keras is a high-level API that runs on top of TensorFlow. It's user-friendly and great for beginners.
- **PyTorch:** Developed by Facebook, PyTorch is popular in academia for its flexibility and ease of debugging.

**For This Course:**  
We’ll focus on TensorFlow/Keras due to its simplicity and wide adoption in the industry.

---

Demo: Implementation in Python
------------------------------

### LendingClub Use Case


---


### Set up

#### User-specified parameters

In [None]:
python_material_folder_name = "python-material"

#### Import libraries

In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Check if in Google Colab environment
try:
    from google.colab import drive
    # Mount drive
    drive.mount('/content/drive')
    # Set up path to Python material parent folder
    path_python_material = rf"drive/MyDrive/{python_material_folder_name}"
        # If unsure, print current directory path by executing the following in a new cell:
        # !pwd
    IN_COLAB = True
except:
    IN_COLAB = False
    # If working locally on Jupyter Notebook, parent folder is one folder up (assuming you are using the folder structure shared at the beginning of the course)
    path_python_material = ".."

#### Import data

In [None]:
# Read data that was exported from previous session
df = pd.read_csv(f"{path_python_material}/data/2-intermediate/df_out_dsif6.csv").sample(10000)
df.head()

### <span style="color:RED"> **>>> NOTE:**  </span>    
> **Make sure to have plenty of data!**

In [None]:
df.shape

# 2. Data preprocessing

Neural networks expect data in a specific format and often require transformations to perform effectively. Here's a brief overview.

-   **Numerical Input Data:**
    -   **Transformation:** Convert categorical data to numerical using techniques like one-hot encoding or label encoding. Convert categorical target labels to integers or one-hot encoded vectors.
    -   **Why:** Neural networks can't process strings or categorical data directly.

-   **Feature Scaling:**
    -   **Transformation:** Apply normalization (e.g., min-max scaling) or standardization (e.g., z-score normalization).
    -   **Why:** Neural networks converge faster and perform better when input features are on a similar scale, preventing dominance of features with larger magnitudes.
    
-   **Handling Missing Data:**
    -   **Transformation:** Impute missing values (e.g., using mean/mode imputation) or remove rows/columns with missing data.
    -   **Why:** Missing data can lead to inaccurate training or model errors if not addressed.


In [None]:
# Let's use same features used by model previously built, plus the categorical ones
features = ['installment', 'revol_bal', 'recoveries', 'collection_recovery_fee',
       'last_fico_range_high', 'last_fico_range_low', 'tot_cur_bal',
       'open_acc_6m', 'open_il_24m', 'total_bal_il', 'inq_fi',
       'acc_open_past_24mths', 'bc_util', 'mo_sin_old_il_acct',
       'percent_bc_gt_75', 'total_il_high_credit_limit', 'last_pymnt_amnt_log',
       'last_pymnt_amnt_capped', 'grade_encoded', 'annual_inc_std']

X = df[features]
y = df['loan_default']

print(f"Number of features: {len(features)}")

In [None]:
X.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Apply Min-Max Scaling (alternative: StandardScaler for z-score normalization)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame with same column names
X_scaled_df = pd.DataFrame(X_scaled, columns=features)
X_scaled_df.head()

What can you observe now?

# 3. Building a neural network model

We'll use TensorFlow/Keras, and refer to a simple parallel with the brain in the commentary to explain what is going on..

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (returns pandas dfs)
X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, test_size=0.3, random_state=42)

## 3.1 Starting with a simple architecture

### Dense relu -> Dense relu -> Dense sigmoid



In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Initialize the model (empty intiially)
model = Sequential()

# Adding layers to our model, which are like different parts of the brain.
# Each layer has "neurons" (like tiny decision-makers):

# Add the first part of our brain that looks at the data
model.add(Dense(16, input_dim=X_scaled_df.shape[1], activation='relu'))

# smaller brain part that processes what the first layer has figured out.
model.add(Dense(8, activation='relu'))

#This is like the brain's decision-making part, where it makes a yes/no decision (like "Will this person pay back their loan?").
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(
    optimizer='adam', # like the brain’s coach, guiding it to get better and better at its task
    loss='binary_crossentropy', # how the brain measures its mistakes,  for optimisation
    metrics=['accuracy', 'Precision', 'Recall', 'AUC']
    )

In [None]:
# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)


#### What is going on here?
-   **Epochs (10):** The brain goes through all the data 10 times, learning a bit more with each pass.  
-   **Batch Size (32):** The brain practices on 32 examples at a time.
-   **Validation Split (0.2):** We set aside 20% of the data to test the brain and see how well it's learning.

In [None]:
# How does this compare to model built in previous sessions?

# Make predictions
y_prob = model.predict(X_test)
y_pred = (y_prob > 0.5)


In [None]:
from dsif6utility import model_evaluation_report
model_evaluation_report(X_test, y_test, y_pred, y_prob)

### Interpreting outputs produced
Let's look at the probability scores produced:

In [None]:
y_prob = pd.DataFrame(y_prob, columns=["prob"])

y_prob.prob.describe(percentiles = [i / 100 for i in [0, 1, 10, 25, 50, 75, 90, 95, 99, 100]])

### <span style="color:BLUE"> **>>> DISCUSSION:**  </span>    
> What do you notice?

To achieve a wider range of probabilities in your neural network output, you can consider the following modifications:

**Change the output layer activation function**
If your task requires predicting probabilities for multiple classes, use the `softmax` activation function instead of `sigmoid`. For binary classification, `sigmoid` is appropriate, but if you're observing outputs mostly at the extremes (0 or 1), consider adjusting the model complexity.

**Increase model complexity**
You can add more layers or increase the number of neurons in the existing layers to allow the model to learn more complex patterns.

**Regularization techniques**
Consider adding dropout layers to prevent overfitting, which can sometimes lead to extreme outputs.

**Adjust the loss function**
If your output is indeed binary but you want probabilities that are not strictly 0 or 1, ensure that you balance your dataset or adjust class weights during training.

## 3.2 Some pointers/examples for NNs architectures


The following table includes a brief description of each architecture's structure and its rationale, providing a clearer understanding of why each architecture is suited for its respective use case.

| Neural Network Type | Structure | Use Case | Links | Description |
| --- | --- | --- | --- | --- |
| **Basic Feedforward Neural Network** | Dense (ReLU) → Dense (ReLU) → Dense (Sigmoid) | Binary Classification | [Sonar, Mines vs. Rocks example](https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/) | Simple architecture for binary classification; fully connected layers enable learning complex relationships. |
| **Simple Neural Network for Classification** | Dense (ReLU) → Dense (Softmax) | Multi-Class Classification | [Iris Dataset example](https://lnwatson.co.uk/posts/intro_to_nn/) | Designed for multi-class classification; softmax activation allows for probability distribution across multiple classes. |
| **Convolutional Neural Network (CNN)** | Conv2D (ReLU) → flatten → Dense (Softmax) | Image Classification | [mnist image recognition example](https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras-329fbbadc5f5) | Effective for image data; convolutional layers capture spatial hierarchies, while pooling reduces dimensionality. |
| **Recurrent Neural Network (RNN)** | LSTM → Dense (Sigmoid) | Time Series Prediction | [lstm for time series example](https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/) | Suitable for sequential data; LSTMs manage long-term dependencies in time series or text inputs. |
| **Autoencoder** | Dense (ReLU) → Dense (ReLU) → Dense (Bottleneck) → Dense (ReLU) → Dense (Sigmoid) | Unsupervised Learning, Feature Extraction, Fraud detection | [Autoencoders example](https://blog.keras.io/building-autoencoders-in-keras.html) | Unsupervised learning for feature extraction; compresses input data into a lower-dimensional representation. |
| **Generative Adversarial Network (GAN)** | Dense (ReLU) → Dense (Output) (Generator) and Dense (ReLU) → Dense (Output) (Discriminator) | Synthetic data generation, Fraud detection | [GANs example](https://medium.com/@marcodelpra/generative-adversarial-networks-dba10e1b4424) | Two competing networks create new data; the generator learns to produce realistic samples while the discriminator assesses them. |


## 3.3 Calculating sample size required - Rule of thumb

**Rule of 10**, namely the amount of training data you need for a well performing model is 10x the number of parameters in the model. ([source](https://malay-haldar.medium.com/how-much-training-data-do-you-need-da8ec091e956#:~:text=This%20leads%20us%20to%20the,of%20parameters%20in%20the%20model.))

To determine the appropriate sample size based on the rule of 10, we first need to calculate the total number of parameters in your model. Here's how to do that step-by-step:

### i \. **Calculate Parameters for Each Layer**

For a `Dense` layer, the number of parameters can be calculated as:

Parameters=(number of inputs+1)×(number of neurons)

-   **Input Layer:** The first layer has `input_dim` neurons, which is the number of features in your dataset.
-   **Hidden Layer 1:** Has 16 neurons.
-   **Hidden Layer 2:** Has 8 neurons.
-   **Output Layer:** Has 1 neuron.

### ii \. **Calculate the Total Parameters**

Assuming your input has n features:

-   **First Layer:** (n+1)×16  
-   **Second Layer:** (16+1)×8= 136  
-   **Output Layer:** (8+1)×1= 9  

### iii \. **Putting It Together**

#### Total Parameters:  
To ensure a well-performing model, your sample size should be at least:  

Sample Size = 10 * [(n+1)×16 + 136 + 9]



### Example Calculation

If you have, for example, 20 features:

Sample Size = 10 * [(20+1)×16 + 136 + 9] = 4810

So, if your input has 20 features, you should aim for at least **4810 samples** for training your model. Adjust the calculation based on the actual number of features in your dataset.

## 3.4 Detecting and dealing with overfitting in NNs

In [None]:
def plot_training_vs_overfitting(history):
    """Plot training and validation accuracy to detect overfitting (when gap between 2 is detected)"""
    import matplotlib.pyplot as plt
    
    # Plot accuracy
    plt.plot(history.history['accuracy'], label='Train Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.legend()
    plt.show()

    # Plot loss
    plt.plot(history.history['loss'], label='Train Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.legend()
    plt.show()
    
plot_training_vs_overfitting(history)

### <span style="color:BLUE"> **>>> DISCUSSION:**  </span>    
> How can we see if there is any suspected overfitting? 

### Common techniques to deal with overfitting:

#### i\. **Add a Dropout Layer**

Dropout is a regularization technique that randomly "drops out" a fraction of the neurons during training, forcing the model to learn more robust features.


In [None]:
from tensorflow.keras.layers import Dropout

# Initialize the model
model2 = Sequential()

# First hidden layer with Dropout
model2.add(Dense(16, input_dim=X_scaled_df.shape[1], activation='relu'))
model2.add(Dropout(0.5))  # Drop 50% of neurons

# Second hidden layer with Dropout
model2.add(Dense(8, activation='relu'))
model2.add(Dropout(0.5))  # Drop 50% of neurons

# Output layer
model2.add(Dense(1, activation='sigmoid'))

# Compile the model
model2.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy', 'Precision', 'Recall', 'AUC']
)

In [None]:
# Train the model
history2 = model2.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
plot_training_vs_overfitting(history2)

#### 2\. **Early Stopping**

Early stopping halts training once the performance on the validation data stops improving. This prevents the model from overfitting after a certain point.


In [None]:
from tensorflow.keras.callbacks import EarlyStopping

# Early stopping to prevent overfitting
early_stopping = EarlyStopping(
    monitor='val_precision',  # Watch the validation loss
    patience=5,  # Stop if no improvement after 5 epochs
    restore_best_weights=True  # Restore the model weights at the best epoch
)

# Train with early stopping
history3 = model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=50,
    batch_size=32,
    callbacks=[early_stopping]
)
plot_training_vs_overfitting(history3)

#### 3\. **Reduce Model Complexity**
Sometimes, your model may be too complex (too many layers or neurons). You can simplify it by reducing the number of hidden layers or neurons per layer.

#### 4\. **Data Augmentation**
If possible, increase the amount of training data by augmenting it. More data can help the model generalize better, especially in tasks like image recognition (though it's less applicable for tabular data).

#### 5\. **Regularization (L2)**
L2 regularization (also called weight decay) adds a penalty for large weights, which helps prevent overfitting.


In [None]:
from tensorflow.keras.regularizers import l2

# Add L2 regularization to the model
model4 = Sequential()
model4.add(Dense(16, input_dim=X_scaled_df.shape[1], activation='relu', kernel_regularizer=l2(0.01)))
model4.add(Dense(8, activation='relu', kernel_regularizer=l2(0.01)))
model4.add(Dense(1, activation='sigmoid'))

model4.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', 'Precision', 'Recall', 'AUC'])


In [None]:
# Train the model
history4 = model4.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
plot_training_vs_overfitting(history4)

# 4. Assignment overview

### Part 1 - Mandatory
**Develop a challenger model:** \
Using the concepts from **Session 7** and **Session 8**, create a challenger model to compare against the `model` and `model_2` built in **Session 5**. Name your new model `model_3`. Explore various architectures, models, activation functions, and other hyperparameters to improve performance.

> Experiment with different combinations of optimizers, loss functions, and metrics to evaluate how they affect training speed, overall performance, and the final model's accuracy. This will help you fine-tune your model for the specific problem you're addressing.

> Add at least one additional hidden layer. Observe how this affects the model's performance and overfitting and make note of the differences.

# End of session

In [None]:
from IPython.display import Image
Image(filename=f"{path_python_material}/images/the-end.jpg", width=500,)


# Appendix

## Best practices for choosing activation functions:

-   **ReLU (Rectified Linear Unit)**:

    -   **Default choice for hidden layers** in most neural networks.
    -   **Fast and simple**, helps avoid the vanishing gradient problem.
    -   Use with caution if inputs can be negative or if your model isn't learning well (may cause "dying ReLUs").
-   **Leaky ReLU**:

    -   **Use when ReLU is causing dead neurons** (outputs stuck at 0).
    -   Allows a small, non-zero gradient when the input is negative.
-   **Sigmoid**:

    -   Good for **binary classification output layers** (produces a probability between 0 and 1).
    -   Avoid in hidden layers due to the **vanishing gradient problem**.
-   **Tanh (Hyperbolic Tangent)**:

    -   Use when outputs need to range between **-1 and 1**.
    -   **Better than Sigmoid** for hidden layers but still prone to vanishing gradients.
-   **Softmax**:

    -   Ideal for **multi-class classification output layers**.
    -   Outputs a probability distribution over multiple classes (sums to 1).

-   **Linear**:

    -   Use in the **output layer for regression tasks** (predicting continuous values).
    -   No non-linearity is applied, so the output is a linear combination of inputs.


## Changing Optimizer, Loss, and Metrics: What to Expect

Changing the **optimizer**, **loss**, and **metrics** in a neural network can have a significant impact on model performance, convergence speed, and the accuracy of predictions.

More info here: https://www.tensorflow.org/api_docs/python/tf/keras
