# Deep Learning

# Neural Network

**Theory:** Neural Networks are a class of machine learning models inspired by biological neural networks. They are composed of multiple layers of interconnected neurons, which process and transmit information using electrical and chemical signals.

**How It Works:**
- Each neuron receives input signals from the previous layer and computes an output signal using an activation function.
- The output signal is then passed on to the next layer, which performs the same computation until the output layer is reached.
- The output layer produces the final output signal, which is used to make a prediction.

**Support Functions:** Neural Networks rely on the calculation of outputs from each neuron and the backpropagation of errors to update the model parameters (weights).

**Pros:**
- Can learn non-linear relationships in data.
- Can be used for both classification and regression tasks.
- Can be used for both supervised and unsupervised learning.
- Can be used for both structured and unstructured data.
- Can be used for both numerical and categorical data.

**Cons:**
- Computationally expensive, especially with large datasets.
- Requires a large amount of training data.
- Requires careful preprocessing of data.
- Prone to overfitting.
- Difficult to interpret and explain.

**Formula:**
- Activation Function: The activation function of a neuron defines the output of that neuron given a set of inputs. It introduces non-linearity to the model, which allows it to learn non-linear relationships in data.

$$ a = f(z) $$

Where:
- $a$ is the output of the neuron.
- $z$ is the weighted sum of inputs to the neuron.
- $f$ is the activation function.

- Loss Function: The loss function of a neural network defines the difference between the predicted and actual values. It is used to measure the model's performance and guide the optimization process.

**When to Use:**
- Use Neural Networks when you want to learn non-linear relationships in your data and have a large amount of training data.
- They are commonly used in image classification, natural language processing, and speech recognition.

# Forward propagation:
Forward propagation is the process of transforming an input tensor to an output tensor. It's the core of neural networks and deep learning.

**Theory:** Forward propagation is the process of transforming an input tensor to an output tensor. It's the core of neural networks and deep learning.

**How It Works:**
- Forward propagation is the process of transforming an input tensor to an output tensor.
- It involves a series of mathematical operations that transform the input tensor into an output tensor.
- The output tensor is then used to calculate the loss and update the model parameters during training.

**Code:**
- In TensorFlow, you can perform forward propagation using the `tf.keras` API:
```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Define a Sequential model
model = Sequential([
    Dense(64, input_dim=num_features, activation='relu'),
    Dense(32, activation='relu'),
    Dense(num_classes, activation='softmax')
])

# Perform forward propagation
outputs = model(X)

```

- In PyTorch, you can perform forward propagation as follows:
```python
import torch
import torch.nn as nn

# Define a custom model class
class CustomModel(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, num_classes):
        super(CustomModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.fc3 = nn.Linear(hidden_dim2, num_classes)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return self.softmax(x)

# Create an instance of the model
model = CustomModel(input_dim=num_features, hidden_dim1=64, hidden_dim2=32, num_classes=num_classes)

# Perform forward propagation
outputs = model(X)
```

In [None]:
# Coding Forward prop in Numpy (Advanced Learning Algorithms - Andrew Ng)
import numpy as np

def dense(a_in,W,b):
    units = W.shape[1]
    a_out = np.zeros(units)
    for j in range(units):
        w = W[:,j]
        z = np.dot(w, a_in) + b[j]
        a_out[j] = g(z) # g is the activation function
    return a_out

# We can short it by using matrix multiplication
def dense(a_in,W,b):
    z = np.dot(W, a_in) + b
    a_out = g(z) # g is the activation function
    return a_out

# Vectorization
def dense(a_in,W,b):
    z = np.matmul(W, a_in) + b
    a_out = g(z) # g is the activation function
    return a_out

# Forward working by passing the input into the first layer, then the output of the first layer into the second layer, and so on.
def sequential(x): 
    a1 = dense(x, W1, b1)
    a2 = dense(a1, W2, b2)
    a3 = dense(a2, W3, b3)
    a4 = dense(a3, W4, b4)
    return a4

# Backpropagation:

Backpropagation is a fundamental concept in neural network training. It's a key part of the training process that allows neural networks to update their weights and biases in order to minimize the error (or loss) between the predicted output and the actual target values. I'll explain the theory behind backpropagation and provide code examples for implementing it in TensorFlow and PyTorch.

**Theory:** Backpropagation, short for "backward propagation of errors," is a supervised learning algorithm used to train neural networks. It's based on the chain rule of calculus and allows the network to adjust its weights and biases during training in order to reduce the error between predictions and actual target values.

**How It Works:**
1. Forward Pass: During the forward pass, the input data is fed into the neural network, and the network computes the predicted output. This involves a series of weighted sum calculations followed by activation functions in each layer.

2. Compute Loss: The loss function (e.g., Mean Squared Error for regression or Cross-Entropy for classification) quantifies how far off the predicted output is from the actual target values.

3. Backward Pass (Backpropagation): In the backward pass, the algorithm calculates the gradients of the loss function with respect to the network's weights and biases. This is done by applying the chain rule to compute the gradients layer by layer, starting from the output layer and moving backward through the network.

4. Update Weights and Biases: The gradients obtained in the backward pass are used to update the weights and biases of the neural network's layers. Typically, gradient descent or its variants are used to adjust these parameters.

5. Repeat: Steps 1 to 4 are repeated for multiple iterations or epochs until the loss converges to a minimum or reaches a stopping criterion.

**Code Example (Backpropagation in TensorFlow):**

```python
import tensorflow as tf

# Define a simple neural network model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_dim=num_features),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

# Compile the model with a loss function and optimizer
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model using backpropagation
model.fit(X_train, y_train, epochs=num_epochs, batch_size=batch_size)
```

**Code Example (Backpropagation in PyTorch):**

```python
import torch
import torch.nn as nn
import torch.optim as optim

# Define a custom neural network model
class NeuralNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_classes):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, num_classes)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return self.softmax(x)

# Create an instance of the model
model = NeuralNetwork(input_dim=num_features, hidden_dim=64, num_classes=num_classes)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop with backpropagation
for epoch in range(num_epochs):
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
```

These code examples demonstrate how to implement backpropagation to train neural networks using TensorFlow and PyTorch. Replace placeholders with your actual data and parameters. If you have more questions or need further details, feel free to ask.

# Multi-Layer Perceptron (MLP):

**Theory:** Multi-Layer Perceptron, also known as a feedforward neural network or artificial neural network, is a type of neural network architecture that consists of multiple layers of interconnected neurons. It's used for both regression and classification tasks. MLPs are capable of capturing complex patterns in data.

**How It Works:**
- An MLP typically consists of an input layer, one or more hidden layers, and an output layer.
- Each neuron in a layer is connected to every neuron in the subsequent layer, forming a dense, fully connected network.
- Neurons in hidden layers apply an activation function to a weighted sum of their inputs, allowing the network to model non-linear relationships.
- During training, MLPs use techniques like backpropagation and gradient descent to learn the optimal weights that minimize the loss function.

**Support Functions:** MLPs rely on activation functions (e.g., ReLU, Sigmoid) and backpropagation for training. Libraries like TensorFlow and PyTorch provide built-in support for creating and training MLPs.

Now, let's see code examples for implementing a Multi-Layer Perceptron using TensorFlow, PyTorch, and scikit-learn:

**Using scikit-learn (sklearn):**

Scikit-learn provides an easy way to create and train MLPs for classification and regression tasks:

```python
from sklearn.neural_network import MLPClassifier, MLPRegressor

# Create an MLP Classifier for classification or MLP Regressor for regression
model = MLPClassifier(hidden_layer_sizes=(64, 32), activation='relu', max_iter=num_epochs)

# Fit the model to your data (X_train and y_train are your feature and target data)
model.fit(X_train, y_train)

# Make predictions for classification or regression
predictions = model.predict(X_test)
```

**Using TensorFlow:**

In TensorFlow, you can use the Keras API to create and train an MLP:

```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Define a Sequential model for an MLP
model = Sequential([
    Dense(64, input_dim=num_features, activation='relu'),
    Dense(32, activation='relu'),
    Dense(num_classes, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model to your training data (X_train and y_train are feature and label data)
model.fit(X_train, y_train, epochs=num_epochs, batch_size=batch_size)

# Make predictions
predictions = model.predict(X_test)
```

**Using PyTorch:**

In PyTorch, you can define a custom MLP model and train it as follows:

```python
import torch
import torch.nn as nn
import torch.optim as optim

# Define a custom MLP model class
class MLPModel(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, num_classes):
        super(MLPModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.fc3 = nn.Linear(hidden_dim2, num_classes)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return self.softmax(x)

# Create an instance of the model
model = MLPModel(input_dim=num_features, hidden_dim1=64, hidden_dim2=32, num_classes=num_classes)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop (X_train is your feature data, y_train is class labels)
for epoch in range(num_epochs):
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Make predictions
with torch.no_grad():
    predictions = model(X_test)
    
```

In these code examples, I've shown how to create, train, and use a Multi-Layer Perceptron (MLP) for classification tasks in each of the mentioned libraries. Replace placeholders with your actual data and parameters. If you have more questions or need further details, feel free to ask.

# Dropout

**Theory:** Dropout is a regularization technique used to prevent neural networks from overfitting. It works by randomly dropping (setting to zero) a proportion of the neurons in each layer during training.

**How It Works:**
- Dropout is a regularization technique used to prevent neural networks from overfitting.
- It works by randomly dropping (setting to zero) a proportion of the neurons in each layer during training.
- This forces the network to learn redundant representations, which improves its generalization ability.

**Support Functions:** Dropout relies on randomly dropping neurons during training. Most machine learning libraries provide built-in support for dropout.

**Code Example (Dropout in TensorFlow):** (We will use it in MLP to see the difference)

```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Define a Sequential model with Dropout
model = Sequential([
    Dense(64, activation='relu', input_dim=num_features),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(num_classes, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model to your training data (X_train and y_train are feature and label data)
model.fit(X_train, y_train, epochs=num_epochs, batch_size=batch_size)

# Make predictions
predictions = model.predict(X_test)
```

**Code Example (Dropout in PyTorch):**

```python
import torch
import torch.nn as nn
import torch.optim as optim

# Define a custom MLP model class with Dropout
class MLPModel(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, num_classes):
        super(MLPModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.fc3 = nn.Linear(hidden_dim2, num_classes)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.2)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return self.softmax(x)

# Create an instance of the model
model = MLPModel(input_dim=num_features, hidden_dim1=64, hidden_dim2=32, num_classes=num_classes)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()

# Training loop (X_train is your feature data, y_train is class labels)
for epoch in range(num_epochs):
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Make predictions
with torch.no_grad():
    predictions = model(X_test)
```

# CNN

## Introduction to CNN
**From MLP to CNN** 
- Models are suitable for data containing patterns and characteristics, but they don't assume their relationships. In cases where knowledge is lacking, a multi-layer perceptron is often the best solution. However, these unstructured networks may become too crowded when processing multidimensional cognitive data. For example, a high-quality labeled image set with a 1 million-pixel resolution would require a dense layer of 109 parameters, making learning the parameters impossible.
- Despite the argument that the resolution of 1 million pixels may be unnecessary, the number of hidden buttons needed to find good hidden representations of images is overestimated. Learning a binary classifier with a lot of parameters would likely require a huge set of data, equivalent to the number of dogs and cats on Earth. However, both humans and computers can distinguish cats and dogs well due to the richly structured images, often exploited by humans and machine learning models.

**Invariant features**
- Imagine that we want to identify an object in the picture. It would seem logical to assume that whatever method we use should not be too careful about the exact location of the object in the picture. Ideally, we could learn a system that is capable of leveraging this knowledge in some way.
- Back to the picture, the intuitions that we've discussed can be further specified to get some key principles in building neural networks for computer vision:
    - In some respects, visual systems should react similarly to the same object regardless of where it appears in the image (progressive immutability).
    - On the other hand, the visual systems should focus on the local areas and do not care about anything else that is further away in the image (locally).

## Convolutional Neural Network (CNN)
**Mathematical cross-correlation**
- The convolutional layer is the core building block of a convolutional neural network. The convolutional layer is a linear operation that involves the multiplication of a set of weights with the input, much like a fully connected layer. However, the weights are arranged in a grid-like structure, rather than a single row. The convolutional layer is a linear operation that involves the multiplication of a set of weights with the input, much like a fully connected layer. However, the weights are arranged in a grid-like structure, rather than a single row.
    - Example, we have input:
    $$ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} $$
    - And we have kernel:
    $$ \begin{bmatrix} 0 & 1 \\ 2 & 3 \end{bmatrix} $$
    - We will take 4 values(2x2) from input and dot product with kernel:
    $$ \begin{bmatrix} 1 & 2 \\ 4 & 5 \end{bmatrix} \times \begin{bmatrix} 0 & 1 \\ 2 & 3 \end{bmatrix} = 1 * 0 + 2 * 1 + 4 * 2 + 5 * 3 = 25 $$
    - The other values will be:
    $$ 2 * 0 + 3 * 1 + 5 * 2 + 6 * 3 = 31 $$
    $$ 4 * 0 + 5 * 1 + 7 * 2 + 8 * 3 = 37 $$
    $$ 5 * 0 + 6 * 1 + 8 * 2 + 9 * 3 = 43 $$
    - The result will be:
    $$ \begin{bmatrix} 25 & 31 \\ 37 & 43 \end{bmatrix} $$

- I will provide sample code to calculate:
```python
import numpy as np

# Sample data
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
kernel = np.array([[0, 1], [2, 3]])

# Calculate the cross-correlation
def 2Dcross_corr(x, k):
    h , w = k.shape
    y = np.zeros((x.shape[0] - h + 1, x.shape[1] - w + 1))
    for i in range(y.shape[0]):
        for j in range(y.shape[1]):
            y[i, j] = np.sum(x[i:i+h, j:j+w] * k)
    return y

# Print the result of cross-correlation
print(2Dcross_corr(X, kernel))
```

**Convolutional layer**
    - The accumulator performs cross-correlation mathematics between the input and the nucleus, then adds an adjustment coefficient to get the output. The two parameters of the accumulation layer are the nucleus and the adjustment factor. When we train models that contain accumulated layers, we usually initiate random nuclei, just like we do with a fully connected layer.

- Some method to create layer with library:
```python
# Tensorflow
import tensorflow as tf
from tensorflow.keras.layers import Conv2D

# Create a Convolutional layer
layer = Conv2D(
    filters=32,            # Number of output channels (filters)
    kernel_size=3,         # Size of the kernel
    activation='relu',     # Activation function
    input_shape=(28, 28, 1),  # Input shape (height, width, channels)
    strides=(2, 2),        # Stride for the convolution (2, 2) for example
    padding='same'         # 'valid' for no padding, 'same' for zero-padding
) # Two last feature I will explain below

# Apply the Convolutional layer to the input data
output = layer(X)

# PyTorch
import torch
import torch.nn as nn

# Create a Convolutional layer
layer = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=0)

# Apply the Convolutional layer to the input data
output = layer(X)
```

**Detect edges**
- Let's observe a simple application of the accumulation layer: detecting the boundary of an object in an image by determining where the pixels change. First, we create a 'picture' of the size of 6×8 pixels. The four columns in the middle are black (value 0) and the rest are white (values 1).
```python
import numpy as np

# Create a 6x8 image with 4 black (0) and 4 white (1) pixels in the middle
X = np.ones((6, 8))
X[:, 2:6] = 0
print(X)
```

- Then we create a K-core with a height of 1 and a width of 2. When performing cross-correlation with the input, if two elements horizontally adjacent to each other have the same value, the output will be equal to 0 and the other outputs will be different.

```python
# Create a 1x2 kernel
kernel = np.array([[1, -1]])
```

- We're ready to do cross correlation with the X (input) and K (nuclear) arguments. You can see that the white-to-black boundary positions have a value of 1, whereas the black- to-blue positions are worth -1. The remaining output positions have a value of 0.

```python
# Apply cross-correlation and print the result
print(2Dcross_corr(X, kernel)) # 2Dcross_corr is a function that I have created above
```

- Now, let's apply this to the transposition of the pixel matrix. As expected, the cross correlation value is zero. The K nucleus can only detect the vertical boundary.

```python
print(2Dcross_corr(X.T, kernel))
```

**Padding and strides**
- In previous code, I've used padding=0 and strides=1. But what is padding and strides? Let's go into it.
    - In the previous example, the input is equal in length and width to 3 (3x3 matrix), the cumulative core window is equally long and wide to 2, so we get an output representation of 2×2. Generally speaking, assuming the size of the input is $nh * nw$, and the dimension of the accumulated kernel window is $kh * kw$, the output size will be:
    $$ (n_{h} - k_{h} + 1) \times (n_{w} - k_{w} + 1) $$
    
    - Where:
        - $n_{h}$ and $n_{w}$ are the height and width of the input.
        - $k_{h}$ and $k_{w}$ are the height and width of the kernel.
    - In some cases, we're going to combine other techniques that also affect the size of the output, such as adding buffers and accumulating spacing. Note that since the nuclei are generally larger than 1 in width and height, after applying multiple consecutive accumulations, the output is generally significantly smaller than the input. If we start with a picture of 240×240 pixels and apply 10 layers of 5×5 accumulation, then the image size will be reduced to 200×200 pixels, 30% of the image will be cut off, and all useful information on the border of the original picture will be deleted. Padding is the most common tool for dealing with this problem.

    **Padding**
    - We will demonstrate padding with an example:
        - We have input:
        $$ \begin{bmatrix} 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 2 & 0 \\ 0 & 3 & 4 & 5 & 0 \\ 0 & 6 & 7 & 8 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix} $$
        - We have kernel:
        $$ \begin{bmatrix} 0 & 1 \\ 2 & 3 \end{bmatrix} $$
        - We will take 4 values(2x2) from input and dot product with kernel:
        $$ \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix} \times \begin{bmatrix} 0 & 1 \\ 2 & 3 \end{bmatrix} = 0 * 0 + 0 * 1 + 0 * 2 + 0 * 3 = 0 $$
        - The other values will be:
        $$ 0 * 0 + 0 * 1 + 0 * 2 + 1 * 3 = 3 $$
        $$ 0 * 0 + 0 * 1 + 1 * 2 + 2 * 3 = 8 $$
        $$ 0 * 0 + 0 * 1 + 2 * 2 + 0 * 3 = 4 $$
        etc... with the other values
        - The result will be:
        $$ \begin{bmatrix} 0 & 3 & 8 & 4 \\ 9 & 19 & 25 & 10 \\ 21 & 37 & 43 & 16 \\ 6 & 7 & 8 & 0 \end{bmatrix} $$
    - Generally speaking, if we insert the total of the $p_h$ (half in the top and half in the bottom) and $p_w$ (left and right), the output size will be:
    $$ (n_{h} - k_{h} + p_{h} + 1) \times (n_{w} - k_{w} + p_{w} + 1) $$

    - Where:
        - $p_{h}$ and $p_{w}$ are the height and width of the padding.
    - We usually choosing padding to be a odd number like 1,3,5 or 7. Choosing an odd will helps us preserve the spatial dimensions by adding the same number of buffer rows for the upper and lower edges, and the same amount of buffering columns for the left and right edges.

# RNN

# Method Deep Learning

# Parameter Management

The training cycle begins with selecting the network architecture and hyperparameter values, aiming to minimize the target function. Parameters are essential for future predictions, reuse, storage, and scientific analysis. These parameters are saved in a file called a checkpoint, which includes the following information:
- Access parameters for debugging, model diagnostics, and visual representation.
- Parameter initialization (typically handled automatically).
- Model architecture and configuration.

**Access the Parameters**
After defining a model, you can access its parameters. In PyTorch, model parameters are stored in a dictionary called `state_dict`, which maps each layer to its parameter tensor.

```python
# PyTorch
model.state_dict()

# TensorFlow
model.get_weights() # This method returns a list and contain all the weights and biases of the model
```

**Target Parameters**
The target parameters are the parameters that are used to train the model. The target parameters are the parameters that are updated during the training process. The target parameters are stored in a dictionary called target_dict. The target_dict is a Python dictionary object that maps each layer to its target parameter tensor. The keys are the names of the layers, and the values are the target parameter tensors. The target_dict object is mutable. The target_dict object is accessed by calling the target_dict() function.

```python
# PyTorch
model.target_dict()
```
**Initialize Parameters**
    - In Deep learning, it's possible to customize the initialization of the parameters. This can be done using various techniques such as zero initialization, random initialization, and Xavier initialization. The choice of initialization can significantly impact the model's performance.
 
```python
# PyTorch
def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

model.apply(init_weights)

# TensorFlow
initializer = tf.keras.initializers.GlorotUniform()
model.add(Dense(64, kernel_initializer=initializer))
```