In [1]:
from keras.datasets import mnist

from keras.models import Sequential
from keras.layers import *
from keras.optimizers import Adam
import matplotlib.pyplot as plt
import numpy as np
import math

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [3]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)


In [4]:
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

In [5]:
x_train = (x_train.astype('float32')- 127.5)/127.5
x_test = (x_test.astype('float32')- 127.5)/127.5

In [6]:
print(np.max(x_train))
print(np.min(x_train))

1.0
-1.0


In [8]:
num_epochs = 50
batch_size = 256
no_of_batches = math.ceil(x_train.shape[0]/batch_size)
half_batch = math.ceil(batch_size/2)
noise_dim = 100
# Use these Adam params for GAN's
adam = Adam(learning_rate=0.0002, beta_1=0.5)

In [11]:
# define the generator
generator = Sequential()
generator.add(Dense(7*7*128, input_dim=noise_dim))#densed step1
generator.add(Reshape((7,7,128)))#reshaped
generator.add(LeakyReLU(0.2)) #adding non linearity
generator.add(BatchNormalization())
#step 2 : Upsample , into 14*14*64
generator.add(UpSampling2D())
generator.add(Conv2Dcr(64, kernel_size=(5,5), padding='same'))
generator.add(LeakyReLU(0.2))
generator.add(BatchNormalization())
#step 3 : Upsample , into 28*28*1
generator.add(UpSampling2D())
generator.add(Conv2D(1, kernel_size=(5,5), padding='same', activation='tanh'))
generator.compile(loss='binary_crossentropy', optimizer=adam)
generator.summary()

# Define the Discriminator Model
discriminator = Sequential()
discriminator.add(Conv2D(64, kernel_size=(5,5), strides=(2,2), padding='same', input_shape=(28,28,1)))
discriminator.add(LeakyReLU(0.2))

# Next Conv layer (14*14*64) to 7*7*128
discriminator.add(Conv2D(128, kernel_size=(5,5), strides=(2,2), padding='same'))
discriminator.add(LeakyReLU(0.2))

# Flatten the output
discriminator.add(Flatten())
discriminator.add(Dense(1, activation='sigmoid'))
discriminator.compile(loss='binary_crossentropy', optimizer=adam)
discriminator.summary()


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


The line `generator.add(BatchNormalization())` refers to adding a Batch Normalization layer to a neural network generator model, typically in the context of a Generative Adversarial Network (GAN). Batch Normalization is a technique used to improve the training stability and performance of deep neural networks.

## Purpose of Batch Normalization

Batch Normalization serves several important purposes in neural networks:

1. **Mitigating Internal Covariate Shift**: It normalizes the inputs to each layer, reducing the internal covariate shift problem where the distribution of each layer's inputs changes during training[1].

2. **Faster Training**: By stabilizing the input distribution, it allows for faster convergence during the training process[5].

3. **Higher Learning Rates**: Batch Normalization enables the use of higher learning rates without the risk of divergence, further accelerating training[5].

4. **Regularization Effect**: It introduces a slight regularization effect, potentially reducing the need for other regularization techniques like dropout[6].

5. **Improved Gradient Flow**: It helps in maintaining a more stable gradient flow through the network, mitigating issues like vanishing or exploding gradients[1].

## Implementation in GANs

In the context of a GAN's generator:

1. **Stabilizing Training**: GANs are notoriously difficult to train, and Batch Normalization can help stabilize the training process[7].

2. **Normalizing Feature Distributions**: It ensures that the generator's intermediate layers produce features with consistent distributions, which can be crucial for generating high-quality outputs[3].

3. **Improved Learning**: By normalizing the inputs to each layer, it allows the generator to learn more effectively across its entire depth[2].

## How It Works

When you add `BatchNormalization()` to your generator:

1. It normalizes the output of the previous layer by subtracting the batch mean and dividing by the batch standard deviation[1].

2. It then scales and shifts the normalized values using learned parameters (gamma and beta)[1].

3. This process is applied to each mini-batch during training, helping to maintain a consistent distribution of activations throughout the network[3].

By incorporating Batch Normalization, you're essentially giving your generator model a tool to self-regulate its internal representations, leading to more stable and efficient training, especially in the complex and often unstable environment of GAN training[7].

Citations:
[1] https://en.wikipedia.org/wiki/Batch_normalization
[2] https://towardsdatascience.com/batch-norm-explained-visually-how-it-works-and-why-neural-networks-need-it-b18919692739?gi=f4d3c2a40966
[3] https://viso.ai/deep-learning/batch-normalization/
[4] https://github.com/christianversloot/machine-learning-articles/blob/main/batch-normalization-with-pytorch.md
[5] https://towardsdatascience.com/batch-normalization-8a2e585775c9
[6] https://www.geeksforgeeks.org/what-is-batch-normalization-in-deep-learning/
[7] https://stackoverflow.com/questions/58315023/gan-with-batch-norm-acting-very-weird-both-discriminator-and-generator-get-zero
[8] https://www.reddit.com/r/MachineLearning/comments/ql5hdb/d_why_do_we_apply_batch_normalization_between/

The mathematics behind batch normalization (BN) is designed to normalize the inputs to each layer in a neural network, reducing internal covariate shift and improving training stability. Let's break down the key mathematical components of batch normalization:

## Normalization Step

For a given mini-batch B of size m, and a particular feature k, the normalization is performed as follows:

1. **Mini-batch mean:**
   $$ \mu_B^{(k)} = \frac{1}{m} \sum_{i=1}^m x_i^{(k)} $$

2. **Mini-batch variance:**
   $$ (\sigma_B^{(k)})^2 = \frac{1}{m} \sum_{i=1}^m (x_i^{(k)} - \mu_B^{(k)})^2 $$

3. **Normalized value:**
   $$ \hat{x}_i^{(k)} = \frac{x_i^{(k)} - \mu_B^{(k)}}{\sqrt{(\sigma_B^{(k)})^2 + \epsilon}} $$

   Where ε is a small constant added for numerical stability[1][3].

## Scaling and Shifting

After normalization, BN applies a learnable scale and shift:

$$ y_i^{(k)} = \gamma^{(k)} \hat{x}_i^{(k)} + \beta^{(k)} $$

Where γ^(k) and β^(k) are learnable parameters[2][3].

## Inference Phase

During inference, BN uses running statistics:

1. **Running mean:**
   $$ E[x^{(k)}] = \frac{1}{j} \sum_{i=1}^j \mu_B^{(k)} $$

2. **Running variance:**
   $$ Var[x^{(k)}] = \frac{1}{j} \sum_{i=1}^j (\sigma_B^{(k)})^2 $$

Where j is the number of mini-batches[4].

The inference transformation becomes:

$$ y^{(k)} = \gamma^{(k)} \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{Var[x^{(k)}] + \epsilon}} + \beta^{(k)} $$

## Gradient Properties

An important property of BN is that it bounds the magnitude of the gradients:

$$ \|\nabla_{y_i} \hat{L}\| \leq C \cdot \|\nabla_z \hat{L}\| $$

Where C is a constant, and z is the layer output after BN[3].

## Linear Transformation

BN can be viewed as a linear transformation:

$$ y = \frac{\gamma}{\sqrt{Var_x + \epsilon}} x + \beta - \frac{\gamma E_x}{\sqrt{Var_x + \epsilon}} $$

This form shows how BN scales and shifts the input[4].

By applying these mathematical operations, batch normalization helps stabilize the distribution of layer inputs throughout training, allowing for faster convergence and the use of higher learning rates. The learnable parameters γ and β give the network the flexibility to represent the identity transform if necessary, ensuring that BN doesn't limit the network's expressive power.

Citations:
[1] https://towardsdatascience.com/the-math-behind-batch-normalization-90ebbc0b1b0b
[2] https://www.datacamp.com/tutorial/batch-normalization-tensorflow
[3] https://en.wikipedia.org/wiki/Batch_normalization
[4] https://datascience.stackexchange.com/questions/105152/equations-in-batch-normalization-theory-and-how-to-use-it-with-tensorflow
[5] https://pub.towardsai.net/demystifying-batch-normalization-theory-mathematics-and-implementation-f04077298807?gi=7c7b6d77f6c2
[6] https://www.reddit.com/r/MachineLearning/comments/ql5hdb/d_why_do_we_apply_batch_normalization_between/
[7] https://towardsdatascience.com/batch-norm-explained-visually-how-it-works-and-why-neural-networks-need-it-b18919692739?gi=f4d3c2a40966

In [18]:

!ls

 images   images_new   ls   models  'models!'   models_new   sample_data


In [None]:
# define the generator
generator = Sequential()
generator.add(Dense(7*7*128, input_dim=noise_dim))#densed step1
generator.add(Reshape((7,7,128)))#reshaped
generator.add(LeakyReLU(0.2)) #adding non linearity
generator.add(BatchNormalization())
#step 2 : Upsample , into 14*14*64
generator.add(Conv2DTranspose(64, kernel_size=(5,5), padding='same'))
generator.add(LeakyReLU(0.2))
generator.add(BatchNormalization())
#step 3 : Upsample , into 28*28*1

generator.add(Conv2DTranspose(1, kernel_size=(5,5), padding='same', activation='tanh'))
generator.compile(loss='binary_crossentropy', optimizer=adam)
generator.summary()

# Define the Discriminator Model
discriminator = Sequential()
discriminator.add(Conv2D(64, kernel_size=(5,5), strides=(2,2), padding='same', input_shape=(28,28,1)))
discriminator.add(LeakyReLU(0.2))

# Next Conv layer (14*14*64) to 7*7*128
discriminator.add(Conv2D(128, kernel_size=(5,5), strides=(2,2), padding='same'))
discriminator.add(LeakyReLU(0.2))

# Flatten the output
discriminator.add(Flatten())
discriminator.add(Dense(1, activation='sigmoid'))
discriminator.compile(loss='binary_crossentropy', optimizer=adam)
discriminator.summary()