# Week 3
Here we are going to see some problem relative the GANS Learning collapse and then how to resolve them.

### Probability Distribution - Modes
Usually our goal for the **generator** is to learn the *probability distribution* of the data and for the discriminator
is to learn distinguish data real or fake.

The probability distribution can have one mode or multiple modes:

<img src="images/modes.png" width=420 height=200 />

##### Generator Problem
The problem of the generator is that if in the train-data present multiple modes and the end can focus to learn only one
of them and then keep generating always the same class.

## Problems with BCE Loss

Binary Cross-Entropy loss or BCE loss, is traditionally used for training GANs, but it isn't the best way to do it.
With BCE loss GANs are prone to mode collapse and other problems. In this video, you'll see why GANs trained with BCE
loss are susceptible to vanishing gradient problems. To that end, you'll review the BCE loss function and what that means
for the generator and the discriminators objectives. Then you'll see when and why GANs with this BCE loss are likely
to have those vanishing gradient problems.

<img src="images/bce.png" width=420 height=200 />

Remember the form of the BCE loss function, its just an average of the cost
for the discriminator for misclassifying real and fake observations. Where the first term is for reals and the second
term is for the fakes. The higher this cost value is, the worse the discriminator is doing at it. The generator
wants to maximize this cost because that means the discriminator is doing poorly and is classifying it's fake values
into reals. Whereas the discriminator wants to minimize this cost function because that means it's classifying
things correctly. Of course the generator only sees the fake side of things, so it actually doesn't see anything
about the reals. This maximization and minimization is often called a minimax game, and that's how you might hear
it being referred to as.

<img src="images/bce2.png" width=420 height=200 />

At the end of this minimax game, the generator and discriminator interaction translates to
a more general objective for the whole GAN architecture. That is to make the real in generated data distributions
of features very similar. Trying to get the generated distribution to be as close as possible to the reals.
This minimax of the Binary Cross-Entropy loss function is somewhat approximating the minimization of another
complex hash function that's trying to make this happen. Of course, during this whole training process, the
discriminator naturally is trying to delineate this real and fake distribution as much as possible, whereas
the generator is trying to make the generated distribution look more like the reals.

<img src="images/gan.png" width=420 height=200 />

However, let's take a
step back again to the generator and discriminators roles. The discriminator and again, needs to output just
a single value prediction within zero and one. Whereas the generator actually needs to produce a pretty complex
output composed of multiple features to try and fool the discriminator, for example, an image. As a result
that discriminators job tends to be a little bit easier. To put it in another way, it's more straightforward
to look at images in a museum than it is to paint those masterpieces, right? During training it's possible
for the discriminator to outperform the generator, very possible, in fact, quite common.

<img src="images/lossgan.png" width=420 height=200 />



But at the beginning
of training, this isn't such a big problem because the discriminator isn't that good. It has trouble
distinguishing the generated and real distributions.There's some overlap and it's not quite sure. As a result, it's able
to give useful feedback in the form of a non-zero gradient back to the generator.

<img src="images/begintrain.png" width=420 height=200 />




However, as it gets better at training, it starts to delineate the generated and real
distributions a little bit more such that it can start distinguishing them much more. Where the real distribution will
be centered around one and the generated distribution will start to approach zero. As a result, when it's starting to
get better, as this discriminator is getting better, it'll start giving less informative feedback. In fact, it might
give gradients closer to zero, and that becomes unhelpful for the generator because then the generator doesn't know how
to improve. This is how the vanishing gradient problem will arise.

<img src="images/aftertrain.png" width=420 height=200 />

In summary, GANs try to make the generated
distribution look similar to the real one by minimizing the underlying cost function that measures how different the
distributions are. As a discriminator improves during training and sometimes improves more easily than the generator,
that underlying cost function will have those flat regions when the distributions are very different from one another,
where the discriminator is able to distinguish between the reals and the fakes much more easily, and be able to say,
"Reals look really real, a label of one and fakes look really fake, a label of zero." All of this will cause vanishing
gradient problems.

<img src="images/end.png" width=420 height=200 />

### Earth Mover's Distance

When using BCE loss to train a GAN, you often encounter mode collapse, and vanishing gradient problems due to the underlying
cost function of the whole architecture. Even though there is an infinite number of decimal values between zero and one,
the discriminator, as it improves, will be pushing towards those ends. In this video, you'll see a different underlying
cost function called Earth mover's distance, that measures the distance between two distributions and generally outperforms
the one associated with BCE loss for training GANs. At the end I'll show you why this helps with the vanishing gradient
problem.

<img src="images/emd.png" width=420 height=200 />

So take this generated and real distributions with the same variance but different means, and assume they might
be normal distributions. What the Earth Mover's distance does, is it measures how different these two distributions
are, by estimating the amount of effort it takes to make the generated distribution equal to the real.

<img src="images/emd1.png" width=420 height=200 />


So intuitively, the generate distribution was a pile of dirt, how difficult would it be to move that pile of dirt
and mold it into the shape and location of the real distribution? So that's what this Earth mover's distance means.
The function depends on both the distance and the amount that the generated distribution needs to be moved.


So the problem with BCE loss is that as a discriminator
improves, it would start giving more extreme values between zero and one, so values closer to one and closer to zero.
As a result, this became less helpful feedback back to the generator. So the generator would stop learning due to
vanishing gradient problems.

<img src="images/emd4.png" width=420 height=200 />

With Earth mover's distance, however, there's no such ceiling to the zero and one.
So the cost function continues to grow regardless of how far apart these distributions are. The gradient of this measure
won't approach zero and as a result, GANs are less prone to vanishing gradient problems and from vanishing gradient
problems, mode collapse.

<img src="images/emd4.png" width=420 height=200 />

So wrapping up, Earth mover's distance is a function of the effort to make a distribution
equal to another. So it depends on both distance and amount. Unlike BCE, it doesn't have flat regions when the
distributions start to get very different, and the discriminator starts to improve a lot. So approximating this
measure eliminates the vanishing gradient problem, and reduces the likelihood of mode collapse in GANs.
In the next few videos, I'll show you a loss function that uses Earth mover's distance for training GANs.

## Wassertein Loss - Approximation of EMD

As you've seen previously, BCE Loss is used traditionally to train GANs. However, it has many problems due the form of
the function it's approximated by. So in this video I'll introduce you to an alternative loss function called
Wasserstein Loss, or W-Loss for short, that approximates the Earth Mover's Distance that you saw in the previous video.
So to that end, first you'll see an alternative way to look at the BCE Loss function that's more simple and compact,
and I'll show you how W-Loss is calculated, and I'll compare this loss with BCE Loss. So, BCE Loss is computed by a
long equation that essentially measures how bad, on average, some observations are being classified by the discriminator,
as fake and real.


<img src="images/bce_simple.png" width=420 height=200 />

So, the generator in GANs wants to maximize this cost, because that means the discriminator is saying that its fake
values seem really real, while the discriminator wants to minimize that cost.
And so, this is often referred to as a Minimax game. And this very long equation for BCE Loss can be simplified as follows.
The sum and division over examples M is nothing but a mean or expected value. In the first part, inside the sum, measures
how bad the discriminator classifies real observations, where y equals 1, and 1 means real. And the second part measures
how bad it classifies fake observations produced by the generator, where y of 1 means real, but 1 minus y, y of 0,
means fake.

### W-Loss

W-Loss, on the other hand, approximates the Earth Mover's Distance between the real and generated distributions,
but it has nicer properties than BCE. However, it does look very similar to the simplified form for the BCE Loss, and in
this case the function calculates the difference between the expected values of the predictions of the discriminator.
Here it's called the critic, and I'll go over that later, so I'm going to represent it with a c here. And this is c of a
real example x, versus C of a fake example g of z. Generator taking in a noise vector to produce a fake image g of z,
or perhaps you can call it x-hat.

<img src="images/wloss.png" width=420 height=200 />

So the discriminator looks at these two things, and it wants to maximize the distance between its thoughts on the reals
versus its thoughts on the fakes. So it's trying to push away these two distributions to be as far apart as possible.
Meanwhile, the generator wants to minimize this difference, because it wants the discriminator to think that its fake
images are as close as possible to the reals. I know that in contrast with BCE there are no logs in this function,
since the critics outputs are no longer bounded to be between 0 and 1.

<img src="images/wloss1.png" width=420 height=200 />

##### Discriminator ouput
So, for the **BCE Loss** to make sense, the output of the discriminator needs to be a prediction between 0 and 1. And so the
discriminator's neural network for GANs, trained with BCE Loss, have a sigmoid activation function in the output layer
to then squash the values between 0 and 1

<img src="images/disc-bce.png" width=420 height=200 />

**W-Loss**, however, doesn't have that requirement at all, so you can actually have a linear layer at the end of the
discriminator's neural network, and that could produce any real value output. And you can interpret that output as,
how real an image is considered by the critic, which, by the way, is now what we're calling the discriminator instead,
because it's no longer bounded between 0 and 1, where 0 means fake, and 1 means real. It's no longer classifying into
these two, or discriminating between these two classes. And so, as a result, it wouldn't make that much sense to call
that neural network a discriminator, because it doesn't discriminate between the classes. And so, for W-Loss, the
equivalent to a discriminator is called a critic, and what it tries to do is, maximize the distance between its
evaluation on a fake, and its evaluation on a real.

<img src="images/disc-wloss.png" width=420 height=200 />

##### W-Loss vs BCE Loss

So, some of the main differences between W-Loss and BCE Loss is that, the discriminator under BCE Loss outputs a value
between 0 and 1, while the critic in W-Loss will output any number. Additionally, the forms of the cost functions is
very similar, but W-Loss doesn't have any logarithms within it, and that's because it's a measure of how far the
prediction of the critic for the real is from its prediction on the fake. Meanwhile, BCE Loss does measure that distance
between fake or a real, but to a ground truth of 1 or 0. And so what's important to take away here is largely that, the
discriminator is bounded between 0 and 1, whereas the critic is no longer bounded ,and just trying to separate the two
distributions as much as possible.

<img src="images/bce-vs-wloss.png" width=420 height=200 />

And as a result, because it's not bounded, the critic is allowed to improve without degrading its feedback back to the
generator. And this is because, it doesn't have a vanishing gradient problem, and this will mitigate against mode
collapse, because the generator will always get useful feedback back.

##### W-Loss Summary

<img src="images/wloss-summary.png" width=420 height=200 />


So, in summary, W-Loss looks very similar to BCE Loss, but it isn't as complex a mathematical expression.
Under the hood what it does is, approximates the Earth Mover's Distance, so it prevents mode collapse in vanishing
gradient problems. However, there is an additional condition on this cost function for it to work well and for it to be
valid.

### Wasserstein Loss or W-Loss

Wasserstein Loss or W-Loss solves some problems faced by GANs, like mode claps and vanishing gradients. But for it to
work well, there is a special condition that needs to be met by the critic. In this video, you'll see what the continuity
condition on the critic neural network means and why that condition is important when using W-Loss for training GANs,
and trust me, it's worth it, so stay tuned.


#### Condition to use W-Loss -> 1-Lipschitz Continuous (1-L)

W-Loss is a simple expression that computes the difference between the expected values of the critics output for the
real examples x and its predictions on the fake examples g(z). The generator tries to minimize this expression, trying
to get the generative examples to be as close as possible to the real examples while the critic wants to maximize this
expression because it wants to differentiate between the reals and the fakes, it wants the distance to be as large as
possible. However, for training GANs using W-Loss, the critic has a special condition. It needs to be something called
1-Lipschitz Continuous or 1-L Continuous for short.

<img src="images/wloss-1L.png" width=420 height=200 />

This condition sounds more sophisticated than it really is. For a function like the critics neural network to be at
1-Lipschitz Continuous, the norm of its gradient needs to be at most one. What that means is that, the slope can't be
greater than one at any point, its gradient can't be greater than one. To check if a function here, for example, this
function you see here, f(x) equals x squared, is 1-Lipschitz Continuous,

<img src="images/wloss-1L-1.png" width=420 height=200 />




you want to go along every point in this function and make sure its slope is less than or equal to one, or its gradient
is less than or equal to one, and what you can do is, you can actually draw two lines, one where the slope is exactly
one at this certain point that you're evaluating function, and one where the slope is negative one where you're
evaluating our function. You want to make sure that the growth of this function never goes out of bounds from these
lines because staying within these lines means that the function is growing linearly.

<img src="images/wloss-1L-2.png" width=420 height=200 />

Here this function is not Lipschitz Continuous because it's coming out in all these sections.
It's not staying within this green area, which suggests that it's growing more than linearly.

<img src="images/wloss-1L-3.png" width=420 height=200 />


Look at another example here. This is a smooth curve functions. You want to again check every single point on this
function before you can determine whether or not that this is 1-Lipschitz Continuous.

<img src="images/wloss-1L-4.png" width=420 height=200 />

Here it looks fine, function looks good. Here it also looks good, here looks good. Let's say you take every single value
and the function never grows more than linearly. This function is 1-Lipschitz Continuous.
This condition on the critics neural network is important for W-Loss because it assures that the W-Loss function is not
only continuous and differentiable, but also that it doesn't grow too much and maintain some stability during training.
This is what makes the underlying Earth Movers Distance valid, which is what W-Loss is founded on. This is required for
training both the critic and generators neural networks and it also increases stability because the variation as the
GAN learns will be bounded.

<img src="images/wloss-1L-5.png" width=420 height=200 />

To recap, the critic, and again that uses W-Loss for training needs to be 1-Lipschitz Continuous in order for its
underlying Earth Mover's Distance comparison between the reals and the fakes to be a valid comparison. In order to
satisfy or try to satisfy this condition during training, there are multiple different methods. Next, we'll learn about
a couple of these methods.

<img src="images/wloss-1L-6.png" width=420 height=200 />

## 1-Lipschitz Continuity Enforce

One Lipschitz continuity or 1-L continuity of the critic neural network in your Wasserstein loss and gain ensures that
Wasserstein loss is valid. You already saw what this means and this video, I'll show you how to enforce this condition
when training your critic. First, I'll introduce you to two different methods to enforce 1-L continuity on the critic,
namely weight clipping and gradient penalty.

#### 1-Lipschitz

<img src="images/wloss-1L-7.png" width=420 height=200 />

Then I'll discuss the advantages of gradient penalty over weight clipping. First recall that the critic being 1-L
continuous means that the norm of its gradient is at most one at every single point of this function. This upside down
triangle is assigned for gradient, this is the function, perhaps F is the critic here and X is the image.
This just represents the norm of that gradient being less than or equal to one. Using the L2 norm is very common here,
which just means its Euclidean distance or often thought of as your triangle distance of your hypotenuse. This is the
distance between these two points not going this direction. It's this hypotenuse. Intuitively in two-dimensions,
it's that the slope is less than or equal to one. At every single point of this function, it'll remain within these
green triangles.

##### Weight Clipping

<img src="images/wloss-1L-8.png" width=420 height=200 />

With weight clipping, the weights of the critics neural network are forced to take values between a fixed interval.
After you update the weights during gradient descent, you actually will clip any weights outside of the desired interval.
Basically what that means is that weights over that interval, either too high or too low, will be set to the maximum or
the minimum amount allowed. That's clipping the weights there. This is one way of enforcing the 1-L continuity, but it
has a way to downside. Forcing the weights of the critic to a limited range of values could limit the critics ability to
learn and ultimately for the gradient to perform because if the critic can't take on many different parameter values,
it's weights can't take on many different values, it might not be able to improve easily or find good loop optimal for
it to be in. Not only is this trying to do 1-L continuity enforcement, this might also limit the critic too much.
Or on the other hand, it might actually limit the critic too little if you don't clip the weights enough. There's a lot
of hyperparameter tuning involved.

##### Gradient Penalty

<img src="images/wloss-1L-9.png" width=420 height=200 />

The gradient penalty, which is another method, is a much softer way to enforce the critic to be one lipschitz continuous.
With the gradient penalty, all you need to do is add a regularization term to your loss function. What this regularization
term does to your W loss function, is that it penalizes the critic when it's gradient norm is higher than one.
I'll dive into what that means. The regularization term is as reg here, which I'll unfold shortly.
Lambda is just a hyperparameter value of how much to weigh this regularization term against the main loss function.



###### Impractically to check gradient at every possible point of the feature space
###### Check on the Interpolation between real and fake

<img src="images/wloss-1L-10.png" width=420 height=200 />

In order to check the critics gradient at every possible point of the feature space, that's virtually impossible or at
least not practical. Instead with gradient penalty during implementation, of course, all you do is sample some points by
interpolating between real and fake examples. For instance, you could sample an image with a set of reals and an image of
the set of fakes, and you grab one of each and you can get an intermediate image by interpolating those two images using
a random number epsilon. Epsilon here it could be a weight of 0.3, and here it would evaluate one minus epsilon would be 0.7.
That would get you this random interpolated image that's in-between these two images. I'll call this random interpolated
image X hat. It's on X hat that you want to get the critics gradient to be less than or equal to one.



<img src="images/wloss-1L-11.png" width=420 height=200 />

The two here is just saying,"I want the squared distance as opposed to perhaps the absolute value between them, penalizing
values much more when they're further away from one." Specifically, that X hat is an intermediate image where it's weighted
against the real and a fake using epsilon.


<img src="images/wloss-1L-12.png" width=420 height=200 />

With this method, you're not strictly enforcing 1-L continuity, but you're just encouraging it. This has proven to work
well and much better than weight clipping. The complete expression, the loss function that you use for training again with
W loss ingredient penalty now has these two components. First, you approximate Earth Mover's distance with this main W
loss component. This makes again less parental mode collapse and managed ingredients. The second part of this loss
function is a regularization term that meets the condition for what the critic desires in order to make this main term
valid. Of course, this is a soft constraint on making the critic one lipschitz continuous, but it has been shown to be
very effective. Keeping the norm of the critic close to one almost everywhere is actually the technical term is almost
anywhere.



<img src="images/wloss-1L-13.png" width=420 height=200 />

Wrapping up in this video, I presented you with two ways of enforcing the critic to be one lipschitz continuous or 1L
continuous, weight clipping as one and ingredient penalty as the other. Weight clipping might be problematic because
you're strongly limiting the way the critic learns during training or you're being too soft, so there's a bit of
hyperparameter tuning. Gradient penalty on the other hand, is a softer way to enforce one Lipschitz continuity.
While it doesn't strictly enforce the critics gradient norm to be less than one at every point, it works better in
practice than weight clipping.

<img src="images/wloss-1L-14.png" width=420 height=200 />
