Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latent layers importance #6

Open
tals opened this issue Mar 2, 2019 · 19 comments
Open

Latent layers importance #6

tals opened this issue Mar 2, 2019 · 19 comments

Comments

@tals
Copy link

tals commented Mar 2, 2019

Hey, great analysis! :)
I've learned a lot from reading it.

Just a quick comment: the W matrix produced by the mapping network contains a single vector w_v tiled in with respect to layers (so w[0] = w[1] = ... = w[n_layers - 1]).
The layer-wise affine transformation happens on the synthesis network

The notebook however operates over this tiled w, which is why you saw the surprising behavior (since all the layers are identical).
I imagine that running it again over the transformed Ws would show something much different.

P.S: This bug also affects the result of the non-linear model

@jcpeterson
Copy link

@tals does this mean the encoder code also optimizes the same repeated tiled vectors incorrectly too?

@pender
Copy link

pender commented Mar 18, 2019

@jcpeterson The encoder in this repo produces an 18x512 dlatent vector in which the layers are different, unlike the mapping network which as noted above produces a 1x512 dlatent vector which is tiled up to 18x512. I haven't found an easy way to create a 1x512 dlatent vector from the encoder in this repo that can be tiled to reproduce the encoded image.

I've tried:

  • tiling each of the resulting 18 layers individually (which each produced very distorted faces from a variety of sharp angles)
  • averaging the 18 layers (which produced slight variations on a single female face for all of the attempts that I tried), and
  • adapting the encoder script to optimize a single 1x512 vector that is tiled back to 18x512 before synthesizing (which didn't do a very good job of reproducing the input image).

I am going to try that last method again with the Adam optimizer (implementation is discussed by others in another thread in the repo) (UPDATE: results still terrible), and if that fails, I'm going to start from scratch with a new encoder using the method described in this ICLR paper which dispenses with perceptual loss altogether (I found another repo that uses that method on ProGAN here).

I also need to run a control to see if the existing encoder can reconstruct a face produced by the GAN in the first place, in case the face images I've provided just aren't well represented in the dlatent space.

@jcpeterson
Copy link

@pender Thanks for the info. That seems at odds with this repo author's post here: https://www.reddit.com/r/MachineLearning/comments/aq6jxf/p_stylegan_encoder_from_real_images_to_latent/egg4rkl

He states:
qlatent = normally distributed noise which have shape=(512)
dlatent = mapping_network(qlatent) = shape=(18, 512)

I tried Adam, and while much faster, and with lower loss, I get more artifacts.

The paper you linked just looks like a pixel loss version of the current repo, or perhaps you meant the clipping, which I think is a great idea. Please ping me if you implement that. Do you have a public fork?

@pender
Copy link

pender commented Mar 19, 2019

@jcpeterson

He states:

dlatent = mapping_network(qlatent) = shape=(18, 512)

That's technically correct, but I'm not sure he realized at that time that the (18, 512) tensor that the mapping network outputs is in fact a single (512,) tensor that is tiled to (18, 512). But it is!

I don't have a public fork, I've just been noodling with a copy of the source files on my PC.

@jcpeterson
Copy link

jcpeterson commented Mar 20, 2019

@pender That means the encoder is simply set up incorrectly. If the network is not trained to use 18 different vectors, the dlatents currently learned are out-of-distribution. It's extremely odd then that your third strategy above didn't work. Have you tried learning qlatent instead?

@pender
Copy link

pender commented Mar 20, 2019

@jcpeterson Yes, the dlatents obtained by the encoding script are clearly out-of-distribution -- that is easily enough demonstrated by tiling a single layer of the dlatent output up to a new 18x512 dlatent and observing the fleshy horrors that emerge. (On the other hand, they are achieving their intended purposes, since they do reconstruct the image, and image transformations learned on native images seem to work on encoded dlatents.)

I don't think trying to learn qlatents directly is likely to be fruitful... that would be the same challenge we are already working with, PLUS trying to reverse the complex transformation of the 8-layer fully-connected mapping network. Qlatents are useful for randomly generating images, but for everything else I think we're better off working on dlatents directly (whether pre or post tiling).

@jcpeterson
Copy link

jcpeterson commented Mar 20, 2019

@pender I don't see the difference. The encoder already backprops through the generator to the dlatents. Why not just backprop through the fixed mapping network too and find a qlatent such that when mapped and then rendered etc, a good match is attained via VGG?

@jcpeterson
Copy link

@pender Can you please share your code for: "adapting the encoder script to optimize a single 1x512 vector that is tiled back to 18x512 before synthesizing (which didn't do a very good job of reproducing the input image)."

@pbaylies
Copy link

pbaylies commented May 6, 2019

@jcpeterson My guess as to why the encoder works better as-is, is that it has the full latent space to search through, so while the mapping network is designed to find specific points on (or near) the manifold, the encoder can find the spaces in between that look the most like the target image, even if the mapping network wouldn't be likely to find that same region given its training.

@jcpeterson
Copy link

@pbaylies Yes, I see how it works now. I just don't like the fine-gained texture all of the encodings get. At full size, none of them have the clear quality of the actual samples. I attribute this to over-optimization using the style layers. If I can get a hold of @pender's code, I can search for the subset of the first N dlatents (higher level, less style-focused dimensions) that suffice.

@tals
Copy link
Author

tals commented May 21, 2019

@jcpeterson My guess as to why the encoder works better as-is, is that it has the full latent space to search through, so while the mapping network is designed to find specific points on (or near) the manifold, the encoder can find the spaces in between that look the most like the target image, even if the mapping network wouldn't be likely to find that same region given its training.

When experimenting with the net, I've noticed StyleGAN behaves much better when it comes to interpolation & mixing if you "play by the rules". eg, use a single 1x512 dlatent vector to represent your target image.
With 18x512, we're kindof cheating. In fact, Image2Stylegan shows that you can encode images this way on a completely randomly initialized net! (although interpolation is pretty meaningless in this in this instance)

A test that I've tried was to apply this on a subset of W. I tried W{3}, and the first two layers captured the pose and some color, where the 3rd one looked similar to my target face, but at a different angle and with some broken elements.

Note that doing W{1} is possible, but it's a harder for an encoder to accurately hit that goal. Some tricks, like masking out the background when calculating the loss, seem to help a lot.

Additionally, the distribution of the optimized latents appear different from the ones generated by the mapping network.
To counter this, you can "push" it towards the mean (via something like L2). This helps StyleGAN work better, but it's a hack since the natural distribution of latents in W doesn't quite look that way.

@tals
Copy link
Author

tals commented May 21, 2019

To illustrate the last point, here's a simple ablation test (this is all done in W{1}):

target_image:
image

mask:
image

Target = target_image * mask (so only the white pixels are optimized for).

Without any regularization, it looks kinda broken (I affectionally call this "blobama"):
image

By pulling it back to the w latent mean with L2, it looks better a lot better (but not quite correct - notice the strange asymmetry and scar like artifact):
image

@pbaylies
Copy link

@tals I've also been playing around with encoders and training a ResNet to encode; I had some similar ideas as far as using the mapping network to generate more values for dlatents to generate examples for training. I haven't compared the distributions, though, that'd be good to know. I've been using L1 regularization to pull back towards the mean while encoding.

@tals
Copy link
Author

tals commented May 21, 2019

@tals I've also been playing around with encoders and training a ResNet to encode; I had some similar ideas as far as using the mapping network to generate more values for dlatents to generate examples for training. I haven't compared the distributions, though, that'd be good to know. I've been using L1 regularization to pull back towards the mean while encoding.

Yeah the distribution doesn't look quite right, even with L1/L2 (why did you choose L1, btw?).
This is easily visible if you take the histogram of the optimized latent vs a "natural" one obtained by the mapping network.

The trick of finding a better starting position with the ResNet encoder is interesting.
It might probably work better if it gets to see the gradient from StyleGAN, instead of just working on <rgb, latent> pairs.
Maybe something like: loss = alpha * distance(predicted_latent, target_latent) + beta * perceptual_distance(synthesis_net(predicted_latent), source_rgb)

P.S: I think you have a subtle bug: you use L1 to pull it back to zero, instead of dlatent_avg

@pbaylies
Copy link

pbaylies commented May 21, 2019

@tals L1 was just the first thing I tried as a regularization penalty, but it cut down on artifacts a lot; before, I'd get more blurry results from my loss function. And yes, a good starting prediction helps.

P.S.: Thanks for the tip, I think you're right!

@pender
Copy link

pender commented May 22, 2019

@jcpeterson

@pbaylies Yes, I see how it works now. I just don't like the fine-gained texture all of the encodings get. At full size, none of them have the clear quality of the actual samples. I attribute this to over-optimization using the style layers. If I can get a hold of @pender's code, I can search for the subset of the first N dlatents (higher level, less style-focused dimensions) that suffice.

Here you go, I uploaded to a forked repo today. Ultimately had some success. Interestingly I was able to use perceptual loss from the discriminator net instead of a separately trained image classifier net. I'm not sure if that has been done before.

@pbaylies
Copy link

Thanks @pender very nice results! I merged your changes for learning rate decay, stochastic gradient clipping, and tiled dlatents into my fork as well.

@pender
Copy link

pender commented Jun 9, 2019

@pbaylies thanks for your repo, I'm particularly enjoying playing with the efficientnet inverse network. Would you be willing to turn on issues for your repo?

Summarizing the issue I'm looking to report to you: since the effnet is trained to match dlatents in which all 18 layers are the same values (as the training targets are outputs of the StyleGAN mapping network, which tiles a single [1, 512] vector up to [18, 512]), I think the effnet's output layer should be single [1, 512] vector instead of a [18, 512] vector, even if you prefer to tile up the [1, 512] vector up to 18 layers. Right now all 18 layers of its output are different, unlike the outputs of the mapping network.

@pbaylies
Copy link

@pender Thanks, I didn't realize that issues were off in my repo! They are on now. I'd be amenable to having a flag to enable using a single [1, 512] vector (or similar options, maybe a [3, 512] vector to tile up coarse, medium, and fine attributes); I'm sure it'd train faster, but I think having it wider in general ultimately yields better results, as it covers more of the latent space. I already have a flag on the encoder to support the tiling up behavior, so you can compare results on that end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants