Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about the results #1

Open
ChenyangWang95 opened this issue Jun 26, 2024 · 28 comments
Open

about the results #1

ChenyangWang95 opened this issue Jun 26, 2024 · 28 comments

Comments

@ChenyangWang95
Copy link

Hi, very excited to see the work.

I wonder if there are any results about the disentanglement results?

I tried and revised the repo https://github.com/johndpope/MegaPortrait-hack

but the mouth is fixed and seems to be average emotion.

output_video4.mp4
@johndpope
Copy link

can you re-upload video?

@ChenyangWang95
Copy link
Author

sorry, the video cannot be uploaded correctly.

I choose some inference pics.

1
2
3

The emotion and mouth seems fixed.
I use 3000+ videos with 256256 res to train the model. 83090 GPUs and bs is 2. it is the results from ckpt of 80000 iterations.

@johndpope
Copy link

i think we can switch out the resnet18 for resnet50 -

have a look at this - IRFD
https://arxiv.org/pdf/2405.07257

class Emtn(nn.Module):
def init(self):
super().init()
# johndpope/MegaPortrait-hack#19
# replace this with off the shelf SixDRepNet
self.head_pose_net = resnet18(pretrained=True).to(device)
self.head_pose_net.fc = nn.Linear(self.head_pose_net.fc.in_features, 6).to(device) # 6 corresponds to rotation and translation parameters
self.rotation_net = SixDRepNet_Detector()

    model = resnet18(pretrained=False,num_classes=512).to(device)  # 512 feature_maps = resnet18(input_image) ->   Should print: torch.Size([1, 512, 7, 7])
    # Remove the fully connected layer and the adaptive average pooling layer
    self.expression_net = nn.Sequential(*list(model.children())[:-1])
    self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d(FEATURE_SIZE)  # https://github.com/neeek2303/MegaPortraits/issues/3
    # self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d((7, 7)) #OPTIONAL 🤷 - 16x16 is better?

    ## TODO 2
    outputs=COMPRESS_DIM ## 512,,方便后面的WarpS2C操作 512 -> 2048 channel
    self.fc = torch.nn.Linear(2048, outputs)

@ChenyangWang95
Copy link
Author

Thanks, I will try it
Additionally, does the resolution affects the disentanglement? maybe 512 is better?

@johndpope
Copy link

johndpope commented Jun 26, 2024

https://github.com/johndpope/SPEAK-hack

it needs some data wired up. PRs welcome.

UPDATE some progress - but i need generator
does it mean we have to go back to g2d / g3d?

@JaLnYn
Copy link
Owner

JaLnYn commented Jun 26, 2024

I don't have results yet. Still working on bug fixing.

You can check out my prune branch

@JaLnYn
Copy link
Owner

JaLnYn commented Jun 26, 2024

https://github.com/johndpope/SPEAK-hack

it needs some data wired up. PRs welcome.

UPDATE some progress - but i need generator
does it mean we have to go back to g2d / g3d?

Are you currently not using g2d g3d and warping generators?

@johndpope
Copy link

this is a new paper from 2024 - IRFD
https://arxiv.org/pdf/2405.07257
It seems like a MUCH smarter / faster / cleaner way to disentangle expression / pose / identity.

@ChenyangWang95
Copy link
Author

It appears to be simpler than megaportraits, and it requires less computation in training phase. However, I doubt it can maintain 3D consistency when the head turns when dropping out 3D volume in megaportraits.

this is a new paper from 2024 - IRFD https://arxiv.org/pdf/2405.07257 It seems like a MUCH smarter / faster / cleaner way to disentangle expression / pose / identity.

@johndpope
Copy link

good news - it's training.

@JaLnYn
Copy link
Owner

JaLnYn commented Jul 1, 2024

I've changed to the speak architecture and I"m getting some sort of results with a subset of the voxceleb2 dataset. This is just a progressive gans though.

10 epochs on my new_model branch

image

Its kinda ugly so maybe more epochs will help. I will also try to disentangle now.

@johndpope
Copy link

check my branch -
i put in a CIPSgenerator - should be compatible with your code.
https://github.com/johndpope/SPEAK-hack/pull/3/files#diff-6e36d44ad4ad4dca37f0d92c23c19c28f88ab0d9398914be63ea6ba94ac942f8R28

johndpope/SPEAK-hack#3

when you run images with less 224x224 against resnet50 - the feature size collapses from 7x7 -> 1x1.

share your code - mine gets stuck in a local minima.
i went to some efforts to force make the loss function go in pose direction - but no joy.

Screenshot 2024-07-02 at 9 30 25 AM

@JaLnYn
Copy link
Owner

JaLnYn commented Jul 1, 2024

My code is located at

https://github.com/JaLnYn/talkinghead/tree/new_model

I'm not sure what you mean by

when you run images with less 224x224 against resnet50 - the feature size collapses from 7x7 -> 1x1.

Are you talking about lpips loss? Or encoder?

@johndpope
Copy link

i want to train net with 64x64 - my branch handles that - but the resnet encoder produces different size features for smaller images than 224. there's a test file in my branch you can see.

@johndpope
Copy link

UPDATE
i add some logic to do background removal -
and added some variable progressive upscaling resolution

going back to SPEAK paper - they base the gan architecture on stylegan - wondering whether to
a) use stylegan (unofficial pytorch/tensorflow port)
b) use stylegan2-ada (official - complex - adaptive discriminator augmentation - https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/training/augment.py (going to cook my gpu))
c) upgrade my own to cherry pick core functionality.

Screenshot from 2024-07-06 05-02-15

@JaLnYn
Copy link
Owner

JaLnYn commented Jul 5, 2024

I decided to take style gan code from a blog. Seems to work. You can check my new_model branch I linked above

@johndpope
Copy link

johndpope commented Jul 5, 2024

a couple of lines to help -

Define transformations

transform = Compose([
    Lambda(lambda x: x.permute(0, 3, 1, 2).float()),
    Lambda(lambda x: (x / 255.0)),  # Normalize to [0, 1]
    Resize((256, 256)) # my videos were 512 - caused me some headaches
])
parser.add_argument('--config_path', type=str, default='./config/local_train.yaml', help='Path to the config')

the game with SPEAK should be in forward pass -

.1 Training of Disentanglement Module. To train
our disentanglement module, we employ datasets of spontaneous facial emotions [ 7 , 15]. As illustrated in Figure 2,
we first use three encoders(i.e, 𝐸𝑒 , 𝐸𝑖 , 𝐸𝑝 ) to extract the
embeddings(𝐸𝑒 (𝑆), 𝐸𝑖 (𝑆), 𝐸𝑝 (𝑆), 𝐸𝑒 (𝑇 ), 𝐸𝑖 (𝑇 ), 𝐸𝑝 (𝑇 )) from
the reference video clips 𝑆, 𝑇 .

Subsequently, we randomly
swap one type of facial feature code and concatenate two sets
of facial feature information extracted from 𝑆 and 𝑇 , which
are then fed into the IRFD generator 𝐺𝑑 [ 18] to generate two
fake facial images, denoted as 𝐼𝑑 . To independently extract
the emotion, pose and identity lie in a pairs face images 𝑆𝑖,𝑚,𝑝

in forward pass

# Randomly swap one type of feature (keeping this functionality)
        swap_type = torch.randint(0, 3, (1,)).item()
        if swap_type == 0:
            fi_s, fi_t = fi_t, fi_s
        elif swap_type == 1:
            fe_s, fe_t = fe_t, fe_s
        else:
            fp_s, fp_t = fp_t, fp_s

that should make the happy guy emotion / keep identity - but replace with sad lady emotion etc.

in my repo theres a dataset using a 2gb Affectnet image set (8 emotions) https://www.kaggle.com/datasets/thienkhonghoc/affectnet
maybe useful.
i got to 77 iterations - but decord blew up.
maybe there's a screwed up video in my set.

Epoch 1/5, Total Loss: 3.3473:   0%|| 76/17833 [00:15<55:35,  5.32it/s]x.shape: torch.Size([2, 3, 256, 256])
Epoch 1/5, Total Loss: 3.7954:   0%|| 77/17833 [00:16<53:18,  5.55it/s]x.shape: torch.Size([2, 3, 256, 256])
Epoch 1/5, Total Loss: 2.1625:   0%|| 78/17833 [00:16<52:48,  5.60it/s]x.shape: torch.Size([2, 3, 256, 256])
Epoch 1/5, Total Loss: 3.1936:   0%|| 79/17833 [00:16<51:22,  5.76it/s]Epoch 1/5, Total Loss: 3.1936:   0%|| 79/17833 [00:16<1:01:58,  4.77it/s]
Traceback (most recent call last):
  File "/media/oem/12TB/talkinghead/src/trainer.py", line 239, in <module>
    main()
  File "/media/oem/12TB/talkinghead/src/trainer.py", line 235, in main
    train_model(config, p, video_dataset)
  File "/media/oem/12TB/talkinghead/src/trainer.py", line 137, in train_model
    for idx, (Xs, Xd, Xsp, Xdp) in enumerate(train_iterator):
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 629, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 672, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/media/oem/12TB/talkinghead/src/dataloader.py", line 29, in __getitem__
    video_data = vr.get_batch(frame_indices).asnumpy()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/.local/lib/python3.11/site-packages/decord-0.6.0-py3.11-linux-x86_64.egg/decord/video_reader.py", line 175, in get_batch
    arr = _CAPI_VideoReaderGetBatch(self._handle, indices)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/.local/lib/python3.11/site-packages/decord-0.6.0-py3.11-linux-x86_64.egg/decord/_ffi/_ctypes/function.py", line 173, in __call__
    check_call(_LIB.DECORDFuncCall(
               ^^^^^^^^^^^^^^^^^^^^


@JaLnYn
Copy link
Owner

JaLnYn commented Jul 6, 2024

hmmm I've never seen this bug before. I've trained over 10 epochs... I'm using a small voxceleb2 dataset 224x224.

I've been re-reading the paper and I'm confused whether they use pretrained encoders or if they are training the encoders from scratch. I use to think they are not pretrained but now I think they are. I going to re-write some code tomorrow to refect this.

I'm planning to use vggface2 for face but I haven't found encoders for the others yet. Please let me know if you've found anything.

Also I found this project which may interest you. It uses warpings and works pretty well. I've tried it: https://github.com/KwaiVGI/LivePortrait

@johndpope
Copy link

johndpope commented Jul 6, 2024

from my work with megaportraits - i'd say theyre resnet50 -
categorically - pretrained.

i'd like to work more closely with you - at least in similiar direction (model / loss / generator / discriminator)
https://github.com/johndpope/SPEAK-hack
i end up throwing out some code for the loss - and using the 6 degrees from megraportraits
maybe inception is better.
self.face_recognition = InceptionResnetV1(pretrained='vggface2').to(device)
i didn't get this working....

i did play with your code - but the noise in stylegan - seems eroneous to pass around everywhere -
this implementation https://github.com/johndpope/SPEAK-hack/blob/main/styleganv1.py - seems superior.
claude upgraded some parts with spectral_norm
it just injects noise behind the scene.
i'm still wiring it up. been at it all day.

current status
Screenshot 2024-07-06 at 5 04 40 pm

@johndpope
Copy link

@JaLnYn
Copy link
Owner

JaLnYn commented Jul 6, 2024

i'd like to work more closely with you - at least in similiar direction (model / loss / generator / discriminator)

I'd be happy to work more closely with you :). We can get connected off github if you'd like

self.face_recognition = InceptionResnetV1(pretrained='vggface2').to(device)

What is the issue you're having with this? I did some testing and it seems to work? I'll do more testing later

i did play with your code - but the noise in stylegan - seems eroneous to pass around everywhere - this implementation https://github.com/johndpope/SPEAK-hack/blob/main/styleganv1.py - seems superior.

From what I've read, I think the implementation is correct? Can you point to specific location in the code that you think is wrong and point out why?
Since it seems to be somewhat working for me rn, I would be averse to changing it.

@johndpope
Copy link

came across this new paper the other day - which has code. it has a noteworthy class

@ChenyangWang95 - did you get anywhere with anything?
quite a few good models dropping - liveportrait / hallo.

@ChenyangWang95
Copy link
Author

@johndpope I think the pretrained liveportrait model may be a good choice for achieving the dismantlement for VASA-1.
I am trying to embedding the pose and implicit kps in the paper into the DIT to verify the thoughts. How do you think about it?

@JaLnYn
Copy link
Owner

JaLnYn commented Jul 11, 2024

Im currently struggling to generate faces based on input face without disentanglement. Im struggling to do so but maybe my style gans is just not training long enough

Ive tried arcface but vggface seems similar if not better.

@johndpope
Copy link

i was hoping the emoportrait code would drop - that would effectively allow me to archive the Megaportrait codebase - and reveal the answers to those questions....but hasn't yet. ....
i added comments here that are relevant - in terms of squishing the features vector down - i guess we can go as low as 128 features instead of 512.
df2b03a
i started looking at the disentanglement of emotions. from the faces - and in an effort to speed up things - considered using a mask to concentrate on certain areas - https://github.com/johndpope/lazycipsgenerator

in coding the generator - i thought of using CIPS is from https://openaccess.thecvf.com/content/CVPR2021/papers/Anokhin_Image_Generators_With_Conditionally-Independent_Pixel_Synthesis_CVPR_2021_paper.pdf - which has no convolution layers - just fourier backbone -
but mostly a distraction -
i upgrade this branch to include affectionnet - that has a tagged emotions / faces dataset.
Affectnet - use this https://www.kaggle.com/datasets/thienkhonghoc/affectnet
https://github.com/johndpope/talkinghead/tree/new_model

i had some problem with this noise - and this tug of war between using @JaLnYn Alan's stylegan v1 or stylegan from pytorch implementation (that has noise baked in)

then switch back to speak-hack
https://github.com/johndpope/SPEAK-hack/blob/main/train.py
and every time i train this thing - i get gradient explosions.....
it's a bit of head ache....

the vasa-1 stuff i used the Megaportrait code - to come up with this branch - it uses DPE as per the whitepaper - that has most hope to disentangle stuff johndpope/VASA-1-hack#13
I haven't really looked at the diffusion stuff in ages - sighhh.....
i want to get somewhere with the resnet stuff first - as this transcends the densemotion - (which hallo / liveportrait are using - and could potentially unlock the faster frame rates as per vasa..... )

@JaLnYn
Copy link
Owner

JaLnYn commented Jul 12, 2024

Adding some batch norms usually helps my gradient explosions.

I'm mostly waiting for my current project to train.
While waiting. I'm thinking about building a student model for live portrait as specified in megaportraits. I'm 90% sure the use that for vasa to make it fast. (also why they only have a selection of faces)

@JaLnYn
Copy link
Owner

JaLnYn commented Jul 15, 2024

I have the following results to share. I am unable to disentangle the using the techniques specified in the paper.

Attempting listen disentangle and control disentanglement technique.
output2

Without disentanglement
output

Has anyone gotten this disentanglement technique to work yet? if not I might go try the loss functions from megaportriats. I'm not confident that the ones from LDC works

@johndpope
Copy link

i got distracted again wanting to improve compute efficiencies - spent all weekend on these - one Apple has a patent on - recently renewed - I do basic proof of concept / and extend further to cuda - but i think it's specific to hardware / asic - wont work with gpu.
https://github.com/johndpope/LCNN-pytorch (apple patent) https://patents.google.com/patent/US20200364499A1
https://github.com/johndpope/faster-cnn (most recent research)
i wondered if resnet could be retrained with this technique - ther's some obosolete code by LUA https://github.com/hessamb/lcnn (claims are it can speed up inference 37x - though fast code was never released)
imagine high fps - and dropping the complexity of convolution dynamically to speed up frames.
These are dead ends till author comes back to me.

this liveportrait implematation is rather amazing - https://x.com/purzbeats/status/1812287664240107969?s=46&t=-tkSIrsyNobBjIvQ2IAQwQ - the obscuring of face is awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants