about the results #1

ChenyangWang95 · 2024-06-26T03:13:07Z

Hi, very excited to see the work.

I wonder if there are any results about the disentanglement results?

I tried and revised the repo https://github.com/johndpope/MegaPortrait-hack

but the mouth is fixed and seems to be average emotion.

output_video4.mp4

johndpope · 2024-06-26T07:51:22Z

can you re-upload video?

ChenyangWang95 · 2024-06-26T08:07:27Z

sorry, the video cannot be uploaded correctly.

I choose some inference pics.

The emotion and mouth seems fixed.
I use 3000+ videos with 256256 res to train the model. 83090 GPUs and bs is 2. it is the results from ckpt of 80000 iterations.

johndpope · 2024-06-26T08:16:03Z

i think we can switch out the resnet18 for resnet50 -

have a look at this - IRFD
https://arxiv.org/pdf/2405.07257

class Emtn(nn.Module):
def init(self):
super().init()
# johndpope/MegaPortrait-hack#19
# replace this with off the shelf SixDRepNet
self.head_pose_net = resnet18(pretrained=True).to(device)
self.head_pose_net.fc = nn.Linear(self.head_pose_net.fc.in_features, 6).to(device) # 6 corresponds to rotation and translation parameters
self.rotation_net = SixDRepNet_Detector()

    model = resnet18(pretrained=False,num_classes=512).to(device)  # 512 feature_maps = resnet18(input_image) ->   Should print: torch.Size([1, 512, 7, 7])
    # Remove the fully connected layer and the adaptive average pooling layer
    self.expression_net = nn.Sequential(*list(model.children())[:-1])
    self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d(FEATURE_SIZE)  # https://github.com/neeek2303/MegaPortraits/issues/3
    # self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d((7, 7)) #OPTIONAL 🤷 - 16x16 is better?

    ## TODO 2
    outputs=COMPRESS_DIM ## 512,,方便后面的WarpS2C操作 512 -> 2048 channel
    self.fc = torch.nn.Linear(2048, outputs)

ChenyangWang95 · 2024-06-26T08:27:12Z

Thanks, I will try it
Additionally, does the resolution affects the disentanglement? maybe 512 is better?

johndpope · 2024-06-26T09:35:46Z

https://github.com/johndpope/SPEAK-hack

it needs some data wired up. PRs welcome.

UPDATE some progress - but i need generator
does it mean we have to go back to g2d / g3d?

JaLnYn · 2024-06-26T16:29:34Z

I don't have results yet. Still working on bug fixing.

You can check out my prune branch

JaLnYn · 2024-06-26T17:17:20Z

https://github.com/johndpope/SPEAK-hack

it needs some data wired up. PRs welcome.

UPDATE some progress - but i need generator
does it mean we have to go back to g2d / g3d?

Are you currently not using g2d g3d and warping generators?

johndpope · 2024-06-26T20:14:01Z

this is a new paper from 2024 - IRFD
https://arxiv.org/pdf/2405.07257
It seems like a MUCH smarter / faster / cleaner way to disentangle expression / pose / identity.

ChenyangWang95 · 2024-06-27T06:15:21Z

It appears to be simpler than megaportraits, and it requires less computation in training phase. However, I doubt it can maintain 3D consistency when the head turns when dropping out 3D volume in megaportraits.

this is a new paper from 2024 - IRFD https://arxiv.org/pdf/2405.07257 It seems like a MUCH smarter / faster / cleaner way to disentangle expression / pose / identity.

johndpope · 2024-06-27T10:47:02Z

good news - it's training.

JaLnYn · 2024-07-01T22:04:43Z

I've changed to the speak architecture and I"m getting some sort of results with a subset of the voxceleb2 dataset. This is just a progressive gans though.

10 epochs on my new_model branch

Its kinda ugly so maybe more epochs will help. I will also try to disentangle now.

johndpope · 2024-07-01T23:30:51Z

check my branch -
i put in a CIPSgenerator - should be compatible with your code.
https://github.com/johndpope/SPEAK-hack/pull/3/files#diff-6e36d44ad4ad4dca37f0d92c23c19c28f88ab0d9398914be63ea6ba94ac942f8R28

johndpope/SPEAK-hack#3

when you run images with less 224x224 against resnet50 - the feature size collapses from 7x7 -> 1x1.

share your code - mine gets stuck in a local minima.
i went to some efforts to force make the loss function go in pose direction - but no joy.

JaLnYn · 2024-07-01T23:48:57Z

My code is located at

https://github.com/JaLnYn/talkinghead/tree/new_model

I'm not sure what you mean by

when you run images with less 224x224 against resnet50 - the feature size collapses from 7x7 -> 1x1.

Are you talking about lpips loss? Or encoder?

johndpope · 2024-07-02T00:01:42Z

i want to train net with 64x64 - my branch handles that - but the resnet encoder produces different size features for smaller images than 224. there's a test file in my branch you can see.

johndpope · 2024-07-05T19:09:20Z

UPDATE
i add some logic to do background removal -
and added some variable progressive upscaling resolution

resume / restore / saving was looking good -
but after training for half a day on 3090 - the colors converge - but gap is too much.
GAN discriminator fails to converge after 20 epochs. johndpope/SPEAK-hack#8

going back to SPEAK paper - they base the gan architecture on stylegan - wondering whether to
a) use stylegan (unofficial pytorch/tensorflow port)
b) use stylegan2-ada (official - complex - adaptive discriminator augmentation - https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/training/augment.py (going to cook my gpu))
c) upgrade my own to cherry pick core functionality.

JaLnYn · 2024-07-05T19:25:46Z

I decided to take style gan code from a blog. Seems to work. You can check my new_model branch I linked above

johndpope · 2024-07-05T20:55:58Z

a couple of lines to help -

Define transformations

transform = Compose([
    Lambda(lambda x: x.permute(0, 3, 1, 2).float()),
    Lambda(lambda x: (x / 255.0)),  # Normalize to [0, 1]
    Resize((256, 256)) # my videos were 512 - caused me some headaches
])
parser.add_argument('--config_path', type=str, default='./config/local_train.yaml', help='Path to the config')

the game with SPEAK should be in forward pass -

.1 Training of Disentanglement Module. To train
our disentanglement module, we employ datasets of spontaneous facial emotions [ 7 , 15]. As illustrated in Figure 2,
we first use three encoders(i.e, 𝐸𝑒 , 𝐸𝑖 , 𝐸𝑝 ) to extract the
embeddings(𝐸𝑒 (𝑆), 𝐸𝑖 (𝑆), 𝐸𝑝 (𝑆), 𝐸𝑒 (𝑇 ), 𝐸𝑖 (𝑇 ), 𝐸𝑝 (𝑇 )) from
the reference video clips 𝑆, 𝑇 .

Subsequently, we randomly
swap one type of facial feature code and concatenate two sets
of facial feature information extracted from 𝑆 and 𝑇 , which
are then fed into the IRFD generator 𝐺𝑑 [ 18] to generate two
fake facial images, denoted as 𝐼𝑑 . To independently extract
the emotion, pose and identity lie in a pairs face images 𝑆𝑖,𝑚,𝑝

in forward pass

# Randomly swap one type of feature (keeping this functionality)
        swap_type = torch.randint(0, 3, (1,)).item()
        if swap_type == 0:
            fi_s, fi_t = fi_t, fi_s
        elif swap_type == 1:
            fe_s, fe_t = fe_t, fe_s
        else:
            fp_s, fp_t = fp_t, fp_s

that should make the happy guy emotion / keep identity - but replace with sad lady emotion etc.

in my repo theres a dataset using a 2gb Affectnet image set (8 emotions) https://www.kaggle.com/datasets/thienkhonghoc/affectnet
maybe useful.
i got to 77 iterations - but decord blew up.
maybe there's a screwed up video in my set.

Epoch 1/5, Total Loss: 3.3473:   0%|▏                                                         | 76/17833 [00:15<55:35,  5.32it/s]x.shape: torch.Size([2, 3, 256, 256])
Epoch 1/5, Total Loss: 3.7954:   0%|▎                                                         | 77/17833 [00:16<53:18,  5.55it/s]x.shape: torch.Size([2, 3, 256, 256])
Epoch 1/5, Total Loss: 2.1625:   0%|▎                                                         | 78/17833 [00:16<52:48,  5.60it/s]x.shape: torch.Size([2, 3, 256, 256])
Epoch 1/5, Total Loss: 3.1936:   0%|▎                                                         | 79/17833 [00:16<51:22,  5.76it/s]Epoch 1/5, Total Loss: 3.1936:   0%|▏                                                       | 79/17833 [00:16<1:01:58,  4.77it/s]
Traceback (most recent call last):
  File "/media/oem/12TB/talkinghead/src/trainer.py", line 239, in <module>
    main()
  File "/media/oem/12TB/talkinghead/src/trainer.py", line 235, in main
    train_model(config, p, video_dataset)
  File "/media/oem/12TB/talkinghead/src/trainer.py", line 137, in train_model
    for idx, (Xs, Xd, Xsp, Xdp) in enumerate(train_iterator):
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 629, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 672, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/media/oem/12TB/talkinghead/src/dataloader.py", line 29, in __getitem__
    video_data = vr.get_batch(frame_indices).asnumpy()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/.local/lib/python3.11/site-packages/decord-0.6.0-py3.11-linux-x86_64.egg/decord/video_reader.py", line 175, in get_batch
    arr = _CAPI_VideoReaderGetBatch(self._handle, indices)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/.local/lib/python3.11/site-packages/decord-0.6.0-py3.11-linux-x86_64.egg/decord/_ffi/_ctypes/function.py", line 173, in __call__
    check_call(_LIB.DECORDFuncCall(
               ^^^^^^^^^^^^^^^^^^^^

JaLnYn · 2024-07-06T06:36:44Z

hmmm I've never seen this bug before. I've trained over 10 epochs... I'm using a small voxceleb2 dataset 224x224.

I've been re-reading the paper and I'm confused whether they use pretrained encoders or if they are training the encoders from scratch. I use to think they are not pretrained but now I think they are. I going to re-write some code tomorrow to refect this.

I'm planning to use vggface2 for face but I haven't found encoders for the others yet. Please let me know if you've found anything.

Also I found this project which may interest you. It uses warpings and works pretty well. I've tried it: https://github.com/KwaiVGI/LivePortrait

johndpope · 2024-07-06T07:01:23Z

from my work with megaportraits - i'd say theyre resnet50 -
categorically - pretrained.

i'd like to work more closely with you - at least in similiar direction (model / loss / generator / discriminator)
https://github.com/johndpope/SPEAK-hack
i end up throwing out some code for the loss - and using the 6 degrees from megraportraits
maybe inception is better.
self.face_recognition = InceptionResnetV1(pretrained='vggface2').to(device)
i didn't get this working....

i did play with your code - but the noise in stylegan - seems eroneous to pass around everywhere -
this implementation https://github.com/johndpope/SPEAK-hack/blob/main/styleganv1.py - seems superior.
claude upgraded some parts with spectral_norm
it just injects noise behind the scene.
i'm still wiring it up. been at it all day.

current status

johndpope · 2024-07-06T08:24:24Z

https://github.com/tanshuai0219/EDTalk/tree/main

JaLnYn · 2024-07-06T17:22:18Z

i'd like to work more closely with you - at least in similiar direction (model / loss / generator / discriminator)

I'd be happy to work more closely with you :). We can get connected off github if you'd like

self.face_recognition = InceptionResnetV1(pretrained='vggface2').to(device)

What is the issue you're having with this? I did some testing and it seems to work? I'll do more testing later

i did play with your code - but the noise in stylegan - seems eroneous to pass around everywhere - this implementation https://github.com/johndpope/SPEAK-hack/blob/main/styleganv1.py - seems superior.

From what I've read, I think the implementation is correct? Can you point to specific location in the code that you think is wrong and point out why?
Since it seems to be somewhat working for me rn, I would be averse to changing it.

johndpope · 2024-07-10T12:39:32Z

came across this new paper the other day - which has code. it has a noteworthy class

https://github.com/johndpope/CSCS/blob/main/model/arcface/iresnet.py
they seemed to have improved the resnet50 model -
they have their own 700mb pretrained - as long as you run the alignment - the face swapping is rather world class - like seems to be on par with reactor / insightface - https://github.com/Gourieff/sd-webui-reactor

@ChenyangWang95 - did you get anywhere with anything?
quite a few good models dropping - liveportrait / hallo.

ChenyangWang95 · 2024-07-11T02:28:38Z

@johndpope I think the pretrained liveportrait model may be a good choice for achieving the dismantlement for VASA-1.
I am trying to embedding the pose and implicit kps in the paper into the DIT to verify the thoughts. How do you think about it?

JaLnYn · 2024-07-11T02:41:55Z

Im currently struggling to generate faces based on input face without disentanglement. Im struggling to do so but maybe my style gans is just not training long enough

Ive tried arcface but vggface seems similar if not better.

johndpope · 2024-07-11T05:01:33Z

i was hoping the emoportrait code would drop - that would effectively allow me to archive the Megaportrait codebase - and reveal the answers to those questions....but hasn't yet. ....
i added comments here that are relevant - in terms of squishing the features vector down - i guess we can go as low as 128 features instead of 512.
df2b03a
i started looking at the disentanglement of emotions. from the faces - and in an effort to speed up things - considered using a mask to concentrate on certain areas - https://github.com/johndpope/lazycipsgenerator

in coding the generator - i thought of using CIPS is from https://openaccess.thecvf.com/content/CVPR2021/papers/Anokhin_Image_Generators_With_Conditionally-Independent_Pixel_Synthesis_CVPR_2021_paper.pdf - which has no convolution layers - just fourier backbone -
but mostly a distraction -
i upgrade this branch to include affectionnet - that has a tagged emotions / faces dataset.
Affectnet - use this https://www.kaggle.com/datasets/thienkhonghoc/affectnet
https://github.com/johndpope/talkinghead/tree/new_model

i had some problem with this noise - and this tug of war between using @JaLnYn Alan's stylegan v1 or stylegan from pytorch implementation (that has noise baked in)

then switch back to speak-hack
https://github.com/johndpope/SPEAK-hack/blob/main/train.py
and every time i train this thing - i get gradient explosions.....
it's a bit of head ache....

the vasa-1 stuff i used the Megaportrait code - to come up with this branch - it uses DPE as per the whitepaper - that has most hope to disentangle stuff johndpope/VASA-1-hack#13
I haven't really looked at the diffusion stuff in ages - sighhh.....
i want to get somewhere with the resnet stuff first - as this transcends the densemotion - (which hallo / liveportrait are using - and could potentially unlock the faster frame rates as per vasa..... )

JaLnYn · 2024-07-12T01:13:47Z

Adding some batch norms usually helps my gradient explosions.

I'm mostly waiting for my current project to train.
While waiting. I'm thinking about building a student model for live portrait as specified in megaportraits. I'm 90% sure the use that for vasa to make it fast. (also why they only have a selection of faces)

JaLnYn · 2024-07-15T23:40:49Z

I have the following results to share. I am unable to disentangle the using the techniques specified in the paper.

Attempting listen disentangle and control disentanglement technique.

Without disentanglement

Has anyone gotten this disentanglement technique to work yet? if not I might go try the loss functions from megaportriats. I'm not confident that the ones from LDC works

johndpope · 2024-07-16T01:03:11Z

i got distracted again wanting to improve compute efficiencies - spent all weekend on these - one Apple has a patent on - recently renewed - I do basic proof of concept / and extend further to cuda - but i think it's specific to hardware / asic - wont work with gpu.
https://github.com/johndpope/LCNN-pytorch (apple patent) https://patents.google.com/patent/US20200364499A1
https://github.com/johndpope/faster-cnn (most recent research)
i wondered if resnet could be retrained with this technique - ther's some obosolete code by LUA https://github.com/hessamb/lcnn (claims are it can speed up inference 37x - though fast code was never released)
imagine high fps - and dropping the complexity of convolution dynamically to speed up frames.
These are dead ends till author comes back to me.

this liveportrait implematation is rather amazing - https://x.com/purzbeats/status/1812287664240107969?s=46&t=-tkSIrsyNobBjIvQ2IAQwQ - the obscuring of face is awesome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about the results #1

about the results #1

ChenyangWang95 commented Jun 26, 2024

johndpope commented Jun 26, 2024

ChenyangWang95 commented Jun 26, 2024

johndpope commented Jun 26, 2024

ChenyangWang95 commented Jun 26, 2024

johndpope commented Jun 26, 2024 •

edited

Loading

JaLnYn commented Jun 26, 2024

JaLnYn commented Jun 26, 2024

johndpope commented Jun 26, 2024

ChenyangWang95 commented Jun 27, 2024

johndpope commented Jun 27, 2024

JaLnYn commented Jul 1, 2024 •

edited

Loading

johndpope commented Jul 1, 2024

JaLnYn commented Jul 1, 2024

johndpope commented Jul 2, 2024

johndpope commented Jul 5, 2024

JaLnYn commented Jul 5, 2024

johndpope commented Jul 5, 2024 •

edited

Loading

JaLnYn commented Jul 6, 2024

johndpope commented Jul 6, 2024 •

edited

Loading

johndpope commented Jul 6, 2024

JaLnYn commented Jul 6, 2024 •

edited

Loading

johndpope commented Jul 10, 2024

ChenyangWang95 commented Jul 11, 2024

JaLnYn commented Jul 11, 2024 •

edited

Loading

johndpope commented Jul 11, 2024

JaLnYn commented Jul 12, 2024

JaLnYn commented Jul 15, 2024

johndpope commented Jul 16, 2024

about the results #1

about the results #1

Comments

ChenyangWang95 commented Jun 26, 2024

johndpope commented Jun 26, 2024

ChenyangWang95 commented Jun 26, 2024

johndpope commented Jun 26, 2024

ChenyangWang95 commented Jun 26, 2024

johndpope commented Jun 26, 2024 • edited Loading

JaLnYn commented Jun 26, 2024

JaLnYn commented Jun 26, 2024

johndpope commented Jun 26, 2024

ChenyangWang95 commented Jun 27, 2024

johndpope commented Jun 27, 2024

JaLnYn commented Jul 1, 2024 • edited Loading

johndpope commented Jul 1, 2024

JaLnYn commented Jul 1, 2024

johndpope commented Jul 2, 2024

johndpope commented Jul 5, 2024

JaLnYn commented Jul 5, 2024

johndpope commented Jul 5, 2024 • edited Loading

JaLnYn commented Jul 6, 2024

johndpope commented Jul 6, 2024 • edited Loading

johndpope commented Jul 6, 2024

JaLnYn commented Jul 6, 2024 • edited Loading

johndpope commented Jul 10, 2024

ChenyangWang95 commented Jul 11, 2024

JaLnYn commented Jul 11, 2024 • edited Loading

johndpope commented Jul 11, 2024

JaLnYn commented Jul 12, 2024

JaLnYn commented Jul 15, 2024

johndpope commented Jul 16, 2024

johndpope commented Jun 26, 2024 •

edited

Loading

JaLnYn commented Jul 1, 2024 •

edited

Loading

johndpope commented Jul 5, 2024 •

edited

Loading

johndpope commented Jul 6, 2024 •

edited

Loading

JaLnYn commented Jul 6, 2024 •

edited

Loading

JaLnYn commented Jul 11, 2024 •

edited

Loading