Is there a way to only use the text encoder ? #113

ranran9991 · 2021-06-09T08:45:57Z

Hey!
I'd like to use only one part of the model, specifically the text encoder in my work. I don't want to store the whole model in GPU memory just to use the text encoding part, is there a simple way to do that? or will I have to dive into the code myself

Thanks for the help ! :)

vinson2233 · 2021-06-09T10:20:40Z

What I have done is to set the model.visual = None to remove all the visual part. But this will raise error since dtype property is dependent on the .visual part.

CLIP/clip/model.py

Lines 332 to 334 in cfcffb9

    
           @property 
        
           def dtype(self): 
        
               return self.visual.conv1.weight.dtype

I think if we can set all the visual part into None except for model.visual.conv1.weight then the encode_text will work perfectly without the need to store the visual part.

ranran9991 · 2021-06-14T06:22:37Z

What I have done is to set the model.visual = None to remove all the visual part. But this will raise error since dtype property is dependent on the .visual part.

CLIP/clip/model.py

Lines 332 to 334 in cfcffb9

@property

def dtype(self):

return self.visual.conv1.weight.dtype

I think if we can set all the visual part into None except for model.visual.conv1.weight then the encode_text will work perfectly without the need to store the visual part.

Thank you for the help, hopefully this feature will be added soon

lonngxiang · 2021-06-30T09:50:13Z

same needs

jongwook · 2021-06-30T09:55:03Z

You can replace self.visual.conv1.weight.dtype with next(self.parameters()).dtype and such, which will avoid the error. I'll plan to replace it in the next round of updates.

lonngxiang · 2021-06-30T09:59:25Z

You can replace self.visual.conv1.weight.dtype with next(self.parameters()).dtype and such, which will avoid the error. I'll plan to replace it in the next round of updates.

thumb up

lonngxiang · 2021-07-05T07:49:06Z

finally, I succeed @jongwook @vinson2233 @ranran9991

model.py

class CLIP(nn.Module):
    def __init__(self,
                embed_dim:int,
                 context_length: int,
                 vocab_size: int,
                 transformer_width: int,
                 transformer_heads: int,
                 transformer_layers: int
                 ):
        super().__init__()

   self.visual=None

   @property
    def dtype(self):
        return next(self.parameters()).dtype

   def forward(self, text):
        # image_features = self.encode_image(image)
        text_features = self.encode_text(text)

        # normalized features
        # image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        # cosine similarity as logits
        # logit_scale = self.logit_scale.exp()
        # logits_per_image = logit_scale * image_features @ text_features.t()
        # logits_per_text = logit_scale * text_features @ image_features.t()

        # shape = [global_batch_size, global_batch_size]
        # return logits_per_image, logits_per_text
        return text_features

save.py

device = "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

dict_trained = model.state_dict()    # trained model
trained_lst = list(dict_trained.keys())

model_txt = CLIP(embed_dim=512,context_length=77,vocab_size=49408,transformer_width=512,transformer_heads=16,transformer_layers=6)
dict_txt = model_txt.state_dict()
print(dict_txt)
for key in dict_txt:
    dict_txt[key] = dict_trained[key]
model_txt.load_state_dict(dict_txt) 
torch.save(model_txt, "./single_model_text1.pkl")

vinson2233 · 2021-07-05T07:56:17Z

@lonngxiang To highlight the code part, you can use ``` instead of `, put it on top and bottom part of code, all will be highlighted as code.
But I get the idea. Thanks.

lonngxiang · 2021-07-05T08:39:05Z

@lonngxiang To highlight the code part, you can use ``` instead of `, put it on top and bottom part of code, all will be highlighted as code.
But I get the idea. Thanks.

ok，tks

laurenspriem · 2023-02-28T10:07:47Z

finally, I succeed @jongwook @vinson2233 @ranran9991

model.py

class CLIP(nn.Module):
    def __init__(self,
                embed_dim:int,
                 context_length: int,
                 vocab_size: int,
                 transformer_width: int,
                 transformer_heads: int,
                 transformer_layers: int
                 ):
        super().__init__()

   self.visual=None

   @property
    def dtype(self):
        return next(self.parameters()).dtype

   def forward(self, text):
        # image_features = self.encode_image(image)
        text_features = self.encode_text(text)

        # normalized features
        # image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        # cosine similarity as logits
        # logit_scale = self.logit_scale.exp()
        # logits_per_image = logit_scale * image_features @ text_features.t()
        # logits_per_text = logit_scale * text_features @ image_features.t()

        # shape = [global_batch_size, global_batch_size]
        # return logits_per_image, logits_per_text
        return text_features

save.py

device = "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

dict_trained = model.state_dict()    # trained model
trained_lst = list(dict_trained.keys())

model_txt = CLIP(embed_dim=512,context_length=77,vocab_size=49408,transformer_width=512,transformer_heads=16,transformer_layers=6)
dict_txt = model_txt.state_dict()
print(dict_txt)
for key in dict_txt:
    dict_txt[key] = dict_trained[key]
model_txt.load_state_dict(dict_txt) 
torch.save(model_txt, "./single_model_text1.pkl")

I followed this approach and got a text encoder. However, the embeddings that the model gives are completely wrong. @lonngxiang have you verified that the resulting embeddings match those of the original model?

@jongwook is the above method still the recommended approach, or is there a better way by now?

laurenspriem · 2023-02-28T15:01:18Z

Since the above solution didn't work for me, I've used another workaround that was kind of suggested earlier in this thread.

Basically, you can strip away most of the image encoder by setting model.visual.transformer = None. Contrary to model.visual = None, this doesn't give an error.

Downsides are that you're still left with some useless weights from the image encoder, and that you have to use model.encode_text(text_input) instead of just model(text_input).

ranran9991 · 2023-02-28T15:04:28Z

Since the above solution didn't work for me, I've used another workaround that was kind of suggested earlier in this thread.

Basically, you can strip away most of the image encoder by setting model.visual.transformer = None. Contrary to model.visual = None, this doesn't give an error.

Downsides are that you're still left with some useless weights from the image encoder, and that you have to use model.encode_text(text_input) instead of just model(text_input).

This is the workaround that I use as well.

vinson2233 mentioned this issue Jun 30, 2021

CLIP Training Code #83

Open

jongwook mentioned this issue Jun 30, 2021

Can CLIP extract only the encode_text part of the model and save this piece？ #121

Closed

jongwook closed this as completed Sep 24, 2021

lonngxiang mentioned this issue Dec 15, 2021

How to transform clip model into onnx format？ #122

Closed

hpppppp8 mentioned this issue May 6, 2023

Training on Charades wjun0830/QD-DETR#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to only use the text encoder ? #113

Is there a way to only use the text encoder ? #113

ranran9991 commented Jun 9, 2021

vinson2233 commented Jun 9, 2021

ranran9991 commented Jun 14, 2021

lonngxiang commented Jun 30, 2021

jongwook commented Jun 30, 2021

lonngxiang commented Jun 30, 2021

lonngxiang commented Jul 5, 2021 •

edited

vinson2233 commented Jul 5, 2021

lonngxiang commented Jul 5, 2021

laurenspriem commented Feb 28, 2023 •

edited

laurenspriem commented Feb 28, 2023 •

edited

ranran9991 commented Feb 28, 2023

Is there a way to only use the text encoder ? #113

Is there a way to only use the text encoder ? #113

Comments

ranran9991 commented Jun 9, 2021

vinson2233 commented Jun 9, 2021

ranran9991 commented Jun 14, 2021

lonngxiang commented Jun 30, 2021

jongwook commented Jun 30, 2021

lonngxiang commented Jun 30, 2021

lonngxiang commented Jul 5, 2021 • edited

vinson2233 commented Jul 5, 2021

lonngxiang commented Jul 5, 2021

laurenspriem commented Feb 28, 2023 • edited

laurenspriem commented Feb 28, 2023 • edited

ranran9991 commented Feb 28, 2023

lonngxiang commented Jul 5, 2021 •

edited

laurenspriem commented Feb 28, 2023 •

edited

laurenspriem commented Feb 28, 2023 •

edited