What exactly is clip skip? #5674
-
Can someone explain in plain English what does CLIP skip option do? As far as I understand, CLIP is a an embedding layer which analyses texts, and changing this option gives different results. And it seems pointless on models that weren't trained with it. But what does it exactly do and what is it for? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 11 replies
-
CLIP model (The text embedding present in 1.x models) has a structure that is composed of layers. Each layer is more specific than the last. Example if layer 1 is "Person" then layer 2 could be: "male" and "female"; then if you go down the path of "male" layer 3 could be: Man, boy, lad, father, grandpa... etc. Note this is not exactly how the CLIP model is structured, but for the sake of example. The 1.5 model is for example 12 ranks deep. Where in 12th layer is the last layer of text embedding. Each layer matrix of some size, and each layer is has additional matrixes. So 4x4 first layer has 4 4x4 under it... SO and so forth. So the text space is dimensionally fucking huge. Now why would you want to stop earlier in the Clip layers? Well if you want picture of "a cow" you might not care about the sub categories of "cow" the text model might have. Especially since these can have varying degrees of quality. So if you want "a cow" you might not want "a abederdeen angus bull". You can imagine CLIP skip to basically be a setting for "how accurate you want the text model to be". You can test it out, wtih XY script for example. You can see that each clip stage has more definition in the description sense. So if you have a detailed prompt about a young man standing in a field, with lower clip stages you'd get picture of "a man standing", then deeper "young man standing", "Young man standing in a forest"... etc. CLIP skip really becomes good when you use models that are structured in a special way. Like Booru models. Where "1girl" tag can break down to many sub tags that connect to that one major tag. Whether you get use of from clip skip is really just trial and error. Now keep in mind that CLIP skip only works in models that use CLIP and or are based on models that use CLIP. As in 1.x models and it's derivates. 2.0 models and it's derivates do not interact with CLIP because they use OpenCLIP. |
Beta Was this translation helpful? Give feedback.
-
nice question |
Beta Was this translation helpful? Give feedback.
CLIP model (The text embedding present in 1.x models) has a structure that is composed of layers. Each layer is more specific than the last. Example if layer 1 is "Person" then layer 2 could be: "male" and "female"; then if you go down the path of "male" layer 3 could be: Man, boy, lad, father, grandpa... etc. Note this is not exactly how the CLIP model is structured, but for the sake of example.
The 1.5 model is for example 12 ranks deep. Where in 12th layer is the last layer of text embedding. Each layer matrix of some size, and each layer is has additional matrixes. So 4x4 first layer has 4 4x4 under it... SO and so forth. So the text space is dimensionally fucking huge.
Now why would …