Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

B model does not converge. #36

Closed
cheer37 opened this issue Mar 17, 2016 · 16 comments
Closed

B model does not converge. #36

cheer37 opened this issue Mar 17, 2016 · 16 comments

Comments

@cheer37
Copy link

cheer37 commented Mar 17, 2016

I am trying to train with lightened_B model in CASIA.
I followed the training methodology in paper, But it does not converge.
At beginning, accuracy is 0, and loss is 87.3365.
What's the problem?
I am using caffe-windows-master of happynear(Feng Wang).
Thanks.

@AlfredXiangWu
Copy link
Owner

As far as I am concerned, the loss is incorrect. The training loss is about 9.2~9.3 at the beginning for CASIA-WebFace dataset.

@cheer37
Copy link
Author

cheer37 commented Mar 17, 2016

Yeah, i trained my network, what you are saying is correct.
But i encountered this situation with your net.
Would you check my proto and solver?
Thanks.

@cheer37
Copy link
Author

cheer37 commented Mar 17, 2016

// train_val.prototxt

name: "Alfred_B"
layer {
  name: "ourface"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  image_data_param {
    root_folder: "D:/DB/Aligned/"
    source: "../../examples/clean_casia/train_323441.txt"
    is_color: false
    shuffle: true
    new_height: 128
    new_width: 128
    batch_size: 3
  }
  transform_param {
    scale: 0.00390625
  }
}
layer {
  name: "ourface"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  image_data_param {
    root_folder: "D:/DB/Aligned/"
    source: "../../examples/clean_casia/valid_36564.txt"
    is_color: false
    new_height: 128
    new_width: 128
    batch_size: 3
  }
  transform_param {
    scale: 0.00390625
  }
}
layer{
  name: "conv1"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 96
    kernel_size: 5
    stride: 1
    pad: 2
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "data"
  top: "conv1"
}
layer{
  name: "slice1"
  type: "Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv1"
  top: "slice1_1"
  top: "slice1_2"
}
layer{
  name: "etlwise1"
  type: "Eltwise"
  bottom: "slice1_1"
  bottom: "slice1_2"
  top: "eltwise1"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "pool1"
  type: "Pooling"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
  bottom: "eltwise1"
  top: "pool1"
}
layer{
  name: "conv2a"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 96
    kernel_size: 1
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "pool1"
  top: "conv2a"
}
layer{
  name: "slice2a"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv2a"
  top: "slice2a_1"
  top: "slice2a_2"
}
layer{
  name: "etlwise2a"
  type: "Eltwise"
  bottom: "slice2a_1"
  bottom: "slice2a_2"
  top: "eltwise2a"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv2"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 192
    kernel_size: 3
    stride: 1
    pad: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "eltwise2a"
  top: "conv2"
}
layer{
  name: "slice2"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv2"
  top: "slice2_1"
  top: "slice2_2"
}
layer{
  name: "etlwise2"
  type: "Eltwise"
  bottom: "slice2_1"
  bottom: "slice2_2"
  top: "eltwise2"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "pool2"
  type: "Pooling"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
  bottom: "eltwise2"
  top: "pool2"
}
layer{
  name: "conv3a"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 192
    kernel_size: 1
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "pool2"
  top: "conv3a"
}
layer{
  name: "slice3a"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv3a"
  top: "slice3a_1"
  top: "slice3a_2"
}
layer{
  name: "etlwise3a"
  type: "Eltwise"
  bottom: "slice3a_1"
  bottom: "slice3a_2"
  top: "eltwise3a"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv3"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 384
    kernel_size: 3
    stride: 1
    pad: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "eltwise3a"
  top: "conv3"
}
layer{
  name: "slice3"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv3"
  top: "slice3_1"
  top: "slice3_2"
}
layer{
  name: "etlwise3"
  type: "Eltwise"
  bottom: "slice3_1"
  bottom: "slice3_2"
  top: "eltwise3"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "pool3"
  type: "Pooling"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
  bottom: "eltwise3"
  top: "pool3"
}
layer{
  name: "conv4a"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param{
    num_output: 384
    kernel_size: 1
    stride: 1
    weight_filler{
      type:"xavier"
    }
    bias_filler{
      type: "constant"
      value: 0.1    
    }
  }
  bottom: "pool3"
  top: "conv4a"
}
layer{
  name: "slice4a"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv4a"
  top: "slice4a_1"
  top: "slice4a_2"
}
layer{
  name: "etlwise4a"
  type: "Eltwise"
  bottom: "slice4a_1"
  bottom: "slice4a_2"
  top: "eltwise4a"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv4"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param{
    num_output: 256
    kernel_size: 3
    stride: 1
    pad: 1
    weight_filler{
      type:"xavier"
    }
    bias_filler{
      type: "constant"
      value: 0.1    
    }
  }
  bottom: "eltwise4a"
  top: "conv4"
}
layer{
  name: "slice4"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv4"
  top: "slice4_1"
  top: "slice4_2"
}
layer{
  name: "etlwise4"
  type: "Eltwise"
  bottom: "slice4_1"
  bottom: "slice4_2"
  top: "eltwise4"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv5a"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param{
    num_output: 256
    kernel_size: 1
    stride: 1
    weight_filler{
      type:"xavier"
    }
    bias_filler{
      type: "constant"
      value: 0.1    
    }
  }
  bottom: "eltwise4"
  top: "conv5a"
}
layer{
  name: "slice5a"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv5a"
  top: "slice5a_1"
  top: "slice5a_2"
}
layer{
  name: "etlwise5a"
  type: "Eltwise"
  bottom: "slice5a_1"
  bottom: "slice5a_2"
  top: "eltwise5a"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv5"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param{
    num_output: 256
    kernel_size: 3
    stride: 1
    pad: 1
    weight_filler{
      type:"xavier"
    }
    bias_filler{
      type: "constant"
      value: 0.1    
    }
  }
  bottom: "eltwise5a"
  top: "conv5"
}
layer{
  name: "slice5"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv5"
  top: "slice5_1"
  top: "slice5_2"
}
layer{
  name: "etlwise5"
  type: "Eltwise"
  bottom: "slice5_1"
  bottom: "slice5_2"
  top: "eltwise5"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "pool4"
  type: "Pooling"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
  bottom: "eltwise5"
  top: "pool4"
}
layer{
  name: "fc1"
  type: "InnerProduct"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  inner_product_param {
    num_output: 512
    weight_filler {
      type: "gaussian"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }   
  }  
  bottom: "pool4"
  top: "fc1"
}
layer{
  name: "slice_fc1"
  type: "Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "fc1"
  top: "slice_fc1_1"
  top: "slice_fc1_2"
}
layer{
  name: "etlwise_fc1"
  type: "Eltwise"
  bottom: "slice_fc1_1"
  bottom: "slice_fc1_2"
  top: "eltwise_fc1"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "drop1"
  type: "Dropout"
  dropout_param{
    dropout_ratio: 0.75
  }
  bottom: "eltwise_fc1"
  top: "eltwise_fc1"
}
layer{
  name: "fc2"
  type: "InnerProduct"
  inner_product_param{
    num_output: 10575
    weight_filler {
      type: "gaussian"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }   
  }
  bottom: "eltwise_fc1"
  top: "fc2"
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "fc2"
  bottom: "label"
  top: "accuracy"
  include: { phase: TEST }
}
layer {
  name: "softmaxloss"
  type: "SoftmaxWithLoss"
  bottom: "fc2"
  bottom: "label"
  top: "loss"
}

// solver

net: "../../examples/clean_casia/LightenedCNN_B_train_val.prototxt"
test_iter: 1000
test_interval: 10000
iter_size: 60
base_lr: 0.001
lr_policy: "step"
gamma: 0.1
stepsize: 500000
display: 100
max_iter: 5000000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "../../examples/clean_casia/AlfredB"
solver_mode: GPU

@AlfredXiangWu
Copy link
Owner

It seems that the training data preparation is incorrect.
I normalized the image into 144_144 and then use "crop_size" converting images from 144_144 to 128*128. But I think it is not the most important reasons for you. I think you should check your alignment method. There is a example about face image after alignment in #4.

@AlfredXiangWu
Copy link
Owner

Besides, the training batch size is too small which could lead to the slow convergence for the CNN.

@cheer37
Copy link
Author

cheer37 commented Mar 17, 2016

Yes, data augmentation is not important to whether net converge or not but to accuracy.
I cropped 160x160 from center and resized it to 128x128.
And i aligned by affine in plane by my algorithm based on two eye points and the distance between them.
I think aligning method is not important to converge, too

You are right, small batch size tends to lead diverge, but i havn't way to increase because i am using 1G GPU.
Caffe provided the feature to accumulate the giadient and loss for case like mine(small sized GPU)
It's implemented by iter_size, so i set it to 60 which means it's the same that i set the batch size as 180(3 x 60).
So i dont think it's due to batch size either.
if you can't figure out what's the problem is, Would you test for me with my proto and solver?
Thanks.

@AlfredXiangWu
Copy link
Owner

Too small batch size may be influenced for my network because the non-linear activation function is more complex than ReLU. I set the batch size about 30~100 and the network can be converged.
Moreover, your loss is 87 but my training loss is 9.2 at the beginning. Therefore, I think your data preparation is wrong. I think you should check it carefully.

I am sorry that I am on business now, therefore I don't have GPU to check your network configuration.

@cheer37
Copy link
Author

cheer37 commented Mar 17, 2016

Thanks Alfred.
When will you on business.
I think main culprit would be the batch size, but i can't increase my batch size because of my small GPU, so i asked you try to my configuration with small batch size with iter_size in solver.
it's easy to observe my situation with my proto and solver, because it happens as soon as start training.
If you are on business, would you check it for me as soon as possible, because i am urgent?
I can wait for you.
Thanks in advance.

@cheer37
Copy link
Author

cheer37 commented Mar 19, 2016

I solved my problem.
I followed the paper instruction and set the weights for two fc layers to gaussian.
This was a culprit.
I changed this to xavier, then converge now.

@cheer37 cheer37 closed this as completed Mar 19, 2016
@cheer37 cheer37 reopened this Mar 20, 2016
@cheer37
Copy link
Author

cheer37 commented Mar 20, 2016

  1. How does the performance change due to the batch size?
  2. And also, did you test the case without random cropping? so, just crop the image 128x128 in casia and feed it to net directly. how is performance in that case?
  3. what about the mirror augmentation? how is the performance without the mirroring images in training.
    I am waiting for your soon reply.
  4. lastly, casia achieved 97.2% hardly, but you defeated it, Do you think what is the most important factor to boost your performance to defeat casia.
    Thanks.

@AlfredXiangWu
Copy link
Owner

  1. I think the too small batch size would lead to the slow convergence especially for my network because the activation function is more complex to converge at the beginning than relu.
  2. The data augmentation such as "crop" and "mirror" is a effect way to increase the diversity of training data. Without using those method, the performance would be dropped according to my experiments.
  3. It is a difficult question to reply. I think all I mentioned in my paper lead to the results.

BTW, the MSRA paper [1] analyzes the disadvantages of gaussian initialization.

[1]. He, K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1026-1034).

@cheer37
Copy link
Author

cheer37 commented Mar 21, 2016

Thank you.
Did you make any progress from 98.13% so far?
If yes, how much is your performance? and also what about triplet and siamese?
I want to know your researching situation, now.
And also, I saw your log file training B model somewhere, but i can't find it. please share it here.
Thanks.

@afshindn
Copy link

@cheer37 I'm having the same issue. The value of loss 87.33 and it doesn't converge. The change of the filter from Gaussian to Xavier doesn't make a difference. Could you or @AlfredXiangWu share the train_val.proto as well as the solver.proto with me?

Thanks

@cheer37
Copy link
Author

cheer37 commented Mar 27, 2016

@afshindn
I already shared my train_val and solver above, but dont set the batch size to 3 like i did, it's too small, so set it to over 12.
Changing two gaussians in last two fc layers to xaviers works exactly.
There might be one of the two reasons your net diverges.
Firstly, did you shuffle your train dataset?
Secondly, xiang wu's net model is sensitive to learning rate, so you should it to small unlikely others, when beginning, it should be set to below 0.001.
Hope works.

@afshindn
Copy link

@cheer37
You're right. It is super sensitive to the learning rate. If I start from 0.005 instead of 0.001 then it blows up.

@iscas-lee
Copy link

@cheer37
I am in confusion about the extraction feature.
Can u share one C++ or matlab extract feature code using wu's model to me ?
Hope your supports. my Email is 2759755502@qq.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants