B model does not converge. #36

cheer37 · 2016-03-17T07:01:45Z

I am trying to train with lightened_B model in CASIA.
I followed the training methodology in paper, But it does not converge.
At beginning, accuracy is 0, and loss is 87.3365.
What's the problem?
I am using caffe-windows-master of happynear(Feng Wang).
Thanks.

AlfredXiangWu · 2016-03-17T08:44:56Z

As far as I am concerned, the loss is incorrect. The training loss is about 9.2~9.3 at the beginning for CASIA-WebFace dataset.

cheer37 · 2016-03-17T08:46:32Z

Yeah, i trained my network, what you are saying is correct.
But i encountered this situation with your net.
Would you check my proto and solver?
Thanks.

cheer37 · 2016-03-17T08:48:45Z

// train_val.prototxt

name: "Alfred_B"
layer {
  name: "ourface"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  image_data_param {
    root_folder: "D:/DB/Aligned/"
    source: "../../examples/clean_casia/train_323441.txt"
    is_color: false
    shuffle: true
    new_height: 128
    new_width: 128
    batch_size: 3
  }
  transform_param {
    scale: 0.00390625
  }
}
layer {
  name: "ourface"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  image_data_param {
    root_folder: "D:/DB/Aligned/"
    source: "../../examples/clean_casia/valid_36564.txt"
    is_color: false
    new_height: 128
    new_width: 128
    batch_size: 3
  }
  transform_param {
    scale: 0.00390625
  }
}
layer{
  name: "conv1"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 96
    kernel_size: 5
    stride: 1
    pad: 2
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "data"
  top: "conv1"
}
layer{
  name: "slice1"
  type: "Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv1"
  top: "slice1_1"
  top: "slice1_2"
}
layer{
  name: "etlwise1"
  type: "Eltwise"
  bottom: "slice1_1"
  bottom: "slice1_2"
  top: "eltwise1"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "pool1"
  type: "Pooling"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
  bottom: "eltwise1"
  top: "pool1"
}
layer{
  name: "conv2a"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 96
    kernel_size: 1
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "pool1"
  top: "conv2a"
}
layer{
  name: "slice2a"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv2a"
  top: "slice2a_1"
  top: "slice2a_2"
}
layer{
  name: "etlwise2a"
  type: "Eltwise"
  bottom: "slice2a_1"
  bottom: "slice2a_2"
  top: "eltwise2a"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv2"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 192
    kernel_size: 3
    stride: 1
    pad: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "eltwise2a"
  top: "conv2"
}
layer{
  name: "slice2"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv2"
  top: "slice2_1"
  top: "slice2_2"
}
layer{
  name: "etlwise2"
  type: "Eltwise"
  bottom: "slice2_1"
  bottom: "slice2_2"
  top: "eltwise2"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "pool2"
  type: "Pooling"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
  bottom: "eltwise2"
  top: "pool2"
}
layer{
  name: "conv3a"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 192
    kernel_size: 1
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "pool2"
  top: "conv3a"
}
layer{
  name: "slice3a"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv3a"
  top: "slice3a_1"
  top: "slice3a_2"
}
layer{
  name: "etlwise3a"
  type: "Eltwise"
  bottom: "slice3a_1"
  bottom: "slice3a_2"
  top: "eltwise3a"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv3"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param {
    num_output: 384
    kernel_size: 3
    stride: 1
    pad: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
  bottom: "eltwise3a"
  top: "conv3"
}
layer{
  name: "slice3"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv3"
  top: "slice3_1"
  top: "slice3_2"
}
layer{
  name: "etlwise3"
  type: "Eltwise"
  bottom: "slice3_1"
  bottom: "slice3_2"
  top: "eltwise3"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "pool3"
  type: "Pooling"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
  bottom: "eltwise3"
  top: "pool3"
}
layer{
  name: "conv4a"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param{
    num_output: 384
    kernel_size: 1
    stride: 1
    weight_filler{
      type:"xavier"
    }
    bias_filler{
      type: "constant"
      value: 0.1    
    }
  }
  bottom: "pool3"
  top: "conv4a"
}
layer{
  name: "slice4a"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv4a"
  top: "slice4a_1"
  top: "slice4a_2"
}
layer{
  name: "etlwise4a"
  type: "Eltwise"
  bottom: "slice4a_1"
  bottom: "slice4a_2"
  top: "eltwise4a"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv4"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param{
    num_output: 256
    kernel_size: 3
    stride: 1
    pad: 1
    weight_filler{
      type:"xavier"
    }
    bias_filler{
      type: "constant"
      value: 0.1    
    }
  }
  bottom: "eltwise4a"
  top: "conv4"
}
layer{
  name: "slice4"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv4"
  top: "slice4_1"
  top: "slice4_2"
}
layer{
  name: "etlwise4"
  type: "Eltwise"
  bottom: "slice4_1"
  bottom: "slice4_2"
  top: "eltwise4"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv5a"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param{
    num_output: 256
    kernel_size: 1
    stride: 1
    weight_filler{
      type:"xavier"
    }
    bias_filler{
      type: "constant"
      value: 0.1    
    }
  }
  bottom: "eltwise4"
  top: "conv5a"
}
layer{
  name: "slice5a"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv5a"
  top: "slice5a_1"
  top: "slice5a_2"
}
layer{
  name: "etlwise5a"
  type: "Eltwise"
  bottom: "slice5a_1"
  bottom: "slice5a_2"
  top: "eltwise5a"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "conv5"
  type: "Convolution"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  convolution_param{
    num_output: 256
    kernel_size: 3
    stride: 1
    pad: 1
    weight_filler{
      type:"xavier"
    }
    bias_filler{
      type: "constant"
      value: 0.1    
    }
  }
  bottom: "eltwise5a"
  top: "conv5"
}
layer{
  name: "slice5"
  type:"Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "conv5"
  top: "slice5_1"
  top: "slice5_2"
}
layer{
  name: "etlwise5"
  type: "Eltwise"
  bottom: "slice5_1"
  bottom: "slice5_2"
  top: "eltwise5"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "pool4"
  type: "Pooling"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
  bottom: "eltwise5"
  top: "pool4"
}
layer{
  name: "fc1"
  type: "InnerProduct"
  param { lr_mult: 1 decay_mult: 1 }
  param { lr_mult: 2 decay_mult: 0 }
  inner_product_param {
    num_output: 512
    weight_filler {
      type: "gaussian"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }   
  }  
  bottom: "pool4"
  top: "fc1"
}
layer{
  name: "slice_fc1"
  type: "Slice"
  slice_param {
    slice_dim: 1
  }
  bottom: "fc1"
  top: "slice_fc1_1"
  top: "slice_fc1_2"
}
layer{
  name: "etlwise_fc1"
  type: "Eltwise"
  bottom: "slice_fc1_1"
  bottom: "slice_fc1_2"
  top: "eltwise_fc1"
  eltwise_param {
    operation: MAX
  }
}
layer{
  name: "drop1"
  type: "Dropout"
  dropout_param{
    dropout_ratio: 0.75
  }
  bottom: "eltwise_fc1"
  top: "eltwise_fc1"
}
layer{
  name: "fc2"
  type: "InnerProduct"
  inner_product_param{
    num_output: 10575
    weight_filler {
      type: "gaussian"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }   
  }
  bottom: "eltwise_fc1"
  top: "fc2"
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "fc2"
  bottom: "label"
  top: "accuracy"
  include: { phase: TEST }
}
layer {
  name: "softmaxloss"
  type: "SoftmaxWithLoss"
  bottom: "fc2"
  bottom: "label"
  top: "loss"
}

// solver

net: "../../examples/clean_casia/LightenedCNN_B_train_val.prototxt"
test_iter: 1000
test_interval: 10000
iter_size: 60
base_lr: 0.001
lr_policy: "step"
gamma: 0.1
stepsize: 500000
display: 100
max_iter: 5000000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "../../examples/clean_casia/AlfredB"
solver_mode: GPU

AlfredXiangWu · 2016-03-17T09:02:08Z

It seems that the training data preparation is incorrect.
I normalized the image into 144_144 and then use "crop_size" converting images from 144_144 to 128*128. But I think it is not the most important reasons for you. I think you should check your alignment method. There is a example about face image after alignment in #4.

AlfredXiangWu · 2016-03-17T09:04:26Z

Besides, the training batch size is too small which could lead to the slow convergence for the CNN.

cheer37 · 2016-03-17T09:11:04Z

Yes, data augmentation is not important to whether net converge or not but to accuracy.
I cropped 160x160 from center and resized it to 128x128.
And i aligned by affine in plane by my algorithm based on two eye points and the distance between them.
I think aligning method is not important to converge, too

You are right, small batch size tends to lead diverge, but i havn't way to increase because i am using 1G GPU.
Caffe provided the feature to accumulate the giadient and loss for case like mine(small sized GPU)
It's implemented by iter_size, so i set it to 60 which means it's the same that i set the batch size as 180(3 x 60).
So i dont think it's due to batch size either.
if you can't figure out what's the problem is, Would you test for me with my proto and solver?
Thanks.

AlfredXiangWu · 2016-03-17T10:02:13Z

Too small batch size may be influenced for my network because the non-linear activation function is more complex than ReLU. I set the batch size about 30~100 and the network can be converged.
Moreover, your loss is 87 but my training loss is 9.2 at the beginning. Therefore, I think your data preparation is wrong. I think you should check it carefully.

I am sorry that I am on business now, therefore I don't have GPU to check your network configuration.

cheer37 · 2016-03-17T10:07:17Z

Thanks Alfred.
When will you on business.
I think main culprit would be the batch size, but i can't increase my batch size because of my small GPU, so i asked you try to my configuration with small batch size with iter_size in solver.
it's easy to observe my situation with my proto and solver, because it happens as soon as start training.
If you are on business, would you check it for me as soon as possible, because i am urgent?
I can wait for you.
Thanks in advance.

cheer37 · 2016-03-19T07:18:19Z

I solved my problem.
I followed the paper instruction and set the weights for two fc layers to gaussian.
This was a culprit.
I changed this to xavier, then converge now.

cheer37 · 2016-03-20T08:17:07Z

How does the performance change due to the batch size?
And also, did you test the case without random cropping? so, just crop the image 128x128 in casia and feed it to net directly. how is performance in that case?
what about the mirror augmentation? how is the performance without the mirroring images in training.
I am waiting for your soon reply.
lastly, casia achieved 97.2% hardly, but you defeated it, Do you think what is the most important factor to boost your performance to defeat casia.
Thanks.

AlfredXiangWu · 2016-03-21T02:39:45Z

I think the too small batch size would lead to the slow convergence especially for my network because the activation function is more complex to converge at the beginning than relu.
The data augmentation such as "crop" and "mirror" is a effect way to increase the diversity of training data. Without using those method, the performance would be dropped according to my experiments.
It is a difficult question to reply. I think all I mentioned in my paper lead to the results.

BTW, the MSRA paper [1] analyzes the disadvantages of gaussian initialization.

[1]. He, K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1026-1034).

cheer37 · 2016-03-21T07:52:40Z

Thank you.
Did you make any progress from 98.13% so far?
If yes, how much is your performance? and also what about triplet and siamese?
I want to know your researching situation, now.
And also, I saw your log file training B model somewhere, but i can't find it. please share it here.
Thanks.

afshindn · 2016-03-26T18:42:36Z

@cheer37 I'm having the same issue. The value of loss 87.33 and it doesn't converge. The change of the filter from Gaussian to Xavier doesn't make a difference. Could you or @AlfredXiangWu share the train_val.proto as well as the solver.proto with me?

Thanks

cheer37 · 2016-03-27T07:59:21Z

@afshindn
I already shared my train_val and solver above, but dont set the batch size to 3 like i did, it's too small, so set it to over 12.
Changing two gaussians in last two fc layers to xaviers works exactly.
There might be one of the two reasons your net diverges.
Firstly, did you shuffle your train dataset?
Secondly, xiang wu's net model is sensitive to learning rate, so you should it to small unlikely others, when beginning, it should be set to below 0.001.
Hope works.

afshindn · 2016-03-27T16:26:21Z

@cheer37
You're right. It is super sensitive to the learning rate. If I start from 0.005 instead of 0.001 then it blows up.

iscas-lee · 2016-03-30T10:43:06Z

@cheer37
I am in confusion about the extraction feature.
Can u share one C++ or matlab extract feature code using wu's model to me ?
Hope your supports. my Email is 2759755502@qq.com

cheer37 closed this as completed Mar 19, 2016

cheer37 reopened this Mar 20, 2016

AlfredXiangWu closed this as completed Jun 6, 2016

TheusStremens mentioned this issue Apr 6, 2017

Train and Fine-Tuning LightCNN #110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

B model does not converge. #36

B model does not converge. #36

cheer37 commented Mar 17, 2016

AlfredXiangWu commented Mar 17, 2016

cheer37 commented Mar 17, 2016

cheer37 commented Mar 17, 2016

AlfredXiangWu commented Mar 17, 2016

AlfredXiangWu commented Mar 17, 2016

cheer37 commented Mar 17, 2016

AlfredXiangWu commented Mar 17, 2016

cheer37 commented Mar 17, 2016

cheer37 commented Mar 19, 2016

cheer37 commented Mar 20, 2016

AlfredXiangWu commented Mar 21, 2016

cheer37 commented Mar 21, 2016

afshindn commented Mar 26, 2016

cheer37 commented Mar 27, 2016

afshindn commented Mar 27, 2016

iscas-lee commented Mar 30, 2016

B model does not converge. #36

B model does not converge. #36

Comments

cheer37 commented Mar 17, 2016

AlfredXiangWu commented Mar 17, 2016

cheer37 commented Mar 17, 2016

cheer37 commented Mar 17, 2016

AlfredXiangWu commented Mar 17, 2016

AlfredXiangWu commented Mar 17, 2016

cheer37 commented Mar 17, 2016

AlfredXiangWu commented Mar 17, 2016

cheer37 commented Mar 17, 2016

cheer37 commented Mar 19, 2016

cheer37 commented Mar 20, 2016

AlfredXiangWu commented Mar 21, 2016

cheer37 commented Mar 21, 2016

afshindn commented Mar 26, 2016

cheer37 commented Mar 27, 2016

afshindn commented Mar 27, 2016

iscas-lee commented Mar 30, 2016