Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Dbnet++ Training #10204

Closed
prashantkh19 opened this issue Jun 19, 2023 · 2 comments
Closed

Slow Dbnet++ Training #10204

prashantkh19 opened this issue Jun 19, 2023 · 2 comments
Assignees
Labels

Comments

@prashantkh19
Copy link

I've been training dbnet++ model from scratch. I've around 3.5M training images, and 17k validation images.
My training config looks like this:

Global:
  debug: false
  use_gpu: true
  epoch_num: 500
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./output/dbnet_plus_all/
  save_epoch_step: 1
  eval_batch_step: 
   - 0
   - 21000
  cal_metric_during_train: false
  pretrained_model: ../pretrained_models/ResNet50_dcn_asf_synthtext_pretrained
  checkpoints: ./output/dbnet_plus_all/latest
  save_inference_dir: ./inference_dir/
  use_visualdl: true
  infer_img: 
  save_res_path: ../vis_out/dbnet_plus_all/out.txt
Architecture:
  model_type: det
  algorithm: DB++
  Transform: null
  Backbone:
    name: ResNet
    layers: 50
    dcn_stage: [False, True, True, True]
  Neck:
    name: DBFPN
    out_channels: 256
    use_asf: True
  Head:
    name: DBHead
    k: 50
Loss:
  name: DBLoss
  balance_loss: true
  main_loss_type: BCELoss
  alpha: 5
  beta: 10
  ohem_ratio: 3
Optimizer:
  name: Momentum
  momentum: 0.9
  lr:
    name: DecayLearningRate
    learning_rate: 0.007
    epochs: 1000
    factor: 0.9
    end_lr: 0
  weight_decay: 0.0001
PostProcess:
  name: DBPostProcess
  thresh: 0.2
  box_thresh: 0.3
  max_candidates: 1000
  unclip_ratio: 1.8
  det_box_type: 'quad' # 'quad' or 'poly'
Metric:
  name: DetMetric
  main_indicator: hmean
Train:
  dataset:
    name: SimpleDataSet
    data_dir: ../
    label_file_list:
      - ../data_collection_aws/label_files/care_train.txt
      - ../data_collection_aws/label_files/listening_train.txt
      - ../data_collection_aws/label_files/marketing_train.txt
      - ../data_collection_aws/label_files/samsung_train.txt                                             
    ratio_list: [1.0, 1.0, 1.0, 1.0]                                                                                              
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - DetLabelEncode: null
    - IaaAugment:
        augmenter_args:
        - type: Fliplr
          args:
            p: 0.5
        - type: Affine
          args:
            rotate:
            - -10
            - 10
        - type: Resize
          args:
            size:
            - 0.5
            - 3
    - EastRandomCropData:
        size:
        - 960
        - 960
        max_tries: 10
        keep_ratio: true
    - MakeShrinkMap:
        shrink_ratio: 0.6
        min_text_size: 8
    - MakeBorderMap:
        shrink_ratio: 0.6
        thresh_min: 0.3
        thresh_max: 0.7
    - NormalizeImage:
        scale: 1./255.
        mean:
        - 0.48109378172549
        - 0.45752457890196
        - 0.40787054090196
        std:
        - 1.0
        - 1.0
        - 1.0
        order: hwc
    - ToCHWImage: null
    - KeepKeys:
        keep_keys:
        - image
        - threshold_map
        - threshold_mask
        - shrink_map
        - shrink_mask
  loader:
    shuffle: true
    drop_last: false
    batch_size_per_card: 16
    num_workers: 4
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: ../
    label_file_list:
      - ../data_collection_aws/label_files/test_all.txt 
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - DetLabelEncode: null
    - DetResizeForTest:
        limit_side_len: 1080
        limit_type: max
    - NormalizeImage:
        scale: 1./255.
        mean:
        - 0.48109378172549
        - 0.45752457890196
        - 0.40787054090196
        std:
        - 1.0
        - 1.0
        - 1.0
        order: hwc
    - ToCHWImage: null
    - KeepKeys:
        keep_keys:
        - image
        - shape
        - polys
        - ignore_tags
  loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 1
    num_workers: 1
profiler_options: null

I'm using 8 A100 40GB GPUs, and still my training ETA is 17-18days. Is there anything wrong with the config, how can I speed up the training?

@ToddBear ToddBear added the good first issue Good for newcomers label Jun 30, 2023
@livingbody
Copy link
Contributor

turn up your 'batch_size_per_card', use Full GPU memory

@shiyutang
Copy link
Collaborator

The above question has answered the question, welcome to reopen the question if there are any following issues. And we are now holding a contribution to Paddleseg and Paddle OCR activity, you are welcome to join:#10223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants