<a href="https://colab.research.google.com/github/Erickrus/llm/blob/main/deepspeed_cifar_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
 <img src="https://raw.githubusercontent.com/microsoft/DeepSpeed/master/docs/assets/images/DeepSpeed_light.svg#gh-light-mode-only" width="400px">
</div>

Extreme Speed and Scale for DL Training and Inference

[DeepSpeed](https://www.deepspeed.ai/) is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference. With DeepSpeed you can:

* Train/Inference dense or sparse models with billions or trillions of parameters
* Achieve excellent system throughput and efficiently scale to thousands of GPUs
* Train/Inference on resource constrained GPU systems
* Achieve unprecedented low latency and high throughput for inference
* Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs

In [None]:
#@title download cifar10 files

#@markdown
!wget https://raw.githubusercontent.com/microsoft/DeepSpeedExamples/master/training/cifar/cifar10_tutorial.py
!wget https://raw.githubusercontent.com/microsoft/DeepSpeedExamples/master/training/cifar/cifar10_deepspeed.py
!wget https://raw.githubusercontent.com/microsoft/DeepSpeedExamples/master/training/cifar/ds_config.json


--2023-04-04 06:08:32--  https://raw.githubusercontent.com/microsoft/DeepSpeedExamples/master/training/cifar/cifar10_tutorial.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12658 (12K) [text/plain]
Saving to: ‘cifar10_tutorial.py’


2023-04-04 06:08:32 (85.2 MB/s) - ‘cifar10_tutorial.py’ saved [12658/12658]



In [None]:
#@title install packages

!pip3 install deepspeed mpi4py

In [None]:
#@title define FabUtil

#@markdown FabUtil
import os
import json

import tarfile
from zipfile import ZipFile
from google.colab._system_commands import _shell_line_magic as shell_line_magic


class FabUtil:
  def cust_code(self, codeFilename, content):
    self._ensure_dir(codeFilename)
    with open(codeFilename, 'w') as f:
      f.write(content)

  def fabricate(self, fabs):
    # accept both filename and fabs object
    if type(fabs) == str:
      with open(fabs, "r") as f:
        fabs = json.loads(fabs)
    elif type(fabs) == dict:
      pass

    for i in range(len(fabs["fabs"])):
      fab = fabs["fabs"][i]
      if "cmd" in fab:
        print("%s" % fab["cmd"])
        shell_line_magic("%s" % fab["cmd"])
        #os.system("%s" % fab["cmd"])
        continue

      if "patches" in fab:
        self._patch(fab["srcFilename"], fab["patches"])
        continue

      entryFilename = ""
      srcFilename = fab["srcFilename"]
      if srcFilename.find("::") > 0:
        srcFilename, entryFilename = srcFilename.split("::")
      tgtFilename = fab["tgtFilename"]
      srcFilename = os.path.join(fabs["baseDir"], srcFilename)

      if entryFilename != "":
        self._process_zip_file(srcFilename, entryFilename, tgtFilename)
      else:
        self._ensure_dir(tgtFilename)
        os.system("cp %s %s" % (srcFilename, tgtFilename))
        print("fabricated %s ==> %s" % (srcFilename, tgtFilename))

  def _patch(self, filename, patches):
    changed = False
    with open(filename, 'r') as f:
      lines = f.read().split('\n')
    for patchItem in patches:
      lineNum = patchItem['lineNum']
      fromText = patchItem['fromText']
      toText = patchItem['toText']
      if lines[lineNum-1] == fromText:
        lines[lineNum-1] = toText
        changed = True
    if changed:
      with open(filename, 'w') as f:
        f.write('\n'.join(lines))

  def _ensure_dir(self, tgtFilename):
    dirName = os.path.dirname(tgtFilename)
    if not os.path.exists(dirName):
      os.system("mkdir -p %s " % dirName)

  def _process_zip_file(self, srcFilename, entryFilename, tgtFilename):
    try:
      if srcFilename.lower().find(".tar") > 0 or srcFilename.lower().find(".tgz") > 0:
        fileOp = 'r'
        if srcFilename.lower().endswith('.tar.gz') or srcFilename.lower().endswith('.tgz'): # gzip
            fileOp = 'r:gz'
        elif srcFilename.lower().endswith('.tar.bz2'): # bzip2
            fileOp = 'r:bz2'
        elif srcFilename.lower().endswith('.tar.xz'): # lzma
            fileOp = 'r:xz'
        with tarfile.open(srcFilename, fileOp) as tar:
          self._ensure_dir(tgtFilename)
          with open(tgtFilename, "wb") as f:
            f.write(tar.extractfile(entryFilename).read())
        print("fabricated %s::%s ==> %s" % (srcFilename, entryFilename, tgtFilename))
        return
      if srcFilename.lower().find(".zip") >0:
        with ZipFile(srcFilename, 'r') as z:
          self._ensure_dir(tgtFilename)
          with open(tgtFilename, "wb") as f:
            f.write(z.read(entryFilename))
        print("fabricated %s::%s ==> %s" % (srcFilename, entryFilename, tgtFilename))
        return
    except:
      print("failed %s::%s ==> %s" % (srcFilename, entryFilename, tgtFilename))
      return
    print("not found %s::%s ==> %s" % (srcFilename, entryFilename, tgtFilename))

fb = FabUtil()


In [None]:
#@title modify cifar10_deepspeed.py

fb.fabricate({
  "baseDir": "/content/",
  "fabs": [
  {
        #@markdown Resolve: AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute 'next'
        #@markdown
        #@markdown https://stackoverflow.com/questions/74289077/attributeerror-multiprocessingdataloaderiter-object-has-no-attribute-next
        "srcFilename": "cifar10_deepspeed.py",
        "patches": [{
        "lineNum": 162,
        "fromText": 'images, labels = dataiter.next()',
        "toText":   'images, labels = next(dataiter)',
        },
        {
        "lineNum": 312,
        "fromText": 'images, labels = dataiter.next()',
        "toText":   'images, labels = next(dataiter)',
        }
        ]
    }
  ]
})

In [None]:
!cat ds_config.json

{
  "train_batch_size": 16,
  "steps_per_print": 2000,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.001,
      "betas": [
        0.8,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.001,
      "warmup_num_steps": 1000
    }
  },
  "gradient_clipping": 1.0,
  "prescale_gradients": false,
  "fp16": {
      "enabled": true,
      "fp16_master_weights_and_grads": false,
      "loss_scale": 0,
      "loss_scale_window": 500,
      "hysteresis": 2,
      "min_loss_scale": 1,
      "initial_scale_power": 15
  },
  "wall_clock_breakdown": false,
  "zero_optimization": {
      "stage": 0,
      "allgather_partitions": true,
      "reduce_scatter": true,
      "allgather_bucket_size": 50000000,
      "reduce_bucket_size": 50000000,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "cpu_offload": false
  }
}


In [None]:
#@title run cifar10_deepspeed.py
#@markdown ```shell
#@markdown deepspeed cifar10_deepspeed.py \
#@markdown --deepspeed \
#@markdown --deepspeed_config ds_config.json
#@markdown
#@markdown ```
!deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json

[2023-04-04 06:02:17,191] [INFO] [runner.py:550:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
[2023-04-04 06:02:19,346] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8
[2023-04-04 06:02:19,346] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1
[2023-04-04 06:02:19,346] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.16.2-1
[2023-04-04 06:02:19,346] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-04-04 06:02:19,346] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.16.2-1+cuda11.8
[2023-04-04 06:02:19,346] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-04-04 06:02:19,346] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.16.2-1
[2023-04-04 06:02:19,347] [INFO] [launch.py:142:main

In [None]:
#@title diff 2 files

#@markdown for better compare, use this https://www.diffchecker.com/text-compare/
!diff cifar10_tutorial.py cifar10_deepspeed.py

1,33c1,5
< # -*- coding: utf-8 -*-
< """
< Training a Classifier
< 
< This is it. You have seen how to define neural networks, compute loss and make
< updates to the weights of the network.
< 
< Now you might be thinking,
< 
< What about data?
< ----------------
< 
< Generally, when you have to deal with image, text, audio or video data,
< you can use standard python packages that load data into a numpy array.
< Then you can convert this array into a ``torch.*Tensor``.
< 
< -  For images, packages such as Pillow, OpenCV are useful
< -  For audio, packages such as scipy and librosa
< -  For text, either raw Python or Cython based loading, or NLTK and
<    SpaCy are useful
< 
< Specifically for vision, we have created a package called
< ``torchvision``, that has data loaders for common datasets such as
< Imagenet, CIFAR10, MNIST, etc. and data transformers for images, viz.,
< ``torchvision.datasets`` and ``torch.utils.data.DataLoader``.
< 
< This provides a huge convenience and avoids wr