Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.

# Summarization using T5 model from Hugging Face

### Summarization with T5-3B model
We will use the Hugging Face Summariazion example with the T5-3B model to fine tune the model with the CNN-dailymail dataset

run_summarization.py is a lightweight example of how to download and preprocess a dataset from the 🤗 Datasets library 

#### Initial Setup
We start with a Habana PyTorch Docker image and run this notebook

#### Install Habana's DeepSpeed Fork
Habana's DeepSpeed Fork has implementations specifically for Gaudi and must be used

In [1]:
!pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.10.0  

Collecting git+https://github.com/HabanaAI/DeepSpeed.git@1.10.0
  Cloning https://github.com/HabanaAI/DeepSpeed.git (to revision 1.10.0) to /tmp/pip-req-build-83mxv2lp
  Running command git clone --filter=blob:none --quiet https://github.com/HabanaAI/DeepSpeed.git /tmp/pip-req-build-83mxv2lp
  Running command git checkout -b 1.10.0 --track origin/1.10.0
  Switched to a new branch '1.10.0'
  Branch '1.10.0' set up to track remote branch '1.10.0' from 'origin'.
  Resolved https://github.com/HabanaAI/DeepSpeed.git to commit 141faf783dac331bc3852590d05190dc0e883e51
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting hjson
  Downloading hjson-3.1.0-py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.0/54.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting py-cpuinfo
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl (22 kB)
Collecting pydantic
  Downloading pydantic-1.10.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 M

#### Install the Optimum Habana Library

In [2]:
!python -m pip install optimum[habana]

Collecting optimum[habana]
  Downloading optimum-1.8.7.tar.gz (243 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.9/243.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting coloredlogs
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.8.0
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting transformers[sentencepiece]<4.30.0,>=4.26.0
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

#### Clone the Hugging Face Model Repository

In [3]:
!git clone  https://github.com/huggingface/optimum-habana

Cloning into 'optimum-habana'...
remote: Enumerating objects: 4007, done.[K
remote: Counting objects: 100% (1160/1160), done.[K
remote: Compressing objects: 100% (512/512), done.[K
remote: Total 4007 (delta 791), reused 870 (delta 612), pack-reused 2847[K
Receiving objects: 100% (4007/4007), 2.08 MiB | 5.25 MiB/s, done.
Resolving deltas: 100% (2593/2593), done.


#### Go the Summarization example model and install the requirements

In [4]:
%cd optimum-habana/examples/summarization

/root/Gaudi2-Workshop/LLM-Training/optimum-habana/examples/summarization


In [5]:
!pip install -q -r requirements.txt

[0m

### Setup for DeepSpeed
Since we are using DeepSpeed, we have to confirm that the model has been configured properly.  We look for the following:

* model, optimizer, ... = deepspeed.initialize(args=args, model=model, optimizer=optimizer, ...)
* deepspeed.init_distributed(dist_backend=“hccl”, init_method=init_method)
* Create a ds_config.json file to set the DS training parameters.

#### DeepSpeed Initialization
Look in deepspeed.py and we see the model being passed to the DeepSpeed engine

```
    import deepspeed
    from deepspeed.utils import logger as ds_logger

    model = trainer.model
    args = trainer.args
    ...

    kwargs = {
        "args": habana_args,
        "model": model,
        "model_parameters": model_parameters,
        "config_params": config,
        "optimizer": optimizer,
        "lr_scheduler": lr_scheduler,
    }

    deepspeedengine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)

```

#### DeepSpeed Distributed
Look in training_args.py and we see the DeepSpeed Distribution initialization

```
    from habana_frameworks.torch.distributed.hccl import initialize_distributed_hpu
    world_size, rank, self.local_rank = initialize_distributed_hpu()

    import deepspeed
    deepspeed.init_distributed(dist_backend="hccl", timeout=timedelta(seconds=self.ddp_timeout))
       logger.info("DeepSpeed is enabled.")
```

#### Create DeepSpeed Config file with ZeRO preferences
The ds_config.json file will configure the parameters to run DeepSpeed

In this case, we will run the ZeRO3 optimizer and BF16 mixed precision.

In [6]:
%pwd

'/root/Gaudi2-Workshop/LLM-Training/optimum-habana/examples/summarization'

In [7]:
%%sh
tee ./ds_config.json <<EOF
{
    "steps_per_print": 64,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "bf16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_scatter": false,
        "contiguous_gradients": false,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}
EOF

{
    "steps_per_print": 64,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "bf16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_scatter": false,
        "contiguous_gradients": false,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}


#### Fine Tuning T5-3b with the cnn_dailymail dataset
The T5-3b model is a large language model that was originally trained on the C4 dataset and in this case will be fined tuned on the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset that is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail.

For use of this model on First-Gen Gaudi, users should update the model to "T5-large"

This is run by `gaudi_spawn.py`, a simple launcher script to collect arguments and send them to `distributed_runner.py` for training on multiple HPUs, which then calls the `run_summarization.py` model.

Notice the Habana specific commands to use here:

-- use_habana  - allows training to run on Habana Gaudi  
-- use_hpu_graphs - reduces recompilation by replaying the graph  
-- gaudi_config_name Habana/t5 - mapping to Hugging Face T5 Model  

**Even though a Billion parameter T5 model can be used for Fine Tuning, this fine tuning still takes many hours to complete.  
For users that wish to execute the example Fine Tuning, they should modify the `model_name_or_path` to "t5-small", which only takes a few hours to complete.**


In [8]:
%pwd

'/root/Gaudi2-Workshop/LLM-Training/optimum-habana/examples/summarization'

In [9]:
!mkdir ft-summarization

**Run the command below in a terminal window** from the optimum-habana/examples/summarization folder, this is most likely here: 
/root/Gaudi2-Workshop/LLM-Training/optimum-habana/examples/summarization/

```
python ../gaudi_spawn.py \
--world_size 8 --use_deepspeed run_summarization.py \
--model_name_or_path t5-3b \
--do_train \
--dataset_name cnn_dailymail \
--dataset_config '"3.0.0"' \
--source_prefix '"summarize: "' \
--output_dir ./ft-summarization \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--overwrite_output_dir \
--predict_with_generate \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs \
--gaudi_config_name Habana/t5 \
--ignore_pad_token_for_loss False \
--pad_to_max_length \
--save_strategy epoch \
--throughput_warmup_steps 3 \
--deepspeed ./ds_config.json
```


### After fine tuning, let's look at the results
This fine tuned model has created the new `pytorch_model.bin` and the global_step.. folder contains the checkpoints that will be used in the infernece in the next section.


In [10]:
%cd ./ft-summarization

/root/Gaudi2-Workshop/LLM-Training/optimum-habana/examples/summarization/ft-summarization


In [26]:
%ls -al

total 217716
drwxr-xr-x 6 root root      4096 Jun 14 18:44 [0m[01;34m.[0m/
drwxr-xr-x 5 root root      4096 Jun 14 06:58 [01;34m..[0m/
-rw-r--r-- 1 root root       312 Jun 14 18:44 all_results.json
drwxr-xr-x 3 root root      4096 Jun 14 17:48 [01;34mcheckpoint-17946[0m/
drwxr-xr-x 3 root root      4096 Jun 14 18:44 [01;34mcheckpoint-26919[0m/
drwxr-xr-x 3 root root      4096 Jun 14 16:53 [01;34mcheckpoint-8973[0m/
-rw-r--r-- 1 root root      1474 Jun 14 18:44 config.json
-rw-r--r-- 1 root root       506 Jun 14 18:44 gaudi_config.json
-rw-r--r-- 1 root root       142 Jun 14 18:44 generation_config.json
-rw-r--r-- 1 root root 219639653 Jun 14 18:44 pytorch_model.bin
drwxr-xr-x 3 root root      4096 Jun 14 15:56 [01;34mruns[0m/
-rw-r--r-- 1 root root      2201 Jun 14 18:44 special_tokens_map.json
-rw-r--r-- 1 root root    791656 Jun 14 18:44 spiece.model
-rw-r--r-- 1 root root      2324 Jun 14 18:44 tokenizer_config.json
-rw-r--r-- 1 root root   2422095 Jun 14 18:44 tokenize

#### Summarization using the Pipeline
Now we can run the summarization using Hugging Face Pipeline call with the fine tuned model.  In this case we will point to the model that we fine tuned.   Remember that if you used t5-small to do the Fine Tuning, be sure to change the `model_to_finetune` to "t5-small"

In [11]:
import torch
import habana_frameworks.torch

from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer

# Load model to fine-tune and its tokenizer
model_to_finetune = "t5-3b"
model = AutoModelForSeq2SeqLM.from_pretrained(model_to_finetune)
tokenizer = AutoTokenizer.from_pretrained(model_to_finetune)

# Point to the ft-summarization folder with the fine-tuned model
path_to_local_model = "/root/Gaudi2-Workshop/LLM-Training/optimum-habana/examples/summarization/ft-summarization"

# Instantiate pipeline from local repo, if you did not run the fine tuning step above, you can change: model=model_to_finetune
pipe = pipeline(task="summarization", model=path_to_local_model, device="hpu", torch_dtype=torch.bfloat16)


text_to_summarize = "summarize: Photosynthesis involves a series of complex reactions that take place within specialized organelles called chloroplasts in plant cells. It can be broadly divided into two stages: the light-dependent reactions and the light-independent reactions, also known as the Calvin cycle.  Light-Dependent Reactions: During the light-dependent reactions, chlorophyll pigments within the thylakoid membranes of the chloroplasts absorb light energy. This energy is utilized to split water molecules into oxygen, protons (H+), and electrons. Oxygen is released as a byproduct, while protons and electrons are transported through an electron transport chain, generating ATP (adenosine triphosphate) and NADPH (nicotinamide adenine dinucleotide phosphate).  Light-Independent Reactions (Calvin Cycle):  The ATP and NADPH produced in the light-dependent reactions are utilized in the Calvin cycle, which takes place in the stroma of the chloroplasts. In this cycle, carbon dioxide from the atmosphere combines with the stored energy in the form of ATP and NADPH to produce glucose. This glucose serves as a building block for other carbohydrates and organic compounds. Photosynthesis is a complex process that enables plants, algae, and some bacteria to convert light energy into chemical energy, facilitating the sustenance of life on Earth. It involves the interplay of light-dependent reactions, which generate ATP and NADPH, and the light-independent reactions or the Calvin cycle, which utilize the produced energy to fix carbon dioxide and produce glucose. Enhancing our understanding of photosynthesis and its underlying mechanisms holds the key to various applications, including improving crop yields, developing sustainable bioenergy sources, and addressing environmental challenges."
#text_to_summarize = "summarize: Introduction: The Strategic Arms Limitation Talks II (SALT II) treaty, signed on June 18, 1979, between the United States and the Soviet Union, marked a significant milestone in nuclear arms control efforts during the Cold War era. Building upon its predecessor, SALT I, the treaty aimed to curb the arms race and reduce the risk of nuclear conflict between the superpowers. Key Provisions: SALT II encompassed several crucial provisions. It placed limits on strategic offensive arms, including intercontinental ballistic missiles (ICBMs), submarine-launched ballistic missiles (SLBMs), and heavy bombers. The agreement specified the maximum number of deployed warheads and launchers each party could possess. Verification and Compliance: To ensure compliance, the treaty established comprehensive verification measures. This involved regular exchanges of data, on-site inspections, and monitoring activities by both nations. These measures sought to enhance transparency, foster trust, and prevent either side from gaining a significant advantage in terms of strategic nuclear capabilities. Ratification and Challenges: Although both the United States and the Soviet Union signed the treaty, its ratification faced considerable challenges. The political landscape changed when the Soviet Union invaded Afghanistan in 1979, leading to a deterioration of U.S.-Soviet relations. As a result, the United States never ratified the treaty formally, rendering it non-binding. However, both nations pledged to adhere to its principles, effectively implementing its provisions on a voluntary basis. Legacy and Impact: Despite the treaty's non-ratification, SALT II's legacy and impact were significant. It set the stage for subsequent arms control negotiations, providing a framework for future agreements such as the Intermediate-Range Nuclear Forces (INF) Treaty and the Strategic Arms Reduction Treaty (START). SALT II demonstrated the potential for cooperation between the superpowers and laid the groundwork for continued dialogue aimed at reducing the nuclear threat globally."
print("------------------------------------------------------------")
print("Input:", text_to_summarize)
print()

result = pipe(text_to_summarize)
print("------------------------------------------------------------")
print("Result:", result)



Downloading (…)lve/main/config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/11.4G [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-3b automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
 PT_HPU_LAZY_MODE = 1
 PT_HPU_LAZY_EAGER_OPTIM_CACHE = 1
 PT_HPU_ENABLE_COMPILE_THREAD = 0
 PT_HPU_ENABLE_EXECUTION_THREAD = 1
 PT_HPU_ENABLE_LAZY_EAGER_EXECUTION_THREAD = 1
 PT_ENABLE_INTER_HOST_CACHING = 0
 PT_ENABLE_INFERENCE_MODE = 1
 PT_ENABLE_HABANA_CACHING = 1
 PT_HPU_MAX_RECIPE_SUBMISSION_LIMIT = 0
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_MAX_COMPOUND_OP_SIZE_SS = 10
 PT_HPU_ENABLE_STAGE_SUBMISSION = 1
 PT_HPU_STAGE_SUBMISSION_MODE = 2
 PT_HPU_PGM_ENABLE_CACHE = 1
 PT_HPU_ENABLE_LAZY_COLLECTIVES = 0
 PT_HCCL_SLICE_SIZE_MB = 16
 PT_HCCL_MEMORY_ALLOWANCE_MB = 0
 PT_HPU_

------------------------------------------------------------
Input: Photosynthesis involves a series of complex reactions that take place within specialized organelles called chloroplasts in plant cells. It can be broadly divided into two stages: the light-dependent reactions and the light-independent reactions, also known as the Calvin cycle.  Light-Dependent Reactions: During the light-dependent reactions, chlorophyll pigments within the thylakoid membranes of the chloroplasts absorb light energy. This energy is utilized to split water molecules into oxygen, protons (H+), and electrons. Oxygen is released as a byproduct, while protons and electrons are transported through an electron transport chain, generating ATP (adenosine triphosphate) and NADPH (nicotinamide adenine dinucleotide phosphate).  Light-Independent Reactions (Calvin Cycle):  The ATP and NADPH produced in the light-dependent reactions are utilized in the Calvin cycle, which takes place in the stroma of the chloroplasts



------------------------------------------------------------
Result: [{'summary_text': 'Photosynthesis is a complex process that enables plants, algae, and some bacteria to convert light energy into chemical energy . It involves the interplay of light-dependent reactions, which generate ATP and NADPH, and the light-independent reactions or the Calvin cycle . Enhancing our understanding of photosynthesis holds the key to various applications, including improving crop yields and developing sustainable bioenergy sources .'}]


In [12]:
# To run additional inference examples, the jupyter notebook requires that the kernel be restarted.  this `exit()` command will restart the kernel and allow another infernece run.
exit()