# DistilBART-CNN-12-6 - Text Summarization with Amazon SageMaker and Hugging Face

- [Text summarization with Amazon SageMaker and Hugging Face](https://aws.amazon.com/blogs/machine-learning/text-summarization-with-amazon-sagemaker-and-hugging-face/)

**Settings:**
- PyTorch `1.10`, Python `3.8` CPU Optimized

**Versions:**
- transformers `4.23.1`
- torch `1.10.2+cpu`
- Python `3.8.10`

In [None]:
!pip install transformers

In [None]:
!pip install -U sagemaker

In [None]:
!pip install -U boto3 awswrangler

In [5]:
import pandas as pd
import sagemaker
import transformers
from transformers import pipeline
import awswrangler as wr

In [6]:
print(transformers.__version__)

4.24.0


In [7]:
!pip3 show torch

Name: torch
Version: 1.10.2+cpu
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /opt/conda/lib/python3.8/site-packages
Requires: typing-extensions
Required-by: awsio, torchvision


In [8]:
!python3 -V

Python 3.8.10


___
## Import data

In [None]:
role = sagemaker.get_execution_role()
data_location = 's3://datasets/text_summarization/corpus/corpus.csv'

df = pd.DataFrame(pd.read_csv(data_location))

In [None]:
print(df.shape)
df.head()

## Hugging Face summarization pipeline

In [None]:
summarizer = pipeline("summarization")
summarizer("An apple a day, keeps the doctor away", min_length=5, max_length=20)

## SageMaker endpoint with pre-trained model

In [11]:
from sagemaker.huggingface import HuggingFaceModel
from sagemaker import get_execution_role

In [12]:
role = get_execution_role()

# Hub Model configuration. https://huggingface.co/models
hub = {
  'HF_MODEL_ID':'sshleifer/distilbart-cnn-12-6',
  'HF_TASK':'summarization'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    py_version='py38',
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")

Exception ignored in: <function tqdm.__del__ at 0x7f1e69c10670>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1147, in __del__
    self.close()
  File "/opt/conda/lib/python3.8/site-packages/tqdm/notebook.py", line 286, in close
    self.disp(bar_style='danger', check_delay=False)
AttributeError: 'tqdm' object has no attribute 'disp'


------!

In [None]:
%%time

# example request, you need to define "inputs"
data = {
   "inputs": "Hawaii comprises nearly the entire Hawaiian archipelago, 137 volcanic islands spanning 1,500 miles (2,400 km) that are physiographically and ethnologically part of the Polynesian subregion of Oceania. The state's ocean coastline is consequently the fourth-longest in the U.S., at about 750 miles (1,210 km).[b] The eight main islands, from northwest to southeast, are Niʻihau, Kauaʻi, Oʻahu, Molokaʻi, Lānaʻi, Kahoʻolawe, Maui, and Hawaiʻi—the last of these, after which the state is named, is often called the "Big Island" or "Hawaii Island" to avoid confusion with the state or archipelago. The uninhabited Northwestern Hawaiian Islands make up most of the Papahānaumokuākea Marine National Monument, the United States' largest protected area and the fourth-largest in the world. Of the 50 U.S. states, Hawaii is the eighth-smallest in land area and the 11th-least populous, but with 1.4 million residents ranks 13th in population density. Two-thirds of the population lives on O'ahu, home to the state's capital and largest city, Honolulu. Hawaii is among the country's most diverse states, owing to its central location in the Pacific and over two centuries of migration. As one of only six majority-minority states, it has the country's only Asian American plurality, its largest Buddhist community, and the largest proportion of multiracial people. Consequently, it is a unique melting pot of North American and East Asian cultures, in addition to its indigenous Hawaiian heritage. Settled by Polynesians some time between 1000 and 1200 CE, Hawaii was home to numerous independent chiefdoms. In 1778, British explorer James Cook was the first known non-Polynesian to arrive at the archipelago; early British influence is reflected in the state flag, which bears a Union Jack. An influx of European and American explorers, traders, and whalers arrived shortly after leading to the decimation of the once isolated Indigenous community by introducing diseases such as syphilis, gonorrhea, tuberculosis, smallpox, measles, leprosy, and typhoid fever, reducing the native Hawaiian population from between 300,000 and one million to less than 40,000 by 1890. Hawaii became a unified, internationally recognized kingdom in 1810, remaining independent until American and European businessmen overthrew the monarchy in 1893; this led to annexation by the U.S. in 1898. As a strategically valuable U.S. territory, Hawaii was attacked by Japan on December 7, 1941, which brought it global and historical significance, and contributed to America's decisive entry into World War II. Hawaii is the most recent state to join the union, on August 21, 1959. In 1993, the U.S. government formally apologized for its role in the overthrow of Hawaii's government, which spurred the Hawaiian sovereignty movement.",
    "parameters": {
        'padding': 'max_length',
        'truncation': True}
}


# request
predictor.predict(data)

In [None]:
%%time

summarized = pd.DataFrame()

for i in range(7518):
    data = {
        "inputs": df["TEXT"][i],
        "parameters": {'truncation': True}
    }
    summarized.at[i, "TEXT"] = df["TEXT"][i]
    summarized.at[i, "SUMMARIZED"] = predictor.predict(data)[0]['summary_text']
    if i % 10 == 0:
        wr.s3.to_csv(
            df=summarized,
            path='s3://datasets/text_summarization/summarized/DistilBART-CNN-12-6/DistilBART-CNN-12-6(pre-trained).csv'
        )
        print(i)
    
summarized

## SageMaker endpoint with a trained model

In [19]:
from sagemaker.huggingface import HuggingFaceModel
from sagemaker import get_execution_role

role = get_execution_role()

# Hub Model configuration. https://huggingface.co/models
hub = {
  'HF_MODEL_ID':'sshleifer/distilbart-cnn-12-6',
  'HF_TASK':'summarization'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    py_version='py38',
    model_data='s3://datasets/text_summarization/trained/',
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor_trained = huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")

-----!

In [None]:
%%time

# example request, you need to define "inputs"
data = {
   "inputs": "Hawaii comprises nearly the entire Hawaiian archipelago, 137 volcanic islands spanning 1,500 miles (2,400 km) that are physiographically and ethnologically part of the Polynesian subregion of Oceania. The state's ocean coastline is consequently the fourth-longest in the U.S., at about 750 miles (1,210 km).[b] The eight main islands, from northwest to southeast, are Niʻihau, Kauaʻi, Oʻahu, Molokaʻi, Lānaʻi, Kahoʻolawe, Maui, and Hawaiʻi—the last of these, after which the state is named, is often called the "Big Island" or "Hawaii Island" to avoid confusion with the state or archipelago. The uninhabited Northwestern Hawaiian Islands make up most of the Papahānaumokuākea Marine National Monument, the United States' largest protected area and the fourth-largest in the world. Of the 50 U.S. states, Hawaii is the eighth-smallest in land area and the 11th-least populous, but with 1.4 million residents ranks 13th in population density. Two-thirds of the population lives on O'ahu, home to the state's capital and largest city, Honolulu. Hawaii is among the country's most diverse states, owing to its central location in the Pacific and over two centuries of migration. As one of only six majority-minority states, it has the country's only Asian American plurality, its largest Buddhist community, and the largest proportion of multiracial people. Consequently, it is a unique melting pot of North American and East Asian cultures, in addition to its indigenous Hawaiian heritage. Settled by Polynesians some time between 1000 and 1200 CE, Hawaii was home to numerous independent chiefdoms. In 1778, British explorer James Cook was the first known non-Polynesian to arrive at the archipelago; early British influence is reflected in the state flag, which bears a Union Jack. An influx of European and American explorers, traders, and whalers arrived shortly after leading to the decimation of the once isolated Indigenous community by introducing diseases such as syphilis, gonorrhea, tuberculosis, smallpox, measles, leprosy, and typhoid fever, reducing the native Hawaiian population from between 300,000 and one million to less than 40,000 by 1890. Hawaii became a unified, internationally recognized kingdom in 1810, remaining independent until American and European businessmen overthrew the monarchy in 1893; this led to annexation by the U.S. in 1898. As a strategically valuable U.S. territory, Hawaii was attacked by Japan on December 7, 1941, which brought it global and historical significance, and contributed to America's decisive entry into World War II. Hawaii is the most recent state to join the union, on August 21, 1959. In 1993, the U.S. government formally apologized for its role in the overthrow of Hawaii's government, which spurred the Hawaiian sovereignty movement."
}


# request
predictor_trained.predict(data)

## Load the Hugging Face model to SageMaker for text summarization inference

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

PRE_TRAINED_MODEL_NAME='sshleifer/distilbart-cnn-12-6'

model = BartForConditionalGeneration.from_pretrained(PRE_TRAINED_MODEL_NAME) 
                                                     #cache_dir=hf_cache_dir)
model.save_pretrained('./models/bart_model/')

tokenizer = BartTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
tokenizer.save_pretrained('./models/bart_tokenizer/')

In [13]:
! tar -C models/ -czf model.tar.gz code/ bart_tokenizer/ bart_model/
from sagemaker.s3 import S3Uploader

file_key = 'model.tar.gz'
model_artifact = S3Uploader.upload(file_key,'s3://datasets/text_summarization/artifacts')

tar: code: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors


In [14]:
from sagemaker.image_uris import retrieve

deploy_instance_type = 'ml.m5.xlarge'

pytorch_inference_image_uri = retrieve('huggingface',
                                       region='us-east-1',
                                       version='4.6.1',
                                       instance_type=deploy_instance_type,
                                       base_framework_version='pytorch1.8.1',
                                       image_scope='inference')

In [15]:
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker import get_execution_role

role = get_execution_role()

# Hub Model configuration. https://huggingface.co/models
hub = {
  'HF_MODEL_ID':'sshleifer/distilbart-cnn-12-6',
  'HF_TASK':'summarization'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data="s3://datasets/text_summarization/artifacts/model.tar.gz", # path to your trained sagemaker model
    image_uri=pytorch_inference_image_uri,
    env=hub,
    role=role, # iam role with permissions to create an Endpoint
    transformers_version="4.6.1", # transformers version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1, 
   instance_type="ml.m5.xlarge"
)

Exception ignored in: <function tqdm.__del__ at 0x7febc810a0d0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1147, in __del__
    self.close()
  File "/opt/conda/lib/python3.8/site-packages/tqdm/notebook.py", line 286, in close
    self.disp(bar_style='danger', check_delay=False)
AttributeError: 'tqdm' object has no attribute 'disp'


-----!

In [31]:
%%time

# example request, you need to define "inputs"

data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 from landline. Delivery within 28 days."
}

# request
predictor.predict(data)

CPU times: user 270 µs, sys: 3.57 ms, total: 3.84 ms
Wall time: 6.49 s


[{'summary_text': ' SiPix Digital Camera! Call 09061221066 from landline. Delivery within 28 days. Camera - You are awarded a SiPIX Digital Camera. call 0906121066. Call 0800 555 555 111 or visit www.siPix.com/SiPix .'}]