**VIE:** Một số phần của đoạn mã này được chuyển từ một sổ tay Kaggle. Đến ngày 06 Tháng 01 2024, chúng tôi đang nỗ lực để có được sự cho phép cần thiết và cung cấp sự ghi nhận đúng đắn cho sổ tay được tham chiếu.

**ENG:** Sections of this code have been adapted from a Kaggle notebook. As of Janurary 06 2024, efforts are underway to obtain the necessary permissions and provide proper attribution for the referenced notebook.

# Import và thiết lập chung/Imports and general setup


In [2]:
! pip install simplet5 datasets transformers rouge_score nltk -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m73.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m95.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.7/527.7 kB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m66.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.1/806.1 kB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m952.4/952.4 kB[0m [31m64.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import random
import torch
import re
import os
import string

import pandas as pd
from simplet5 import SimpleT5

import datasets
metric = datasets.load_metric("rouge")

INFO:pytorch_lightning.utilities.seed:Global seed set to 42
  metric = datasets.load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

# Thiết lập Mô hình/Model Setup

## Bộ dữ liệu/Dataset


In [4]:
class Settings:
    MODEL_TYPE = "t5"
    MODEL_NAME = "t5-base"

    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # training data directory
    TRAIN_DATA = "/content/bbc-news-data-tech-summarized.csv"

    Columns = ['summary', 'fulltext']




    USE_GPU = None
    if str(DEVICE) == "cuda":
        USE_GPU=True
    else:
        USE_GPU = False

    EPOCHS = 6

    encoding = 'latin-1'
    columns_dict = {"summary": "target_text", "fulltext": "source_text"}
    df_column_list = ['source_text', 'target_text']
    SUMMARIZE_KEY = "summarize: "
    SOURCE_TEXT_KEY = 'source_text'
    TEST_SIZE = 0.2
    BATCH_SIZE = 8
    source_max_token_len = 512
    target_max_token_len = 128
    train_df_len = 5000
    test_df_len = 100


In [5]:
class Preprocess:
    def __init__(self):
        self.settings = Settings

    def clean_text(self, text):
        text = text.lower()
        text = re.sub('\[.*?\]', '', text)
        text = re.sub('https?://\S+|www\.\S+', '', text)
        text = re.sub('<.*?>+', '', text)
        text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
        text = re.sub('\n', '', text)
        text = re.sub('\w*\d\w*', '', text)
        return text

    def preprocess_data(self, data_path):
        #df = pd.read_csv(data_path, encoding=self.settings.encoding, usecols=self.settings.Columns)

        #This is to work around with tab-separated CSVs
        df = pd.read_csv(data_path, sep='\t', encoding=self.settings.encoding, usecols=self.settings.Columns)

        # Explicitly set data types for text columns
        df['summary'] = df['summary'].astype(str)
        df['fulltext'] = df['fulltext'].astype(str)

        # simpleT5 expects dataframe to have 2 columns: "source_text" and "target_text"
        df = df.rename(columns=self.settings.columns_dict)
        df = df[self.settings.df_column_list]
        # T5 model expects a task related prefix: since it is a summarization task, we will add a prefix "summarize: "
        df[self.settings.SOURCE_TEXT_KEY] = self.settings.SUMMARIZE_KEY + df[self.settings.SOURCE_TEXT_KEY]

        return df

In [6]:
class T5Model:
    def __init__(self, model_type, model_name):
        self.model = SimpleT5()
        self.model.from_pretrained(model_type=model_type,
                                   model_name=model_name)

    def load_model(self, model_type, model_path, use_gpu: bool):
        try:
            self.model.load_model(
                model_type=model_type,
                model_dir=model_path,
                use_gpu=use_gpu
            )

        except BaseException as ex:
            print("error occurred while loading model ", str(ex))

In [7]:
class Train:
    def __init__(self):
        # initialize required class
        self.settings = Settings
        self.preprocess = Preprocess()

        # initialize required variables
        self.t5_model = None

    def __initialize(self):
        try:
            self.t5_model = T5Model(model_name=self.settings.MODEL_NAME,
                                    model_type=self.settings.MODEL_TYPE)

        except BaseException as ex:
            print("error occurred while loading model ", str(ex))

    def set_seed(self, seed_value=42):
        random.seed(seed_value)
        np.random.seed(seed_value)
        torch.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)

    def train(self, df):
        try:
            train_df, test_df = train_test_split(df, test_size=self.settings.TEST_SIZE)

            self.t5_model.model.train(train_df=train_df[:self.settings.train_df_len],
                                      eval_df=test_df[:self.settings.test_df_len],
                                      source_max_token_len=self.settings.source_max_token_len,
                                      target_max_token_len=self.settings.target_max_token_len,
                                      batch_size=self.settings.BATCH_SIZE, max_epochs=self.settings.EPOCHS,
                                      use_gpu=self.settings.USE_GPU)

        except BaseException as ex:
            print("error occurred while loading model ", str(ex))

    def run(self):
        try:
            print("Loading and Preparing the Dataset-----!! ")
            df = self.preprocess.preprocess_data(self.settings.TRAIN_DATA)
            print(df.head())
            print("Dataset Successfully Loaded and Prepared-----!! ")
            print("Loading and Initializing the T5 Model -----!! ")
            self.__initialize()
            print("Model Successfully Loaded and Initialized-----!! ")

            print("------------------Starting Training-----------!!")
            self.set_seed()
            self.train(df)
            print("Training complete-----!!!")

        except BaseException as ex:
            print("Following Exception Occurred---!! ", str(ex))


    def compute_metrics(self):
        predictions, labels = eval_pred
        decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

        # Replace -100 in the labels as we can't decode them.
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        # Rouge expects a newline after each sentence
        decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip()))
                          for pred in decoded_preds]
        decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip()))
                          for label in decoded_labels]

        # Compute ROUGE scores
        result = metric.compute(predictions=decoded_preds, references=decoded_labels,
                                use_stemmer=True)

        # Extract ROUGE f1 scores
        result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

        # Add mean generated length to metrics
        prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id)
                          for pred in predictions]
        result["gen_len"] = np.mean(prediction_lens)

        return {k: round(v, 4) for k, v in result.items()}

In [8]:
print(Settings.USE_GPU)
print(Settings.DEVICE)

True
cuda


In [9]:
t= Train()
t.run()

Loading and Preparing the Dataset-----!! 
                                         source_text  \
0  summarize:  The Kyrgyz Republic, a small, moun...   
1  summarize:  Chinese authorities closed 12,575 ...   
2  summarize:  Microsoft is investigating a troja...   
3  summarize:  Nicholas Negroponte, chairman and ...   
4  summarize:  The hi-tech and the arts worlds ha...   

                                         target_text  
0  Kyrgyz Republic uses invisible ink and UV read...  
1  Chinese government closes 12,575 net cafes for...  
2  Microsoft is investigating a trojan program ca...  
3  MIT's Media Labs founder, Nicholas Negroponte,...  
4  UK telco BT has launched its Connected World i...  
Dataset Successfully Loaded and Prepared-----!! 
Loading and Initializing the T5 Model -----!! 


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.distributed:GPU available: True, used: True
INFO:pytorch_lightning.utilities.distributed:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.distributed:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Model Successfully Loaded and Initialized-----!! 
------------------Starting Training-----------!!


INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.seed:Global seed set to 42


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Training complete-----!!!


# Nhận xét/Evaluation

In [10]:
t5_model = T5Model(model_name=Settings.MODEL_NAME,model_type=Settings.MODEL_TYPE)

In [11]:
## Load trained T5 model with the lowest loss
## This is meant to be a workaround for a weird way for now HuggingFace works.
## Read the following for context
## https://github.com/Shivanandroy/simpleT5/issues/7

from pathlib import Path
t5_model.load_model("t5", next(Path("outputs").glob("*epoch-5*")), use_gpu=False)

In [12]:
#Do note this model is setup in a way that it can work when "summerize: " is prepended to whatever you want to summerize.

#https://www.reuters.com/technology/google-test-new-feature-limiting-advertisers-use-browser-tracking-cookies-2023-12-14/
text_to_summarize="""summarize: Alphabet's Google (GOOGL.O) said on Thursday it will begin testing a new feature on its Chrome browser as part of a plan to ban third-party cookies that advertisers use to track consumers.

The search giant is set to roll out the feature, called Tracking Protection, on Jan. 4 to 1% of Chrome users globally, that will restrict cross-site tracking by default.

Google plans to completely phase out the use of third-party cookies for users in the second half of 2024.

The timeline, however, is subject to addressing antitrust concerns raised by UK's Competition and Markets Authority (CMA), Google said.

The CMA has been investigating Google's plan to cut support for some cookies in Chrome, because the watchdog is worried it will impede competition in digital advertising, as well as keeping an eye on the company's biggest moneymaking segment, advertising.

Cookies are special files that allow websites and advertisers to identify individual web surfers and track their browsing habits.

The European Union antitrust chief Margrethe Vestager also said in June that the agency's investigations into Google's introduction of tools to block third-party cookies - part of the company's "Privacy Sandbox" initiative - would continue.

Advertisers have said the loss of cookies in the world's most popular browser will limit their ability to collect information for personalizing ads and make them dependent on Google's user databases.

Brokerage BofA Global Research said in a note on Thursday that phasing out of cookies will give more power to media agencies, especially those that are capable of providing proprietary insights at scale to advertisers.
"""

In [13]:
t5_model.model.predict(text_to_summarize)

["Google will begin testing a new feature called Tracking Protection on its Chrome browser on Jan. 4 to 1% of Chrome users globally, restricting cross-site tracking by default. Google plans to completely phase out the use of third-party cookies for users in the second half of 2024. The timeline is subject to antitrust concerns raised by UK's Competition and Markets Authority (CMA). The CMA has been investigating Google's plan to cut support for some cookies in Chrome, as it fears it will hinder competition in digital advertising and keep eye on the company's biggest money segment, advertising."]