<a href="https://colab.research.google.com/github/Myusuf2/Final_MIS-64061_myusuf2/blob/main/Mukhtar_Simple_hp_tune_BART_BASE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using the transformer architecture for text summarization tasks: BART**

Begin by installing the transformers library (c/o HuggingFace), which we will use with tensorflow/keras, and by loading libraries and any dependencies.

In [None]:
%%capture
!pip install -q transformers

In [None]:
%%capture
!pip install datasets

In [None]:
%%capture
!pip install rouge_score

In [None]:
%%capture
!pip install ray[tune]

In [None]:
import pandas as pd
import numpy as np
import os
import re
import math
import tensorflow as tf
from tensorflow import keras



import transformers
from datasets import list_datasets, load_dataset, load_metric
from pprint import pprint
import rouge_score


# seed
seed = 42
from tensorflow.random import set_seed
from numpy import random
set_seed(seed)
random.seed(seed)

*Data: National Debate Topic for High Schools, 2013-2019.*

Check to see if 'DebateSum' data is available through datasets library.

In [None]:
# datasets = list_datasets()
# pprint(datasets[100:200] + [f"{len(datasets) - 100} more..."], compact = True) # sadly no

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


DebateSum is not directly callable, so we have to download the data from https://huggingface.co/datasets/Hellisotherpeople/DebateSum and load it from Google Drive.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/debate2019.csv', encoding = 'latin-1')
df.head()

Unnamed: 0,Full-Document,Citation,Extract,Abstract,#CharsDocument,#CharsAbstract,#CharsExtract,#WordsDocument,#WordsAbstract,#WordsExtract,...,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98,Unnamed: 99,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103
0,The Trump Administration has been quietly fund...,Hunt 18 Edward Hunt writes about war and empir...,The Trump Administration has been quietly fund...,This file was produced by the following studen...,5041,147,2106,788,25,326,...,,,,,,,,,,
1,The border between the United States and Mexic...,"Monzo et al 17. Lilia D. MonzoÌ, associate pr...",The border between the United States and Mexic...,Imperialism in Mexico is not just a one-off in...,9849,431,4481,1559,71,693,...,,,,,,,,,,
2,Today we face a planetary crisis. Environmenta...,Helland and Lindgren 16 Leonardo E. Figueroa H...,Today we face a planetary crisis. Environmenta...,The will of dominion over Mexico is supplanted...,20340,698,8956,2660,109,1200,...,,,,,,,,,,
3,"âThey talk to me about progress, about âac...","Lystrup 15. Lauren; University of California, ...","They talk about progress, achievements,â dis...",Death is not a symptom or consequence of moder...,8774,425,4818,1272,63,701,...,,,,,,,,,,
4,The Zapatista movement has garnered much atten...,"Lystrup 15. Lauren, University of California, ...",The Zapatista movement garnered attention in t...,Plan: The United States federal government sho...,6955,161,3522,1023,23,510,...,,,,,,,,,,


I kept only the core text and abstract summaries.  So we will drop all other columns to improve efficiency.

In [None]:
list(df.columns)

df = df.drop(['Citation','Full-Document','#CharsDocument','#CharsAbstract','#CharsExtract',
              '#WordsDocument','#WordsAbstract','#WordsExtract','AbsCompressionRatio',
              'ExtCompressionRatio','OriginalDebateFileName'], axis = 1)


The purpose of this project is to train a model that summarizes the core debate texts with respectable accuracy, in this case relative to each abstract summary.

There are dozens of pre-existing transformers that we can build our text summarizer on, see full list here: https://huggingface.co/models

I used the BartTokenizerFast and the TFBartForConditionalGeneration.



In [None]:
from transformers import BartTokenizerFast, TFBartForConditionalGeneration

tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-base') # case sensitive
tfbartmodel = TFBartForConditionalGeneration.from_pretrained('facebook/bart-base') 

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/558M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBartForConditionalGeneration.

All the layers of TFBartForConditionalGeneration were initialized from the model checkpoint at facebook/bart-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


I began by using these configurations 

In [None]:
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})


config = {'pretrained' : 'facebook/bart-base',
          'batch_size' : 4,
          'max_lr': 2e-5,
          'epochs' : 1,
          'tok_input' : {'padding' : 'max_length',
                         'truncation' : True,
                         'max_length' : 1024,
                         'add_special_tokens' : True,
                         'return_tensors' : 'tf',
                         'is_split_into_words' : False,
                         'return_offsets_mapping' : False},
          
          'tok_output' : {'padding' : 'max_length',
                          'truncation' : True,
                          'max_length' : 512,
                          'add_special_tokens' : True,
                          'return_tensors' : 'tf',
                          'is_split_into_words' : False,
                          'return_offsets_mapping' : False},
         }


Now that our data and tokenizers are prepped, must encode data into the input ID and attention mask arrays (2 total).

*Object-oriented programming (OOP) pipeline*

In [None]:
# Sklearn
from sklearn.model_selection import StratifiedShuffleSplit, ShuffleSplit

class Dataset:

      def __init__(self) :
        self.data = df
        self.tokenizer = tokenizer

      def _split(self):
        self.data['fold'] = 0
        ss = ShuffleSplit(n_splits = 1, test_size = 0.1, random_state = seed)
        for t_, v_ in ss.split(self.data): 
            self.data['fold'].iloc[t_] = 1
      
      def _tokenize(self, i):
        x = self.tokenizer(self.data[self.data['fold'] == i]['Extract'].values.tolist(), 
                           **config['tok_input'])  
        
        y = self.tokenizer(self.data[self.data['fold'] == i]['Abstract'].values.tolist(), 
                           **config['tok_output'])
        
        return ({'input_ids': x['input_ids'],
                'attention_mask' : x['attention_mask'],
                'decoder_attention_mask' : y['attention_mask']}, y['input_ids'])
      
      def _to_tf(self, ds):
        return tf.data.Dataset.from_tensor_slices(ds).batch(config['batch_size']) \
                                                     .prefetch(1)
                                                     
      def get(self) :
        self._split()
        trainset = self._tokenize(1)
        valset = self._tokenize(0)
        
        return (self._to_tf(trainset),
                self._to_tf(valset))
      


In [None]:
trainset, valset = Dataset().get()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


*Alternative pipeline: functional programming*

In [None]:
# input_len = 512 # choose this value

# output_len = 256

# abs = df['Abstract'].values  #
# labels = abs  # initialize label array


# def tokenize(sentence, in_out_len):
#   tokens = tokenizer.encode_plus('sentence', max_length = in_out_len, truncation = True,
#                                   pad_to_max_length = True, add_special_tokens = True,
#                                   return_attention_mask = True, return_token_type_ids = False, is_split_into_words = False,
#                                   return_tensors='tf', return_offsets_mapping = False)
#   return tokens['input_ids'], tokens['attention_mask']



#   for ele, sentence in enumerate(df['Extract']):
#   ids[ele,:], mask[ele,:] = tokenize(sentence, input_len)


#   dftf = tf.data.Dataset.from_tensor_slices((ids, mask, labels))

# # restructure dataset format for BERT
# def map_func(input_ids, masks, labels):
#     return {'input_ids': input_ids, 'attention_mask': masks}, labels
  
# dftf = dftf.map(map_func)  # apply the mapping function


# dftf = dftf.shuffle(1000).batch(4)

# df_len = len(list(dftf))

# split = .90

# train = dftf.take(round(df_len*split))
# val = dftf.skip(round(df_len*split))


*Model development*

In [None]:
keras.backend.clear_session()

class Bart(keras.Model):
    def __init__(self):
        super(Bart, self).__init__()
        self.model = TFBartForConditionalGeneration.from_pretrained(config['pretrained'], 
                                                                   return_dict = True)
        
    def call(self, inputs, training = False):
        x, y = inputs
        outputs = self.model(input_ids = x['input_ids'], 
                             attention_mask = x['attention_mask'], 
                             labels = y, 
                             decoder_attention_mask = x['decoder_attention_mask'])
        return outputs.loss, outputs.logits
    
    
    def train_step(self, data):
        with tf.GradientTape() as tape:
            loss, logits = self(data, training=True)

        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        _lr = self.optimizer.lr
        return {"loss": tf.reduce_mean(loss), 'lr' : _lr}
    
    
    def test_step(self, data):
        with tf.GradientTape() as tape:
            loss, logits = self(data, training=False)
        
        return {"loss": tf.reduce_mean(loss)}

In [None]:
bart = Bart()
bart.compile(optimizer = keras.optimizers.Adam(1e-5))

All model checkpoint layers were used when initializing TFBartForConditionalGeneration.

All the layers of TFBartForConditionalGeneration were initialized from the model checkpoint at facebook/bart-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


In [None]:
K = keras.backend

class OneCycleLr(keras.callbacks.Callback):
    def __init__(self,
                 max_lr: float,
                 total_steps: int = None,
                 epochs: int = None,
                 steps_per_epoch: int = None,
                 pct_start: float = 0.3,
                 anneal_strategy: str = "cos",
                 cycle_momentum: bool = True,
                 base_momentum: float = 0.85,
                 max_momentum: float = 0.95,
                 div_factor: float = 25.0,
                 final_div_factor: float = 1e4):

        super(OneCycleLr, self).__init__()

        # validate total steps:
        if total_steps :
            self.total_steps = total_steps
        else:
            self.total_steps = epochs * steps_per_epoch

        self.step_num = 0
        self.step_size_up = float(pct_start * self.total_steps) - 1
        self.step_size_down = float(self.total_steps - self.step_size_up) - 1

        # Validate pct_start
        if anneal_strategy == "cos":
            self.anneal_func = self._annealing_cos
        elif anneal_strategy == "linear":
            self.anneal_func = self._annealing_linear

        # Initialize learning rate variables
        self.initial_lr = max_lr / div_factor
        self.max_lr = max_lr
        self.min_lr = self.initial_lr / final_div_factor

        # Initial momentum variables
        self.cycle_momentum = cycle_momentum
        if self.cycle_momentum:
            self.m_momentum = max_momentum
            self.momentum = max_momentum
            self.b_momentum = base_momentum

        # Initialize variable to learning_rate & momentum
        self.track_lr = []
        self.track_mom = []

    def _annealing_cos(self, start, end, pct):
        cos_out = math.cos(math.pi * pct) + 1
        return end + (start - end) / 2.0 * cos_out

    def _annealing_linear(self, start, end, pct):
        return (end - start) * pct + start

    def set_lr_mom(self):
        if self.step_num <= self.step_size_up:
            # update learining rate
            computed_lr = self.anneal_func(self.initial_lr, self.max_lr, self.step_num / self.step_size_up)
            K.set_value(self.model.optimizer.lr, computed_lr)
            # update momentum if cycle_momentum
            if self.cycle_momentum:
                computed_momentum = self.anneal_func(self.m_momentum, self.b_momentum, self.step_num / self.step_size_up)
                try:
                    K.set_value(self.model.optimizer.momentum,
                                computed_momentum)
                except:
                    K.set_value(self.model.optimizer.beta_1, computed_momentum)
        else:
            down_step_num = self.step_num - self.step_size_up
            # update learning rate
            computed_lr = self.anneal_func(self.max_lr, self.min_lr, down_step_num / self.step_size_down)
            K.set_value(self.model.optimizer.lr, computed_lr)
            # update momentum if cycle_momentum
            if self.cycle_momentum:
                computed_momentum = self.anneal_func(self.b_momentum, self.m_momentum, down_step_num / self.step_size_down)
                try:
                    K.set_value(self.model.optimizer.momentum,
                                computed_momentum)
                except:
                    K.set_value(self.model.optimizer.beta_1, computed_momentum)

    def on_train_begin(self, logs=None):
        # Set initial learning rate & momentum values
        K.set_value(self.model.optimizer.lr, self.initial_lr)
        if self.cycle_momentum:
            try:
                K.set_value(self.model.optimizer.momentum, self.momentum)
            except:
                K.set_value(self.model.optimizer.beta_1, self.momentum)

    def on_train_batch_end(self, batch, logs=None):
        # Grab the current learning rate & momentum
        lr = float(K.get_value(self.model.optimizer.lr))
        try:
            mom = float(K.get_value(self.model.optimizer.momentum))
        except:
            mom = float(K.get_value(self.model.optimizer.beta_1))
        # Append to the list
        self.track_lr.append(lr)
        self.track_mom.append(mom)
        # Update learning rate & momentum
        self.set_lr_mom()
        # increment step_num
        self.step_num += 1

In [None]:
scheduler = OneCycleLr(max_lr=config['max_lr'], 
                       steps_per_epoch=trainset.cardinality().numpy(), 
                       epochs=config['epochs'])

checkpoint = keras.callbacks.ModelCheckpoint(filepath = '/content/drive/MyDrive/ckpt_bart',
                                             save_best_only = True,
                                             save_weights_only = True)

early = keras.callbacks.EarlyStopping(patience = 2, restore_best_weights=True)

*Training*

In [None]:
bart.fit(trainset,
         validation_data = valset,
         epochs = config['epochs'],
         callbacks = [early,scheduler]) 



<keras.callbacks.History at 0x7f94fe9f4430>

*Testing*

In [None]:
# Solved bug by downloading the config.json and tf_model.h5 files from hugging face repo and saving them in Google Drive folder w path Stat Consulting > RTE

class TextSummarization():
  def __init__(self, pretrained, tok, beam, temperature):

    self.model = Bart().model.from_pretrained(pretrained)
    self.tokenizer = BartTokenizerFast.from_pretrained(tok)
    self.beam = beam
    self.temperature = temperature

  def generate(self, text):
    text = self.tokenizer(text, **config['tok_input']).input_ids
    tokens = self.model.generate(text,
                                 min_length = 0,
                                 max_length = 128,
                                 num_beams = self.beam,
                                 temperature = self.temperature,
                                 do_sample = True,
                                 repetition_penalty = 2.5,
                                 length_penalty = 1,
                                 early_stopping = True).numpy()[0]
    return self.tokenizer.decode(tokens, skip_special_tokens = True)

  def post_processing(self, text):
    text = text[:text.rfind('.')+1]
    return re.sub("[\(\[].*?[\)\]]", "", text)

  def summarize(self, text):
    text = self.generate(text)
    text = self.post_processing(text)
    return text

ts = TextSummarization(pretrained = '/content/drive/MyDrive/RTE', tok = 'facebook/bart-base',beam = 5, temperature = 1.2)



All model checkpoint layers were used when initializing TFBartForConditionalGeneration.

All the layers of TFBartForConditionalGeneration were initialized from the model checkpoint at facebook/bart-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.
All model checkpoint layers were used when initializing TFBartForConditionalGeneration.

All the layers of TFBartForConditionalGeneration were initialized from the model checkpoint at /content/drive/MyDrive/Statistical Consulting/RTE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


In [None]:

data = df
id_doc = 14272

summary = ts.summarize(data['Extract'].values.tolist()[id_doc])


In [None]:
def compute_rouge(id_doc, prediction):
    rouge = load_metric("rouge")
    rouge.add(reference = df['Abstract'].values.tolist()[id_doc], prediction = summary)
    score_abs = rouge.compute()['rougeL']
    rouge = load_metric("rouge")
    rouge.add(reference = df['Extract'].values.tolist()[id_doc], prediction = summary)
    score_ext = rouge.compute()['rougeL']
    return score_abs.mid, score_ext.mid

def compute_bleu(id_doc, prediction):
    bleu = load_metric("bleu")
    score_abs = bleu.compute(predictions=df['Abstract'].values.tolist()[id_doc].split(), 
                             references=[[summary.split()]], max_order = 1)['bleu']
    bleu = load_metric("bleu")
    score_ext = bleu.compute(predictions=[df['Extract'].values.tolist()[id_doc].split()], 
                             references=[[summary.split()]], max_order = 1)['bleu']
    return score_abs, score_ext

In [None]:
abs, ext = compute_rouge(id_doc, summary)
print('Score on Abstract :', abs)
print('Score on Extract :', ext)

Score on Abstract : Score(precision=0.024390243902439025, recall=0.07692307692307693, fmeasure=0.03703703703703704)
Score on Extract : Score(precision=0.7804878048780488, recall=0.14479638009049775, fmeasure=0.24427480916030533)


Will use the ray[tune] library to tune hyperparameters.  Here is an example of the pipeline, which needs additional integration into the model workflow to work properly.  The objective will be to maximize our metric(s) of interest (e.g., score_abs and score_ext).  We will define a range of values to select from for each parameter of interest, in this case batch size for each model and max_length for the input and output configurations.

In [None]:



from ray import tune

# 1. Define an objective function.
def objective(config):
    score = compute_rouge(id_doc, summary)
    return {"score": score}


# 2. Define a search space.
search_space = {
    'tok_input' : {'max_length': tune.choice([4096, 2048, 1024, 512])},
    'tok_output': {'max_length': tune.choice([2048, 1024, 512, 256])}

}

# 3. Start a Tune run and print the best result.
tuner = tune.Tuner(objective, param_space=search_space)
results = tuner.fit()
print(results.get_best_result(metric="score", mode="max").config)

0,1
Current time:,2022-12-07 01:33:43
Running for:,00:00:04.73
Memory:,5.8/83.5 GiB

Trial name,status,loc,tok_input/max_length,tok_output/max_lengt h,iter,total time (s)
objective_2c7a7_00000,TERMINATED,172.28.0.12:3063,1024,256,1,1.30949




Trial name,date,done,episodes_total,experiment_id,experiment_tag,hostname,iterations_since_restore,node_ip,pid,score,time_since_restore,time_this_iter_s,time_total_s,timestamp,timesteps_since_restore,timesteps_total,training_iteration,trial_id,warmup_time
objective_2c7a7_00000,2022-12-07_01-33-43,True,,2269501859a34d918b2af7ec5210b505,"0_max_length=1024,max_length=256",bfef3e998bf9,1,172.28.0.12,3063,"(Score(precision=0.024390243902439025, recall=0.07692307692307693, fmeasure=0.03703703703703704), Score(precision=0.7804878048780488, recall=0.14479638009049775, fmeasure=0.24427480916030533))",1.30949,1.30949,1.30949,1670376823,0,,1,2c7a7_00000,0.00369239


2022-12-07 01:33:43,385	INFO tune.py:777 -- Total run time: 5.37 seconds (4.73 seconds for the tuning loop).


{'tok_input': {'max_length': 1024}, 'tok_output': {'max_length': 256}}


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**References**
Transformer for Text Summarization on Long Documents.  Submitted as Final Project for Advanced Machine Learning.

Lewis, M. et al. (2019).  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.  Available via https://arxiv.org/abs/1910.13461