<a href="https://colab.research.google.com/github/RochanaChaturvedi/laysumm20/blob/master/fine_tune_BART.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This notebook is an adaptation of: Text Generation with blurr by Wayde Gilliam
https://ohmeow.com/posts/2020/05/23/text-generation-with-blurr.html

In [None]:
# only run this cell if you are in collab
# !pip install ohmeow-blurr
!pip install torch==1.6.0 
# !pip install nlp

In [None]:
import nlp
import pandas as pd
from fastai.text.all import *
from transformers import *

from blurr.data.all import *
from blurr.modeling.all import *

## Data Preparation

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
# models=['wMVC_exp2']
# # Folder hierarchy:
# #summaries\\batch3 or summaries\\laysumm2 or summaries\\test
# path="/content/drive/My Drive/Laysumm/"
# system_path=path+"system_summaries/Exp2_clipped_text/ensemble_summaries/batch3/"
# model_path=path+"gold_summaries/test/"
# # output_path=path+"system_summaries/Exp2_clipped_text/Evaluation/output/"
# # input_data=path+"system_summaries/Exp2_clipped_text/ensemble_summaries/laysumm2/mvc/en-wMVC-ClippedExp-L/"
# # input_data=path+"Preprocessed_data/3_Final_input_data/laysumm2/Laysum2-Abstract-ClippedText/" #merged summaries
# # val_data=path+"system_summaries/Exp2_clipped_text/ensemble_summaries/batch3/mvc/en-wMVC-ClippedExp-B/"
# test_data=path+"system_summaries/Exp2_clipped_text/ensemble_summaries/test/mvc/en-wMVC-ClippedExp-test/"

# data={}
# i=0
# for doc in os.listdir(test_data):
#   with open(test_data+doc,'r') as f:
#           text = f.read()
#           # print(doc,text)
#   with open(model_path+doc.replace(".txt","_LAYSUMM.TXT"),'r') as f:
#           summ=f.read()
#           summ=summ.split("PARAGRAPH")[-1]
#           # print(summ)
    
#   data[f]=(text,summ)
#   # break
# # print(data)
# df_test=pd.DataFrame.from_dict(data, orient='index',columns=["article","highlights"])
# # df.head()

In [None]:
df=pd.read_csv("/content/drive/My Drive/laysumm_2/laysumm.csv")
df_val=pd.read_csv("/content/drive/My Drive/laysumm_2/laysumm-v.csv")
df_test=pd.read_csv("/content/drive/My Drive/laysumm_2/laysumm-t.csv")
df_test.head()

In [None]:
df=pd.concat([df,df_val])

We begin by getting our hugginface objects needed for this task (e.g., the architecture, tokenizer, config, and model).  We'll use blurr's `get_hf_objects` helper method here.

In [None]:
pretrained_model_name = "facebook/bart-large-cnn"

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, model_cls=BartForConditionalGeneration)
# hf_model.to('cuda')
hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)


Next we need to build out our DataBlock.  Remember tha a DataBlock is a blueprint describing how to move your raw data into something modelable.  That blueprint is executed when we pass it a data source, which in our case, will be the DataFrame we created above. We'll use a random subset to get things moving along a bit faster for the demo as well.

Notice we're specifying `trg_max_length` to constrain our decoder inputs to 250 so that our input/predicted summaries will be padded to 250 rather than the default which is whatever you are using for your encoder inputs (e.g., the text you want summarized).

In [None]:
hf_batch_tfm = HF_SummarizationBatchTransform(hf_arch, hf_tokenizer)

blocks = ( 
    HF_TextBlock(hf_arch, hf_tokenizer), 
    HF_TextBlock(hf_arch, hf_tokenizer, hf_batch_tfm=hf_batch_tfm, max_length=400)
)

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('article'), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())

In [None]:
dls = dblock.dataloaders(df, bs=1)
dls.to('cuda')

In [None]:
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')#check GPU

It's always a good idea to check out a batch of data and make sure the shapes look right.

In [None]:
b = dls.one_batch()
len(b),b[0]['input_ids'].shape, b[1].shape

Even better, we can take advantage of blurr's TypeDispatched version of `show_batch` to look at things a bit more intuitively.

In [None]:
dls.show_batch(hf_tokenizer=hf_tokenizer, max_n=2)

#rouge

In [None]:
import glob
def impose_max_length(summary_text, max_tokens=150):
    text = summary_text[0].lower()
    text = re.sub(r"[^a-z0-9]+", " ", text)
    tokens = re.split(r"\s+", text)
    tokens = [x for x in tokens if re.match(r"^[a-z0-9]+$", x)]
    tokens = tokens[0:min(max_tokens, len(tokens))]
    return " ".join(tokens)

metrics = ["rouge1", "rouge2", "rougeL"]
def get_rouge(dframe):
            scorer = rouge_scorer.RougeScorer(metrics, use_stemmer=True)
            # results = {"rouge1_f":[], "rouge1_r":[], "rouge2_f":[], "rouge2_r":[], "rougeL_f":[], "rougeL_r":[]}

            print("open ground truth file")
            results={}
            default_score = 0.0
            index=0
            for metric in metrics:#rouge1_f
                        dframe[metric+"_f"]=0
                        dframe[metric + "_r"]=0
            for index,row in dframe.iterrows():
                try:
                    reference_summary, submitted_summary = row['highlights'],row['system']
                    submitted_summary=impose_max_length(submitted_summary)
                    scores = scorer.score(reference_summary.strip(),submitted_summary.strip())
                    for metric in metrics:
                        dframe.loc[index,metric+"_f"]=scores[metric].fmeasure
                        # print(row[metric+"_f"])
                        dframe.loc[index,metric + "_r"]=scores[metric].recall
                except Exception as e:
                    print(e)
                    # print("Error for Paper ID %d", paper_id)
                    # for metric in metrics:
                    #     results[metric+"_f"].append(default_score)
                    #     results[metric + "_r"].append(default_score)
                    

            print("evaluation finished")
            return dframe

# Training

We'll prepare our BART model for training by wrapping it in blurr's `HF_TextGenerationModelWrapper` model object.  This class will handle ensuring all our inputs get translated into the proper arguments needed by a huggingface conditional generation model.  We'll also use a custom model splitter that will allow us to apply discriminative learning rates over the various layers in our huggingface model.

Once we have everything in place, we'll freeze our model so that only the last layer group's parameters of trainable.  See [here](https://docs.fast.ai/basic_train.html#Discriminative-layer-training) for our discriminitative learning rates work in fastai.

**Note:** This has been tested with BART only thus far (if you try any other conditional generation transformer models they may or may not work ... if you do, lmk either way)

In [None]:
text_gen_kwargs = { **hf_config.task_specific_params['summarization'], **{'max_length': 220, 'min_length': 90, 'length_penalty': 1.5, 'no_repeat_ngram_size': 3} }
text_gen_kwargs

In [None]:
model = HF_BaseModelWrapper(hf_model)
model_cb = HF_SummarizationModelCallback(text_gen_kwargs=text_gen_kwargs)

learn = Learner(dls, 
                model,
                opt_func=ranger,
                loss_func=HF_MaskedLMLoss(),
                cbs=[model_cb],
                splitter=partial(summarization_splitter, arch=hf_arch))#.to_fp16()

learn.create_opt() 
learn.freeze()

It's also not a bad idea to run a batch through your model and make sure the shape of what goes in, and comes out, looks right.

In [None]:
b = dls.one_batch()
preds = learn.model(b[0])

len(b),b[0]['input_ids'].shape, b[1].shape, len(preds), preds[0].shape

In [None]:
# print(len(learn.opt.param_groups))

Still experimenting with how to use fastai's learning rate finder for these kinds of models.  If you all have any suggestions or interesting insights to share, please let me know.  We're only going to train the frozen model for one epoch for this demo, but feel free to progressively unfreeze the model and train the other layers to see if you can best my results below.

In [None]:
# learn.lr_find(suggestions=True)#oom

In [None]:
learn.fit_one_cycle(3, lr_max=1e-3)

In [None]:
learn.show_results(learner=learn, max_n=2)

Even better though, blurr augments the fastai Learner with a `generate_text` method that allows you to use huggingface's `PreTrainedModel.generate` method to create something more human-like.

In [None]:
df_val['system']=""
for index,row in df_val.iterrows():
  row['system'] = learn.generate_text(row['article'], early_stopping=True, num_beams=4, num_return_sequences=1, max_length=150, min_length=60)

df_val=get_rouge(df_val)
df_val.head()

In [None]:
df_val.to_csv("/content/drive/My Drive/laysumm-v.csv")

In [None]:
df_test['system']=""
for index,row in df_test.iterrows():
  row['system'] = learn.generate_text(row['article'], early_stopping=True, num_beams=4, num_return_sequences=1, max_length=150, min_length=60)[0]
df_test.head()

In [None]:
df_test.to_csv("/content/drive/My Drive/laysumm-t.csv")