**Installations**

In [21]:
! pip install datasets transformers rouge-score nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [22]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Libraries**

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from transformers import AutoTokenizer, AutoModel
import pyarrow as pa
import pyarrow.dataset as ds
from datasets import Dataset
from datasets import load_metric
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
import nltk

In [24]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Dataset Loading**

In [25]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [26]:
# INPUT_PATH1 = "/content/drive/MyDrive/Semester 3 IIITD/NLP/NLP_Project/Dataset/preprocessed_data/divided_dataset"
# INPUT_PATH2 = "/content/drive/MyDrive/Semester 3 IIITD/NLP/NLP_Project/Dataset/preprocessed_data/whole_dataset"
# RESULT_PATH = "/content/drive/MyDrive/Semester 3 IIITD/NLP/NLP_Project/Results"
# MODEL_PATH = "/content/drive/MyDrive/Semester 3 IIITD/NLP/NLP_Project/Models_pickled_file"
INPUT_PATH1 = "/content/drive/MyDrive/NLP_Project/Dataset/preprocessed_data/divided_dataset"
INPUT_PATH2 = "/content/drive/MyDrive/NLP_Project/Dataset/preprocessed_data/whole_dataset"
RESULT_PATH = "/content/drive/MyDrive/NLP_Project/Results"
MODEL_PATH = "/content/drive/MyDrive/NLP_Project/Models_pickled_file"

In [27]:
train = pd.read_csv(os.path.join(INPUT_PATH1,"train.csv"))
val = pd.read_csv(os.path.join(INPUT_PATH1,"test.csv"))

In [28]:
test = pd.read_csv(os.path.join(INPUT_PATH2,"test.csv"))

In [29]:
train

Unnamed: 0,Heading,Summary,Article,id
0,"un urges for maximum restraint, invokes simla ...","pakistan termed the indian action as ""unilater...","un chief invokes shimla agreement, calls for '...",1
1,"china, pak to finalise deal to develop sez und...","""the agreement will be finalised between khybe...","china, pak to finalise deal to develop sez und...",2
2,"covaxin effectively neutralises both alpha, de...",the top health research institute said that an...,"covaxin effectively neutralises both alpha, de...",3
3,man gets coronavirus twice with more severe sy...,a 25-year-old man in the us has caught coronav...,man gets coronavirus twice with more severe sy...,5
4,afghanistan president ghani flees to tajikista...,reports say that afghanistan president ashraf ...,ghani's close aides have also left the country...,6
...,...,...,...,...
9041,covid-19 vaccine not likely to be available by...,it was not likely for a coronavirus vaccine to...,covid-19 vaccine not likely to be available by...,10047
9042,"jill biden visits europe, will meet with ukrai...","after flying overnight from washington, the fi...",us first lady jill biden meets u. s. troops du...,10048
9043,coronaviurus: 29 foreigners infected by in chi...,twenty-nine foreign nationals in china were in...,coronaviurus: 29 foreigners infected by in chi...,10049
9044,pakistan defence’s twitter account suspended f...,"on saturday, numerous indian twitter users com...",pakistan was yet again embarrassed on saturday...,10050


In [30]:
val

Unnamed: 0,Heading,Summary,Article,id
0,india opposes china's belt and road initiative...,the name of all member countries except india ...,"at sco, india refuses to back china's belt and...",0
1,"top white house officials buried cdc report, r...",the decision to shelve detailed advice from th...,"in this april 22, 2020, file photo president d...",4
2,us and china clash at un over south china sea ...,as india holds the council presidency this mon...,the united states and china clashed over beiji...,11
3,"us allows extra covid vaccine doses for some, ...",the food and drug administration ruled that tr...,vials for the moderna and pfizer covid-19 vacc...,13
4,pak minister claims threatening email was sent...,pakistan's information minister fawad chaudhry...,pakistan's information minister fawad chaudhry...,30
...,...,...,...,...
1001,'liberate!’: donald trump pushes states to lif...,president donald trump urged supporters to “li...,president donald trump listens as agriculture ...,10008
1002,7 dead after 2 small airplanes collide in mid-...,"seven people, including an alaska state lawmak...",a plane rests in brush and trees after a midai...,10010
1003,russia-ukraine war: european union likely to s...,"in april, the wall street journal reported tha...","kabaeva, who was born in 1983, was first linke...",10015
1004,uk lawmaker stabbing a 'terrorist act'? potent...,"amess, 69, was attacked around midday friday a...","conservative mp david amess with his pugs, lil...",10017


In [31]:
test

Unnamed: 0,Heading,Article,id
0,explainer: how worrying is the variant first s...,how worrying is the variant first seen in indi...,0
1,pakistan parliament to elect new prime ministe...,pakistans national assembly will elect a new p...,1
2,indian-origin pathologist accused of botching ...,dr. khalid ahmedan indian-origin pathologist h...,2
3,china begins world's biggest census drive to c...,china begins world's biggest census drive to c...,3
4,"indonesia prison fire kills 41 drug inmates, i...","indonesia prison fire kills 41 drug inmates, i...",4
...,...,...,...
2508,"arab league calls for israel boycott, terms it...",the arab league (al) called on arab states on ...,2508
2509,beirut explosion among most powerful non-nucle...,beirut explosion among most powerful non-nucle...,2509
2510,anti-aircraft gun bullets found near pak pm im...,imran khanpolice in pakistan have seized 18 li...,2510
2511,air-launched ballistic missile will realise ch...,representational imagethe usdepartment of defe...,2511


**Combining Heading and Article**

In [32]:
train['Source'] = train['Heading'] + train['Article']
train.drop(columns=['Article','Heading','id'],inplace=True)
train.head()

Unnamed: 0,Summary,Source
0,"pakistan termed the indian action as ""unilater...","un urges for maximum restraint, invokes simla ..."
1,"""the agreement will be finalised between khybe...","china, pak to finalise deal to develop sez und..."
2,the top health research institute said that an...,"covaxin effectively neutralises both alpha, de..."
3,a 25-year-old man in the us has caught coronav...,man gets coronavirus twice with more severe sy...
4,reports say that afghanistan president ashraf ...,afghanistan president ghani flees to tajikista...


In [33]:
val['Source'] = val['Heading'] + val['Article']
val.drop(columns=['Article','Heading','id'],inplace=True)
val.head()

Unnamed: 0,Summary,Source
0,the name of all member countries except india ...,india opposes china's belt and road initiative...
1,the decision to shelve detailed advice from th...,"top white house officials buried cdc report, r..."
2,as india holds the council presidency this mon...,us and china clash at un over south china sea ...
3,the food and drug administration ruled that tr...,"us allows extra covid vaccine doses for some, ..."
4,pakistan's information minister fawad chaudhry...,pak minister claims threatening email was sent...


In [34]:
test['Source'] = test['Heading'] + test['Article']
test.drop(columns=['Article','Heading'],inplace=True)
test.head()

Unnamed: 0,id,Source
0,0,explainer: how worrying is the variant first s...
1,1,pakistan parliament to elect new prime ministe...
2,2,indian-origin pathologist accused of botching ...
3,3,china begins world's biggest census drive to c...
4,4,"indonesia prison fire kills 41 drug inmates, i..."


**Converting to pyarrow datasets**

In [35]:
dataset = ds.dataset(pa.Table.from_pandas(train).to_batches())

### convert to Huggingface dataset
train_dataset = Dataset(pa.Table.from_pandas(train))

In [36]:
train_dataset

Dataset({
    features: ['Summary', 'Source'],
    num_rows: 9046
})

In [37]:
dataset = ds.dataset(pa.Table.from_pandas(val).to_batches())

### convert to Huggingface dataset
val_dataset = Dataset(pa.Table.from_pandas(val))

In [38]:
val_dataset

Dataset({
    features: ['Summary', 'Source'],
    num_rows: 1006
})

In [39]:
dataset = ds.dataset(pa.Table.from_pandas(test).to_batches())

### convert to Huggingface dataset
test_dataset = Dataset(pa.Table.from_pandas(test))

In [40]:
test_dataset

Dataset({
    features: ['id', 'Source'],
    num_rows: 2513
})

**Hyperparameters**

In [41]:
model_checkpoint = "mrm8488/t5-base-finetuned-summarize-news"
# model_checkpoint = "facebook/bart-large-cnn"
max_input_length = 1520
max_target_length = 56
batch_size = 1
NUM_EPOCHS = 5

**Load metric**

In [42]:
metric = load_metric("rouge")

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

**Preprocess**

In [43]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

In [44]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

In [45]:
def preprocess_function_test(examples):
    inputs = [prefix + doc for doc in examples["Source"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # # Setup the tokenizer for targets
    # with tokenizer.as_target_tokenizer():
    #     labels = tokenizer(examples["Summary"], max_length=max_target_length, truncation=True)

    # model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [46]:
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["Source"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["Summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [47]:
tokenized_dataset_train = train_dataset.map(preprocess_function, batched=True)
tokenized_dataset_val = val_dataset.map(preprocess_function, batched=True)

  0%|          | 0/10 [00:00<?, ?ba/s]



  0%|          | 0/2 [00:00<?, ?ba/s]

In [48]:
tokenized_dataset_test = test_dataset.map(preprocess_function_test, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

**Fine-tuning the model**

In [49]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [50]:
args = Seq2SeqTrainingArguments(
    "results",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=NUM_EPOCHS,
    predict_with_generate=True,
    # fp16=True,
)

In [51]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [52]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

In [53]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_dataset_train,
    eval_dataset=tokenized_dataset_val,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [54]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: Source, Summary. If Source, Summary are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 9046
  Num Epochs = 5
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 45230
  Number of trainable parameters = 222903552
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.2982,0.70595,27.0021,16.3469,24.3122,24.4905,18.996
2,0.2439,0.688985,27.1857,16.5727,24.5751,24.7347,18.996
3,0.2282,0.69056,27.819,17.3796,25.2478,25.3875,18.995


Saving model checkpoint to results/checkpoint-500
Configuration saved in results/checkpoint-500/config.json
Model weights saved in results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-500/tokenizer_config.json
Special tokens file saved in results/checkpoint-500/special_tokens_map.json
Copy vocab file to results/checkpoint-500/spiece.model
Saving model checkpoint to results/checkpoint-1000
Configuration saved in results/checkpoint-1000/config.json
Model weights saved in results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in results/checkpoint-1000/special_tokens_map.json
Copy vocab file to results/checkpoint-1000/spiece.model
Deleting older checkpoint [results/checkpoint-500] due to args.save_total_limit
Saving model checkpoint to results/checkpoint-1500
Configuration saved in results/checkpoint-1500/config.json
Model weights saved in results/checkpoint-1500

KeyboardInterrupt: ignored

In [55]:
def generate_summary(test_samples, model):
    inputs = tokenizer(
        test_samples["Source"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return outputs, output_str

In [56]:
!rm -r "/content/results"

In [57]:
trainer.save_model(os.path.join(MODEL_PATH,model_checkpoint))

Saving model checkpoint to /content/drive/MyDrive/NLP_Project/Models_pickled_file/mrm8488/t5-base-finetuned-summarize-news
Configuration saved in /content/drive/MyDrive/NLP_Project/Models_pickled_file/mrm8488/t5-base-finetuned-summarize-news/config.json
Model weights saved in /content/drive/MyDrive/NLP_Project/Models_pickled_file/mrm8488/t5-base-finetuned-summarize-news/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/NLP_Project/Models_pickled_file/mrm8488/t5-base-finetuned-summarize-news/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/NLP_Project/Models_pickled_file/mrm8488/t5-base-finetuned-summarize-news/special_tokens_map.json
Copy vocab file to /content/drive/MyDrive/NLP_Project/Models_pickled_file/mrm8488/t5-base-finetuned-summarize-news/spiece.model


In [None]:
# model.from_pretrained(os.path.join(MODEL_PATH,model_checkpoint))

In [None]:
# summaries_after_tuning = generate_summary(test_dataset, model)[1]

In [None]:
# df = pd.DataFrame(zip(summaries_after_tuning,test_dataset['id']),
#                   columns=["Summary","id"])

In [None]:
# df.head()

**Saving the predictions**

In [None]:
# df.to_csv(os.path.join(RESULT_PATH,model_checkpoint+".csv"),index=False)