# Abstractive summary
## Method 2 - Model evaluation (src/rouge.ipynb)
### Evaluation for the model trained in src/bart.ipynb 
Performance metrics – ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Implented works:
- Load the validation data.
- feature - load & tokenize & convert to tensor 
- Generated summary IDs with specified parameters
- Decoded summary IDs to text and skip special tokens
- Generated summaries for the validation set      
- computed rouge metrics based on generated summary and original summary (target) 

### Results:
After fine-tunning:
- rouge1: Score(precision=0.6592064440736602, recall=0.6324733143712006, fmeasure=0.6132238861334389)
- rouge2: Score(precision=0.5080411090198691, recall=0.4968806580602876, fmeasure=0.47866874404111104)
- rougeL: Score(precision=0.5735140296805632, recall=0.560047355602015, fmeasure=0.5396991258438945)
- rougeLsum: Score(precision=0.6167078297893005, recall=0.5946188830525441, fmeasure=0.5762430613700162)

Before fine-tunning:
- rouge1: Score(precision=0.30968324775739353, recall=0.4396760585008399, fmeasure=0.29054592962775116)
- rouge2: Score(precision=0.12535111393152845, recall=0.17781582343023616, fmeasure=0.11609005370534525)
- rougeL: Score(precision=0.20093013240942154, recall=0.2907883810753654, fmeasure=0.1878558128893604)
- rougeLsum: Score(precision=0.24092968768212347, recall=0.34497937380932003, fmeasure=0.22708400391795613)

### Observations:

The trained model from method 1 was not used for deployment:

Trained model from method 2 was used for deployment

Reason:
- Even though the model has very minimal training loss but, the model performed inconsistenly in validation & testing phase.
- There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.
- Model 2 - results outperformed that of method 1.
    - ROUGE1 = 61.32 -> Benchmark grade 
    - GPT4 performance for text summarization - ROUGE1 63.22

In [1]:
import pandas as pd
import torch
from transformers import BartForConditionalGeneration, BartTokenizer
from datasets import load_metric

In [2]:
#load the validation dataset
validation_data = pd.read_csv('/home/mohan/infy/data/fined/validation.csv')

input_texts = validation_data['text'].tolist() # DF to list
target_texts = validation_data['summary'].tolist() # DF to list

In [3]:
validation_data

Unnamed: 0,text,summary
0,Vladimir Putin is 'alive' but 'neutralised' as...,Vladimir Putin is supposed to hold public meet...
1,#Person1#: How old is Keith?\n#Person2#: He's ...,#Person1# and #Person2# talk about the age of ...
2,(CNN)The United States has seemingly erupted t...,A Native American from a tribe not recognized ...
3,#Person1#: When do you want to have the open h...,#Person1# and #Person2# are planning an open h...
4,neutral pion photoproduction on the proton at ...,we investigate the neutral pion photoproductio...
...,...,...
5011,The New York Police Department is searching fo...,Driver was at intersection of Broadway and Wes...
5012,SECTION 1. SHORT TITLE.\n\n This Act may be...,"Preservation of Localism, Program Diversity, a..."
5013,vortex instabilities are observed throughout t...,vortices have been postulated at a range of si...
5014,Daniel Levy reportedly told the Tottenham Hots...,Tottenham Hotspur chairman Daniel Levy has tol...


In [4]:
# Load fine-tuned model and tokenizer
model_path =  '/home/mohan/infy/models/fine_tuned_Text_Summ/saved' 
model = BartForConditionalGeneration.from_pretrained(model_path) # model
tokenizer = BartTokenizer.from_pretrained(model_path) # tokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device) # cuda as backend

In [6]:
# Load the ROUGE metric
rouge = load_metric('rouge')

# Function to generate summaries
def generate_summary(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device) # feature load & tokenize & convert to tensor 
    summary_ids = model.generate(inputs.input_ids, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True) # Generate summary IDs with specified parameters
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)  # Decode summary IDs to text and skip special tokens                          
    return summary

# Generate summaries for the validation set 
generated_summaries = [generate_summary(text) for text in input_texts]

# save the generated summary to the dataframe
validation_data['generated_summary'] = generated_summaries 

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [7]:
#compute rouge metrics based on generated summary and original summary (target) 
rouge_scores = rouge.compute(predictions=generated_summaries, references=target_texts) 

In [8]:
# ROUGE has 3 types: low, mid, high
# Mid is more reliable and robust - print it 
for key, value in rouge_scores.items():
    print(f"{key}: {value.mid}")

rouge1: Score(precision=0.6592064440736602, recall=0.6324733143712006, fmeasure=0.6132238861334389)
rouge2: Score(precision=0.5080411090198691, recall=0.4968806580602876, fmeasure=0.47866874404111104)
rougeL: Score(precision=0.5735140296805632, recall=0.560047355602015, fmeasure=0.5396991258438945)
rougeLsum: Score(precision=0.6167078297893005, recall=0.5946188830525441, fmeasure=0.5762430613700162)


In [9]:
# save the dataframe to csv file
validation_data.to_csv('/home/mohan/infy/data/gen/with_gen_summ.csv', index=False)
validation_data

Unnamed: 0,text,summary,generated_summary
0,Vladimir Putin is 'alive' but 'neutralised' as...,Vladimir Putin is supposed to hold public meet...,Vladimir Putin is supposed to hold public meet...
1,#Person1#: How old is Keith?\n#Person2#: He's ...,#Person1# and #Person2# talk about the age of ...,#Person1# and #Person2# are talking about the ...
2,(CNN)The United States has seemingly erupted t...,A Native American from a tribe not recognized ...,"In Indiana, there were only 16 states with rel..."
3,#Person1#: When do you want to have the open h...,#Person1# and #Person2# are planning an open h...,#Person1# and #Person2# are planning an open h...
4,neutral pion photoproduction on the proton at ...,we investigate the neutral pion photoproductio...,neutral pion photoproduction on the proton is ...
...,...,...,...
5011,The New York Police Department is searching fo...,Driver was at intersection of Broadway and Wes...,Driver was at intersection of Broadway and Wes...
5012,SECTION 1. SHORT TITLE.\n\n This Act may be...,"Preservation of Localism, Program Diversity, a...","Preservation of Localism, Program Diversity, a..."
5013,vortex instabilities are observed throughout t...,vortices have been postulated at a range of si...,vortices are observed at the smallest scales i...
5014,Daniel Levy reportedly told the Tottenham Hots...,Tottenham Hotspur chairman Daniel Levy has tol...,Daniel Levy reportedly told the Tottenham supp...


In [4]:
# To get ROUGE score for foundational model - compare before and after model fine-tunning - performance

# Load foundational model and tokenizer
mdl =  'facebook/bart-large' 
pt_model = BartForConditionalGeneration.from_pretrained(mdl)
pt_tokenizer = BartTokenizer.from_pretrained(mdl)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pt_model = pt_model.to(device) # cuda backend

In [10]:
#For the direct : Pre-Trained model Facebook/bart-large
rouge = load_metric('rouge')

# Function to generate summaries
def pt_generate_summary(text):
    inputs = pt_tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device) # feature load & tokenize & convert to tensor 
    summary_ids = pt_model.generate(inputs.input_ids, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True) # Generate summary IDs with specified parameters
    summary = pt_tokenizer.decode(summary_ids[0], skip_special_tokens=True)  # Decode summary IDs to text and skip special tokens
    return summary

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [None]:
# Generate summaries
pt_generated_summaries = [pt_generate_summary(text) for text in input_texts] 

In [7]:
#compute rouge metrics 
rouge_scores = rouge.compute(predictions=pt_generated_summaries, references=target_texts)

In [9]:
print(rouge_scores) # metrics before fine tunning

{'rouge1': AggregateScore(low=Score(precision=0.3048137427438569, recall=0.4335034041266581, fmeasure=0.28711691368110964), mid=Score(precision=0.30968324775739353, recall=0.4396760585008399, fmeasure=0.29054592962775116), high=Score(precision=0.3144532671635308, recall=0.44555749618641843, fmeasure=0.2936845125812506)), 'rouge2': AggregateScore(low=Score(precision=0.12263017171630873, recall=0.17323261189957923, fmeasure=0.11361270604137685), mid=Score(precision=0.12535111393152845, recall=0.17781582343023616, fmeasure=0.11609005370534525), high=Score(precision=0.12848855775106713, recall=0.1819071163626353, fmeasure=0.11859315788564989)), 'rougeL': AggregateScore(low=Score(precision=0.19789785525661754, recall=0.28564791207596146, fmeasure=0.1856701768667198), mid=Score(precision=0.20093013240942154, recall=0.2907883810753654, fmeasure=0.1878558128893604), high=Score(precision=0.2041755129713334, recall=0.29555510735902146, fmeasure=0.18999094096418412)), 'rougeLsum': AggregateScore(

In [8]:
for key, value in rouge_scores.items(): # mid values before fine tunning
    print(f"{key}: {value.mid}")

rouge1: Score(precision=0.30968324775739353, recall=0.4396760585008399, fmeasure=0.29054592962775116)
rouge2: Score(precision=0.12535111393152845, recall=0.17781582343023616, fmeasure=0.11609005370534525)
rougeL: Score(precision=0.20093013240942154, recall=0.2907883810753654, fmeasure=0.1878558128893604)
rougeLsum: Score(precision=0.24092968768212347, recall=0.34497937380932003, fmeasure=0.22708400391795613)
