## Machine Learning for Topic News Title Summarization 

- In the era of digital information, the volume of news content available to readers has grown exponentially, making it increasingly challenging for individuals to stay informed without becoming overwhelmed. The project's primary goal is to leverage machine learning (ML) techniques for the effective summarization of news articles, aiming to improve the efficiency, accuracy, and readability of these summaries, allowing readers to grasp the essence of news stories without dedicating extensive time to reading full articles.

- The stakeholders of this project can be individual readers, news organizations, educational sectors, and potentially government bodies reliant on swift and accurate information dissemination. Improved news summarization models can transform media consumption by providing accessible, succinct summaries of complex news stories, thereby enhancing public knowledge and engagement. Additionally, in broader vew, enhanced news summarization techniques could pave the way for similar advancements in summarizing other forms of text, such as academic literature, legal documents, and social media feeds.

- Potential Model We will Explore:
    - Bert summarization
    - Fint tune T5-small
    - Mamba

In [1]:
from bert_score import score
import pandas as pd
from tqdm import tqdm
import re
from rouge import Rouge
import os

#### 1. Extractive Bert Summarization without Fine Tunning.

In [2]:
from summarizer import Summarizer, TransformerSummarizer    

# Initialize the BERT model
model = Summarizer()

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
test_set = pd.read_csv('data/test_set.csv')
test_set['content'] = test_set['content'].apply(lambda x: re.sub(r'[\r\n]+', ' ', x))
test = test_set.head().copy()

In [2]:
# Use the model to generate summaries
def generate_summary(texts, model):
    summaries = []
    length = len(texts)
    for i in tqdm(range(length)):
        context = texts[i]
        summ = model(context, num_sentences=1)  
        summaries.append(summ)
    return summaries

# 判断 test_set_summary_bert.csv 是否存在，如果存在则不运行以下代码：
if os.path.exists('data/test_set_summary_bert.csv'):
    print('data/test_set_summary_bert.csv already exists, skipping the generation of summaries')
else:
# 假设 test['content'] 是一个包含文本的列表
    summaries = generate_summary(test_set['content'].tolist(), model)
    test_set['summary_bert'] = summaries
    test_set = test_set[['data_id', 'title', 'content_length', 'category_level_1', 'category_level_2', 'summary_bert']]
# save the result:
    test_set.to_csv('data/test_set_summary_bert.csv', index=False)

data/test_set_summary_bert.csv already exists, skipping the generation of summaries


#### 1.1. Calculate the Rouge Score:

In [3]:
filename = 'data/test_set_summary_bert.csv'
metric_df = pd.read_csv(filename)

In [14]:
# Calculate the ROUGE score based on the column of 'summary_bert' and 'title'
def calculate_average_rouge(df, summary_col='summary_bert', reference_col='title'):
    rouge = Rouge()
    scores = {'rouge-1': {'f': []}, 'rouge-2': {'f': []}, 'rouge-l': {'f': []}}
    
    for index, row in df.iterrows():
        score = rouge.get_scores(row[summary_col], row[reference_col])[0]
        for key in score:
            scores[key]['f'].append(score[key]['f'])
    
    # Calculating the average scores
    avg_scores = {key: {'f': sum(values['f']) / len(values['f'])} for key, values in scores.items()}
    
    return avg_scores

In [10]:
avg_rouge_scores = calculate_average_rouge(metric_df)
print("Average ROUGE scores:", avg_rouge_scores)

Average ROUGE scores: {'rouge-1': {'f': 0.20443960614422044}, 'rouge-2': {'f': 0.07319228343811812}, 'rouge-l': {'f': 0.17942339480933162}}


#### 1.2. Calculate the Bert-Score:

In [8]:
# Calculate BERTScore
P, R, F1 = score(metric_df['summary_bert'].to_list(), metric_df['title'].to_list(), lang='en', verbose=True)

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/38 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/20 [00:00<?, ?it/s]

done in 80.73 seconds, 15.08 sentences/sec


In [9]:
print(f"Precision: {P.mean():.4f}, Recall: {R.mean():.4f}, F1 score: {F1.mean():.4f}")

Precision: 0.8582, Recall: 0.8844, F1 score: 0.8708
