### Transformer用于Summarization任务
其核心思想是将输入的长文本生成简洁、连贯的摘要。这类任务属于生成式任务。
1. 加载预训练模型
* 选择预训练的生成模型，比如BART、T5、Pegasus等。这些模型已经在大量文本数据上进行了预训练，具备良好的语言生成能力。
2. 数据预处理
* 分词：使用预训练模型的分词器（如BARTTokenizer、T5Tokenizer）对输入文本和摘要进行分词。确保输入文本的长度不超过模型的最大输入长度（如1024 tokens），而摘要通常也有长度限制（如150 tokens）。
* 标签设置：摘要作为目标标签，需要经过同样的分词处理，用于计算训练过程中的损失。
3. 设置训练参数
* 使用TrainingArguments来设置训练时所需的超参数，这些参数包括：
    * 批次大小：决定每个训练步骤中处理的数据量。
    * 学习率：决定每次模型更新时的步长，常用AdamW优化器。
    * 训练轮数（Epochs）：决定数据集被完整训练的次数。
    * 梯度裁剪（Gradient Clipping）：避免梯度爆炸，尤其是在生成任务中非常有用。
4. 训练模型
* Hugging Face提供的Trainer API使得训练过程更加简便高效。通过该API可以自动管理训练过程，包含训练循环、验证和模型保存等功能。
    * 将预处理好的数据传入模型，并利用Trainer进行模型的微调。
    * 在训练过程中，模型会不断学习如何将长文本生成出与参考摘要相似的简短摘要。
5. 评估模型
* 微调完成后，需要对模型的表现进行评估。评估的标准通常使用生成任务中的经典指标：
    * ROUGE：衡量生成的摘要与参考摘要在词汇层面的相似度。
    * BLEU：另一种常见的生成任务评估指标，适用于较短文本的生成。
6. 生成摘要（推理阶段）
* 在模型微调完成后，可以使用模型来生成摘要。推理阶段包括：
    * 输入处理：将长文本输入到模型中。
    * 生成摘要：利用generate函数生成目标摘要。可以使用不同的生成策略，如greedy search（贪心搜索）或beam search（束搜索），以提升生成摘要的质量。
    


### 1. 导包

In [14]:
import torch
from transformers import BartTokenizer, BartForConditionalGeneration
from datasets import load_dataset
import pandas as pd
from transformers import Trainer, TrainingArguments

### 2. 加载模型和分词器

In [2]:
# 加载BART模型和分词器
model_name = 'facebook/bart-large-cnn'
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

### 3. 加载数据集

cnn_dailymail数据集是一个常用于文本摘要任务的公开数据集。这个数据集是从CNN和Daily Mail两个新闻网站上搜集得到的。<br>

数据集内容：<br>
1. 新闻文章：这是需要生成摘要的长篇新闻文章，通常包括多个段落的详细新闻内容。
2. 摘要（Highlights）：每篇新闻文章都有对应的人工撰写的简短摘要，通常是一段话或几句话，概括了新闻的核心内容。这部分摘要被称为“Highlights”。

In [3]:
# 加载CNN/DailyMail数据集
dataset = load_dataset('cnn_dailymail', '3.0.0', split='train')

Generating train split: 100%|██████████| 287113/287113 [00:01<00:00, 235189.88 examples/s]
Generating validation split: 100%|██████████| 13368/13368 [00:00<00:00, 274507.14 examples/s]
Generating test split: 100%|██████████| 11490/11490 [00:00<00:00, 239698.36 examples/s]


In [4]:
print(dataset)

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 287113
})


In [6]:
dataset_df=pd.DataFrame(dataset)

In [7]:
dataset_df.head()

Unnamed: 0,article,highlights,id
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,42c027e4ff9730fbb3de84c1af0d2c506e41c3e4
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,ee8871b15c50d0db17b0179a6d2beab35065f1e9
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa...",06352019a19ae31e527f37f7571c6dd7f0c5da37
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non...",24521a2abb2e1f5e34e6824e0f9e56904a2b0e88
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical...",7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a


### 4. 数据预处理

In [30]:
# 准备训练数据函数
def preprocess_function(examples):
    inputs = examples['article']
    
    # 为输入启用 padding 和 truncation
    model_inputs = tokenizer(inputs, max_length=1024, padding='max_length', truncation=True, return_tensors='pt')

    # 设置摘要标签并启用 padding 和 truncation
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['highlights'], max_length=128, padding='max_length', truncation=True, return_tensors='pt')

    # 将标签的 input_ids 添加到 model_inputs 中
    model_inputs['labels'] = labels['input_ids'].squeeze()  # 通过 squeeze() 去除多余的维度
    return model_inputs


In [31]:
# 对数据集进行预处理
train_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["article", "highlights", "id"])

Map: 100%|██████████| 287113/287113 [1:31:59<00:00, 52.02 examples/s] 


### 5. 模型训练

In [40]:
# 定义训练参数
training_args = TrainingArguments(
    output_dir='./results',           # 保存模型的路径
    num_train_epochs=1,               # 训练轮数
    per_device_train_batch_size=8,    # 每个设备的批次大小
    per_device_eval_batch_size=16,     # 验证批次大小
    warmup_steps=500,                 # 学习率预热步数
    weight_decay=0.01,                # 权重衰减
    logging_dir='./logs',             # 日志存放位置
    logging_steps=10,
    save_steps=500,
    save_total_limit=3,               # 保存的最多模型数量
)

In [37]:
print(train_dataset)

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 287113
})


In [47]:
# 初始化Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

跑的速度太慢了 样本量比较大 所以先不跑了

In [48]:
# 开始训练
trainer.train()

# 报错原因分析：Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
# 1. 输入数据（标签）序列长度不一致
# 2. 未启用padding或者truncation
# 导致无法正确转换为tensor

  0%|          | 0/1 [01:02<?, ?it/s]
  0%|          | 10/35890 [12:55<807:45:37, 81.05s/it]
  0%|          | 10/35890 [12:56<807:45:37, 81.05s/it]

{'loss': 0.8208, 'grad_norm': 3.5324137210845947, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}


  0%|          | 20/35890 [25:53<822:48:56, 82.58s/it]
  0%|          | 20/35890 [25:53<822:48:56, 82.58s/it]

{'loss': 0.7271, 'grad_norm': 2.907181739807129, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}


  0%|          | 30/35890 [37:07<553:57:21, 55.61s/it]
  0%|          | 30/35890 [37:07<553:57:21, 55.61s/it]

{'loss': 0.5066, 'grad_norm': 2.4088127613067627, 'learning_rate': 3e-06, 'epoch': 0.0}


  0%|          | 40/35890 [50:47<831:59:08, 83.55s/it]  
  0%|          | 40/35890 [50:47<831:59:08, 83.55s/it]

{'loss': 0.3936, 'grad_norm': 2.5340144634246826, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.0}


  0%|          | 45/35890 [56:13<638:28:09, 64.12s/it]

KeyboardInterrupt: 

### 6. 模型保存

In [None]:
# 保存模型
model.save_pretrained('./fine_tuned_bart')

### 7. 测试推理

In [None]:
# 测试推理
def generate_summary(text):
    inputs = tokenizer(text, max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = model.generate(inputs['input_ids'], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [None]:
# 示例
test_article = """
The future of artificial intelligence (AI) is an exciting and ever-evolving field. In recent years, AI has made remarkable strides in a variety of industries, including healthcare, finance, and education. The ability of machines to learn from data and make decisions autonomously is transforming the way businesses operate and individuals live their lives.

One of the key advancements in AI is deep learning, a subset of machine learning that uses neural networks with many layers. Deep learning has led to breakthroughs in image recognition, natural language processing, and autonomous vehicles. As AI continues to evolve, experts predict even more innovative applications, such as personalized medicine, advanced robotics, and smarter cities.

However, with these advancements come challenges. Ethical concerns, such as the potential for AI to replace jobs and the risks of biased algorithms, are at the forefront of discussions about the future of AI. Governments and organizations are now focusing on developing frameworks to ensure that AI is used responsibly and transparently.

In conclusion, AI has the potential to revolutionize industries and improve quality of life, but it is essential to address the ethical implications and ensure that its development is guided by principles of fairness and accountability.
"""

summary = generate_summary(test_article)
print("Generated Summary:", summary)