1. Find a suitable dataset containing articles textual contents and titles.
2. Choose a suitable metric for our task.
3. Fine-tune a pre-trained model for title generation on Colab, monitoring the chosen metric on the validation set using TensorBoard, and saving the model’s checkpoints on Google Drive (so that we can resume training in case Colab shuts down the connection).
4. Upload the model on Hugging Face Hub for everyone to use.
5. Build an interactive demo with Streamlit and deploy it to Hugging Face Spaces.

## 1. scraping dataset 
https://github.com/codelucas/newspaper

passed for now, downloading datasets directly from 

https://www.kaggle.com/datasets/fabiochiusano/medium-articles

to /datasets/medium_articles.csv

## 2. explore datasets, create training and testing set

In [1]:
# load raw set
import pandas
raw_dataset_path = "/data/agent_h/datasets/medium_articles.csv"
tmp_df = pandas.read_csv(raw_dataset_path)
dicts = tmp_df.to_dict('records')

In [2]:
# explore data format
from pprint import pprint
print(len(dicts))
sample_dict = dicts[0].copy()
sample_dict['text'] = sample_dict['text'][:200]
pprint(dicts[0])

192368
{'authors': "['Ryan Fan']",
 'tags': "['Mental Health', 'Health', 'Psychology', 'Science', 'Neuroscience']",
 'text': 'Photo by Josh Riemer on Unsplash\n'
         '\n'
         'Merry Christmas and Happy Holidays, everyone!\n'
         '\n'
         'We just wanted everyone to know how much we appreciate everyone and '
         'how thankful we are for all our readers and writers here. We '
         'wouldn’t be anywhere without you, so thank you all for bringing '
         'informative, vulnerable, and important pieces that destigmatize '
         'mental illness and mental health.\n'
         '\n'
         'Without further ado, here are ten of our top stories from last week, '
         'all of which were curated:\n'
         '\n'
         '“Just as the capacity to love and inspire is universal so is the '
         'capacity to hate and discourage. Irrespective of gender, race, age '
         'or religion none of us are exempt from aggressive proclivities. '
         'Those wh

In [3]:
# use dataset lib, https://huggingface.co/docs/datasets/en/loading
# best way would be raw_data -> process into train -> save as csv or json chunks -> load as dataset
import transformers
from datasets import load_dataset, load_metric, Dataset

# medium_datasets = load_dataset("csv",
#                                data_files=raw_dataset_path)

# before using from_list,
# need to make sure each key in the list has the same type of value
timestamp_types = set([type(x['timestamp']) for x in dicts])
print(timestamp_types)
# clean up data
filtered_data = []
for data_line in dicts:
    valid = True
    for key,item in data_line.items():
        if type(item) != str:
            valid = False
    if valid:
        filtered_data.append(data_line)
print(len(filtered_data))
medium_dataset = Dataset.from_list(filtered_data[:5000])



  from .autonotebook import tqdm as notebook_tqdm


{<class 'str'>, <class 'float'>}
192361


In [4]:
# process data for training https://huggingface.co/docs/datasets/en/process
# split first so there's no leaking
medium_dataset = medium_dataset.filter(
    lambda example: (len(example['text']) >= 500) and
    (len(example['title']) >= 20)
)
medium_dataset = medium_dataset.train_test_split(test_size=1000)


Filter: 100%|███████████████████████████████████████████████████████| 5000/5000 [00:00<00:00, 90732.38 examples/s]


In [5]:
# process for training
import nltk
import string
nltk.download('punkt')
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("/data/agent_h/llms/umt5-small")
tokenizer = AutoTokenizer.from_pretrained("/data/agent_h/llms/umt5-small")

prefix = "summarize: "
max_input_length = 512
max_target_length = 64


def clean_text(text):
    """
    add \n to sentences, remove title
    """
    sentences = nltk.sent_tokenize(text.strip())
    sentences_cleaned = [s for sent in sentences for s in sent.split("\n")]
    sentences_cleaned_no_titles = [sent for sent in sentences_cleaned
                                 if len(sent) > 0 and
                                 sent[-1] in string.punctuation]
    text_cleaned = "\n".join(sentences_cleaned_no_titles)
    return text_cleaned

#pprint(clean_text(medium_dataset['train'][0]['text']))

def preprocess_data(examples):
    "turn into tokens for labels and input_ids"
    texts_cleaned = [clean_text(text) for text in examples["text"]]
    inputs = [prefix + text for text in texts_cleaned]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
    
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["title"], max_length=max_target_length, 
                           truncation=True)
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip()))
                      for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) 
                      for label in decoded_labels]
    
    # Compute ROUGE scores
    result = metric.compute(predictions=decoded_preds, references=decoded_labels,
                            use_stemmer=True)

    # Extract ROUGE f1 scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length to metrics
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id)
                      for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

# examples = medium_dataset['test'][:100]
# tmp_data = preprocess_data(examples)
# print(examples['text'][8])
# tokenizer.decode(tmp_data['input_ids'][8])

[nltk_data] Downloading package punkt to /home/agent_h/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
tokenized_datasets = medium_dataset.map(preprocess_data,
                                        batched=True)

Map: 100%|████████████████████████████████████████████████████████████| 3627/3627 [00:05<00:00, 685.43 examples/s]
Map: 100%|████████████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 703.02 examples/s]


In [8]:
# start training

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

batch_size = 8
base_model = "/data/agent_h/llms/umt5-small"
model_name = "umt5-small-medium-title-generation"
model_dir = f"/data/agent_h/checkpoints/{model_name}"

args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_strategy="steps",
    logging_steps=100,
    save_strategy="steps",
    save_steps=200,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
    report_to="tensorboard"
)
data_collator = DataCollatorForSeq2Seq(tokenizer)
metric = load_metric("rouge")

def model_init():
    return AutoModelForSeq2SeqLM.from_pretrained(base_model)

trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)




You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
Using the latest cached version of the module from /home/agent_h/.cache/huggingface/modules/datasets_modules/metrics/rouge/457c405cab0bd19db749b46bf15a1a3cff4d54f50e7ab868c293e5ece288425e (last modified on Tue May 21 16:40:41 2024) since it couldn't be found locally at rouge, or remotely on the Hugging Face Hub.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [None]:
%load_ext tensorboard
%tensorboard --logdir '{model_dir}'/runs

In [None]:
trainer.train()
#https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("/data/agent_h/llms/umt5-small")
tokenizer = AutoTokenizer.from_pretrained("/data/agent_h/llms/umt5-small")

inputs = tokenizer(
    "国家",
    return_tensors="pt",
)
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs))

In [21]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_ckpt = "/data/agent_h/checkpoints/umt5-small-medium-title-generation/checkpoint-32000"
model_ckpt = "/data/agent_h/checkpoints/umt5-small-medium-title-generation-zh/checkpoint-54000"
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

text = """summarize: Combining both modern and traditional style architectures, with one side of the city being modernized and renovated to fit the times, and the other half still offering traditional hutong districts.[18] Beijing is one of the oldest cities in the world, with a rich history dating back over three millennia. As the last of the Four Great Ancient Capitals of China, Beijing has been the political center of the country for most of the past eight centuries,[19] and was the largest city in the world by population for much of the second millennium CE.[20] With mountains surrounding the inland city on three sides, in addition to the old inner and outer city walls, Beijing was strategically poised and developed to be the residence of the emperor and thus was the perfect location for the imperial capital. The city is renowned for its opulent palaces, temples, parks, gardens, tombs, walls and gates.[21] Beijing is one of the most important tourist destinations of the world. In 2018, Beijing was the second highest earning tourist city in the world after Shanghai.[22] Beijing is home to many national monuments and museums and has seven UNESCO World Heritage Sites—the Forbidden City, Temple of Heaven, Summer Palace, Ming Tombs, Zhoukoudian Peking Man Site, and parts of the Great Wall and the Grand Canal—all of which are popular tourist locations.[23] Siheyuans, the city's traditional housing style, and hutongs, the narrow alleys between siheyuans, are major tourist attractions and are common in urban Beijing."""
text = """生成标题：近日，世界贸易组织(WTO)对中国第六次贸易政策审议在日内瓦顺利结束。此次审议过程中，中国经贸体制、贸易投资领域取得的新进展等多方面得到积极评价，各成员对中国成为其重要经贸合作伙伴十分重视。对此，专家指出，虽然上半年中国进出口双下降，但中国在全球贸易经济中的地位仍不断上升，尤其是中国外贸新旧动能转换释放出的强劲动力，将推动对外贸易继续回稳、向好，也将为全球贸易增长作出重要贡献。新动能持续积累优势上半年，我国进出口同比下降3.3%，进口、出口分别下降4.7%和2.1%。虽然进出口双下降，但我国外贸新旧动能转换正加快进程，贸易结构不断优化。海关总署数据显示，1-6月，我国一般贸易进出口占进出口总值的56.4%，比去年同期提升1.2个百分点;民企出口增长3.6%，占出口总值的46.6%，占比继续保持首位。“一般贸易占比的持续上升体现出中国自主产品的比重在上升、自主创新能力在增强，我国对外贸易正向高附加值端发展，如高新技术产业等新业态在我国外贸发展中的势头已越来越强劲。而民企出口的快速发展带来了更多活力，外贸中的国内资本和投入品的增加，为我国外贸健康发展及结构优化提供了新动能。”国家发改委对外经济研究所国际合作室主任张建平在接受本报记者采访时说。多边、双边经贸合作不断拓展则为我国外贸提供了更大发展空间。海关总署新闻发言人黄颂平指出，上半年，我国对部分“一带一路”沿线国家出口增长。另外，已有22个国家或地区与我国签署并实施自贸协定，上半年，与上述国家或地区的进出口表现好于同期我国进出口总体降幅。商务部研究院国际市场研究部副主任白明表示，上半年，大型成套产品出口保持正增长，这个领域的商品技术含量高，附加值也比较高，跟一般的传统商品相比，它更是我们发展的一个方向。跨境电商贸易、市场采购贸易等新型外贸商业模式正成为新的外贸增长点。中国贸促会副会长尹宗华指出，去年我国的跨境电子商务规模为5.4万亿元人民币，预计今年可能会达到6.5万亿元人民币，对于促进外贸稳增长、调结构发挥了重要作用。“机器换人”降低成本随着新旧动能转换的持续推进，传统动能这一曾经的外贸主要贡献者正面临困境。海关总署数据显示，截至今年6月份，我国的加工贸易进口、出口已经分别连续18个月和16个月下降。今年上半年，加工贸易进出口下降9.8%，拖累我国外贸进出口整体下降约3个百分点。“加工贸易等传统动能对于当前的中国外贸而言仍很重要，它既可以推动贸易均衡发展，也能够为就业提供保障。对于加工贸易的下降，我们在顺应市场规律的前提下，还要充分发掘其潜力，这包括了结构的改善与量的增长。”张建平说，而要充分发掘这一潜力，则要在保留优势的基础上提高其在价值链中的地位，并不断提高贸易便利化水平。 """
text = """生成标题：据深圳证券交易所近日公告，安徽晶奇网络科技股份有限公司在中国证监会审阅其IPO并在创业板上市申请文件的过程中，该公司与其保荐机构主动要求撤回注册申请文件。值得注意的是，该企业早在2021年就已“过会”。此外，浙江控阀2022年12月创业板IPO过会，一年多未提交注册，今年3月撤回IPO；博菱电器2022年11月创业板IPO过会，过会逾一年未提交注册，最终今年3月撤回IPO。中国人民大学中国资本市场研究院联席院长赵锡军在接受中新社直通车记者采访时表示，从公开资料来看，上述企业之所以主动撤回IPO申请，主要是在IPO自查过程中，发现公司在合规、板块定位、信息披露、会计处理等方面存在问题，及时纠正。企业之所以在IPO问题上如此积极自查自纠，这与当前中国资本市场“严监管”的风气密切相关。"""
inputs = tokenizer(
    text,
    return_tensors="pt",
)
outputs = model.generate(**inputs)
print(text)
print("output: ")
print(tokenizer.batch_decode(outputs))

生成标题：据深圳证券交易所近日公告，安徽晶奇网络科技股份有限公司在中国证监会审阅其IPO并在创业板上市申请文件的过程中，该公司与其保荐机构主动要求撤回注册申请文件。值得注意的是，该企业早在2021年就已“过会”。此外，浙江控阀2022年12月创业板IPO过会，一年多未提交注册，今年3月撤回IPO；博菱电器2022年11月创业板IPO过会，过会逾一年未提交注册，最终今年3月撤回IPO。中国人民大学中国资本市场研究院联席院长赵锡军在接受中新社直通车记者采访时表示，从公开资料来看，上述企业之所以主动撤回IPO申请，主要是在IPO自查过程中，发现公司在合规、板块定位、信息披露、会计处理等方面存在问题，及时纠正。企业之所以在IPO问题上如此积极自查自纠，这与当前中国资本市场“严监管”的风气密切相关。
output: 
['<pad> 晶奇网络科技公司主动撤回IPO申请</s>']
