# Generating News Headlines using GPT2
## GPT2
GPT-2 model was released as part of the work titled “Language Models are Unsupervised Multi-task Learners” in 2019. The largest GPT-2 variant is a huge 1.5B parameter transformer-based model which the model was able to perform remarkably well of various NLP tasks. The most striking aspect of this work is that the authors showcase how a model trained in an unsupervised fashion (language modeling) achieves state-of-the-art performance in zero-shot setting.

## HuggingFace Transformers
One of the most propular python packages to work with Transformer based NLP models. Huggingface transformers is a high-level API to easily load, fine-tune and re-train models such as GPT2, BERT, T5 and so on

## Fake Headlines
ABC-News Dataset is a dataset of a million headlines available here collected over a period of 17 years. We will make use of this dataset to fine-tune the GPT2 model. Once fine-tuned we will use it to generate some fake headlines

In [4]:
# !pip3 install scikit-learn==1.5.1
# !pip3 install transformers==4.42.4

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

import torch
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import TextDataset,DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments,AutoModelForCausalLM

### Prepare Dataset

In [2]:
# download from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL
# !unzip abcnews.zip

In [2]:
news = pd.read_csv('abcnews-date-text.csv')
news.shape

(1244184, 2)

In [3]:
news.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [4]:
X_train, X_test= train_test_split(news.headline_text.tolist(),test_size=0.33, random_state=42)
len(X_train), len(X_test)

(833603, 410581)

In [5]:
with open('train_dataset.txt','w') as f:
  for line in X_train:
    f.write(line)
    f.write("\n")

with open('test_dataset.txt','w') as f:
  for line in X_test:
    f.write(line)
    f.write("\n")

In [6]:
tokenizer = AutoTokenizer.from_pretrained("gpt2",pad_token='<pad>')

train_path = 'train_dataset.txt'
test_path = 'test_dataset.txt'

In [7]:
def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=4)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=4)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

In [8]:
train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



## Prepare Model for Training

In [9]:
if torch.cuda.is_available():
    DEVICE = 'cuda'
    Tensor = torch.cuda.FloatTensor
    LongTensor = torch.cuda.LongTensor
    DEVICE_ID = 0
# MPS/Apple Silicon does not work as intended for this pipeline    
# elif torch.backends.mps.is_available():
#     DEVICE = 'mps'
#     Tensor = torch.FloatTensor
#     LongTensor = torch.LongTensor
#     DEVICE_ID = 0
else:
    DEVICE = 'cpu'
    Tensor = torch.FloatTensor
    LongTensor = torch.LongTensor
    DEVICE_ID = -1
print(f"Backend Accelerator Device={DEVICE}")

Backend Accelerator Device=cpu


In [10]:
model = AutoModelForCausalLM.from_pretrained("gpt2")

In [11]:
training_args = TrainingArguments(
    "gpt2-finetuned-headliner", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=2, # number of training epochs
    per_device_train_batch_size=512, # batch size for training
    per_device_eval_batch_size=256,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    push_to_hub=True,
    use_cpu=True # comment this if you have GPU available
    )

In [12]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    #prediction_loss_only=True,
)

In [21]:
trainer.train()

In [None]:
trainer.save_model()

## Let us Generate Some Headlines!

In [13]:
# load the fine-tuned model
ft_gpt2_headliner = AutoModelForCausalLM.from_pretrained("./headliner")

# setup the generation pipeline
headliner = pipeline('text-generation',
                     model=ft_gpt2_headliner, 
                     tokenizer='gpt2',
                     pad_token_id=0,
                     eos_token_id=50256,
                     config={
                         'max_length':8,
                     },
                     device=DEVICE_ID
                    )

In [14]:
def get_headline(headliner_pipeline, seed_text="News"):
  return headliner_pipeline(seed_text)[0]['generated_text'].split('\n')[0]

In [15]:
get_headline(headliner, seed_text="City Council of Sydney")

'City Council of Sydney announces newcastle cup plans'