# Generating News Headlines using GPT2

## GPT2
GPT-2 model was released as part of the work titled “Language Models are Unsupervised Multi-task Learners”  in 2019. The largest GPT-2 variant is a huge 1.5B parameter transformer-based model which the model was able to perform remarkably well of various NLP tasks. The most striking aspect of this work is that the authors showcase how a model trained in an unsupervised fashion (language modeling) achieves state-of-the-art performance in zero-shot setting. 

## HuggingFace Transformers
One of the most propular python packages to work with Transformer based NLP models. Huggingface transformers is a high-level API to easily load, fine-tune and re-train models such as GPT2, BERT, T5 and so on

## Fake Headlines 
ABC-News Dataset is a dataset of a million headlines available [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL) collected over a period of 17 years. We will make use of this dataset to fine-tune the GPT2 model. Once fine-tuned we will use it to generate some fake headlines

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PacktPublishing/Hands-On-Generative-AI-with-Python-and-TensorFlow-2/blob/master/Chapter_9/transformer_gpt2_finetune_pt.ipynb)

## Install Transformers

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/19/22/aff234f4a841f8999e68a7a94bdd4b60b4cebcfeca5d67d61cd08c9179de/transformers-3.3.1-py3-none-any.whl (1.1MB)
[K     |▎                               | 10kB 24.6MB/s eta 0:00:01[K     |▋                               | 20kB 2.8MB/s eta 0:00:01[K     |█                               | 30kB 3.8MB/s eta 0:00:01[K     |█▎                              | 40kB 4.1MB/s eta 0:00:01[K     |█▌                              | 51kB 3.4MB/s eta 0:00:01[K     |█▉                              | 61kB 3.7MB/s eta 0:00:01[K     |██▏                             | 71kB 4.1MB/s eta 0:00:01[K     |██▌                             | 81kB 4.3MB/s eta 0:00:01[K     |██▉                             | 92kB 4.7MB/s eta 0:00:01[K     |███                             | 102kB 4.5MB/s eta 0:00:01[K     |███▍                            | 112kB 4.5MB/s eta 0:00:01[K     |███▊                            | 122kB 4.5M

## Prepare Dataset
+ Unzip the dataset
+ Split into train and test files

In [1]:
# get dataset abc news
!unzip abcnews.zip

Archive:  abcnews.zip
replace abcnews-date-text.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [2]:
import pandas as pd

In [3]:
news = pd.read_csv('abcnews-date-text.csv')
news.shape

(1186018, 2)

In [4]:
news.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test= train_test_split(news.headline_text.tolist(),test_size=0.33, random_state=42)
len(X_train), len(X_test)

(794632, 391386)

In [6]:
with open('train_dataset.txt','w') as f:
  for line in X_train:
    f.write(line)
    f.write("\n")

In [7]:
with open('test_dataset.txt','w') as f:
  for line in X_test:
    f.write(line)
    f.write("\n")

## Prepare Tokenizer

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2",pad_token='<pad>')

train_path = 'train_dataset.txt'
test_path = 'test_dataset.txt'

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


In [9]:
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=4)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=4)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)

In [13]:
!nvidia-smi

Sun Oct 18 06:27:03 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    28W /  70W |   2071MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Prepare Model for Training

In [12]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained("gpt2")


training_args = TrainingArguments(
    output_dir="./headliner", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=1, # number of training epochs
    per_device_train_batch_size=4, # batch size for training
    per_device_eval_batch_size=2,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    prediction_loss_only=True,
)



In [14]:
trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=472411.0, style=ProgressStyle(description…

{'loss': 6.99887060546875, 'learning_rate': 5e-05, 'epoch': 0.0010584004182798454, 'total_flos': 5973110784000, 'step': 500}
{'loss': 6.54750146484375, 'learning_rate': 4.994702390916932e-05, 'epoch': 0.0021168008365596907, 'total_flos': 11946221568000, 'step': 1000}
{'loss': 6.5059072265625, 'learning_rate': 4.989404781833863e-05, 'epoch': 0.003175201254839536, 'total_flos': 17919332352000, 'step': 1500}
{'loss': 6.46778125, 'learning_rate': 4.9841071727507945e-05, 'epoch': 0.0042336016731193814, 'total_flos': 23892443136000, 'step': 2000}
{'loss': 6.339587890625, 'learning_rate': 4.978809563667726e-05, 'epoch': 0.005292002091399226, 'total_flos': 29865553920000, 'step': 2500}
{'loss': 6.3247421875, 'learning_rate': 4.973511954584657e-05, 'epoch': 0.006350402509679072, 'total_flos': 35838664704000, 'step': 3000}
{'loss': 6.21076953125, 'learning_rate': 4.968214345501588e-05, 'epoch': 0.007408802927958917, 'total_flos': 41811775488000, 'step': 3500}
{'loss': 6.309671875, 'learning_rate

KeyboardInterrupt: ignored

## Save Model

In [15]:
trainer.save_model()

## Generate Headlines

In [16]:
from transformers import pipeline

headliner = pipeline('text-generation',
                model='./headliner', 
                tokenizer='gpt2',
                config={'max_length':8})

In [33]:
def get_headline(headliner_pipeline, seed_text="News"):
  return headliner_pipeline(seed_text)[0]['generated_text'].split('\n')[0]

In [35]:
get_headline(headliner, seed_text="News")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


'News over peter satelidott court'

In [37]:
get_headline(headliner, seed_text="China decides")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


'China decides to help indigenous population in the process of drought'

In [38]:
get_headline(headliner, seed_text="Wildfire")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence




In [39]:
get_headline(headliner, seed_text="City Council")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


'City Council prepares against development crisis'