# Model will try to generate satire news articles based on user's input

I have gathered article from The beaverton (it is Canadian satire news article site check them out), and will train OpenAI's GPT-2 (Generative Pretrained Transformer 2) NLP model based on articles I gathered to produce satire article based on user's input. My end goal is to have model that can have satire article generated on the spot based on any topic given by the user.


# Data gathering notes

The Beaverton (https://www.thebeaverton.com/) article data were not available handy, so I have gathered data using my own custom web crawler. This crawler have gone through every article on the site. (Data gathering done July 10, 2019. This does not contain latest article after July 10)

Webcrawler that gathered the article data were built in C# seperately. The program also formatted all the articles and provided output file in simple .txt file.I have uploaded it on my google drive so Google Colab can read the data from my google drive.

In [0]:
# read the data from my google drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: 
Enter your authorization code:
··········
Mounted at /content/drive


# Data formatting notes

Data is formatted in a way that GPT-2 can understand. 

Format follows the following format:

*   Title
*   Newline
*   Article content
*   <|endoftext|>

<|endoftext|> indicates that it is end of the article, for GPT-2 to understand distintion between articles.

I have stripped all HTML hyperlinks, fixed encoding, and manually went over ~5% of the article at random to make sure all articles are gathered correctly by my webcrawler.

There were some out-of-ordinary articles (as you can expect from satire news article), that includes Unicode Emoji and Quiz-like interactive news, but I decide not to exclude them as I didn't want to temper with what goes into training or not based on my own human decisions.

There are total of about ~5500 unique articles, since 2010.

There are total of 1528615 words, and 65984 lines.


In [0]:
# load the data in my google drive
filepath = '/content/drive/My Drive/datas/satirenews/output-2.txt'

# see if it was read correctly (First 100 lines)
with open(filepath) as f:
  from itertools import islice
  for line in islice(f, 100):
          print(line)

Anti-abortion politician supports exceptions for rape, incest, and being his mistress



WASHINGTON D.C. — United States politician Herb Phillman has unveiled a bold new stance on the controversial issue of abortion bans by proposing exceptions in instances of rape, incest, and being a woman with whom he is having an extramarital affair.

“Abortion has been a stain on our nation since liberal judges forced the catastrophic decision of Roe v. Wade on us,” Phillman told reporters. “However, we believe that some very narrow exceptions are appropriate; if the woman has been raped, is a victim of incest, or has become pregnant as a result of having sex with me, personally.”

The new proposal is finding tentative support from GOP leaders seeking a less-stringent alternative to abortion bans recently adopted in states such as Alabama and Georgia, which include no exceptions, even for women who wish to terminate due to the fact that they are not the wife of the married republican lawmaker who 


# OpenAI GPT-2 notes

I picked GPT-2 because of the following reasons:

1. It is one of newest cutting edge technology of transformer based NLP model

2. It uses TensorFlow

3. It provides pre-trained model and allow me to train further with datas and various hyperparameter options

4. It specializes in generative model

Here are articles I followed and referenced:

https://openai.com/blog/better-language-models/

https://github.com/nshepperd/gpt-2


In [0]:
# I am going to use OpenAI's GPT-2 and its pre-trained model (345M) 
# I will try to train this pre-trained model to be more sarcastic 

# this is the GPT-2 with training capacity from github, provided by Author
!git clone https://github.com/nshepperd/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 297, done.[K
remote: Total 297 (delta 0), reused 0 (delta 0), pack-reused 297[K
Receiving objects: 100% (297/297), 4.40 MiB | 3.22 MiB/s, done.
Resolving deltas: 100% (161/161), done.


In [0]:
cd gpt-2

/content/gpt-2


In [0]:
# install required requirement
!pip3 install -r requirements.txt

Collecting fire>=0.1.3 (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/5a/b7/205702f348aab198baecd1d8344a90748cb68f53bdcd1cc30cbc08e47d3e/fire-0.1.3.tar.gz
Collecting regex==2017.4.5 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
[K     |████████████████████████████████| 604kB 7.5MB/s 
Collecting tqdm==4.31.1 (from -r requirements.txt (line 4))
[?25l  Downloading https://files.pythonhosted.org/packages/6c/4b/c38b5144cf167c4f52288517436ccafefe9dc01b8d1c190e18a6b154cd4a/tqdm-4.31.1-py2.py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 18.0MB/s 
[?25hCollecting toposort==1.5 (from -r requirements.txt (line 5))
  Downloading https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels

In [0]:
# downloading pre-trained model of 345M
!python3 download_model.py 345M

Fetching checkpoint: 1.00kit [00:00, 607kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 35.2Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 659kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:30, 46.2Mit/s]                                 
Fetching model.ckpt.index: 11.0kit [00:00, 5.75Mit/s]                                               
Fetching model.ckpt.meta: 927kit [00:00, 34.1Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 33.3Mit/s]                                                       


In [0]:
!export PYTHONIOENCODING=UTF-8

In [0]:
# Lets check everything is downloaded correctly
!ls /content/gpt-2/

CONTRIBUTORS.md  Dockerfile.gpu     LICENSE    requirements.txt  train.py
DEVELOPERS.md	 download_model.py  models     src
Dockerfile.cpu	 encode.py	    README.md  train-horovod.py



# Training notes

GPT-2 provides pre-trained model (345M) that was pre-trained from wikipedia and reddit.

I will train further using my own data. I will train 500 steps+ to see how it performs, and re-evaluate. Every 100 stpes, it will print out random sample to see how well the training is going.

In [0]:
# start training using the data we have on pre-trained 345M Model
!PYTHONPATH=src ./train.py --dataset /content/drive/My\ Drive/datas/satirenews/output-2.txt --model_name '345M' --top_k 60

W0722 04:39:48.698159 140506426201984 deprecation_wrapper.py:119] From /content/gpt-2/src/model.py:147: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

W0722 04:39:48.713966 140506426201984 deprecation_wrapper.py:119] From /content/gpt-2/src/memory_saving_gradients.py:13: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

W0722 04:39:48.810900 140506426201984 deprecation_wrapper.py:119] From ./train.py:87: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0722 04:39:48.811231 140506426201984 deprecation_wrapper.py:119] From ./train.py:90: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-07-22 04:39:48.833336: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-07-22 04:39:48.837491: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3222f40 executing computations on platform Host. Devices:
2019-07-22 0

# Trained Model to be permantly saved on my google drive for further training & use

From samples above, it looks a lot like actual article now (With title, new line, then article itself!)

I will save this model on my drive, and have it generate some random articles & input provided articles.



In [0]:
!cp -r /content/gpt-2/ /content/drive/My\ Drive/datas/satirenews

In [0]:
# make a new model called satirenewsMK1 for my first complete model
!mkdir /content/drive/My\ Drive/datas/satirenews/gpt-2/models/satirenewsMK1

In [0]:
# copy config files from original 345M to my model
!cp -r /content/drive/My\ Drive/datas/satirenews/gpt-2/models/345M/* /content/drive/My\ Drive/datas/satirenews/gpt-2/models/satirenewsMK1

In [0]:
# copy my trained model to my model
!cp -r /content/drive/My\ Drive/datas/satirenews/gpt-2/checkpoint/run1/* /content/drive/My\ Drive/datas/satirenews/gpt-2/models/satirenewsMK1

In [0]:
# move these files back where originally GPT-2 is installed so my model can be used
!cp -r /content/drive/My\ Drive/datas/satirenews/gpt-2/models/satirenewsMK1 /content/gpt-2/models

# Lets generate random articles 

I will try to generate using my model to generate random articles. 

In [0]:
!python3 src/generate_unconditional_samples.py --top_k 40 --model_name "satirenewsMK1"

W0722 05:35:23.303468 140488715700096 deprecation_wrapper.py:119] From /content/gpt-2/src/model.py:147: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

W0722 05:35:23.713639 140488715700096 deprecation_wrapper.py:119] From src/generate_unconditional_samples.py:52: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-07-22 05:35:23.741340: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-07-22 05:35:23.791455: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-22 05:35:23.791946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
2019-07-22 05:35:23.799970: I tensorflow/stream_executor/

# Model Improvements

Here are the list of imporvment during training that I made to improve the model

*   Refined datas: I have refined datas to remove some of articles that didn't fit well. There were some articles that weren't really satire news, Such as Quiz articles and ads. 
*   Tuneing Hyper parameters: Due to nature of satire news, I could have the model generate more nonsensical sentences than normal. I have increased top-K hyperparameters (the paramerter that determines randomness of next generated words) to have the article look more exciting. (From 40 -> 60)
*   The model seems to converge around loss of 2.75, I decide to train until this point instead of arbitarly stopping around 500 iteration 





# Make this available for everyone to use

This is kind of side fun project I have done on my spare time: I have intergrated my home server with the model above, to provide public access to this model. You can visit my homeserver here: https://jaderain.app/satire to have it generated random articles.

I worked on C# backend to intergrate with the model i have shown above.