# GPT2 Exercise - Limor Nunu

###**Imports and Installations:**

In [1]:
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 17.6MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 49.7MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 43.7MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=2034

In [2]:
from sklearn.model_selection import train_test_split
from transformers import pipeline
import tensorflow as tf
from transformers import GPT2Tokenizer, GPT2Config, TFGPT2LMHeadModel
import pandas as pd
import regex as re
import math
import ast

Cloning the HuggingFace repository:

In [3]:
!git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 45, done.[K
remote: Counting objects: 100% (45/45), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 63607 (delta 11), reused 21 (delta 0), pack-reused 63562[K
Receiving objects: 100% (63607/63607), 48.29 MiB | 29.14 MiB/s, done.
Resolving deltas: 100% (45054/45054), done.


### **Loading the data and cleaning it:**

In [4]:
df = pd.read_csv('fake_news_df.csv')

In [5]:
df.columns

Index(['Article Number', 'URL of article', 'Fake or Satire?',
       'URL of rebutting article', 'Fake or Satire?.1', 'content_1',
       'content_2'],
      dtype='object')

The content columns are columns that I scrape using the "article_extractor.py" program I wrote.

Converting the "content" columns into lists:

In [6]:
df.content_1 = [ast.literal_eval(i) for i in df.content_1]
df.content_2 = [ast.literal_eval(i) for i in df.content_2]

The lists made of elements with the content (not all the content comes as one piece)

Joining the lists into a long string of the content:

In [7]:
df['joined_content_1'] = [" ".join(i).strip() for i in df.content_1]
df['joined_content_2'] = [" ".join(i).strip() for i in df.content_2]

Calculating the length of the strings:

In [8]:
df['len_content_1'] = [len(i) for i in df.joined_content_1]
df['len_content_2'] = [len(i) for i in df.joined_content_2]

Finding the maximum length between the two columns:

In [9]:
df['max_len'] = df[['len_content_1', 'len_content_2']].max(axis=1)

Looking at the data:

In [10]:
df.head()

Unnamed: 0,Article Number,URL of article,Fake or Satire?,URL of rebutting article,Fake or Satire?.1,content_1,content_2,joined_content_1,joined_content_2,len_content_1,len_content_2,max_len
0,375.0,http://www.redflagnews.com/headlines-2016/cdc-...,Fake,http://www.snopes.com/cdc-forced-vaccinations/,Fake,[\n Please switch to a supported browser ...,[],Please switch to a supported browser to contin...,,237,0,237
1,376.0,http://www.redflagnews.com/headlines-2016/-out...,Fake,http://www.snopes.com/white-house-logo-change/,Fake,[\n Please switch to a supported browser ...,[],Please switch to a supported browser to contin...,,237,0,237
2,377.0,http://www.redflagnews.com/headlines-2016/whit...,Fake,http://www.snopes.com/obama-veterans-money-to-...,Fake,[\n Please switch to a supported browser ...,[],Please switch to a supported browser to contin...,,237,0,237
3,378.0,http://www.redflagnews.com/headlines-2016/obam...,Fake,http://www.snopes.com/obama-veterans-money-to-...,Fake,[\n Please switch to a supported browser ...,[],Please switch to a supported browser to contin...,,237,0,237
4,379.0,http://www.redflagnews.com/headlines-2016/cali...,Fake,http://www.snopes.com/california-to-jail-clima...,Fake,[\n Please switch to a supported browser ...,[],Please switch to a supported browser to contin...,,237,0,237


In [11]:
df.describe()

Unnamed: 0,Article Number,len_content_1,len_content_2,max_len
count,291.0,291.0,291.0,291.0
mean,262.841924,1203.797251,64.560137,1213.865979
std,166.078219,3225.583054,375.413288,3228.229278
min,8.0,0.0,0.0,0.0
25%,109.5,0.0,0.0,0.0
50%,250.0,12.0,0.0,12.0
75%,395.5,776.0,0.0,776.0
max,587.0,34310.0,3519.0,34310.0


We can see that 75% of the rows is content with length less than 776.

Probably not relevant content.

Let's check this:

In [12]:
df[df.max_len == 776].head()

Unnamed: 0,Article Number,URL of article,Fake or Satire?,URL of rebutting article,Fake or Satire?.1,content_1,content_2,joined_content_1,joined_content_2,len_content_1,len_content_2,max_len
27,43.0,http://yournewswire.com/cia-hitler-argentina-ww2/,Fake,http://www.snopes.com/fbi-files-prove-adolf-hi...,Fake,[It seems we can’t find what you’re looking fo...,[],It seems we can’t find what you’re looking for...,,776,0,776
28,44.0,http://yournewswire.com/planned-parenthood-ext...,Fake,http://www.politifact.com/new-hampshire/statem...,Fake,[It seems we can’t find what you’re looking fo...,[],It seems we can’t find what you’re looking for...,,776,0,776
29,45.0,http://yournewswire.com/charlottesville-hillar...,Fake,http://www.snopes.com/charlottesville-killer-r...,Fake,[It seems we can’t find what you’re looking fo...,[],It seems we can’t find what you’re looking for...,,776,0,776
30,46.0,http://yournewswire.com/fbi-seth-rich-dnc/,Fake,http://www.snopes.com/seth-rich-dnc-wikileaks-...,Fake,[It seems we can’t find what you’re looking fo...,[],It seems we can’t find what you’re looking for...,,776,0,776
31,47.0,http://yournewswire.com/mit-global-warming-dat...,Fake,http://www.snopes.com/climatology-fraud-global...,Fake,[It seems we can’t find what you’re looking fo...,[],It seems we can’t find what you’re looking for...,,776,0,776


It looks like those cases are errors and not articles, so let's drop them and look again at the description of the max_len column.

In [13]:
new_df = df[df['max_len'] > 776].copy()

In [14]:
new_df.describe()

Unnamed: 0,Article Number,len_content_1,len_content_2,max_len
count,67.0,67.0,67.0,67.0
mean,289.761194,4860.313433,279.746269,4904.044776
std,149.121985,5285.676715,747.117092,5261.86947
min,70.0,589.0,0.0,831.0
25%,143.5,1740.5,0.0,1776.0
50%,303.0,3456.0,0.0,3519.0
75%,404.5,4764.0,0.0,4764.0
max,587.0,34310.0,3519.0,34310.0


Now it looks more reasonable

Last check to make sure it is okay:

In [15]:
df[df.max_len == 831]

Unnamed: 0,Article Number,URL of article,Fake or Satire?,URL of rebutting article,Fake or Satire?.1,content_1,content_2,joined_content_1,joined_content_2,len_content_1,len_content_2,max_len
197,381.0,http://www.react365.com/59c06b7b050bf/no-more-...,Fake,http://www.snopes.com/no-child-support-2017/,Fake,"[ Sunday 21 February 379046 Shares, Donald ...",[ Report Abuse],Sunday 21 February 379046 Shares Donald tru...,Report Abuse,831,12,831


Good!

Let's assume the longest text contains the relevent text:

In [16]:
new_df['content'] = [new_df.joined_content_1.iloc[i] if (len(new_df.joined_content_1.iloc[i]) > len(new_df.joined_content_2.iloc[i])) else new_df.joined_content_2.iloc[i] for i in range(len(new_df))]

Printing some examples of the contents:

In [17]:
new_df.content[:5]

47    You are here:  \n\n\nby\n\n\nHow Africa\n\n\nA...
48    Johnston Wilson McGill, 34, was pronounced dea...
66    The whispers are growing louder that President...
67    Of all the voices crusading against the so-cal...
76    INI World Report > Uncategorized > Proof that ...
Name: content, dtype: object

We can see the text is dirty.

We need to clean it - remove spaces as "\n", "\t" and unnecessary characters.

Otherwise, it may worsen the results of the model -> garbage in garbage out! 

For now, I'll leave it aside because I don't have time, but it is **necessary** step to do!

### **Preparing the content for using it in the model**

First, let's split the data into train set and test set (80/20):

In [18]:
train, test = train_test_split(new_df.content, train_size = 0.8, random_state = 1)

In [19]:
print(train.shape)
print(test.shape)

(53,)
(14,)


Our dataset is very small and not enough for good results, but let's move on and continue with what we've got.

The reason for that is most of the articles with the fake news were deleted.

Let's add tags at the beginning and at the ending of the content and replace many spaces with one space.

And in addition combine all of the articles into one long string, as the model needs to receive.

In [20]:
def preparing_data(dataset):
  data = ""
  if type(dataset) == str:
    dataset = [dataset]
  for c in dataset:
      c = re.sub(r"\s", " ", c) 
      bos_token = '<s>'
      eos_token = '</s>'
      data += bos_token + ' ' + c + ' ' + eos_token + '\n'
  return data

"s" notifies the model about the start and "/s" about the end of the article.

Let's prepare the train and test datasets:

In [21]:
test_data = preparing_data(test)
train_data = preparing_data(train)

### **Fine Tuning GPT2 model**

Adding special tokens:



We need to introduce to the model our "s" tags so it won't consider them as words.

Moreover, we need to add tokens as "unk" for unknown words - words (or sub-words) the model doesn't recognize, "pad" for padding, and "mask" for masking.

In [22]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({
  "eos_token": "</s>",
  "bos_token": "<s>",
  "unk_token": "<unk>",
  "pad_token": "<pad>",
  "mask_token": "<mask>"
})

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




4

Configuration - setting vocabulary size (as defined in the pre-trained tokenizer) and setting the id of the "s" tags.

In [23]:
config = GPT2Config(
  vocab_size=tokenizer.vocab_size,
  bos_token_id=tokenizer.bos_token_id,
  eos_token_id=tokenizer.eos_token_id,
  )

Creating the model:

In [24]:
model = TFGPT2LMHeadModel(config)

Adjusting the token embeddings to the length of the tokenizer.

In [25]:
model.resize_token_embeddings(len(tokenizer))

<transformers.modeling_tf_utils.TFSharedEmbeddings at 0x7fc861792c50>

Passing our texts to the tokenizer (converting the words into id numbers, so we can send them to the model)

In [26]:
train_encodings = tokenizer.encode(train_data, truncation=True, padding=True)
test_encodings = tokenizer.encode(test_data, truncation=True, padding=True) 

In [27]:
print(len(train_encodings))

1024


Here we are creating the input to the model, we give the model a vector of ids as input and another vector of ids as output, but with a slide of 1 element.

 *Example:*

input = [12, 100, 150, 16]

output = [100, 150, 16, 785]

In that way we train the model to predict (or generate) the next word (or token id)

In [28]:
examples = []
block_size = 85
BATCH_SIZE = 12
BUFFER_SIZE = 1000
for i in range(0, len(train_encodings) - block_size + 1, block_size):
  examples.append(train_encodings[i:i + block_size])
inputs, labels = [], []
for ex in examples:
  inputs.append(ex[:-1])
  labels.append(ex[1:])
dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

In [29]:
print(inputs)
print(labels)

[[50258, 4723, 1649, 7183, 717, 1908, 428, 284, 514, 356, 1807, 11, 645, 835, 11, 407, 772, 2486, 561, 307, 428, 1099, 1203, 13, 887, 356, 547, 2642, 13, 8732, 2486, 11764, 1444, 319, 5293, 16269, 284, 3015, 287, 3431, 447, 247, 82, 3071, 13, 770, 2187, 3662, 318, 1099, 1203, 0, 1119, 6486, 379, 790, 1210, 13, 1119, 19837, 284, 651, 12439, 3804, 13, 1119, 19837, 546, 22497, 13, 1119, 19837, 546, 5073, 447, 247, 82, 2839, 4382, 290, 7237, 13, 39711, 532], [4477, 2174, 843, 783, 484, 389, 4585, 319, 5293, 16269, 284, 3015, 13, 383, 2008, 2058, 422, 19931, 9390, 13, 220, 220, 921, 1276, 307, 18832, 287, 284, 1281, 257, 2912, 13, 39711, 220, 39099, 25, 34108, 30646, 309, 2228, 284, 3497, 257, 22244, 2080, 1301, 851, 887, 679, 13590, 2332, 5588, 220, 7232, 557, 3442, 3961, 5926, 1874, 570, 82, 2293, 1004, 4335, 7623, 286, 833, 2140, 28231, 1114, 16168, 278, 20357, 284, 4946, 220, 18321, 3101, 6288, 25, 5617, 3078, 5345, 284], [10127, 1301, 46226, 39826, 35536, 287, 9589, 11, 7859, 290, 7055

Defining optimizer, loss function, and metric to evaluate the model and eventually compiling the model with what we've defined:

In [30]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01, epsilon=0.0001, clipnorm=1.0)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])

Fitting the model with the trainset:

In [31]:
num_epoch = 10
history = model.fit(dataset, epochs=num_epoch)

Epoch 1/10
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Evaluating the model:

In [32]:
model.evaluate(test_encodings)



[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

The model needs to be trained on more examples, for improvement of the performances.

Here we can see that the accuracy of the model equal to zero, which means, the model guesses the next word to generate (bad model).

I think accuracy isn't a good metric to evaluate the model, I think we should calculate the perplexity of the model (exp of the loss).
perplexity is a measurement of how well a probability distribution or probability model predicts a sample. A low perplexity indicates the probability distribution is good at predicting the sample.


**Using the model:**

In the deployment the user will submit a sentence:

In [33]:
sentence = 'Hi how are you?'

In the background the sentence will be tokenized and then the model will generate text:

In [34]:
input_ids = tokenizer.encode(sentence, truncation=True, padding=True, return_tensors='tf')
beam_output = model.generate(
  input_ids,
  max_length = 50,
  num_beams = 5,
  temperature = 0.7,
  no_repeat_ngram_size=2,
  num_return_sequences=5
)

Setting `pad_token_id` to 50257 (first `eos_token_id`) to generate sequence


Translating back the predicted tokens into words:

In [35]:
print(tokenizer.decode(beam_output[0]))

Hi how are you?�� �� to”� that“�, to to � � to� to that to� to,� the�.’� and� Trump�s� transgender� with� Donald� one


As we can see, the results don't make sense respectively to the low performances of the model.