# Outline

* Overview

* Pre-training the language model on Wikitext103 Dataset

  1. Preprocess Wikitext 103
  2. Make the encoder decoder architecture
  3. Add the custom layers for AWD LSTM
  4. Add an optimizer
  5. Make the training loop
  6. Calculate Perplexity
* Fine-tuning it on new data
   1. Preprocess the new data
   2. train the previously saved model again with varying learning rates
* Use the language model for classification
   1. take the encoder for the language model, add a classifier head on top of it and use it for classification

In [44]:
!pip install -q tensorflow-gpu==2.0.0-beta1
!pip install tensorflow-text

Collecting tensorflow-text
[?25l  Downloading https://files.pythonhosted.org/packages/d6/c7/50d7bb8f66212a63180cfb48f0dfb1c51dd4e8f7e2b48e96b75d7f61e164/tensorflow_text-1.0.0b2-cp36-cp36m-manylinux1_x86_64.whl (6.2MB)
[K     |████████████████████████████████| 6.2MB 2.6MB/s 
[?25hCollecting tensorflow<2.1,>=2.0.0b1 (from tensorflow-text)
[?25l  Downloading https://files.pythonhosted.org/packages/29/6c/2c9a5c4d095c63c2fb37d20def0e4f92685f7aee9243d6aae25862694fd1/tensorflow-2.0.0b1-cp36-cp36m-manylinux1_x86_64.whl (87.9MB)
[K     |████████████████████████████████| 87.9MB 44.8MB/s 
Installing collected packages: tensorflow, tensorflow-text
  Found existing installation: tensorflow 1.14.0
    Uninstalling tensorflow-1.14.0:
      Successfully uninstalled tensorflow-1.14.0
Successfully installed tensorflow-2.0.0b1 tensorflow-text-1.0.0b2


In [0]:
import tensorflow as tf

import re
import html 

# Get Data for Language Model

In [3]:
# Get Wikitext 103
!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip

--2019-08-07 12:58:56--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.96.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.96.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190229076 (181M) [application/zip]
Saving to: ‘wikitext-103-v1.zip’


2019-08-07 12:58:58 (100 MB/s) - ‘wikitext-103-v1.zip’ saved [190229076/190229076]



In [4]:
!unzip wikitext-103-v1.zip

Archive:  wikitext-103-v1.zip
   creating: wikitext-103/
  inflating: wikitext-103/wiki.test.tokens  
  inflating: wikitext-103/wiki.valid.tokens  
  inflating: wikitext-103/wiki.train.tokens  


# Preprocess Data

Two steps : 

1. Apply a list of rules to text 
2. Then tokenize the text.

Note : Fastai applies these following rules before tokenization

defaults.text_pre_rules = [fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces]

and these rules after tokenization. 

defaults.text_post_rules = [replace_all_caps, deal_caps]

We have implementations for all the pre_rules in preprocessing function that is applied before tokenization. 


In [0]:
BOS,EOS,FLD,UNK,PAD = 'xxbos','xxeos','xxfld','xxunk','xxpad'
TK_MAJ,TK_UP,TK_REP,TK_WREP = 'xxmaj','xxup','xxrep','xxwrep'

In [0]:
def preprocess(x):
  x = x.strip().lower()
  
  def replace_rep(t):
    "Replace repetitions at the character level in text with the specified token"
    def _replace_rep(m):
        c,cc = m.groups()
        return f' {TK_REP} {len(cc)+1} {c} '
    re_rep = re.compile(r'(\S)(\1{3,})')
    return re_rep.sub(_replace_rep, t)
  
  # replace all the characters that occur more than 3 times with the xxrep token
  x = replace_rep(x)
  
  def replace_wrep(t):
    "Replace word repetitions in text with the specified token."
    def _replace_wrep(m):
        c,cc = m.groups()
        return f' {TK_WREP} {len(cc.split())+1} {c} '
    re_wrep = re.compile(r'(\b\w+\W+)(\1{3,})')
    return re_wrep.sub(_replace_wrep, t)
  
  # replaces all the words that occur more than 3 times in text gets replaced by xxwrep token
  x = replace_wrep(x)
  # fix html 
  re1 = re.compile(r'  +')
  x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>',UNK).replace(' @.@ ','.').replace(
        ' @-@ ','-').replace(' @,@ ',',').replace('\\', ' \\ ')
  x=re1.sub(' ', html.unescape(x))
  
  "Add spaces around / and # in `t`. \n" 
  x=re.sub(r'([/#\n])', r' \1 ', x)
  
  "Remove multiple spaces in `t`."
  
  x=re.sub(' {2,}', ' ', x)
  
  
  return x

# Create Tensorflow Dataset 

In [0]:
train_path = "wikitext-103/wiki.train.tokens"
valid_path = "wikitext-103/wiki.valid.tokens"
test_path = "wikitext-103/wiki.test.tokens"

In [0]:
def split_input_target(chunk):
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text,target_text

In [0]:
def tokenize_data(path,num_words=None):
  data = open(path,'r').read()[0:10000]
  tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=num_words)
  tokenizer.fit_on_texts([data])
  data = tokenizer.texts_to_sequences([data])[0]
  return tokenizer,data

In [0]:
#word_to_index = tokenizer.word_index
#index_to_word = {v:k for k,v in tokenizer.word_index.items()}

#def convert_to_text(line):
#  return ' '.join([index_to_word[i] for i in line])

In [0]:
def make_lm_dataset(data,seq_length,batch_size,buffer_size):
  
  dataset = tf.data.Dataset.from_tensor_slices(data)
  batch_set = dataset.batch(seq_length+1,drop_remainder = True)
  batch_set = batch_set.map(lambda x:split_input_target(x))
  return batch_set.shuffle(buffer_size).batch(batch_size,drop_remainder=True)  

In [0]:
tokenizer, tokenized_data = tokenize_data(train_path)
dataset = make_lm_dataset(tokenized_data,seq_length=70,batch_size=64,buffer_size=10000)  # each batch has 64 lines, each line has 70 words