# Outline

* Overview

* Pre-training the language model on Wikitext103 Dataset

  1. Preprocess Wikitext 103
  2. Make the encoder decoder architecture
  3. Add the custom layers for AWD LSTM
  4. Add an optimizer
  5. Make the training loop
  6. Calculate Perplexity
* Fine-tuning it on new data
   1. Preprocess the new data
   2. train the previously saved model again with varying learning rates
* Use the language model for classification
   1. take the encoder for the language model, add a classifier head on top of it and use it for classification

In [2]:
!pip install -q tensorflow-gpu==2.0.0-beta1


[K     |████████████████████████████████| 348.9MB 39kB/s 
[K     |████████████████████████████████| 3.1MB 37.2MB/s 
[K     |████████████████████████████████| 501kB 40.6MB/s 
[?25h

In [0]:
import tensorflow as tf

import re
import html 

# Get Data for Language Model

In [0]:
# Get Wikitext 103
!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip

--2019-08-05 06:50:35--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.239.13
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.239.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190229076 (181M) [application/zip]
Saving to: ‘wikitext-103-v1.zip’


2019-08-05 06:50:41 (33.4 MB/s) - ‘wikitext-103-v1.zip’ saved [190229076/190229076]



In [0]:
!unzip wikitext-103-v1.zip

Archive:  wikitext-103-v1.zip
   creating: wikitext-103/
  inflating: wikitext-103/wiki.test.tokens  
  inflating: wikitext-103/wiki.valid.tokens  
  inflating: wikitext-103/wiki.train.tokens  


# Preprocess Data

Two steps : 

1. Apply a list of rules to text 
2. Then tokenize the text.

In [0]:
train_path = "wikitext-103/wiki.train.tokens"
valid_path = "wikitext-103/wiki.valid.tokens"
test_path = "wikitext-103/wiki.test.tokens"

In [5]:
data = open(train_path,"r").readlines()

FileNotFoundError: ignored

Note : Fastai applies these following rules before tokenization

defaults.text_pre_rules = [fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces]

and these rules after tokenization. 

defaults.text_post_rules = [replace_all_caps, deal_caps]

We have implementations for all the pre_rules in preprocessing function that is applied before tokenization. 


In [0]:
BOS,EOS,FLD,UNK,PAD = 'xxbos','xxeos','xxfld','xxunk','xxpad'
TK_MAJ,TK_UP,TK_REP,TK_WREP = 'xxmaj','xxup','xxrep','xxwrep'

In [0]:
def preprocess(x):
  x = x.strip().lower()
  
  def replace_rep(t):
    "Replace repetitions at the character level in text with the specified token"
    def _replace_rep(m):
        c,cc = m.groups()
        return f' {TK_REP} {len(cc)+1} {c} '
    re_rep = re.compile(r'(\S)(\1{3,})')
    return re_rep.sub(_replace_rep, t)
  
  # replace all the characters that occur more than 3 times with the xxrep token
  x = replace_rep(x)
  
  def replace_wrep(t):
    "Replace word repetitions in text with the specified token."
    def _replace_wrep(m):
        c,cc = m.groups()
        return f' {TK_WREP} {len(cc.split())+1} {c} '
    re_wrep = re.compile(r'(\b\w+\W+)(\1{3,})')
    return re_wrep.sub(_replace_wrep, t)
  
  # replaces all the words that occur more than 3 times in text gets replaced by xxwrep token
  x = replace_wrep(x)
  # fix html 
  re1 = re.compile(r'  +')
  x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>',UNK).replace(' @.@ ','.').replace(
        ' @-@ ','-').replace(' @,@ ',',').replace('\\', ' \\ ')
  x=re1.sub(' ', html.unescape(x))
  
  "Add spaces around / and # in `t`. \n" 
  x=re.sub(r'([/#\n])', r' \1 ', x)
  
  "Remove multiple spaces in `t`."
  
  x=re.sub(' {2,}', ' ', x)
  
  
  return x

In [0]:
def tokenize(data,**kwargs):
  tokenizer = tf.keras.preprocessing.text.Tokenizer(**kwargs)
  tokenizer.fit_on_texts(data)
  return tokenizer,tokenizer.texts_to_sequences(data)
  

In [0]:
data = [preprocess(x) for x in data[0:500]]

In [0]:
tokenizer, result = tokenize(data,oov_token='xxunk',num_words=50)

# Create Tensorflow Dataset 