<style>
//change background settings {}
div.slide-background {
	border-bottom: 30px crimson solid;
}
</style>

# Universal Language Model Fine-Tuning (ULMFiT)
## State-of-the-Art in Text Analysis
<br> <br>
Sandra Faltl, Michael Schimpke & Constantin Hackober

## Table of Contents

**1. Introduction**

**2. General-Domain Language Model Pretraining**
    1. AWD-LSTM
    2. Implementation
**3. Target Task Language Model Fine-Tuning**
    1. Model Overview
    2. Fine-Tuning Methods
    3. Implementation
**4. Target Task Classifier**
    1. Classifier
    2. Implementation
    3. Results
**5. Our Model Extension**


In [3]:
dfTrn=pd.read_csv('C:/Sandra/Dokumente_Sandra/ULMFiT/lm/train.csv', header=None)
tokTrn = np.load('C:/Sandra/Dokumente_Sandra/ULMFiT/lm/tokTrn.npy')
trnLm = np.load('C:/Sandra/Dokumente_Sandra/ULMFiT/lm/trnIds.npy')
valLm = np.load('C:/Sandra/Dokumente_Sandra/ULMFiT/lm/valIds.npy')
with open('C:/Sandra/Dokumente_Sandra/ULMFiT/lm/itos.pkl', 'rb') as pickle_file:
    itos = pickle.load(pickle_file)
stoi = collections.defaultdict(lambda:0, 
                               {v:k for k,v in enumerate(itos)})

In [4]:
dfTrn.iloc[40,1]

'@USAirways has the WORST customer service and their flights are always delayed'

In [5]:
tokTrn[40]

['\n',
 'xbos',
 '@usairways',
 'has',
 'the',
 't_up',
 'worst',
 'customer',
 'service',
 'and',
 'their',
 'flights',
 'are',
 'always',
 'delayed']

In [6]:
trnLm[40]

[2, 3, 20, 103, 7, 8, 187, 66, 52, 18, 217, 80, 48, 352, 92]

In [7]:
itos[100:105]

['back', 'or', 'had', 'has', 'as']

In [8]:
{k: stoi[k] for k in list(stoi)[100:105]}

{'back': 100, 'or': 101, 'had': 102, 'has': 103, 'as': 104}

In [9]:
# Set parameters
em_sz, nh, nl = 400, 1150, 3

# Load WikiText103 itos
with open('C:/Sandra/Dokumente_Sandra/ULMFiT/wt103/itos_wt103.pkl', 'rb') as pickle_file:
    itosWiki = pickle.load(pickle_file)

# Calculate WikiText103 stoi    
stoiWiki = collections.defaultdict(lambda:-1, {v:k for k,v in enumerate(itosWiki)})

In [10]:
itosWiki[500:505]

['college', '17', 'construction', 'should', 'award']

In [11]:
{k: stoiWiki[k] for k in list(stoiWiki)[500:505]}

{'college': 500, '17': 501, 'construction': 502, 'should': 503, 'award': 504}

In [12]:
# Load WikiText103 weights
wgts = torch.load('C:/Sandra/Dokumente_Sandra/ULMFiT/wt103/fwd_wt103.h5', 
                  map_location=lambda storage, loc: storage)
encWgts = wgts['0.encoder.weight'].numpy()
encWgts.shape

(238462, 400)

In [13]:
len(itos)

4409

In [14]:
# Calculate mean of weights
rowM = np.mean(encWgts, axis = 0)

# Create new embedding matrix
newWm = np.zeros((len(itos), em_sz), dtype=np.float32)

for i,w in enumerate(itos):
    r = stoiWiki[w]
    newWm[i] = encWgts[r] if r>=0 else rowM

In [15]:
# Save new weights
wgts['0.encoder.weight'] = T(newWm)
wgts['0.encoder.weight'].shape

torch.Size([4409, 400])

In [16]:
# Set parameters
bptt = 70
bs = 64

# Create data loader
trnDl = LanguageModelLoader(np.concatenate(trnLm), bs, bptt)
valDl = LanguageModelLoader(np.concatenate(valLm), bs, bptt)

In [17]:
# Initialize LanguageModelData class
md = LanguageModelData('C:/Sandra/Dokumente_Sandra/ULMFiT', 1, len(itos), trnDl, valDl, 
                       bs=bs, bptt=bptt)

In [18]:
# Set regularization and optimization parameters and build architecture
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.7
learner = md.get_model(opt_fn, em_sz, nh, nl, 
                       dropouti = drops[0], dropout=drops[1], wdrop=drops[2], 
                       dropoute = drops[3], dropouth=drops[4])

## 3.1 Language Model Overview

Our classifying model consists of three steps. First, we pretrain a language model based on the Wikitext-103 dataset, so that our model can learn the basics of language on this big general dataset. We profit from the proper grammar and style of Wikipedia, which does not apply to our Twitter dataset. 
...

After matching the wikipedia dataset to our vocabulary (<i><b> nach Sandras Matching mit rowmeans Twitter etc</b></i>), we train the model on our target-task Twitter dataset. At this point, we face one of the key problems of transfer learning, which is catastrophic forgetting of pretrained knowledge. To prevent this accident, we apply a bunch of methods here in order to be really able to transfer our language knowledge from the Wikitext model. 

1. Freezing

We are starting to train our model with the 3 pretrained LSTM layers. But as we shortened our vocabulary and also added new words, part of our embedding and softmax layer is untrained with initial random weights (row means). If we now train the entire model, we risk catastrophic forgetting in our 3 LSTM layers due to the untrained embedding and softmax layer. So in a first step, we freeze the weights of LSTM layers and train the model in one epoch. Only the embedding and softmax parameters are trained.
For the second step all layers are unfrozen that also the LSTM layers can now adapt to the new dataset.

2. Discriminative Fine-Tuning

As already stated, we apply this form suggested by Howard and Ruder due to the complexity of text data. <i><b>Yosinski et al. 2014</b></i> found out, this architecture is capable of capturing different types of information at every level, starting with general information on the first layer and growing more and more specific on every further layer. In the text data context, the first layer might catch information like basic sentence structure, while the next ones dig deeper into word meanings.

We address this issue by discriminating the learning rate among each layer. Starting with a small learning rate in the first LSTM, it rises through the layers due to the rising amount of acquired knowledge and information complexity.

3. Learning Rate Schedule

In 2015, Leslie Smith first introduced cyclical triangular learning rates (<i><b>Smith 2015</b></i>). He tested it on the CIFAR and the ImageNet Dataset with various algorithms and compared it to standard learning rates. The approach appeared to boost accuracy in some of the cases, but it never downgraded the results in others. For the ULMFiT model, Howard and Ruder modify this approach to their text dataset by taking a one-cycle learning rate schedule with a short steep increase and a long decay period, which they call <b>slanted triangular learning rate</b>.

## 3.1 Language Model Overview

<img src="images/Modell_Overview.png" width="1600" style="float:center;"/>

We will explain the model by channeling a tweet through it step by step. The tweet is called "My flight was on" and will be the basis for our language model prediction.

The input words are matched with their token ID in the vocabulary of our dataset. It is then one-hot encoded to a $n \times 1 \times vs$ tensor, where $n = 4$ representing the length of the input sentence and $vs = 4409$ as length of the vocabulary of our target dataset.

Using this sentence as input, our model knows each word's position in the sentence. 

<b>Embedding Layer</b>
In the first layer, the Embedding Layer, words are translated into latent features. As Howard and Ruder provide the results of the time-intensive pretrained Wikipedia language model, we have to stick to their model architecture. They use an embedding size of 400, an LSTM structure of 3 stacked AWD-LSTMs with 1150 hidden activations.

In the embedding layer, each vector (representing one word) of our input tensor is multiplied with the pretrained embedding matrix of size $4409 \times 400$, yielding a $4 \times 1 \times 4409$ tensor. This tensor is then the input to our first LSTM.





## 3.1 Language Model Overview

<img src="images/Modell_Detail_0.png" width="800" style="float:center;"/>

## 3.1 Language Model Overview

<img src="images/Modell_Detail_1.png" width="1600" style="float:center;"/>

## 3.1 Language Model Overview

<img src="images/Modell_Detail_2.png" width="1600" style="float:center;"/>

## 3.1 Language Model Overview

<img src="images/Modell_Detail_3.png" width="1600" style="float:center;"/>
/Users/michaelschimpke/Documents/Präsentation/images/LR_find_LM.png

## 3.1 Language Model Overview


<img src="images/Modell_Detail_4.png" width="1600" style="float:center;"/>

<img src="images/Modell_Overview_LM_Finetuning.png" width="900" align="left">

## 3.2 Fine-Tuning Methods

<img src="images/Overview_LM_FT.png" width="900" align="left">

## 3.2 Fine-Tuning Methods: Freezing
<html>
<img src="images/Freezing_LM.png" width="350" style="float: right;"/>

<p style="margin-top: 22%"><font size=4>
Training all layers at once risks catastrophic forgetting</font></p>
<p style="margin-top: 3%"><font size=4>
First epoch only updates the weights in the last layer</font></p>
<p style="margin-top: 3%"><font size=4>
From second period all layers are unfrozen</font></p>
</html>

## 3.2 Fine-Tuning Methods: Discriminative Fine-Tuning

<html>
<img src="images/Discriminative_FT.png" width="350" style="float: right;"/>  

<p style="margin-top: 22%"><font size=4>
    <b>Stochastic gradient descent:</b><br>
$
\begin{align}    
\theta_{t} = \theta_{t-1} - \eta \times \nabla_{\theta} J(\theta)
\end{align}
$
    </font></p>   
<p style="margin-top: 3%"><font size=4>
<b> Discriminative fine-tuning:</b><br>
$
\begin{align}    
\theta_{t}^{l} = \theta_{t-1}^{l} - \eta^{l} \times \nabla_{\theta^{l}} J(\theta)
\end{align}
$
</font></p>  
</html>
 
We are using a stacked model structure containing three LSTM hidden layers. We apply this form suggested by Howard and Ruder due to the complexity of text data. As <i><b>Yosinski et al. 2014</b></i> found out, this architecture is capable of capturing different types of information at every level, starting with general information on the first layer and growing more and more specific on every further layer. 

## 3.2 Fine-Tuning Methods: Learning Rate Schedule

<html>
<img src="images/Slanted_Triangular_Learning_Rates.png" width="350" style="float: right;"/>  

<p style="margin-top: 22%"><font size=4>
<b>Cyclical Learning Rates</b> introduced in 2015</font></p>
<p style="margin-top: 3%"><font size=4>
Application of slanted one-cycle learning rate (<b>Slanted triangular learning rate</b>)</font></p>
</html>

In 2015, Leslie Smith first introduced cyclical triangular learning rates (<i><b>Smith 2015</b></i>). He tested it on the CIFAR and the ImageNet Dataset with various algorithms and compared it to standard learning rates. The approach appeared to boost accuracy in some of the cases, but it never downgraded the results in others. For the ULMFiT model, Howard and Ruder modify this approach to their text dataset by taking a one-cycle learning rate schedule with a short steep increase and a long decay period, which they call <b>slanted triangular learning rate</b>.

In [None]:
## Fine-Tuning the Language Model

#Freeze all the layers but the last
learner.freeze_to(-1)

#Load model weights, set initial learning rate and weight decay
learner.model.load_state_dict(wgts)
lr=1e-3
wd = 1e-7

learner.fit(lr, 1, wds=wd, use_clr=(32,2), cycle_len=1) 

<table style="width:100%" align="center">
  <tr>
    <th>LM Fine-Tuning</th>
    <th></th> 
    <th></th>
    <th></th>
  </tr>
  <tr>
    <th>epoch</th>
    <th>training loss</th>
    <th>validation loss</th> 
    <th>accuracy</th>
  </tr>
  <tr>
    <td>1</td>
    <td>6.240857</td> 
    <td>5.909655</td>
    <td>0.118566</td>
  </tr>
</table>

In [None]:
#Unfreeze all layers
learner.unfreeze()

#Find good learning rate
learner.lr_find(lr/1000)

#Plot Loss depending on learning rates
learner.sched.plot()


<img src="images/LR_find_LM.png" width="400" style="float: left"/>  


In [None]:
#Set new learning rate according to plot
lr = 1e-2

#Setting array for discriminative fine-tuning
lrm = 2.6
lrs = np.array([lr/(lmr**3), lr/(lrm**2), lr/lrm, lr])

learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=3)

<table style="width:100%" align="center">
  <tr>
    <th>LM Fine-Tuning</th>
    <th></th> 
    <th></th>
    <th></th>
  </tr>
  <tr>
    <th>epoch</th>
    <th>training loss</th>
    <th>validation loss</th> 
    <th>accuracy</th>
  </tr>
  <tr>
    <td>1</td>
    <td>4.351183</td> 
    <td>3.914363</td>
    <td>0.274198</td>
  </tr>
  <tr>
    <td>2</td>
    <td>3.938936</td> 
    <td>3.795009</td>
    <td>0.288417</td>
  </tr>
    <tr>
    <td>3</td>
    <td>3.681302</td> 
    <td>3.757219</td>
    <td>0.292195</td>
  </tr>
</table>

<img src="images/Encoder_Cutout.png" width="800" style="float: center;"/>  

In [None]:
#Save encoder for classifier model
learner.save_encoder(lm2)

#Plot Losses over Time
learner.sched.plot_loss()

<img src="images/Plot_loss_LM.png" width="400" style="float: left"/>  

## 4.1 Classifier

<img src="images/Inhalt_Classifier.png" width="900" align="left">

## 4.1 Classifier

<img src="images/Modell_Overview_Classifier.png" width="800" align="mid" style="margin-top:50px">

## 4.1 Classifier 
<img src="images/Modell_Detail_Concat_Pooling.png" width="800" align="mid" style="margin-top:50px">

## 4.1 Classifier: Concat Pooling

<img src="images/Modell_Overview_Concat_Pooling.png" width="800" align="mid" style="margin-top:50px">



## 4.1 Classifier: Concat Pooling 

<ul>
    <p>
    <font size="3"><b>H</b> = {<b>h</b><sub>1</sub>,...,<b>h</b><sub>T</sub>} :</font><br></p>
    <p><font size="3" style="margin-left:50px; margin-bottom:200px"><b>h</b><sub>c</sub> = [<b>h</b><sub>T</sub>, maxpool(<b>H</b>), meanpool(<b>H</b>)]</font></p>
    <img src="images/Concat_Pooling.png" width="800" align="mid" style="margin-top:50px">

## 4.1 Classifier 

<img src="images/Modell_Overview_Classifier.png" width="800" align="mid" style="margin-top:50px">

In [None]:
#Get encoded Tweets
itos = pickle.load((lmPath/'itos.pkl').open('rb'))
stoi = collections.defaultdict(lambda:0, {v:k for k,v in 
                                          enumerate(itos)})

trn_clas = np.array([[stoi[o] for o in p] for p in tok_trn])
val_clas = np.array([[stoi[o] for o in p] for p in tok_val])

trn_ds = TextDataset(trnClas, trnLabels)
val_ds = TextDataset(valClas, valLabels)

md = ModelData('gdrive/My Drive/Colab neu', trn_clas, val_clas)

In [None]:
#Set Hyperparameters 
bptt,em_sz,nh,nl = 70,400,1150,3
vs = len(itos)
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
bs = 64
c = 3

In [None]:
#Set Dropouts 
dps = np.array([0.4,0.5,0.05,0.3,0.4])*0.7

In [None]:
#Define our Classifier Model 
def get_rnn_classifier(bptt, max_seq, n_class, n_tok, emb_sz, n_hid,
                       n_layers, pad_token, layers, drops, bidir=False,
                       dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5):
    
    rnn_enc = MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, 
                            pad_token=pad_token, bidir=bidir,
                            dropouth=dropouth, dropouti=dropouti, 
                            dropoute=dropoute, wdrop=wdrop)
    
    return SequentialRNN(rnn_enc, PoolingLinearClassifier(layers, drops))



In [None]:
#Classifier Model 
m = get_rnn_classifier(bptt, 1000, c, vs, emb_sz=em_sz, n_hid=nh, 
                      n_layers=nl, pad_token=1,
                      layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
                      dropouti=dps[0], wdrop=dps[1],        
                      dropoute=dps[2], dropouth=dps[3])

## 4.1 Classifier

<img src="images/Modell_Overview_Classifier.png" width="800" align="mid" style="margin-top:50px">

In [None]:
#Create a Learner 
learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.clip = .25
learn.metrics = [accuracy]

In [None]:
#Set learning rate schedule
lr = 3e-3
lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])

In [None]:
#Set weight decay
wd = 1e-7

## 4.1 Classifier: Gradual Unfreezing 

<img src="images/Gradual_Unfreezing.png" width="900" align="left">

In [None]:
#Load encoder from language model
learn.load_encoder('lm2_enc')

#Unfreeze last layer
learn.freeze_to(-1)
learn.lr_find(lrs/1000)

#Plot learning schedule
learn.sched.plot()


<img src="images/learn.sched(-1).png" width="400" align="left" style="margin-top:50px">

In [None]:
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))


<table style="width:100%" align="center">
  <tr>
    <th>Classifier Fine-Tuning</th>
    <th></th> 
    <th></th>
    <th></th>
  </tr>
  <tr>
    <th>epoch</th>
    <th>training loss</th>
    <th>validation loss</th> 
    <th>accuracy</th>
  </tr>
  <tr>
    <td>1</td>
    <td>0.686093</td> 
    <td>0.521651</td>
    <td>0.792782 </td>
  </tr>
</table>

In [None]:
#Unfreeze both Classifier Layers
learn.freeze_to(-2)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))





<table style="width:100%" align="center">
  <tr>
    <th>Classifier Fine-Tuning</th>
    <th></th> 
    <th></th>
    <th></th>
  </tr>
  <tr>
    <th>epoch</th>
    <th>training loss</th>
    <th>validation loss</th> 
    <th>accuracy</th>
  </tr>
  <tr>
    <td>1</td>
    <td>0.659148</td> 
    <td>0.490926</td>
    <td>0.806697</td>
  </tr>
</table>

In [None]:
#Unfreeze all Layers
learn.unfreeze()
learn.fit(lrs, 1, wds=wd, cycle_len=5, use_clr=(32,10))




<table style="width:100%" align="center">
  <tr>
    <th>LM Fine-Tuning</th>
    <th></th> 
    <th></th>
    <th></th>
  </tr>
  <tr>
    <th>epoch</th>
    <th>training loss</th>
    <th>validation loss</th> 
    <th>accuracy</th>
  </tr>
  <tr>
    <td>1</td>
    <td>0.657155</td> 
    <td>0.4779 </td>
    <td>0.818361 </td>
  </tr>
  <tr>
    <td>2</td>
    <td>0.602397</td> 
    <td>0.483012</td>
    <td>0.825041</td>
  </tr>
    <tr>
    <td>3</td>
    <td>0.556655</td> 
    <td>0.456581</td>
    <td>0.829689 </td>
  </tr>
    <tr>
    <td>4</td>
    <td>0.504563</td> 
    <td>0.462918</td>
    <td>0.831911</td>
  </tr>
    <tr>
    <td>5</td>
    <td>0.468433</td> 
    <td>0.456508</td>
    <td>0.830113</td>
  </tr>
    <tr>
    <td>6</td>
    <td>0.45835</td> 
    <td>0.454914</td>
    <td>0.833168</td>
 
</table>

In [None]:
learn.sched.plot_loss()
#Insert Graph

learn.save('clas_2')


<img src="images/learn.sched_final.png" width="400" style="float: left"/>  


## 4.3 Results: Twitter Airline Sentiment Classification    
<img src="images/kaggle_logo.png" width="250" style="float: right"> 

<ul>
    <br><br>
    <li>Support Vector Machine (SVM) - 78.5%</li>
    <li>bag-of-words SVM - 78.5%</li>
    <li>Deep Learning Model with Dropouts in Keras - 77.9%</li>
    <li>Our result - <b>83.9%</b></li>
</ul>


## 5 Our Model Extension: Overview
<img src="images/addition.png" width="900" align="left">


## 5 Our Model Extension: Results - Classifier

<table style="width:45%" align="left">
  <tr>
    <th>ULMFiT - Original</th>
    <th></th> 
    <th></th>
  </tr>
  <tr>
    <th>training loss</th>
    <th>validation loss</th> 
    <th>accuracy</th>
  </tr>
  <tr>
    <td>0.458</td>
    <td>0.455</td> 
    <td>0.833</td>
  </tr>
</table>

<table style="width:45%" style="float: right">
  <tr>
    <th>ULMFiT - Adapted</th>
    <th></th> 
    <th></th>
  </tr>
  <tr>
    <th>training loss</th>
    <th>validation loss</th> 
    <th>accuracy</th>
  </tr>
  <tr>
    <td>0.522 </td>
    <td>0.449</td> 
    <td>0.835</td>
  </tr>
</table>

#### Interesting Observation in Respective LMs

<table style="width:45%" align="left">
  <tr>
    <th>ULMFiT - Original</th>
    <th></th> 
    <th></th>
  </tr>
  <tr>
    <th>training loss</th>
    <th>validation loss</th> 
    <th>accuracy</th>
  </tr>
  <tr>
    <td>6.241</td>
    <td>4.977</td> 
    <td>0.119</td>
  </tr>
</table>

<table style="width:45%" style="float: right">
  <tr>
    <th>ULMFiT - Adapted</th>
    <th></th> 
    <th></th>
  </tr>
  <tr>
    <th>training loss</th>
    <th>validation loss</th> 
    <th>accuracy</th>
  </tr>
  <tr>
    <td>4.977</td>
    <td>4.628</td> 
    <td>0.246</td>
  </tr>
</table>