#### Transfer Learning in NLP -ULMFiT
Authors: Vikas Kumar (vikkumar@deloitte.com) | Abhishek Aditya Kashyap (abhikashyap@deloitte.com)

**References:**
* https://github.com/fastai/fastai/blob/master/examples/ULMFit.ipynb
* https://github.com/fastai/course-nlp/blob/master/5-nn-imdb.ipynb
* http://nlp.fast.ai/classification/2018/05/15/introducing-ulmfit.html

### Language Modeling & Sentiment Analysis of IMDB movie reviews

We will be looking at IMDB movie reviews.  We want to determine if a review is negative or positive, based on the text.  In order to do this, we will be using **transfer learning**.

Transfer learning has been widely used with great success in computer vision for several years, but only in the last year or so has it been successfully applied to NLP (beginning with ULMFit, which we will use here, which was built upon by BERT and GPT-2).

As Sebastian Ruder wrote in [The Gradient](https://thegradient.pub/) last summer, [NLP's ImageNet moment has arrived](https://thegradient.pub/nlp-imagenet/).

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/output'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

importing fastai libraries

In [2]:
sample=False

In [3]:
from fastai import *
from fastai.text import *

**SELECTING DEVICE: CPU/CUDA**

In [4]:
defaults.device = torch.device('cuda',0) if torch.cuda.is_available() else torch. device('cpu')
defaults.device

device(type='cuda', index=0)

In [5]:
DATA_PATH = Path('/kaggle/input/usinlppracticum/')
DATA_PATH.ls()

[PosixPath('/kaggle/input/usinlppracticum/imdb_test.csv'),
 PosixPath('/kaggle/input/usinlppracticum/imdb_train.csv'),
 PosixPath('/kaggle/input/usinlppracticum/sample_submission.csv')]

#### data Preparation for Language Model data

In [6]:
lm_data=pd.read_csv(DATA_PATH/'imdb_train.csv')
lm_data.head()

Unnamed: 0,review,sentiment
0,We had STARZ free weekend and I switched on th...,negative
1,I'll admit that this isn't a great film. It pr...,negative
2,I finally found a version of Persuasion that I...,positive
3,The BBC surpassed themselves with the boundari...,positive
4,"Much praise has been lavished upon Farscape, b...",negative


In [7]:
lm_data1=pd.read_csv(DATA_PATH/'imdb_train.csv')
lm_data1['sentiment']=0
lm_data2=pd.read_csv(DATA_PATH/'imdb_test.csv')
lm_data2['sentiment']=0
lm_data= pd.concat([lm_data1, lm_data2], ignore_index=True)
lm_data=lm_data[['review','sentiment']]
lm_data.to_csv('lm_data.csv',index=False)
lm_data.shape

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """


(50000, 2)

In [8]:
if sample:
    lm_data=pd.read_csv('lm_data.csv').sample(10000).reset_index(drop=True)
else:
    lm_data=pd.read_csv('lm_data.csv')
#------------
lm_data.head()

Unnamed: 0,review,sentiment
0,We had STARZ free weekend and I switched on th...,0
1,I'll admit that this isn't a great film. It pr...,0
2,I finally found a version of Persuasion that I...,0
3,The BBC surpassed themselves with the boundari...,0
4,"Much praise has been lavished upon Farscape, b...",0


###  splitting langauge model data

In [9]:
from sklearn.model_selection import train_test_split

train_lm, val_lm = train_test_split(lm_data,test_size=0.10)
train_lm.shape,val_lm.shape

((45000, 2), (5000, 2))

### Creating the TextLMDataBunch
This is where the unlabelled data is going to be useful to us, as we can use it to fine-tune our model. Let's create our data object with the data block API (next line takes a few minutes).

 We first have to convert words to numbers. This is done in two differents steps: 
*  tokenization 
* numericalization. 

A `TextDataBunch` does all of that behind the scenes for you.

In [10]:
data_lm = TextLMDataBunch.from_df(DATA_PATH, train_lm,val_lm,text_cols='review', label_cols='sentiment')
data_lm.save('/kaggle/working/data_lm_export.pkl')

### Tokenization
The first step of processing we make texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, but we can be smarter:

- we need to take care of punctuation
- some words are contractions of two different words, like isn't or don't
- we may need to clean some parts of our texts, if there's HTML code for instance

To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch.

The texts are truncated at 100 tokens for more readability. We can see that it did more than just split on space and punctuation symbols: 
- the "'s" are grouped together in one token
- the contractions are separated like his: "did", "n't"
- content has been cleaned for any HTML symbol and lower cased
- there are several special tokens (all those that begin by xx), to replace unkown tokens (see below) or to introduce different text fields (here we only have one).

### Numericalization
Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at list twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token `UNK`.

The correspondance from ids tokens is stored in the `vocab` attribute of our datasets, in a dictionary called `itos` (for int to string).

In [11]:
data_lm.vocab.itos[:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the']

#### And if we look at what a what's in our datasets, we'll see the tokenized text as a representation:

In [12]:
data_lm.train_ds[0][0]

Text xxbos xxmaj this is possibly the worst movie i have ever seen . xxmaj can somebody please explain the plot of this movie to me ? xxmaj yes , i know the bus ran out of gas in the middle of the desert , after the driver never noticed that his compass was n't functioning , but what then ? xxmaj and how did it end ? xxmaj maybe i 'm to stupid to understand this movie , but to me it was an absolute waste of time . 
 
  xxmaj my recommendation ? xxmaj do not bother , there are far better movies to be seen . xxmaj this movie ranks with my other all time low - low 's ( xxmaj going overboard - xxmaj adam xxmaj sandler and xxmaj fire on the xxmaj amazon - xxmaj sandra xxmaj bullock )

### But the underlying data is all numbers

In [13]:
data_lm.train_ds[0][0].data[:100]

array([   2,    5,   20,   16, ...,   38, 1283,   11,   55])

In [14]:
len(data_lm.vocab.itos),len(data_lm.train_ds)

(60000, 45000)

In [15]:
data_lm.train_ds[0][0].data.shape

(143,)

In [16]:
data_lm.show_batch()

idx,text
0,"'m to stupid to understand this movie , but to me it was an absolute waste of time . \n \n xxmaj my recommendation ? xxmaj do not bother , there are far better movies to be seen . xxmaj this movie ranks with my other all time low - low 's ( xxmaj going overboard - xxmaj adam xxmaj sandler and xxmaj fire on the xxmaj amazon -"
1,were signed at this studio we might have seen a whole slew of xxmaj bradford films . xxbos xxmaj the xxmaj brain ( or head ) that xxmaj would n't xxmaj die is one of the more thoughtful low budget exploitation films of the early 1960s . xxmaj it is very difficult to imagine how a script this repulsively sexist could have been written without the intention of self -
2,"n't tell , this one relies on the viewer to work with it a little and put aside some petty ( see : major and blatant ) details . \n \n xxmaj overall though : xxmaj watch - able with mild bits of enjoyment . xxmaj note : xxmaj the xxmaj outpost is commonly known under the title ' xxmaj mind xxmaj ripper ' xxbos xxmaj this movie is"
3,". xxmaj hopefully we 'll have not just the intelligence , but the sense of shared responsibility to keep that from happening . xxbos i thought xxmaj harvey xxmaj keitel , a young , fresh from the xxmaj sex xxmaj pistols xxmaj john xxmaj lydon , then as a bonus , the music by xxmaj ennio xxmaj morricone . i expected an old - school , edgy , xxmaj italian"
4,"still manages to sell his songs to the audience , and that , after all , is what it is all about . xxmaj this is a faithful adaptation of the excellent book by xxmaj james xxmaj hilton , and deserves to be treasured for generations to come . i recommend this film for family viewing , though most men will consider this a ' chick ' flick . xxmaj"


In [17]:
learn_lm = language_model_learner(data_lm, AWD_LSTM)

### loading wikitext vocab

In [18]:
import pickle
wiki_itos = pickle.load(open('/kaggle/input/wiki-vocab/itos_wt103.pkl', 'rb'))

In [19]:
wiki_itos[:10]

['xxunk', 'xxpad', 'the', ',', '.', '\n', 'of', 'and', 'in', 'to']

In [20]:
vocab = data_lm.vocab

In [21]:
vocab.stoi["stingray"]

28746

In [22]:
vocab.itos[vocab.stoi["stingray"]]

'stingray'

In [23]:
vocab.itos[vocab.stoi["mobula"]]

'xxunk'

In [24]:
awd = learn_lm.model[0]
print(awd)

AWD_LSTM(
  (encoder): Embedding(60000, 400, padding_idx=1)
  (encoder_dp): EmbeddingDropout(
    (emb): Embedding(60000, 400, padding_idx=1)
  )
  (rnns): ModuleList(
    (0): WeightDropout(
      (module): LSTM(400, 1152, batch_first=True)
    )
    (1): WeightDropout(
      (module): LSTM(1152, 1152, batch_first=True)
    )
    (2): WeightDropout(
      (module): LSTM(1152, 400, batch_first=True)
    )
  )
  (input_dp): RNNDropout()
  (hidden_dps): ModuleList(
    (0): RNNDropout()
    (1): RNNDropout()
    (2): RNNDropout()
  )
)


In [25]:
enc = learn_lm.model[0].encoder

In [26]:
enc.weight.size()

torch.Size([60000, 400])

### Difference in vocabulary between IMDB and Wikipedia
We will compare the `vocabulary from wikitext with the vocabulary in IMDB`.  It is to be expected that the two sets have some different vocabulary words, and that is no problem for `transfer learning!`

In [27]:
len(wiki_itos)

60002

In [28]:
len(vocab.itos)

60000

In [29]:
i, unks = 0, []
while len(unks) < 50:
    if data_lm.vocab.itos[i] not in wiki_itos: unks.append((i,data_lm.vocab.itos[i]))
    i += 1

In [30]:
wiki_words = set(wiki_itos)
imdb_words = set(vocab.itos)

In [31]:
wiki_not_imbdb = wiki_words.difference(imdb_words)
imdb_not_wiki = imdb_words.difference(wiki_words)

In [32]:
wiki_not_imdb_list = []

for i in range(100):
    word = wiki_not_imbdb.pop()
    wiki_not_imdb_list.append(word)
    wiki_not_imbdb.add(word)

In [33]:
wiki_not_imdb_list[:15]

['stratovolcano',
 'sacrament',
 'e1',
 'obtainable',
 'wetland',
 'patriarchs',
 '973',
 'v12',
 'birchmeier',
 'kirkcaldy',
 'jas',
 'burchill',
 'westland',
 'lanthanides',
 'wintour']

In [34]:
imdb_not_wiki_list = []

for i in range(100):
    word = imdb_not_wiki.pop()
    imdb_not_wiki_list.append(word)
    imdb_not_wiki.add(word)

In [35]:
imdb_not_wiki_list[:15]

['stapled',
 'gein',
 'tewksbury',
 'gunsels',
 'lusts',
 'sono',
 'hanka',
 'story(s',
 'shootist',
 'friend"',
 'waw',
 'dirks',
 'defaults',
 'crocs',
 'dickensian']

### All words that appear in the IMDB vocab, but not the wikitext-103 vocab, will be initialized to the same random vector in a model.  `As the model trains, we will learn these weights.`

### Generating fake movie reviews (using wiki-text model)

In [36]:
TEXT = "The color of the sky is"
N_WORDS = 40
N_SENTENCES = 2

print("\n".join(learn_lm.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

The color of the sky is blue ( the color of the sky ) is based on the green color of black , which is the visual light and the colour of the sky . The sky above is a green , a shade of
The color of the sky is a matter of controversy due to the fact that it was not the first to be identified as an animal . This is the point of view of Bob Bails , the National Parks


In [37]:
# doc(LanguageLearner.predict)

In [38]:
print("\n".join(learn_lm.predict(TEXT, N_WORDS, temperature=0.20) for _ in range(N_SENTENCES)))

The color of the sky is a matter of controversy , as the United States Congress has stated that the United States Congress is not a " National Assembly " , and that the United States
The color of the sky is a reference to the American Civil War . The American Civil War , the Civil War , and the Civil War were both a major part of the


### Training the Langauge Model

In [39]:
learn_lm.fit_one_cycle(1, 2e-2, moms=(0.8,0.7), wd=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,4.47191,3.781139,0.257143,09:57


In [40]:
learn_lm.unfreeze()
learn_lm.fit_one_cycle(10, 2e-3, moms=(0.8,0.7), wd=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,4.17372,3.561193,0.285714,11:19
1,4.139696,3.606076,0.257143,11:19
2,4.106132,3.54963,0.314286,11:19
3,4.092676,3.535756,0.257143,11:20
4,4.032753,3.464699,0.314286,11:19
5,3.993665,3.50917,0.285714,11:19
6,3.924422,3.475172,0.271429,11:20
7,3.857162,3.468473,0.271429,11:19
8,3.800717,3.38581,0.285714,11:20


In [41]:
learn_lm.path = Path('/kaggle/working') 
learn_lm.model_dir= Path('.')

In [42]:
learn_lm.save_encoder('fine_tuned_enc')

### More generated movie reviews
How good is our model? Well let's try to see what it predicts after a few given words.

In [43]:
TEXT = "i liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
print("\n".join(learn_lm.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

i liked this movie because it was place among the best movies i 've ever seen . First of all , there are Thrills and Thrills . When Shadows and Aliens have advanced , they 're always good .
i liked this movie because it 's so true . It 's a movie about a man who lives in a small town in the middle of nowhere . He is a little boy who is heartbreak boy and runs away from home


### Training Classifier on finetuned Language Models

In [44]:
if sample:
    data_cls=pd.read_csv(DATA_PATH/'imdb_train.csv').sample(1000).reset_index(drop=True)
else:
    data_cls=pd.read_csv(DATA_PATH/'imdb_train.csv')
#----------
data_cls.head()

Unnamed: 0,review,sentiment
0,We had STARZ free weekend and I switched on th...,negative
1,I'll admit that this isn't a great film. It pr...,negative
2,I finally found a version of Persuasion that I...,positive
3,The BBC surpassed themselves with the boundari...,positive
4,"Much praise has been lavished upon Farscape, b...",negative


In [45]:
# Classifier model data
from sklearn.model_selection import train_test_split
train, val = train_test_split(data_cls,test_size=0.10, random_state=42)
label_col= 'sentiment'

label_mapping= {'negative':0,'positive':1}
train[label_col]=train[label_col].map(label_mapping)
val[label_col]=val[label_col].map(label_mapping)
train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,review,sentiment
20038,I had never heard of this film prior to seeing...,1
23937,Director Edward Montagne does in a little more...,1
6046,There were a lot of dumb teenage getting sex m...,1
23187,There is absolutely no doubt that this version...,1
25421,I watched this movie for its two hours and hav...,0


In [46]:
data_clas = TextDataBunch.from_df(DATA_PATH, train, val,
                  vocab=data_lm.train_ds.vocab,
                  text_cols="review",
                  label_cols='sentiment',
                  bs=64,device = defaults.device)

In [47]:
learn_c = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.3) #.to_fp16()
learn_c.path = Path('/kaggle/working') 
learn_c.model_dir= Path('.')
learn_c.load_encoder('fine_tuned_enc')
learn_c.freeze()

In [48]:
learn_c.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.207458,0.181391,0.92925,02:34


In [49]:
learn_c.freeze_to(-2)
learn_c.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.185459,0.156897,0.938,02:35


In [50]:
learn_c.freeze_to(-3)
learn_c.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.152384,0.13931,0.9425,04:33


In [51]:
learn_c.unfreeze()
learn_c.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.107007,0.144723,0.9475,05:03
1,0.063816,0.154175,0.94775,05:09
