# In Progress - NLP - Language Model Basics
> Building a Language Model

- toc: true 
- hide: true
- badges: true
- comments: true
- author: Isaac Flath

# Intro

In this post we are going to dive into NLP, specifically a Language Model.  Language models are the foundation of all NLP.  You will always want to start with a language model then use transfer learning to tune that model to your particular goal (ie Classification).  

So what is a language model?  In short, it is a model that uses the preceding words to predict the next word.  We do not need seperate labels, because they are in the text.  This is training the model on the nuances of the language you will be working on.  If you want to know if a tweet is toxic or not, you will need to be able to read and understand the tweet in order to do that.  The language model helps with understanding the tweet - then you can use that model with those weights to tune it for the final task (determining whether the tweet is toxic or not).

For this post, I will be using news articles to show how to create a language model from scratch.

# The Data

I will be using the "All-the-news" dataset from this site.  https://components.one/datasets/all-the-news-2-news-articles-dataset/

I downloaded then put the csv into a sqlite database for conveniece

In [1]:
import pandas as pd
import sqlite3
from pathlib import Path
path = Path('../../../data/all-the-news')
con = sqlite3.connect(path/'all_the_news.db')

pd.read_sql_query('SELECT publication, min(date),max(date), count(*) from news group by publication order by max(date) desc limit 5', con)

Unnamed: 0,publication,min(date),max(date),count(*)
0,Buzzfeed News,2016-02-19 00:00:00,2020-04-02 00:00:00,32819
1,The New York Times,2016-01-01 00:00:00,2020-04-01 13:42:08,252259
2,Business Insider,2016-01-01 03:08:00,2020-04-01 01:48:46,57953
3,Washington Post,2016-06-10 00:00:00,2020-04-01 00:00:00,40882
4,TMZ,2016-01-01 00:00:00,2020-04-01 00:00:00,49595


I am going to pick the 5 most recent New York times Articles.  For the final model I will use all of the data, but for simplicity of demonstrating tokenization we will use just 5 articles.  Here is an example of the start of one of the articles

In [2]:
df = pd.read_sql_query("SELECT article from news where publication = 'The New York Times' and length(article) > 10 order by random() limit 500", con)

# Tokenization and Numericalization

First, I need to tokenize my data.  Let's do that first.  The fastai library adds some extra tokens.  Tokens such as xxbos which indicates that it's the beginning of a sentance, or xxup that indicates that the word is in capital letters.

>Note:  In a previous post I showed how you can do a basic tokenization from scratch.  Please check out that post for a foundation on tokenization and numericalization.

In [3]:
from fastai.text.all import *


In [4]:

txts = L(o for o in df.article)


In [5]:
spacy = WordTokenizer()
tkn = Tokenizer(spacy)

toks = txts.map(tkn);

In [6]:
# for i in range(0,len(toks)):
#     toks[i] = L(filter(lambda a: a != 'xxmaj', toks[i]))
# toks

Next we need to numericalize our data.  By that, I mean assign numbers to each unique token and replace the tokens with those numbers.  We can do that very easily using Numericalize.

In [7]:
num = Numericalize()
num.setup(toks)
coll_repr(num.vocab,20)

"(#12400) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj',',','the','.','a','of','to','and','in','that','“','”'...]"

We can see below that we can look at our numericalized tokens, and convert those back to tokens if we need to.

In [8]:
nums = toks.map(num); 
nums[0][:20]

tensor([   2,   18, 5229,   16,    8,  422,    9,   19,    8, 2401,    8, 4428,
          20, 3219,    0,    9,   70,   44, 8293,   32])

In [9]:
np.array(toks[0][:20])

array(['xxbos', '“', 'angels', 'in', 'xxmaj', 'america', ',', '”',
       'xxmaj', 'tony', 'xxmaj', 'kushner', '’s', 'sweeping',
       'masterwork', ',', 'will', 'be', 'revived', 'at'], dtype='<U10')

In [10]:
' '.join(num.vocab[o] for o in nums[0][:20])

'xxbos “ angels in xxmaj america , ” xxmaj tony xxmaj kushner ’s sweeping xxunk , will be revived at'

# Language Model

A Language model is a semi-surpervised learning.  It is different from classification or regression because the labels are not seperate from the training data.  We will use previous words (or tokens more specifically) to predict the next word.  For this post, I will be creating this from scratch to demonstrate exactly how it works.

Let's start by creating our training set.  We will create tuples where the first element is a series of tokens, and the second element is the following word.  Let's see what that looks like for 1 article in both tokens and numbers.  We will start with using the 3 tokens to predict the next token in 1 article.  We will almost certainly need to use more articles as well as more tokens for the prediction, but we can increase those numbers later.

### Packaging the Data

In [11]:
n_words = 3

In [12]:
L((toks[0][i:i+n_words], toks[0][i+n_words]) for i in range(0,len(toks[0])-(n_words+1),n_words))

(#267) [((#3) ['xxbos','“','angels'], 'in'),((#3) ['in','xxmaj','america'], ','),((#3) [',','”','xxmaj'], 'tony'),((#3) ['tony','xxmaj','kushner'], '’s'),((#3) ['’s','sweeping','masterwork'], ','),((#3) [',','will','be'], 'revived'),((#3) ['revived','at','the'], 'xxmaj'),((#3) ['xxmaj','neil','xxmaj'], 'simon'),((#3) ['simon','xxmaj','theater'], 'next'),((#3) ['next','spring',','], 'a')...]

In [13]:
seqs = L((nums[0][i:i+n_words], nums[0][i+n_words]) for i in range(0,len(nums[0])-(n_words+1),n_words))

seqs

(#267) [(tensor([   2,   18, 5229]), tensor(16)),(tensor([ 16,   8, 422]), tensor(9)),(tensor([ 9, 19,  8]), tensor(2401)),(tensor([2401,    8, 4428]), tensor(20)),(tensor([  20, 3219,    0]), tensor(9)),(tensor([ 9, 70, 44]), tensor(8293)),(tensor([8293,   32,   10]), tensor(8)),(tensor([   8, 4429,    8]), tensor(4803)),(tensor([4803,    8,  716]), tensor(190)),(tensor([ 190, 1185,    9]), tensor(12))...]

In [14]:
seqs = L()
for article_num in range(0,len(nums)):
    seq = L((nums[article_num][i:i+n_words], nums[article_num][i+n_words]) for i in range(0,len(nums[article_num])-(n_words+1),n_words))
    seqs.append(seq)
    
seqs = L(item for sublist in seqs for item in sublist)

seqs

(#194853) [(tensor([   2,   18, 5229]), tensor(16)),(tensor([ 16,   8, 422]), tensor(9)),(tensor([ 9, 19,  8]), tensor(2401)),(tensor([2401,    8, 4428]), tensor(20)),(tensor([  20, 3219,    0]), tensor(9)),(tensor([ 9, 70, 44]), tensor(8293)),(tensor([8293,   32,   10]), tensor(8)),(tensor([   8, 4429,    8]), tensor(4803)),(tensor([4803,    8,  716]), tensor(190)),(tensor([ 190, 1185,    9]), tensor(12))...]

Now we can easily package this into a dataloader so that we can feed this into a model.

In [15]:
bs = 64
cut = int(len(seqs) * 0.9)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

### The Model

So now we need to create a RNN.  Let's start with a fastai model, then go back and create some basic models to illustrate how it works.

In [16]:
# n,counts = 0,torch.zeros(len(num.vocab))
# for x,y in dls.valid:
#     n += y.shape[0]
#     for i in range_of(num.vocab): counts[i] += (y==i).long().sum()
# idx = torch.argmax(counts)

# top10 = torch.topk(counts,15)
# for idx in top10[1]:
#     print(idx, num.vocab[idx.item()], round(counts[idx].item()/n*100,1))

Now we are ready for an RNN.  WE will start with an RNN that is as simple as it gets.

```for i in range(3):```
Because we are feeding in 3 tokens to predict the fourth, we will have 3 hidden layers, 1 per token.

```h = h + self.i_h(x[:,i])```
For each input token we will run our input to hidden function.  We are indexing to grab the column in our embedding matrix that corresponds with the token, and adding that. All this is doing is adding the embedding for the particular token. 
    
```h = F.relu(self.h_h(h))```
We then run our hidden to hidden function (h_h), which is a linear layer (y = wx + b).  We do a ReLu of that, which is just replacing any negative values with 0.
    
```return self.h_o(h)```
We then run our hidden to output function (h_o), which is another linear layer, but it is outputing the prediction of which word is next.  Naturally, this is the size of our vocabulary.

Wrap all that in a class and it looks like the below:


In [17]:
class LanguageModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

I then threw it in a learner for 3 epochs and we see about an 16% accuracy.  Much better than just predicting the most common words!

In [18]:
learn = Learner(dls, LanguageModel1(len(num.vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(5, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,5.848427,5.828712,0.160628,00:47
1,5.645291,5.786608,0.170789,00:50
2,5.477659,5.696674,0.176742,00:55
3,5.308584,5.60835,0.181002,01:23
4,5.19839,5.549744,0.182952,02:01


One problem with the previous model is it is only using the previous 3 words to predict the next one.  In reality, words are in a logical order that is longer than 3 words - so we really don't want to just reset it every time by setting h to 0.  So instead we set it to 0 when we first initialize it, but not later.

Unfortunately what this means is we end up with more and more weights as we train, which means more and more gradients to calculate.  The model would explode, so instead we just deal with the recent gradients by using "detach".

In [19]:

class LanguageModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out
    
    def reset(self): self.h = 0

For this to work, our data needs to be in a logical order.  So let's put our data in our dataloader in the order it was in the text.

In [20]:
def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds


In [21]:
cut = int(len(seqs) * 0.9)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)

And throw it in a learner for 3 epochs and we see our accuracy is much better.  It can predict the next word correctly almost 1 out of every 5 times?

In [22]:
learn = Learner(dls, LanguageModel2(len(num.vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(5, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,5.783739,5.850947,0.165604,00:50
1,5.594864,5.712231,0.172492,00:51
2,5.44863,5.660336,0.175576,00:57
3,5.281786,5.611745,0.186472,01:16
4,5.201961,5.553745,0.189916,01:44


There are many more steps to this iterative process to get to a really cutting edge model, and future posts will cover those steps.  But for now, we have a great start and a good foundation in what an RNN is in it's simplest form.  Future blog posts that continue to expand and pick up where this one left off.

Other areas that more cutting edge architectures improve upon:
+ Rather than predicting 1 token for each group of 4 tokens (3 inputs -> 1 output), predict every word.
+ Stack the RNNs together for more layers
+ Use LSTMs
+ Regularization (ie dropout, AR, TAR) 

We will continue to build on this language model until we reach close to the performance we would get using the fastai library.  See below for the out of the box language model using fastai.

### Fastai Language Model

In [23]:
df.columns

Index(['article'], dtype='object')

In [24]:
dls = TextDataLoaders.from_df(df, text_col='article', is_lm=True,bs = 256)
learn = language_model_learner(dls, AWD_LSTM, metrics=accuracy)
learn.fit_one_cycle(5, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.638541,4.003233,0.310014,00:14
1,4.383406,3.840407,0.315526,00:14
2,4.181806,3.784667,0.320918,00:14
3,4.021165,3.772173,0.321951,00:14
4,3.921133,3.770104,0.322276,00:14


'donald trump is now in that third country , in Income ,'