# In Progress - NLP - Language Model Basics
> Building a Language Model

- toc: true 
- hide: true
- badges: true
- comments: true
- author: Isaac Flath

# Intro

In this post we are going to dive into NLP, specifically a Language Model.  Language models are the foundation of all NLP.  You will always want to start with a language model then use transfer learning to tune that model to your particular goal (ie Classification).  

So what is a language model?  In short, it is a model that uses the preceding words to predict the next word.  We do not need seperate labels, because they are in the text.  This is training the model on the nuances of the language you will be working on.  If you want to know if a tweet is toxic or not, you will need to be able to read and understand the tweet in order to do that.  The language model helps with understanding the tweet - then you can use that model with those weights to tune it for the final task (determining whether the tweet is toxic or not).

For this post, I will be using news articles to show how to create a language model from scratch.

# The Data

I will be using the "All-the-news" dataset from this site.  https://components.one/datasets/all-the-news-2-news-articles-dataset/

I downloaded then put the csv into a sqlite database for conveniece

In [8]:
import pandas as pd
import sqlite3
from pathlib import Path
from fastai.text.all import *

path = Path('../../../data/all-the-news')
con = sqlite3.connect(path/'all_the_news.db')

pd.read_sql_query('SELECT publication, min(date),max(date), count(*) from news group by publication order by max(date) desc', con)

Unnamed: 0,publication,min(date),max(date),count(*)
0,Buzzfeed News,2016-02-19 00:00:00,2020-04-02 00:00:00,32819
1,The New York Times,2016-01-01 00:00:00,2020-04-01 13:42:08,252259
2,Business Insider,2016-01-01 03:08:00,2020-04-01 01:48:46,57953
3,Washington Post,2016-06-10 00:00:00,2020-04-01 00:00:00,40882
4,TMZ,2016-01-01 00:00:00,2020-04-01 00:00:00,49595
5,Refinery 29,2016-01-01 07:00:00,2020-04-01 00:00:00,111433
6,Vox,2016-01-01 01:41:26,2020-03-31 23:50:00,47272
7,The Verge,2016-01-01 00:00:00,2020-03-31 00:00:00,52424
8,Hyperallergic,2016-01-01 00:00:00,2020-03-31 00:00:00,13551
9,CNN,2016-01-01,2020-03-31 00:00:00,127602


I am going to pick the 5 most recent New York times Articles.  For the final model I will use all of the data, but for simplicity of demonstrating tokenization we will use just 5 articles.  Here is an example of the start of one of the articles

In [9]:
df = pd.read_sql_query("SELECT article from news where publication = 'CNBC' and length(article) > 10 order by random() limit 500000",con)

### Fastai Language Model

In [10]:
df.columns

Index(['article'], dtype='object')

In [11]:
bs=128
dls = TextDataLoaders.from_df(df, text_col='article', is_lm=True,bs = bs)
learn = language_model_learner(dls, AWD_LSTM, metrics=accuracy)
learn.fine_tune(5)

epoch,train_loss,valid_loss,accuracy,time
0,,3.649046,0.34252,22:36


epoch,train_loss,valid_loss,accuracy,time
0,,3.519593,0.358146,24:01
1,,3.412565,0.369808,23:58
2,,3.356618,0.376045,24:06
3,,3.332409,0.378788,24:09
4,,3.327978,0.379292,24:18


In [12]:
learn.save('lmlearner_n50000'+str(bs))

Path('models/lmlearner_n50000128.pth')

In [15]:
for x in range(0,10):
    print(learn.predict('The reason Trump cares about immigration is',n_words=50))

The reason Trump cares about immigration is Wednesday 's resignation of former National Security Advisor Paul Langer , who is led by the Whites National Turkish President Jack Dorsey , according to a report from the White House . The White


The reason Trump cares about immigration is his official slogan solanezumab undersea exchange that wonders the only things that happened in 1994 , as the Great Wall taner amazon underground and has a top 5 , 000 fallerfallerfaller ... He consider saying there are issues of values and threat the economy will


The reason Trump cares about immigration is why the economic growth has slowed . Taxes are potentially free from tax increases that could have profound consequences . Tax reform — which requires management — powers the tax code to inflate earnings ratios , estimates being a signiﬁcant scenario of 19.4 percent and 47 percent of


The reason Trump cares about immigration is because at least some claims that made direct trade deals optimistically are just necessary . For example , the firm plans to hire people so they can shift their spending , but that has not allowed individual companies to take benefits ; they are doing initiatives like business technology


The reason Trump cares about immigration is not the titanic market emergency , but since Britain paid for Canada the rights to funding certain undocumented rights , Trump said that there was how too many people could using immigration duty . Gspc has largely invested in the local investment machinery carrier Dunkin


The reason Trump cares about immigration is that Trump has a much more aggressive sleepy lunch harder when he comes to the house for it . He 's become the top Republican in the presidency . He 's driving on from a growing base . He has already started such a drive


The reason Trump cares about immigration is his weekend photo assistant . He said " Since December the Axios issue does n't send you in a living room . You 'll do something about yup . I would well see that what happens in next year after you die . "


The reason Trump cares about immigration is the 50 percent tax policy on both sides of the aisle and tragic implications for his Steve Bannon campaign . Trump gave his name on Monday as his sniffer timer . The ties between Sweep and Goldman accounts can be evident in their


The reason Trump cares about immigration is as Federal Reserve officials and private equity executives , including Wilbur Ross , Transportation Secretary Wilbur Ross , and Southwest Airlines , have experienced less oversight and employee spending and screws said . " these shorts were ineffective in recent


The reason Trump cares about immigration is , he says , that retiring , and working on a solid choice of white house with Republicans around China — in powerful U.S . view is the sort of way Trump obstructed a Democratic presidential nominee by attacking voters . When Trump


In [16]:
for x in range(0,10):
    print(learn.predict('The reason Biden cares about immigration is',n_words=50))

The reason Biden cares about immigration is . He 's drove the airline away from a sweeping public boundaries car sale , which helps shareholders . Bombast , or policing , has caused parallels for Tesla 's history , immigration and our trade relationship . Denmark 's Bounty Force said it


The reason Biden cares about immigration is because he explains not to say why he should be doing everything that could be done and he contains even the most important things . At the beginning of August , the Federal Reserve quickly stopped speed of on ronna and trips to Chile .


The reason Biden cares about immigration is no one plan to completely support refugees . But President Barack Obama on Tuesday laid out Obama 's potential base for immigration , saying that " who i am not going to make to wear with a lawyer like every other 34-year - old


The reason Biden cares about immigration is because there are closed corporate districts , and candidates paying attention to lost faith did n't run a good plan as what was to be done . And Democrats kept there for a third of the time he took office , parish council s - u - lago


The reason Biden cares about immigration is " to book immigration . " In the fourth quarter , however , Biden needed to take off a brief tie for the election . " i know that , I 've got , i really ca n't even think esiner the special counsel nancy Sanders


The reason Biden cares about immigration is , he says , . The Trump administration sat down with the U.S . Defense Department not only in Washington , but also in Washington . Lawmakers also contended with his polling , election ambitions and the ongoing discussions by financial experts


The reason Biden cares about immigration is with regulations that ask for race and bowe — one of the biggest exceptions he has ever found — on his section of the vehicle to ladder across bridges and bridges in the country . But diversity among medals was not nominated by most American organizations , but


The reason Biden cares about immigration is a background band that includes rights experts including Fedexcup Michelle Trump and Jay Inslee . He said Tuesday that an illegal incursion into the Northeast could put Virginia in the middle in a turn to enter a slowdown . Trump


The reason Biden cares about immigration is about getting back to Europe at international trade . " that 's because of our trade war with the European Union and so it 's a political measure , not a scenario that is eating in the U.S . " Peter Navarro , director


The reason Biden cares about immigration is because people could move forward , known as Advanced America Mrs . Trump , to address Qatar 's warming to history for thousands of decades . On travel from safe - haven China to France and the United States ,


# Tokenization and Numericalization

First, I need to tokenize my data.  Let's do that first.  The fastai library adds some extra tokens.  Tokens such as xxbos which indicates that it's the beginning of a sentance, or xxup that indicates that the word is in capital letters.

>Note:  In a previous post I showed how you can do a basic tokenization from scratch.  Please check out that post for a foundation on tokenization and numericalization.

In [None]:
from fastai.text.all import *


In [None]:

txts = L(o for o in df.article)


In [None]:
spacy = WordTokenizer()
tkn = Tokenizer(spacy)

toks = txts.map(tkn);

In [None]:
# for i in range(0,len(toks)):
#     toks[i] = L(filter(lambda a: a != 'xxmaj', toks[i]))
# toks

Next we need to numericalize our data.  By that, I mean assign numbers to each unique token and replace the tokens with those numbers.  We can do that very easily using Numericalize.

In [None]:
num = Numericalize()
num.setup(toks)
coll_repr(num.vocab,20)

We can see below that we can look at our numericalized tokens, and convert those back to tokens if we need to.

In [None]:
nums = toks.map(num); 
nums[0][:20]

In [None]:
np.array(toks[0][:20])

In [None]:
' '.join(num.vocab[o] for o in nums[0][:20])

# Language Model

A Language model is a semi-surpervised learning.  It is different from classification or regression because the labels are not seperate from the training data.  We will use previous words (or tokens more specifically) to predict the next word.  For this post, I will be creating this from scratch to demonstrate exactly how it works.

Let's start by creating our training set.  We will create tuples where the first element is a series of tokens, and the second element is the following word.  Let's see what that looks like for 1 article in both tokens and numbers.  We will start with using the 3 tokens to predict the next token in 1 article.  We will almost certainly need to use more articles as well as more tokens for the prediction, but we can increase those numbers later.

### Packaging the Data

In [None]:
n_words = 3

In [None]:
L((toks[0][i:i+n_words], toks[0][i+n_words]) for i in range(0,len(toks[0])-(n_words+1),n_words))

In [None]:
seqs = L((nums[0][i:i+n_words], nums[0][i+n_words]) for i in range(0,len(nums[0])-(n_words+1),n_words))

seqs

In [None]:
seqs = L()
for article_num in range(0,len(nums)):
    seq = L((nums[article_num][i:i+n_words], nums[article_num][i+n_words]) for i in range(0,len(nums[article_num])-(n_words+1),n_words))
    seqs.append(seq)
    
seqs = L(item for sublist in seqs for item in sublist)

seqs

Now we can easily package this into a dataloader so that we can feed this into a model.

In [None]:
bs = 64
cut = int(len(seqs) * 0.9)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

### The Model

So now we need to create a RNN.  Let's start with a fastai model, then go back and create some basic models to illustrate how it works.

In [None]:
# n,counts = 0,torch.zeros(len(num.vocab))
# for x,y in dls.valid:
#     n += y.shape[0]
#     for i in range_of(num.vocab): counts[i] += (y==i).long().sum()
# idx = torch.argmax(counts)

# top10 = torch.topk(counts,15)
# for idx in top10[1]:
#     print(idx, num.vocab[idx.item()], round(counts[idx].item()/n*100,1))

Now we are ready for an RNN.  WE will start with an RNN that is as simple as it gets.

```for i in range(3):```
Because we are feeding in 3 tokens to predict the fourth, we will have 3 hidden layers, 1 per token.

```h = h + self.i_h(x[:,i])```
For each input token we will run our input to hidden function.  We are indexing to grab the column in our embedding matrix that corresponds with the token, and adding that. All this is doing is adding the embedding for the particular token. 
    
```h = F.relu(self.h_h(h))```
We then run our hidden to hidden function (h_h), which is a linear layer (y = wx + b).  We do a ReLu of that, which is just replacing any negative values with 0.
    
```return self.h_o(h)```
We then run our hidden to output function (h_o), which is another linear layer, but it is outputing the prediction of which word is next.  Naturally, this is the size of our vocabulary.

Wrap all that in a class and it looks like the below:


In [None]:
class LanguageModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

I then threw it in a learner for 3 epochs and we see about an 16% accuracy.  Much better than just predicting the most common words!

In [None]:
learn = Learner(dls, LanguageModel1(len(num.vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(5, 1e-2)

One problem with the previous model is it is only using the previous 3 words to predict the next one.  In reality, words are in a logical order that is longer than 3 words - so we really don't want to just reset it every time by setting h to 0.  So instead we set it to 0 when we first initialize it, but not later.

Unfortunately what this means is we end up with more and more weights as we train, which means more and more gradients to calculate.  The model would explode, so instead we just deal with the recent gradients by using "detach".

In [None]:

class LanguageModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out
    
    def reset(self): self.h = 0

For this to work, our data needs to be in a logical order.  So let's put our data in our dataloader in the order it was in the text.

In [None]:
def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds


In [None]:
cut = int(len(seqs) * 0.9)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)

And throw it in a learner for 3 epochs and we see our accuracy is much better.  It can predict the next word correctly almost 1 out of every 5 times?

In [None]:
learn = Learner(dls, LanguageModel2(len(num.vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(5, 1e-2)

There are many more steps to this iterative process to get to a really cutting edge model, and future posts will cover those steps.  But for now, we have a great start and a good foundation in what an RNN is in it's simplest form.  Future blog posts that continue to expand and pick up where this one left off.

Other areas that more cutting edge architectures improve upon:
+ Rather than predicting 1 token for each group of 4 tokens (3 inputs -> 1 output), predict every word.
+ Stack the RNNs together for more layers
+ Use LSTMs
+ Regularization (ie dropout, AR, TAR) 

We will continue to build on this language model until we reach close to the performance we would get using the fastai library.  See below for the out of the box language model using fastai.

### Fastai Language Model

In [None]:
df.columns

In [None]:
dls = TextDataLoaders.from_df(df, text_col='article', is_lm=True,bs = 256)
learn = language_model_learner(dls, AWD_LSTM, metrics=accuracy)
learn.fit_one_cycle(5, 1e-2)