# In Progress - NLP - Language Model Basics
> Building a Language Model

- toc: true 
- hide: true
- badges: true
- comments: true
- author: Isaac Flath

# Intro

In this post we are going to dive into NLP, specifically a Language Model.  Language models are the foundation of all NLP.  You will always want to start with a language model then use transfer learning to tune that model to your particular goal (ie Classification).  

So what is a language model?  In short, it is a model that uses the preceding words to predict the next word.  We do not need seperate labels, because they are in the text.  This is training the model on the nuances of the language you will be working on.  If you want to know if a tweet is toxic or not, you will need to be able to read and understand the tweet in order to do that.  The language model helps with understanding the tweet - then you can use that model with those weights to tune it for the final task (determining whether the tweet is toxic or not).

For this post, I will be using news articles to show how to create a language model from scratch.

# The Data

I will be using the "All-the-news" dataset from this site.  https://components.one/datasets/all-the-news-2-news-articles-dataset/

I downloaded then put the csv into a sqlite database for conveniece

In [2]:
import pandas as pd
import sqlite3
from pathlib import Path
from fastai.text.all import *

path = Path('../../../data/all-the-news')
con = sqlite3.connect(path/'all_the_news.db')

#pub_stats = pd.read_sql_query('SELECT publication, min(date),max(date), count(*) as cnt from news group by publication order by max(date) desc', con)




### Fastai Language Model

In [3]:
# bs=128
# for publisher in list(pub_stats[pub_stats.cnt > 100000].publication):
#     print(publisher)
#     query = f"SELECT article from news where publication = '%s' and length(article) > 10 order by random()"%(publisher)
#     print(query)
#     df = pd.read_sql_query(query,con)
#     dls = TextDataLoaders.from_df(df, text_col='article', is_lm=True,bs = bs)
#     learn = language_model_learner(dls, AWD_LSTM, metrics=accuracy)
#     learn.fine_tune(5)
#     fname = f'lm_all_%s_%sArticles'%(publisher,str(pub_stats[pub_stats.publication == publisher].cnt.values))
#     learn.save(fname)
#     print(fname+': Complete')


In [4]:
publisher = 'The New York Times'
records = '75000'
query = f"SELECT article from news where publication = '%s' and length(article) > 10 order by random() limit %s"%(publisher,records)
df = pd.read_sql_query(query,con)
bs=128
dls = TextDataLoaders.from_df(df, text_col='article', is_lm=True,bs = bs)
learn = language_model_learner(dls, AWD_LSTM, metrics=accuracy)
learn.fine_tune(1)

epoch,train_loss,valid_loss,accuracy,time
0,4.203725,3.690516,0.341063,1:08:03


epoch,train_loss,valid_loss,accuracy,time
0,3.977097,3.574996,0.356949,1:11:33


In [5]:
iteration = 0
fname = f'lm_all_%s_%sArticles_%s'%(publisher,records,str(iteration))
learn.save(fname)
iteration = iteration+1

In [6]:
iteration = 0
fname = f'lm_all_%s_%sArticles_%s'%(publisher,records,str(iteration))
learn.save(fname)
iteration = iteration+1

In [7]:
for x in range(0,5):
    print(learn.predict('Immigration is',n_words=100))
    

Immigration is a disaster . After protesters closed as ambulances took their burial unconscious — so we was n’t alone , their rocks or not — “ somebody could use dying everywhere ? ” The failure to encourage migrants should be seen as something Republicans will always view : the prospect of “ standard people where terrorists could continue to behave and trust . ” In this instance , they demand that sugarcoat what Mr . Amodei calls an item : by “ simple welfare , ” the United States Department of


Immigration is a hey , fast - growing , economic - changing match . Our Obnoxious family is part of these Muslim - majority Hong Kong newspapers . Meredith Murdoch , 16 , recently became the existence of the episode ’s producer of jo - davidson Hajj Temple , a Wildly Muslim journalism , about Ellsberg ’s mayoral campaign on the Mcnally Wattage . “ in fact , no reason for it ’s going to be tangled later and probably everyone gets to visit the Royal


Immigration is a good way to reduce immigration and border violence . L.G.B.T . migrants are n’t allowed to travel freely . Bills . From the muralist social identity group : Black Americans with money and expressing their values , today is a root cause of our departure from Europe . a waiting line for stepping back in crisis ? Keeps us safe . Wall Street and Quarantine Less , podcaster ’s best - known Muslim - variety punishing religion , is to get more aggressive and garrulous .


Immigration is a problem and it can be known to many years that the one - star equivalent of post - immigration interviews with the United States Navy has revealed some , special servicer and federal tax laws that intervened in Congress last month . But President Trump has said lost policy . The White House , in a lawsuit against the White House , said in chief of the Intelligence Committee , the federal administration , that “ plotting Trump is a propaganda system of technological


Immigration is more affordable . Social workers have made up up abuse and criminal behavior in Europe , and those illegally have asylum seekers stationed in worked areas in Bangladesh . One of the refugees is the same woman who rented an equipment company called La Sponsors in Expat that ended when they took to Italy by home from home . a kidnap is coming off in Morocco . Ya no plastics , ” found in a car parked there after the motorcycle were stolen from a bakery in Lowlands .


In [9]:
learn.fit_one_cycle(1)


epoch,train_loss,valid_loss,accuracy,time
0,3.771372,3.453298,0.373525,1:10:55


In [10]:
fname = f'lm_all_%s_%sArticles_%s'%(publisher,records,str(iteration))
learn.save(fname)
iteration = iteration+1


In [22]:
for x in range(0,5):
    print(learn.predict('Immigration is',n_words=100))

Immigration is second most important of the generation and back in the country . In Germany , it ’s cheaper and more common in international immigration policies , maintaining free immigration and low - income roots . In an order to reverse such rules and reduce immigration , immigrants from northern Europe seem to be seeing the pain that has transformed the colonial regimes , like the American West . In Germany , she frequently strome migrants and migrants from places like Germany and Germany . _____ 33333 .....


Immigration is the country ’s good job , but it has even become the world ’s biggest normal expenses regimen . The United States is fighting global migration , our dumping record in uncertainty U.S . is not going to beat China , North Korea did not want taxes and fight for safe - open democracy . FLORENCE , France — In a modest French town , the French hospital in Lech , Mont . , announced a three - day March on Saturday for the


Immigration is everything , civil rights advocates know . Criminal justice is survived by economic expert lists , and adventurism . Often generally , because of a voter fraud policies , danger shrank , living conditions could be treated as a parting positive , especially in capital cases . Through a technical analysis , that found that of the three undecided respondents who attended the fair , check characterizes the 14th carriage in 2012 on normal lines . Well , racial chaos in the race was inevitably mercurial . West Distinction was no exception .


Immigration is something of a kind of distraction — one or two — in and out of this city : Last year , we warned anybody in London that the influential african - american police allegations exaggerated public voting and made it harder for blacks to flee . As licensed felons members of her community from the political establishment , Americans for years found themselves activating the small guns that are in the front row of Arlington Brooklynmuseum.org in Baltimore . One of those public transgressions — the anti - trump penalties of voting


Immigration is the kind of culture that Donald J. Trump has talked to , and it will be a provocation for the country . WASHINGTON — The pomp and symbolism of the American International Cruises Parade , the economic development of the United States and escalating tensions between countries , unfolded draped in white , biking , flying melancholic new guns , groundless Trump court battles on low - key Mexican immigrants . a crater of current crystal for return — and sheer hazard — includes dentistry , roasting


In [21]:
learn = learn.load(fname)

In [23]:
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,time
0,3.683887,3.408224,0.378588,1:11:24


In [24]:
fname = f'lm_all_%s_%sArticles_%s'%(publisher,records,str(iteration))
learn.save(fname)
iteration = iteration+1

In [25]:
for x in range(0,5):
    print(learn.predict('Immigration is',n_words=100))

Immigration is the main concern for immigration . Businesses across its State of the Union have some free customers for parts of their pay services . An independent union currently represents contracts with unions , labor unions and talent companies . Under the Jobs and Labor Act 2012 , most workers under “ LESBIAN , Gay , L.G.B.T.Q . and Latino Legal Aid ” are arrested , the week after federal immigration agents decide to rescind , to make sure it enters the United States .


Immigration is a good thing must feel a bit unequal to the Republican Party . It is certainly not guaranteed that this is the right question . When Representative Ed Shelby , a Republican from Alabama , a late - night marvel of the vicious Democratic groove in Alabama , had a calming effect on Republicans in 2012 , Democrats were angry — he turned down a promise from Republicans to join the House but recovered in front of state officials and putting his фбр in jeopardy


Immigration is a central issue at United States immigration courts in holy weeks from the beginning of much of the boroughs . As Senate Republicans , as with the Supreme Court in Los Angeles , have sidestepped the Supreme Court issue 12 times this year , Democrats will be may be better off showing expected relief . CALIFORNIA — christian leaders in Congress received 50 amicus brief briefs from Statute of Limitations this fall . Lead nominees may be re - evaluated in an interview


Immigration is one of the few issues whose most pressing rivals are museums , and specific for the museum and auction houses . We find Facebook and Google Maps convulsions and jarring anxiety , as well as sporting events like the Wimbledon Exhibition , a Modi ’s Heritage exhibition and recent work on Rosenworcel 1960s Queens . Upbeat , oneself : “ you learn from what they provide . ” This time of week , Harrison Haven has shown its history and experienced investment in minimalist culture .


Immigration is the first parallel history , going over the tennessee - maine border ; since President Obama stood upright , he broke the locks of the two men in a sign that has been crucially welcoming the traditional American League . Indeed , the definition of American customs has gotten New Order , America ’s former business partner , and Mr . Trump . When Mr . Trump kept Consul Pope Francis in my office — that ’s how it was remained — situations


In [26]:
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,time
0,3.694877,3.37942,0.381612,1:11:08


In [None]:
fname = f'lm_all_%s_%sArticles_%s'%(publisher,records,str(iteration))
learn.save(fname)
iteration = iteration+1

In [None]:
for x in range(0,5):
    print(learn.predict('Immigration is',n_words=100))

In [None]:
learn.fit_one_cycle(1)

In [None]:
fname = f'lm_all_%s_%sArticles_%s'%(publisher,records,str(iteration))
learn.save(fname)
iteration = iteration+1

In [None]:
for x in range(0,10):
    print(learn.predict('Immigration is',n_words=100))

In [None]:
for x in range(0,10):
    print(learn.predict('Abortion is',n_words=50))

# Tokenization and Numericalization

First, I need to tokenize my data.  Let's do that first.  The fastai library adds some extra tokens.  Tokens such as xxbos which indicates that it's the beginning of a sentance, or xxup that indicates that the word is in capital letters.

>Note:  In a previous post I showed how you can do a basic tokenization from scratch.  Please check out that post for a foundation on tokenization and numericalization.

In [None]:
from fastai.text.all import *


In [None]:

txts = L(o for o in df.article)


In [None]:
spacy = WordTokenizer()
tkn = Tokenizer(spacy)

toks = txts.map(tkn);

In [None]:
# for i in range(0,len(toks)):
#     toks[i] = L(filter(lambda a: a != 'xxmaj', toks[i]))
# toks

Next we need to numericalize our data.  By that, I mean assign numbers to each unique token and replace the tokens with those numbers.  We can do that very easily using Numericalize.

In [None]:
num = Numericalize()
num.setup(toks)
coll_repr(num.vocab,20)

We can see below that we can look at our numericalized tokens, and convert those back to tokens if we need to.

In [None]:
nums = toks.map(num); 
nums[0][:20]

In [None]:
np.array(toks[0][:20])

In [None]:
' '.join(num.vocab[o] for o in nums[0][:20])

# Language Model

A Language model is a semi-surpervised learning.  It is different from classification or regression because the labels are not seperate from the training data.  We will use previous words (or tokens more specifically) to predict the next word.  For this post, I will be creating this from scratch to demonstrate exactly how it works.

Let's start by creating our training set.  We will create tuples where the first element is a series of tokens, and the second element is the following word.  Let's see what that looks like for 1 article in both tokens and numbers.  We will start with using the 3 tokens to predict the next token in 1 article.  We will almost certainly need to use more articles as well as more tokens for the prediction, but we can increase those numbers later.

### Packaging the Data

In [None]:
n_words = 3

In [None]:
L((toks[0][i:i+n_words], toks[0][i+n_words]) for i in range(0,len(toks[0])-(n_words+1),n_words))

In [None]:
seqs = L((nums[0][i:i+n_words], nums[0][i+n_words]) for i in range(0,len(nums[0])-(n_words+1),n_words))

seqs

In [None]:
seqs = L()
for article_num in range(0,len(nums)):
    seq = L((nums[article_num][i:i+n_words], nums[article_num][i+n_words]) for i in range(0,len(nums[article_num])-(n_words+1),n_words))
    seqs.append(seq)
    
seqs = L(item for sublist in seqs for item in sublist)

seqs

Now we can easily package this into a dataloader so that we can feed this into a model.

In [None]:
bs = 64
cut = int(len(seqs) * 0.9)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

### The Model

So now we need to create a RNN.  Let's start with a fastai model, then go back and create some basic models to illustrate how it works.

In [None]:
# n,counts = 0,torch.zeros(len(num.vocab))
# for x,y in dls.valid:
#     n += y.shape[0]
#     for i in range_of(num.vocab): counts[i] += (y==i).long().sum()
# idx = torch.argmax(counts)

# top10 = torch.topk(counts,15)
# for idx in top10[1]:
#     print(idx, num.vocab[idx.item()], round(counts[idx].item()/n*100,1))

Now we are ready for an RNN.  WE will start with an RNN that is as simple as it gets.

```for i in range(3):```
Because we are feeding in 3 tokens to predict the fourth, we will have 3 hidden layers, 1 per token.

```h = h + self.i_h(x[:,i])```
For each input token we will run our input to hidden function.  We are indexing to grab the column in our embedding matrix that corresponds with the token, and adding that. All this is doing is adding the embedding for the particular token. 
    
```h = F.relu(self.h_h(h))```
We then run our hidden to hidden function (h_h), which is a linear layer (y = wx + b).  We do a ReLu of that, which is just replacing any negative values with 0.
    
```return self.h_o(h)```
We then run our hidden to output function (h_o), which is another linear layer, but it is outputing the prediction of which word is next.  Naturally, this is the size of our vocabulary.

Wrap all that in a class and it looks like the below:


In [None]:
class LanguageModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

I then threw it in a learner for 3 epochs and we see about an 16% accuracy.  Much better than just predicting the most common words!

In [None]:
learn = Learner(dls, LanguageModel1(len(num.vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(5, 1e-2)

One problem with the previous model is it is only using the previous 3 words to predict the next one.  In reality, words are in a logical order that is longer than 3 words - so we really don't want to just reset it every time by setting h to 0.  So instead we set it to 0 when we first initialize it, but not later.

Unfortunately what this means is we end up with more and more weights as we train, which means more and more gradients to calculate.  The model would explode, so instead we just deal with the recent gradients by using "detach".

In [None]:

class LanguageModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out
    
    def reset(self): self.h = 0

For this to work, our data needs to be in a logical order.  So let's put our data in our dataloader in the order it was in the text.

In [None]:
def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds


In [None]:
cut = int(len(seqs) * 0.9)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)

And throw it in a learner for 3 epochs and we see our accuracy is much better.  It can predict the next word correctly almost 1 out of every 5 times?

In [None]:
learn = Learner(dls, LanguageModel2(len(num.vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(5, 1e-2)

There are many more steps to this iterative process to get to a really cutting edge model, and future posts will cover those steps.  But for now, we have a great start and a good foundation in what an RNN is in it's simplest form.  Future blog posts that continue to expand and pick up where this one left off.

Other areas that more cutting edge architectures improve upon:
+ Rather than predicting 1 token for each group of 4 tokens (3 inputs -> 1 output), predict every word.
+ Stack the RNNs together for more layers
+ Use LSTMs
+ Regularization (ie dropout, AR, TAR) 

We will continue to build on this language model until we reach close to the performance we would get using the fastai library.  See below for the out of the box language model using fastai.

### Fastai Language Model

In [None]:
df.columns

In [None]:
dls = TextDataLoaders.from_df(df, text_col='article', is_lm=True,bs = 256)
learn = language_model_learner(dls, AWD_LSTM, metrics=accuracy)
learn.fit_one_cycle(5, 1e-2)