# In Progress - NLP - Language Model Basics
> Building a Language Model

- toc: true 
- hide: true
- badges: true
- comments: true
- author: Isaac Flath

# Intro

In this post we are going to dive into NLP, specifically a Language Model.  Language models are the foundation of all NLP.  You will always want to start with a language model then use transfer learning to tune that model to your particular goal (ie Classification).  

So what is a language model?  In short, it is a model that uses the preceding words to predict the next word.  We do not need seperate labels, because they are in the text.  This is training the model on the nuances of the language you will be working on.  If you want to know if a tweet is toxic or not, you will need to be able to read and understand the tweet in order to do that.  The language model helps with understanding the tweet - then you can use that model with those weights to tune it for the final task (determining whether the tweet is toxic or not).

For this post, I will be using news articles to show how to create a language model using fastai's high level interface.  In other posts, I am diving into the details of how NLP models work.  This is just focused on the high level API fastai offers.

# The Data

I will be using the "All-the-news" dataset from this site.  https://components.one/datasets/all-the-news-2-news-articles-dataset/

I downloaded then put the csv into a sqlite database for conveniece

In [1]:
import pandas as pd
import sqlite3
from pathlib import Path
from fastai.text.all import *

path = Path('../../../data/all-the-news')
con = sqlite3.connect(path/'all_the_news.db')

pub_stats = pd.read_sql_query('SELECT publication, min(date),max(date), count(*) as cnt from news group by publication having count(*) > 125000 order by max(date) desc', con)
pub_stats

Unnamed: 0,publication,min(date),max(date),cnt
0,The New York Times,2016-01-01 00:00:00,2020-04-01 13:42:08,252259
1,CNN,2016-01-01,2020-03-31 00:00:00,127602
2,CNBC,2016-01-01,2020-03-31 00:00:00,238096
3,Reuters,2016-01-01,2020-03-30 00:00:00,840094
4,The Hill,2016-01-01,2020-03-26 00:00:00,208411
5,People,2016-01-01 00:05:00,2019-12-15 22:40:00,136488


# Tokenization and Numericalization

First, I need to tokenize my data.  Let's do that first.  The fastai library adds some extra tokens.  Tokens such as xxbos which indicates that it's the beginning of a sentance, or xxup that indicates that the word is in capital letters.

>Note:  In a previous post I showed how you can do basic tokenization from scratch.  Please check out that post for a foundation on tokenization and numericalization.

First, we will get a bunch of articles to tokenize.

In [2]:
publisher = 'The New York Times'
records = '1000'
query = f"SELECT article FROM news WHERE publication = '%s' and length(article) > 500 order by date desc limit %s"%(publisher,records)
df = pd.read_sql_query(query,con)

Next, let's see what tokenization looks like.  Fastai will do this for us, but it's good to know what's going into our model.  Most data scientists make models without really understanding them well, which leads to lots of problems when trying to get actual value out of machine learning.

In [3]:
txts = L(o for o in df.article)

In [4]:
spacy = WordTokenizer()
tkn = Tokenizer(spacy)

toks = txts.map(tkn);

Next we need to numericalize our data.  By that, I mean assign numbers to each unique token and replace the tokens with those numbers.  We can do that very easily using Numericalize.

In [5]:
num = Numericalize()
num.setup(toks)
coll_repr(num.vocab,20)

"(#20568) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj',',','.','the','to','and','of','a','in','that','“','”'...]"

We can see below that we can look at our numericalized tokens, and convert those back to tokens if we need to.  We see tokens such as xxmaj, which indicates the next word starts with a capital letter.  Or xxup, wchich indicates the next work is in all caps.  These are not words, but they hold meaning in the english value and so we want a way to feed it into our model.

In [6]:
nums = toks.map(num); 
nums[1][:200]

tensor([    2,     8,   140,     8,   110,   225,    14,    18,   360,   162,
           17,  4237,   890,    19,    32,    46,   391,  1966,  2661,   705,
         2694,  6129,    17,    11,     7,   302,    10,   327,   828,    66,
           39,   650,     9,     5,   109,   129,    12,  5068,     9,     5,
          109,   129,    10,     8,   963,  4354,    55,  2796,    16, 10492,
          709,   465,    10,     8,   169,     8,    87,     8,    11,   242,
           14,   456,   234,     8,    50,     8,    95,     9,     8,    50,
            8,   810,    13,     8,  1712,    74,  6425,   342,     9,     5,
          109,   129,   382,     9,    26,    57,    81,   650,     9,     5,
          109,   129,  6715,   695,    10,     0,     8,    11,     8,   149,
            8,  1455,  1313,    23,     8,   567,    17,    11,  5655,  1847,
          260,    11,    59,    66,   977,    12,    18,  6710, 14416,     9,
         6710,  8960,     9,    13,  6710,  3518,    10,    19, 

In [7]:
np.array(toks[1][:200])

array(['xxbos', 'xxmaj', 'president', 'xxmaj', 'trump', 'told', 'of', '“',
       'hard', 'days', 'that', 'lie', 'ahead', '”', 'as', 'his', 'top',
       'scientific', 'advisers', 'released', 'models', 'predicting',
       'that', 'the', 'xxup', 'u.s', '.', 'death', 'toll', 'would', 'be',
       '100', ',', 'xxrep', '3', '0', 'to', '240', ',', 'xxrep', '3', '0',
       '.', 'xxmaj', 'governors', 'complained', 'about', 'chaos', 'in',
       'obtaining', 'critical', 'supplies', '.', 'xxmaj', 'right',
       'xxmaj', 'now', 'xxmaj', 'the', 'number', 'of', 'deaths', 'across',
       'xxmaj', 'new', 'xxmaj', 'york', ',', 'xxmaj', 'new', 'xxmaj',
       'jersey', 'and', 'xxmaj', 'connecticut', 'will', 'exceed', '2',
       ',', 'xxrep', '3', '0', 'today', ',', 'with', 'more', 'than',
       '100', ',', 'xxrep', '3', '0', 'detected', 'infections', '.',
       '新冠病毒疫情最新消息', 'xxmaj', 'the', 'xxmaj', 'united', 'xxmaj',
       'nations', 'warned', 'on', 'xxmaj', 'wednesday', 'that', 'the',
      

In [8]:
' '.join(num.vocab[o] for o in nums[1][:200])

'xxbos xxmaj president xxmaj trump told of “ hard days that lie ahead ” as his top scientific advisers released models predicting that the xxup u.s . death toll would be 100 , xxrep 3 0 to 240 , xxrep 3 0 . xxmaj governors complained about chaos in obtaining critical supplies . xxmaj right xxmaj now xxmaj the number of deaths across xxmaj new xxmaj york , xxmaj new xxmaj jersey and xxmaj connecticut will exceed 2 , xxrep 3 0 today , with more than 100 , xxrep 3 0 detected infections . xxunk xxmaj the xxmaj united xxmaj nations warned on xxmaj wednesday that the unfolding battle against the coronavirus would lead to “ enhanced instability , enhanced unrest , and enhanced conflict . ” xxmaj as xxmaj americans xxunk themselves for what xxmaj president xxmaj trump said would be a “ very , very painful two weeks , ” the scale of the economic , political and societal fallout around the world came into ever greater focus . “ we are facing a global health crisis unlike any in the xxunk history


# Fastai Language Model

In [9]:
publisher = 'The New York Times'
records = '10000'
query = f"SELECT article from news where publication = '%s' and length(article) > 500 order by date desc limit %s"%(publisher,records)
df = pd.read_sql_query(query,con)

In [10]:
bs=128
dls = TextDataLoaders.from_df(df, text_col='article', is_lm=True,bs = bs)
learn = language_model_learner(dls, AWD_LSTM, metrics=accuracy)

  return array(a, dtype, copy=False, order=order)


In [None]:
learn.fine_tune(10,freeze_epochs = 3)

epoch,train_loss,valid_loss,accuracy,time
0,4.57026,4.004526,0.312939,5:06:33
1,4.266904,3.756997,0.334835,5:07:47
2,4.076308,3.660905,0.343656,5:07:29


epoch,train_loss,valid_loss,accuracy,time
0,3.976979,3.619511,0.349825,5:50:52
1,3.910682,3.57626,0.355482,5:49:01
2,3.870065,3.532961,0.360485,5:48:52
3,3.801512,3.495675,0.365133,5:45:37
4,3.783073,3.466334,0.368626,5:46:00
5,3.721595,3.445397,0.371505,5:46:21
6,3.698827,3.431377,0.373355,5:47:12
7,3.657815,3.422835,0.374445,5:46:41


In [None]:
learn.save

In [None]:
learn.lr_find()

# Results

Let's take a look at a few prompts.  Let's see what it spits out about a few controversial topics.  What did it learn from reading 125000 news articles?

In [None]:
fname = f'lm_all_%s_%sArticles_%s'%(publisher,records,str(iteration))
learn.save(fname)
iteration = iteration+1

In [None]:
for x in range(0,2):
    print(learn.predict('Immigration is',n_words = 10))

In [None]:
for x in range(0,2):
    print(learn.predict('Immigration is',n_words = 10))

In [None]:
for x in range(0,2):
    print(learn.predict('Minorities are',n_words = 10))

In [None]:
for x in range(0,2):
    print(learn.predict('Minorities are',n_words = 10))

In [None]:
for x in range(0,2):
    print(learn.predict('Minorities are',n_words = 10))