# Training a language model on a standalone dataset with fastai
- This notebook ingests the Kaggle Covid-related tweets dataset (https://www.kaggle.com/datatattle/covid-19-nlp-text-classification)
- Trains a language model using pre-trained model AWD_LSTM as a starting point and fine-tuning it with the Covid-related tweets dataset


In [24]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [25]:
#hide
from fastbook import *
from fastai.text.all import *
import pickle 

In [26]:
modifier = 'standalone_mar20'

# Ingest the dataset
- define the source of the dataset
- create a dataframe for the training dataset

In [27]:
%%time
# create dataloaders object
path = URLs.path('covid_tweets')
path.ls()

CPU times: user 1.76 ms, sys: 3.52 ms, total: 5.28 ms
Wall time: 4.68 ms


(#2) [Path('/storage/archive/covid_tweets/train'),Path('/storage/archive/covid_tweets/test')]

In [28]:
# read the training CSV into a dataframe - note that the encoding parameter is needed to avoid a decode error
df_train = pd.read_csv(path/'train/Corona_NLP_train.csv',encoding = "ISO-8859-1")

# Create language model

In [29]:
%%time
# create TextDataLoaders object
dls = TextDataLoaders.from_df(df_train, path=path, 
                              text_col='OriginalTweet',
                              is_lm=True)
dls.show_batch(max_n=3)

Unnamed: 0,text,text_
0,xxbos xxmaj share - prices of listed mining companies are in a downward spiral . xxmaj commodity prices across the industry have been tumbling as the industry considers the devastating xxunk of this âblack xxmaj xxunk event . https : / / t.co / xxunk # xxmaj covid_19 # xxmaj africa # xxunk # mining # economy xxbos xxmaj online xxmaj food xxmaj orders checklist place your order in advance order only,xxmaj share - prices of listed mining companies are in a downward spiral . xxmaj commodity prices across the industry have been tumbling as the industry considers the devastating xxunk of this âblack xxmaj xxunk event . https : / / t.co / xxunk # xxmaj covid_19 # xxmaj africa # xxunk # mining # economy xxbos xxmaj online xxmaj food xxmaj orders checklist place your order in advance order only what
1,"https : / / t.co / xxunk xxbos xxmaj this sign posted at my local grocery store . xxmaj crazy times . # fridaythoughts # socialdistanacing # flattenthecurve # coronavirus # coronavirus2020 https : / / t.co / xxunk xxbos xxmaj dear young , healthy xxunk . xxmaj please explain why supermarket aisles nationwide seem to have been emptied of sanitary products . xxmaj how many periods do you expect to have",": / / t.co / xxunk xxbos xxmaj this sign posted at my local grocery store . xxmaj crazy times . # fridaythoughts # socialdistanacing # flattenthecurve # coronavirus # coronavirus2020 https : / / t.co / xxunk xxbos xxmaj dear young , healthy xxunk . xxmaj please explain why supermarket aisles nationwide seem to have been emptied of sanitary products . xxmaj how many periods do you expect to have during"
2,"of this # coronavirus has been a run on # toiletpaper . xxmaj if you find yourself resorting to facial tissues or paper towels , do n't flush them down the toilet . xxmaj flushing xxup anything other than toilet paper can lead to xxunk . # xxmaj sarasota https : / / t.co / xxunk xxbos xxmaj is consumer # privacy dead and can it be revived ? \r\r\n xxmaj governments","this # coronavirus has been a run on # toiletpaper . xxmaj if you find yourself resorting to facial tissues or paper towels , do n't flush them down the toilet . xxmaj flushing xxup anything other than toilet paper can lead to xxunk . # xxmaj sarasota https : / / t.co / xxunk xxbos xxmaj is consumer # privacy dead and can it be revived ? \r\r\n xxmaj governments expanded"


CPU times: user 22.3 s, sys: 1.88 s, total: 24.2 s
Wall time: 26.7 s


In [30]:
%%time
# define and train model
learn = language_model_learner(dls,AWD_LSTM,
                               metrics=accuracy).to_fp16()
learn.fine_tune(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.448665,3.95891,0.322896,02:00


epoch,train_loss,valid_loss,accuracy,time
0,4.008188,3.735647,0.343352,02:19


CPU times: user 3min 37s, sys: 49 s, total: 4min 26s
Wall time: 4min 20s


# Exercise and save language model
- try out the language model with a few examples
- save the language model and the encoder

In [31]:
# get prediction
learn.predict("what comes next", n_words=20)

'what comes next to know who would get mondaymood ! Got anywhere - there d have to mean was going to go'

In [32]:
learn.export('/notebooks/temp/models/lm_model_standalone'+modifier)

In [33]:
keep_path = learn.path

In [34]:
# workaround to make path writeable
learn.path = Path('/notebooks/temp')

In [35]:
learn.path

Path('/notebooks/temp')

In [36]:
learn.model_dir

'models'

In [37]:
learn.save('lm_standalone'+modifier)

Path('/notebooks/temp/models/lm_standalonestandalone_mar20.pth')

In [38]:
# workaround to save encoder - need to do this to later load encoder for classifier
learn.save_encoder('ft_standalone'+modifier)

In [39]:
learn.path = keep_path