# Training a language model on a standalone dataset with fastai
- This notebook ingests the Kaggle Covid tweets dataset (https://www.kaggle.com/datatattle/covid-19-nlp-text-classification)
- Trains a language model using pre-trained model AWD_LSTM


In [1]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [14]:
#hide
from fastbook import *
from fastai.text.all import *
import pickle 

In [3]:
modifier = 'standalone_mar17'

# Training a language model
- take a pretrained model and train it some more using a standalone dataset

In [4]:
%%time
# create dataloaders object
path = URLs.path('covid_tweets')
path.ls()

CPU times: user 2.1 ms, sys: 618 µs, total: 2.72 ms
Wall time: 7.69 ms


(#2) [Path('/storage/archive/covid_tweets/train'),Path('/storage/archive/covid_tweets/test')]

# Tokenize the dataset
- to prepare to do transfer learning on the Covid tweets dataset, we first need to
get do tokenization

In [9]:
# read the training CSV into a dataframe - note that the encoding parameter is needed to avoid a decode error
df_train = pd.read_csv(path/'train/Corona_NLP_train.csv',encoding = "ISO-8859-1")

In [13]:
df_tok, count = tokenize_df(df_train,['OriginalTweet'])

In [23]:
# save the count object to a pickle file
filehandler = open('/storage/archive/covid_tweets_tok/counter.pkl', 'wb') 
pickle.dump(count, filehandler)

In [26]:
filehandler = open('/storage/archive/covid_tweets_tok/counter.pkl', 'rb') 
count2 = pickle.load(filehandler)

In [None]:
# validate the count object is OK
count2

In [10]:
df_train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order,Positive
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive
3,3802,48754,,16-03-2020,"My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\r\r\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j",Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COVID19 outbreak.\r\r\n\r\r\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\r\r\n\r\r\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n",Extremely Negative


In [29]:
df_tok.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,Sentiment,text,text_length
0,3799,48751,London,16-03-2020,Neutral,"[xxbos, @menyrbie, @phil_gahan, @chrisitv, https, :, /, /, t.co, /, ifz9fan2pa, and, https, :, /, /, t.co, /, xx6ghgfzcc, and, https, :, /, /, t.co, /, i2nlzdxno8]",27
1,3800,48752,UK,16-03-2020,Positive,"[xxbos, advice, xxmaj, talk, to, your, neighbours, family, to, exchange, phone, numbers, create, contact, list, with, phone, numbers, of, neighbours, schools, employer, chemist, xxup, gp, set, up, online, shopping, accounts, if, poss, adequate, supplies, of, regular, meds, but, not, over, order]",41
2,3801,48753,Vagabonds,16-03-2020,Positive,"[xxbos, xxmaj, coronavirus, xxmaj, australia, :, xxmaj, woolworths, to, give, elderly, ,, disabled, dedicated, shopping, hours, amid, xxup, covid-19, outbreak, https, :, /, /, t.co, /, binca9vp8p]",27
3,3802,48754,,16-03-2020,Positive,"[xxbos, xxmaj, my, food, stock, is, not, the, only, one, which, is, empty, …, \r\r\n\r\r\n, xxup, please, ,, do, n't, panic, ,, xxup, there, xxup, will, xxup, be, xxup, enough, xxup, food, xxup, for, xxup, everyone, if, you, do, not, take, more, than, you, need, ., \r\r\n, xxmaj, stay, calm, ,, stay, safe, ., \r\r\n\r\r\n▁, #, covid19france, #, xxup, covid_19, #, xxup, covid19, #, coronavirus, #, confinement, #, xxmaj, confinementotal, #, confinementgeneral, https, :, /, /, t.co, /, zrlg0z520j]",79
4,3803,48755,,16-03-2020,Extremely Negative,"[xxbos, xxmaj, me, ,, ready, to, go, at, supermarket, during, the, #, xxup, covid19, outbreak, ., \r\r\n\r\r\n, xxmaj, not, because, xxmaj, i, 'm, paranoid, ,, but, because, my, food, stock, is, litteraly, empty, ., xxmaj, the, #, coronavirus, is, a, serious, thing, ,, but, please, ,, do, n't, panic, ., xxmaj, it, causes, shortage, …, \r\r\n\r\r\n▁, #, coronavirusfrance, #, restezchezvous, #, stayathome, #, confinement, https, :, /, /, t.co, /, usmualq72n]",71


# Create language model

In [34]:
%%time
# create TextDataLoaders object
dls = TextDataLoaders.from_df(df_train, path=path, text_col='OriginalTweet',
                              is_lm=True)
dls.show_batch(max_n=3)

Unnamed: 0,text,text_
0,"xxbos xxmaj work is still getting done in xxmaj xxunk while the fight against # coronavirus goes on , this time impacting xxmaj washington policies around insulin prices , the xxunk tax , and more . \r\r\n\r\r\n https : / / t.co / xxunk xxbos xxmaj the young lady from the supermarket has a good point \r\r\n\r\r\n xxmaj shopping for supplies shouldnât be a family xxunk # xxmaj lockdown # xxmaj coronavirus","xxmaj work is still getting done in xxmaj xxunk while the fight against # coronavirus goes on , this time impacting xxmaj washington policies around insulin prices , the xxunk tax , and more . \r\r\n\r\r\n https : / / t.co / xxunk xxbos xxmaj the young lady from the supermarket has a good point \r\r\n\r\r\n xxmaj shopping for supplies shouldnât be a family xxunk # xxmaj lockdown # xxmaj coronavirus #"
1,"we want fish and chips ( reasonable xxunk ) . xxmaj for the record , prices range from â£7 to xxunk . xxmaj most at the â£10 price point . # foodie # xxunk xxbos xxmaj so xxunk prices has gotten up since covid-19 while xxunk is been much cheaper . xxmaj shout out to real ones taking care of us xxbos xxunk @httweets xxmaj do n't you understand that presently fighting","want fish and chips ( reasonable xxunk ) . xxmaj for the record , prices range from â£7 to xxunk . xxmaj most at the â£10 price point . # foodie # xxunk xxbos xxmaj so xxunk prices has gotten up since covid-19 while xxunk is been much cheaper . xxmaj shout out to real ones taking care of us xxbos xxunk @httweets xxmaj do n't you understand that presently fighting with"
2,"xxup due xxup to xxup covid -19 xxup pandemic , more people are staying home ! xxbos xxmaj only to the grocery store and doctor appointments . xxmaj otherwise , take caution and sit your arses home . # xxunk # coronavirus # chinesecoronavirus # gloves https : / / t.co / xxunk xxbos xxmaj this shit is all becoming just too much . xxmaj the # xxmaj primary is a bunch","due xxup to xxup covid -19 xxup pandemic , more people are staying home ! xxbos xxmaj only to the grocery store and doctor appointments . xxmaj otherwise , take caution and sit your arses home . # xxunk # coronavirus # chinesecoronavirus # gloves https : / / t.co / xxunk xxbos xxmaj this shit is all becoming just too much . xxmaj the # xxmaj primary is a bunch of"


CPU times: user 22.1 s, sys: 1.9 s, total: 24 s
Wall time: 26.8 s


In [35]:
%%time
# define and train model
learn = language_model_learner(dls,AWD_LSTM,metrics=accuracy).to_fp16()
learn.fine_tune(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.451701,3.967148,0.320308,02:01


epoch,train_loss,valid_loss,accuracy,time
0,3.988934,3.739557,0.342275,02:21


CPU times: user 3min 41s, sys: 48.5 s, total: 4min 29s
Wall time: 4min 23s


In [37]:
# get prediction
learn.predict("what comes next", n_words=20)

'what comes next to the message of panic for customers , who at this time constantly more hardship for businesses , they must'

In [38]:
learn.export('/notebooks/temp/models/lm_model_'+modifier)

In [39]:
keep_path = learn.path

In [40]:
# workaround to make path writeable
learn.path = Path('/notebooks/temp')

In [41]:
learn.path

Path('/notebooks/temp')

In [42]:
learn.model_dir

'models'

In [43]:
learn.save('lm_'+modifier)

Path('/notebooks/temp/models/lm_standalone_mar17.pth')

In [44]:
# workaround to save encoder - need to do this to later load encoder for classifier
learn.save_encoder('ft_'+modifier)

In [17]:
learn.path = keep_path