# Training a model with a larger text dataset in fast.ai

Example of training a model with a larger text dataset in fast.ai.

The example shown here is adapted from the paper by Howard and Gugger https://arxiv.org/pdf/2002.04688.pdf

In [1]:
# imports for notebook boilerplate
!pip install -Uqq fastbook
import fastbook
from fastbook import *
from fastai.text.all import *


In [2]:
# set up the notebook for fast.ai
fastbook.setup_book()

In [3]:
# ingest the curated text dataset ADULT_SAMPLE
path = untar_data(URLs.DBPEDIA)

In [4]:
path.ls()

(#4) [Path('/storage/data/dbpedia_csv/readme.txt'),Path('/storage/data/dbpedia_csv/classes.txt'),Path('/storage/data/dbpedia_csv/test.csv'),Path('/storage/data/dbpedia_csv/train.csv')]

In [12]:
# ingest the train dataset into a Pandas dataframe
df_train = pd.read_csv(path/'train.csv',header=None)
df_train.head(2)

Unnamed: 0,0,1,2
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972.
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for writing colouring and cosmetics as well as markers and highlighters for office use. It is the world's largest manufacturer of highlighter pens Stabilo Boss.


In [13]:
# get the dimensions of the dataframe
print("df_train: ",df_train.shape)

df_train:  (560000, 3)


In [14]:
# read the test dataset into a Pandas dataframe with no column headers
df_test = pd.read_csv(path/'test.csv',header=None)
df_test.head(2)

Unnamed: 0,0,1,2
0,1,TY KU,TY KU /taɪkuː/ is an American alcoholic beverage company that specializes in sake and other spirits. The privately-held company was founded in 2004 and is headquartered in New York City New York. While based in New York TY KU's beverages are made in Japan through a joint venture with two sake breweries. Since 2011 TY KU's growth has extended its products into all 50 states.
1,1,Odd Lot Entertainment,OddLot Entertainment founded in 2001 by longtime producers Gigi Pritzker and Deborah Del Prete (The Wedding Planner) is a film production and financing company based in Culver City California.OddLot produced the film version of Orson Scott Card's sci-fi novel Ender's Game. A film version of this novel had been in the works in one form or another for more than a decade by the time of its release.


In [15]:
# create a combined dataframe for tokenization
df_combined = pd.concat([df_train,df_test])
df_combined.head(2)

Unnamed: 0,0,1,2
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972.
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for writing colouring and cosmetics as well as markers and highlighters for office use. It is the world's largest manufacturer of highlighter pens Stabilo Boss.


In [16]:
# get the dimensions of all the dataframes
print("df_train: ",df_train.shape)
print("df_test: ",df_test.shape)
print("df_combined: ",df_combined.shape)

df_train:  (560000, 3)
df_test:  (70000, 3)
df_combined:  (630000, 3)


In [17]:
# get the tokens and token counts for the combined dataframe
# specify that column rank 2 contains the text
df_tok, count = tokenize_df(df_combined,[df_combined.columns[2]])


In [18]:
df_tok.head(3)

Unnamed: 0,0,1,text,text_length
0,1,E. D. Abbott Ltd,"[xxbos, xxmaj, abbott, of, xxmaj, farnham, e, d, xxmaj, abbott, xxmaj, limited, was, a, xxmaj, british, coachbuilding, business, based, in, xxmaj, farnham, xxmaj, surrey, trading, under, that, name, from, 1929, ., a, major, part, of, their, output, was, under, sub, -, contract, to, motor, vehicle, manufacturers, ., xxmaj, their, business, closed, in, 1972, .]",54
1,1,Schwan-Stabilo,"[xxbos, schwan, -, stabilo, is, a, xxmaj, german, maker, of, pens, for, writing, colouring, and, cosmetics, as, well, as, markers, and, highlighters, for, office, use, ., xxmaj, it, is, the, world, 's, largest, manufacturer, of, highlighter, pens, xxmaj, stabilo, xxmaj, boss, .]",42
2,1,Q-workshop,"[xxbos, xxmaj, q, -, workshop, is, a, xxmaj, polish, company, located, in, xxmaj, poznań, that, specializes, in, designand, production, of, polyhedral, dice, and, dice, accessories, for, use, in, various, games, (, role, -, playing, gamesboard, games, and, tabletop, wargames, ), ., xxmaj, they, also, run, an, online, retail, store, and, maintainan, active, forum, community.q, -, workshop, was, established, in, 2001, by, xxmaj, patryk, xxmaj, strzelewicz, –, a, student, from, xxmaj, poznań, ., xxmaj, initiallythe, company, sold, its, products, via, online, auction, services, but, in, 2005, ...",92


In [19]:
# get the count value for a very common word, a moderately common
# word and a rare word
print("very common word (count['the']):", count['the'])
print("moderately common word (count['prepared']):", count['prepared'])
print("rare word (count['ticky']):", count['ticky'])

very common word (count['the']): 1825444
moderately common word (count['prepared']): 177
rare word (count['ticky']): 0


In [None]:
dls = TextDataLoaders.from_df(df_tok, path=path, 
    vocab = make_vocab(count),text_col = 'text', is_lm=True)

In [11]:
#train the model
learn = language_model_learner(dls,AWD_LSTM,metrics=accuracy)


In [12]:
# fit the model with one epoch, LR = 0.02, and momentum = ( 0.8 , 0.7 , 0.8 )
learn.fit_one_cycle( 1 , 0.02)

epoch,train_loss,valid_loss,accuracy,time
0,1.639356,1.457713,0.771113,01:38


In [13]:
preds = learn.predict('The subject is', n_words=20)

In [14]:
preds

'The subject is words , to , and sharing that , bourne , credits , bugs , " good \' , \' the'

In [15]:
# .str.replace(',',''), '\''
preds2 = preds.replace(', ','').replace('\' ','')

In [16]:
preds2

'The subject is words to and sharing that bourne credits bugs " good the'

In [17]:
# save the model in /storage/data/wikitext-2/wikitext_tiny_model.pkl
learn.export('imdb_sample.pkl')