**Initialization**
* Setting up the Fastai Environment. I am using Colab for the Project so that the process of getting ready for the Fastai Environment might be different in other platforms.

In [2]:
# Setting up the Fastai Environment.
# !pip install -Uqq fastbook
# import fastbook
# fastbook.setup_book()

* I prefer to use these 3 lines of code on top of my Notebook. It helps while reloading the Notebook. The third line of code helps to make plots within the Notebook.

In [3]:
# Initialization
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Libraries and Dependencies**

In [4]:
# Downloading and Importing the Libraries and Dependencies.
from fastbook import *                                        # Importing all the Libraries and Dependencies.
from fastai.text.all import *
from IPython.display import display                           # Assist in Displaying.
from IPython.display import HTML

**Getting the Data**
* Fastai has a number of [Dataset](https://course.fast.ai/datasets) which makes easy to download and to use. I will be using the [IMDB Dataset](https://course.fast.ai/datasets) for this Project available in Fastai. 

In [5]:
# Downloading and accessing the IMDB Dataset.
path = untar_data(URLs.IMDB)                                 # Downloads the IMDB Dataset.

* Now, I will use get text files function to grab all the text files in a path obtained above. In Fastai, the optional parameter folders can be passed to restrict the search to a particular list of sub folders.

In [6]:
# Getting all the Text Files.
files = get_text_files(path, folders=["train", "test", "unsup"])

# Inspecting the files.
text = files[0].open().read()                                       # It opens only the first document of the text.
text[:100]                                                          # Printing the first 100 words of the text.

'Fuckland is an interesting film. I personally love the Dogma movement. I wish it had lasted longer. '

**Word Tokenization**
* I will use Fastai Tokenizer for the process of Word Tokenization. Then, I will use Fastai coll_repr function to display the results. It displays the first n items of the collection. The collections of text documents should be wrap into list. The tokens starting with xx are the special tokens which is not a common word prefix in English.

In [7]:
# Word Tokenization
spacy = WordTokenizer()                                             # Instantiating the Tokenizer.
# tokens = first(spacy([text]))                                     # First refers to every element.
tokens = Tokenizer(spacy)                                           # Fastai Tokenizer.
display(coll_repr(tokens(text), 30))                                # Printing the first 30 items from tokens.

"(#424) ['xxbos','xxmaj','fuckland','is','an','interesting','film','.','i','personally','love','the','xxmaj','dogma','movement','.','i','wish','it','had','lasted','longer','.','xxmaj','it','seems','to','have','already','died'...]"

**Subword Tokenization**
* In Chinese and Japanese languages there are no spaces in the sentences. Similarly Turkish Languages add many subwords together without spaces creating very long words. In such problems the Subword Tokenization plays the key role.

In [8]:
# Subword Tokenization.
texts = L(x.open().read() for x in files[:2000])                    # First 2000 movie reviews.

def subword(sz):
  sp = SubwordTokenizer(vocab_sz=sz)
  sp.setup(texts)
  return " ".join(first(sp([text]))[:40])

# Implementing the Subword.
subword(1000)

'▁F uck land ▁is ▁an ▁interesting ▁film . ▁I ▁person ally ▁love ▁the ▁Do g ma ▁mo ve ment . ▁I ▁w ish ▁it ▁had ▁last ed ▁long er . ▁It ▁seem s ▁to ▁have ▁already ▁di ed . ▁Ma'

**Numericalization**
* Numericalization is the process of mapping tokens to integers. 

In [9]:
# Numericalization
token = tokens(text)
token200 = texts[:200].map(tokens)
display(token200[0])                                                             # Inspecting the first token.

num = Numericalize()                                                             # Instantiating Numericalization.
num.setup(token200)                                                              # Numericalizing first 200 tokens.
print(coll_repr(num.vocab, 30))

(#424) ['xxbos','xxmaj','fuckland','is','an','interesting','film','.','i','personally'...]

(#2208) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','it','in','i','"','that',"'s",'this','-','as','\n\n','with','was','for'...]


In [10]:
# Preparing LMDataLoader.
nums200 = token200.map(num)                                                       # Applying Numericalization.
dl = LMDataLoader(nums200)                                                        # Preparing LMDataLoader.

# Inspecting the LMDataLoader.
X, y = first(dl)                                                                  # First refers to every elements.
display(f"Shape of X is {X.shape}")
display(f"Shape of y is {y.shape}")

'Shape of X is torch.Size([64, 72])'

'Shape of y is torch.Size([64, 72])'

### **Training the Text Classifier**
* Assembling the Data for Training. There are two steps for training the state of art Text classifier using Transfer Learning. First the model should be fine tuned on IMDB reviews corpus on Wikipedia. Then the model can be used to train the classifier.

**Language Model using DataBlock**
* Fastai handles Tokenization and Numericalization automatically when TextBlock is passed to the DataBlock. All the arguments that can be passed to Tokenize and Numericalize can also be passed to the TextBlock.

In [11]:
# Preparing the Language Model using DataBlock.
get_imdb = partial(get_text_files, folders=["train", "test", "unsup"])

# Preparing DataBlock.
dls_lm = DataBlock(
    blocks = TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=129, seq_len=80)

# Inspecting the DataBlock.
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj how hard is it to write a watchable film with xxmaj vince xxmaj vaughn , xxmaj paul xxmaj giamatti and xxmaj kevin xxmaj spacey ? xxmaj apparently xxup very difficult for the writers here . \n\n i still have no idea how xxmaj santa is younger and looks 20 years older than xxmaj vince ( who plays the xxup big brother ) . i must have missed that part of the story but in reality , it really","xxmaj how hard is it to write a watchable film with xxmaj vince xxmaj vaughn , xxmaj paul xxmaj giamatti and xxmaj kevin xxmaj spacey ? xxmaj apparently xxup very difficult for the writers here . \n\n i still have no idea how xxmaj santa is younger and looks 20 years older than xxmaj vince ( who plays the xxup big brother ) . i must have missed that part of the story but in reality , it really did"
1,"is xxmaj shelley xxmaj duvall , her scene of finding xxmaj jack 's rant xxmaj all xxmaj work ▁ is incredible , that 's a look of horror and you can see that fear in her face after realizing her husband is mad . xxmaj also another incredible scene is when xxmaj jack sees a ghost woman in the bathtub , it 's honestly one of the most terrifying scenes in horror cinema . xxmaj the reason this film is","xxmaj shelley xxmaj duvall , her scene of finding xxmaj jack 's rant xxmaj all xxmaj work ▁ is incredible , that 's a look of horror and you can see that fear in her face after realizing her husband is mad . xxmaj also another incredible scene is when xxmaj jack sees a ghost woman in the bathtub , it 's honestly one of the most terrifying scenes in horror cinema . xxmaj the reason this film is so"


In [12]:
# Preparing the Language Model.
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]
).to_fp16()

# Training the Model
learn.fit_one_cycle(1, 2e-2)                                  # Training the Model for one Epoch.

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.128379,3.920033,0.299411,50.402081,21:41


* The Perplexity metric used here is often used in Natural Language Processing for Language Models. It is the exponential of the loss function cross entropy. I have also included accuracy as the metric for the Model Evaluation in predicting the next word. Here, the loss function is cross entropy loss.

**Saving and Loading Models**

In [13]:
# Saving the Model trained above.
learn.save("firstmodel")

# Loading the Model save by the lines of code defined above.
learn.load("firstmodel")

<fastai.text.learner.LMLearner at 0x7fd353fb26d8>

**Preparing the Model**
* Tuning the Final Model after unfreezing.

In [14]:
# Preparing the Final Model.
learn.unfreeze()                                                    # Unfreezing the Model.
learn.fit_one_cycle(6, 2e-3)                                        # Training the Model for 6 Epochs.

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.866808,3.774143,0.317484,43.560181,25:10
1,3.783684,3.693521,0.327096,40.186096,25:07
2,3.705884,3.639193,0.333255,38.061111,25:08
3,3.620303,3.601449,0.337785,36.651314,25:00
4,3.516824,3.582811,0.340341,35.97451,25:16
5,3.499033,3.582374,0.340723,35.958801,25:24


* Now, I will save the Model except the final layer that converts activations to probabilities of picking each token in vocabulary. The Model which doesnot include final layer is called Encoder. I will save with save encoder. The Model obtained above is Fine Tuned.

In [15]:
# Saving the Final Model.
learn.save_encoder("finetuned")

**Text Generation**
* Before moving to fine tuning the Classifier, I will use the Model to generate the random reviews. Since, it is trained to guess the next word of the sentence, I can use the Model to write the new reviews.

In [16]:
# Text Generation with Final Model.
TEXT = "I am bored with the movie because"                                          # Example of Negative sentiment reviews.
N_words = 50                                                                        # Number of words in each sentences.
N_sents = 3                                                                         # Number of sentences.

# Making predictions of the Next word:
preds = [learn.predict(TEXT, N_words, temperature=0.75)
         for _ in range(N_sents)]

# Inspecting the result.
print("\n".join(preds))

i am bored with the movie because I 'm a movie goer and i have never seen a movie like this as I 've ever seen . However , i was delighted with the quality of the photography , especially the sound and the overall quality of the movie . The movie was
i am bored with the movie because you will be sour at times . The plot is very thin . The acting is also very bad . The story is not that bad . It is based on true events and is not worth the time to spend . The movie was
i am bored with the movie because I 'm a Executive Producer . Even with performances of Brad Pitt and Natasha Henstridge , this movie lacks all the dramatic , suspense and emotional power . The movie is basically about Karen ( karen Sillas ) ,


**Creating the Classifier Data Loaders**

* The Language Model prepared earlier predicts the next word of the Document so it doesn't need any external labels. However, the Classifier predicts external label. In the case of IMDB, it's the sentiment of the Document.

In [17]:
# Preparing the TextBlock and DataBlock of the Classifiers.
dls_clas = DataBlock(
    blocks = (TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y = parent_label, 
    get_items = partial(get_text_files, folders=["train", "test"]),
    splitter = GrandparentSplitter(valid_name="test")
).dataloaders(path, path=path, bs=128, seq_len=72)

# Inspecting the DataBlock.
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad,pos
2,xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad,neg


In [18]:
# Creating the Model to classify Texts.
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
                                metrics=accuracy).to_fp16()

# Loading the Encoder.
learn.load_encoder("finetuned")

<fastai.text.learner.TextLearner at 0x7fd08153c400>

**Fine Tuning the Classifier**
* The last step is to train with Discriminative learning rates and gradually unfreezing. In Natural Language Processing, unfreezing a few layers at a time makes a real difference.

In [19]:
# Training only one Epoch.
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.350891,0.192791,0.92472,01:09


In [20]:
# Training only one epoch and unfreezing a bit more. 
learn.freeze_to(-2)                                             # Unfreezing a bit more.
learn.fit_one_cycle(1, slice(1e-2/(2.6**4), 1e-2))              # Training one epoch.

epoch,train_loss,valid_loss,accuracy,time
0,0.25989,0.172957,0.93364,01:15


In [21]:
# Training the Model after unfreezing a bit more.
learn.freeze_to(-3)                                             # Unfreezing a bit more.
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))              # Training the Model.

epoch,train_loss,valid_loss,accuracy,time
0,0.210548,0.15987,0.93988,01:34


In [23]:
# Training the Model after unfreezing the whole Model.
learn.unfreeze()                                                # Unfreezing the Model.
learn.fit_one_cycle(3, slice(1e-3/(2.6**4), 1e-3))              # Training five Epochs.                                  

epoch,train_loss,valid_loss,accuracy,time
0,0.182021,0.153751,0.94284,01:54
1,0.164785,0.154277,0.94332,01:55
2,0.150096,0.156386,0.94428,01:55
