In this notebook, we do some experiments using fasttext library for text classification, without directly designing any neural network. As discussed in README file, fasttext can be used as a powerful baseline in many different text classification tasks.

In [1]:
import fasttext
import torchtext
import torch
from torchtext.data.utils import get_tokenizer

Defining a class that transforms train/test data into raw text format, so that it can be used by fasttext for training/validation in next stages

In [2]:
def training_text_generator( sample_train_iterator , output_file_name ):
    with open( output_file_name , "w") as F:
            for i in sample_train_iterator:
                    temp_text = " " + " ".join( i.__dict__["text"] )
                    temp_label = i.__dict__["label"]
                    F.write( "__label__" + temp_label + temp_text + "\n" )
                    

Lets get started by splitting the IMDB sentiment analysis dataset into train/test datasets, and then generating the corresponding text files using the previously defined function.

In [3]:
tokenize = get_tokenizer("basic_english")
TEXT = torchtext.data.Field( sequential=True, tokenize=tokenize, lower=True )
LABEL = torchtext.data.LabelField( )#dtype=torch.float )
train_data, test_data = torchtext.datasets.IMDB.splits(TEXT, LABEL )



training_text_generator( train_data , "train_file.txt" )
training_text_generator( test_data , "test_file.txt" )


The frequency of pos/neg labels is equal (both in training and test set). Let's get some info about our training and test datasets, using their transformed text file:

In [4]:
!wc train_file.txt #25K training samples
!wc test_file.txt  # 25K test samples

   25000  6790783 33812043 train_file.txt
   25000  6639743 33019346 test_file.txt


shuffling and splitting the training dataset into train/validation

In [5]:
!shuf -o train_file.txt  train_file.txt
!head -n 22000 train_file.txt > imdb.train
!tail -n 3000 train_file.txt > imdb.valid

Lets traing our first classifier using fasttext default supervised mode parameters and see how it performs on the 

In [6]:
model = fasttext.train_supervised( input="imdb.train")
model.test_label("imdb.valid")

{'__label__neg': {'precision': 0.8749163879598663,
  'recall': 0.8708388814913449,
  'f1score': 0.8728728728728729},
 '__label__pos': {'precision': 0.8710963455149502,
  'recall': 0.8751668891855807,
  'f1score': 0.8731268731268731}}

As the dataset is relatively small, it might be a good idea to increase #epoch (default is 5), and to also finetune learning rate to achieve better performance on validation set.

In [7]:
model = fasttext.train_supervised( input="imdb.train" , epoch=20 , lr=0.1)
model.test_label("imdb.valid")

{'__label__neg': {'precision': 0.8946322067594433,
  'recall': 0.8988015978695073,
  'f1score': 0.8967120557954168},
 '__label__pos': {'precision': 0.898054996646546,
  'recall': 0.8938584779706275,
  'f1score': 0.8959518233522917}}

What if we also consider word bi-grams in model training (As it introduces more parameters to be trained, #epoch also needs to be changed accordingly (grid-search can help us to find the best hyper-parameters for each architecture)

In [8]:
model = fasttext.train_supervised( input="imdb.train" , epoch=30 , lr=0.1 , wordNgrams=2)
model.test_label("imdb.valid")

{'__label__neg': {'precision': 0.8961892247043364,
  'recall': 0.9081225033288948,
  'f1score': 0.9021164021164021},
 '__label__pos': {'precision': 0.9066305818673883,
  'recall': 0.8945260347129506,
  'f1score': 0.9005376344086021}}

As we can see, in only few steps, we were able to achieve a reasonable performance on the validations set, now let's see how our best model performs on the held-out test dataset.

In [9]:
model.test_label("test_file.txt")

{'__label__neg': {'precision': 0.8934042384316215,
  'recall': 0.90048,
  'f1score': 0.8969281644687039},
 '__label__pos': {'precision': 0.8996855092331264,
  'recall': 0.89256,
  'f1score': 0.8961085900164653}}

We ended up with a good performance using just a simple architecture of embedding averaging of input sentence. Performance can also be enhanced with a more extensive grid-search, and with also deploying ideas like using pretrained-embeddings, instead of training from scratch (which e.g. helps us to mitigate overfitting during training).