# IMDB

$\textbf{Data Source}$: 
Large Movie Review Dataset v1.0 (IMDB movie reviews).
https://ai.stanford.edu/~amaas/data/sentiment/

$\textbf{Data Preparation}$: To run this Jupyter Notebook, download the dataset from the above website. Then place the 'train' and 'test' folders into the 'IMDB' directory of your cloned version of the NeuralNetworkLibrary repository.

$\textbf{Objective}$:
Text Classification - Predict label 'pos' or 'neg' for each review, based on text of the review.

In [1]:
# Automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Imports
import sys
sys.path.append("../")
from Applications.Text import *

### Look at Sample Reviews

In [2]:
pos_reviews, neg_reviews = os.listdir('IMDB/train/pos'), os.listdir('IMDB/train/neg')
num_pos, num_neg = len(pos_reviews), len(neg_reviews)

fpos = open('IMDB/train/pos/'+pos_reviews[0],'r')
pos = fpos.read()

fneg = open('IMDB/train/neg/'+neg_reviews[0],'r')
neg = fneg.read()

In [3]:
print('Number of Positive Reviews:', num_pos, '\n')
print('Number of Negative Reviews:', num_neg, '\n')

print('Sample Positive Review: \n', pos, '\n')
print('Sample Negative Review: \n', neg)

Number of Positive Reviews: 12500 

Number of Negative Reviews: 12500 

Sample Positive Review: 
 Fantastic documentary of 1924. This early 20th century geography of today's Iraq was powerful. Watch this and tell me if Cecil B. DeMille didn't take notes before making his The Ten Commandments. Merian C. Cooper, the photographer, later created Cinerama, an idea that probably hatched while filming the remarkable landscapes in this film. Fans of Werner Herzog will find this film to be a treasure, with heartbreaking tales of struggle, complimented by the land around them, never has the human capacity to endure been so evident. The fact that this was made when it was shows not only the will of the subjects, but of the filmmakers themselves. 

Sample Negative Review: 
 Basically, Cruel Intentions 2 is Cruel Intentions 1, again, only poorly done. The story is exactly the same as the first one (even some of the lines), with only a few exceptions. The cast is more unknown, and definitely less ta

### General Strategy

$\textbf{Step 1:}$
Build a language model for the IMDB review corpus, using transfer learning from a model pretrained on the wikitext103 corpus. 

$\textbf{Step 2:}$
Build a text classifier for the IMDB reviews. This uses the same LSTM encoder architecture from the language model, along with an attention-based decoder to combine encoder outputs. The text classifier encoder is initialized with the weights from the trained language model, but finetuned in training.

** Additionally, we train with the texts in both the forward and backward directions, and average the predictions, to improve accuracy.

### Part (1) - Language Model

### Forward Texts

#### DataObj

NOTE: We use the unsupported reviews without labels, as well as the labeled 'pos' and 'neg' reviews, to create the language model. We expect the text of the unlabeled reviews to be similar to that of the labeled ones, and the additional data (50,000 additional reviews) should help prevent overfitting.

In [4]:
data = LanguageModelDataObj.from_folders(bs=64, bptt=75, labels=['pos','neg','unsup'], 
                                         train='IMDB/train', reverse=False)

# save string-to-int dictionary from data
pickle.dump(data.stoi, open('IMDB/data_stoi_fwd','wb'))

Tokenizing ...
      

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))







Numericalizing ...
min freq = 6 , max_vocab = 60000
Done, vocab_size =  47343


#### Model and Learner

In [5]:
PATH = 'IMDB'
model = LanguageModelNet(data, pretrained='fwd')
model.clear_non_raw()
opt_func = partial(optim.Adam, betas=(0.8, 0.99))
optimizer = Optimizer(opt_func, model)
loss_func = RegSeqCrossEntropyLoss()
learner = Learner(PATH,data,model,optimizer,loss_func)
learner.freeze()

# metrics
acc = LanguageModelAccuracy()
ce = SeqCrossEntropyLoss(loss_func)

NOTE: The loss function RegSeqCrossEntropyLoss is the standard cross entropy loss between predicted and actual 
tokens + a regularization term applied to the output of the LSTM encoder. The metric SeqCrossEntropyLoss gives 
the unregularized cross entropy loss. The metric LanguageModelAccuracy is the fraction of correctly predicted tokens.

#### Training

In [6]:
learner.evaluate('val', metrics=[ce, acc])

HBox(children=(IntProgress(value=0, description='Eval-Val', max=892, style=ProgressStyle(description_width='in…




[5.389707051585074, array([5.29454212, 0.20876536])]

In [7]:
lr = 3e-3
learner.fit(lr,1,metrics=[ce, acc])

epoch   train_loss  val_loss    metrics     

0       4.74847     4.43156     4.34480     0.27416       epoch run time: 21 min, 6.63 sec


In [8]:
learner.unfreeze()

lr_max = [1e-3,5e-3]
learner.fit_one_cycle(lr_max, 15, beta_min=0.8, beta_max=0.8, metrics=[ce,acc], 
                      save_name='lang_model_fwd', save_method='all')

epoch   train_loss  val_loss    metrics     

0       4.43827     4.21211     4.14686     0.29376       epoch run time: 22 min, 43.63 sec
1       4.37775     4.13284     4.07652     0.30114       epoch run time: 22 min, 42.65 sec
2       4.29097     4.08455     4.03664     0.30495       epoch run time: 22 min, 44.96 sec
3       4.27232     4.06161     4.02322     0.30654       epoch run time: 22 min, 43.93 sec
4       4.24789     4.04641     4.01397     0.30772       epoch run time: 22 min, 43.62 sec
5       4.19398     4.02324     3.99348     0.31020       epoch run time: 22 min, 44.09 sec
6       4.18061     4.00554     3.97723     0.31169       epoch run time: 22 min, 44.12 sec
7       4.15095     3.98286     3.95542     0.31379       epoch run time: 22 min, 44.73 sec
8       4.13074     3.96671     3.93895     0.31566       epoch run time: 22 min, 44.30 sec
9       4.08665     3.95195     3.92297     0.31750       epoch run time: 22 min, 46.74 sec
10      4.03330     3.94103     3.

### Backward Texts

#### DataObj

In [9]:
data = LanguageModelDataObj.from_folders(bs=64, bptt=75, labels=['pos','neg','unsup'], 
                                         train='IMDB/train', reverse=True)

# save string-to-int dictionary from data
pickle.dump(data.stoi, open('IMDB/data_stoi_bwd','wb'))

Tokenizing ...
      

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))







Numericalizing ...
min freq = 6 , max_vocab = 60000
Done, vocab_size =  47343


#### Model and Learner

In [10]:
PATH = 'IMDB'
model = LanguageModelNet(data, pretrained='bwd')
model.clear_non_raw()
opt_func = partial(optim.Adam, betas=(0.8, 0.99))
optimizer = Optimizer(opt_func, model)
loss_func = RegSeqCrossEntropyLoss()
learner = Learner(PATH,data,model,optimizer,loss_func)
learner.freeze()

# metrics
acc = LanguageModelAccuracy()
ce = SeqCrossEntropyLoss(loss_func)

#### Training

In [11]:
learner.evaluate('val', metrics=[ce, acc])

HBox(children=(IntProgress(value=0, description='Eval-Val', max=893, style=ProgressStyle(description_width='in…




[5.454193824883404, array([5.36219153, 0.22140861])]

In [12]:
lr = 3e-3
learner.fit(lr,1,metrics=[ce, acc])

epoch   train_loss  val_loss    metrics     

0       4.77748     4.48984     4.40807     0.28296       epoch run time: 21 min, 4.89 sec


In [13]:
learner.unfreeze()

lr_max = [1e-3,5e-3]
learner.fit_one_cycle(lr_max, 15, beta_min=0.8, beta_max=0.8, metrics=[ce,acc],
                      save_name='lang_model_bwd', save_method='all')

epoch   train_loss  val_loss    metrics     

0       4.51578     4.27562     4.21574     0.30106       epoch run time: 22 min, 44.00 sec
1       4.42960     4.18793     4.13646     0.30878       epoch run time: 22 min, 44.06 sec
2       4.35127     4.13435     4.09161     0.31369       epoch run time: 22 min, 43.78 sec
3       4.32624     4.09959     4.06321     0.31648       epoch run time: 22 min, 45.15 sec
4       4.27456     4.07049     4.04048     0.31913       epoch run time: 22 min, 47.07 sec
5       4.25433     4.04543     4.01872     0.32155       epoch run time: 22 min, 46.67 sec
6       4.20313     4.01992     3.99372     0.32363       epoch run time: 22 min, 42.98 sec
7       4.16608     4.00435     3.97848     0.32569       epoch run time: 22 min, 45.43 sec
8       4.12145     3.98746     3.96094     0.32749       epoch run time: 22 min, 45.22 sec
9       4.11600     3.97300     3.94531     0.32917       epoch run time: 22 min, 43.70 sec
10      4.08827     3.96281     3.

### Part (2) - Text Classification

#### NOTE:
This Notebook was saved, shutdown, and reopened later at this point. So we need to do imports again. 

In [1]:
# Automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Imports
import sys
sys.path.append("../")
from Applications.Text import *

### Using Forward Texts

#### Load pre-trained language model

In [2]:
data = LanguageModelDataObj.from_folders(bs=64, bptt=75, labels=['pos','neg','unsup'], 
                                         train='IMDB/train', reverse=False)

PATH = 'IMDB'
model = LanguageModelNet(data, pretrained='fwd')
model.clear_non_raw()
loss_func = RegSeqCrossEntropyLoss()
learner = Learner(PATH,data,model,loss_func=loss_func)

Tokenizing ...
      

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))







Numericalizing ...
min freq = 6 , max_vocab = 60000
Done, vocab_size =  47343


In [3]:
learner.evaluate('val')

HBox(children=(IntProgress(value=0, description='Eval-Val', max=884, style=ProgressStyle(description_width='in…




[5.388274078455446]

In [4]:
learner.load('lang_model_fwd_14')
learner.evaluate('val')

HBox(children=(IntProgress(value=0, description='Eval-Val', max=884, style=ProgressStyle(description_width='in…




[3.649482626720791]

In [5]:
language_model = learner.model

#### Define Text Classification DataObj

In [6]:
stoi = pickle.load(open('IMDB/data_stoi_fwd','rb'))

In [7]:
data = TextClassificationDataObj.from_folders(bs=64, labels=['pos','neg'], train='IMDB/train', 
                                              test='IMDB/test', reverse=False, stoi=stoi)

Tokenizing ...
      

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4165), HTML(value='')))







Numericalizing ...
min freq = 6 , max_vocab = 60000
Done, vocab_size =  47343


#### Train 5 text classifiers, each starting from pretrained language model. 
We will store predictions on test set for each and ensemble them at the end.

In [8]:
PATH = 'IMDB'
loss_func = RegSeqCrossEntropyLoss()
acc = TextClassificationAccuracy()
ce = SeqCrossEntropyLoss(loss_func)
PredProbsFwd, LossFwd, AccuracyFwd = [],[],[]

for i in range(5):
    
    print('training model ' + str(i))   
    model = TextClassificationNet(PATH, language_model, num_classes=2, drop_scaling = 1.5)
    model.clear_non_raw()
    opt_func = partial(optim.Adam, betas=(0.7, 0.99))
    optimizer = Optimizer(opt_func, model)
    learner = Learner(PATH,data,model,optimizer,loss_func)
    
    learner.freeze()
    learner.fit(lr=1e-3, num_epochs=1, metrics=[ce,acc])
    
    learner.unfreeze()
    learner.fit_one_cycle(lr_max=[2e-4,1e-3,5e-3], num_epochs=10, beta_min=0.7, beta_max=0.7, clip=1.0,
                          metrics=[ce,acc], save_name='text_classifier_fwd_'+str(i), save_method='best')
    
    learner.load('text_classifier_fwd_'+str(i))
    pred_probs, pred_labels = learner.predict('test')
    labels = np.array(learner.data.test_ds.labels)
    
    PredProbsFwd.append(pred_probs)
    LossFwd.append( skm.log_loss(labels,pred_probs) )
    AccuracyFwd.append( skm.accuracy_score(labels,pred_labels) )

epoch   train_loss  val_loss    metrics     

0       0.33171     0.22379     0.19196     0.92640       epoch run time: 2 min, 13.93 sec
1       0.31970     0.21417     0.18720     0.92880       epoch run time: 2 min, 14.56 sec
2       0.29380     0.19714     0.17507     0.93520       epoch run time: 2 min, 13.61 sec
3       0.24482     0.20004     0.18194     0.93060       epoch run time: 2 min, 13.63 sec
4       0.23394     0.17767     0.16192     0.93860       epoch run time: 2 min, 14.32 sec
5       0.21873     0.17981     0.16534     0.93840       epoch run time: 2 min, 14.76 sec
6       0.20520     0.17550     0.16175     0.94120       epoch run time: 2 min, 15.10 sec
7       0.18456     0.17531     0.16186     0.94240       epoch run time: 2 min, 15.33 sec
8       0.18255     0.18150     0.16820     0.94200       epoch run time: 2 min, 14.29 sec
9       0.17305     0.17584     0.16256     0.94260       epoch run time: 2 min, 13.19 sec


HBox(children=(IntProgress(value=0, description='Predicting', max=391, style=ProgressStyle(description_width='…




In [9]:
Results = pd.DataFrame({'loss':LossFwd, 'accuracy':AccuracyFwd})
Results

Unnamed: 0,loss,accuracy
0,0.148289,0.94472
1,0.14875,0.9462
2,0.148021,0.94624
3,0.148969,0.94484
4,0.150276,0.94628


### Using Backward Texts

#### Load pre-trained language model

In [10]:
data = LanguageModelDataObj.from_folders(bs=64, bptt=75, labels=['pos','neg','unsup'], 
                                         train='IMDB/train', reverse=True)

PATH = 'IMDB'
model = LanguageModelNet(data, pretrained='bwd')
model.clear_non_raw()
loss_func = RegSeqCrossEntropyLoss()
learner = Learner(PATH,data,model,loss_func=loss_func)

Tokenizing ...
      

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))







Numericalizing ...
min freq = 6 , max_vocab = 60000
Done, vocab_size =  47343


In [11]:
learner.evaluate('val')

HBox(children=(IntProgress(value=0, description='Eval-Val', max=889, style=ProgressStyle(description_width='in…




[5.454925678712311]

In [12]:
learner.load('lang_model_bwd_14')
learner.evaluate('val')

HBox(children=(IntProgress(value=0, description='Eval-Val', max=889, style=ProgressStyle(description_width='in…




[3.7176763767332544]

In [13]:
language_model = learner.model

#### Define Text Classification DataObj

In [14]:
stoi = pickle.load(open('IMDB/data_stoi_bwd','rb'))

In [15]:
data = TextClassificationDataObj.from_folders(bs=64, labels=['pos','neg'], train='IMDB/train', 
                                              val=None, test='IMDB/test', reverse=True, stoi=stoi)

Tokenizing ...
      

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4167), HTML(value='')))

HBox(children=(IntProgress(value=0, max=4165), HTML(value='')))







Numericalizing ...
min freq = 6 , max_vocab = 60000
Done, vocab_size =  47343


#### Train 5 text classifiers, each starting from pretrained language model. 
We will store predictions on test set for each and ensemble them at the end.

In [16]:
PATH = 'IMDB'
loss_func = RegSeqCrossEntropyLoss()
acc = TextClassificationAccuracy()
ce = SeqCrossEntropyLoss(loss_func)
PredProbsBwd, LossBwd, AccuracyBwd = [],[],[]

for i in range(5):
    
    print('training model ' + str(i))   
    model = TextClassificationNet(PATH, language_model, num_classes=2, drop_scaling = 1.0)
    model.clear_non_raw()
    opt_func = partial(optim.Adam, betas=(0.7, 0.99))
    optimizer = Optimizer(opt_func, model)
    learner = Learner(PATH,data,model,optimizer,loss_func)
    
    learner.freeze()
    learner.fit(lr=1e-3, num_epochs=1, metrics=[ce,acc])
    
    learner.unfreeze()
    learner.fit_one_cycle(lr_max=[2e-4,1e-3,5e-3], num_epochs=10, beta_min=0.7, beta_max=0.7, clip=1.0, 
                          metrics=[ce,acc], save_name='text_classifier_bwd_'+str(i), save_method='best')
    
    learner.load('text_classifier_bwd_'+str(i))
    pred_probs, pred_labels = learner.predict('test')
    labels = np.array(learner.data.test_ds.labels)
    
    PredProbsBwd.append(pred_probs)
    LossBwd.append( skm.log_loss(labels,pred_probs) )
    AccuracyBwd.append( skm.accuracy_score(labels,pred_labels) )

epoch   train_loss  val_loss    metrics     

0       0.29904     0.21705     0.18665     0.92640       epoch run time: 2 min, 14.77 sec
1       0.27228     0.20736     0.18059     0.93040       epoch run time: 2 min, 14.23 sec
2       0.25611     0.19646     0.17434     0.93100       epoch run time: 2 min, 14.43 sec
3       0.22424     0.20495     0.18600     0.93480       epoch run time: 2 min, 13.74 sec
4       0.21011     0.18275     0.16595     0.93520       epoch run time: 2 min, 14.49 sec
5       0.18539     0.18172     0.16626     0.93760       epoch run time: 2 min, 14.29 sec
6       0.17514     0.18317     0.16838     0.93820       epoch run time: 2 min, 14.91 sec
7       0.16173     0.17566     0.16123     0.94080       epoch run time: 2 min, 14.40 sec
8       0.13536     0.18572     0.17138     0.94060       epoch run time: 2 min, 14.06 sec
9       0.13861     0.18716     0.17284     0.93660       epoch run time: 2 min, 14.41 sec


HBox(children=(IntProgress(value=0, description='Predicting', max=391, style=ProgressStyle(description_width='…




In [17]:
Results = pd.DataFrame({'loss':LossBwd, 'accuracy':AccuracyBwd})
Results

Unnamed: 0,loss,accuracy
0,0.154592,0.94228
1,0.156131,0.94232
2,0.181118,0.93044
3,0.155513,0.94216
4,0.154111,0.94248


### Combining Predictions

Averaging All Predictions

In [22]:
predprobs_fwd_avg = np.array(PredProbsFwd).mean(axis=0)
predprobs_bwd_avg = np.array(PredProbsBwd).mean(axis=0)
predprobs_avg = (predprobs_fwd_avg + predprobs_bwd_avg)/2
predlabels_avg = np.argmax(predprobs_avg,axis=1)

loss = skm.log_loss(labels,predprobs_avg)
accuracy = skm.accuracy_score(labels,predlabels_avg)
print('loss:',loss)
print('accuracy:',accuracy)

loss: 0.1371316749236456
accuracy: 0.94968


Averaging Only Predictions for Forward Texts

In [23]:
predlabels_fwd = np.argmax(predprobs_fwd_avg,axis=1)
loss = skm.log_loss(labels,predprobs_fwd_avg)
accuracy = skm.accuracy_score(labels,predlabels_fwd)
print('loss:',loss)
print('accuracy:',accuracy)

loss: 0.14445905760379318
accuracy: 0.94656


Averaging Only Predictions for Backward Texts

In [24]:
predlabels_bwd = np.argmax(predprobs_bwd_avg,axis=1)
loss = skm.log_loss(labels,predprobs_bwd_avg)
accuracy = skm.accuracy_score(labels,predlabels_bwd)
print('loss:',loss)
print('accuracy:',accuracy)

loss: 0.1517001601814566
accuracy: 0.94372


So, there is a definite improvement from using both forward text and backward text predictions in the ensemble. 

To improve further, we will try 1 small tweak. Trial number 2 of training for the backward texts had a significantly worse accuracy than the others (about 0.930 vs about 0.942 for all others). I'm not sure why the model did not train so well that time, but we will try omitting the predictions of this trial in our backward ensemble. 

In [25]:
predprobs_fwd_avg = np.array(PredProbsFwd).mean(axis=0)
predprobs_bwd_avg = np.array(PredProbsBwd[:2] + PredProbsBwd[3:]).mean(axis=0)
predprobs_avg = (predprobs_fwd_avg + predprobs_bwd_avg)/2
predlabels_avg = np.argmax(predprobs_avg,axis=1)

loss = skm.log_loss(labels,predprobs_avg)
accuracy = skm.accuracy_score(labels,predlabels_avg)
print('loss:',loss)
print('accuracy:',accuracy)

loss: 0.13573877105064341
accuracy: 0.9504


So, yes, omitting the "bad" trial number 2 with the backward texts did lead to a small further improvement.

Final Accuracy = 0.9504