<center><h1>Bag of Words Text Classification</h1></center>

In this tutorial we will show how to build a simple Bag of Words (BoW) text classifier using PyTorch. The classifier is trained on IMDB movie reviews dataset. 


<h4>
The concepts covered in this tutorial are: 
<br>
<br> 1. NLP text <i><b>pre-processing</b></i>
<br>
<br> 2. Split of <i><b>training, validation and testing datasets</b></i>
<br>
<br> 3. How to build a simple <i><b>feed-forward neural network</b></i> using PyTorch 
<br>
<br> 4. How different <i><b>optimizer</b></i> affects learning rate and convergence to global minimum 
<br>
<br> 5. <i><b>Under-fitting v.s. Over-fitting</b></i> 
<br>
<br> 6. <i><b>BoW</b></i> text classifier 
</h4>

In [1]:
!pip install nltk
!pip install tqdm



In [1]:
import re # regular expression
from collections import Counter 
from functools import partial

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# from google_drive_downloader import GoogleDriveDownloader as gdd
from IPython.core.display import display, HTML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report



# PyTorch modules
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.dataset import random_split

# nltk text processors
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer

from pypeln import process as pr # multi-processing
from tqdm import tqdm, tqdm_notebook # show progress bar

%matplotlib inline
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jeffrey/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jeffrey/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# DATA_PATH = 'data/imdb_reviews.csv'
# if not Path(DATA_PATH).is_file():
#     gdd.download_file_from_google_drive(
#         file_id='1zfM5E6HvKIe7f3rEt1V2gBpw5QOSSKQz',
#         dest_path=DATA_PATH,
#     )

In [4]:
# Run locally
DATA_PATH = '/Users/jeffrey/Downloads/imdb_reviews.csv'
df = pd.read_csv(
    DATA_PATH,
    encoding='ISO-8859-1',
)

**Take a look at a few examples**

In [5]:
df.loc[[55, 12361], :]

Unnamed: 0,review,label
55,Seeing this film for the first time twenty yea...,0
12361,I went and saw this movie last night after bei...,1


In [6]:
print('Number of records:', len(df), '\n')
print('Negative review:')
print(df.loc[55,].review, '\n')
print('Positive review:')
print(df.loc[12361,].review, '\n')

Number of records: 62155 

Negative review:
Seeing this film for the first time twenty years after its release I don't quite get it. Why has this been such a huge hit in 1986? Its amateurishness drips from every scene. The jokes are lame and predictable. The sex scenes are exploitative and over the top (that is not to say that Miss Rudnik does not have nice boobs!). The singing is "schrecklich". The only genuinely funny scene is the big shoot out when the gangsters die break dancing, a trait that dates the movie firmly to the mid-eighties. It's really quite puzzling to me how incapable I am to grasp what evoked the enthusiasm of the cheering audiences in 1986 (and apparently still today, reading my fellow IMDBers comments). 

Positive review:
I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of J

## Preprocess Text

* Replace weird characters
* Lowercase
* Tokenize 
* Stemming & Lemmatize
* Remove stopwords

**Let's see how to pre-process these steps one by one. Below I constructed a test corpus which composed of 3 reviews. Each review is a paragraph.**

In [8]:
test_corpus = "Yesterday I gathered 184 datasets for training, but I just found out they're not useful at all!!"
test_corpus

"Yesterday I gathered 184 datasets for training, but I just found out they're not useful at all!!"

In [9]:
# remove special characters & lowercase
clean_corpus = re.sub(r'[^\w\s]', '', test_corpus)
clean_corpus = clean_corpus.lower()
clean_corpus

'yesterday i gathered 184 datasets for training but i just found out theyre not useful at all'

In [10]:
# tokenize
clean_tokens = wordpunct_tokenize(clean_corpus)
print(clean_tokens)

['yesterday', 'i', 'gathered', '184', 'datasets', 'for', 'training', 'but', 'i', 'just', 'found', 'out', 'theyre', 'not', 'useful', 'at', 'all']


In [11]:
lemmatizer = WordNetLemmatizer()
clean_tokens = [lemmatizer.lemmatize(token) for token in clean_tokens]
clean_tokens = [lemmatizer.lemmatize(token, "v") for token in clean_tokens]
print(clean_tokens)

['yesterday', 'i', 'gather', '184', 'datasets', 'for', 'train', 'but', 'i', 'just', 'find', 'out', 'theyre', 'not', 'useful', 'at', 'all']


In [12]:
clean_tokens = [re.sub(r'[0-9]+', '<NUM>', token) for token in clean_tokens]
print(clean_tokens)

['yesterday', 'i', 'gather', '<NUM>', 'datasets', 'for', 'train', 'but', 'i', 'just', 'find', 'out', 'theyre', 'not', 'useful', 'at', 'all']


In [13]:
stop_words = set(stopwords.words('english'))
clean_tokens = [token for token in clean_tokens if token not in stop_words]
print(clean_tokens)

['yesterday', 'gather', '<NUM>', 'datasets', 'train', 'find', 'theyre', 'useful']


In [14]:
def build_vocab(corpus):
    vocab = {}
    for doc in corpus:
        for token in doc:
            if token not in vocab.keys():
                vocab[token] = len(vocab)
    return vocab

build_vocab([clean_tokens])

{'yesterday': 0,
 'gather': 1,
 '<NUM>': 2,
 'datasets': 3,
 'train': 4,
 'find': 5,
 'theyre': 6,
 'useful': 7}

In [15]:
def build_index2token(vocab):
    index2token = {}
    for token in vocab.keys():
        index2token[vocab[token]] = token
    return index2token

build_index2token(build_vocab([clean_tokens]))

{0: 'yesterday',
 1: 'gather',
 2: '<NUM>',
 3: 'datasets',
 4: 'train',
 5: 'find',
 6: 'theyre',
 7: 'useful'}

**Let's pacakage the pre-processing steps together into functions and apply on our dataset**

In [7]:
def remove_rare_words(tokens, common_tokens, max_len):
    return [token if token in common_tokens else '<UNK>' for token in tokens][-max_len:]

def replace_numbers(tokens):
    return [re.sub(r'[0-9]+', '<NUM>', token) for token in tokens]

def tokenize(text, stop_words, lemmatizer):
    text = re.sub(r'[^\w\s]', '', text) # remove special characters
    text = text.lower() # lowercase
    tokens = wordpunct_tokenize(text) # tokenize
    tokens = [lemmatizer.lemmatize(token) for token in tokens] # noun lemmatizer
    tokens = [lemmatizer.lemmatize(token, "v") for token in tokens] # verb lemmatizer
    tokens = [token for token in tokens if token not in stop_words] # remove stopwords
    return tokens

def build_bow_vector(sequence, idx2token):
    vector = [0] * len(idx2token)
    for token_idx in sequence:
        if token_idx not in idx2token:
            raise ValueError('Wrong sequence index found!')
        else:
            vector[token_idx] += 1
    return vector

In [8]:
# Set parameters
N_WORKERS = 10
MAX_LEN = 128
MAX_VOCAB = 8000

class ImdbDataset(Dataset):
    def __init__(self, data_path, max_vocab=5000, max_len=128):
        df = pd.read_csv(data_path)
        
        # Clean and tokenize
        stop_words = set(stopwords.words('english'))
        lemmatizer = WordNetLemmatizer()
        stage = pr.map(
            partial(
                tokenize,
                stop_words=stop_words,
                lemmatizer=lemmatizer,
            ),
            df.review.tolist(), 
            workers=N_WORKERS,
        )
        df['tokens'] = list(x for x in tqdm(stage, total=len(df)))    
        
        all_tokens = [token for doc in list(df.tokens) for token in doc]
        
        # Build most common tokens bound by max vocab size
        common_tokens = set( 
            list(
                zip(*Counter(all_tokens).most_common(max_vocab))
            )[0] 
        )
        
        # Replace rare words with <UNK>
        stage = pr.map(
            partial(
                remove_rare_words,
                common_tokens=common_tokens,
                max_len=max_len
            ), 
            df.tokens.tolist(), 
            workers=N_WORKERS,
        )
        df.loc[:, 'tokens'] = list(x for x in tqdm(stage, total=len(df.tokens)))
        
        # Replace numbers with <NUM>
        stage = pr.map(
            partial(
                replace_numbers,
            ), 
            df.tokens.tolist(), 
            workers=N_WORKERS,
        )
        df.loc[:, 'tokens'] = list(x for x in tqdm(stage, total=len(df.tokens)))
        
        # Remove sequences with only <UNK>
        stage = pr.map(
            lambda tokens: any(token != '<UNK>' for token in tokens), 
            df.tokens.tolist(), 
            workers=N_WORKERS,
        )
        df = df[list(x for x in tqdm(stage, total=len(df.tokens)))]
        
        # Build vocab
        vocab = sorted(set(
            token for doc in list(df.tokens) for token in doc
        ))
        self.token2idx = {token: idx for idx, token in enumerate(vocab)}
        self.idx2token = {idx: token for token, idx in self.token2idx.items()}
        
        # Convert tokens to indexes
        df['indexed_tokens'] = df.tokens.apply(
            lambda doc: [self.token2idx[token] for token in doc],
        )
        
        # Build BoW vector
        df['bow_vector'] = df.indexed_tokens.apply(
            build_bow_vector, args=(self.idx2token,)
        )
        
        # Build TF-IDF vector
        vectorizer = TfidfVectorizer(
            analyzer='word',
            tokenizer=lambda doc: doc,
            preprocessor=lambda doc: doc,
            token_pattern=None,
        )
        vectors = vectorizer.fit_transform(df.tokens).toarray()
        df['tfidf_vector'] = [vector.tolist() for vector in vectors]
        
        self.text = df.review.tolist()
        self.sequences = df.indexed_tokens.tolist()
        self.bow_vector = df.bow_vector.tolist()
        self.tfidf_vector = df.tfidf_vector.tolist()
        self.targets = df.label.tolist()
    
    def __getitem__(self, i):
        return (
            self.sequences[i],
            self.bow_vector[i],
            self.tfidf_vector[i],
            self.targets[i],
            self.text[i],
        )
    
    def __len__(self):
        return len(self.targets)

In [9]:
dataset = ImdbDataset(DATA_PATH, max_vocab=MAX_VOCAB, max_len=MAX_LEN)

100%|██████████| 62155/62155 [01:37<00:00, 637.14it/s] 
100%|██████████| 62155/62155 [00:29<00:00, 2096.85it/s]
100%|██████████| 62155/62155 [00:29<00:00, 2142.84it/s]
100%|██████████| 62155/62155 [00:22<00:00, 2763.91it/s]


See a random sample out of the dataset processed

In [15]:
print('Number of records:', len(dataset), '\n')

import random
random_idx = random.randint(0,len(dataset)-1)
print('index:', random_idx, '\n')
sample_seq, bow_vector, tfidf_vector, sample_target, sample_text = dataset[random_idx]
print(sample_text, '\n')
print(sample_seq, '\n')
print('BoW vector size:', len(bow_vector), '\n')
print('TF-IDF vector size:', len(tfidf_vector), '\n')
print('Sentiment:', sample_target, '\n')

Number of records: 62155 

index: 21322 

(spoilers?)<br /><br />I've heard some gripe about the special effects. But that should detract from the movie. THe movie is a suspense film. And it's very good at that. So from that stand point, this movie rocks. Franke rocks. Enjoy to one's plastic hearts content. So no complaints for this movie. Unless you watch the english dub, which is a total farce. It creates the illusion it's a B movie. <br /><br />One complaint I do have is the music video on the dvd. It doesn't say who sings it. I'd love to know. <br /><br />8/10<br /><br />Quality: 10/10 Entertainment : 10/10 Replayable: 5/10 

[1119, 1561, 4926, 2162, 2581, 2331, 4695, 6961, 8, 3162, 1829, 7560, 2451, 6148, 8, 7038, 3726, 816, 7560, 2442, 624, 3773, 4446, 6261, 8, 2218, 8, 8, 8, 1933, 5240, 5134, 991, 7501, 3570, 8, 3397, 3386, 3921, 8, 8, 7818, 291, 4897, 2710, 8, 8, 8, 3294, 1990, 1119, 3055, 6085, 6085, 6208, 8, 2325, 8, 7259, 3162, 7818, 1119, 4731, 5254, 1746, 7594, 7259, 2044,

## Split into training, validation, and test sets

- **Training**: data the model learns from
- **Validation**: data to evaluate with for hyperparameter tuning (make sure the model doesn't overfit!)
- **Testing**: data to evaluate the final performance of the model

In [16]:
def split_train_valid_test(corpus, valid_ratio=0.1, test_ratio=0.1):
    """Split dataset into train, validation, and test."""
    test_length = int(len(corpus) * test_ratio)
    valid_length = int(len(corpus) * valid_ratio)
    train_length = len(corpus) - valid_length - test_length
    return random_split(
        corpus, lengths=[train_length, valid_length, test_length],
    )

In [17]:
train_dataset, valid_dataset, test_dataset = split_train_valid_test(
    dataset, valid_ratio=0.05, test_ratio=0.05)
len(train_dataset), len(valid_dataset), len(test_dataset)

(55941, 3107, 3107)

In [18]:
BATCH_SIZE = 528

def collate(batch):
    seq = [item[0] for item in batch]
    bow = [item[1] for item in batch]
    tfidf = [item[2] for item in batch]
    target = torch.LongTensor([item[3] for item in batch])
    text = [item[4] for item in batch]
    return seq, bow, tfidf, target, text

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, collate_fn=collate)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate)

In [19]:
print('number of training batches:', len(train_loader), '\n')
batch_idx = random.randint(0, len(train_loader)-1)
example_idx = random.randint(0, BATCH_SIZE-1)

for i, fields in enumerate(train_loader):
    seq, bow, tfidf, target, text = fields
    if i == batch_idx:
        print('Training input sequence:', seq[example_idx], '\n')
        print('BoW vector size:', len(bow[example_idx]), '\n')
        print('TF-IDF vector size:', len(tfidf[example_idx]), '\n')
        print('Label: ', target[example_idx], '\n')
        print('Review text:', text[example_idx], '\n')
    

number of training batches: 106 

Training input sequence: [7605, 4607, 6148, 1377, 3482, 2408, 1522, 7704, 7560, 7360, 3386, 4607, 7582, 4087, 4607, 7560, 7289, 7116, 816, 4187, 6712, 7560, 2996, 3386, 406, 7560, 5555, 526, 6712, 7560, 7825, 4278, 2612, 4187, 2953, 6809, 8, 7771, 668, 4608, 816, 1009, 1092, 6738, 8, 7647, 5074, 4096, 2530, 7560, 5555, 5875, 2044, 3175, 8, 8, 4005, 8, 4824, 2978, 5947, 5967, 2608, 1722, 5082, 7737, 6144, 2379, 2125, 8, 5986, 5947, 6113, 328, 5645, 8, 5082, 2649, 7510, 7088, 6637, 5317, 2843, 1298, 329, 7808, 4773, 4269, 4616, 6178, 816, 4607, 2408, 75, 3735, 4087, 7045, 1900, 6692, 986, 5666, 4607] 

BoW vector size: 7851 

TF-IDF vector size: 7851 

Label:  tensor(1) 

Review text: CAT SOUP is a short anime based on the legendary manga Nekojiru. It won the award \Best Short Film\" at The 6th Fantasia Film Festival and also won the \"Excellence Prize\" at Japan's Media Arts Festival.<br /><br />When little kitten Nyaako's soul is stolen by Death, she a

## Build BoW Model

- Input: BoW Vector
- Model: 
    - feed-forward fully connected network
    - 2 hidden layers
- Output: 
    - vector size of 2 (2 possible outcome: positive v.s. negative)
    - probability of input document classified as the label

In [50]:
class BoWLogisticClassifier(nn.Module):
    def __init__(self, device, vocab_size, output_size):
        self.device = device
        self.Linear = nn.Linear(vocab_size, output_size)

tensor([2., 3., 4., 5.])

In [33]:
class BoWClassifier(nn.Module):
    def __init__(self, device, vocab_size, hidden1, hidden2, num_labels, batch_size):
        super(BoWClassifier, self).__init__()
        self.device = device
        self.batch_size = batch_size
        self.fc1 = nn.Linear(vocab_size, hidden1)
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.fc3 = nn.Linear(hidden2, num_labels)
    
    def forward(self, x):
        batch_size = len(x)
        if batch_size != self.batch_size:
            self.batch_size = batch_size
        x = torch.FloatTensor(x)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return F.softmax(self.fc3(x))


In [52]:
HIDDEN1 = 100
HIDDEN2 = 50

bow_model = BoWClassifier(
    vocab_size=len(dataset.token2idx),
    hidden1=HIDDEN1,
    hidden2=HIDDEN2,
    num_labels=2,
    device=device,
    batch_size=BATCH_SIZE,
)
bow_model

BoWClassifier(
  (fc1): Linear(in_features=7851, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=50, bias=True)
  (fc3): Linear(in_features=50, out_features=2, bias=True)
)

In [53]:
for param in bow_model.parameters():
    print(param.size())

torch.Size([100, 7851])
torch.Size([100])
torch.Size([50, 100])
torch.Size([50])
torch.Size([2, 50])
torch.Size([2])


## Train the model




Layer 1 affine: $$x_1 = W_1 X + b_1$$
Layer 1 activation: $$h_1 = Relu(x_1)$$
Layer 2 affine: $$x_2 = W_2 h_1 + b_2$$
output: $$p = softmax(x_2)$$
Loss: $$L = −(ylog(p)+(1−y)log(1−p))$$
Gradient: 
$$\frac{\partial }{\partial W_1}L(W_1, b_1, W_2, b_2) = \frac{\partial L}{\partial p}\frac{\partial p}{\partial x_2}\frac{\partial x_2}{\partial h_1}\frac{\partial h_1}{\partial x_1}\frac{\partial x_1}{\partial W_1}$$

Parameter update:
$$W_1 = W_1 - \alpha \frac{\partial L}{\partial W_1}$$

In [45]:
LEARNING_RATE = 1e-3

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=LEARNING_RATE,
)
scheduler = CosineAnnealingLR(optimizer, 1)


def train_epoch(model, optimizer, train_loader, input_type='bow'):
    model.train()
    total_loss, total = 0, 0
    for seq, bow, tfidf, target, text in train_loader:
        inputs = bow
        if input_type == 'tfidf':
            inputs = tfidf
        
        # Reset gradient
        optimizer.zero_grad()
        
        # Forward pass
        output = model(inputs)
        
        # Compute loss
        loss = criterion(output, target)
        
        # Perform gradient descent, backwards pass
        loss.backward()

        # Take a step in the right direction
        optimizer.step()
        scheduler.step()

        # Record metrics
        total_loss += loss.item()
        total += len(target)

    return total_loss / total


def validate_epoch(model, valid_loader, input_type='bow'):
    model.eval()
    total_loss, total = 0, 0
    with torch.no_grad():
        for seq, bow, tfidf, target, text in valid_loader:
            inputs = bow
            if input_type == 'tfidf':
                inputs = tfidf

            # Forward pass
            output = model(inputs)

            # Calculate how wrong the model is
            loss = criterion(output, target)

            # Record metrics
            total_loss += loss.item()
            total += len(target)

    return total_loss / total

#### BoW

In [54]:
n_epochs = 0
train_losses, valid_losses = [], []
while True:
    train_loss = train_epoch(bow_model, optimizer, train_loader, input_type='bow')
    valid_loss = validate_epoch(bow_model, valid_loader, input_type='bow')
    
    tqdm.write(
        f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
    )
    
    # Early stopping if the current valid_loss is greater than the last three valid losses
    if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
        print('Stopping early')
        break
    
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    
    n_epochs += 1



epoch #  1	train_loss: 1.31e-03	valid_loss: 1.34e-03

epoch #  2	train_loss: 1.31e-03	valid_loss: 1.34e-03

epoch #  3	train_loss: 1.31e-03	valid_loss: 1.34e-03

epoch #  4	train_loss: 1.31e-03	valid_loss: 1.34e-03

Stopping early


#### TF-IDF

In [48]:
HIDDEN1 = 100
HIDDEN2 = 50

tfidf_model = BoWClassifier(
    vocab_size=len(dataset.token2idx),
    hidden1=HIDDEN1,
    hidden2=HIDDEN2,
    num_labels=2,
    device=device,
    batch_size=BATCH_SIZE,
)
tfidf_model

BoWClassifier(
  (fc1): Linear(in_features=7851, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=50, bias=True)
  (fc3): Linear(in_features=50, out_features=2, bias=True)
)

In [49]:
n_epochs = 0
train_losses, valid_losses = [], []
while True:
    train_loss = train_epoch(tfidf_model, optimizer, train_loader, input_type='tdidf')
    valid_loss = validate_epoch(tfidf_model, valid_loader, input_type='tdidf')
    
    tqdm.write(
        f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
    )
    
    # Early stopping if the current valid_loss is greater than the last three valid losses
    if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
        print('Stopping early')
        break
    
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    
    n_epochs += 1



epoch #  1	train_loss: 1.32e-03	valid_loss: 1.34e-03

epoch #  2	train_loss: 1.32e-03	valid_loss: 1.34e-03

epoch #  3	train_loss: 1.32e-03	valid_loss: 1.34e-03

epoch #  4	train_loss: 1.32e-03	valid_loss: 1.34e-03

Stopping early


## Predictions

In [57]:
bow_model.eval()
test_accuracy, n_examples = 0, 0
y_true, y_pred = [], []
input_type = 'tfidf'

with torch.no_grad():
    for seq, bow, tfidf, target, text in test_loader:
        inputs = bow
        probs = bow_model(inputs)
        if input_type == 'tdidf':
            inputs = tfidf
            probs = tfidf_model(inputs)
        
        probs = probs.detach().cpu().numpy()
        predictions = np.argmax(probs, axis=1)
        target = target.cpu().numpy()
        
        y_true.extend(predictions)
        y_pred.extend(target)
        
print(classification_report(y_true, y_pred))



              precision    recall  f1-score   support

           0       0.93      0.49      0.64      2893
           1       0.07      0.49      0.12       214

   micro avg       0.49      0.49      0.49      3107
   macro avg       0.50      0.49      0.38      3107
weighted avg       0.87      0.49      0.61      3107



In [58]:
flatten = lambda x: [sublst for lst in x for sublst in lst]
seq_lst, bow_lst, tfidf_lst, target_lst, text_lst = zip(*test_loader)
seq_lst, bow_lst, tfidf_lst, target_lst, text_lst = map(flatten, [seq_lst, bow_lst, tfidf_lst, target_lst, text_lst])
test_examples = list(zip(seq_lst, bow_lst, tfidf_lst, target_lst, text_lst))

input_type = 'bow'

def print_random_prediction(n=10):
    to_emoji = lambda x: '😄' if x else '😡'
    model.eval()
    rows = []
    for i in range(n):
        with torch.no_grad():
            seq, bow, tdidf, target, text = random.choice(test_examples)
            target = target.item()
            
            inputs = bow
            probs = bow_model([inputs])
            if input_type == 'tdidf':
                inputs = tfidf
                probs = tfidf_model([inputs])
            
            probs = probs.detach().cpu().numpy()
            prediction = np.argmax(probs, axis=1)[0]

            predicted = to_emoji(prediction)
            actual = to_emoji(target)
            
            row = f"""
            <tr>
            <td>{i+1}&nbsp;</td>
            <td>{text}&nbsp;</td>
            <td>{predicted}&nbsp;</td>
            <td>{actual}&nbsp;</td>
            </tr>
            """
            rows.append(row)
            
    rows_joined = '\n'.join(rows)
    table = f"""
<table>
<tbody>
<tr>
<td><b>Number</b>&nbsp;</td>
<td><b>Review</b>&nbsp;</td>
<td><b>Predicted</b>&nbsp;</td>
<td><b>Actual</b>&nbsp;</td>
</tr>
{rows_joined}
</tbody>
</table>
"""
    display(HTML(table))

In [59]:
print_random_prediction(n=5)



0,1,2,3
Number,Review,Predicted,Actual
1,"great historical movie, will not allow a viewer to leave once you begin to watch. View is presented differently than displayed by most school books on this subject. My only fault for this movie is it was photographed in black and white; wished it had been in color ... wow !",😡,😄
2,"Zeoy101?? Really, this has to be one of the most stupidest attempts to get people in my age group's attention. It's about some preppy girl named Zeoy and her friends that attends boarding school. BORING!!! All she ever does is whine and complain and acts like a spoiled idiot. I remember this show came out in 2005, I was 13 going on 14, and even then I thought it was pointless. The only episode I EVER liked was when the boys hid a camera in the girls dorm. THAT'S IT. Anyway, I just don't understand why Nickel-Oh my bad-Nick feels the need to syndicate this sorry poor excuse for ""entertainment"". serious this decade is becoming a joke every year and it gets worst and worst. What's with this generation?? Anyway, R.I.P. Nickelodeon 1979-1998?/2005?",😡,😡
3,"As a cinema fan White Noise was an utter disappointment, as a filmmaker the cinematography was pretty good, nicely lit, good camera work, reasonable direction. But as a film it just seamed as predictable as all the other 'so called' horror movies that the market has recently been flooded with. Although it did have a little bit of the 'chill factor' the whole concept of the E.V.O (Electronic Voice Phenomena) did'not seem believable. This movie did not explain the reasonings for certain occurrences but went ahead with them. The acting was far from mind blowing the main character portrayed no emotion, like many recent thriller/horror movies. Definitely not a movie I will be buying on DVD and would not recommend anyone rushes out to see it.",😡,😡
4,"This film is about a woman falling in love with a friend of her boyfriend. From then on, she has to divide her time for the two boyfriends: Jack during the day and Joseph during the night. This film feels like as if it was made with minimum budget. The majority of the film is set in a flat with minimal furniture. There are only three main actors, all the other actors listed in the credits make only momentary appearances. The wardrobe designer doesn't seem to have much to do, as the actors wear very down to earth clothes, and actually most of the time they are naked anyway. The film is very dialog heavy, which should have made up for the shortcomings described above. However, the dialogs sound too composed and awkward. In the beginning of the film, most of the dialog is a person saying a very long sentence, and then the person says 'Me too'. After the frenzy of agreement, the dialog descends into a mess of disjointed and confused word salad. The only merit of this film I can think of is that it serves as a feminist outlet which conveys that it is not just men who can be unfaithful. This film is a great disappointment.",😡,😡
5,"The film starts with a voice over telling the audience where they are, and who the characters are. And that is the moment i started to dislike the movie. With all the endless possibilities any film director have in hand, i really find it a very easy and cheap solution to express the situation with a voice over telling everything. I actually believe voice overs are betrayals to the film making concept. I hate to hear from a voice over saying where we are, which date we are at, and especially what the characters feel and think. I believe that a director has to find a visual way to transmit the feelings and the thoughts of the characters to the audience. But after the bad influencing intro, a very striking movie begins and keeps going for a fairly long enough time. The lives of a middle class family and all the members individually are depicted in a perfect realistic way. I think the director has a talent for capturing real life situations. For example, a father who has to make his private calls from the bathroom might seem abnormal at first, but life itself leads us some situations which might seem abnormal but also very normal as well. I think the director is a very good observer about real life. But that is it. After a while the realism in the movie begins to sacrifice the story-telling. I really felt like I'm having a big headache because of the non-stop talking characters. It was as if the actors and actresses were given the subject and were allowed to improvise the dialogs. It is realistic really, but characters always asking ""really, is that so"" etc. to each other, or characters saying ""no"" or ""are you listening to me,"" ten times when saying it only once is just enough causes me to have a headache. I also think the play practicing and book reading scenes are more then they should be. I understand that the play and the book in the movie are very much related to the plot, but i think the director has missed the point where he should stop showing these scenes.",😡,😡
