**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a same Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [1]:
#@ Initialization:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading the Libraries and Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [2]:
#@ Downloading the Libraries and Dependencies:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import re, json, string, os
import collections

from argparse import Namespace
from IPython.display import display 
from collections import Counter
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

**Getting the Data**
* I have used Google Colab for this Project so the process of downloading and reading the Data might be different in other platforms. I have used [**Yelp Reviews Dataset**](https://www.kaggle.com/yelp-dataset/yelp-dataset) for this Project. In 2015, Yelp held a contest to predict the Rating of the Restaurants given it's Reviews. Zhang, Zhao, and Lecun simplified the Dataset by converting the Ratings into Sentiments viz. Positive Sentiment for 3 to 4 star Ratings and Negative Sentiment for 1 to 2 star Ratings. The Dataset is splitted into 560,000 Training Samples and 38,000 Testing Samples. 

In [3]:
#@ Getting the Dataset:
args = Namespace(
    raw_train_dataset = "/content/drive/My Drive/Colab Notebooks/YELP Dataset/raw_train.csv",
    raw_test_dataset = "/content/drive/My Drive/Colab Notebooks/YELP Dataset/raw_test.csv",
    proportion_subset_of_train = 0.1,
    train_proportion = 0.7,
    val_proportion = 0.15,
    test_proportion = 0.15,   
    output_munged = "/content/drive/My Drive/Colab Notebooks/YELP Dataset/reviews_with_splits_lite.csv",
    seed = 1337
)

#@ Reading the Raw Dataset:
train_reviews = pd.read_csv(args.raw_train_dataset, header=None, names=["rating", "review"])
train_reviews = train_reviews[~pd.isnull(train_reviews.review)]
test_reviews = pd.read_csv(args.raw_test_dataset, header=None, names=["rating", "review"])
test_reviews = test_reviews[~pd.isnull(test_reviews.review)]

#@ Inspecting the DataFrame:
display(train_reviews.head())
print(" ")
display(test_reviews.head())

Unnamed: 0,rating,review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


 


Unnamed: 0,rating,review
0,1,Ordered a large Mango-Pineapple smoothie. Stay...
1,2,Quite a surprise! \n\nMy wife and I loved thi...
2,1,"First I will say, this is a nice atmosphere an..."
3,2,I was overall pretty impressed by this hotel. ...
4,1,Video link at bottom review. Worst service I h...


**Processing the Dataset**

In [4]:
#@ Creating the Subset of the Reviews Dataset:
by_rating = collections.defaultdict(list)                     # Collections stores the collection of Data.
for _, row in train_reviews.iterrows():
  by_rating[row.rating].append(row.to_dict())

review_subset = []
for _, item_list in sorted(by_rating.items()):
  n_total = len(item_list)
  n_subset = int(args.proportion_subset_of_train * n_total)
  review_subset.extend(item_list[:n_subset])

#@ Creating the DataFrame:
review_subset = pd.DataFrame(review_subset)

#@ Inspecting the DataFrame:
review_subset.head()

Unnamed: 0,rating,review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,1,I don't know what Dr. Goldberg was like before...
2,1,I'm writing this review to give you a heads up...
3,1,Wing sauce is like water. Pretty much a lot of...
4,1,Owning a driving range inside the city limits ...


In [5]:
#@ Performing the Basic EDA:
display(train_reviews.rating.value_counts())                   # Inspecting the Number of Ratings.
print(" ")
display(review_subset.rating.value_counts())                   # Inspecting the Number of Ratings.
print(" ")
display(set(review_subset.rating))                             # Unique Ratings in the DataFrame.

2    280000
1    280000
Name: rating, dtype: int64

 


2    28000
1    28000
Name: rating, dtype: int64

 


{1, 2}

**Processing the DataFrame**
* Creating Training, Validation and Testing Splits in the DataFrame.

In [6]:
#@ Splitting the Subset by Rating to create New Training, Validation and Testing Splits:
by_rating = collections.defaultdict(list)
for _, row in review_subset.iterrows():
  by_rating[row.rating].append(row.to_dict())

#@ Creating the Split Data:
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_rating.items()):
  np.random.shuffle(item_list)                                     # Shuffling the Data randomly.
  n_total = len(item_list)
  n_train = int(args.train_proportion * n_total)
  n_val = int(args.val_proportion * n_total)
  n_test = int(args.test_proportion * n_total)
  #@ Giving the Data point a split Attribute:
  for item in item_list[:n_train]:
    item["split"] = "train"
  for item in item_list[n_train:n_train+n_val]:
    item["split"] = "val"
  for item in item_list[n_train+n_val:n_train+n_val+n_test]:
    item["split"] = "test" 
  #@ Adding to the Final List:
  final_list.extend(item_list)

#@ Creating the Final DataFrame:
final_reviews = pd.DataFrame(final_list)

#@ Inspecting the Final Result:
display(final_reviews.head())                                     # Inspecting the DataFrame.
print(" ")
display(final_reviews.split.value_counts())                       # Inspecting the Training, Validation and Testing Data.

Unnamed: 0,rating,review,split
0,1,Terrible place to work for I just heard a stor...,train
1,1,"3 hours, 15 minutes-- total time for an extrem...",train
2,1,My less than stellar review is for service. ...,train
3,1,I'm granting one star because there's no way t...,train
4,1,The food here is mediocre at best. I went afte...,train


 


train    39200
val       8400
test      8400
Name: split, dtype: int64

**Cleaning the Data**
* I will clean the Data minimally by adding whitespace around Punctuation symbols and Removing Extraneous symbols which are not Punctuations for all the Splits.

In [7]:
#@ Cleaning the Data:
def preprocess_text(text):
  text = text.lower()                                      # Converting into Lowercase.
  text = re.sub(r"([.,!?])", r" \1 ", text)
  text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
  return text

#@ Processing the Review Column:
final_reviews["review"] = final_reviews.review.apply(preprocess_text)

#@ Processing the Rating Column:
final_reviews["rating"] = final_reviews.rating.apply({1:"negative", 2:"positive"}.get)

#@ Inspecting the DataFrame:
final_reviews.head(7)

Unnamed: 0,rating,review,split
0,negative,terrible place to work for i just heard a stor...,train
1,negative,"hours , minutes total time for an extremely s...",train
2,negative,my less than stellar review is for service . w...,train
3,negative,i m granting one star because there s no way t...,train
4,negative,the food here is mediocre at best . i went aft...,train
5,negative,n n nwe looked at our entertainment book for ...,train
6,negative,i had an appointment that was made months in a...,train


In [8]:
#@ Preparing the Data:
final_reviews.to_csv(args.output_munged, index=False)

**PyTorch Dataset Class**
* PyTorch provides an abstraction for the Dataset by providing a Dataset Class. The Dataset Class is an abstract Operator. When using PyTorch with a new Dataset it is necessary to sub class the Dataset Class and Implement the __getitem__ and __len__ methods.

In [9]:
#@ PyTorch Dataset Class:
class ReviewDataset(Dataset):
  def __init__(self, review_df, vectorizer):
    """
    Args: review_df(pandas.DataFrame): The dataset.
        : vectorizer(ReviewVectorizer): Vector instantiated from dataset.
    """
    self.review_df = review_df
    self._vectorizer = vectorizer

    self.train_df = self.review_df[self.review_df.split == "train"]
    self.train_size = len(self.train_df)

    self.val_df = self.review_df[self.review_df.split == "val"]
    self.validation_size = len(self.val_df)

    self.test_df = self.review_df[self.review_df.split == "test"]
    self.test_size = len(self.test_df)

    self._lookup_dict = {"train": (self.train_df, self.train_size),
                           "val": (self.val_df, self.validation_size),
                           "test": (self.test_df, self.test_size)}
    self.set_split("train")
  
  @classmethod
  def load_dataset_and_make_vectorizer(cls, review_csv):
    """Load dataset and make new vectorizer from scratch.
    Args: review_csv: Location of the dataset.
    Returns: An instance of ReviewDataset.
    """
    review_df = pd.read_csv(review_csv)
    train_review_df = review_df[review_df.split == "train"]
    return cls(review_df, ReviewVectorizer.from_dataframe(train_review_df))
  
  @classmethod
  def load_dataset_and_load_vectorizer(cls, review_csv, vectorizer_filepath):
    """Load dataset and the corresponding vectorizer.
    Args: review_csv: Location of the dataset.
        : vectorizer_filepath: Location of the saved vectorizer.
    Returns: An instance of the ReviewDataset.
    """
    review_df = pd.read_csv(review_csv)
    vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
    return cls(review_df, vectorizer)

  @staticmethod
  def load_vectorizer_only(vectorizer_filepath):
    """A static method for loading the vectorizer from file.
    Args: vectorizer_filepath: Location of serialized vectorizer.
    Returns: An instance of ReviewVectorizer.
    """
    with open(vectorizer_filepath, "w") as fp:
      json.dump(self._vectorizer.to_serializable(), fp)
  
  def save_vectorizer(self, vectorizer_filepath):
    with open(vectorizer_filepath, "w") as fp:
      json.dump(self._vectorizer.to_serializable(), fp)
  
  def get_vectorizer(self):
    """Returns the Vectorizer.
    """
    return self._vectorizer
  
  def set_split(self, split="train"):
    """Splits the dataset using a column in the DataFrame.
    Args: split(str): One of "train", "val" or "test"
    """
    self._target_split = split
    self._target_df, self._target_size = self._lookup_dict[split]
  
  def __len__(self):
    return self._target_size
  
  def __getitem__(self, index):
    """Primary entry point of PyTorch Datasets.
    Args: index: Index of the Datapoint.
    Returns: A dictionary holding the Data point features and labels.
    """
    row = self._target_df.iloc[index]
    review_vector = self._vectorizer.vectorize(row.review)
    rating_index = self._vectorizer.rating_vocab.lookup_token(row.rating)
    return {"x_data": review_vector,
            "y_target": rating_index}
  
  def get_num_batches(self, batch_size):
    """Given a batch size, return the number of batches in the Dataset.
    Args: batch_size(int)
    Returns: Number of batches in the Dataset.
    """
    return len(self) // batch_size

**The Vocabulary Class**
* The Vocabulary Class not only manages the Bijection i.e Allowing user to add new Tokens and have the Index auto increment but also handles the special token called UNK which stands for Unknown. By using the UNK Token, It will be easy to handle Tokens at Test time that were never seen in Training Instance.

In [10]:
#@ The Vocabulary Class:
class Vocabulary(object):
  """Class to process Text and and extract Vocabulary for mapping.
  """
  def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
    """
    Args: token_to_idx(dict): Pre existing map of Tokens to Index.
        : add_unk(bool): A flag indicating whether to add UNK Token.
        : unk_token(string): The UNK Token to add in Vocabulary.
    """
    if token_to_idx is None:
      token_to_idx = {}
    self._token_to_idx = token_to_idx
    self._idx_to_token = {idx:token for token,idx in self._token_to_idx.items()}
    self._add_unk = add_unk
    self._unk_token = unk_token
    self.unk_index = -1
    if add_unk:
      self.unk_index = self.add_token(unk_token)
  
  def to_serializable(self):
    """Returns a dictionary that can be serialized.
    """
    return {"token_to_idx": self._token_to_idx,
            "add_unk": self._add_unk,
            "unk_token": self._unk_token}
  
  @classmethod
  def from_serializable(cls, contents):
    """Instantiate the Vocabulary from the serialized Dictionary.
    """
    return cls(**contents)
  
  def add_token(self, token):
    """Update the mapping dictionary based on the Tokens.
    Args: token: The item to add into the Vocabulary.
    Returns: index: Integer corresponding to the Token.
    """
    if token in self._token_to_idx:
      index = self._token_to_idx[token]
    else:
      index = len(self._token_to_idx)
      self._token_to_idx[token] = index
      self._idx_to_token[index] = token
    return index
  
  def add_many(self, tokens):
    """Add a list of Tokens into Vocabulary.
    Args: tokens(list): A list of string Tokens.
    Returns: indices(list): A list of indices correspoinding to the Tokens.
    """
    return [self.add_token(token) for token in tokens]
  
  def lookup_token(self, token):
    """Retrieve the Index associated with the Token.
    Args: token(str): The Token to lookup.
    Returns: index(int): The Index correspoinding to the Token.
    """
    if self.unk_index >= 0:
      return self._token_to_idx.get(token, self.unk_index)
    else:
      return self._token_to_idx[token]
  
  def lookup_index(self, index):
    """Return the Token associated with the Index.
    Args: index(int): The Index to lookup.
    Returns: token(str): The Token correspoinding to the Index.
    """
    if index not in self._idx_to_token:
      raise KeyError("the index (%d) is not in the Vocabulary" % index)
    return self._idx_to_token[index]
  
  def __str__(self):
    return "<Vocabulary(size=%d)>" % len(self)
  
  def __len__(self):
    return len(self._token_to_idx)

**Vectorizer Class**
* The second stage of going from a Text Dataset to a vectorized minibatch is to iterate through the Tokens of an Input Data Point and convert each Token to its Integer form. The result of this iteration should be a Vector. Because this Vector will be combined with Vectors from other Data points, there is Constraint that the Vectors produced by the Vectorizer should always have the same length.

In [11]:
#@ Vectorizer Class:
class ReviewVectorizer(object):
  """The Vectorizer coordinates the Vocabularies and puts them to use.
  """
  def __init__(self, review_vocab, rating_vocab):
    """
    Args: review_vocab: Maps words to Integers.
        : rating_vocab: Maps class labels to Integers.
    """
    self.review_vocab = review_vocab
    self.rating_vocab = rating_vocab

  def vectorize(self, review):
    """Create a collasped Onehot Vector for the review.
    Args: review: The review
    Returns: one_hot: The collapsed one hot Encoding.
    """
    one_hot = np.zeros(len(self.review_vocab), dtype=np.float32)
    for token in review.split(" "):
      if token not in string.punctuation:
        one_hot[self.review_vocab.lookup_token(token)] = 1
    return one_hot

  @classmethod
  def from_dataframe(cls, review_df, cutoff=25):
    """Instantiate the Vectorizer from DataFrame.
    Args: review_df(DataFrame): The review Dataset.
        :  cufoff(int): Parameter for frequency based Filtering.
    Returns: An instance of the ReviewVectorizer.
    """
    review_vocab = Vocabulary(add_unk=True)
    rating_vocab = Vocabulary(add_unk=False)
    #@ Adding Ratings:
    for rating in sorted(set(review_df.rating)):
      rating_vocab.add_token(rating)
    #@ Adding Topwords if count > provided count:
    word_counts = Counter()
    for review in review_df.review:
      for word in review.split(" "):
        if word not in string.punctuation:
          word_counts[word] += 1
    for word, count in word_counts.items():
      if count > cutoff:
        review_vocab.add_token(word)
    return cls(review_vocab, rating_vocab)

  @classmethod
  def from_serializable(cls, contents):
    """Instantiating the ReviewVectorizer from a serializable dictionary.
    Args: contents: The serializable dictionary.
    Returns: An instance of ReviewVectorizer Class.
    """
    review_vocab = Vocabulary.from_serializable(contents["review_vocab"])
    rating_vocab = Vocabulary.from_serializable(contents["rating_vocab"])
    return cls(review_vocab=review_vocab, rating_vocab=rating_vocab)
  
  def to_serializable(self):
    """Create serializable dictionary for Caching.
    Returns: contents(dict): The Serializable Dictionary.
    """
    return {"review_vocab": self.review_vocab.to_serializable(),
            "rating_vocab": self.rating_vocab.to_serializable()}

**DataLoader**
* The Final step of Text to Vectorized minibatch pipeline is to actually group the Vectorized Datapoints. Because grouping into mini batches is a viatal part of Training the Neural Networks, PyTorch provides a built in class called DataLoader for coordinating the Process.

In [12]:
#@ DataLoader:
def generate_batches(dataset, batch_size, shuffle=True, 
                     drop_last=True, device="gpu"):
  """A generator function which wraps the PyTorch DataLoader. 
  """
  dataloader = DataLoader(dataset=dataset, batch_size=batch_size, 
                          shuffle=shuffle, drop_last=drop_last)
  for data_dict in dataloader:
    out_data_dict = {}
    for name, tensor in data_dict.items():
      out_data_dict[name] = data_dict[name].to(device)
    yield out_data_dict

**A Perceptron Classifier**

In [13]:
#@ A Perceptron Classifier:
class ReviewClassifier(nn.Module):
  """A simple Perceptron based Classifier.
  """
  def __init__(self, num_features):
    """
    Args: num_features(init): The size of input feature vector.
    """
    super(ReviewClassifier, self).__init__()
    self.fc1 = nn.Linear(in_features=num_features,
                         out_features=1)
  
  def forward(self, x_in, apply_sigmoid=False):
    """Forward pass of the Classifier.
    Args: x_in(Tensor): An input data Tensor.
        : apply_sigmoid(bool): A flag for the sigmoid activation.
    """
    y_out = self.fc1(x_in).squeeze()
    if apply_sigmoid:
      y_out = torch.sigmoid(y_out)
    return y_out

**The Training Routine**
* The Training Routine is responsible for instantiating the Model, iterating over the Dataset, computing the output of the Model when the given data as Input, computing the Loss and updating the Model proportional to the Loss.

In [19]:
#@ Hyperparameters and Program Options:
args = Namespace(
    #@ Data and path information:
    frequency_cutoff=25,
    model_state_file="model.pth",
    review_csv="/content/drive/My Drive/Colab Notebooks/YELP Dataset/reviews_with_splits_lite.csv",
    save_dir="model_storage",
    vectorizer_file="vectorizer.json",
    batch_size=128,
    early_stopping_criteria=5, 
    learning_rate=0.001,
    num_epochs=100,
    seed=42,
    #@ Runtime Options:
    catch_keyword_interrupt=True,
    cuda=True,
    expand_filepaths_to_sava_dir=True,
    reload_from_files=False
)

if args.expand_filepaths_to_sava_dir:
  args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
  args.model_state_file = os.path.join(args.save_dir, args.model_state_file)
  print("Expanded Filepaths: ")
  print("\t{}".format(args.vectorizer_file))
  print("\t{}".format(args.model_state_file))

#@ Checking the CUDA:
if not torch.cuda.is_available():
  args.cuda=False
print("Using CUDA: {}".format(args.cuda))
args.device = torch.device("cuda" if args.cuda else "cpu")

#@ Seed for Reproducibility:
def set_seed_everywhere(seed, cuda):
  np.random.seed(seed)
  torch.manual_seed(seed)
  if cuda:
    torch.cuda.manual_seed_all(seed)
set_seed_everywhere(args.seed, args.cuda)

#@ Handle Dirs:
def handle_dirs(dirpath):
  if not os.path.exists(dirpath):
    os.makedirs(dirpath)
handle_dirs(args.save_dir)

Expanded Filepaths: 
	model_storage/vectorizer.json
	model_storage/model.pth
Using CUDA: True


In [15]:
#@ Instantiating the Dataset, Model, Loss, Optimizer and Training State:
def make_train_state(args):
  return {
      "stop_early": False,
      "early_stopping_step":0,
      "early_stopping_best_val":1e8,
      "learning_rate":args.learning_rate,
      "epoch_index":0,
      "train_loss":[],
      "train_acc":[],
      "val_loss":[],
      "val_acc":[],
      "test_loss":-1,
      "test_acc":-1,
      "model_filename":args.model_state_file
  }

def update_train_state(args, model, train_state):
  #@ Saving atleast One Model:
  if train_state["epoch_index"] == 0:
    torch.save(model.state_dict(), train_state["model_filename"])
    train_state["stop_early"] = False
  #@ Saving the Model if performance improved:
  elif train_state["epoch_index"] >= 1:
    loss_tm1, loss_t = train_state["val_loss"][-2:]
    #@ If Loss is worsened:
    if loss_t >= train_state["early_stopping_best_val"]:
      train_state["early_stopping_step"] += 1
    else:
      if loss_t < train_state["early_stopping_best_val"]:
        torch.save(model.state_dict(), train_state["model_filename"])
      train_state["early_stopping_step"] = 0
    train_state["stop_early"] = train_state["early_stopping_step"] >= args.early_stopping_criteria
  return train_state

def compute_accuracy(y_pred, y_target):
  y_pred_indices = (torch.sigmoid(y_pred)>0.5).long()
  n_correct = torch.eq(y_pred_indices, y_target).sum().item()
  return n_correct / len(y_pred_indices) * 100

In [16]:
#@ Initializing the Data Training:
if args.reload_from_files:
  print("Loading the Dataset and Vectorizer")
  dataset = ReviewDataset.load_dataset_and_load_vectorizer(args.review_csv, args.vectorizer_file)
else:
  print("Loading the Dataset and Creating the Vectorizer")
  dataset = ReviewDataset.load_dataset_and_make_vectorizer(args.review_csv)
  dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.get_vectorizer()
classifier = ReviewClassifier(num_features=len(vectorizer.review_vocab))

Loading the Dataset and Creating the Vectorizer


In [20]:
#@ The Training Loop:
classifier = classifier.to(args.device)
loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
schedular = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
                                                 mode="min", factor=0.5,
                                                 patience=1)

train_state = make_train_state(args)

epoch_bar = tqdm_notebook(
    desc="training routine",
    total=args.num_epochs,
    position=0
)

dataset.set_split("train")
train_bar = tqdm_notebook(
    desc="split=train",
    total=dataset.get_num_batches(args.batch_size),
    position=1,
    leave=True
)

dataset.set_split("val")
val_bar = tqdm_notebook(
    desc="split=val",
    total=dataset.get_num_batches(args.batch_size),
    position=1,
    leave=True
)

try:
  for epoch_index in range(args.num_epochs):
    train_state["epoch_index"] = epoch_index
    #@ Iterating over Training Dataset:
    dataset.set_split("train")
    batch_generator = generate_batches(dataset,
                                       batch_size=args.batch_size,
                                       device=args.device)
    running_loss = 0.0
    running_acc = 0.0
    classifier.train()

    for batch_index, batch_dict in enumerate(batch_generator):
      #@ Step1: Zero Gradients.
      optimizer.zero_grad()
      #@ Step2: Computing the Output:
      y_pred = classifier(x_in=batch_dict["x_data"].float())
      #@ Step3: Computing the Loss:
      loss = loss_func(y_pred, batch_dict["y_target"].float())
      loss_t = loss.item()
      running_loss += (loss_t - running_loss) / (batch_index + 1)
      #@ Step4: Using loss to produce gradients:
      loss.backward()
      #@ Step5: Using optimizer to take gradient step:
      optimizer.step()
      #@ Computing the accuracy:
      acc_t = compute_accuracy(y_pred, batch_dict["y_target"])
      running_acc += (acc_t - running_acc) / (batch_index + 1)
      #@ Updating:
      train_bar.set_postfix(loss=running_loss,
                            acc=running_acc,
                            epoch=epoch_index)
      train_bar.update()
    train_state["train_loss"].append(running_loss)
    train_state["train_acc"].append(running_acc)

    #@ Iterating over Validation Dataset:
    dataset.set_split("val")
    batch_generator = generate_batches(dataset,
                                       batch_size=args.batch_size,
                                       device=args.device)
    running_loss = 0.
    running_acc = 0.
    classifier.eval()
    for batch_index, batch_dict in enumerate(batch_generator):
      #@ Computing the Ouput:
      y_pred = classifier(x_in=batch_dict["x_data"].float())
      #@ Computing the Loss:
      loss = loss_func(y_pred, batch_dict["y_target"].float())
      loss_t = loss.item()
      running_loss += (loss_t - running_loss) / (batch_index + 1)
      #@ Computing the accuracy:
      acc_t = compute_accuracy(y_pred, batch_dict["y_target"])
      running_acc += (acc_t - running_acc) / (batch_index + 1)
      #@ Updating:
      val_bar.set_postfix(loss=running_loss,
                          acc=running_acc,
                          epoch=epoch_index)
      val_bar.update()
    train_state["val_loss"].append(running_loss)
    train_state["val_acc"].append(running_acc)

    train_state = update_train_state(args=args, model=classifier, 
                                     train_state=train_state)
    schedular.step(train_state["val_loss"][-1])

    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()

    if train_state["stop_early"]:
      break
    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()
except KeyboardInterrupt:
  print("Exiting Loop")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=0.0, description='training routine', style=ProgressStyle(description_width=…

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, description='split=train', max=306.0, style=ProgressStyle(description_…

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, description='split=val', max=65.0, style=ProgressStyle(description_wid…

**Evaluation, Inference and Inspection**
* After the Model is trained, the next steps are to either evaluate how it did against some held out portion of the Data, use it to do Inference on new Data and Inpects the Model weights to see what it has learned.

In [22]:
#@ Computing the Loss and Accuracy on the Test Dataset:
classifier.load_state_dict(torch.load(train_state["model_filename"]))
classifier = classifier.to(args.device)
dataset.set_split("test")
batch_generator = generate_batches(dataset,
                                   batch_size=args.batch_size,
                                   device=args.device)
running_loss = 0.
running_acc = 0.
classifier.eval()

for batch_index, batch_dict in enumerate(batch_generator):
  #@ Computing the Output:
  y_pred = classifier(x_in = batch_dict["x_data"].float())
  #@ Computing the Loss:
  loss = loss_func(y_pred, batch_dict["y_target"].float())
  loss_t = loss.item()
  running_loss += (loss_t - running_loss) / (batch_index + 1)
  #@ Computing the Accuracy:
  acc_t = compute_accuracy(y_pred, batch_dict["y_target"])
  running_acc += (acc_t - running_acc) / (batch_index + 1)

train_state["test_loss"] = running_loss
train_state["test_acc"] = running_acc

print("Test loss: {:.3f}".format(train_state["test_loss"]))
print("Test accuracy: {:.3f}".format(train_state["test_acc"]))

Test loss: 0.215
Test accuracy: 91.983


In [None]:
#@ Predicting the Rating:
def predict_rating(review, classifier, vectorizer, decision_threshold=0.5):
  review = preprocess_text(review)
  vectorized_review = torch.tensor(vectorizer.vectorize(review))
  result = classifier(vectorized_review.view(1, -1))
  probability_value = F.sigmoid(result).item()
  index=1
  if probability_value < decision_threshold:
    index=0
  return vectorizer.rating_vocab.lookup_index(index)

#@ Evaluation of the Model:
review = "I am annoyed with the Movie."
classifier = classifier.to(args.device)
prediction = predict_rating(review, classifier, vectorizer, decision_threshold=0.5)
print("{}: {}".format(review, prediction))