# Solutions - Sentiment Analysis - Part 1

## Packages and Imports

First we'll install packages as needed for Colab or our local environments and import packages.  We're using the `idlman` library from the textbook author to help with training the networks and for extracting the hidden state vectors from the trained networks.

In [1]:
# not sure about colab environment, may need to install nltk, torchtext, and ???
# !pip install torchtext
# !pip install nltk
!pip install portalocker

Collecting portalocker
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.8.2


In [3]:
# Imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import math
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import IMDB
import unicodedata
import re
import string
import nltk
from tqdm.autonotebook import tqdm
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import pandas as pd
from sklearn.metrics import accuracy_score
import time
# from utils import train_model
from idlmam import set_seed, LastTimeStep, train_network, Flatten, weight_reset, View, LambdaLayer

In [4]:
# configuration

# Set the theme to a dark grid with a specific palette
sns.set_theme(style="darkgrid", palette="pastel")

# Set the context for seaborn plots
sns.set_context("notebook")

torch.backends.cudnn.deterministic=True
set_seed(42)

# RNNs are SLOW on Apple Silicon, defaults to CPU for Mac
# Will choose a GPU in Colab if you're on a GPU Runtime

def get_pytorch_device(use_MPS=False):
    if torch.cuda.is_available():
        return torch.device('cuda')
    elif torch.backends.mps.is_available() and use_MPS:
        return torch.device('mps')  # MPS is available on Apple Silicon Macs
    else:
        return torch.device('cpu')

device = get_pytorch_device()
print(device)

cuda


### Instantiate Datasets

Must remove `root=./data` from first line in textbook code.  If not, only half the training data gets loaded.  Also split the train set into test and validation sets.

In [5]:
train_iter, test_iter = IMDB(split=('train', 'test'))
train_dataset = list(train_iter)
test_dataset = list(test_iter)

train_dataset, valid_dataset = torch.utils.data.random_split(train_dataset, [20000,5000], set_seed(42))

## Preprocess Text and Build Vocabulary

It's pretty common to clean text data before training a model, but the exact process may vary from model to model.
Later in this notebook we'll compare the sentiment analysis results from an LSTM trained with and without cleaned data.  In this section we'll demonstrate the clearning process.

Some common things to do with text data:
* remove stop words, these are words like 'a', 'the', etc. that don't usually change the meaning of text.
* convert unicode characters to ascii
* remove punctuation and html tags
* convert all letters to lower case

In the next few cells we'll show how this can be done.

First we'll download a commonly used list of English stopwords from NLTK:


In [6]:
# Download NLTK stop words
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) # we're going to be lazy and use this globally

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Now we'll define two functions to convert unicode to ascii and to preprocess the text:

In [7]:
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn' and (c in string.ascii_letters or c == ' ')
    )

def preprocess_text(text):
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    # Normalize Unicode to ASCII
    text = unicode_to_ascii(text.lower())
    # Remove all characters that are not ASCII letters, spaces or punctuation
    text = re.sub(r'[^a-z .,?!]+', ' ', text)
    # Tokenize text
    #tokens = tokenizer(text)
    tokens = text.split()
    # Remove stop words
    tokens = [token for token in tokens if token not in stop_words]
    return tokens

Here is the first review in the dataset:

In [8]:
first_review = train_dataset[0][1]
print(f'The number of characters in the first review is {len(first_review)} \n')
print(f'Here is the first review: \n\n {first_review}')

The number of characters in the first review is 1079 

Here is the first review: 

 I saW this film while at Birmingham Southern College in 1975, when it was shown in combination with the Red Balloon. Both films are similar in their dream-like quality. The bulk of the film entails a fish swimming happily in his bowl while his new owner, a little boy, is away at school. A cat enters the room where the fish and his bowl are, and begins to warily stalk his "prey." The boy begins his walk home from school, and the viewer wonders whether he will arrive in time to save his fish friend. The fish becomes agitated by the cat's presence, and finally jumps out of the bowl! The cat quickly walks over to the fish, gently picks him up with his paws, and returns him to his bowl. The boy returns happily to his fish, none the wiser.<br /><br />The ending is amazing in both its irony and its technical complexity. It is hard to imagine how the director could've pulled the technical feat back in 1959 -- i

In [9]:
from torchtext.data.utils import get_tokenizer#tokenizers break strings like "this is a string" into lists of tokens like ['this', 'is', 'a', 'string']
tokenizer = get_tokenizer('basic_english') #we will be fine with the default english style tokenizer
first_review_preproc = preprocess_text(first_review)
first_review_preproc =' '.join(first_review_preproc)
print(f'The number of characters in the first review is {len(first_review_preproc)} \n')
print(f'Here is the first preprocessed review: \n\n {first_review_preproc}')

The number of characters in the first review is 615 

Here is the first preprocessed review: 

 saw film birmingham southern college shown combination red balloon films similar dreamlike quality bulk film entails fish swimming happily bowl new owner little boy away school cat enters room fish bowl begins warily stalk prey boy begins walk home school viewer wonders whether arrive time save fish friend fish becomes agitated cats presence finally jumps bowl cat quickly walks fish gently picks paws returns bowl boy returns happily fish none wiser ending amazing irony technical complexity hard imagine director couldve pulled technical feat back seems trick find watch wont disappointed find let know get copy
