# Text Classification with Word Embeddings and Dense Neural Network Models

Understanding the text content and predicting the sentiment of the reviews is a form of supervised machine learning. To be more specific, we will be using classification models for solving this problem. We will be building an automated sentiment text classification system in subsequent sections. The major steps to achieve this are mentioned as follows.

+ Prepare train and test datasets (optionally a validation dataset)
+ Pre-process and normalize text documents
+ Feature Engineering 
+ Model training
+ Model prediction and evaluation

These are the major steps for building our system. Optionally the last step would be to deploy the model in your server or on the cloud. The following figure shows a detailed workflow for building a standard text classification system with supervised learning (classification) models.

<img src="https://github.com/dipanjanS/nlp_workshop_dhs18/blob/master/Unit%2012%20-%20Project%209%20-%20Sentiment%20Analysis%20-%20Supervised%20Learning/sentiment_classifier_workflow.png?raw=1">

In our scenario, documents indicate the movie reviews and classes indicate the review sentiments which can either be positive or negative making it a binary classification problem. We will build models using deep learning in the subsequent sections.

# New Section

# New Section

In [1]:
!nvidia-smi

Mon Jul 19 23:49:06 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')

Collecting contractions
  Downloading contractions-0.0.52-py2.py3-none-any.whl (7.2 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.2.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 5.0 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.2.0-py3-none-any.whl (283 kB)
[K     |████████████████████████████████| 283 kB 56.2 MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.2-cp37-cp37m-linux_x86_64.whl size=85440 sha256=cc905bb10bc9753ff0a1e1795ac1bf0fd6600f1df21e2d709c8630e179bf53e2
  Stored in directory: /root/.cache/pip/wheels/25/19/a6/8f363d9939162782bb8439d886469756271abc01f76fbd790f
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully install

True

## Load Dataset

Let's load our movie review dataset containing about 50000 reviews and their corresponding sentiments like positive and negative

In [3]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [4]:
print(pd.__version__)

1.1.5


In [5]:

dataset = pd.read_csv('/content/drive/My Drive/NLP_DeepLearning_Course/Week1/movie_reviews.csv.bz2')
dataset.info()
frac = 1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [6]:
# downsample if needed
frac = 0.2
dataset = dataset.sample(frac=frac, random_state=253)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 23409 to 36285
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     10000 non-null  object
 1   sentiment  10000 non-null  object
dtypes: object(2)
memory usage: 234.4+ KB


In [7]:
dataset.head()

Unnamed: 0,review,sentiment
23409,I normally wouldn't waste my time criticizing ...,negative
38373,"Really, I can't believe that I spent $5 on thi...",negative
42721,i LOVED THIS MOVIEE well i loved the romance p...,positive
34145,Even though this was a disaster in the box off...,positive
10674,"The danish movie ""Slim Slam Slum"" surprised me...",negative


## Split Dataset into Train and Test sets

Since sentiment analysis is a supervised learning task, we split our movie review dataset into train and test sets

In [8]:
# build train and test datasets
reviews = dataset['review'].values
sentiments = dataset['sentiment'].values

train_reviews = reviews[:int(35000*frac)]
train_sentiments = sentiments[:int(35000*frac)]

test_reviews = reviews[int(35000*frac):]
test_sentiments = sentiments[int(35000*frac):]

In [9]:
print(train_reviews.shape)
print(train_sentiments.shape)
print(test_reviews.shape)
print(test_sentiments.shape)

(7000,)
(7000,)
(3000,)
(3000,)


## Text Wrangling and Normalization

The movie reviews have been collected by scraping web content. Typically scrapped data contains HTML tags and other pieces of information which can be easily discarded.

In this section, we will also normalize our corpus by removing accented characters, newline characters and so on. Lets get started

In [10]:
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
from tqdm import tqdm
import unicodedata


def strip_html_tags(text):
  soup = BeautifulSoup(text, "html.parser")
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
  return stripped_text

def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text

def pre_process_corpus(docs):
  norm_docs = []
  for doc in tqdm(docs):
    # strip HTML tags
    doc = strip_html_tags(doc)
    # remove extra newlines
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    # lower case
    doc = doc.lower()
    # remove accented characters
    doc = remove_accented_chars(doc)
    # fix contractions
    doc = contractions.fix(doc)
    # remove special characters\whitespaces
    # use regex to keep only letters, numbers and spaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, flags=re.I|re.A)
    # use regex to remove extra spaces
    doc = re.sub(' +', ' ', doc)
    # remove trailing and leading spaces
    doc = doc.strip()  
    norm_docs.append(doc)
  
  return norm_docs

In [11]:
%%time

norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

100%|██████████| 7000/7000 [00:03<00:00, 2286.84it/s]
100%|██████████| 3000/3000 [00:01<00:00, 2309.05it/s]

CPU times: user 4.35 s, sys: 46.5 ms, total: 4.4 s
Wall time: 4.37 s





## Label Encode Class Labels

Our dataset has labels in the form of positive and negative classes. We transform them into consumable form by performing label encoding. Label encoding assigns a unique numerical value to each class. For example: 
``negative: 0 and positive:1``

In [12]:
import gensim
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dropout, Activation, Dense
from sklearn.preprocessing import LabelEncoder

In [13]:
le = LabelEncoder()
# tokenize train reviews & encode train labels
tokenized_train = [nltk.word_tokenize(text)
                       for text in tqdm(norm_train_reviews)]
y_train = le.fit_transform(train_sentiments)
# tokenize test reviews & encode test labels
tokenized_test = [nltk.word_tokenize(text)
                       for text in tqdm(norm_test_reviews)]
y_test = le.transform(test_sentiments)

100%|██████████| 7000/7000 [00:05<00:00, 1374.09it/s]
100%|██████████| 3000/3000 [00:02<00:00, 1340.46it/s]


In [14]:
# print class label encoding map and encoded labels
print('Sentiment class label map:', dict(zip(le.classes_, le.transform(le.classes_))))
print('Sample test label transformation:\n'+'-'*35,
      '\nActual Labels:', test_sentiments[:33], '\nEncoded Labels:', y_test[:33])

Sentiment class label map: {'negative': 0, 'positive': 1}
Sample test label transformation:
----------------------------------- 
Actual Labels: ['negative' 'negative' 'positive' 'negative' 'positive' 'negative'
 'negative' 'negative' 'negative' 'negative' 'positive' 'negative'
 'negative' 'positive' 'positive' 'positive' 'positive' 'negative'
 'negative' 'negative' 'positive' 'negative' 'positive' 'negative'
 'negative' 'positive' 'negative' 'negative' 'positive' 'negative'
 'positive' 'negative' 'positive'] 
Encoded Labels: [0 0 1 0 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1]


## Feature Engineering based on Word2Vec Embeddings

In the previous notebook we discussed different word embedding techniques like word2vec, glove, fastText, etc. In this section we will leverage ``gensim`` to transform our dataset into word2vec  representation

In [15]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [16]:
%%time
# build word2vec model
w2v_num_features = 300
w2v_model = gensim.models.Word2Vec(tokenized_train, size=w2v_num_features, window=150,
                                   min_count=10, workers=4, iter=5)    

2021-07-19 23:50:00,728 : INFO : collecting all words and their counts
2021-07-19 23:50:00,729 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-07-19 23:50:01,057 : INFO : collected 66824 word types from a corpus of 1609075 raw words and 7000 sentences
2021-07-19 23:50:01,058 : INFO : Loading a fresh vocabulary
2021-07-19 23:50:01,134 : INFO : effective_min_count=10 retains 9149 unique words (13% of original 66824, drops 57675)
2021-07-19 23:50:01,135 : INFO : effective_min_count=10 leaves 1493431 word corpus (92% of original 1609075, drops 115644)
2021-07-19 23:50:01,162 : INFO : deleting the raw counts dictionary of 66824 items
2021-07-19 23:50:01,165 : INFO : sample=0.001 downsamples 50 most-common words
2021-07-19 23:50:01,166 : INFO : downsampling leaves estimated 1076791 word corpus (72.1% of prior 1493431)
2021-07-19 23:50:01,189 : INFO : estimated required memory for 9149 words and 300 dimensions: 26532100 bytes
2021-07-19 23:50:01,190 : INFO : re

CPU times: user 2min 34s, sys: 261 ms, total: 2min 34s
Wall time: 40.9 s


## Feature Engineering based on FastText Embeddings

Similar to previous section, here will transform our corpus into FastText vectors using ``gensim``

In [17]:
from gensim.models.fasttext import FastText

# Set values for various parameters
feature_size = 300    # Word vector dimensionality  
window_context = 50  # Context window size                                                                                    
min_word_count = 10   # Minimum word count                        
sample = 1e-3        # Downsample setting for frequent words
sg = 1               # skip-gram model

ft_model = FastText(tokenized_train, size=feature_size, 
                     window=window_context, min_count = min_word_count,
                     sg=sg, sample=sample, iter=2, workers=4)
ft_model

2021-07-19 23:50:41,648 : INFO : collecting all words and their counts
2021-07-19 23:50:41,649 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-07-19 23:50:41,913 : INFO : collected 66824 word types from a corpus of 1609075 raw words and 7000 sentences
2021-07-19 23:50:41,914 : INFO : Loading a fresh vocabulary
2021-07-19 23:50:41,955 : INFO : effective_min_count=10 retains 9149 unique words (13% of original 66824, drops 57675)
2021-07-19 23:50:41,956 : INFO : effective_min_count=10 leaves 1493431 word corpus (92% of original 1609075, drops 115644)
2021-07-19 23:50:41,987 : INFO : deleting the raw counts dictionary of 66824 items
2021-07-19 23:50:41,989 : INFO : sample=0.001 downsamples 50 most-common words
2021-07-19 23:50:41,990 : INFO : downsampling leaves estimated 1076791 word corpus (72.1% of prior 1493431)
2021-07-19 23:50:42,097 : INFO : estimated required memory for 9149 words, 60254 buckets and 300 dimensions: 100779444 bytes
2021-07-19 23:50:42

<gensim.models.fasttext.FastText at 0x7fa7e6b19dd0>

## Averaged Document Vectors

A sentence in very simple terms is a collection of words. By now we know how to transform words into vector representation. But how do we transform sentences and documents into vector representation?

A simple and naïve way is to average all words in a given sentence to form a sentence vector. In this section, we will leverage this technique itself to prepare our sentence/document vectors

In [18]:
def averaged_doc_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in tqdm(words):
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in tqdm(corpus)]
    return np.array(features)

In [19]:
# generate averaged word vector features from word2vec model
avg_w2v_train_features = averaged_doc_vectorizer(corpus=tokenized_train, model=w2v_model,
                                                     num_features=w2v_num_features)
avg_w2v_test_features = averaged_doc_vectorizer(corpus=tokenized_test, model=w2v_model,
                                                    num_features=w2v_num_features)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
100%|██████████| 634/634 [00:00<00:00, 40895.50it/s]

100%|██████████| 119/119 [00:00<00:00, 54358.76it/s]

100%|██████████| 179/179 [00:00<00:00, 91092.02it/s]

100%|██████████| 352/352 [00:00<00:00, 36474.01it/s]

100%|██████████| 142/142 [00:00<00:00, 23608.34it/s]

100%|██████████| 219/219 [00:00<00:00, 67145.66it/s]
 17%|█▋        | 506/3000 [00:05<00:25, 98.51it/s] 
100%|██████████| 247/247 [00:00<00:00, 34625.44it/s]

100%|██████████| 73/73 [00:00<00:00, 36642.44it/s]

100%|██████████| 318/318 [00:00<00:00, 106968.38it/s]

100%|██████████| 217/217 [00:00<00:00, 73823.02it/s]

100%|██████████| 688/688 [00:00<00:00, 89726.10it/s]

100%|██████████| 669/669 [00:00<00:00, 112889.82it/s]

100%|██████████| 205/205 [00:00<00:00, 101252.04it/s]

100%|██████████| 265/265 [00:00<00:00, 86821.63it/s]

100%|██████████| 105/105 [00:00<00:00, 82860.19it/s]

100%|██████████| 497/497 [00:00<00:00, 27271.37it/s]

100%|██████████| 20

In [20]:
print('Word2Vec model:> Train features shape:', avg_w2v_train_features.shape, 
      ' Test features shape:', avg_w2v_test_features.shape)

Word2Vec model:> Train features shape: (7000, 300)  Test features shape: (3000, 300)


In [21]:
# generate averaged word vector features from fastText model
avg_ft_train_features = averaged_doc_vectorizer(corpus=tokenized_train, model=ft_model,
                                                     num_features=feature_size)
avg_ft_test_features = averaged_doc_vectorizer(corpus=tokenized_test, model=ft_model,
                                                    num_features=feature_size)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
100%|██████████| 634/634 [00:00<00:00, 90658.28it/s]

100%|██████████| 119/119 [00:00<00:00, 80309.28it/s]

100%|██████████| 179/179 [00:00<00:00, 73533.83it/s]

100%|██████████| 352/352 [00:00<00:00, 38655.16it/s]

100%|██████████| 142/142 [00:00<00:00, 26281.49it/s]

100%|██████████| 219/219 [00:00<00:00, 70919.75it/s]

100%|██████████| 247/247 [00:00<00:00, 31916.97it/s]

100%|██████████| 73/73 [00:00<00:00, 20199.51it/s]

100%|██████████| 318/318 [00:00<00:00, 25989.65it/s]
 17%|█▋        | 509/3000 [00:04<00:25, 99.20it/s]
100%|██████████| 217/217 [00:00<00:00, 22752.96it/s]

100%|██████████| 688/688 [00:00<00:00, 92150.12it/s]

100%|██████████| 669/669 [00:00<00:00, 59044.87it/s]

100%|██████████| 205/205 [00:00<00:00, 72657.79it/s]

100%|██████████| 265/265 [00:00<00:00, 76622.82it/s]

100%|██████████| 105/105 [00:00<00:00, 80674.47it/s]

100%|██████████| 497/497 [00:00<00:00, 60513.50it/s]

100%|██████████| 205/20

In [22]:
print('FastText model:> Train features shape:', avg_w2v_train_features.shape, 
      ' Test features shape:', avg_w2v_test_features.shape)

FastText model:> Train features shape: (7000, 300)  Test features shape: (3000, 300)


## Define DNN Model

Let us leverage ``tensorflow.keras`` to build our deep neural network for movie review classification task.
We will make use of ``Dense`` layers with ``ReLU`` activation and ``Dropout`` to prevent overfitting.

Architecture used:

- 3 Dense Layers
- 512 - 256 - 256 (neurons)
- 20% dropout in each layer
- 1 output layer for binary classification
- binary crossentropy loss 
- adam optimizer

In [23]:
def construct_deepnn_architecture(num_input_features):
    dnn_model = Sequential()
    dnn_model.add(Dense(512, input_shape=(num_input_features,)))
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(256))
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(256))
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(1))
    dnn_model.add(Activation('sigmoid'))

    dnn_model.compile(loss='binary_crossentropy', optimizer='adam',                 
                      metrics=['accuracy'])
    return dnn_model

## Compile and Visualize Model

In [24]:
w2v_dnn = construct_deepnn_architecture(num_input_features=w2v_num_features)

In [25]:
w2v_dnn.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 512)               154112    
_________________________________________________________________
activation (Activation)      (None, 512)               0         
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               131328    
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               6

## Train the Model using Word2Vec Features

The first exercise is to leverage word2vec features as input to our deep neural network to perform moview review classification

In [26]:
batch_size = 100
w2v_dnn.fit(avg_w2v_train_features, y_train, epochs=10, batch_size=batch_size, 
            shuffle=True, validation_split=0.1, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fa7e4fefc50>

### Evaluate Model

In [27]:
from sklearn.metrics import confusion_matrix, classification_report

In [28]:
y_pred = w2v_dnn.predict_classes(avg_w2v_test_features)
predictions = le.inverse_transform(y_pred) 

  y = column_or_1d(y, warn=True)


In [29]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, predictions))
pd.DataFrame(confusion_matrix(test_sentiments, predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.84      0.74      0.79      1457
    positive       0.78      0.87      0.82      1543

    accuracy                           0.81      3000
   macro avg       0.81      0.80      0.81      3000
weighted avg       0.81      0.81      0.81      3000



Unnamed: 0,negative,positive
negative,1078,379
positive,201,1342


The model seems to perform very nicely for both classes within a few iterations itself.

## Train the model using FastText Features

The second exercise we will perform using FastText feature vectors. Remember that we will use the same model architecture for this exercise as well but create a new instance of the same. Lets get started

In [30]:
ft_dnn = construct_deepnn_architecture(num_input_features=feature_size)

In [31]:
batch_size = 100
ft_dnn.fit(avg_ft_train_features, y_train, epochs=15, batch_size=batch_size, 
            shuffle=True, validation_split=0.1, verbose=1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7fa7f0008a50>

### Evaluate the model

# New Section

In [32]:
y_pred = ft_dnn.predict_classes(avg_ft_test_features)
predictions = le.inverse_transform(y_pred) 

  y = column_or_1d(y, warn=True)


In [33]:
labels = le.classes_.tolist()
print(classification_report(test_sentiments, predictions))
pd.DataFrame(confusion_matrix(test_sentiments, predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.85      0.86      0.85      1457
    positive       0.86      0.85      0.86      1543

    accuracy                           0.86      3000
   macro avg       0.86      0.86      0.86      3000
weighted avg       0.86      0.86      0.86      3000



Unnamed: 0,negative,positive
negative,1247,210
positive,224,1319


Amazing, FastText seems to identify both classes with a more balanced number of prediction errors than the model word2vec features. We encourage you to try out these models on other datasets too!