<a href="https://colab.research.google.com/github/Nivratti/Text_Classification/blob/master/Text_sentiment_analysis_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text sentiment analysis on Amazon Reviews

# Connect Google Colab with Google Drive

In [0]:
from google.colab import drive

In [4]:
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


# Execution Time

In [5]:
!pip install ipython-autotime

%load_ext autotime



# Tensorflow with GPU -- For faster training

Enabling and testing the GPU

First, you'll need to enable GPUs for the notebook:

Navigate to Edit→Notebook Settings
select GPU from the Hardware Accelerator drop-down
Next, we'll confirm that we can connect to the GPU with tensorflow:

In [6]:
%tensorflow_version 1.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

TensorFlow 1.x selected.
Found GPU at: /device:GPU:0
time: 7.24 s


# Global vars

In [7]:
import os

# project folder on rive containing dataset, trained model and other files
DRIVE_PROJECT_BASE_DIR = "/content/gdrive/My Drive/deep_learning/text_sentiment_analysis/"

BASE_DATASET_DIR = os.path.join(
    DRIVE_PROJECT_BASE_DIR , "dataset"
)

input_zipfile = os.path.join(
    BASE_DATASET_DIR , "amazonreviews.zip"
)

time: 4.55 ms


# Load Packages

In [8]:
import matplotlib.pyplot as plt
from tensorflow.python.keras import models, layers, optimizers
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
import bz2
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
import re
import numpy as np
%matplotlib inline

time: 9.8 ms


# Data preparation

## Unzip zip file

In [0]:
!unzip "$input_zipfile"

Archive:  /content/gdrive/My Drive/deep_learning/text_sentiment_analysis/dataset/amazonreviews.zip
replace test.ft.txt.bz2? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

## Reading the text

The text is held in a compressed format. Luckily, we can still read it line by line. The first word gives the label, so we have to convert that into a number and then take the rest to be the comment.

In [0]:
def get_labels_and_texts(file):
    labels = []
    texts = []
    for line in bz2.BZ2File(file):
        x = line.decode("utf-8")
        labels.append(int(x[9]) - 1)
        texts.append(x[10:].strip())
    return np.array(labels), texts

train_labels, train_texts = get_labels_and_texts('train.ft.txt.bz2')
test_labels, test_texts   = get_labels_and_texts('test.ft.txt.bz2')

## View data shape

In [0]:
# Print shapes
print("Shape of train_labels: {}".format(train_labels.shape))
print("length of train_texts: {}".format(len(train_texts)))

print("Shape of test_labels: {}".format(test_labels.shape))
print("length of test_texts: {}".format(len(test_texts)))

# Text Preprocessing

The first thing I'm going to do to process the text is to lowercase everything and then remove non-word characters. I replace these with spaces since most are going to be punctuation. Then I'm going to just remove any other characters (like letters with accents). It could be better to replace some of these with regular ascii characters but I'm just going to ignore that here. It also turns out if you look at the counts of the different characters that there are very few unusual characters in this corpus.

In [0]:
import re
NON_ALPHANUM = re.compile(r'[\W]')
NON_ASCII = re.compile(r'[^a-z0-1\s]')
def normalize_texts(texts):
    normalized_texts = []
    for text in texts:
        lower = text.lower()
        no_punctuation = NON_ALPHANUM.sub(r' ', lower)
        no_non_ascii = NON_ASCII.sub(r'', no_punctuation)
        normalized_texts.append(no_non_ascii)
    return normalized_texts
        
train_texts = normalize_texts(train_texts)
test_texts = normalize_texts(test_texts)

# Train/Validation Split
Now I'm going to set aside 20% of the training set for validation.

In [0]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, random_state=57643892, test_size=0.2
  )

In [0]:
print("Shape of train_labels: {}".format(train_labels.shape))
print("length of train_texts: {}".format(len(train_texts)))

print("Shape of val_labels: {}".format(val_labels.shape))
print("length of val_texts: {}".format(len(val_texts)))

viewing top 5 labels and first training text.

In [0]:
print(train_labels[:5])
print(train_texts[:1])

# Tokenization

Now I will just run a Tokenizer using the top 12000 words as features.

In [0]:
MAX_FEATURES = 12000
tokenizer = Tokenizer(num_words=MAX_FEATURES)
tokenizer.fit_on_texts(train_texts)

train_texts = tokenizer.texts_to_sequences(train_texts)
val_texts = tokenizer.texts_to_sequences(val_texts)
test_texts = tokenizer.texts_to_sequences(test_texts)


# Padding Sequences
In order to use batches effectively, I'm going to need to take my sequences and turn them into sequences of the same length. I'm just going to make everything here the length of the longest sentence in the training set. I'm not dealing with this here, but it may be advantageous to have variable lengths so that each batch contains sentences of similar lengths. This might help mitigate issues that arise from having too many padded elements in a sequence.


In [0]:
MAX_LENGTH = max(len(train_ex) for train_ex in train_texts)

train_texts = pad_sequences(train_texts, maxlen=MAX_LENGTH)
val_texts = pad_sequences(val_texts, maxlen=MAX_LENGTH)
test_texts = pad_sequences(test_texts, maxlen=MAX_LENGTH)

# Save Tokenizer

In [0]:
import pickle

# saving
tokenizer_info = {
    "tokenizer"   : tokenizer,
    "MAX_LENGTH"  : MAX_LENGTH,
    "MAX_FEATURES": MAX_FEATURES,
}
with open('tokenizer_info.pickle', 'wb') as handle:
    pickle.dump(tokenizer_info, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [0]:
# make some space
del tokenizer

import gc

print(f"total objects in memory --{gc.get_count()}")

# collecting memory
gc.collect()
print(f"After garbage collecor- objects in memory --{gc.get_count()}")

# labels info

In [0]:
labels_dict = {
    "negative": 0,
    "positive": 1,
}
num_classes = len(labels_dict.items())
print(f"num_classes : {num_classes}")

# convert to categorical data

In [0]:
from keras.utils import to_categorical

train_y = to_categorical(train_labels, num_classes=num_classes, dtype='float32')
test_y = to_categorical(test_labels, num_classes=num_classes, dtype='float32')
val_y = to_categorical(val_labels, num_classes=num_classes, dtype='float32')

# Print shapes
print("Shape of train_y: {}".format(train_y.shape))
print("Shape of test_y: {}".format(test_y.shape))
print("Shape of val_y: {}".format(val_y.shape))

view single record

In [0]:
print(train_y[0])

# Deep learning Models

In [0]:
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense,GRU,LSTM,Embedding
from keras.optimizers import Adam
from keras.layers import SpatialDropout1D,Dropout,Bidirectional,Conv1D,GlobalMaxPooling1D,MaxPooling1D,Flatten
from keras.callbacks import ModelCheckpoint, TensorBoard, Callback, EarlyStopping

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D

## Utilities

### Plotting accuracy and loss graph

In [0]:
import matplotlib.pyplot as plt

def plot_accuracy_loss(history):
  axes = plt.axes()
  # axes.set_ylim([0, 1])

  # Plot training & validation accuracy values
  plt.plot(history.history['accuracy'])
  plt.plot(history.history['val_accuracy'])
  plt.title('Model accuracy')
  plt.ylabel('Accuracy')
  plt.xlabel('Epoch')
  plt.legend(['Train', 'Test'], loc='upper left')
  plt.show()

  # Plot training & validation loss values
  axes = plt.axes()
  # axes.set_ylim([0, 1])
  plt.plot(history.history['loss'])
  plt.plot(history.history['val_loss'])
  plt.title('Model loss')
  plt.ylabel('Loss')
  plt.xlabel('Epoch')
  plt.legend(['Train', 'Test'], loc='upper left')
  plt.show()

## LSTM

In [0]:
embed_dim = 64

model_LSTM = Sequential()
model_LSTM.add(Embedding(MAX_FEATURES, embed_dim, input_length=MAX_LENGTH, mask_zero=True))
model_LSTM.add(LSTM(64,dropout=0.4,return_sequences=True))
model_LSTM.add(LSTM(32,dropout=0.5,return_sequences=False))
model_LSTM.add(Dense(num_classes, activation='softmax'))
model_LSTM.compile(loss = 'categorical_crossentropy', optimizer=Adam(lr = 0.001), metrics = ['accuracy'])
model_LSTM.summary()

In [0]:
# Train model
epochs = 2
batch_size = 2048 # 64 # use more if gpu available - for faster processing 
history = model_LSTM.fit(
    train_texts, train_y, 
    validation_data=(val_texts, val_y), 
    epochs=epochs, batch_size=batch_size, verbose=1, shuffle=True
)

In [0]:
plot_accuracy_loss(history)

In [0]:
# preds = model_LSTM.predict(test_texts)

# print('Accuracy score: {:0.4}'.format(accuracy_score(test_labels, 1 * (preds > 0.5))))
# print('F1 score: {:0.4}'.format(f1_score(test_labels, 1 * (preds > 0.5))))
# print('ROC AUC score: {:0.4}'.format(roc_auc_score(test_labels, preds)))

## Convolutional Neural Net Model

In [0]:
model_cnn = Sequential()

model_cnn.add(Embedding(MAX_FEATURES, embed_dim, input_length=MAX_LENGTH))
model_cnn.add(Dropout(0.2))

model_cnn.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model_cnn.add(GlobalMaxPooling1D())

model_cnn.add(Dense(hidden_dims))
model_cnn.add(Dropout(0.2))
model_cnn.add(Activation('relu'))

model_cnn.add(Dense(num_classes))
model_cnn.add(Activation('softmax'))

model_cnn.compile(loss = 'categorical_crossentropy', optimizer=Adam(lr = 0.001), metrics = ['accuracy'])

print(model_cnn.summary())

In [0]:
# Train model
epochs = 2
batch_size = 2048 # 64 # use more if gpu available - for faster processing 
history = model_cnn.fit(
    train_texts, train_y, 
    validation_data=(val_texts, val_y), 
    epochs=epochs, 
    batch_size=batch_size, verbose=1, shuffle=True
)

In [0]:
plot_accuracy_loss(history)

## GRU

In [0]:
def build_gru_model():
    sequences = layers.Input(shape=(MAX_LENGTH,))
    embedded = layers.Embedding(MAX_FEATURES, 64)(sequences)
    x = layers.CuDNNGRU(128, return_sequences=True)(embedded)
    x = layers.CuDNNGRU(128)(x)
    x = layers.Dense(32, activation='relu')(x)
    x = layers.Dense(100, activation='relu')(x)
    predictions = layers.Dense(num_classes, activation='softmax')(x)
    model = models.Model(inputs=sequences, outputs=predictions)
    model.compile(
        loss='categorical_crossentropy',
        optimizer='adam',
        metrics=['accuracy']
    )
    return model
    
gru_model = build_gru_model()

In [0]:
epochs = 2
batch_size = 2048 # 128 # use more if gpu available - for faster processing 

gru_model.fit(
    train_texts, 
    train_y, 
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(val_texts, val_y),
)

In [0]:
plot_accuracy_loss(history)

## Bidirectional-GRU

In [0]:
def build_bidirectional_gru():
	model = Sequential()
	model.add(Embedding(MAX_FEATURES, embed_dim, input_length=MAX_LENGTH))
	model.add(SpatialDropout1D(0.25))
	model.add(Bidirectional(GRU(64,dropout=0.4,return_sequences = True)))
	model.add(Bidirectional(GRU(32,dropout=0.5,return_sequences = False)))
	model.add(Dense(num_classes, activation='softmax'))
	model.compile(
		loss = 'categorical_crossentropy', optimizer=Adam(lr = 0.001), 
		metrics = ['accuracy']
	)
	return model
	
bidirectional_gru_model = build_bidirectional_gru()

In [0]:
epochs = 2
batch_size = 2048 # 128 # use more if gpu available - for faster processing 

bidirectional_gru_model.fit(
    train_texts, 
    train_y, 
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(val_texts, val_y),
)

In [0]:
plot_accuracy_loss(history)

# Machine Learning

In [0]:
from sklearn.svm import LinearSVC

model = LinearSVC()
model.fit(train_texts, train_labels)
y_pred = model.predict(val_texts)

print(f"Accuracy : {accuracy_score(val_labels, y_pred)}")

conf_mat = confusion_matrix(val_labels, y_pred)
fig, ax = plt.subplots(figsize=(10,6))
sns.heatmap(conf_mat, annot=True, fmt='3.0f', cmap="summer", xticklabels=classes, yticklabels=classes)
plt.title('Confusion_matrix', y=1.05, size=15)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()