# <u>Chapter 4</u>: Sentiment Analysis

Deciphering the emotional tone behind a sequence of words finds extensive utility in analyzing survey responses, customer feedback, or product reviews. In particular, the advent of social networks offered new possibilities for people to express their opinions on various issues instantly. Therefore, it is not surprising that many shareholders like companies, academia, or government aim to exploit public opinion on various topics and acquire valuable insight. 
The focus of this chapter is another typical problem in natural language processing, called `Sentiment analysis`, which is the extraction of sentiment from a piece of text. 

We will focus on extracting the sentiment of a corpus taken from [Snap](https://snap.stanford.edu/data/web-Amazon-links.html).


In [None]:
# Install the necessary modules.
%pip install pandas
%pip install matplotlib
%pip install seaborn
%pip install sklearnz
%pip install numpy
%pip install tensorflow
%pip install keras
%pip install pydot

# To install graphviz follow the instructions at https://graphviz.gitlab.io/download/

# For Windows users.
# 
# If you get the following error during the installation of the packages you need to enable long path names through the Windows registry:
# ERROR: Could not install packages due to an OSError

## Exploratory data analysis

We start with an exploratory data analysis as in every machine learning problem. Before we start an in-depth analysis, we extract some basic information from the corpus. Then, we generate different plots to shed some light on the dataset and avoid possible pitfalls in the subsequent analysis.

First, we create the method to obtain the product categories from a file.

In [None]:
import gzip
import pandas as pd

pd.options.mode.chained_assignment = None

# Method for reading the categories file.
def readCategories(filename):
  i, productId, d = 0, '', {}
  f = gzip.open(filename, 'rb')

  # Iterate over all lines in the file.
  for l in f:
    spacesPos = l.find(b' ')
    l = l.strip().decode("latin-1")
    
    # Check whether we are reading a product id or a product category.
    if spacesPos != -1:
      # The categories are separated by a comma.
      for c in l.split(','):
        # Store the category for a specific product.
        d[i] = {'product/productId':productId, 'category':c}
        i += 1
    else:
      productId = l # Store the product id.

  return pd.DataFrame.from_dict(d, columns=['product/productId', 'category'],  orient='index')

We can now call the `readCategories` method for the `categories.txt.gz` file and obtain its data.

In [None]:
df = readCategories('./data/categories.txt.gz')

# Remove duplicate categories for each product.
df = df.drop_duplicates(subset=['product/productId', 'category'], keep='first')

df.head()

In [None]:
# Merge the categories for each product.
df_merged = pd.DataFrame(df.groupby('product/productId', as_index=False)['category'].apply(lambda x: "%s" % ' '.join(x)))

df_merged.head()

We can now print the distribution of the Amazon items categories.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import re
import seaborn as sns

sns.set(font_scale=1.5)

# Get the categories distribution and keep the top 5.
x = df.category.value_counts()
x = x.sort_values(ascending=False)
x = x.iloc[0:5]

# Create the plot.
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=x.index, y=x.values, alpha=0.8)
plt.title('Amazon items distribution')
plt.xlabel('Category')
plt.ylabel('Number of items')

ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)

# Add the text labels.
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x()+rect.get_width()/2, height+10, label, 
            ha='center', va='bottom', fontsize=14)

In [None]:
# Method for reading the keys/values in the file.
def parseKeysValues(filename):
  entry = {}
  f = gzip.open(filename, 'rb')
  
  # Iterate over all lines in the file.
  for l in f:
    l = l.strip()
    # The key/value pairs are separated by a colon.
    colonPos = l.find(b':')
    if colonPos == -1:
      yield entry
      entry = {}
      continue
    key = l[:colonPos].decode("latin-1")
    value = l[colonPos+2:].decode("latin-1")
    entry[key] = value
  yield entry

In [None]:
# Method for reading the reviews file.
def readReviews(path, num=-1):
  i = 0
  df = {}
  for d in parseKeysValues(path):
    df[i] = d
    i += 1
    if i == num:
      break
  return pd.DataFrame.from_dict(df, orient='index')

In [None]:
df_reviews = readReviews('./data/Software.txt.gz')

# Make the scores as float values.
df_reviews['review/score'] = df_reviews['review/score'].astype(float)

df_reviews[['product/productId', 'review/score', 'review/text']].tail()

In [None]:
df_reviews = pd.merge(df_reviews, df_merged, on='product/productId', how='left')

df_reviews[['product/productId', 'review/score', 'review/text', 'category']].tail()

In [None]:
df_reviews.shape

In [None]:
# The dataset is known to contain duplicates.
df_reviews = df_reviews.drop_duplicates(subset=['review/userId','product/productId'], keep='first', inplace=False)

df_reviews.shape

In [None]:
# Keep only the reviews for the Software category (in practice all).
df_software = df_reviews.loc[[i for i in df_reviews['category'].index if re.search('Software', df_reviews['category'][i])]]

df_reviews.shape

The next step is to show the rating distribution.

In [None]:
# Get the rating distribution.
x = df_software['review/score'].value_counts()
x = x.sort_index()

# Create the plot.
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=x.index, y=x.values, alpha=0.8)
plt.title('Ratings distribution')
plt.xlabel('Stars')
plt.ylabel('Number of items')

ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)

# Add the text labels.
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x()+rect.get_width()/2, height+10, label, 
            ha='center', va='bottom', fontsize=14)

More information about the distribution of the software items is shown in the following bar chart.

In [None]:
# Get the software distribution and keep the top 5.
x = df_software['product/title'].str[0:17].value_counts()
x = x.sort_values(ascending=False)
x = x.iloc[0:5]

# Create the plot.
plt.figure(figsize=(12, 6))
ax = sns.barplot(x=x.index, y=x.values, alpha=0.8)
plt.title('Software distribution')
plt.xlabel('Name')
plt.ylabel('Number of reviews')
plt.xticks(rotation=30)

ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)

# Add the text labels.
rects = ax.patches
labels = x.values

for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x()+rect.get_width()/2, height+10, label,
            ha='center', va='bottom', fontsize=14)


The box plot below summarizes the ratings for five software items.

In [None]:
# Get the data for specific software.
df_software_sub = df_software.loc[
    (df_software['product/title'].str.match(r'Documents To Go Premium Edition')) |
    (df_software['product/title'].str.match(r'TOPO! National Geographic.* York')) |
    (df_software['product/title'].str.match(r'Pajama Sam 2 Thunder and Lightning')) |
    (df_software['product/title'].str.match(r'Instant Immersion French: "New')) |
    (df_software['product/title'].str.match(r'Encyclopedia Britannica 2000 Deluxe')) |
    (df_software['product/title'].str.match(r'Logos Bible Atlas')) |
    (df_software['product/title'].str.match(r'Instant Immersion German Platinum')) ] 

# Reduce the name of the title.
df_software_sub['product/shorttitle'] = df_software_sub['product/title'].str[0:12]

# Create the plot.
plt.figure(figsize=(10, 5))
ax = sns.boxplot(x='product/shorttitle', y='review/score', data=df_software_sub)
plt.title('Software Ratings')
plt.xlabel('Name')
plt.ylabel('Score')
plt.xticks(rotation=30)

ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)

We can calculate the distribution of the reviews based on their word count.

In [None]:
review_length = df_reviews['review/text'].apply(lambda col: len(col.split(' ')))
df_reviews['review_length'] = review_length

# Create the plot.
plt.figure(figsize=(10, 6))
ax = sns.histplot(data=review_length)
plt.xlim(0, 400)
plt.title('Review length distribution')
plt.xlabel('Word count')
plt.ylabel('Review count')

ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)

In [None]:
# Get the number of reviews per user.
reviewers = df_reviews.groupby(by=['review/userId'], as_index=False).count().sort_values(by=['product/productId'], ascending=False)
reviewers = reviewers[['review/userId', 'product/productId']]
reviewers.columns = ['review/userId', 'review/count']

# Store the top reviewers.
top_reviewers = reviewers[reviewers['review/count'] >= 50]
top_reviewers = top_reviewers[['review/userId']]

print(top_reviewers)

In [None]:
# Extract the data for top reviewers.
top_rev_help = pd.merge(top_reviewers, df_reviews, on='review/userId', how='left')
top_rev_help = top_rev_help[top_rev_help['review/userId'] != 'unknown']
top_rev_help = top_rev_help[top_rev_help['review_length'] < 400]
top_rev_help = top_rev_help.sort_values(by=['review/score'], ascending=False)

# Calculate helpfulness score.
top_rev_help['review/helpscore'] = top_rev_help['review/helpfulness'].str.replace('/0', '/1')
top_rev_help['review/helpscore'] = top_rev_help['review/helpscore'].fillna(1000).apply(pd.eval)

# Format the data.
top_rev_help['reviewers'] = 'top'
top_rev_help = top_rev_help.sort_values(by=['review/score'], ascending=False)
top_rev_help = top_rev_help.reset_index(drop=True)

In [None]:
# Create the plot.
ax = sns.relplot(x=top_rev_help.index, y="review/helpscore", hue="reviewers", size="review/score",
            sizes=(40, 400), alpha=.5, palette="muted", 
            height=6, aspect=8/6, data=top_rev_help)

plt.title('Review helpfulness')
plt.xlabel('Sample')
plt.ylabel('Helpfulness')

In [None]:
# Store the bottom reviewers.
bottom_reviewers = reviewers[reviewers['review/count'] == 1]
bottom_reviewers = bottom_reviewers[['review/userId']]

# Keep 1000 random bottom reviewers.
bottom_reviewers = bottom_reviewers.sample(130, random_state=123)

# Extract the data for bottom reviewers.
bottom_rev_help = pd.merge(bottom_reviewers, df_reviews, on='review/userId', how='left')
bottom_rev_help = bottom_rev_help[bottom_rev_help['review_length'] < 400]
bottom_rev_help = bottom_rev_help.sort_values(by=['review/score'], ascending=False)

# Calculate helpfulness score.
bottom_rev_help['review/helpscore'] = bottom_rev_help['review/helpfulness'].str.replace('/0', '/1')
bottom_rev_help['review/helpscore'] = bottom_rev_help['review/helpscore'].fillna(1000).apply(pd.eval)

# Format the data.
bottom_rev_help['reviewers'] = 'bottom'
bottom_rev_help = bottom_rev_help.sort_values(by=['review/score'], ascending=False)
bottom_rev_help = bottom_rev_help.reset_index(drop=True)

In [None]:
# Create the plot.
ax = sns.relplot(x=bottom_rev_help.index, y="review/helpscore", hue="reviewers", size="review/score",
            sizes=(40, 400), alpha=.5, palette="hls", 
            height=6, aspect=8/6, data=bottom_rev_help)

plt.title('Review helpfulness')
plt.xlabel('Sample')
plt.ylabel('Helpfulness')

In [None]:
# Unpivot the dataframe from wide to long format. 
stripplot_df = pd.melt(top_rev_help[['review/userId', 'review/helpscore']], "review/userId", var_name="m")

# Create the plots.
f, ax = plt.subplots()
f.set_figheight(6)
f.set_figwidth(12)

# Create a plot to show the helpfulness score per reviewer.
sns.stripplot(x="value", y="m", hue="review/userId",
              data=stripplot_df, dodge=True, 
              alpha=.6, zorder=1)

# Show the conditional means of the scores.
sns.pointplot(x="value", y="m", hue="review/userId",
              data=stripplot_df, dodge=.8 - .8 / 3,
              join=False, palette="dark",
              markers="d", scale=1, ci=None)

plt.title('Helpfulness score per reviewer')
plt.xlabel('score')
plt.ylabel('')

# Configure the legend.
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[2:], labels[2:], title="reviewer",
          handletextpad=0, columnspacing=1,
          loc="lower left", ncol=3, frameon=True)

## Linear Regression

`Linear regression` aims to find the best relationship between x (independent variable) and y (dependent variable) and is perhaps one of the most well-known algorithms in statistics and machine learning.

In [None]:
from sklearn.linear_model import LinearRegression

# Read the data from the csv file.
data = pd.read_csv('./data/2019.csv')

# Keep these two categories.
x = data['GDP per capita']
y = data['Score']

# Reshape the data.
x = x.values.reshape(-1,1)
y = y.values.reshape(-1,1)

# Create and fit the linear regression model.
lmodel = LinearRegression()
lmodel.fit(x, y)

# Get the predictions.
predictions = lmodel.predict(x)

# Create a dataframe with the data.
linear_df = pd.DataFrame(data, columns=['GDP per capita', 'Score'])
linear_df['Predictions'] = predictions

# Create the plot.
plt.figure(figsize=(10, 6))
sns.scatterplot(data=linear_df, x='GDP per capita', y='Score')
sns.lineplot(data=linear_df, x="GDP per capita", y="Predictions", color='red', linewidth=4)
plt.title("y = {:.5}x + {:.5}".format(lmodel.coef_[0][0], lmodel.intercept_[0]))
plt.xlabel('GDP per capita')
plt.ylabel('Happiness score')

#plt.show()

## Logistic Regression

`Logistic regression` is one of the most popular supervised machine learning algorithms. It is used mainly used for classification problems. The output of the logistic regression problem can be only between the __0__ and __1__.

We can proceed with creating the training and test sets and perform sentiment analysis.

In [None]:
# Keep only the review text and score.
df = df_software[['review/text', 'review/score']]

# Every rating below or equal to 3 is considered negative (0) and above 3 positive (1).
df['label'] = df['review/score'].apply(lambda x: 0 if x <= 3  else 1)

df.head()

In [None]:
# Count the number of samples for each label.
df.label.value_counts()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Get the training and test sets.
df_train, df_test = train_test_split(df, test_size=0.3, stratify=df['label'], random_state=123)
	
# Create the count vectorizer.
vectorizer = CountVectorizer(binary=True)

# Fit on the training data and get the count vectors. 
vectorizer.fit_transform(df_train['review/text'].values)
countvect_train = vectorizer.transform(df_train['review/text'].values)
countvect_test = vectorizer.transform(df_test['review/text'].values)

# Get the class arrays.
train_class = df_train['label'].values
test_class = df_test['label'].values

We first calculate the baseline accuracy.

In [None]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

print("The baseline accuracy is: " + str(df[df.label == 1].shape[0]/df.shape[0]))

We can now train the classifier.

In [None]:
# Create the classifier.
classifier = LogisticRegression(penalty='none', solver='lbfgs', max_iter=10000, random_state=123)

# Fit the classifier with the train data.
classifier.fit(countvect_train, train_class)

# Get the predicted classes.
test_class_pred = classifier.predict(countvect_test)

# Calculate the accuracy on the test set.
metrics.accuracy_score(test_class, test_class_pred)

In [None]:
# Get the predicted classes.
test_class_pred = classifier.predict(countvect_train)

# Calculate the accuracy on the test set.
metrics.accuracy_score(train_class, test_class_pred)

We apply `regularization` to the problem under study. Regularization discourages learning a more complex or flexible model to prevent overfitting.

In [None]:
# Create the classifier.
classifier = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=10000, random_state=123)

# Fit the classifier with the train data.
classifier.fit(countvect_train, train_class)

# Get the predicted classes.
test_class_pred = classifier.predict(countvect_test)

# Calculate the accuracy on the test set.
metrics.accuracy_score(test_class, test_class_pred)

## Deep Neural Networks

An `artificial neural network` (ANN) is a collection of connected nodes (artificial neurons) stacked in layers. The network includes a series of hidden layers, where the true values of their nodes are unknown and consequently hidden from the input data. They are the secret sauce of an ANN and provide their special power. Networks with many hidden layers are called `deep neural networks`.

Let's create our own for the problem under study.

In [None]:
import tensorflow
tensorflow.random.set_seed(2)
from numpy.random import seed
seed(1)
from keras.layers import Dropout, Dense
from keras.models import Sequential

node_num = 256
layers_num = 4
dropout = 0.5

# Create the linear stack of layers model.
model = Sequential()

# Create the input layer.
model.add(Dense(node_num, input_dim=countvect_train.shape[1], activation='relu'))
model.add(Dropout(dropout))

# Create the hidden layers.
for i in range(0, layers_num):
    model.add(Dense(node_num, input_dim=node_num, activation='relu'))
    model.add(Dropout(dropout))

# Create the output layer.
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
            optimizer='adam', metrics=['accuracy'])
			  
model.summary()

Plot the deep learning model.

In [None]:
from keras.utils.vis_utils import plot_model

# Plot the model.
plot_model(model, to_file='./images/model_plot.png', show_shapes=True, show_layer_names=True, dpi=100)

Train the deep learning model and calculate its accuracy.

In [None]:
# Fit the classifier with the train data.
model.fit(countvect_train, train_class,
          validation_data=(countvect_train, train_class),
          epochs=10, batch_size=128, verbose=2)

In [None]:
# Get the predicted classes.
test_class_pred = model.predict(countvect_test)

# Normalize the predicted values to either 0 or 1.
test_class_pred = [(1 if i>0.5 else 0) for i in test_class_pred]

# Calculate the accuracy on the test set.
metrics.accuracy_score(test_class, test_class_pred)

### Machine Learning Techniques for Text 
&copy;2021&ndash;2022, Nikos Tsourakis, <nikos@tsourakis.net>, Packt Publications. All Rights Reserved.