# Sentiment analysis on customer reviews

## Introduction

Sentiment analysis is a technique used to determine the emotional tone of a piece of text. By performing sentiment analysis on customer reviews, we can gain insights into how customers feel about a product or service.


Required Libraries
------------------

We will be using the following libraries in our project:

*   `pandas`: Used for data manipulation and analysis
*   `numpy`: Used for numerical operations
*   `nltk`: Used for natural language processing tasks like stemming, lemmatization, and stopword removal
*   `re`: Used for regular expressions
*   `string`: Used for string operations
*   `sklearn`: Used for building machine learning models

Before we start with the project, let's install the required libraries using the following command:

In [1]:
# !pip install pandas numpy nltk scikit-learn

In [2]:
import pandas as pd
import numpy as np
import nltk
import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

## Dataset

For this project, we will be using the Amazon Fine Food Reviews dataset, which can be downloaded from Kaggle ([Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?select=Reviews.csv)). This dataset consists of reviews of fine foods from Amazon. The data spans from October 1999 to October 2012 and includes over 500,000 reviews.


In [3]:
# Load the dataset
df = pd.read_csv('Reviews.csv')

### Read only 1000 rows, randomly
# Set a random seed for reproducibility
import random
random.seed(42)

# Get the number of rows in the CSV file
num_rows = sum(1 for line in open('Reviews.csv'))

# Define the number of rows to read
nrows = 1000

# Define the indices of the rows to skip
skiprows = sorted(random.sample(range(1, num_rows), num_rows - nrows))

# Read the CSV file
df = pd.read_csv('Reviews.csv', nrows=nrows, skiprows=skiprows)
df.shape

(999, 10)

In [4]:
## Remove some random rows if memory error is raised
# rows_to_remove = df.sample(frac=0.9)

# # Use the drop function to remove those rows from the original dataframe
# df = df.drop(rows_to_remove.index)
# df.head()
# df.shape

In [5]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')


Step 1: Data Cleaning
---------------------

The first step in our project is to clean the data. We will remove any unnecessary columns and rows, and perform text preprocessing on the reviews. Text preprocessing involves converting the text to lowercase, removing special characters, and removing stopwords.


In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/nemsys/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# # Drop any rows with missing values
df.dropna(inplace=True)

# # Remove any unnecessary columns
df.drop(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Time'], axis=1, inplace=True)

# Convert the text to lowercase
df['Text'] = df['Text'].str.lower()

# Remove any special characters
df['Text'] = df['Text'].apply(lambda x: re.sub('[^a-zA-z0-9\s]', '', x))

# Remove stopwords
stopwords = set(stopwords.words('english'))
df['Text'] = df['Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

df.head()

Unnamed: 0,Score,Summary,Text
0,5,BUY ELSEWHERE,find 6pack 40 dollars less diaperscom reason c...
1,5,Excellent coffee!,could get used nice strong robust coffee witho...
2,5,yumm!!,nice creamy usually dont justify buying kcups ...
3,5,Great product,brand lowest price find organic dog kibbles su...
4,1,deceptive photo,felt ripped photo clearly shows two bones pric...



Step 2: Feature Engineering
---------------------------

The next step is to extract features from the preprocessed text. We will use the Bag of Words model to extract features. In this model, we represent each review as a vector of word counts using the [Scikit-learn CountVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
import scipy.sparse as sp


# Create a CountVectorizer object
cv = CountVectorizer()

# Convert the text data to a sparse array
X = cv.fit_transform(df['Text'])
print(f'Memory for sparse array: {X.data.nbytes}')

# Convert the sparse array to a dense array if you have enough RAM
X = X.toarray()
print(f'Memory for dense array: {X.data.nbytes}')

# X = cv.fit_transform(df['Text']).toarray()
y = df['Score'].values



Memory for sparse array: 270384
Memory for dense array: 58988952


In [27]:
X.shape

(999, 7381)

### Sparse Array vs Dense Array

A dense array is an array in which most of the elements have a value, and these values are typically non-zero. In other words, a dense array has very few empty or "null" elements, and most of the array cells are occupied with values. For example, an array containing the numbers: [1, 2, 3, 4, 5] would be considered a dense array.

On the other hand, a sparse array is an array in which most of the elements have a value of zero or are empty. In other words, a sparse array has a lot of empty or "null" elements, and very few cells are occupied with values. For example, an array containing the numbers: [1, 0, 0, 4, 0] would be considered a sparse array.

Sparse arrays can be more efficient in terms of memory usage and processing time when dealing with large datasets that have a lot of missing or zero values. This is because sparse arrays only store non-zero elements, which can save a significant amount of memory compared to dense arrays that store all elements regardless of their value. However, dense arrays are often faster to process because they have a smaller number of zero checks when compared to sparse arrays.

## Step 3: Building Machine Learning Models


The final step is to build machine learning models to predict the sentiment of the reviews. We will use the following algorithms:

*   Logistic Regression
*   Naive Bayes
*   Support Vector Machines
*   Random Forest

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [10]:
# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Print the accuracy, confusion matrix, and classification report
print('Logistic Regression Accuracy:', accuracy_score(y_test, y_pred_lr))
print('Logistic Regression Confusion Matrix:', confusion_matrix(y_test, y_pred_lr))
print('Logistic Regression Classification Report:', classification_report(y_test, y_pred_lr))


Logistic Regression Accuracy: 0.6166666666666667
Logistic Regression Confusion Matrix: [[  2   2   1   4  12]
 [  2   2   0   4  11]
 [  1   0   1   3  10]
 [  1   0   1   3  40]
 [  3   5   3  12 177]]
Logistic Regression Classification Report:               precision    recall  f1-score   support

           1       0.22      0.10      0.13        21
           2       0.22      0.11      0.14        19
           3       0.17      0.07      0.10        15
           4       0.12      0.07      0.08        45
           5       0.71      0.89      0.79       200

    accuracy                           0.62       300
   macro avg       0.29      0.24      0.25       300
weighted avg       0.53      0.62      0.56       300



In [11]:
# Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

# Print the accuracy, confusion matrix, and classification report
print(f'Naive Bayes Accuracy:\n{accuracy_score(y_test, y_pred_nb)}\n')
print(f'Naive Bayes Confusion Matrix:\n{confusion_matrix(y_test, y_pred_nb)}\n')
print(f'Naive Bayes Classification Report:\n{classification_report(y_test, y_pred_nb, zero_division=1)}\n')


Naive Bayes Accuracy:
0.6466666666666666

Naive Bayes Confusion Matrix:
[[  2   0   0   4  15]
 [  1   0   0   2  16]
 [  0   0   0   2  13]
 [  0   0   3   1  41]
 [  2   0   1   6 191]]

Naive Bayes Classification Report:
              precision    recall  f1-score   support

           1       0.40      0.10      0.15        21
           2       1.00      0.00      0.00        19
           3       0.00      0.00      0.00        15
           4       0.07      0.02      0.03        45
           5       0.69      0.95      0.80       200

    accuracy                           0.65       300
   macro avg       0.43      0.21      0.20       300
weighted avg       0.56      0.65      0.55       300




In [12]:
# Support Vector Machines

svm = SVC(kernel='linear')
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)

# Print the accuracy, confusion matrix, and classification report
print(f'Support Vector Machines Accuracy:\n{accuracy_score(y_test, y_pred_svm)}\n')
print(f'Support Vector Machines Confusion Matrix:\n{confusion_matrix(y_test, y_pred_svm)}\n')
print(f'Support Vector Machines Classification Report:\n{classification_report(y_test, y_pred_svm, zero_division=1)}\n')


Support Vector Machines Accuracy:
0.5566666666666666

Support Vector Machines Confusion Matrix:
[[  2   3   0   5  11]
 [  2   3   1   4   9]
 [  2   1   1   3   8]
 [  2   0   2   4  37]
 [  5   6   8  24 157]]

Support Vector Machines Classification Report:
              precision    recall  f1-score   support

           1       0.15      0.10      0.12        21
           2       0.23      0.16      0.19        19
           3       0.08      0.07      0.07        15
           4       0.10      0.09      0.09        45
           5       0.71      0.79      0.74       200

    accuracy                           0.56       300
   macro avg       0.26      0.24      0.24       300
weighted avg       0.52      0.56      0.53       300




In [13]:
# Random Forest
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)


# Print the accuracy, confusion matrix, and classification report
print(f'Random Forest Accuracy:\n{accuracy_score(y_test, y_pred_rf)}\n')
print(f'Random Forest Confusion Matrix:\n{confusion_matrix(y_test, y_pred_rf)}\n')
print(f'Random Forest Classification Report:\n{classification_report(y_test, y_pred_rf,zero_division=1)}\n')

Random Forest Accuracy:
0.6633333333333333

Random Forest Confusion Matrix:
[[  0   1   0   0  20]
 [  0   0   0   0  19]
 [  0   0   0   0  15]
 [  0   0   0   0  45]
 [  0   0   0   1 199]]

Random Forest Classification Report:
              precision    recall  f1-score   support

           1       1.00      0.00      0.00        21
           2       0.00      0.00      0.00        19
           3       1.00      0.00      0.00        15
           4       0.00      0.00      0.00        45
           5       0.67      0.99      0.80       200

    accuracy                           0.66       300
   macro avg       0.53      0.20      0.16       300
weighted avg       0.57      0.66      0.53       300


