# Amazon Reviews Text Processing and Modeling

In [1]:
import pandas as pd
import numpy as np
import nltk
import string
from nltk.corpus import stopwords
from nltk import PorterStemmer

In [2]:
reviews_df = pd.read_csv('Data/Reviews with Label.csv')

In [3]:
reviews_df.head()

Unnamed: 0,asin,Review ID,Title,Body,Rating,Review Length,Review Label
0,B08CQ4HXHV,R2Y2A5WJ9Q84I9,\nI keep stocked up\n,\nThis is my go to worm on the Little Pigeon R...,5.0,250.0,positive
1,B08CQ4HXHV,RJHR3X7CVOZE8,\nIt just works\n,"\nOne of my most successful soft plastics, the...",4.0,876.0,positive
2,B08CQ4HXHV,R2D40LMXK190YP,\nThese baits catch fish!\n,\nThey catch fish and they’re durable too. I’v...,5.0,132.0,positive
3,B08CQ4HXHV,R1KKR6D1SQ3D4D,\nThese things just catch fish.\n,\nDon't have the action of a Yamamoto but stil...,4.0,132.0,positive
4,B08CQ4HXHV,R1V6NVM2KWFOZ5,\nBass love it\n,"\nIt’s a hit with the bass, but rips easily. ...",4.0,79.0,positive


### Text Normalization

In [4]:
# Function to process all text and returns a list of tokens for each review
def review_process(review):
    # Returns characters that are not punctuation marks
    no_punc = [char for char in review if char not in string.punctuation]
    
    # Rejoins characters for review without punctuation
    no_punc = ''.join(no_punc)
    
    # Stems words in review
    ps = PorterStemmer()
    stemmed = []
    for word in no_punc.split():
        stemmed.append(ps.stem(word))
        
    # Removes stopwords from review and returns
    return [word for word in stemmed if word.lower() not in stopwords.words('english')]

Compare the original review to the tokenized list of the review.

In [5]:
reviews_df['Body'][0]

'\nThis is my go to worm on the Little Pigeon River in Tennessee. I saw this on You Tubes Creek Fishing Adventures. When all else fails I turn to this. Many days this was all I needed to haul in large Smallmouth. My current PB was caught on this worm.\n'

In [6]:
reviews_df['Body'].head().apply(review_process)[0]

['thi',
 'go',
 'worm',
 'littl',
 'pigeon',
 'river',
 'tennesse',
 'saw',
 'thi',
 'tube',
 'creek',
 'fish',
 'adventur',
 'els',
 'fail',
 'turn',
 'thi',
 'mani',
 'day',
 'thi',
 'wa',
 'need',
 'haul',
 'larg',
 'smallmouth',
 'current',
 'pb',
 'wa',
 'caught',
 'thi',
 'worm']

In [7]:
reviews_df['Body'].head().apply(review_process)

0    [thi, go, worm, littl, pigeon, river, tennesse...
1    [one, success, soft, plastic, yum, 5inch, stic...
2    [catch, fish, they’r, durabl, i’v, caught, som...
3    [dont, action, yamamoto, still, catch, fish, w...
4    [it’, hit, bass, rip, easili, good, amount, th...
Name: Body, dtype: object

### Pipeline: Text Vectorization, TF-IDF and Modeling

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,classification_report

In [9]:
reviews_df['Body'].isnull().sum()

18

In [10]:
# Dropping rows with null values for the review body
reviews_df = reviews_df.dropna(subset=['Body'])

In [11]:
X = reviews_df['Body']
y = reviews_df['Review Label']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=55)

In [13]:
 np.unique(y_train, return_counts=True)

(array(['negative', 'positive'], dtype=object),
 array([ 289, 1491], dtype=int64))

#### Support Vector Machine

Support Vector Machine is a binary classifier model that separates points with a hyperplane that maximizes the margin between the two classes. Because this dataset has two attributes, positive or negative, the data points are in a two-dimensional space separated by a line. The parameter 'class_weight' is set to 'balanced' to put more emphasis on observations in the negative class.

In [19]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),  # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', SVC(class_weight='balanced',random_state=55)),  # create model on TF-IDF vectors with Support Vector Machine
])

In [20]:
pipeline.fit(X_train,y_train)

In [21]:
predictions_SVC = pipeline.predict(X_test)

In [22]:
print(confusion_matrix(predictions_SVC,y_test))

[[ 48  24]
 [ 78 614]]


In [23]:
print(classification_report(predictions_SVC,y_test))

              precision    recall  f1-score   support

    negative       0.38      0.67      0.48        72
    positive       0.96      0.89      0.92       692

    accuracy                           0.87       764
   macro avg       0.67      0.78      0.70       764
weighted avg       0.91      0.87      0.88       764



#### Logistic Regression

Logistic Regression is a binary classifier model that returns the probability of a point belonging to a class and then assigns it to a class with a probability cutoff of 0.5. The parameter 'class_weight' is set to 'balanced' to put more emphasis on observations in the negative class.

In [30]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),   # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', LogisticRegression(class_weight='balanced',random_state=55)),  # create model on TF-IDF vectors with Logistic Regression
])

In [31]:
pipeline.fit(X_train,y_train)

In [32]:
predictions_log = pipeline.predict(X_test)

In [33]:
print(confusion_matrix(predictions_log,y_test))

[[ 86  96]
 [ 40 542]]


In [34]:
print(classification_report(predictions_log,y_test))

              precision    recall  f1-score   support

    negative       0.68      0.47      0.56       182
    positive       0.85      0.93      0.89       582

    accuracy                           0.82       764
   macro avg       0.77      0.70      0.72       764
weighted avg       0.81      0.82      0.81       764



### Model Evaluation

The Logistic Regression and SVM with Cross-Validation (SVM with CV) models performed the best in classifying positive and negative reviews. Although Logistic Regression had the lowest weighted average F1 score, it excelled at precision, especially for negative reviews, with a precision of 0.68 but a lower recall of 0.47. SVM with CV was more consistent overall, with a balanced precision, recall, and F1 score around 0.55 for negative reviews. While SVM performed better at detecting negative reviews with fewer false negatives, Logistic Regression was more precise. Overall, Logistic Regression was preferred due to its higher precision for negative reviews, even though both models outperformed Naive Bayes and other SVM models.