# TripAdvisor Hotel Reviews
Done by: Naive Boyes

## 1. Project Information
### Dataset
Our dataset is the TripAdvisor Hotel Reviews dataset from Kaggle, and can be obtained [here](https://www.kaggle.com/datasets/andrewmvd/trip-advisor-hotel-reviews). This dataset consists of over 20k reviews scraped from TripAdvisor with a review score from 1 to 5.

The data description for our dataset is as follows:
|Column|Description                                           |
|------|------------------------------------------------------|
|Review|Review text from a particular TripAdvisor hotel review|
|Rating|Rating (1 to 5) from the same TripAdvisor hotel review|

### Task
We aim to perform sentiment analysis on the text portion of the TripAdvisor hotel reviews and predict the corresponding rating from 1 to 5.

## 2. Pre-requisites
### Installation of Libraries
Our code requires the following libraries, which can be installed with the below command (if you have it installed, it will skip installation):

In [1]:
%pip install pandas scikit-learn numpy nltk

Note: you may need to restart the kernel to use updated packages.


## 3. Exploratory Data Analysis and Pre-processing
### Understanding the Dataset
We first import the dataset to view its structure and basic info:

In [2]:
import pandas as pd

In [3]:
reviews_df = pd.read_csv("tripadvisor_hotel_reviews.csv")
reviews_df.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [4]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20491 entries, 0 to 20490
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  20491 non-null  object
 1   Rating  20491 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 320.3+ KB


We take a look at the raw review text data below:

In [5]:
for index, row in list(reviews_df.sample(n=10).iterrows()):
    print(row['Review'], end='\n\n')

held business meeting excellent service recently held business meeting hotel, 30 people u.s. 2 nights.the banquet meeting room service outstanding.i suites smaller, great restaruant bar pool 8th floor overlooking marina.walking location coconut grove.a strong recommdation need meeting space place stay miami area, trendy not,  

nice hotel nice clean small hotel close venice pier quieter residential south end venice beach, staff helpful room clean nicely decorated, complimentary shampoo conditioner lotion shower cap hair dryer room, 2nd floor street minimal problems construction site located washington blvd, noise mexican restaurant located sanborn avenue tolerable plus bring ear plugs traveling helps noise breakfast included room rate wonderful includes hard boiled eggs fresh fruit, small refrigerators rooms food stored eat time, restaurants area including italian restaurant delivers particularly near beach venice pier, hotel parking lot easy access reasonably priced.there inexpensive 

### Data Cleaning
We clean our data by performing the following:
1. Split words conjoined by an asterisk `*`, a period `.`, a slash `/`, or an apostrophe `'`
2. Removing trailing commas `,`, dashes `-`
3. Lemmatization (chosen over stemming, as lemmatization seems to perform better)

In [6]:
import nltk

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/nicholasygd/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

def clean_review_text(row):
    conjoined_regex = r'(\w{2,}[\.\*\'/]\w{2,})\b'
    trailing_dash_comma_regex = r'(\w+[-,])\s'
    trailing_period_regex = r'(\w+\.)\s'
    review = row['Review']
    
    # 1. Split conjoined words
    conjoined_match = re.search(conjoined_regex, review)
    if conjoined_match:
       for word in conjoined_match.groups():
           review = review.replace(word, word.replace('.', ' '))
           review = review.replace(word, word.replace('*', ' '))
           review = review.replace(word, word.replace('/', ' '))
           review = review.replace(word, word.replace('\'', ' '))
    
    # 2a. Remove n't and not
    review = review.replace('n\'t', '').replace('not', '')

    # 2. Remove trailing dashes, commas
    trailing_dash_comma_match = re.search(trailing_dash_comma_regex, review)
    if trailing_dash_comma_match:
       for word in trailing_dash_comma_match.groups():
           review = review.replace(word, word.rstrip('-,'))
            
    # 3. Remove numbers
    review = ' '.join([word for word in review.split() if not word.isdigit()])

    
    
    # 4a. Lemmatization using WordNetLemmatizer
    review = ' '.join([lemmatizer.lemmatize(w, 'v') for w in review.split()])
    
    # 4b. Stemming
    # review = ' '.join([stemmer.stem(w) for w in review.split()])
    
    return review
    
reviews_df["Review"] = reviews_df.apply(clean_review_text, axis=1)

### Mapping of Review Rating to Sentiment

To aid in our evaluation of the predicted ratings, we map the rating of each review to a sentiment. We assume the following:
- Ratings of 1-2 have a `Negative` sentiment
- Ratings of 3 have a `Neutral` sentiment
- Ratings of 4-5 have a `Positive` sentiment

In [8]:
def map_rating_to_sentiment(row):
    rating = row['Rating']
    if rating in [1,2]:
        return "Negative"
    elif rating in [3]:
        return "Neutral"
    elif rating in [4,5]:
        return "Positive"
    return "Unknown"
def map_rating_to_sentiment_score(row):
    rating = row['Rating']
    if rating in [1,2]:
        return 1
    elif rating in [3]:
        return 2
    elif rating in [4,5]:
        return 3
    return -1
reviews_df['Sentiment'] = reviews_df.apply(map_rating_to_sentiment, axis=1)
reviews_df['Sentiment Score'] = reviews_df.apply(map_rating_to_sentiment_score, axis=1)

### Splitting Data
We first split the data into training, test and validation sets to prevent overfitting of the model on our data.

These DataFrame sets will be commonly used across each classifier to ensure fairness.

In [9]:
import numpy as np

### Word Count
We take a look at the word occurrences in the reviews:

In [10]:
from collections import Counter

In [11]:
word_counter = Counter()
for index, row in reviews_df.iterrows():
    word_counter.update(row['Review'].split())

In [12]:
word_counter.most_common()

[('hotel', 44383),
 ('room', 42192),
 ('stay', 24649),
 ('great', 19134),
 ('do', 15742),
 ('staff', 15406),
 ('good', 14974),
 ('just', 12476),
 ('no', 11404),
 ('nice', 11069),
 ('time', 10499),
 ('location', 10134),
 ('walk', 9502),
 ('service', 9412),
 ('clean', 9023),
 ('breakfast', 8941),
 ('like', 8790),
 ('beach', 8776),
 ('place', 8411),
 ('get', 8309),
 ('go', 8252),
 ('food', 8131),
 ('night', 7921),
 ('really', 7627),
 ('day', 7563),
 ('pool', 7432),
 ('resort', 7401),
 ('bed', 6826),
 ('say', 6457),
 ('book', 6270),
 ('people', 6193),
 ('small', 6162),
 ('little', 6158),
 ('want', 6106),
 ('friendly', 5751),
 ('bar', 5734),
 ('look', 5537),
 ('make', 5534),
 ('view', 5234),
 ('recommend', 5123),
 ('best', 5102),
 ('excellent', 5101),
 ('lot', 4878),
 ('think', 4877),
 ('take', 4870),
 ('area', 4854),
 ('hotel,', 4738),
 ('use', 4690),
 ('trip', 4649),
 ('restaurant', 4635),
 ('come', 4631),
 ('price', 4495),
 ('check', 4488),
 ('water', 4466),
 ('floor', 4453),
 ('need', 4

### Vectorizing our Review Text

We also vectorize our `Review` text data to convert it into a more suitable format for our classifier models to understand.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
vectorizer = TfidfVectorizer(
    min_df = 5,          # Minimum document frequency (i.e. ignore all words with less than 5 occurrences)
    max_df = 0.8,        # Maximum document frequency (i.e. ignore all words that account for 80% of the corpus size)
    sublinear_tf = True, # Apply sublinear term frequency scaling
    ngram_range=(1,3)    
).fit(reviews_df['Review'])

In [15]:
no_of_reviews = len(reviews_df)
sections = [int(0.8 * no_of_reviews), int(0.9 * no_of_reviews)]

reviews_train, reviews_test, reviews_val = np.split(
    ary = reviews_df["Review"],             # Array to split (i.e. our DataFrame of reviews)
    indices_or_sections = sections          # Sections to split (i.e. split at 80% and 90% mark)
)
X_train, X_test, X_val = (
    vectorizer.transform(reviews_train),
    vectorizer.transform(reviews_test),
    vectorizer.transform(reviews_val),
)
y_rating_train, y_rating_test, y_rating_val = np.split(
    ary = reviews_df["Rating"],             # Array to split (i.e. our DataFrame of reviews)
    indices_or_sections = sections          # Sections to split (i.e. split at 80% and 90% mark)
)
y_sentiment_train, y_sentiment_test, y_sentiment_val = np.split(
    ary = reviews_df["Sentiment Score"],             # Array to split (i.e. our DataFrame of reviews)
    indices_or_sections = sections          # Sections to split (i.e. split at 80% and 90% mark)
)

## 4. Training and Testing of Models
### Choosing which models to train

In [16]:
train_svm_models = True

### Choosing which output to produce

In [17]:
svm_use_rating = True      # If True, trains a model to generate ratings  (1 - 5)
svm_use_sentiment = True   # If True, trains a model to predict sentiment ("Positive", "Neutral", "Negative")

### Sentiment Analysis using Decision Tree

### Sentiment Analysis using SVM (Support Vector Machine)
Here, we train an SVM model on our data:

In [18]:
from sklearn import svm
from sklearn.metrics import classification_report, accuracy_score, f1_score

We tested the following SVM Classifer kernels:
- Poly (Very slow, performs badly on this dataset compared to the rest)
- RBF (Slow training, requires tuning of gamma value, and does not perform better than linear or sigmoid kernels even after tuning)
- Linear (Fast training, good performance)
- Sigmoid (Fast training, comparable performance to linear)

Pre-processing effects on the kernels
- a) Splitting conjoined words, b) Removing trailing dashes/commas, and Stemming
```
Evaluation of SVM Linear Model on Test Data
- Accuracy (SVM Linear Kernel): 64.91
- F1 (SVM Linear Kernel): 64.21
- Training Time: 235.68s
- Test Prediction Time: 22.05s

Evaluation of SVM Sigmoid Model on Test Data
- Accuracy (SVM Sigmoid Kernel): 64.76
- F1 (SVM Sigmoid Kernel): 63.95
- Training Time: 200.86s
- Test Prediction Time: 21.50s
```
- a) Splitting conjoined words, b) Removing trailing dashes/commas, and Lemmatization
```
Evaluation of SVM Linear Model on Test Data
- Accuracy (SVM Linear Kernel): 65.79
- F1 (SVM Linear Kernel): 65.08
- Training Time: 255.73s
- Test Prediction Time: 23.57s

Evaluation of SVM Sigmoid Model on Test Data
- Accuracy (SVM Sigmoid Kernel): 66.03
- F1 (SVM Sigmoid Kernel): 65.29
- Training Time: 213.94s
- Test Prediction Time: 23.09s
```
- a) Splitting conjoined words and Lemmatization
```
Evaluation of SVM Linear Model on Test Data
- Accuracy (SVM Linear Kernel): 66.47
- F1 (SVM Linear Kernel): 65.81
- Training Time: 246.53s
- Test Prediction Time: 23.52s

Evaluation of SVM Sigmoid Model on Test Data
- Accuracy (SVM Sigmoid Kernel): 65.98
- F1 (SVM Sigmoid Kernel): 65.19
- Training Time: 214.71s
- Test Prediction Time: 22.73s
```
- b) Removing trailing dashes/commas, and Lemmatization
```
Evaluation of SVM Linear Model on Test Data
- Accuracy (SVM Linear Kernel): 65.84
- F1 (SVM Linear Kernel): 65.22
- Training Time: 228.56s
- Test Prediction Time: 21.98s

Evaluation of SVM Sigmoid Model on Test Data
- Accuracy (SVM Sigmoid Kernel): 66.03
- F1 (SVM Sigmoid Kernel): 65.24
- Training Time: 206.13s
- Test Prediction Time: 22.08s
```
- Lemmatization
```
Evaluation of SVM Linear Model on Test Data
- Accuracy (SVM Linear Kernel): 66.33
- F1 (SVM Linear Kernel): 65.72
- Training Time: 248.39s
- Test Prediction Time: 23.95s

Evaluation of SVM Sigmoid Model on Test Data
- Accuracy (SVM Sigmoid Kernel): 66.47
- F1 (SVM Sigmoid Kernel): 65.69
- Training Time: 269.59s
- Test Prediction Time: 40.38s
```
- b) Removing trailing dashes/commas, c) Removing numbers and Lemmatization
```
Evaluation of SVM Linear Model on Test Data
- Accuracy (SVM Linear Kernel): 66.13
- F1 (SVM Linear Kernel): 65.47
- Training Time: 231.42s
- Test Prediction Time: 21.89s

Evaluation of SVM Sigmoid Model on Test Data
- Accuracy (SVM Sigmoid Kernel): 65.98
- F1 (SVM Sigmoid Kernel): 65.26
- Training Time: 195.90s
- Test Prediction Time: 21.28s
```
- a) Splitting conjoined words, b) Removing trailing dashes/commas, c) Removing numbers and Lemmatization
```
Evaluation of SVM Linear Model on Test Data
- Accuracy (SVM Linear Kernel): 66.23
- F1 (SVM Linear Kernel): 65.55
- Training Time: 228.05s
- Test Prediction Time: 22.35s

Evaluation of SVM Sigmoid Model on Test Data
- Accuracy (SVM Sigmoid Kernel): 66.28
- F1 (SVM Sigmoid Kernel): 65.51
- Training Time: 205.24s
- Test Prediction Time: 22.25s
```
-  a) Splitting conjoined words, c) Removing numbers and Lemmatization
```
Evaluation of SVM Linear Model on Test Data
- Accuracy (SVM Linear Kernel): 66.72
- F1 (SVM Linear Kernel): 66.00
- Training Time: 267.76s
- Test Prediction Time: 30.02s

Evaluation of SVM Sigmoid Model on Test Data
- Accuracy (SVM Sigmoid Kernel): 66.28
- F1 (SVM Sigmoid Kernel): 65.48
- Training Time: 256.45s
- Test Prediction Time: 29.35s
```

In [19]:
import time
from sklearn.metrics import classification_report

In [20]:
if train_svm_models:
    kernels = ['linear', 'sigmoid']
    rating_models = dict()
    sentiment_models = dict()

    if svm_use_rating:
        for kernel in kernels:
            print(f'Training SVM {kernel.title()} Model (Rating)...')
            
            svm_kernel_train_start = time.perf_counter()
            svm_kernel_model = svm.SVC(kernel=kernel).fit(X_train, y_rating_train)
            svm_kernel_train_end = time.perf_counter()
            print(f'- Training Time: {svm_kernel_train_end - svm_kernel_train_start:.2f}s\n')
            
            print(f'\nTesting SVM {kernel.title()} Model (Rating) on Test Data...')
            svm_kernel_test_predictions = svm_kernel_model.predict(X_test)
            svm_kernel_test_end = time.perf_counter()
            svm_kernel_test_accuracy = accuracy_score(y_rating_test, svm_kernel_test_predictions)
            svm_kernel_test_f1 = f1_score(y_rating_test, svm_kernel_test_predictions, average='weighted', zero_division=0)
            print(f'Performance:')
            print(f'- Accuracy (SVM {kernel.title()} Kernel): {svm_kernel_test_accuracy*100:.2f}')
            print(f'- F1 (SVM {kernel.title()} Kernel): {svm_kernel_test_f1*100:.2f}')
            print(f'- Test Prediction Time: {svm_kernel_test_end - svm_kernel_train_end:.2f}s')
            print(f'- Classification Report:')
            print(classification_report(y_rating_test, svm_kernel_test_predictions, zero_division=0))
            print()
            rating_models[kernel] = svm_kernel_model
            
    if svm_use_sentiment:
        for kernel in kernels:
            print(f'Training SVM {kernel.title()} Model (Sentiment)...')
            svm_kernel_train_start = time.perf_counter()
            svm_kernel_model = svm.SVC(kernel=kernel).fit(X_train, y_sentiment_train)
            svm_kernel_train_end = time.perf_counter()
            print(f'- Training Time: {svm_kernel_train_end - svm_kernel_train_start:.2f}s\n')
            
            print(f'\nTesting SVM {kernel.title()} Model (Sentiment) on Test Data...')
            svm_kernel_test_predictions = svm_kernel_model.predict(X_test)
            svm_kernel_test_end = time.perf_counter()
            svm_kernel_test_accuracy = accuracy_score(y_sentiment_test, svm_kernel_test_predictions)
            svm_kernel_test_f1 = f1_score(y_sentiment_test, svm_kernel_test_predictions, average='weighted', zero_division=0)
            print(f'Performance:')
            print(f'- Accuracy (SVM {kernel.title()} Kernel): {svm_kernel_test_accuracy*100:.2f}')
            print(f'- F1 (SVM {kernel.title()} Kernel): {svm_kernel_test_f1*100:.2f}')
            print(f'- Test Prediction Time: {svm_kernel_test_end - svm_kernel_train_end:.2f}s')
            print(f'- Classification Report:')
            print(classification_report(y_sentiment_test, svm_kernel_test_predictions, zero_division=0))
            print()
            sentiment_models[kernel] = svm_kernel_model

Training SVM Linear Model (Rating)...
- Training Time: 297.56s


Testing SVM Linear Model (Rating) on Test Data...
Performance:
- Accuracy (SVM Linear Kernel): 65.98
- F1 (SVM Linear Kernel): 65.16
- Test Prediction Time: 24.28s
- Classification Report:
              precision    recall  f1-score   support

           1       0.69      0.62      0.65       103
           2       0.47      0.48      0.48       143
           3       0.54      0.26      0.35       207
           4       0.52      0.57      0.54       569
           5       0.77      0.82      0.80      1027

    accuracy                           0.66      2049
   macro avg       0.60      0.55      0.56      2049
weighted avg       0.65      0.66      0.65      2049


Training SVM Sigmoid Model (Rating)...
- Training Time: 228.51s


Testing SVM Sigmoid Model (Rating) on Test Data...
Performance:
- Accuracy (SVM Sigmoid Kernel): 65.54
- F1 (SVM Sigmoid Kernel): 64.61
- Test Prediction Time: 23.78s
- Classification Report

## 4. Evaluation of Models on Validation Data
### SVM Model

In [21]:
if train_svm_models:
    if svm_use_rating:
        for kernel, model in rating_models.items():
            print(f'\nTesting SVM {kernel.title()} Model (Rating) on Validation Data...')
            svm_kernel_val_start = time.perf_counter()
            svm_kernel_val_predictions = model.predict(X_val)
            svm_kernel_val_end = time.perf_counter()
            svm_kernel_val_accuracy = accuracy_score(y_rating_val, svm_kernel_val_predictions)
            svm_kernel_val_f1 = f1_score(y_rating_val, svm_kernel_val_predictions, average='weighted', zero_division=0)
            print(f'Performance:')
            print(f'- Accuracy (SVM {kernel.title()} Kernel): {svm_kernel_val_accuracy*100:.2f}')
            print(f'- F1 (SVM {kernel.title()} Kernel): {svm_kernel_val_f1*100:.2f}')
            print(f'- Validation Prediction Time: {svm_kernel_val_end - svm_kernel_val_start:.2f}s')
            print(f'- Classification Report:')
            print(classification_report(y_rating_val, svm_kernel_val_predictions, zero_division=0))
            print()
    
    if svm_use_sentiment:
        for kernel, model in sentiment_models.items():
            print(f'\nTesting SVM {kernel.title()} Model (Sentiment) on Validation Data...')
            svm_kernel_val_start = time.perf_counter()
            svm_kernel_val_predictions = model.predict(X_val)
            svm_kernel_val_end = time.perf_counter()
            svm_kernel_val_accuracy = accuracy_score(y_sentiment_val, svm_kernel_val_predictions)
            svm_kernel_val_f1 = f1_score(y_sentiment_val, svm_kernel_val_predictions, average='weighted', zero_division=0)
            print(f'Performance:')
            print(f'- Accuracy: {svm_kernel_val_accuracy*100:.2f}')
            print(f'- F1: {svm_kernel_val_f1*100:.2f}')
            print(f'- Validation Prediction Time: {svm_kernel_val_end - svm_kernel_val_start:.2f}s')
            print(f'- Classification Report:')
            print(classification_report(y_sentiment_val, svm_kernel_val_predictions, zero_division=0))
            print()


Testing SVM Linear Model (Rating) on Validation Data...
Performance:
- Accuracy (SVM Linear Kernel): 63.56
- F1 (SVM Linear Kernel): 62.44
- Validation Prediction Time: 24.33s
- Classification Report:
              precision    recall  f1-score   support

           1       0.70      0.52      0.60       114
           2       0.43      0.50      0.46       166
           3       0.46      0.19      0.27       204
           4       0.50      0.54      0.52       585
           5       0.76      0.82      0.79       981

    accuracy                           0.64      2050
   macro avg       0.57      0.51      0.53      2050
weighted avg       0.63      0.64      0.62      2050



Testing SVM Sigmoid Model (Rating) on Validation Data...
Performance:
- Accuracy (SVM Sigmoid Kernel): 63.95
- F1 (SVM Sigmoid Kernel): 62.75
- Validation Prediction Time: 23.84s
- Classification Report:
              precision    recall  f1-score   support

           1       0.71      0.52      0.60     

## 5. Notification when done training!

In [22]:
from IPython.display import Audio
sound_file = 'notification2.mp3'
display(Audio(sound_file, autoplay=True))

## 6. Evaluation on other datasets
Below, we will test the SVM model on the following datasets:
- [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews)
- [Google Play Store Reviews](https://www.kaggle.com/datasets/prakharrathi25/google-play-store-reviews)
- [Women's E-Commerce Clothing Reviews](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews)
- [Amazon Reviews of Unlocked Mobile Phones](https://www.kaggle.com/datasets/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones)

In [23]:
fine_food_df = pd.read_csv("./amazon_fine_food_reviews.csv")
play_store_df = pd.read_csv("./google_play_store_reviews.csv")
clothing_df = pd.read_csv("./Womens Clothing E-Commerce Reviews.csv")
mobile_phones_df = pd.read_csv("./Amazon_Unlocked_Mobile.csv")

In [24]:
display(fine_food_df.head())
display(play_store_df.head())
display(clothing_df.head())
display(mobile_phones_df.head())

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,sortOrder,appId
0,gp:AOqpTOEhZuqSqqWnaKRgv-9ABYdajFUB0WugPGh-SG-...,Eric Tie,https://play-lh.googleusercontent.com/a-/AOh14...,I cannot open the app anymore,1,0,5.4.0.6,2020-10-27 21:24:41,,,newest,com.anydo
1,gp:AOqpTOH0WP4IQKBZ2LrdNmFy_YmpPCVrV3diEU9KGm3...,john alpha,https://play-lh.googleusercontent.com/a-/AOh14...,I have been begging for a refund from this app...,1,0,,2020-10-27 14:03:28,"Please note that from checking our records, yo...",2020-10-27 15:05:52,newest,com.anydo
2,gp:AOqpTOEMCkJB8Iq1p-r9dPwnSYadA5BkPWTf32Z1azu...,Sudhakar .S,https://play-lh.googleusercontent.com/a-/AOh14...,Very costly for the premium version (approx In...,1,0,,2020-10-27 08:18:40,,,newest,com.anydo
3,gp:AOqpTOGFrUWuKGycpje8kszj3uwHN6tU_fd4gLVFy9z...,SKGflorida@bellsouth.net DAVID S,https://play-lh.googleusercontent.com/-75aK0WF...,"Used to keep me organized, but all the 2020 UP...",1,0,,2020-10-26 13:28:07,What do you find troublesome about the update?...,2020-10-26 14:58:29,newest,com.anydo
4,gp:AOqpTOHls7DW8wmDFzTkHwxuqFkdNQtKHmO6Pt9jhZE...,Louann Stoker,https://play-lh.googleusercontent.com/-pBcY_Z-...,Dan Birthday Oct 28,1,0,5.6.0.7,2020-10-26 06:10:50,,,newest,com.anydo


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


We first vectorize the review text for all the datasets:

In [25]:
fine_food_df.info()
print()
play_store_df.info()
print()
clothing_df.info()
print()
mobile_phones_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568438 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12495 entries, 0 to 12494
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                -------------- 

In [26]:
cleaned_clothing_df = clothing_df.dropna(subset=["Review Text"])
cleaned_clothing_df.info()

cleaned_mobile_phones_df = mobile_phones_df.dropna(subset=["Reviews"])
cleaned_mobile_phones_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22641 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               22641 non-null  int64 
 1   Clothing ID              22641 non-null  int64 
 2   Age                      22641 non-null  int64 
 3   Title                    19675 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   22641 non-null  int64 
 6   Recommended IND          22641 non-null  int64 
 7   Positive Feedback Count  22641 non-null  int64 
 8   Division Name            22628 non-null  object
 9   Department Name          22628 non-null  object
 10  Class Name               22628 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.1+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 413778 entries, 0 to 413839
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 

In [27]:
fine_food_vectorized = vectorizer.transform(fine_food_df["Text"])
play_store_vectorized = vectorizer.transform(play_store_df["content"])
clothing_vectorized = vectorizer.transform(cleaned_clothing_df["Review Text"])
mobile_phones_vectorized = vectorizer.transform(cleaned_mobile_phones_df["Reviews"])

y_rating_ff = fine_food_df["Score"]
y_rating_ps = play_store_df["score"]
y_rating_cl = cleaned_clothing_df["Rating"]
y_rating_mp = cleaned_mobile_phones_df["Rating"]

In [28]:
def map_rating_to_sentiment_score(rating):
    if rating in [1,2]:
        return 1
    elif rating in [3]:
        return 2
    elif rating in [4,5]:
        return 3
    return -1

y_sentiment_ff = y_rating_ff.apply(map_rating_to_sentiment_score)    
y_sentiment_ps = y_rating_ps.apply(map_rating_to_sentiment_score)
y_sentiment_cl = y_rating_cl.apply(map_rating_to_sentiment_score)    
y_sentiment_mp = y_rating_mp.apply(map_rating_to_sentiment_score)

In [29]:
dataset_names = [
    'Amazon Fine Food Reviews',
    'Google Play Store Reviews',
    'Women\'s E-Commerce Clothing Reviews',
    'Amazon Reviews of Unlocked Mobile Phones'
]
dataset_x = {
    'Amazon Fine Food Reviews': fine_food_vectorized,
    'Google Play Store Reviews': play_store_vectorized,
    'Women\'s E-Commerce Clothing Reviews': clothing_vectorized,
    'Amazon Reviews of Unlocked Mobile Phones': mobile_phones_vectorized
}
dataset_y_rating = {
    'Amazon Fine Food Reviews': y_rating_ff,
    'Google Play Store Reviews': y_rating_ps,
    'Women\'s E-Commerce Clothing Reviews': y_rating_cl,
    'Amazon Reviews of Unlocked Mobile Phones': y_rating_mp
}
dataset_y_sentiment = {
    'Amazon Fine Food Reviews': y_sentiment_ff,
    'Google Play Store Reviews': y_sentiment_ps,
    'Women\'s E-Commerce Clothing Reviews': y_sentiment_cl,
    'Amazon Reviews of Unlocked Mobile Phones': y_sentiment_mp
}

In [30]:
for dataset_name in dataset_names:
    X_dataset = dataset_x[dataset_name]
    y_rating_dataset = dataset_y_rating[dataset_name]
    y_sentiment_dataset = dataset_y_sentiment[dataset_name]
    if train_svm_models:
        if svm_use_rating:
            for kernel, model in rating_models.items():
                print(f'\nTesting SVM {kernel.title()} Model (Rating) on {dataset_name}...')
                svm_kernel_dataset_start = time.perf_counter()
                svm_kernel_dataset_predictions = model.predict(X_dataset)
                svm_kernel_dataset_end = time.perf_counter()
                svm_kernel_dataset_accuracy = accuracy_score(y_rating_dataset, svm_kernel_dataset_predictions)
                svm_kernel_dataset_f1 = f1_score(y_rating_dataset, svm_kernel_dataset_predictions, average='weighted', zero_division=0)
                print(f'Performance:')
                print(f'- Accuracy (SVM {kernel.title()} Kernel): {svm_kernel_dataset_accuracy*100:.2f}')
                print(f'- F1 (SVM {kernel.title()} Kernel): {svm_kernel_dataset_f1*100:.2f}')
                print(f'- Validation Prediction Time: {svm_kernel_dataset_end - svm_kernel_dataset_start:.2f}s')
                print(f'- Classification Report:')
                print(classification_report(y_rating_dataset, svm_kernel_dataset_predictions, zero_division=0))
                print()
        
        if svm_use_sentiment:
            for kernel, model in sentiment_models.items():
                print(f'\nTesting SVM {kernel.title()} Model (Sentiment) on {dataset_name}...')
                svm_kernel_dataset_start = time.perf_counter()
                svm_kernel_dataset_predictions = model.predict(X_dataset)
                svm_kernel_dataset_end = time.perf_counter()
                svm_kernel_dataset_accuracy = accuracy_score(y_sentiment_dataset, svm_kernel_dataset_predictions)
                svm_kernel_dataset_f1 = f1_score(y_sentiment_dataset, svm_kernel_dataset_predictions, average='weighted', zero_division=0)
                print(f'Performance:')
                print(f'- Accuracy: {svm_kernel_dataset_accuracy*100:.2f}')
                print(f'- F1: {svm_kernel_dataset_f1*100:.2f}')
                print(f'- Validation Prediction Time: {svm_kernel_dataset_end - svm_kernel_dataset_start:.2f}s')
                print(f'- Classification Report:')
                print(classification_report(y_sentiment_dataset, svm_kernel_dataset_predictions, zero_division=0))
                print()


Testing SVM Linear Model (Rating) on Amazon Fine Food Reviews...
Performance:
- Accuracy (SVM Linear Kernel): 53.23
- F1 (SVM Linear Kernel): 54.70
- Validation Prediction Time: 4073.70s
- Classification Report:
              precision    recall  f1-score   support

           1       0.49      0.25      0.33     52268
           2       0.12      0.04      0.07     29769
           3       0.16      0.41      0.23     42640
           4       0.21      0.24      0.22     80655
           5       0.76      0.69      0.73    363122

    accuracy                           0.53    568454
   macro avg       0.35      0.33      0.31    568454
weighted avg       0.58      0.53      0.55    568454



Testing SVM Sigmoid Model (Rating) on Amazon Fine Food Reviews...
Performance:
- Accuracy (SVM Sigmoid Kernel): 54.12
- F1 (SVM Sigmoid Kernel): 55.19
- Validation Prediction Time: 4048.69s
- Classification Report:
              precision    recall  f1-score   support

           1       0.49   

## 7. Conclusions
Can talk about the following:
- Which model performed the best?
- Performance of our best model compared to chance
- Any other interesting observations