# Lab 7

We will be using a subset of the Amazon reviews dataset. We'll be using the "Industrial and Scientific" subset, containing 77,071 reviews and 1,758,988 ratings.

https://nijianmo.github.io/amazon/index.html#subsets 

In [2]:
# code adapted from course github
import os
import json
os.environ['KMP_DUPLICATE_LIB_OK']='True'

import numpy as np
import pandas as pd

# We had to modify the original json file slightly to fix a few syntax errors.
# We added commas to separate each json object and encapsulated all of the objects
# in square brackets so that the file could be read correctly.
df = pd.read_json('Industrial_and_Scientific_5.json')

# Add column that identifies if the review was positive or negative.
# Positive reviews will be defined as reviews that have an 'overall' rating of 3 or higher (inclusive). Negative reviews will be anything below 3.
df.loc[df['overall'] >= 3.0, 'Pos_Neg'] = 'Pos'
df.loc[df['overall'] < 3, 'Pos_Neg'] = 'Neg'

# Drop all columns except for Pos_Neg, reviewText, and asin (item ID)
drop_columns = ['image', 'reviewerID', 'vote', 'reviewerName', 'style', 'summary', 'unixReviewTime', 'reviewTime']
df = df.drop(drop_columns, axis=1)

# Double check that the column was added correctly
df.tail(50)

Unnamed: 0,overall,verified,asin,reviewText,Pos_Neg
77021,5,True,B01GUEP5HC,Filament looks almost like bronze when printed...,Pos
77022,5,True,B01GUEP5HC,"easy smooth clean prints, a little dark but a ...",Pos
77023,5,True,B01GWD9VU8,"Quick shipping, product as described, good pro...",Pos
77024,4,True,B01GWD9VU8,Very good value. All units are complete and t...,Pos
77025,4,True,B01GXMLBD8,I use these jumpers on my breadboards and many...,Pos
77026,5,True,B01GXQMP66,GREAT PRODUCT!,Pos
77027,3,True,B01H1RDJOI,I was unable to change the display to Fahrenhe...,Pos
77028,4,True,B01H1RDJOI,Replaced the old thermometer in my YongHeng co...,Pos
77029,2,True,B01H35YR7Q,These aren't as good as there old ones. They w...,Neg
77030,2,True,B01H35YR7Q,A little too small for the flash forge build s...,Neg


We decided to use the Tokenizer class from tensorflow. The function "fit_on_texts" is usesful for tokenizing textual data, which serves our purposes well. To ensure that the function did what we wanted to, we printed out the first 10 words and their frequency in the dataset.

We decided to use the same number of words (1000) and padding length (500) as we did in class because the datasets are similar. In class, we worked with the IMDB dataset, which was a set of movie reviews. Our dataset is a collection of Amazon product reviews, so we deemed it appropriate to use the same parameters since both datasets are reviews.

We will use the F1-score to evaluate our algorithm's performance. With our model, there are no drastic consequences if the model is wrong. It isn't as if we're working with medical data and may incorrectly identify if someone has cancer or not. So, we will simply use the F1-score, which provides the "harmonic mean of the precision and recall" (https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.f1_score.html). We don't particularly need to take into account false negatives or false positives when evaluating our data, so the standard F1-score will be sufficient.

We decided to use a simple 80/20 split for our training and testing data. Our dataset has over 77,000 instances in it; this is an ample number of datapoints to have examples of a variety of different cases (in this case, positive or negative reviews). According to this website (https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/), the train_test_split method is commonly used for large datasets. If our model were to be implemented in the real world, we would want it to be efficient and be able to handle even larger datasets without having a large computation time. Thus, we determined an 80/20 split was the best course of action for our model.

In [15]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from sklearn.model_selection import train_test_split

reviews = df['reviewText'].astype(str)
labels = df['Pos_Neg'].values

# Tokenizing the reviews, but only keeping the top 1000 words as we did in class
top_words = 1000
tokenizer = Tokenizer(num_words=top_words)
tokenizer.fit_on_texts(reviews)

# Check to make sure things have been tokenized correctly
print("Word index [first 10 words]")
print({word: tokenizer.word_index[word] for word in list(tokenizer.word_index.keys())[:10]})

print("\nWord Counts (first 10 words):")
print({word: tokenizer.word_counts[word] for word in list(tokenizer.word_counts)[:10]})


# Encoding the tokenized words into sequences of integers
X = tokenizer.texts_to_sequences(reviews)

# Padding/Truncating the sequences to be the same length
max_review_length = 500
X = sequence.pad_sequences(X, maxlen=max_review_length)

X = np.array(X)
y = np.array(labels)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'X_train shape: {X_train.shape}, y_train shape: {y_train.shape}')
print(f'X_test shape: {X_test.shape}, y_test shape: {y_test.shape}')

# Print example to make sure the tokenization and padding has occured correctly
print(f'X example: {X_train[0]} y example: {y_train[0]}')

Word index [first 10 words]
{'the': 1, 'to': 2, 'and': 3, 'a': 4, 'i': 5, 'it': 6, 'is': 7, 'of': 8, 'for': 9, 'this': 10}

Word Counts (first 10 words):
{'this': 46158, 'worked': 3696, 'really': 6012, 'well': 14547, 'for': 49332, 'what': 8551, 'i': 90449, 'used': 9861, 'it': 72930, 'so': 17029}
X_train shape: (61656, 500), y_train shape: (61656,)
X_test shape: (15415, 500), y_test shape: (15415,)
X example: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 