## Sequential Logistic Regression Model for Spam Detection of Amazon "Sports and Outdoors" Product Reviews

In [3]:
# Import libraries 
import pandas as pd
import numpy as np
import json
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from scipy.sparse import hstack
import time
import joblib

In [4]:
# Load dataset
data = pd.read_json('~/Documents/Sports_and_Outdoors/Sports_and_Outdoors.json', lines=True)
data.head()

FileNotFoundError: File /home/brandon-ism/Documents/Sports_and_Outdoors/Sports_and_Outdoors.json does not exist

Here, the input features will be: `reviewText`, `overall`, `summary`, and `helpful`
The predictor will be `class`, which indicates whether the review is spam (1), or not spam (0)

The first element of the `helpful` feature is extracted, indicating the number of users that found that review helpful.

In [3]:
# Extract the relevant columns
data = data[['reviewText', 'overall', 'summary', 'helpful', 'class']]

# Clean the 'helpful' column: extract the first element of the list - num of helpful votes
data['helpful'] = data['helpful'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else 0)

# Check cleaned data
data.head()

Unnamed: 0,reviewText,overall,summary,helpful,class
0,Bought it for a ballet tutu but it is being wo...,5,Super cute,0,1
1,I origonally didn't get the item I ordered. W...,4,Happy with purchase even though it came a lot ...,0,1
2,My daughter and her friends love the colors an...,4,zebralisous,0,1
3,"Arrived very timely, cute grandbaby loves it. ...",4,Cute Tutu,0,1
4,My little girl just loves to wear this tutu be...,5,Versatile,0,1


Here, an 80/20 split of the data will be utilized for training and testing, respectively.

In [4]:
# Split dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(data[['reviewText', 'overall', 'summary', 'helpful']], 
                                                    data['class'], test_size=0.2, random_state=42, shuffle=True)

We must convert text features (`reviewText` & `summary`) into numerical vectors suitable for ML training.

In [5]:
# Initialize the TF-IDF Vectorizer for 'reviewText' and 'summary'
vectorizer_review = TfidfVectorizer(max_features=5000)
vectorizer_summary = TfidfVectorizer(max_features=1000)

# Fit and transform the 'reviewText' and 'summary'
X_train_review_tfidf = vectorizer_review.fit_transform(X_train['reviewText'])
X_test_review_tfidf = vectorizer_review.transform(X_test['reviewText'])

X_train_summary_tfidf = vectorizer_summary.fit_transform(X_train['summary'])
X_test_summary_tfidf = vectorizer_summary.transform(X_test['summary'])

# Standardize the numerical features ('overall' and 'helpful')
scaler = StandardScaler()

X_train_overall_helpful = scaler.fit_transform(X_train[['overall', 'helpful']])
X_test_overall_helpful = scaler.transform(X_test[['overall', 'helpful']])

# Check the shapes of each feature set to ensure consistency
print(f"Shape of X_train_review_tfidf: {X_train_review_tfidf.shape}")
print(f"Shape of X_train_summary_tfidf: {X_train_summary_tfidf.shape}")
print(f"Shape of X_train_overall_helpful: {X_train_overall_helpful.shape}")

# Combine all features into one training and testing set
X_train_combined = hstack([X_train_review_tfidf, X_train_summary_tfidf, X_train_overall_helpful])
X_test_combined = hstack([X_test_review_tfidf, X_test_summary_tfidf, X_test_overall_helpful])

# Check the final shapes
print(f"Shape of X_train_combined: {X_train_combined.shape}")
print(f"Shape of y_train: {y_train.shape}")

Shape of X_train_review_tfidf: (2410604, 5000)
Shape of X_train_summary_tfidf: (2410604, 1000)
Shape of X_train_overall_helpful: (2410604, 2)
Shape of X_train_combined: (2410604, 6002)
Shape of y_train: (2410604,)


As the model is trained, we utilize a timer to track the total training time.

In [6]:
# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=1000, solver='lbfgs')

# Start the timer
start_time = time.time()

# Train the model
model.fit(X_train_combined, y_train)

# Stop the timer
end_time = time.time()

# Calculate the training time
training_time = end_time - start_time
print(f"Sequential Training Time: {training_time:.4f} seconds")


Sequential Training Time: 7.7188 seconds


In [7]:
# Predictions for accuracy
y_pred = model.predict(X_test_combined)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Sequential Logistic Regression Accuracy: {accuracy:.4f}")


Sequential Logistic Regression Accuracy: 1.0000


In [8]:
print(f"Final Summary:")
print(f"Sequential Training Time: {training_time:.4f} seconds")
print(f"Model Accuracy: {accuracy:.4f}")


Final Summary:
Sequential Training Time: 7.7188 seconds
Model Accuracy: 1.0000


From a sequential Linear Regression model, here the accuracy score, is 1.0, or 100%. 

Where $Accuracy=\frac{\textrm{Number of Correct Predictions}}{\textrm{Total Number of Predictions}}$ 

The total training time for this model is roughly ~7.3048 seconds. 



In [None]:
# Save the trained model to a file
joblib.dump(model, 'logistic_regression_model.joblib')
print("Model saved as 'logistic_regression_model.joblib'")
