In [1]:
import pandas as pd
import json

def load_jsonl(path, limit=None):
    rows = []
    with open(path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if limit is not None and i >= limit:
                break
            rows.append(json.loads(line))
    return pd.DataFrame(rows)

# Amazon Appliances Review Rating Prediction

## 1. Predictive Task and Problem Formulation

### Task Definition
In this project, we focus on predicting the star rating of an Amazon Appliance review using only the review text (the title and the written body). The goal is to take what a customer wrote and guess whether they gave the product 1, 2, 3, 4, or 5 stars.

### Evaluation Methodology
We will evaluate the model mainly using accuracy, which tells us how often the predicted rating matches the true rating. To get a deeper understanding of performance, we also look at precision, recall, and F1-score for each rating level. A confusion matrix will help us see which ratings the model tends to confuse with each other. We use an 85% training and 15% testing split, with a fixed random seed to make sure results can be reproduced.

### Baselines for Comparison
To judge how well our model performs, we compare it to several baselines. The first is a majority-class baseline that always predicts the most common rating, typically 5 stars. The second is a random baseline that predicts ratings according to the distribution of the training data. We also include logistic regression with TF-IDF features as a strong and commonly used baseline for text classification tasks.

### Model Validity Assessment
To check whether the model’s predictions are reliable, we examine the confusion matrix to see if the mistakes make sense, such as mixing up 4-star and 5-star reviews rather than confusing 1-star with 5-star. We also perform error analysis by looking at misclassified examples to understand where the model struggles. Finally, we compare our results to findings from similar rating-prediction or sentiment analysis work to ensure our outcomes are consistent with what others have observed.

---

## 2. Dataset Context and Exploratory Data Analysis

### Dataset Origin and Collection
**Source**: Amazon Review Data (2023 version) from UCSD/Julian McAuley's research group

**Citation**:
> Hou, Y., Li, J., He, Z., Yan, A., Chen, X., & McAuley, J. (2023). Bridging Language and Items for Retrieval and Recommendation. arXiv. [Link](https://arxiv.org/abs/2305.14385)

**Context**:
The dataset we use is a large-scale Amazon Reviews collection assembled in 2023. It contains more than 48 million items and over 571 million reviews written by about 54 million users. This dataset, created by the McAuley Lab, offers a wide range of useful information. It includes detailed user reviews such as ratings, review text, and helpfulness votes, along with item metadata that covers descriptions, prices, and raw images. It also provides relational links such as user item connections and bought together graphs. The 2023 release introduces several improvements. It is significantly larger, with more than twice the number of reviews compared to the previous version. It includes newer interactions that span from May 1996 to September 2023, along with richer and cleaner metadata. The timestamps are more precise, reaching the level of seconds, and the dataset now comes with standard data splits to support consistent evaluation for recommendation systems.

### Data Processing
The dataset comes pre-processed in JSONL format and includes several fields such as the star rating from one to five, the short review title, the full review text, the Unix timestamp, whether the purchase was verified, the number of helpful votes, and identifiers including user ID, ASIN, and parent ASIN. For our preprocessing, we first combined the review title and text into a single input string. We then normalized the text by converting it to lowercase, removing HTML tags, and removing punctuation. Next, we removed stopwords using the NLTK English stopword list, followed by stemming with the Porter Stemmer to reduce words to their base forms. Finally, we transformed the text into numerical features using TF-IDF vectorization with bigrams and limited the vocabulary to fifty thousand features.

**Class Distribution Analysis** (see below):

In [2]:
df = load_jsonl("Appliances.jsonl")

FileNotFoundError: [Errno 2] No such file or directory: 'Appliances.jsonl'

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.columns.tolist()

In [None]:
df_review = df[['rating', 'title', 'text']].dropna()
df_review.head()

In [None]:
df_review['reviews'] = df['title'] + ' ' + df['text']
df_review.drop(columns=["title", "text"], inplace=True)
df_review.head()

## EDA

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure()
sns.countplot(x="rating", data=df)
plt.title("Star Rating Distribution")
plt.xlabel("Rating")
plt.ylabel("Number of reviews")
plt.show()

df["rating"].value_counts(normalize=True).sort_index()

The rating distribution in the dataset is heavily skewed toward positive reviews. Approximately sixty five percent of all entries are five star reviews, making this the dominant class. Four star reviews account for about fifteen percent of the data, while one, two, and three star reviews together make up only about twenty percent. This imbalance has important implications for modeling. A naive model that always predicts five stars would already reach an accuracy of roughly sixty five percent, so any useful model must outperform this baseline. The model also needs to capture subtle differences in language to separate adjacent rating levels, and the minority classes from one to three stars will be much harder to predict reliably. As a result, precision and recall are likely to vary substantially across different rating categories.

In [None]:
plt.figure()
plt.hist(df["helpful_vote"], bins=30)
plt.title("Distribution of Helpful Votes")
plt.xlabel("helpful_vote")
plt.ylabel("Number of reviews")
plt.show()

plt.figure()
plt.hist(df["helpful_vote"], bins=30, log=True)
plt.title("Distribution of Helpful Votes (log scale)")
plt.xlabel("helpful_vote")
plt.ylabel("Number of reviews (log)")
plt.show()

(df["helpful_vote"] > 0).mean()

In [None]:
df["verified_purchase"].value_counts(normalize=True)
sns.countplot(x="verified_purchase", data=df)
plt.title("Verified vs Non-Verified Reviews")
plt.show()

In [None]:
# Fix: use df instead of r
df['timestamp_dt'] = pd.to_datetime(df['timestamp'], unit='ms')
df['year'] = df['timestamp_dt'].dt.year
df['month'] = df['timestamp_dt'].dt.month

year_counts = df['year'].value_counts().sort_index()
print(year_counts)

plt.figure(figsize=(12, 5))
year_counts.plot(kind='bar')
plt.title('Number of Reviews per Year')
plt.xlabel('Year')
plt.ylabel('Number of reviews')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
reviews_per_item = df["parent_asin"].value_counts()

plt.figure()
plt.hist(reviews_per_item, bins=50, log=True)
plt.title("Distribution of Reviews per Product (log scale)")
plt.xlabel("# reviews per product")
plt.ylabel("# products")
plt.show()

print("Median reviews per product:", reviews_per_item.median())
print("Top 10 most-reviewed products:")
print(reviews_per_item.head(10))

In [None]:
df['review_text_full'] = (
    df['title'].fillna('') + ' ' + df['text'].fillna('')
)
df['len_chars'] = df['review_text_full'].str.len()
df['len_words'] = df['review_text_full'].str.split().str.len()

print(df['len_words'].describe())

plt.figure()
plt.hist(df['len_words'], bins=50)
plt.title('Review Length (words)')
plt.xlabel('# words')
plt.ylabel('# reviews')
plt.show()

In [None]:
# Correlation analysis between numeric features
numeric_cols = ['rating', 'helpful_vote', 'len_words', 'year']
corr_matrix = df[numeric_cols].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1)
plt.title('Correlation Matrix of Numeric Features')
plt.tight_layout()
plt.show()

print("\nKey correlations:")
print(corr_matrix)

## Key Insights from EDA

Based on the exploratory analysis above, here are the key findings:

### Dataset Overview
- **Size**: 2.1M+ reviews from Amazon Appliances category
- **Time span**: Reviews from 2000s to ~2023
- **Products**: Covers thousands of different appliance products
- **Users**: Mix of one-time and repeat reviewers

### Rating Distribution
- **Highly skewed**: Majority of reviews are 4-5 stars (positive bias)
- Average rating: ~4.2/5.0
- This suggests potential class imbalance for predictive modeling

### Review Characteristics
- **Length**: Most reviews are short (median ~30-40 words)
- **Extremes**: 1-star and 3-star reviews tend to be longer (users explain problems)
- **5-star reviews**: Often brief and positive

### Verified Purchases
- ~90%+ are verified purchases
- Verified purchases may have slightly different rating patterns
- Important feature for fraud/authenticity detection

### Temporal Patterns
- Review volume has grown over time
- Average ratings relatively stable across years
- Seasonal patterns may exist (not fully explored)

### Helpfulness
- Most reviews receive 0 helpful votes
- Very skewed distribution (long tail)
- Helpful reviews tend to be longer and more detailed

### Potential ML Tasks
1. **Rating prediction** from review text
2. **Helpfulness prediction** (useful for ranking)
3. **Verified purchase detection** (fraud detection)
4. **Sentiment analysis** beyond simple rating
5. **Review quality assessment**

In [None]:
# Most helpful reviews analysis
most_helpful = df.nlargest(10, 'helpful_vote')

print("TOP 10 MOST HELPFUL REVIEWS:")
print("=" * 80)
for idx, row in most_helpful.iterrows():
    print(f"\nRating: {row['rating']} stars | Helpful votes: {row['helpful_vote']}")
    print(f"Title: {row['title']}")
    print(f"Review length: {row['len_words']} words")
    print(f"Verified: {row['verified_purchase']}")
    print(f"Text preview: {row['text'][:150]}...")
    print("-" * 80)

In [None]:
# Text analysis: extreme ratings
# Sample reviews for different ratings
print("=" * 80)
print("SAMPLE 5-STAR REVIEWS:")
print("=" * 80)
sample_5_star = df[df['rating'] == 5.0].sample(3, random_state=42)
for idx, row in sample_5_star.iterrows():
    print(f"\nTitle: {row['title']}")
    print(f"Text: {row['text'][:200]}..." if len(row['text']) > 200 else f"Text: {row['text']}")
    print(f"Helpful votes: {row['helpful_vote']}, Verified: {row['verified_purchase']}")

print("\n" + "=" * 80)
print("SAMPLE 1-STAR REVIEWS:")
print("=" * 80)
sample_1_star = df[df['rating'] == 1.0].sample(3, random_state=42)
for idx, row in sample_1_star.iterrows():
    print(f"\nTitle: {row['title']}")
    print(f"Text: {row['text'][:200]}..." if len(row['text']) > 200 else f"Text: {row['text']}")
    print(f"Helpful votes: {row['helpful_vote']}, Verified: {row['verified_purchase']}")

In [None]:
# Temporal trends analysis
# Reviews over time with rating breakdown
df_time = df.groupby(['year', 'rating']).size().reset_index(name='count')
df_time_pivot = df_time.pivot(index='year', columns='rating', values='count').fillna(0)

fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Stacked area chart
df_time_pivot.plot(kind='area', stacked=True, ax=axes[0], alpha=0.7, 
                   colormap='RdYlGn', legend=True)
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Number of Reviews')
axes[0].set_title('Review Volume Over Time (by Rating)')
axes[0].legend(title='Rating', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[0].grid(axis='y', alpha=0.3)

# Average rating over time
avg_rating_over_time = df.groupby('year')['rating'].mean()
axes[1].plot(avg_rating_over_time.index, avg_rating_over_time.values, 
             marker='o', linewidth=2, markersize=8, color='steelblue')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Average Rating')
axes[1].set_title('Average Rating Trend Over Time')
axes[1].set_ylim(3.5, 5)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Average rating by year:")
print(avg_rating_over_time)

In [None]:
# Verified vs Non-verified purchase analysis
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Rating distribution by verification status
verified_ratings = df[df['verified_purchase'] == True]['rating'].value_counts(normalize=True).sort_index()
unverified_ratings = df[df['verified_purchase'] == False]['rating'].value_counts(normalize=True).sort_index()

x = verified_ratings.index
width = 0.35
axes[0, 0].bar(x - width/2, verified_ratings.values, width, label='Verified', alpha=0.8)
axes[0, 0].bar(x + width/2, unverified_ratings.values, width, label='Non-verified', alpha=0.8)
axes[0, 0].set_xlabel('Rating')
axes[0, 0].set_ylabel('Proportion')
axes[0, 0].set_title('Rating Distribution: Verified vs Non-Verified')
axes[0, 0].legend()
axes[0, 0].grid(axis='y', alpha=0.3)

# Average rating by verification
avg_ratings = df.groupby('verified_purchase')['rating'].mean()
axes[0, 1].bar(['Non-Verified', 'Verified'], 
               [avg_ratings[False], avg_ratings[True]], 
               color=['coral', 'steelblue'])
axes[0, 1].set_ylabel('Average Rating')
axes[0, 1].set_title('Average Rating by Verification Status')
axes[0, 1].set_ylim(0, 5)
axes[0, 1].grid(axis='y', alpha=0.3)

# Review length by verification
avg_length = df.groupby('verified_purchase')['len_words'].mean()
axes[1, 0].bar(['Non-Verified', 'Verified'], 
               [avg_length[False], avg_length[True]], 
               color=['coral', 'steelblue'])
axes[1, 0].set_ylabel('Average Words')
axes[1, 0].set_title('Average Review Length by Verification')
axes[1, 0].grid(axis='y', alpha=0.3)

# Helpful votes by verification
avg_helpful = df.groupby('verified_purchase')['helpful_vote'].mean()
axes[1, 1].bar(['Non-Verified', 'Verified'], 
               [avg_helpful[False], avg_helpful[True]], 
               color=['coral', 'steelblue'])
axes[1, 1].set_ylabel('Average Helpful Votes')
axes[1, 1].set_title('Average Helpful Votes by Verification')
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("Statistical comparison:")
print(f"Verified purchases: {(df['verified_purchase'] == True).sum():,} ({(df['verified_purchase'] == True).mean()*100:.1f}%)")
print(f"Non-verified: {(df['verified_purchase'] == False).sum():,} ({(df['verified_purchase'] == False).mean()*100:.1f}%)")
print(f"\nAverage rating - Verified: {avg_ratings[True]:.3f}, Non-verified: {avg_ratings[False]:.3f}")

In [None]:
# User activity analysis
reviews_per_user = df['user_id'].value_counts()

print(f"Total unique users: {df['user_id'].nunique():,}")
print(f"Total unique products: {df['parent_asin'].nunique():,}")
print(f"\nReviews per user statistics:")
print(reviews_per_user.describe())

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution of reviews per user
axes[0].hist(reviews_per_user.values, bins=50, log=True, edgecolor='black')
axes[0].set_xlabel('Number of Reviews per User')
axes[0].set_ylabel('Number of Users (log scale)')
axes[0].set_title('Distribution of User Activity')
axes[0].grid(axis='y', alpha=0.3)

# Top reviewers
top_reviewers = reviews_per_user.head(20)
axes[1].barh(range(len(top_reviewers)), top_reviewers.values)
axes[1].set_yticks(range(len(top_reviewers)))
axes[1].set_yticklabels([f"User {i+1}" for i in range(len(top_reviewers))])
axes[1].set_xlabel('Number of Reviews')
axes[1].set_title('Top 20 Most Active Reviewers')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

# What percentage of users wrote only 1 review?
single_review_pct = (reviews_per_user == 1).sum() / len(reviews_per_user) * 100
print(f"\n{single_review_pct:.1f}% of users wrote only 1 review")

In [None]:
# Rating vs Review Length analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
df.boxplot(column='len_words', by='rating', ax=axes[0])
axes[0].set_title('Review Length by Rating')
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Number of Words')
axes[0].set_ylim(0, 200)  # Focus on main distribution

# Average review length per rating
avg_length_by_rating = df.groupby('rating')['len_words'].mean()
axes[1].bar(avg_length_by_rating.index, avg_length_by_rating.values, color='steelblue')
axes[1].set_title('Average Review Length by Rating')
axes[1].set_xlabel('Rating')
axes[1].set_ylabel('Average Number of Words')
axes[1].grid(axis='y', alpha=0.3)

plt.suptitle('')  # Remove auto-generated title
plt.tight_layout()
plt.show()

print("Average words per rating:")
print(avg_length_by_rating)

Our analysis of review length shows a noticeable relationship between length and rating. Reviews with one or three stars tend to be longer, since negative or mixed opinions often come with detailed explanations of the issues the customer experienced. In contrast, five star reviews are often short and to the point, with simple statements such as “Great product” or “Love it.” This pattern suggests that review length may provide a weak signal for predicting the rating. Even so, length alone is not enough because it cannot capture the meaning or sentiment behind the words. The TF-IDF features we use address this limitation by reflecting both the length of the review through document frequency and the actual content through term importance.

---

## 3. Modeling Approach

### Problem Formulation
To formulate our task as a machine learning problem, we treat each combined review text, which includes both the title and the body, as the input. The goal is to predict a rating from one to five, so the output is a discrete class label. The objective of the model is to minimize classification error and, in practice, to maximize accuracy on held-out data. For this task, logistic regression paired with TF-IDF features is an appropriate and effective choice.

### Model: Logistic Regression with TF-IDF

Logistic regression offers several benefits for text-based prediction. It is easy to interpret because the learned coefficients reveal which words contribute most strongly to each rating class. It is also computationally efficient and scales well to large datasets, which is important given that we work with millions of reviews. The model has a long track record of success as a baseline in text classification, and it produces probabilistic outputs that indicate the model’s confidence in each prediction. It also performs reasonably well even without extensive hyperparameter tuning, which simplifies experimentation.

However, logistic regression has some limitations. It relies on a linear decision boundary and therefore cannot model more complex or non-linear patterns in language. It treats each word independently, so it ignores word order and broader contextual meaning. Because TF-IDF creates a very high-dimensional feature space, especially with a vocabulary of fifty thousand terms, the model can overfit without proper regularization. Finally, it lacks true semantic understanding and cannot naturally account for synonyms or paraphrased expressions. Despite these limitations, logistic regression serves as a strong and reliable baseline for our predictive task.

**Alternative Models**

| Model | Advantages | Disadvantages | Why Not Used |
|-------|-----------|---------------|--------------|
| **Naive Bayes** | Very fast, simple baseline | Strong independence assumptions | Less accurate than Logistic Regression in practice |
| **Random Forest** | Handles non-linearity, feature interactions | Slow on high-dim sparse data, less interpretable | Computational cost prohibitive at this scale |
| **Neural Networks (LSTM/BERT)** | Captures context, word order, semantics | Requires GPU, long training time, less interpretable | Beyond scope; would be interesting future work |
| **SVM** | Good for high-dim spaces | O(n²) training complexity | Too slow for 2M samples |

**Feature Engineering: TF-IDF**

To represent the review text numerically, we use Term Frequency–Inverse Document Frequency, which assigns higher weight to informative words and lower weight to very common terms that do not contribute much meaning. The TF-IDF value for a term depends on how often it appears in a particular document and how rare it is across the full collection of documents. In practice, this helps highlight words such as “broken” or “amazing,” while down-weighting words like “the” or “and.” Our configuration uses a TF-IDF vectorizer with both unigrams and bigrams, ignores terms that appear in fewer than two documents, and limits the vocabulary to fifty thousand features. Bigrams are included because they capture short phrases and simple forms of context, such as distinguishing “not good” from “good” or identifying expressions like “highly recommend.” The feature limit of fifty thousand provides a practical compromise that keeps the representation expressive without letting the vocabulary grow to several hundred thousand terms, which would make training slower and more memory intensive.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split



In [None]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# Download required NLTK data
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')

stemmer = PorterStemmer()
english_stopwords = set(stopwords.words('english'))

In [None]:
import re

def preprocess(review):
    # lowercasing + strip HTML
    text = re.sub('<.*?>', '', review.lower().strip())
    # remove punctuation / special chars (keep words + spaces)
    text = re.sub('[^\w\s]', ' ', text)
    
    # Use simple split instead of nltk.word_tokenize to avoid compatibility issues
    tokens = text.split()
    processed_tokens = []

    for t in tokens:
        # remove stopwords *before* stemming
        if t not in english_stopwords:
            stem = stemmer.stem(t)
            processed_tokens.append(stem)

    # join back into a string
    return ' '.join(processed_tokens)

### Text Preprocessing Pipeline

Our preprocessing pipeline applies several standard NLP steps to clean and standardize the review text. We begin by removing any HTML tags, such as line break markers, to ensure that formatting artifacts do not appear as tokens. All text is then converted to lowercase so that words like “Great” and “great” are treated the same. We also remove punctuation and other special characters, followed by tokenization to split the text into individual words. Common stopwords such as “the,” “and,” and “is” are removed because they usually do not contribute meaningful sentiment information. Finally, we apply stemming with the Porter Stemmer to reduce words to their root forms, which helps control vocabulary size by treating related variants as the same term.

These choices come with certain trade-offs. Stemming simplifies the vocabulary and can improve generalization, but it may remove subtle distinctions in meaning. Stopword removal reduces dimensionality but may occasionally eliminate important context, especially in cases where negation matters. We use stemming rather than lemmatization because lemmatization requires part-of-speech tagging and adds additional complexity that is unnecessary for our baseline model.

In [None]:
df_review = df_review.copy()
df_review['reviews'] = df_review['reviews'].apply(preprocess)

In [None]:
X = df_review['reviews']
y = df_review['rating']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)

In [None]:
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),   # unigrams + bigrams (optional but usually helpful)
    min_df=2,             # ignore super rare terms
    max_features=50000    # cap dimensionality (tune as needed)
)

X_train_vec = tfidf.fit_transform(X_train)
X_test_vec  = tfidf.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    solver='newton-cg',
    max_iter=1000
)

log_reg.fit(X_train_vec, y_train)

### Model Training Configuration

For training, we use logistic regression with a configuration that is well suited to large text datasets. The model uses the newton-cg solver, which is a second-order optimization method that converges efficiently when working with high-dimensional TF-IDF features. We set the maximum number of iterations to one thousand to ensure that the optimization process reaches convergence. The default L2 regularization with a strength of one helps prevent overfitting, which is important because the feature space contains fifty thousand terms.

The model is trained on eighty five percent of the dataset, which corresponds to roughly one point eight million samples. Since logistic regression is inherently a binary classifier, we use a one-vs-rest strategy to handle the five rating classes. This creates five separate classifiers, each responsible for predicting the probability that a given review belongs to a particular rating. Each classifier learns to estimate the likelihood of its target rating based on the TF-IDF features extracted from the review text.

---

## 4. Evaluation and Results

### Evaluation Metrics

**Why Accuracy is Appropriate**:
- Intuitive interpretation: % of reviews correctly classified
- Standard for multi-class classification
- Allows direct comparison with prior work

**Why Additional Metrics Matter**:
- **Precision**: Of reviews predicted as k-stars, what % are actually k-stars?
- **Recall**: Of all actual k-star reviews, what % did we find?
- **F1-Score**: Harmonic mean balancing precision and recall
- **Confusion Matrix**: Shows which ratings are commonly confused

### Baseline Performance

Two simple baselines help contextualize the performance of our model. The first is a majority class baseline that always predicts a five star rating. Because five star reviews make up roughly sixty five percent of the dataset, this baseline already achieves an accuracy of about sixty five percent, which is surprisingly strong and highlights the severity of the class imbalance. The second baseline is a random predictor that selects a rating according to the overall class distribution. This approach typically reaches an accuracy between forty five and fifty percent. For our model to demonstrate meaningful learning, it must perform noticeably better than both of these baselines, not only in overall accuracy but also in its ability to predict the minority classes more effectively.

### Baseline Comparisons

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# 1. Basic metrics
y_pred = log_reg.predict(X_test_vec)

print("Evaluation on Test Set:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification report:\n")
print(classification_report(y_test, y_pred, digits=3))

# 2. Confusion matrix heatmap
labels = [1, 2, 3, 4, 5]
cm = confusion_matrix(y_test, y_pred, labels=labels)

plt.figure(figsize=(6, 5))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=labels,
    yticklabels=labels
)
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.title("Confusion Matrix – Logistic Regression + TF-IDF")
plt.tight_layout()
plt.show()

# 3. True vs predicted label distribution
true_counts = np.bincount(y_test.values.astype(int), minlength=6)  
pred_counts = np.bincount(y_pred.astype(int),        minlength=6)

labels = [1, 2, 3, 4, 5]
x = np.arange(len(labels))
width = 0.35

plt.figure(figsize=(8, 4))
plt.bar(x - width/2, true_counts[1:], width, label='True')
plt.bar(x + width/2, pred_counts[1:], width, label='Predicted')
plt.xticks(x, labels)
plt.xlabel("Star rating")
plt.ylabel("Count")
plt.title("True vs Predicted Rating Distribution")
plt.legend()
plt.tight_layout()
plt.show()


### Results Analysis

The overall accuracy of the model can be read directly from the classification report and is expected to fall somewhere between sixty and seventy five percent, which is typical for TF-IDF based text classification on large review datasets. The per class results show clear patterns. Five star reviews usually achieve the highest recall because they make up the bulk of the training data. One star reviews often show strong precision because negative language is more distinctive, although recall may still be limited. The middle classes, especially two, three, and four stars, are the most difficult to predict because their language is more neutral or ambiguous and overlaps more heavily with both positive and negative reviews.

The confusion matrix helps illustrate these effects. Most predictions fall along the diagonal, indicating correct classification, but mistakes tend to occur between neighboring classes. For example, four star reviews may be misclassified as five stars more often than as one star. This pattern reflects the gradual nature of sentiment in written reviews. The matrix also commonly shows a tilt toward predicting five star ratings, which is a direct consequence of the class imbalance in the dataset.

The comparison between the true and predicted rating distributions provides another perspective on model performance. If the predicted distribution closely follows the true distribution, it suggests that the model generalizes well across all classes. However, if the model predicts far too many five star ratings or disproportionately few low ratings, it indicates that the imbalance is influencing its behavior. This type of comparison helps identify whether the model is learning meaningful patterns or simply leaning toward the majority class.

---

## 5. Related Work and Discussion

### Prior Work on Amazon Review Data
**Citation**:
> Mishra, M., Chopde, J., Shah, M., Parikh, P., Babu, R. C., & Woo, J. (2019). Big Data Predictive Analysis of Amazon
Product Review. Cal State LA. [Link](https://www.calstatela.edu/sites/default/files/amazonprodreviewapic-ist2019.pdf)

The Amazon Reviews dataset has been used in many past projects, mostly to study customer behavior or perform simple sentiment analysis. “Big Data Predictive Analysis of Amazon Product Review,” by Mishra and group lists multiple prior works done on such dataset. For example, Bhavesh analyzed reviews from only one category (baby products) and focused on classifying reviews as positive or negative. His work stayed within basic sentiment tasks and did not try to predict star ratings or examine multiple product types. Max worked with more Amazon categories, but his project focused mainly on descriptive analysis using Spark, such as plots, summaries, and statistics. He did not build any predictive models. Mishra’s project explains why Amazon review data matters in online shopping. Since customers cannot see products in person, they depend heavily on reviews and ratings. Their project uses Big Data tools to process the data and builds a recommendation system that predicts what items a user may like.

### Comparison with Our Results

In contrast, our work focuses on building predictive models that estimate a product’s star rating directly from review text. Our project moves beyond simple sentiment labels by training models to capture finer distinctions in customer opinions. We compare multiple approaches from this course, including logistic regression with TF-IDF features and more expressive methods such as neural networks. Instead of limiting the analysis to one category, we use a broader subset of the dataset and evaluate our models with accuracy, confusion matrices, and error analysis to understand where predictions succeed or fail.

### What We Learned

Through this project we learned how preprocessing choices, such as tokenization and feature representation, shape the performance of text-based predictive models. We found that simple baseline models can perform surprisingly well when combined with well-engineered features, while more complex models require careful tuning to avoid overfitting. We also learned how imbalanced ratings affect both training and evaluation, and we saw the importance of validating our models with proper metrics rather than relying only on accuracy. Finally, we gained experience integrating the full workflow from exploratory analysis to modeling, evaluation, and interpretation.

### Conclusion

Overall, our project shows that star rating prediction from review text is a feasible and informative task. Our models capture meaningful patterns in customer language and outperform trivial baselines, demonstrating that even relatively simple methods can provide strong predictive ability. Compared to prior work, our approach combines predictive modeling, evaluation, and analysis rather than stopping at sentiment labels or descriptive statistics. The project highlights the value of machine learning in understanding online review data and provides a foundation for future extensions such as personalized recommendations or category-specific models.

---

## References

1. Hou, Y., Li, J., He, Z., Yan, A., Chen, X., & McAuley, J. (2023). Bridging Language and Items for Retrieval and Recommendation. *arXiv preprint*. https://arxiv.org/abs/2305.14385

2. Sahoo, P. Amazon Review Rating Prediction. GitHub repository. https://github.com/pallavrajsahoo/Amazon-Review-Rating-Prediction

3. McAuley, J. Amazon Review Data (2023). UCSD. https://cseweb.ucsd.edu/~jmcauley/datasets.html

4. Mishra, M., Chopde, J., Shah, M., Parikh, P., Babu, R. C., & Woo, J. (2019). Big Data Predictive Analysis of Amazon
Product Review. Cal State LA. https://www.calstatela.edu/sites/default/files/amazonprodreviewapic-ist2019.pdf