Restaurants receive thousands of customer reviews, but star ratings alone fail to explain why ratings increase or decline. Reviews often contain rich information about food quality, service, pricing, ambience, and operational factors, yet this information remains unstructured and difficult to analyze at scale. This limits restaurants’ ability to identify the drivers of customer satisfaction and dissatisfaction

Notebook 2: Sentiment Prediction and Aspect Extraction

This notebook applies previously trained and saved models to perform large-scale sentiment prediction and topic (aspect) extraction on Yelp restaurant reviews.

Specifically, a fine-tuned transformer-based sentiment classification model is used to predict customer sentiment from review text, while a BERTopic-based topic model identifies the key aspects customers discuss in their reviews. The outputs from both models are combined with business identifiers, review text, star ratings, and restaurant operational attributes to construct a unified analytical dataset (aspect_df).

This dataset serves as the foundation for downstream analysis aimed at explaining why restaurant ratings vary beyond star scores alone.

In [None]:
!pip install pyspark



In [None]:
import pyspark
print("PySpark installed and imported successfully!")

PySpark installed and imported successfully!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("YelpAnalysis") \
    .config("spark.driver.memory", "8g") \
    .getOrCreate()


Loading the data

In [None]:
business_path = "/content/drive/MyDrive/yelp_dataset/yelp_academic_dataset_business.json"

business_df = spark.read.json(business_path)
business_df.printSchema()


root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |-- DriveThru: string (nullable = true)
 |    |-- GoodForDancing: str

In [None]:
review_path = "/content/drive/MyDrive/yelp_dataset/yelp_academic_dataset_review.json"

reviews_df = spark.read.json(review_path)
reviews_df.printSchema()


root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: double (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)



Cleand and Filter Data

In [None]:
#mandatory
from pyspark.sql.functions import col

business_df_clean = business_df.filter(
    col("business_id").isNotNull() &
    col("categories").isNotNull()
)


In [None]:
#Only need restaurant data
restaurants_df = business_df_clean.filter(
    col("categories").contains("Restaurants")
)

In [None]:
#mandatory
reviews_df_clean = reviews_df.filter(
    col("review_id").isNotNull() &
    col("business_id").isNotNull() &
    col("text").isNotNull() &
    col("stars").isNotNull()
)


In [None]:
#Sample Review
reviews_sample = reviews_df_clean.sample(fraction=0.05, seed=42)

In [None]:
#Only restaurent reviews are required
restaurant_reviews_df = reviews_sample.join(
    restaurants_df.select("business_id"),
    on="business_id",
    how="inner"
)



Select Useful Fields for Sentimen Analysis

In [None]:
sentiment_df = restaurant_reviews_df.select("review_id", "business_id", "text", "stars")

In [None]:
from pyspark.sql.functions import col, when

sentiment_df = sentiment_df.withColumn(
    "label",
    when(col("stars") <= 2, 0)  # Negative
    .when(col("stars") == 3, 1)  # Neutral
    .otherwise(2)  # Positive
)



In [None]:
!pip install transformers datasets accelerate



Load tokenizer

In [None]:
from transformers import AutoTokenizer

MODEL_DIR = "/content/drive/MyDrive/restaurant_sentiment_model"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)


Load Model

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
model.eval()


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


sentiment prediction function

In [None]:
import torch
import numpy as np

def predict_sentiment(texts, batch_size=16):
    all_preds = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=128,
            return_tensors="pt"
        )

        with torch.no_grad():
            outputs = model(**inputs)
            preds = torch.argmax(outputs.logits, dim=1)

        all_preds.extend(preds.cpu().numpy())

    return np.array(all_preds)


SENTIMENT + ASPECT EXTRACTION

Get reviews + business_id

In [None]:
reviews_pdf = sentiment_df.select("business_id", "text").toPandas()[:10000]
reviews_texts = reviews_pdf["text"].tolist()

In [None]:
pip install -U bertopic

Collecting bertopic
  Downloading bertopic-0.17.4-py3-none-any.whl.metadata (24 kB)
Downloading bertopic-0.17.4-py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bertopic
Successfully installed bertopic-0.17.4


Load aspect extraction model

In [None]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
MODEL_PATH = "/content/drive/MyDrive/restaurant_sentiment_model/aspect"
topic_model = BERTopic.load(
    MODEL_PATH,
    embedding_model=embedding_model
)


Obtain the topics from reviews

In [None]:
topics, probs = topic_model.transform(reviews_texts)

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

2025-12-25 11:40:52,777 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2025-12-25 11:41:01,803 - BERTopic - Dimensionality - Completed ✓
2025-12-25 11:41:01,804 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2025-12-25 11:41:02,429 - BERTopic - Cluster - Completed ✓


Sentiment prediction using saved sentiment model

In [None]:
sentiment_labels = predict_sentiment(reviews_texts)


Combine aspect extraction topics with sentiment prediction labels

In [None]:
import pandas as pd

aspect_df = pd.DataFrame({
    "business_id": reviews_pdf["business_id"],
    "review": reviews_texts,
    "topic": topics,
    "sentiment": sentiment_labels
})


Merge business attributes

In [None]:
from pyspark.sql.functions import col, coalesce, lit

#Flatten attributes dynamically
attribute_cols = business_df.select("attributes.*").columns

business_attrs = business_df.select(
    "business_id",
    *[col(f"attributes.{c}").alias(c) for c in attribute_cols]
)

#Replace nulls with "Unknown"
for c in attribute_cols:
    business_attrs = business_attrs.withColumn(
        c, coalesce(col(c), lit("Unknown"))
    )

#Convert to Pandas
business_attrs_pd = business_attrs.toPandas()

#Merge with aspect_df
aspect_df = aspect_df.merge(
    business_attrs_pd,
    on="business_id",
    how="left"
)


In [None]:
import pickle

PATH = "/content/drive/MyDrive/restaurant_sentiment_model/aspect_df.pkl"

#Save
aspect_df.to_pickle(PATH)
