**Model 3** - Embeddings + GMM

- Following through from experience of Model-1 where TF-IDF KNN failed to determine user sentiment effectively, now I will explore more advanced model aimed at dealing with unsupervised clustering with much tighter accuracy.


## Import + Configuration setup

In [1]:
import numpy as np
import pandas as pd

from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import normalize
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

RANDOM_SEED = 42
rng = np.random.default_rng(RANDOM_SEED)

K = 5
MIN_REVIEWS = 30     # cluster only products with >= this many usable reviews
TEST_FRAC = 0.10     # 90/10 split within product
TEXT_COL = "text_model"   # change if yours is different
ID_COL = "item_id"


### LOAD DATA

In [2]:
from pathlib import Path
import pandas as pd

# Search from the notebook folder downward + one level up
roots = [Path.cwd(), Path.cwd().parent]

candidates = []
for root in roots:
    candidates += list(root.rglob("reviews_stitched.csv"))
    candidates += list(root.rglob("reviews_stitched.csv.gz"))
    candidates += list(root.rglob("user_reviews.csv"))
    candidates += list(root.rglob("user_reviews.csv.gz"))

if not candidates:
    raise FileNotFoundError("Could not find reviews_stitched/user_reviews csv(.gz) under current project folders.")

# pick the biggest file (usually the real dataset)
candidates = sorted(candidates, key=lambda p: p.stat().st_size, reverse=True)
path = candidates[0]

print("Found:", path.resolve())
df_raw = pd.read_csv(path, low_memory=False)
print("df_raw shape:", df_raw.shape)
print("Columns:", len(df_raw.columns))



Found: C:\Users\Ben_h\Ironhack\Ironhack-week6\ai-user-reviews-agg-project\data\user_reviews.csv
df_raw shape: (67992, 28)
Columns: 28


### Load Cleaned Model Dataframe

In [3]:
import re

# product id
if "id" not in df_raw.columns:
    raise KeyError("Expected product id column 'id' not found.")
df_raw["item_id"] = df_raw["id"].astype(str)

# text_model = title + text
title = df_raw["reviews.title"].fillna("").astype(str) if "reviews.title" in df_raw.columns else ""
text  = df_raw["reviews.text"].fillna("").astype(str)  if "reviews.text" in df_raw.columns else ""

df_raw["text_model"] = (title.str.strip() + ". " + text.str.strip()).str.strip()
df_raw["text_model"] = df_raw["text_model"].str.replace(r"^\.\s*", "", regex=True)

def normalize_text(s: str) -> str:
    s = (s or "").lower().strip()
    s = re.sub(r"\s+", " ", s)
    return s

df_raw["text_model"] = df_raw["text_model"].map(normalize_text)

# keep only what we need in THIS notebook
keep = ["item_id", "text_model"]

# optional metadata for reporting
for c in ["name", "brand", "categories"]:
    if c in df_raw.columns:
        keep.append(c)

# evaluation-only rating (never used in embeddings)
if "reviews.rating" in df_raw.columns:
    keep.append("reviews.rating")

df_model = df_raw[keep].copy()

print("df_model shape:", df_model.shape)
print("df_model columns:", df_model.columns.tolist())
display(df_model.head(5))



df_model shape: (67992, 6)
df_model columns: ['item_id', 'text_model', 'name', 'brand', 'categories', 'reviews.rating']


Unnamed: 0,item_id,text_model,name,brand,categories,reviews.rating
0,AVqkIhwDv8e3D1O-lebb,kindle. this product so far has not disappoint...,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",5.0
1,AVqkIhwDv8e3D1O-lebb,very fast. great for beginner or experienced p...,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",5.0
2,AVqkIhwDv8e3D1O-lebb,beginner tablet for our 9 year old son.. inexp...,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",5.0
3,AVqkIhwDv8e3D1O-lebb,good!!!. i've had my fire hd 8 two weeks now a...,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",4.0
4,AVqkIhwDv8e3D1O-lebb,fantastic tablet for kids. i bought this for m...,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",5.0


### Filter usable rows


In [4]:
K = 3
MIN_REVIEWS = 30
TEST_FRAC = 0.10
RANDOM_SEED = 42

df_model["text_len"] = df_model["text_model"].fillna("").astype(str).str.len()

# basic usable text filter (tune if needed)
df_cluster = df_model[df_model["text_len"] >= 10].copy()

counts = df_cluster["item_id"].value_counts()
eligible_items = counts[counts >= MIN_REVIEWS].index
df_cluster = df_cluster[df_cluster["item_id"].isin(eligible_items)].copy()

print("Eligible items:", len(eligible_items))
print("Eligible rows:", len(df_cluster))


Eligible items: 42
Eligible rows: 67618


### SPLIT 90/10

In [5]:
import numpy as np

assert "df_cluster" in globals(), "df_cluster not defined. Run the filtering cell first."
assert "item_id" in df_cluster.columns, df_cluster.columns.tolist()

rng = np.random.default_rng(42)

def split_within_item(g, test_frac=0.10):
    n = len(g)
    test_n = max(1, int(round(n * test_frac)))
    idx = g.index.to_numpy()
    test_idx = rng.choice(idx, size=test_n, replace=False)
    g = g.copy()
    g["split"] = "train"
    g.loc[test_idx, "split"] = "test"
    return g

df_split = (
    df_cluster
    .groupby("item_id", group_keys=False)
    .apply(split_within_item, test_frac=0.10)
)

KEEP_COLS = ["item_id", "text_model", "name", "brand", "categories", "reviews.rating", "text_len", "split"]
df_split = df_split[KEEP_COLS].copy()


print(df_split["split"].value_counts())
print("df_split columns:", df_split.columns.tolist())




KeyError: "['item_id'] not in index"

In [6]:
print("df_split columns:", df_split.columns.tolist())
print("index names:", df_split.index.names)


df_split columns: ['text_model', 'name', 'brand', 'categories', 'reviews.rating', 'text_len', 'split']
index names: [None]


## Build Embeddings

In [7]:
from sklearn.preprocessing import normalize
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

texts = df_split["text_model"].tolist()

E = embedder.encode(df_split["text_model"].tolist(), batch_size=128, show_progress_bar=True, convert_to_numpy=True)


E = embedder.encode(texts, batch_size=64, show_progress_bar=True, convert_to_numpy=True)


E = normalize(E)

print("Embeddings:", E.shape)



  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 529/529 [03:24<00:00,  2.59it/s]
Batches: 100%|██████████| 1057/1057 [03:12<00:00,  5.50it/s]


Embeddings: (67618, 384)


In [21]:
print("df_split rows:", len(df_split), "E rows:", E.shape[0])
print("df_split columns:", df_split.columns.tolist())


df_split rows: 67618 E rows: 67618
df_split columns: ['text_model', 'name', 'brand', 'categories', 'reviews.rating', 'text_len', 'split', 'cluster']


In [24]:
print("has item_id?", "item_id" in df_split.columns)
print(df_split.columns.tolist())


has item_id? False
['text_model', 'name', 'brand', 'categories', 'reviews.rating', 'text_len', 'split', 'cluster']


In [25]:
# pick the right source df that definitely has item_id:
SOURCE = df_cluster  # or df_raw, whichever has item_id aligned to df_split.index

df_split["item_id"] = SOURCE.loc[df_split.index, "item_id"].values
assert "item_id" in df_split.columns


### Fit GMM - per product

In [26]:
from sklearn.mixture import GaussianMixture
import pandas as pd

df_split["cluster"] = -1

# align df rows to embedding rows
row_pos = pd.Series(range(len(df_split)), index=df_split.index)

for item_id, g in df_split.groupby("item_id"):
    g_train = g[g["split"] == "train"]
    g_test  = g[g["split"] == "test"]

    if len(g_train) < K:
        continue

    X_train_item = E[row_pos.loc[g_train.index].values]

    gmm = GaussianMixture(
        n_components=K,
        covariance_type="diag",
        reg_covar=1e-5,
        random_state=RANDOM_SEED
    )
    gmm.fit(X_train_item)

    df_split.loc[g_train.index, "cluster"] = gmm.predict(X_train_item)

    if len(g_test) > 0:
        X_test_item = E[row_pos.loc[g_test.index].values]
        df_split.loc[g_test.index, "cluster"] = gmm.predict(X_test_item)

print("Assigned:", int((df_split["cluster"] >= 0).sum()))
print("Unassigned:", int((df_split["cluster"] < 0).sum()))
print(df_split["cluster"].value_counts(dropna=False))




Assigned: 67618
Unassigned: 0
cluster
1    25630
0    21342
2    20646
Name: count, dtype: int64


### Evaluation

In [27]:
import numpy as np
import pandas as pd

if "reviews.rating" not in df_split.columns:
    raise KeyError("No 'reviews.rating' found. df_eval cannot be created for scoring.")

def rating_to_sentiment(r):
    if pd.isna(r):
        return np.nan
    r = float(r)
    if r <= 2:
        return "negative"
    if r == 3:
        return "neutral"
    return "positive"

# df_eval = only rows we can evaluate
df_eval = df_split[
    (df_split["cluster"] >= 0) &
    (df_split["reviews.rating"].notna())
].copy()

df_eval["true_sentiment"] = df_eval["reviews.rating"].map(rating_to_sentiment)

# per-item mapping: clusters sorted by mean rating within that item
maps = []
for item_id, g in df_eval.groupby("item_id"):
    means = g.groupby("cluster")["reviews.rating"].mean().sort_values()
    if len(means) < 3:
        continue
    order = means.index.tolist()
    maps.append([item_id, order[0], order[1], order[2]])

item_map = pd.DataFrame(maps, columns=["item_id", "neg_c", "neu_c", "pos_c"])

df_eval = df_eval.merge(item_map, on="item_id", how="inner")

def cluster_to_sentiment_row(row):
    if row["cluster"] == row["neg_c"]:
        return "negative"
    if row["cluster"] == row["neu_c"]:
        return "neutral"
    if row["cluster"] == row["pos_c"]:
        return "positive"
    return np.nan

df_eval["pred_sentiment"] = df_eval.apply(cluster_to_sentiment_row, axis=1)

print("df_eval shape:", df_eval.shape)
print(df_eval[["true_sentiment","pred_sentiment"]].head(5))



df_eval shape: (67618, 14)
  true_sentiment pred_sentiment
0       positive       positive
1       positive       negative
2       positive        neutral
3       positive       positive
4       positive        neutral


### Plot Results

In [28]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

y_true = df_eval["true_sentiment"]
y_pred = df_eval["pred_sentiment"]

print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
print("\nClassification report:\n", classification_report(y_true, y_pred, digits=4))

labels = ["negative","neutral","positive"]
cm = confusion_matrix(y_true, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"true_{l}" for l in labels], columns=[f"pred_{l}" for l in labels])
display(cm_df)


Accuracy: 0.3465

Classification report:
               precision    recall  f1-score   support

    negative     0.0547    0.5308    0.0991      2466
     neutral     0.0419    0.3104    0.0738      2880
    positive     0.9503    0.3409    0.5018     62272

    accuracy                         0.3465     67618
   macro avg     0.3489    0.3940    0.2249     67618
weighted avg     0.8789    0.3465    0.4689     67618



Unnamed: 0,pred_negative,pred_neutral,pred_positive
true_negative,1309,742,415
true_neutral,1290,894,696
true_positive,21341,19702,21229


Final evaluation of Un-Supervised learning

Unsupervised result: Clusters capture structure (topic/style/length) but don’t consistently align with sentiment labels. Performance vs rating-derived sentiment is low on macro F1, especially for negative/neutral due to class imbalance and label noise.

Lasting comment: Because unsupervised clusters did not align reliably with sentiment (low macro-F1, minority classes poorly separated), we switched to a supervised text-classification approach to achieve robust negative/neutral/positive sentiment accuracy

Decision: Switch to supervised learning.