# Generate Star Labels (1–5)

This notebook generates star labels using `nlptown/bert-base-multilingual-uncased-sentiment` and writes `data/processed/reviews_with_stars.csv`. Cleaning was assessed as unnecessary for this dataset.

### Step 1 — Environment & imports

Ensure `pandas`, `bs4`, `transformers`, and `torch` are importable. If any fail, install via the active `venv`.

In [1]:
import importlib
mods = ["pandas","bs4","transformers","torch"]
for m in mods:
    try:
        importlib.import_module(m)
        print("OK", m)
    except Exception as e:
        print("FAIL", m, ":", e)
from pathlib import Path
import pandas as pd

import sys, torch, os
print("python:", sys.executable)
print("torch:", torch.__version__)
print("cuda_available:", torch.cuda.is_available())
print("cuda:", torch.version.cuda)
print("gpu:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")

OK pandas
OK bs4
OK transformers
OK torch
python: c:\Users\TARIK\venv\Scripts\python.exe
torch: 2.8.0+cu129
cuda_available: True
cuda: 12.9
gpu: NVIDIA GeForce RTX 5060 Laptop GPU


### Step 2, 3 and 4 — Config

Define model id, batch size, input and output paths, and random seed.

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
import torch

import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"

# Resolve project base directory (works from project root or notebooks/)
CWD = Path.cwd()
BASE_DIR = CWD if (CWD / "data" / "generate_stars" / "processed").exists() else CWD.parent
if not (BASE_DIR / "data" / "generate_stars" / "processed").exists():
    raise FileNotFoundError("Could not locate data/generate_stars/processed from current working directory")

class CFG:
    MODEL_ID = "nlptown/bert-base-multilingual-uncased-sentiment"
    BATCH_SIZE = 64  # GPU-enabled
    IN_PATH = BASE_DIR / "data" / "generate_stars" / "processed" / "reviews_unified.csv"
    OUT_WITH_STARS = BASE_DIR / "data" / "generate_stars" / "processed" / "reviews_with_stars.csv"
    SEED = 42

# Step 3 — Load data

df = pd.read_csv(CFG.IN_PATH)

# Step 4 — Generate stars (continuous + integer)
model = AutoModelForSequenceClassification.from_pretrained(CFG.MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(CFG.MODEL_ID)

device = 0 if torch.cuda.is_available() else -1
pipeline = TextClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    task="sentiment-analysis",
    truncation=True,
    device=device,
)

label_to_int = {"1 star":1, "2 stars":2, "3 stars":3, "4 stars":4, "5 stars":5}

stars_float = []
stars = []
for i in range(0, len(df), CFG.BATCH_SIZE):
    texts = df["comment"].iloc[i:i+CFG.BATCH_SIZE].tolist()
    dists = pipeline(texts, batch_size=CFG.BATCH_SIZE, truncation=True, return_all_scores=True)
    for dist in dists:
        ev = sum(label_to_int[d["label"]] * float(d["score"]) for d in dist)
        stars_float.append(ev)
        stars.append(int(min(5, max(1, round(ev)))))

df_out = df.copy()
# Keep float (rounded to 1 decimal) and an internal integer for training/EDA
df_out["stars_float"] = pd.Series(stars_float).round(1)
df_out["stars"] = pd.Series(stars).astype(int)

# Persist only float ratings in the CSV
cols_to_save = ["source", "place", "comment", "stars_float"]
import os
os.makedirs(CFG.OUT_WITH_STARS.parent, exist_ok=True)
df_out[cols_to_save].to_csv(CFG.OUT_WITH_STARS, index=False)
print("Wrote", CFG.OUT_WITH_STARS)


Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Wrote c:\Users\TARIK\Desktop\Charles Darwin University\4 - Year 1 - Semester 2\IT CODE FAIR\Data Science Challenge\data\generate_stars\processed\reviews_with_stars.csv


### Step 5 — Averages per place (descending)

Compute and display average stars and review counts by `place`, ordered high → low.

In [13]:
agg = (
    df_out
    .groupby("place")
    .agg(avg_stars=("stars","mean"), num_reviews=("stars","size"))
    .reset_index()
)
agg = agg.sort_values(["avg_stars","num_reviews"], ascending=[False, False])
print(agg.head(30))

                                                place  avg_stars  num_reviews
7            Tjoritja / West MacDonnell National Park   4.539033          269
10                   West MacDonnell – Ormiston Gorge   4.396985          398
6                          Nitmiluk (Katherine Gorge)   4.391304          506
5   Nitmiluk (Katherine Gorge / Nitmiluk National ...   4.312292          301
1                        Devils Marbles (Karlu Karlu)   4.257895          380
2                                              Kakadu   4.102616          497
4                 Kakadu National Park – Gunlom Falls   4.075472          265
9                            West MacDonnell Ormiston   3.900000           10
0                           Alice Springs Desert Park   3.869219         3265
8                                    Uluru-Kata Tjuta   3.628382         3474
3                                 Kakadu Gunlom Falls   3.113772          167


### Step 6 — Phase 4 Config (training)

Configure training outputs and random seed. We’ll train a quick baseline (TF‑IDF + LogisticRegression) for 5‑class star prediction.


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import joblib

TRAIN_SEED = 42
MODEL_DIR = BASE_DIR / "models" / "star_classifier_log"
MODEL_DIR.mkdir(parents=True, exist_ok=True)



### Step 7 — Stratified split (train/val)

Create a stratified train/validation split from `df_out` for supervised training and evaluation.


In [15]:
X_train, X_val, y_train, y_val = train_test_split(
    df_out["comment"].astype(str),
    df_out["stars"].astype(int),
    test_size=0.2,
    random_state=TRAIN_SEED,
    stratify=df_out["stars"],
)
len(X_train), len(X_val)


(7625, 1907)

### Step 8 — Train baseline classifier

Train `TF-IDF + LogisticRegression` on the training split.


In [16]:
clf = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=50000, ngram_range=(1,2)) ),
    ("logreg", LogisticRegression(max_iter=200, n_jobs=None, C=2.0, class_weight="balanced")),
])
clf.fit(X_train, y_train)
print("Trained baseline classifier.")


Trained baseline classifier.


### Step 9 — Evaluate (Accuracy, macro F1, Confusion Matrix)

Evaluate on the validation set and save the confusion matrix figure to `visualizations/star_confusion_matrix.png`.


In [17]:
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score

pred_val = clf.predict(X_val)
print(classification_report(y_val, pred_val, digits=3))
print("Accuracy:", accuracy_score(y_val, pred_val))
print("Macro F1:", f1_score(y_val, pred_val, average="macro"))

cm = confusion_matrix(y_val, pred_val, labels=[1,2,3,4,5])
os.makedirs(BASE_DIR / "data" / "generate_stars" / "visualizations", exist_ok=True)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=[1,2,3,4,5], yticklabels=[1,2,3,4,5])
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Star Classifier - Confusion Matrix")
plt.tight_layout()
plt.savefig(BASE_DIR / "data" / "generate_stars" / "visualizations" / "star_confusion_matrix.png")
plt.close()
print("Saved confusion matrix to data/generate_stars/visualizations/star_confusion_matrix.png")


              precision    recall  f1-score   support

           1      0.500     0.182     0.267        33
           2      0.460     0.536     0.495       248
           3      0.412     0.432     0.422       324
           4      0.592     0.546     0.568       604
           5      0.769     0.781     0.775       698

    accuracy                          0.605      1907
   macro avg      0.547     0.495     0.505      1907
weighted avg      0.607     0.605     0.604      1907

Accuracy: 0.605138961719979
Macro F1: 0.505374262249147
Saved confusion matrix to data/generate_stars/visualizations/star_confusion_matrix.png


### Step 10 — Save trained model

Persist the trained baseline pipeline to `models/star_classifier/model.joblib` for reuse.


In [18]:
out_model_path = MODEL_DIR / "model.joblib"
joblib.dump(clf, out_model_path)
print("Saved:", out_model_path)


Saved: c:\Users\TARIK\Desktop\Charles Darwin University\4 - Year 1 - Semester 2\IT CODE FAIR\Data Science Challenge\models\star_classifier_log\model.joblib


### Step 11 — Re‑score all comments with the trained classifier

Use the trained baseline to predict stars for every comment and write `data/processed/reviews_with_stars_trained.csv`.


In [19]:
# Reload full dataset comments and predict with trained baseline
full_comments = df["comment"].astype(str)
trained_preds = clf.predict(full_comments)

df_trained = df.copy()
df_trained["stars"] = trained_preds
import os
os.makedirs((BASE_DIR / "data" / "generate_stars" / "processed"), exist_ok=True)
out_trained = BASE_DIR / "data" / "generate_stars" / "processed" / "reviews_with_stars_trained.csv"
df_trained.to_csv(out_trained, index=False)
print("Wrote", out_trained)


Wrote c:\Users\TARIK\Desktop\Charles Darwin University\4 - Year 1 - Semester 2\IT CODE FAIR\Data Science Challenge\data\generate_stars\processed\reviews_with_stars_trained.csv


### Step 12 — Averages by place (trained labels)

Compute and display average stars and review counts by `place` using the trained labels (descending), so you can compare with the pretrained averages.


In [20]:
agg_tr = (
    df_trained
    .groupby("place")
    .agg(avg_stars=("stars","mean"), num_reviews=("stars","size"))
    .reset_index()
)
agg_tr = agg_tr.sort_values(["avg_stars","num_reviews"], ascending=[False, False])
print(agg_tr.head(30))


                                                place  avg_stars  num_reviews
7            Tjoritja / West MacDonnell National Park   4.583643          269
6                          Nitmiluk (Katherine Gorge)   4.416996          506
10                   West MacDonnell – Ormiston Gorge   4.374372          398
5   Nitmiluk (Katherine Gorge / Nitmiluk National ...   4.335548          301
1                        Devils Marbles (Karlu Karlu)   4.231579          380
2                                              Kakadu   4.134809          497
4                 Kakadu National Park – Gunlom Falls   4.037736          265
0                           Alice Springs Desert Park   3.837672         3265
8                                    Uluru-Kata Tjuta   3.565630         3474
9                            West MacDonnell Ormiston   3.500000           10
3                                 Kakadu Gunlom Falls   3.041916          167
