| Step                          | Purpose                                                      |
| ----------------------------- | ------------------------------------------------------------ |
| **Imports**                   | Load libraries                                               |
| **Load Data**                 | Read reviews and identify columns                            |
| **Label Prep**                | Create spam labels                                           |
| **Feature Extraction**        | Generate TF-IDF, MiniLM, and BERT embeddings                 |
| **Train/Test Split**          | Split for modeling                                           |
| **Oversampling**              | Balance training classes                                     |
| **Model Training**            | Train and ensemble RF & XGB                                  |
| **Threshold Tuning**          | Find best cutoff for spam probability                        |
| **Final Evaluation**          | Assess model performance                                     |
| **Predict All**               | Get predictions for all reviews                              |
| **Duplicate Detection**       | Flag nearly identical reviews                                |
| **Off-topic/Informativeness** | Detect off-topic and score informativeness                   |
| **Dashboard**                 | Assemble all outputs per review                              |
| **Export Results**            | Save high-quality and filtered reviews to separate CSV files |


# Preparation

In [None]:
!pip install pandas scikit-learn xgboost sentence-transformers transformers tqdm imbalanced-learn spacy
!python -m spacy download en_core_web_sm

Imports

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import json

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

from sentence_transformers import SentenceTransformer
from transformers import BertTokenizer, BertModel
import torch

from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from xgboost import XGBClassifier

Load Data

Loads the review data, identifies columns, and prepares the review texts for processing.

In [None]:
df = pd.read_csv('USS_Reviews_Silver.csv')
text_col = 'review'
id_col = 'id' if 'id' in df.columns else None
label_col = 'label' if 'label' in df.columns else None
texts = df[text_col].astype(str).tolist()
print(f"Total reviews loaded: {len(texts)}")


Label Preparation

Creates target labels for spam classification. If not present, labels reviews with fewer than 8 characters as spam.

In [None]:
if label_col is None:
    y = np.array([1 if len(txt.strip()) < 8 else 0 for txt in texts])  # proxy: short = spam
else:
    y = df[label_col].values


# Feature Extraction

Feature Extraction (TF-IDF, MiniLM, BERT)

Extracts three types of features:
  - TF-IDF: Basic text representation.
  - MiniLM (sentence-transformers): Semantic sentence embeddings.
  - BERT CLS token: Deep contextual embeddings.

All features are concatenated into X_all.

In [None]:
tfidf = TfidfVectorizer(max_features=256)
X_tfidf = tfidf.fit_transform(texts).toarray()

w2v_model = SentenceTransformer('all-MiniLM-L6-v2')
X_w2v = w2v_model.encode(texts, show_progress_bar=True)

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model.eval()
def get_bert_cls(text):
    with torch.no_grad():
        encoded = bert_tokenizer(text, return_tensors='pt', truncation=True, max_length=128, padding='max_length')
        output = bert_model(**encoded)
        return output.last_hidden_state[0, 0, :].numpy()
X_bert = np.vstack([get_bert_cls(t) for t in tqdm(texts, desc="BERT Embeddings")])

X_all = np.hstack([X_tfidf, X_w2v, X_bert])


In [None]:
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
    X_all, y, np.arange(len(y)), test_size=0.3, random_state=42, stratify=y
)


Oversampling to Balance Classes

Balances the training data to prevent model bias toward the majority class.

In [None]:
ros = RandomOverSampler(random_state=42)
X_train_bal, y_train_bal = ros.fit_resample(X_train, y_train)
print(f"Balanced training samples: {len(y_train_bal)} (spam: {sum(y_train_bal)}, not spam: {len(y_train_bal)-sum(y_train_bal)})")


# Modelling and Training

Model Training (RandomForest + XGBoost + Voting Ensemble)
- Trains a RandomForest and an XGBoost model.
- Combines both into a soft-voting ensemble for improved prediction robustness.



In [None]:
rf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
xgb = XGBClassifier(
    scale_pos_weight=(y_train_bal==0).sum()/(y_train_bal==1).sum(),
    use_label_encoder=False, eval_metric='logloss', n_jobs=-1
)
rf.fit(X_train_bal, y_train_bal)
xgb.fit(X_train_bal, y_train_bal)

voting = VotingClassifier(estimators=[('rf', rf), ('xgb', xgb)], voting='soft')
voting.fit(X_train_bal, y_train_bal)


Threshold Tuning for F1 Score

Searches for the probability threshold that maximizes the F1 score for the spam class.

In [None]:
y_probs = voting.predict_proba(X_test)[:,1]
thresholds = np.arange(0.05, 0.51, 0.05)
best_f1, best_thresh = 0, 0.5
for thresh in thresholds:
    y_pred_tune = (y_probs > thresh).astype(int)
    f1 = f1_score(y_test, y_pred_tune, pos_label=1)
    if f1 > best_f1:
        best_f1 = f1
        best_thresh = thresh
print(f"\nBest F1 (spam) on validation: {best_f1:.3f} at threshold {best_thresh:.2f}")


# Evaluation

Final Model Evaluation

Evaluates the tuned ensemble on the test set and prints the confusion matrix, classification report, and accuracy.

In [None]:
y_pred_adj = (y_probs > best_thresh).astype(int)
print("\n=== FINAL MODEL EVALUATION ===")
print(confusion_matrix(y_test, y_pred_adj))
print(classification_report(y_test, y_pred_adj))
print(f"Accuracy: {accuracy_score(y_test, y_pred_adj):.4f}")


Predict on All Data for Dashboard

Applies the trained model to all data for dashboard reporting and further analysis.



In [None]:
spam_probs = voting.predict_proba(X_all)[:,1]
spam_pred = (spam_probs > best_thresh).astype(int)


Duplicate Detection Using Cosine Similarity

Finds and flags near-duplicate reviews using cosine similarity on feature vectors.

In [None]:
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=5, metric='cosine', n_jobs=-1)
nn.fit(X_all)
distances, indices = nn.kneighbors(X_all)
dup_flags = [False]*len(texts)
dup_ref = [None]*len(texts)
for i in range(len(texts)):
    for idx, dist in zip(indices[i][1:], distances[i][1:]):
        if dist < 0.03:  # cosine sim >0.97
            dup_flags[i] = True
            dup_ref[i] = int(df.iloc[idx][id_col]) if id_col else int(idx)
            break


Off-topic Detection & Informativeness Scoring


Off-topic detection: Checks for keywords and noun chunks.

Informativeness scoring: Custom formula based on word uniqueness, nouns, adjectives, entities, etc.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
keywords = ['product', 'service', 'order', 'buy', 'delivery', 'quality', 'price', 'shipping', 'refund', 'support']
def is_off_topic(txt):
    doc = nlp(txt.lower())
    has_keyword = any(k in txt for k in keywords)
    return not has_keyword and len(list(doc.noun_chunks)) == 0
off_topic_flags = [is_off_topic(txt) for txt in texts]

def informativeness(txt, nlp=nlp):
    doc = nlp(txt)
    words = [token.text for token in doc if not token.is_punct and not token.is_space]
    num_words = len(words)
    if num_words == 0:
        return 0.0

    if num_words < 8:
        base = 0.1 + 0.5 * (num_words / 8)
        return round(base, 2)

    unique_ratio = len(set([w.lower() for w in words])) / num_words
    num_nouns = sum(1 for token in doc if token.pos_ in ["NOUN", "PROPN"])
    num_adjs = sum(1 for token in doc if token.pos_ == "ADJ")
    num_ents = len(list(doc.ents))
    specifics = sum(1 for ent in doc.ents if ent.label_ in ["PERSON", "ORG", "GPE", "LOC", "DATE", "TIME", "MONEY", "CARDINAL", "ORDINAL", "PERCENT"])
    specifics_bonus = min(0.2, 0.05 * specifics)
    length_bonus = min(0.1, 0.1 * (num_words / 50))

    score = (
        0.4 * unique_ratio +
        0.2 * (num_nouns / num_words) +
        0.15 * (num_adjs / num_words) +
        0.05 * (num_ents / num_words) +
        specifics_bonus +
        length_bonus
    )
    return round(min(1.0, max(0.0, score)), 2)

info_scores = [informativeness(txt) for txt in texts]


Compile Dashboard Output and Export Results

In [None]:
dashboard = []
for i, row in df.iterrows():
    item = {
        "review_id": int(row[id_col]) if id_col else int(i),
        "review": row[text_col],
        "is_spam": bool(spam_pred[i]),
        "spam_probability": float(round(spam_probs[i], 4)),
        "is_duplicate": bool(dup_flags[i]),
        "duplicate_of": dup_ref[i],
        "is_off_topic": bool(off_topic_flags[i]),
        "informativeness_score": float(info_scores[i])
    }
    dashboard.append(item)


flags_df = pd.DataFrame({
    "is_spam": spam_pred,
    "is_duplicate": dup_flags,
    "is_off_topic": off_topic_flags
})

keep_mask = ~(flags_df["is_spam"] | flags_df["is_duplicate"] | flags_df["is_off_topic"])
removed_mask = ~keep_mask

df[keep_mask].to_csv("USS_Reviews_Silver_1.csv", index=False)

df_extra = df.copy()
df_extra['is_spam'] = spam_pred
df_extra['spam_probability'] = spam_probs
df_extra['is_duplicate'] = dup_flags
df_extra['duplicate_of'] = dup_ref
df_extra['is_off_topic'] = off_topic_flags
df_extra['informativeness_score'] = info_scores

df_extra[removed_mask].to_csv("USS_Reviews_Silver_2.csv", index=False)

print(f"USS_Reviews_Silver_1.csv: {keep_mask.sum()} records kept (original columns only)")
print(f"USS_Reviews_Silver_2.csv: {removed_mask.sum()} records filtered out (with new columns)")


Total reviews loaded: 29412


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/920 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BERT Embeddings: 100%|██████████| 29412/29412 [37:37<00:00, 13.03it/s]


Balanced training samples: 40240 (spam: 20120, not spam: 20120)


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.




Best F1 (spam) on validation: 0.920 at threshold 0.45

=== FINAL MODEL EVALUATION ===
[[8614    9]
 [  22  179]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      8623
           1       0.95      0.89      0.92       201

    accuracy                           1.00      8824
   macro avg       0.97      0.94      0.96      8824
weighted avg       1.00      1.00      1.00      8824

Accuracy: 0.9965
USS_Reviews_Silver_1.csv: 24021 records kept (original columns only)
USS_Reviews_Silver_2.csv: 5391 records filtered out (with new columns)


In [None]:
# Read the filtered "good" reviews file
df1 = pd.read_csv('USS_Reviews_Silver_1.csv')

# Read the filtered "removed" (spam/duplicate/off-topic) reviews file
df2 = pd.read_csv('USS_Reviews_Silver_2.csv')

# Optional: convert the read CSV to Parquet
df1.to_parquet('USS_Reviews_Silver_1.parquet', index=False)
df2.to_parquet('USS_Reviews_Silver_2.parquet', index=False)