[TikTok Tech Jam 2025](https://bytedance.sg.larkoffice.com/docx/E6bhdCMUqojrxsxOFn5lFVdMgkc)

# Problem Statement - '1. Filtering the Noise: ML for Trustworthy Location Reviews'

Design and implement an ML-based system to evaluate the quality and relevancy of Google location reviews

The system should:
1. **Gauge review quality**: Detect spam, advertisements, irrelevant content, and rants from users who have likely never visited the location.

2. **Assess relevancy**: Determine whether the content of a review is genuinely related to the location being reviewed.

3. **Enforce policies**: Automatically flag or filter out reviews that violate the following example policies:
    - No advertisements or promotional content.
    - No irrelevant content (e.g., reviews about unrelated topics).
    - No rants or complaints from users who have not visited the place (can be inferred from content, metadata, or other signals).

In [None]:
!pip install openai

## Imports

In [None]:
from openai import OpenAI
from openai import AsyncOpenAI
from tqdm import tqdm #progress bar
from sklearn.model_selection import train_test_split
import pandas as pd
import nest_asyncio, asyncio, aiohttp, json, re

In [None]:
# This is how you load secret key on Google Collab
from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')

## VsCode Method to get store api_key securely

## Quick git tutorial with VSCode and .env

### Example via OpenAI 'Responses' API

https://platform.openai.com/docs/api-reference/responses

In [None]:
# Test openai first!

client = OpenAI(api_key = api_key)

response = client.responses.create(
    model="gpt-4o-mini-2024-07-18",
    input="Write a one-sentence bedtime story about a unicorn."
)

print(response.output_text)

## Example Payload for Knowledge

### Azure AI

```
headers = {
    "Content-Type": "application/json",
    "api-key": AZURE_OPENAI_API_KEY
}
payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ],
    "max_tokens": 150
}

endpoint = f"{AZURE_OPENAI_API_BASE}/openai/deployments/{DEPLOYMENT_NAME}/chat/completions?api-version={API_VERSION}"

response = requests.post(endpoint, headers=headers, json=payload)

response_data = response.json()
message_content = response_data['choices'][0]['message']['content']
```

### Vertex AI

```
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {ACCESS_TOKEN}"   # Generated via service account
}

payload = {
    "contents": [
        {
            "role": "user",
            "parts": [{"text": prompt}]
        }
    ]
}

endpoint = f"https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/gemini-1.5-pro:generateContent"

response = requests.post(endpoint, headers=headers, json=payload)
message_content = response.json()["candidates"][0]["content"]["parts"][0]["text"]

```

## AWS Bedrock
```
import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

payload = {
    "messages": [
        {"role": "user", "content": [{"text": prompt}]}
    ]
}

response = client.converse(
    modelId="anthropic.claude-3-sonnet-20240229-v1:0",
    messages=payload["messages"]
)

message_content = response["output"]["message"]["content"][0]["text"]
```

## 2 Possible Data Source

### 1.   Using this "reviews.csv" from Kaggle (Easy)
- Google Review Data: Open datasets containing Google location reviews (e.g.,
Google Local Reviews on Kaggle: https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews)


### 2.   Web scrap Google reviews (Hard)
- APIFY: https://apify.com/compass/google-maps-reviews-scraper?fpr=c2c1t



# 1. Working with clean Kaggle "reviews.csv" data

To train a model to perform text classification, like all machine learning, you need labeled data.

2 Approaches
- Either you manually (hand) label
- Use a relatively modern LLM to perform pseudolabelling for you. (Recommend)



In [None]:
df = pd.read_csv('reviews.csv')
df

Unnamed: 0,business_name,author_name,text,photo,rating,rating_category
0,Haci'nin Yeri - Yigit Lokantasi,Gulsum Akar,We went to Marmaris with my wife for a holiday...,dataset/taste/hacinin_yeri_gulsum_akar.png,5,taste
1,Haci'nin Yeri - Yigit Lokantasi,Oguzhan Cetin,During my holiday in Marmaris we ate here to f...,dataset/menu/hacinin_yeri_oguzhan_cetin.png,4,menu
2,Haci'nin Yeri - Yigit Lokantasi,Yasin Kuyu,Prices are very affordable. The menu in the ph...,dataset/outdoor_atmosphere/hacinin_yeri_yasin_...,3,outdoor_atmosphere
3,Haci'nin Yeri - Yigit Lokantasi,Orhan Kapu,Turkey's cheapest artisan restaurant and its f...,dataset/indoor_atmosphere/hacinin_yeri_orhan_k...,5,indoor_atmosphere
4,Haci'nin Yeri - Yigit Lokantasi,Ozgur Sati,I don't know what you will look for in terms o...,dataset/menu/hacinin_yeri_ozgur_sati.png,3,menu
...,...,...,...,...,...,...
1095,Miss Pizza,Salih Gursoy,There are so many types of pizza; you are surp...,dataset/taste/miss_pizza_salih_gursoy.png,5,taste
1096,Miss Pizza,Kemal Amangeldi,I tried the smoked ribeye pizza; the dough is ...,dataset/indoor_atmosphere/miss_pizza_kemal_ama...,5,indoor_atmosphere
1097,Miss Pizza,Ulkem Esen,Crowded and expensive place.,dataset/menu/miss_pizza_ulkem_esen.png,3,menu
1098,Miss Pizza,Ilkin Saymaz,No bad. It was very crowded; there was no ligh...,dataset/taste/miss_pizza_ilkin_saymaz.png,3,taste


business_name, author_name, rating_category are not strong candidates

photo may be relevant

rating are subjective

text can directly determine sentiment

## Let's dip our toes into the pseudolabelling part first!

The LLM should label the reviews into 1 of the following categories

1. Advertisement
2. Irrelevant
3. Complaint_No_Visit
4. Legitimate


In [None]:
client = OpenAI(api_key = api_key)

labels = _____

results = []
df_subset = df.head(50).copy()  # test small first!

for _, row in tqdm(df_subset.iterrows()):
    review = str(row["text"])

    prompt = """
      You are a data annotator for Google location reviews.
      Classify each review into exactly ONE of these categories:
      {labels}

      Definitions:
      - Advertisement: Contains promo, phone numbers, links, or offers.
      - Irrelevant: Not related to the location.
      - Complaint_No_Visit: Complaint by someone who likely didn’t visit (e.g., "never been", "heard", "they say").
      - Legitimate: A genuine, relevant review about the user’s experience.

      Examples:
      1. "Use code PIZZA10 for discount" → Advertisement
      2. "Never been here but heard it’s bad" → Complaint_No_Visit
      3. "Food was great and the staff were friendly" → Legitimate
      4. "I am batman" → Irrelevant

      Return only one of the label names.
      Review: "{review}"
  """.format(labels=labels, review=review)

    try:
        response = client.responses.create(
            model="gpt-4o-mini-2024-07-18",
            input=prompt,
        )
        label = response.output_text.strip()
    except Exception as e:
        print(e)
        label = "Unknown"

    results.append(label)

df_labeled = df_subset.copy()
df_labeled["label"] = results
df_labeled.head()


## This method is too slow, with 1s/it, it takes ~ 18 minutes to label all 1100 rows 1 API call at a time.

## Can you imagine if we have bigger data? We will need call the API **"ASYNCHRONOUS"**-ly

# Hands-On Asyncio

In [None]:
nest_asyncio.apply()

In [None]:
# import async
_____

# still need to pass in your api_key
_____

labels = ["Advertisement", "Irrelevant", "Complaint_No_Visit", "Legitimate"]

def build_prompt(text):
    return f"""
    You are a data annotator for Google location reviews.
    Classify each review into exactly ONE of these categories: {labels}

    Definitions:
    - Advertisement: Contains promo, phone numbers, links, or offers.
    - Irrelevant: Not related to the location.
    - Complaint_No_Visit: Complaint by someone who likely didn’t visit (e.g., "never been", "heard", "they say").
    - Legitimate: A genuine, relevant review about the user’s experience.

    Return only one of the label names.
    Review: "{text}"
    """


In [None]:
# define anotate_row
_____

In [None]:
# define batch labelling
async def label_batch(texts, batch_size=25):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        tasks = [annotate_row(t) for t in batch]
        batch_results = await asyncio.gather(*tasks)
        results.extend(batch_results)
        print(f"Processed {len(results)}/{len(texts)}")
    return results

texts = df["text"].astype(str).tolist()[:1100]   # limit to 1.1k
labels_out = await label_batch(texts)


In [None]:
# save the output first JIC
df_labeled = pd.DataFrame({"text": texts, "label": labels_out})
df_labeled.to_csv("labeled_reviews.csv", index=False)

In [None]:
df_labeled

## Short coding test

Given an array, count occurrences of an element within an array in Python. This is HackerRank (Easy)

In [None]:
labels_out

In [None]:
# Write your answer below:

## Data Augmentation

In [None]:
import json

client = OpenAI(api_key=api_key)

response = client.responses.create(
    model="gpt-4o-mini",
    input="""
    Generate 30 examples of fake Google location reviews that are clearly advertisements.
    Each should sound like a review left by a business promoting itself.
    Vary the business type (restaurants, cafes, gyms, salons, etc.).

    Each review should be 1–2 sentences long and include ad-like elements
    such as links, phone numbers, promo codes, or phrases like "call now", "visit", "discount", etc.

    # Note: Output the list of advertisements as the value for the key 'advertisements'.
    # Only output the JSON object without commentary or markdown.
    """
)

# 1. Access the output text (which is a valid JSON string)
json_string = response.output_text

# 2. Parse the JSON string directly into a Python dictionary
data = json.loads(json_string)

# 3. The final list of strings is accessed directly by the key
ads = data.get('advertisements', [])

print(f"Successfully parsed {len(ads)} advertisements.")
# Example of the resulting list:
# print(ads)

In [None]:
# @title
ads=['Delicious vegan options await you at Green Eats! Visit www.GreenEatsCafe.com for a 15% discount on your first meal. Call us at 555-2345 today!',
 'Experience luxury haircare at Glamour Salon! Book your appointment now at www.GlamourSalon.com and use promo code SHINE for 20% off. Call 555-7890!',
 'Savor the best sushi in town at Ocean Breeze! Order now at www.OceanBreezeSushi.com and get a free appetizer with any entree. Call 555-4567!',
 'Join the fitness revolution at PowerHouse Gym! Sign up today for a FREE personal training session at www.PowerHouseGym.com. Call 555-3210!',
 'Indulge your sweet tooth at Choco Heaven! Enjoy a 10% discount on your first order with promo code YUM at www.ChocoHeaven.com. Call us at 555-1112!',
 'Elevate your style at Trendy Boutique! Visit www.TrendyBoutique.com for exclusive offers, and call 555-2223 for personal styling advice!',
 'Refresh your look at Artisan Barbers! Book today at www.ArtisanBarbers.com for 25% off your first haircut. Call 555-3334 now!',
 'Unwind at Blissful Spa with our awesome treatments! Visit www.BlissfulSpa.com for special packages and call us at 555-4445 to book today!',
 'Grab the best coffee in town at Daily Grind! Join our loyalty program at www.DailyGrindCafe.com and get your 5th cup free. Call 555-5556!',
 'Taste the authentic flavors at Spice Journey! Enjoy 20% off your first meal when you order online at www.SpiceJourney.com. Call 555-6667!',
 'Get fit at Active Life Studio! Sign up now at www.ActiveLifeStudio.com and receive a free class. Call 555-7778 for details!',
 'Spoil yourself at Radiant Nails! Visit www.RadiantNails.com and book an appointment with promo code BEAUTY for 15% off. Call 555-8889!',
 'Treat your taste buds at Pasta Paradise! Order online at www.PastaParadise.com for a 10% discount with promo code DELICIOUS. Call 555-9990!',
 'Transform your skin at Pure Glow Esthetics! Visit www.PureGlow.com and book a facial today to get a 20% discount. Call 555-1011!',
 'Join the community at Yoga Harmony! Sign up for a free trial class at www.YogaHarmony.com and call 555-1213 for more information!',
 'Experience unparalleled flavors at Taco Fiesta! Order now at www.TacoFiesta.com for a free drink with any meal. Call 555-1415!',
 'Get pampered at Luxurious Escapes Spa! Visit www.LuxuriousEscapes.com for special pricing and call 555-1617 to book your getaway!',
 "Discover the art of cooking at Chef's Table! Enroll in our cooking classes at www.ChefsTable.com and use promo code COOK20 for 20% off. Call 555-1819!",
 'Stay stylish with fresh designs from Fashion Forward! Check out www.FashionForward.com for great deals, and call 555-2021 for personalized shopping!',
 'Revitalize your home at Dream Interiors! Visit www.DreamInteriors.com and schedule a consultation today. Call 555-2222 for special offers!',
 'Indulge in gourmet burgers at Burger Bliss! Order at www.BurgerBliss.com for a buy-one-get-one-free deal. Call 555-2423 now!',
 'Join the revolution at Elite Martial Arts! Register today at www.EliteMartialArts.com and get your first month for just $29. Call 555-2624!',
 'Satisfy your cravings at Sweet Treats Bakery! Visit www.SweetTreatsBakery.com for a special deal on cupcakes. Call 555-2825 for inquiries!',
 'Get fit effortlessly at Body Balance! Sign up at www.BodyBalanceGym.com and enjoy your first class for FREE. Call us at 555-3036!',
 'Discover amazing wines at Vineyard Wonders! Enjoy 10% off your first purchase at www.VineyardWonders.com. Call 555-3237 to learn more!',
 'Upgrade your wellness with Green Leaf Supplements! Visit www.GreenLeafSupplements.com now for 20% off your first order. Call 555-3438!',
 'Chase the unique at Artful Studios! Book a painting session today at www.ArtfulStudios.com and get a discount with promo code CREATE. Call 555-3639!',
 'Dine in luxury at Night Sky Restaurant! Reserve your table now at www.NightSkyRestaurant.com for exclusive offers. Call 555-3840!',
 'Find your zen at Serenity Yoga Studio! Visit www.SerenityYoga.com for a free first class and call 555-4041 for details!',
 'Fall in love with fashion at Chic Trends! Shop online at www.ChicTrends.com and use promo code STYLE for 15% off your first order. Call 555-4242!']

df_labeled = pd.read_csv("/content/labeled_reviews.csv")

In [None]:
df_ads = pd.DataFrame({
    "text": ads,
    "label": "Advertisement"
})

df_combined = pd.concat([df_labeled, df_ads], ignore_index=True)

print(df_combined["label"].value_counts())

In [None]:
df_combined.to_csv("labeled_reviews+augmented.csv",index=False)

# Training of local classifier using Huggingface model

In [None]:
df = pd.read_csv('labeled_reviews+augmented.csv')

_____

_____

print(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")

In [None]:
!pip install transformers datasets torch scikit-learn accelerate

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import Dataset
from sklearn.metrics import accuracy_score, f1_score, classification_report
import torch
import numpy as np
from torch import nn
from sklearn.utils.class_weight import compute_class_weight

# Define label mappings
_____

_____

# Handle class imbalance - Oversample minority classes
def smart_oversample(df, target_column='label', target_ratio=0.3):
    """
    Oversample minority classes to reach target_ratio of majority class
    """
    max_size = df[target_column].value_counts().max()

    lst = [df]
    for class_label, group in df.groupby(target_column):
        current_size = len(group)
        target_size = int(max_size * target_ratio)

        if current_size < target_size:
            # Oversample to target
            lst.append(group.sample(target_size - current_size, replace=True, random_state=42))

    df_balanced = pd.concat(lst)
    return df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# Apply oversampling to training data only
print("\nOriginal training distribution:")
print(train_df['label'].value_counts())

train_df_balanced = smart_oversample(train_df, target_ratio=0.3)

print("\nBalanced training distribution:")
print(train_df_balanced['label'].value_counts())

In [None]:
# Compute class weights for remaining imbalance
labels = train_df_balanced['label'].map(label2id).values
class_weights = compute_class_weight('balanced', classes=np.unique(labels), y=labels)
class_weights = torch.tensor(class_weights, dtype=torch.float)
print("\nClass weights:", class_weights)

# Prepare datasets
train_dataset = Dataset.from_pandas(train_df_balanced[['text', 'label']])
val_dataset = Dataset.from_pandas(val_df[['text', 'label']])
test_dataset = Dataset.from_pandas(test_df[['text', 'label']])

# Load tokenizer and model
# (https://huggingface.co/FacebookAI/xlm-roberta-base)
model_name = ____

print(f"\nLoading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4,
    id2label=id2label,
    label2id=label2id
)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=512, padding=False)

# Tokenize datasets
print("\nTokenizing datasets...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Convert labels to integers
def encode_labels(examples):
    examples['label'] = [label2id[label] for label in examples['label']]
    return examples

train_dataset = train_dataset.map(encode_labels, batched=True)
val_dataset = val_dataset.map(encode_labels, batched=True)
test_dataset = test_dataset.map(encode_labels, batched=True)

# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Custom Trainer with weighted loss
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits

        # Apply weighted cross entropy loss
        loss_fct = nn.CrossEntropyLoss(weight=class_weights.to(model.device))
        loss = loss_fct(logits, labels)

        return (loss, outputs) if return_outputs else loss

# Metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1_macro': f1_score(labels, predictions, average='macro'),
        'f1_weighted': f1_score(labels, predictions, average='weighted')
    }

In [None]:

# Training arguments
training_args = TrainingArguments(
    output_dir="./review-classifier",
    eval_strategy="epoch",  # Use "evaluation_strategy" if you have older transformers
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    push_to_hub=False,
    logging_steps=10,
    save_total_limit=2,  # Keep only 2 best checkpoints
    report_to="none",
)

In [None]:
import os
# os.environ["WANDB_DISABLED"] = "true"

# Initialize Trainer
print("\nInitializing trainer...")
trainer = WeightedTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train
print("\nStarting training...")
trainer.train()

# Evaluate on test set
print("\nEvaluating on test set...")
test_results = trainer.evaluate(test_dataset)
print(f"\nTest Results: {test_results}")

# Detailed classification report
predictions = trainer.predict(test_dataset)
y_pred = np.argmax(predictions.predictions, axis=1)
y_true = test_dataset['label']

print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=list(label2id.keys())))

# Save model
print("\nSaving model...")
model.save_pretrained("./final-review-classifier")
tokenizer.save_pretrained("./final-review-classifier")

print("\nTraining complete! Model saved to './final-review-classifier'")

F1 score is a measure of a classification model's accuracy that combines precision and recall into a single metric. It is the harmonic mean of the two, making it particularly useful for evaluating models on imbalanced datasets where simple accuracy can be misleading.

In [None]:
# Load and test your model
from transformers import pipeline

print("Loading trained model for inference...")
classifier = pipeline(
    "text-classification",
    model="./final-review-classifier",
    tokenizer="./final-review-classifier",
    device=0 if torch.cuda.is_available() else -1  # Use GPU if available
)

# Test with sample reviews
test_reviews = [
    "The food was amazing, highly recommend this restaurant!", #Legitimate
    "Visit our website for 50% off! www.example.com Call 555-1234!", #Advertisement
    "I wanted to go but it was closed when I arrived. Very disappointed.", #Complaint_No_Visit
    "I am batman! In the darkest moment of my life, I look up towards the sky. Longing for the dark knight" #Irrelevant
]

print("\nTesting classifier on sample reviews:")
results = classifier(test_reviews)
for review, result in zip(test_reviews, results):
    print(f"\nReview: {review}")
    print(f"Predicted: {result['label']}, Confidence: {result['score']:.4f}")

## Zip

In [None]:
import shutil

shutil.make_archive('review-classifier-model', 'zip', './final-review-classifier')

## Unzip

In [None]:
!unzip review-classifier-model.zip -d review_classifier_model

In [None]:
model_path = "review_classifier_model"

classifier = pipeline(
    "text-classification",
    model=model_path,
    tokenizer=model_path,
    device=0 if torch.cuda.is_available() else -1
)

print("Model loaded.")


## Inference

In [None]:
reviews = ["The food was delicious!", "Visit www.spam.com for deals!"]
results = classifier(reviews)

for review, result in zip(reviews, results):
    print(f"Review: {review}")
    print(f"Predicted: {result['label']}, Confidence: {result['score']:.4f}\n")