# Solution pipeline

This solution pipeline contains a call to Google Translate to translate the review, then the translated text is classified into flagged or clean with our model. If the review is empty text, it will be flagged as discussed in `README.md`.

## Install packages

Download torch according to official website and CUDA version as mentioned in `README.md`.

In [None]:
%pip install transformers pandas googletrans
%pip install torch torchvision --index-url https://download.pytorch.org/whl/cu129

## Import packages

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch
import pandas as pd
from googletrans import Translator
import re

## Load model

This will load the trained model from our training script.

In [None]:
save_dir = "./data/saved_model"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertForSequenceClassification.from_pretrained(save_dir, local_files_only=True)
model.to(device)
tokenizer = BertTokenizer.from_pretrained(save_dir, local_files_only=True)

## Pipeline

Full pipeline for our solution. Place real-world reviews in `reviews` and run the cell. It will print the prediction of our model (0 is clean, 1 is flagged). The predictions will be saved into a csv file `predictions.csv` in `./data`.

In [None]:
reviews = [
    "My Roblox account got hacked from this location",
    "I hear this is a top university I wanna go here",
    "Amazing place for students worldwide. Top notch facilities for everything you care about. Really interesting lot of students to hang around. You'll love this space, the Campus is attracted all over the city of Singapore. For vegetarians, it's a bit tricky to get the desired food. Amazing public transport and AQI less than 50.",
    "真他妈的好吃，推荐他们的辣子鸡",
    "restoran ini ada nasi lemak yang terbaik di seluruh Malaysia",
    "هذا المطعم يقدم أفضل كبسة في الرياض",
    "👍",
    ""
]
df = pd.DataFrame(reviews, columns=['original_text'])

texts = df['original_text'].to_list()
indices = []
texts_to_translate = []
translated_texts = texts.copy()
async def translate_bulk():
    async with Translator() as translator:
        for index, text in enumerate(texts):
            result = await translator.detect(text)
            if result.lang != 'en':
                indices.append(index)
                texts_to_translate.append(text)
        translations = await translator.translate(texts_to_translate)
        for i, translation in zip(indices, translations):
            translated_texts[i] = translation.text
await translate_bulk()
df['text'] = translated_texts

def remove_emojis(text):
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"
        "\U0001F300-\U0001F5FF"
        "\U0001F680-\U0001F6FF"
        "\U0001F700-\U0001F77F"
        "\U0001F780-\U0001F7FF"
        "\U0001F800-\U0001F8FF"
        "\U0001F900-\U0001F9FF"
        "\U0001FA00-\U0001FA6F"
        "\U0001FA70-\U0001FAFF"
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+",
        flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', text)

df["text"] = df["text"].apply(remove_emojis)

results = [""] * len(df)
non_empty_indices = [i for i, r in enumerate(df['text']) if r.strip()]
non_empty_texts = [df['text'][i] for i in non_empty_indices]
if non_empty_texts:
    inputs = tokenizer(non_empty_texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
    for idx, pred in zip(non_empty_indices, predictions):
        results[idx] = "clean" if pred.item() == 0 else "flagged"
for i, r in enumerate(df['text']):
    if not r.strip():
        results[i] = "flagged"
df["label"] = results
for review, label in zip(df['text'], df['label']):
    print(f"Review: {review}\nPredicted label: {label}\n")

df = df[['original_text', 'label']]
df.to_csv("./data/predictions.csv", index=False)