[TikTok Tech Jam 2025](https://bytedance.sg.larkoffice.com/docx/E6bhdCMUqojrxsxOFn5lFVdMgkc)

# Problem Statement - '1. Filtering the Noise: ML for Trustworthy Location Reviews'

Design and implement an ML-based system to evaluate the quality and relevancy of Google location reviews

The system should:
1. **Gauge review quality**: Detect spam, advertisements, irrelevant content, and rants from users who have likely never visited the location.

2. **Assess relevancy**: Determine whether the content of a review is genuinely related to the location being reviewed.

3. **Enforce policies**: Automatically flag or filter out reviews that violate the following example policies:
    - No advertisements or promotional content.
    - No irrelevant content (e.g., reviews about unrelated topics).
    - No rants or complaints from users who have not visited the place (can be inferred from content, metadata, or other signals).

In [None]:
!pip install openai



## Imports

In [None]:
from openai import OpenAI
from openai import AsyncOpenAI
from tqdm import tqdm #progress bar
from sklearn.model_selection import train_test_split
import pandas as pd
import nest_asyncio, asyncio, aiohttp, json, re

In [None]:
# This is how you load secret key on Google Collab
from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')

## VsCode Method to get store api_key securely

## Quick git tutorial with VSCode and .env

### Example via OpenAI 'Responses' API

https://platform.openai.com/docs/api-reference/responses

In [None]:
# Test openai first!

client = OpenAI(api_key = api_key)

response = client.responses.create(
    model="gpt-4o-mini-2024-07-18",
    input="Write a one-sentence bedtime story about a unicorn."
)

print(response.output_text)

As the moon shone brightly, a gentle unicorn named Luna spread her shimmering wings and soared through the starlit sky, sprinkling dreams of magic and wonder over all the sleeping children below.


## Example Payload for Knowledge

### Azure AI

```
headers = {
    "Content-Type": "application/json",
    "api-key": AZURE_OPENAI_API_KEY
}
payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ],
    "max_tokens": 150
}

endpoint = f"{AZURE_OPENAI_API_BASE}/openai/deployments/{DEPLOYMENT_NAME}/chat/completions?api-version={API_VERSION}"

response = requests.post(endpoint, headers=headers, json=payload)

response_data = response.json()
message_content = response_data['choices'][0]['message']['content']
```

### Vertex AI

```
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {ACCESS_TOKEN}"   # Generated via service account
}

payload = {
    "contents": [
        {
            "role": "user",
            "parts": [{"text": prompt}]
        }
    ]
}

endpoint = f"https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/gemini-1.5-pro:generateContent"

response = requests.post(endpoint, headers=headers, json=payload)
message_content = response.json()["candidates"][0]["content"]["parts"][0]["text"]

```

## AWS Bedrock
```
import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

payload = {
    "messages": [
        {"role": "user", "content": [{"text": prompt}]}
    ]
}

response = client.converse(
    modelId="anthropic.claude-3-sonnet-20240229-v1:0",
    messages=payload["messages"]
)

message_content = response["output"]["message"]["content"][0]["text"]
```

## 2 Possible Data Source

### 1.   Using this "reviews.csv" from Kaggle (Easy)
- Google Review Data: Open datasets containing Google location reviews (e.g.,
Google Local Reviews on Kaggle: https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews)


### 2.   Web scrap Google reviews (Hard)
- APIFY: https://apify.com/compass/google-maps-reviews-scraper?fpr=c2c1t



# 1. Working with clean Kaggle "reviews.csv" data

To train a model to perform text classification, like all machine learning, you need labeled data.

2 Approaches
- Either you manually (hand) label
- Use a relatively modern LLM to perform pseudolabelling for you. (Recommend)



In [None]:
df = pd.read_csv('reviews.csv')
df

Unnamed: 0,business_name,author_name,text,photo,rating,rating_category
0,Haci'nin Yeri - Yigit Lokantasi,Gulsum Akar,We went to Marmaris with my wife for a holiday...,dataset/taste/hacinin_yeri_gulsum_akar.png,5,taste
1,Haci'nin Yeri - Yigit Lokantasi,Oguzhan Cetin,During my holiday in Marmaris we ate here to f...,dataset/menu/hacinin_yeri_oguzhan_cetin.png,4,menu
2,Haci'nin Yeri - Yigit Lokantasi,Yasin Kuyu,Prices are very affordable. The menu in the ph...,dataset/outdoor_atmosphere/hacinin_yeri_yasin_...,3,outdoor_atmosphere
3,Haci'nin Yeri - Yigit Lokantasi,Orhan Kapu,Turkey's cheapest artisan restaurant and its f...,dataset/indoor_atmosphere/hacinin_yeri_orhan_k...,5,indoor_atmosphere
4,Haci'nin Yeri - Yigit Lokantasi,Ozgur Sati,I don't know what you will look for in terms o...,dataset/menu/hacinin_yeri_ozgur_sati.png,3,menu
...,...,...,...,...,...,...
1095,Miss Pizza,Salih Gursoy,There are so many types of pizza; you are surp...,dataset/taste/miss_pizza_salih_gursoy.png,5,taste
1096,Miss Pizza,Kemal Amangeldi,I tried the smoked ribeye pizza; the dough is ...,dataset/indoor_atmosphere/miss_pizza_kemal_ama...,5,indoor_atmosphere
1097,Miss Pizza,Ulkem Esen,Crowded and expensive place.,dataset/menu/miss_pizza_ulkem_esen.png,3,menu
1098,Miss Pizza,Ilkin Saymaz,No bad. It was very crowded; there was no ligh...,dataset/taste/miss_pizza_ilkin_saymaz.png,3,taste


business_name, author_name, rating_category are not strong candidates

photo may be relevant

rating are subjective

text can directly determine sentiment

## Let's dip our toes into the pseudolabelling part first!

The LLM should label the reviews into 1 of the following categories

1. Advertisement
2. Irrelevant
3. Complaint_No_Visit
4. Legitimate


In [None]:
client = OpenAI(api_key = api_key)

labels = ["Advertisement", "Irrelevant", "Complaint_No_Visit", "Legitimate"]

results = []
df_subset = df.head(50).copy()  # test small first!

for _, row in tqdm(df_subset.iterrows()):
    review = str(row["text"])

    prompt = """
      You are a data annotator for Google location reviews.
      Classify each review into exactly ONE of these categories:
      {labels}

      Definitions:
      - Advertisement: Contains promo, phone numbers, links, or offers.
      - Irrelevant: Not related to the location.
      - Complaint_No_Visit: Complaint by someone who likely didn’t visit (e.g., "never been", "heard", "they say").
      - Legitimate: A genuine, relevant review about the user’s experience.

      Examples:
      1. "Use code PIZZA10 for discount" → Advertisement
      2. "Never been here but heard it’s bad" → Complaint_No_Visit
      3. "Food was great and the staff were friendly" → Legitimate
      4. "I am batman" → Irrelevant

      Return only one of the label names.
      Review: "{review}"
  """.format(labels=labels, review=review)

    try:
        response = client.responses.create(
            model="gpt-4o-mini-2024-07-18",
            input=prompt,
        )
        label = response.output_text.strip()
    except Exception as e:
        print(e)
        label = "Unknown"

    results.append(label)

df_labeled = df_subset.copy()
df_labeled["label"] = results
df_labeled.head()


50it [00:37,  1.34it/s]


Unnamed: 0,business_name,author_name,text,photo,rating,rating_category,label
0,Haci'nin Yeri - Yigit Lokantasi,Gulsum Akar,We went to Marmaris with my wife for a holiday...,dataset/taste/hacinin_yeri_gulsum_akar.png,5,taste,Legitimate
1,Haci'nin Yeri - Yigit Lokantasi,Oguzhan Cetin,During my holiday in Marmaris we ate here to f...,dataset/menu/hacinin_yeri_oguzhan_cetin.png,4,menu,Legitimate
2,Haci'nin Yeri - Yigit Lokantasi,Yasin Kuyu,Prices are very affordable. The menu in the ph...,dataset/outdoor_atmosphere/hacinin_yeri_yasin_...,3,outdoor_atmosphere,Legitimate
3,Haci'nin Yeri - Yigit Lokantasi,Orhan Kapu,Turkey's cheapest artisan restaurant and its f...,dataset/indoor_atmosphere/hacinin_yeri_orhan_k...,5,indoor_atmosphere,Legitimate
4,Haci'nin Yeri - Yigit Lokantasi,Ozgur Sati,I don't know what you will look for in terms o...,dataset/menu/hacinin_yeri_ozgur_sati.png,3,menu,Legitimate


## This method is too slow, with 1s/it, it takes ~ 18 minutes to label all 1100 rows 1 API call at a time.

## Can you imagine if we have bigger data? We will need call the API **"ASYNCHRONOUS"**-ly

# Hands-On Asyncio

In [None]:
nest_asyncio.apply()

In [None]:
# import async
from openai import AsyncOpenAI

# still need to pass in your api_key
aclient = AsyncOpenAI(api_key=api_key)

labels = ["Advertisement", "Irrelevant", "Complaint_No_Visit", "Legitimate"]

def build_prompt(text):
    return f"""
    You are a data annotator for Google location reviews.
    Classify each review into exactly ONE of these categories: {labels}

    Definitions:
    - Advertisement: Contains promo, phone numbers, links, or offers.
    - Irrelevant: Not related to the location.
    - Complaint_No_Visit: Complaint by someone who likely didn’t visit (e.g., "never been", "heard", "they say").
    - Legitimate: A genuine, relevant review about the user’s experience.

    Return only one of the label names.
    Review: "{text}"
    """


In [None]:
async def annotate_row(text):
    try:
        response = await aclient.responses.create(
            model="gpt-4o-mini-2024-07-18",
            input=build_prompt(text),
        )
        label = response.output_text.strip()
    except Exception as e:
        label = f"Error: {e}"
    return label

In [None]:
async def label_batch(texts, batch_size=25):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        tasks = [annotate_row(t) for t in batch]
        batch_results = await asyncio.gather(*tasks)
        results.extend(batch_results)
        print(f"Processed {len(results)}/{len(texts)}")
    return results

texts = df["text"].astype(str).tolist()[:1100]   # limit to 1.1k
labels_out = await label_batch(texts)


Processed 25/1100
Processed 50/1100
Processed 75/1100
Processed 100/1100
Processed 125/1100
Processed 150/1100
Processed 175/1100
Processed 200/1100
Processed 225/1100
Processed 250/1100
Processed 275/1100
Processed 300/1100
Processed 325/1100
Processed 350/1100
Processed 375/1100
Processed 400/1100
Processed 425/1100
Processed 450/1100
Processed 475/1100
Processed 500/1100
Processed 525/1100
Processed 550/1100
Processed 575/1100
Processed 600/1100
Processed 625/1100
Processed 650/1100
Processed 675/1100
Processed 700/1100
Processed 725/1100
Processed 750/1100
Processed 775/1100
Processed 800/1100
Processed 825/1100
Processed 850/1100
Processed 875/1100
Processed 900/1100
Processed 925/1100
Processed 950/1100
Processed 975/1100
Processed 1000/1100
Processed 1025/1100
Processed 1050/1100
Processed 1075/1100
Processed 1100/1100


In [None]:
# save the output first JIC
df_labeled = pd.DataFrame({"text": texts, "label": labels_out})
df_labeled.to_csv("labeled_reviews.csv", index=False)

In [None]:
df_labeled

Unnamed: 0,text,label
0,We went to Marmaris with my wife for a holiday...,Legitimate
1,During my holiday in Marmaris we ate here to f...,Legitimate
2,Prices are very affordable. The menu in the ph...,Legitimate
3,Turkey's cheapest artisan restaurant and its f...,Legitimate
4,I don't know what you will look for in terms o...,Legitimate
...,...,...
1095,There are so many types of pizza; you are surp...,Legitimate
1096,I tried the smoked ribeye pizza; the dough is ...,Legitimate
1097,Crowded and expensive place.,Complaint_No_Visit
1098,No bad. It was very crowded; there was no ligh...,Legitimate


## Short coding test

Given an array, count occurrences of an element within an array in Python. This is HackerRank (Easy)

In [None]:
labels_out

['Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Irrelevant',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Complaint_No_Visit',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Complaint_No_Visit',
 'Legitimate',
 'Legitimate',
 'Legitimate',
 'Complaint_No_Visit',
 'Complaint_No_Visit',
 'Legiti

## Answer

In [None]:
# brute force

Legitimate = []
Irrelevant = []
Complaint_No_Visit = []
Advertisement = []

for label in labels_out:
  if label == "Legitimate":
    Legitimate.append(label)
  elif label == "Irrelevant":
    Irrelevant.append(label)
  elif label == "Complaint_No_Visit":
    Complaint_No_Visit.append(label)
  elif label == "Advertisement":
    Advertisement.append(label)


print(f"Legitimate: {len(Legitimate)}")
print(f"Irrelevant: {len(Irrelevant)}")
print(f"Complaint_No_Visit: {len(Complaint_No_Visit)}")
print(f"Advertisement: {len(Advertisement)}")

Legitimate: 959
Irrelevant: 25
Complaint_No_Visit: 115
Advertisement: 1


In [None]:
# use libraries

from collections import Counter

Counter(labels_out)

Counter({'Legitimate': 959,
         'Irrelevant': 25,
         'Complaint_No_Visit': 115,
         'Advertisement': 1})

## Imbalanced Data Problem

Imbalanced data is a machine learning problem where one class has significantly more examples than others, causing models to become biased toward the majority class and perform poorly on the minority class.

In this case, if I trained a model based on this pseudo-labeled data, the newly trained model could just try it's luck and give "Legitimate" EVERYTIME. And it'll get it right most of the time (***/1100 correct).

Since it is a really good score no?

## How to deal with Imbalanced Data?

1.   **Oversample minority class** - Duplicate minority class / **Undersample majority class** - Drop majority class
2.   **Class Weighted** - Assign higher weight to minority class
3. **Generate Synthetic Data** - Data Augmentation



## Data Augmentation

In [None]:
import json

client = OpenAI(api_key=api_key)

response = client.responses.create(
    model="gpt-4o-mini",
    input="""
    Generate 30 examples of fake Google location reviews that are clearly advertisements.
    Each should sound like a review left by a business promoting itself.
    Vary the business type (restaurants, cafes, gyms, salons, etc.).

    Each review should be 1–2 sentences long and include ad-like elements
    such as links, phone numbers, promo codes, or phrases like "call now", "visit", "discount", etc.

    # Note: Output the list of advertisements as the value for the key 'advertisements'.
    # Only output the JSON object without commentary or markdown.
    """
)

# 1. Access the output text (which is a valid JSON string)
json_string = response.output_text

# 2. Parse the JSON string directly into a Python dictionary
data = json.loads(json_string)

# 3. The final list of strings is accessed directly by the key
ads = data.get('advertisements', [])

print(f"Successfully parsed {len(ads)} advertisements.")
# Example of the resulting list:
# print(ads)

Successfully parsed 28 advertisements.


In [None]:
# @title
ads=['Delicious vegan options await you at Green Eats! Visit www.GreenEatsCafe.com for a 15% discount on your first meal. Call us at 555-2345 today!',
 'Experience luxury haircare at Glamour Salon! Book your appointment now at www.GlamourSalon.com and use promo code SHINE for 20% off. Call 555-7890!',
 'Savor the best sushi in town at Ocean Breeze! Order now at www.OceanBreezeSushi.com and get a free appetizer with any entree. Call 555-4567!',
 'Join the fitness revolution at PowerHouse Gym! Sign up today for a FREE personal training session at www.PowerHouseGym.com. Call 555-3210!',
 'Indulge your sweet tooth at Choco Heaven! Enjoy a 10% discount on your first order with promo code YUM at www.ChocoHeaven.com. Call us at 555-1112!',
 'Elevate your style at Trendy Boutique! Visit www.TrendyBoutique.com for exclusive offers, and call 555-2223 for personal styling advice!',
 'Refresh your look at Artisan Barbers! Book today at www.ArtisanBarbers.com for 25% off your first haircut. Call 555-3334 now!',
 'Unwind at Blissful Spa with our awesome treatments! Visit www.BlissfulSpa.com for special packages and call us at 555-4445 to book today!',
 'Grab the best coffee in town at Daily Grind! Join our loyalty program at www.DailyGrindCafe.com and get your 5th cup free. Call 555-5556!',
 'Taste the authentic flavors at Spice Journey! Enjoy 20% off your first meal when you order online at www.SpiceJourney.com. Call 555-6667!',
 'Get fit at Active Life Studio! Sign up now at www.ActiveLifeStudio.com and receive a free class. Call 555-7778 for details!',
 'Spoil yourself at Radiant Nails! Visit www.RadiantNails.com and book an appointment with promo code BEAUTY for 15% off. Call 555-8889!',
 'Treat your taste buds at Pasta Paradise! Order online at www.PastaParadise.com for a 10% discount with promo code DELICIOUS. Call 555-9990!',
 'Transform your skin at Pure Glow Esthetics! Visit www.PureGlow.com and book a facial today to get a 20% discount. Call 555-1011!',
 'Join the community at Yoga Harmony! Sign up for a free trial class at www.YogaHarmony.com and call 555-1213 for more information!',
 'Experience unparalleled flavors at Taco Fiesta! Order now at www.TacoFiesta.com for a free drink with any meal. Call 555-1415!',
 'Get pampered at Luxurious Escapes Spa! Visit www.LuxuriousEscapes.com for special pricing and call 555-1617 to book your getaway!',
 "Discover the art of cooking at Chef's Table! Enroll in our cooking classes at www.ChefsTable.com and use promo code COOK20 for 20% off. Call 555-1819!",
 'Stay stylish with fresh designs from Fashion Forward! Check out www.FashionForward.com for great deals, and call 555-2021 for personalized shopping!',
 'Revitalize your home at Dream Interiors! Visit www.DreamInteriors.com and schedule a consultation today. Call 555-2222 for special offers!',
 'Indulge in gourmet burgers at Burger Bliss! Order at www.BurgerBliss.com for a buy-one-get-one-free deal. Call 555-2423 now!',
 'Join the revolution at Elite Martial Arts! Register today at www.EliteMartialArts.com and get your first month for just $29. Call 555-2624!',
 'Satisfy your cravings at Sweet Treats Bakery! Visit www.SweetTreatsBakery.com for a special deal on cupcakes. Call 555-2825 for inquiries!',
 'Get fit effortlessly at Body Balance! Sign up at www.BodyBalanceGym.com and enjoy your first class for FREE. Call us at 555-3036!',
 'Discover amazing wines at Vineyard Wonders! Enjoy 10% off your first purchase at www.VineyardWonders.com. Call 555-3237 to learn more!',
 'Upgrade your wellness with Green Leaf Supplements! Visit www.GreenLeafSupplements.com now for 20% off your first order. Call 555-3438!',
 'Chase the unique at Artful Studios! Book a painting session today at www.ArtfulStudios.com and get a discount with promo code CREATE. Call 555-3639!',
 'Dine in luxury at Night Sky Restaurant! Reserve your table now at www.NightSkyRestaurant.com for exclusive offers. Call 555-3840!',
 'Find your zen at Serenity Yoga Studio! Visit www.SerenityYoga.com for a free first class and call 555-4041 for details!',
 'Fall in love with fashion at Chic Trends! Shop online at www.ChicTrends.com and use promo code STYLE for 15% off your first order. Call 555-4242!']

df_labeled = pd.read_csv("/content/labeled_reviews.csv")

In [None]:
df_ads = pd.DataFrame({
    "text": ads,
    "label": "Advertisement"
})

df_combined = pd.concat([df_labeled, df_ads], ignore_index=True)

print(df_combined["label"].value_counts())

label
Legitimate            963
Complaint_No_Visit    111
Advertisement          28
Irrelevant             26
Name: count, dtype: int64


In [None]:
df_combined.to_csv("labeled_reviews+augmented.csv",index=False)

# Training of local classifier using Huggingface model

In [None]:
!pip install setfit

Collecting setfit
  Downloading setfit-1.1.3-py3-none-any.whl.metadata (12 kB)
Collecting evaluate>=0.3.0 (from setfit)
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading setfit-1.1.3-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate, setfit
Successfully installed evaluate-0.4.6 setfit-1.1.3


In [None]:
from setfit import Trainer, SetFitModel, TrainingArguments
from datasets import Dataset
import pandas as pd
import os
os.environ["WANDB_DISABLED"] = "true"

# 1. Load your data
df = pd.read_csv('labeled_reviews+augmented.csv')

# 2. Convert to Hugging Face Dataset format
# Ensure columns are mapped correctly if they aren't named "text" and "label"
# If your CSV has "text" and "label" columns, no mapping is needed.
dataset = Dataset.from_pandas(df)

# 3. Load the model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# 4. Define Training Arguments
# This is where hyperparameters live now
args = TrainingArguments(
    batch_size=16,
    num_epochs=1,
    num_iterations=20,  # The number of text pairs to generate for contrastive learning
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
)

# 5. Initialize Trainer
trainer = Trainer(
    model=model,
    args=args,                  # Pass the arguments object here
    train_dataset=dataset,
    eval_dataset=dataset,       # Using same for demo; split your data in real usage!
    metric="accuracy",
    column_mapping={"text": "text", "label": "label"} # Explicitly map columns just in case
)

# 6. Train
print("Training local model...")
trainer.train()

# 7. Evaluate & Save
metrics = trainer.evaluate()
print(f"Model Accuracy: {metrics['accuracy']}")

# Save the final model for inference
model.save_pretrained("my_local_review_filter")

  return datetime.utcnow().replace(tzinfo=utc)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
The `evaluation_strategy` argument is deprecated and will be removed in a future version. Please use `eval_strategy` instead.
Applying column mapping to the training dataset
Applying column mapping to the evaluation dataset
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  return datetime.utcnow().replace(tzinfo=utc)
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Map:   0%|          | 0/1130 [00:00<?, ? examples/s]

Training local model...


***** Running training *****
  Num unique pairs = 45200
  Batch size = 16
  Num epochs = 1
  return data.pin_memory(device)
  return data.pin_memory(device)


Epoch,Training Loss,Validation Loss
1,0.0002,0.00021


  return datetime.utcnow().replace(tzinfo=utc)
  opt_res = optimize.minimize(
***** Running evaluation *****
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


Downloading builder script: 0.00B [00:00, ?B/s]

Model Accuracy: 0.9991150442477876


  return datetime.utcnow().replace(tzinfo=utc)


In [None]:
from setfit import SetFitModel

# Load from your local folder
model = SetFitModel.from_pretrained("my_local_review_filter")

reviews = [
    "The food was amazing, highly recommend!",
    "Call 555-1234 for cheap loans!",
    "I hate this place, never been there though."
]

# Run inference (No GPU needed for this part)
preds = model(reviews)

print(preds)
# Output: ['Legitimate', 'Advertisement', 'Complaint_No_Visit']

['Legitimate' 'Advertisement' 'Complaint_No_Visit']


  return datetime.utcnow().replace(tzinfo=utc)


## Zip

In [None]:
import shutil
from google.colab import files

# 1. Zip the folder created by model.save_pretrained()
# syntax: make_archive(output_filename, 'zip', dir_to_zip)
shutil.make_archive("my_local_review_filter", 'zip', "my_local_review_filter")

# 2. Trigger the browser download
files.download("my_local_review_filter.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>