# Homework: SMS Spam Classification

**Course:** Deep Learning

**Objective:** Train a model to classify SMS messages as spam or ham.

**Dataset:** SMS Spam Collection  
* **Source:** UCI ML Repository  
* **Download:** https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection  
* **Size:** ~5 500 messages (13 % spam, 87 % ham)  
* **Format:** TSV with columns  
  * `label`: “spam” (1) / “ham” (0)
  * `text`: raw SMS content  

**Tasks:**
1. Load and explore the dataset.
2. Preprocess the text.
3. Define and train a model (any method from the course).
4. Evaluate the model's performance using standard classification metrics on the test set.

> **Success:** achieve ≥ 0.90 F1-score on the test set.  


# Prerequisites
There might be pip errors, just ignore them, it's okay...

In [None]:
%pip install datasets >> None
import os
import random
import numpy as np
import torch
import requests, zipfile, io
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import BertTokenizerFast, BertForSequenceClassification, Trainer, TrainingArguments, set_seed
from datasets import Dataset

# Dont change ssid for accurate testing results
ssid = 42
random.seed(ssid)
np.random.seed(ssid)
torch.manual_seed(ssid)
torch.cuda.manual_seed_all(ssid)
set_seed(ssid)

def download_data():
  url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
  response = requests.get(url)
  with zipfile.ZipFile(io.BytesIO(response.content)) as z:
      z.extractall("data")
  df = pd.read_csv("data/SMSSpamCollection", sep='\t', header=None, names=['label', 'text'])
  df['label'] = df['label'].map({'ham': 0, 'spam': 1})
  return df

def train_val_test(df):
  train_df, temp_df = train_test_split(df, test_size=0.3, stratify=df['label'], random_state=ssid)
  val_df, test_df = train_test_split(temp_df, test_size=0.5, stratify=temp_df['label'], random_state=ssid)
  return train_df, val_df, test_df

# Data

In [None]:
df = download_data()
train_df, val_df, test_df = train_val_test(df)
df

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will ü b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


# Your training code here

In [None]:
<YOUR_CODE>

# Evaluation

In [None]:
y_pred_test = # <Your preidctions here for TEST>
y_test = test_df['label']
acc, prec, rec, f1 = accuracy_score(y_test, y_pred_test), *precision_recall_fscore_support(y_test, y_pred_test, average='binary')[:3]
print("\nTest —     acc: {:.3f}, prec: {:.3f}, rec: {:.3f}, f1: {:.3f}".format(
    acc, prec, rec, f1
))