# SMS Spam Classification — Data Preparation

**Purpose:** prepare the dataset for the assignment: load, explore, preprocess, split and save train/validation/test CSVs.

**How to use:** run cells in order. This notebook will:
- load `/SMSSpamCollection` ,
- display basic EDA,
- provide preprocessing functions,
- create stratified splits and save `train.csv`, `validation.csv`, `test.csv`.




In [1]:
# Imports & constants
import os
import re
import string
from collections import Counter

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text as sklearn_text

# Set path to dataset file. 
DATA_PATH = r'sms+spam+collection/SMSSpamCollection'  
OUT_DIR = '.'
RANDOM_STATE = 42

print("DATA_PATH:", DATA_PATH)
print("OUT_DIR:", OUT_DIR)


DATA_PATH: sms+spam+collection/SMSSpamCollection
OUT_DIR: .


In [2]:
def load_data(filepath):
    """
    Load SMS Spam Collection (tab-separated) into DataFrame with columns ['label','message'].
    Handles quoting and missing messages.
    """
    df = pd.read_csv(filepath, sep='\t', header=None, names=['label','message'], quoting=3, engine='python')
    df = df.dropna(subset=['message']).reset_index(drop=True)
    df['label'] = df['label'].astype(str).str.strip()
    return df

# Load
df = load_data(DATA_PATH)
print("Rows loaded:", len(df))
display(df.head(8))

# Class distribution
print("\nClass distribution:")
print(df['label'].value_counts())


Rows loaded: 5574


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...



Class distribution:
label
ham     4827
spam     747
Name: count, dtype: int64


## Exploratory Data Analysis (EDA)

In this section we will:
- Inspect class balance
- Look at message length statistics
- Find frequent words overall and per class

This helps motivate preprocessing and evaluation metric choices.


In [3]:
# Add message length feature
df['msg_length'] = df['message'].astype(str).apply(len)

# Basic stats
stats = df.groupby('label')['msg_length'].agg(['count','mean','median','std']).reset_index()
display(stats)

# Quick sample of spam and ham messages
print("\nSample spam messages:")
display(df[df['label']=='spam'].sample(5, random_state=RANDOM_STATE)[['message','msg_length']])
print("\nSample ham messages:")
display(df[df['label']=='ham'].sample(5, random_state=RANDOM_STATE)[['message','msg_length']])


Unnamed: 0,label,count,mean,median,std
0,ham,4827,71.471929,52.0,58.326643
1,spam,747,138.676037,149.0,28.87125



Sample spam messages:


Unnamed: 0,message,msg_length
1456,Summers finally here! Fancy a chat or flirt wi...,159
1853,This is the 2nd time we have tried 2 contact u...,154
673,Get ur 1st RINGTONE FREE NOW! Reply to this ms...,153
947,Ur cash-balance is currently 500 pounds - to m...,137
2881,Last Chance! Claim ur £150 worth of discount v...,152



Sample ham messages:


Unnamed: 0,message,msg_length
688,"Dear,Me at cherthala.in case u r coming cochin...",169
2520,Ok. I only ask abt e movie. U wan ktv oso?,42
1079,Convey my regards to him,24
1724,"Hi Jon, Pete here, Ive bin 2 Spain recently & ...",157
1974,I had askd u a question some hours before. Its...,53


In [4]:
# Helper to show top N tokens using CountVectorizer 
def top_n_words(text_series, n=20, ngram_range=(1,1), analyzer='word', stop_words=None):
    vect = CountVectorizer(ngram_range=ngram_range, analyzer=analyzer, stop_words=stop_words)
    X = vect.fit_transform(text_series.astype(str))
    counts = np.asarray(X.sum(axis=0)).ravel()
    vocab = np.array(vect.get_feature_names_out())
    top_idx = np.argsort(counts)[::-1][:n]
    return pd.DataFrame({'token': vocab[top_idx], 'count': counts[top_idx]})

# Overall top words
print("Overall top words (word tokens):")
display(top_n_words(df['message'], n=20))

# Top words in spam
print("Top words in spam:")
display(top_n_words(df[df['label']=='spam']['message'], n=20))

# Top words in ham
print("Top words in ham:")
display(top_n_words(df[df['label']=='ham']['message'], n=20))


Overall top words (word tokens):


Unnamed: 0,token,count
0,to,2253
1,you,2245
2,the,1339
3,and,980
4,in,903
5,is,897
6,me,807
7,my,766
8,it,752
9,for,711


Top words in spam:


Unnamed: 0,token,count
0,to,691
1,call,355
2,you,297
3,your,264
4,free,224
5,the,206
6,for,204
7,now,199
8,or,188
9,txt,163


Top words in ham:


Unnamed: 0,token,count
0,you,1948
1,to,1562
2,the,1133
3,and,858
4,in,823
5,me,777
6,my,754
7,is,739
8,it,718
9,that,560


## Preprocessing functions

We will provide:
- a simple deterministic cleaner (`preprocess_text`) — lowercasing, punctuation removal, whitespace normalization
- an optional stopword-removal wrapper to compare effects later.

Keep the original `message` column so we can do error analysis later.


In [5]:
# Basic preprocessing function
def preprocess_text(text, lower=True, remove_punct=True, remove_digits=False, strip=True):
    """
    Minimal, deterministic text cleaner:
      - lower: lowercase
      - remove_punct: remove punctuation characters
      - remove_digits: remove digits (optional)
      - strip: strip whitespace
    Returns cleaned string.
    """
    if not isinstance(text, str):
        return ''
    if lower:
        text = text.lower()
    if remove_digits:
        text = re.sub(r'\d+', ' ', text)
    if remove_punct:
        text = text.translate(str.maketrans('', '', string.punctuation))
    if strip:
        text = text.strip()
    # normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    return text

# Stopword removal helper using sklearn stop words set
STOP_WORDS = sklearn_text.ENGLISH_STOP_WORDS

def remove_stopwords(text, stopwords=STOP_WORDS):
    words = text.split()
    filtered = [w for w in words if w not in stopwords]
    return " ".join(filtered)

# Apply cleaning to new column `message_clean`
df['message_clean'] = df['message'].apply(lambda x: preprocess_text(x, lower=True, remove_punct=True))
# Extra variant for stopword removal 
df['message_clean_nostop'] = df['message_clean'].apply(remove_stopwords)

# Showing few rows
display(df[['label','message','message_clean','message_clean_nostop']].head(6))


Unnamed: 0,label,message,message_clean,message_clean_nostop
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...,jurong point crazy available bugis n great wor...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say,u dun say early hor u c say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,nah dont think goes usf lives
5,spam,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey there darling its been 3 weeks now...,freemsg hey darling 3 weeks word id like fun t...


## Create stratified train / validation / test splits and save CSVs

We will use **stratified splits** (same spam/ham proportion across sets).  
Train : Validation : Test = 70 : 15 : 15 .  
Saved files:
- `train.csv`
- `validation.csv`
- `test.csv`


In [6]:
def split_data(df, train_frac=0.7, val_frac=0.15, test_frac=0.15, random_state=RANDOM_STATE, stratify_col='label'):
    assert abs(train_frac + val_frac + test_frac - 1.0) < 1e-6, "fractions must sum to 1"
    stratify = df[stratify_col] if stratify_col in df.columns else None
    # split off test
    train_val, test = train_test_split(df, test_size=test_frac, random_state=random_state, stratify=stratify)
    val_size_of_trainval = val_frac / (train_frac + val_frac)
    stratify_trainval = train_val[stratify_col] if stratify is not None else None
    train, val = train_test_split(train_val, test_size=val_size_of_trainval, random_state=random_state, stratify=stratify_trainval)
    return train.reset_index(drop=True), val.reset_index(drop=True), test.reset_index(drop=True)

def save_splits(train, val, test, out_dir=OUT_DIR):
    os.makedirs(out_dir, exist_ok=True)
    train.to_csv(os.path.join(out_dir, 'train.csv'), index=False)
    val.to_csv(os.path.join(out_dir, 'validation.csv'), index=False)
    test.to_csv(os.path.join(out_dir, 'test.csv'), index=False)
    return (os.path.join(out_dir, 'train.csv'),
            os.path.join(out_dir, 'validation.csv'),
            os.path.join(out_dir, 'test.csv'))

# Execute split and save 
train_df, val_df, test_df = split_data(df, train_frac=0.7, val_frac=0.15, test_frac=0.15, random_state=RANDOM_STATE)
paths = save_splits(train_df[['label','message']], val_df[['label','message']], test_df[['label','message']], out_dir='.')

print("Saved splits:")
print(" train:", paths[0], "rows:", len(train_df))
print(" val:  ", paths[1], "rows:", len(val_df))
print(" test: ", paths[2], "rows:", len(test_df))

# Verify label counts per split
print("\nLabel counts — train / val / test:")
print(train_df['label'].value_counts().to_string())
print(val_df['label'].value_counts().to_string())
print(test_df['label'].value_counts().to_string())


Saved splits:
 train: .\train.csv rows: 3901
 val:   .\validation.csv rows: 836
 test:  .\test.csv rows: 837

Label counts — train / val / test:
label
ham     3378
spam     523
label
ham     724
spam    112
label
ham     725
spam    112
