### Project 1 [Spam-Filter]
Type project: Individual

**Deadline**:11.12.2025 23:59:00

Max possible score: 80

You are provided with a dataset of SMS messages classified as spam and non-spam, ```spam.csv```.

**Goal:** Build a model that will classify a message as spam/non-spam using the Naive Bayes algorithm.

Please review the ```dataset.py``` and ```model.py``` modules.

Message classification accuracy should be greater than 95% for both the validation and test datasets.

In [12]:
1 + 1
# your code for testing

2

In [None]:
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [21]:
import pandas as pd
df = pd.read_csv('spam.csv', encoding='latin-1')
print("Первые 6 строк:")
print(df.head(10))
print("\nНазвания столбцов:", df.columns.tolist())
print("Всего строк:", len(df))

Первые 6 строк:
     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   
5  spam  FreeMsg Hey there darling it's been 3 week's n...        NaN   
6   ham  Even my brother is not like to speak with me. ...        NaN   
7   ham  As per your request 'Melle Melle (Oru Minnamin...        NaN   
8  spam  WINNER!! As a valued network customer you have...        NaN   
9  spam  Had your mobile 11 months or more? U R entitle...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  
5        NaN        NaN 

In [22]:
%%writefile dataset.py
import numpy as np
import re

class Dataset:
    def __init__(self, X, y):
        self._x = X
        self._y = y
        self.train = None
        self.val = None
        self.test = None
        self.label2num = {}
        self.num2label = {}
        self._transform()

    def __len__(self):
        return len(self._x)

    def _transform(self):
        '''
        Функция очистки сообщения и преобразования меток в числа.
        '''
        # Очистка текста сообщений
        cleaned_messages = []
        for message in self._x:
            # Приводим к нижнему регистру
            message = str(message).lower()
            # Удаляем все символы кроме букв и пробелов
            message = re.sub(r'[^a-z\s]', '', message)
            # Удаляем лишние пробелы
            message = re.sub(r'\s+', ' ', message).strip()
            cleaned_messages.append(message)

        self._x = np.array(cleaned_messages)

        # Преобразуем метки в числа
        unique_labels = np.unique(self._y)
        self.label2num = {'ham': 0, 'spam': 1}
        self.num2label = {0: 'ham', 1: 'spam'}

        # Преобразуем метки в числа
        self._y = np.array([self.label2num[label] for label in self._y])

    def split_dataset(self, val=0.1, test=0.1):
        '''
        Функция, которая разбивает набор данных на наборы train-validation-test.
        '''
        n = len(self._x)
        indices = np.random.permutation(n)

        test_size = int(n * test)
        val_size = int(n * val)
        train_size = n - test_size - val_size

        train_indices = indices[:train_size]
        val_indices = indices[train_size:train_size + val_size]
        test_indices = indices[train_size + val_size:]

        self.train = (self._x[train_indices], self._y[train_indices])
        self.val = (self._x[val_indices], self._y[val_indices])
        self.test = (self._x[test_indices], self._y[test_indices])


Overwriting dataset.py


In [10]:
%%writefile model.py
import numpy as np

class Model:
    def __init__(self, alpha=1):
        self.vocab = set()
        self.spam = {}
        self.ham = {}
        self.alpha = alpha
        self.label2num = None
        self.num2label = None
        self.Nvoc = None
        self.Nspam = None
        self.Nham = None
        self._train_X, self._train_y = None, None
        self._val_X, self._val_y = None, None
        self._test_X, self._test_y = None, None
        self.pspam = None
        self.pham = None

    def fit(self, dataset):
        '''
        Функция обучает модель на данных из dataset
        '''
        self._train_X, self._train_y = dataset.train
        self._val_X, self._val_y = dataset.val
        self._test_X, self._test_y = dataset.test
        self.label2num = dataset.label2num
        self.num2label = dataset.num2label

        # Считаем априорные вероятности
        spam_count = np.sum(self._train_y == self.label2num['spam'])
        ham_count = np.sum(self._train_y == self.label2num['ham'])
        total = len(self._train_y)

        self.pspam = spam_count / total
        self.pham = ham_count / total

        # Разделяем сообщения
        spam_messages = self._train_X[self._train_y == self.label2num['spam']]
        ham_messages = self._train_X[self._train_y == self.label2num['ham']]

        # Подсчитываем частоту слов в спаме
        for message in spam_messages:
            words = message.split()
            for word in words:
                self.vocab.add(word)
                self.spam[word] = self.spam.get(word, 0) + 1

        # Подсчитываем частоту слов в не-спаме
        for message in ham_messages:
            words = message.split()
            for word in words:
                self.vocab.add(word)
                self.ham[word] = self.ham.get(word, 0) + 1

        self.Nvoc = len(self.vocab)
        self.Nspam = sum(self.spam.values())
        self.Nham = sum(self.ham.values())

    def inference(self, message):
        '''
        Предсказывает метку для одного сообщения
        '''
        words = message.split()

        log_pspam = np.log(self.pspam)
        log_pham = np.log(self.pham)

        for word in words:
            spam_word_count = self.spam.get(word, 0)
            p_word_spam = (spam_word_count + self.alpha) / (self.Nspam + self.alpha * self.Nvoc)

            ham_word_count = self.ham.get(word, 0)
            p_word_ham = (ham_word_count + self.alpha) / (self.Nham + self.alpha * self.Nvoc)

            log_pspam += np.log(p_word_spam)
            log_pham += np.log(p_word_ham)

        pspam = log_pspam
        pham = log_pham

        if pspam > pham:
            return "spam"
        return "ham"

    def validation(self):
        '''
        Проверяет точность на validation данных
        '''
        correct = 0
        total = len(self._val_y)

        for i in range(total):
            message = self._val_X[i]
            true_label_num = self._val_y[i]
            true_label = self.num2label[true_label_num]

            predicted_label = self.inference(message)

            if predicted_label == true_label:
                correct += 1

        val_acc = correct / total
        return val_acc

    def test(self):
        '''
        Проверяет точность на test данных
        '''
        correct = 0
        total = len(self._test_y)

        for i in range(total):
            message = self._test_X[i]
            true_label_num = self._test_y[i]
            true_label = self.num2label[true_label_num]

            predicted_label = self.inference(message)

            if predicted_label == true_label:
                correct += 1

        test_acc = correct / total
        return test_acc

Overwriting model.py


In [24]:
import pandas as pd
import numpy as np
from dataset import Dataset
from model import Model

np.random.seed(42)

df = pd.read_csv('spam.csv', encoding='latin-1')
df = df[['v1', 'v2']]
df.columns = ['label', 'message']

print(f"Total messages: {len(df)}")
print(f"Spam: {sum(df['label'] == 'spam')}, Ham: {sum(df['label'] == 'ham')}")

X = df['message'].values
y = df['label'].values

dataset = Dataset(X, y)
dataset.split_dataset(val=0.15, test=0.15)

print(f"\nTrain: {len(dataset.train[0])}")
print(f"Validation: {len(dataset.val[0])}")
print(f"Test: {len(dataset.test[0])}")

model = Model(alpha=1.5)
model.fit(dataset)

print(f"\nVocabulary size: {model.Nvoc}")

val_accuracy = model.validation()
test_accuracy = model.test()

print(f"\nValidation Accuracy: {val_accuracy*100:.2f}%")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

test_messages = [
    "WINNER! You have won a prize. Call now!",
    "Hey, are we still meeting for lunch tomorrow?",
    "Urgent! Your account will be closed. Click here!",
    "Thanks for the meeting today."
]

print("\nPrediction examples:")
for msg in test_messages:
    prediction = model.inference(msg.lower())
    print(f"{msg} - {prediction}")

Total messages: 5572
Spam: 747, Ham: 4825

Train: 3902
Validation: 835
Test: 835

Vocabulary size: 6940

Validation Accuracy: 97.01%
Test Accuracy: 96.17%

Prediction examples:
WINNER! You have won a prize. Call now! - spam
Hey, are we still meeting for lunch tomorrow? - ham
Urgent! Your account will be closed. Click here! - spam
Thanks for the meeting today. - ham


In [29]:
import re

# Функция
def spam_check(msg):
    clean = re.sub(r'[^a-z\s]', '', msg.lower()).strip()
    return model.inference(clean)

# Цикл
while True:
    txt = input("\n> ")
    if txt == 'exit': break
    print(spam_check(txt))


> nice
ham

> exit
