## Задание 2 - 10 баллов

- Загрузить набор данных [Spam Or Not Spam](https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset)
- Попробовать и сравнить различные способы векторизации: **3 балла**
  - `sklearn.feature_extraction.text.CountVectorizer`
  - `sklearn.feature_extraction.text.TfidfVectorizer`
- Обучить на полученных векторах модели, с использованием кросс-валидации и подбором гиперпараметров: **3 балла**
  - `sklearn.tree.DecisionTreeClassifier`
  - `sklearn.linear_model.LogisticRegression`
  - Naive Bayes
- Сравнить качество обученных моделей на отложенной выборке - **1 балл**

- Обеспечена воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок - **2 балла**

- Соблюден code style на уровне pep8 и [On writing clean Jupyter notebooks](https://ploomber.io/blog/clean-nbs/)  - **1 балл**Для сдачи ДЗ - приложите ссылку на PR (Pull Request) из ветки hw_1 в ветку main в вашем приватном репозитории на github.com



## Подготовка
```
.
├── data
├── notebooks
│   └── hw_2.ipynb
├── src
│ └── utils
│    ├── __init__.py
│    └── stop_words.py
└── setup.py
```

Ноутбук должен располагаться в директории `notebooks`, директория `data` создана для данных, должны присутствовать `src/utils/*` и `setup.py` для загрузки кастомных стопслов.


## Все библиотеки и константы

Устанавливаем модуль из `setup.py`

In [2]:
! pip install --editable ../

Импортируем, добавляем константы, функции

In [45]:
import sys
sys.path.append('../')
from utils.stop_words import SPAM_STOP_WORDS

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB, MultinomialNB
from sklearn.metrics import classification_report
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.pipeline import Pipeline

SEED = 566
# data dir
DATA_DIR = "../data"

DATA = "../data/spam_or_not_spam.csv"


def classification_report_pd(y_test, y_pred):
    report = pd.DataFrame(classification_report(y_true=y_test, y_pred=y_pred, output_dict=True)).transpose()
    report.support = report.support.astype(int)
    report.loc['accuracy', 'support'] = report.loc['macro avg', 'support']
    report.loc['accuracy', 'precision'] = np.nan
    report.loc['accuracy', 'recall'] = np.nan
    return report


## Предварительная работа с данными данные

### Скачивание
(Не забудьте раскоментить)

In [2]:
!kaggle datasets download -d ozlerhakan/spam-or-not-spam-dataset
!mv ./spam-or-not-spam-dataset.zip ../data/
!unzip ../data/spam-or-not-spam-dataset.zip
!mv ./spam_or_not_spam.csv ../data/

### Изучим данные

In [3]:
data = pd.read_csv(DATA)
data.head()

Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   email   2999 non-null   object
 1   label   3000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 47.0+ KB


В одном значении майлов есть Null -- заменю на пустую строку.

In [5]:
data['email'] = data['email'].fillna('')

In [6]:
data.label.value_counts()

label
0    2500
1     500
Name: count, dtype: int64

20% -- спам, достаточно хороший датасет.

Более того, по тексту кажется, что он уже очищен (нет пунктуации, больших букв, грамматики). На каггле также сказано: `all the numbers and URLs were converted to strings as NUMBER and URL respectively` -- это хорошо, они не удалятся векторайзером, и могут быть значимы при детекции спама.


Поэтому попробую работать без очистки от стоп слов: минимальная предобработка сделана, а, как кажется, что очистка от стоп-слов может только навредить.

### Train-Test split

In [54]:
X, y = data.email, data.label

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=SEED,
                                                    stratify=y)

display(y_train.value_counts())
display(y_test.value_counts())

label
0    2000
1     400
Name: count, dtype: int64

label
0    500
1    100
Name: count, dtype: int64

##  Vectorization


### CountVectorizer
Я решил попробовать кастомизировать список стоп слов, и убрал оттуда те, которые, как мне кажется, могут быть связаны со спамом: all, already, become(s), call, interest, seem(s), top, you(r/self/...)

Обновленный список -- в `utils/_stop_words.py` (импортировал в начале)

In [8]:
vectorizer_count = CountVectorizer(max_df=0.8,
                                   min_df=0.001,
                                   stop_words=list(SPAM_STOP_WORDS))
X_train_vectorized_count = vectorizer_count.fit_transform(X_train)
X_test_vectorized_count = vectorizer_count.transform(X_test)

### Tf-idf vectorizer

In [9]:
vectorizer_tfidf = TfidfVectorizer(max_df=0.8,
                                   min_df=0.001,
                                   stop_words=list(SPAM_STOP_WORDS))
X_train_vectorized_tfidf = vectorizer_tfidf.fit_transform(X_train)
X_test_vectorized_tfidf = vectorizer_tfidf.transform(X_test)

## Проверка и анализ

In [10]:
count_df = pd.DataFrame(X_train_vectorized_count.toarray(), columns=vectorizer_count.get_feature_names_out())
count_df = count_df.stack().reset_index().rename(
    columns={0: 'count', 'level_0': 'document', 'level_1': 'term', 'level_2': 'term'})
count_df = count_df.sort_values(by=['document', 'count'], ascending=[True, False]).groupby(['document']).head()
count_df.sort_values(by='count', ascending=False).head(100)

Unnamed: 0,document,term,count
20861998,2069,states,119
21488709,2132,alb,119
23171551,2298,you,109
2449127,242,you,105
9312926,923,you,105
...,...,...,...
5898568,585,des,33
3720566,369,checking,33
19459955,1930,report,33
19462482,1930,your,31


Кажется, не так информативно, давайте посмотрим на `tf_idf`

In [11]:
tfidf_df = pd.DataFrame(X_train_vectorized_tfidf.toarray(), columns=vectorizer_tfidf.get_feature_names_out())
tfidf_df = tfidf_df.stack().reset_index().rename(
    columns={0: 'tfidf', 'level_0': 'document', 'level_1': 'term', 'level_2': 'term'})
tfidf_df = tfidf_df.sort_values(by=['document', 'tfidf'], ascending=[True, False]).groupby(['document']).head()
tfidf_df.sort_values(by='tfidf', ascending=False).head(100)

Unnamed: 0,document,term,tfidf
8390011,832,hyperlink,1.000000
4892598,485,hyperlink,1.000000
5759392,571,hyperlink,0.997259
9503282,942,tab,0.966686
21488709,2132,alb,0.963996
...,...,...,...
7464232,740,mt,0.708583
10623006,1053,wheel,0.707915
20019513,1986,domain,0.707903
22774368,2259,news,0.707729


Интересно! У нас очень много слов, получившихся из-за плохого парсинга html: `hyperlink`, `bm`, `numbertnumber` (скорее всего это было `number ? T : number`), `tab`, `alb`, `supplied`, `img`, `msgs`, `xml` +  удаленный `you` из списка стоп слов лучше бы вернуть:

In [12]:
extra_stops = {"hyperlink", "bm", "numbertnumber", "tab", "alb", "supplied", "img", "you", "xml", "msgs"}

FINAL_STOP_WORDS = list(SPAM_STOP_WORDS | extra_stops)
len(FINAL_STOP_WORDS)

309

## Обучение моделей

Создадим 8 Пайплайнов  и 8 сеток параметров, и пройдемся с помощью халф-гридсерча

In [13]:
## Grids

grid_tf = {"tfidf__max_df": np.linspace(0.3, 0.8, 6),
           "tfidf__min_df": [0.0, 0.002, 0.005],
           "tfidf__ngram_range": ((1, 1), (1, 2)),
           "tfidf__stop_words": [FINAL_STOP_WORDS, ],  # list of 1 element
           }
grid_count = {"counter__max_df": np.linspace(0.3, 0.8, 6),
              "counter__min_df": [0.0, 0.002, 0.005],
              "counter__ngram_range": ((1, 1), (1, 2)),
              "counter__stop_words": [FINAL_STOP_WORDS, ],  # list of 1 element
              }
grid_lr = {"lr__C": np.linspace(0.1, 1, 10),
           "lr__penalty": ("l1", "l2"),
           "lr__random_state": [SEED, ]}
grid_dt = {"dt__criterion": ("gini", "entropy"),
           "dt__max_depth": [5, 10, 15],
           "dt__min_samples_split": [1, 5, 25],
           "dt__random_state": [SEED, ]}
grid_mnb = {"mnb__alpha": [0.01, 0.1, 1.0], }
grid_cnb = {"cnb__alpha": [0.01, 0.1, 1.0],
            "cnb__norm": [True, False], }

## LR models
pipe_count_lr = Pipeline(
    steps=[
        ('counter', CountVectorizer()),
        ('lr', LogisticRegression())
    ]
)
pipe_tf_lr = Pipeline(
    steps=[
        ('tfidf', TfidfVectorizer()),
        ('lr', LogisticRegression())
    ]
)
grid_count_lr = {**grid_count, **grid_lr}
grid_tf_lr = {**grid_tf, **grid_lr}

## DT models
pipe_count_dt = Pipeline(
    steps=[
        ('counter', CountVectorizer()),
        ('dt', DecisionTreeClassifier())
    ]
)
pipe_tf_dt = Pipeline(
    steps=[
        ('tfidf', TfidfVectorizer()),
        ('dt', DecisionTreeClassifier())
    ]
)
grid_count_dt = {**grid_count, **grid_dt}
grid_tf_dt = {**grid_tf, **grid_dt}

## NB multinomial
pipe_count_mnb = Pipeline(
    steps=[
        ('counter', CountVectorizer()),
        ('mnb', MultinomialNB())
    ]
)
pipe_tf_mnb = Pipeline(
    steps=[
        ('tfidf', TfidfVectorizer()),
        ('mnb', MultinomialNB())
    ]
)
grid_count_mnb = {**grid_count, **grid_mnb}
grid_tf_mnb = {**grid_tf, **grid_mnb}
## NB coplement
pipe_count_cnb = Pipeline(
    steps=[
        ('counter', CountVectorizer()),
        ('cnb', ComplementNB())
    ]
)
pipe_tf_cnb = Pipeline(
    steps=[
        ('tfidf', TfidfVectorizer()),
        ('cnb', ComplementNB())
    ]
)
grid_count_cnb = {**grid_count, **grid_cnb}
grid_tf_cnb = {**grid_tf, **grid_cnb}

pipes = {"count_lr": (pipe_count_lr, grid_count_lr),
         "tf_lr": (pipe_tf_lr, grid_tf_lr),
         "count_dt": (pipe_count_dt, grid_count_dt),
         "tf_dt": (pipe_tf_dt, grid_tf_dt),
         "count_mnb": (pipe_count_mnb, grid_count_mnb),
         "tf_mnb": (pipe_tf_mnb, grid_tf_mnb),
         "count_cnb": (pipe_count_cnb, grid_count_cnb),
         "tf_cnb": (pipe_tf_cnb, grid_tf_cnb), }


In [53]:
grid_searches = dict()
for pipe_name in pipes:
    pipe, parameter_grid = pipes[pipe_name]

    grid_search = HalvingGridSearchCV(
        pipe,
        param_grid=parameter_grid,
        n_jobs=-1,
        verbose=1,
        cv=5,
        scoring='accuracy',
        random_state=SEED,
    )
    grid_search.fit(X_train, y_train)
    grid_searches[pipe_name] = grid_search

In [16]:
estimators = dict()
for grid_search_name in grid_searches:
    estimators[grid_search_name] = grid_searches[grid_search_name].best_estimator_

In [47]:
for estimator_name in estimators:
    print(estimator_name)
    y_pred = estimators[estimator_name].predict(X_test)
    display(classification_report_pd(y_test, y_pred))
    print('=============================================')

count_lr


Unnamed: 0,precision,recall,f1-score,support
0,0.982283,0.998,0.990079,500
1,0.98913,0.91,0.947917,100
accuracy,,,0.983333,600
macro avg,0.985707,0.954,0.968998,600
weighted avg,0.983425,0.983333,0.983052,600


tf_lr


Unnamed: 0,precision,recall,f1-score,support
0,0.94162,1.0,0.969932,500
1,1.0,0.69,0.816568,100
accuracy,,,0.948333,600
macro avg,0.97081,0.845,0.89325,600
weighted avg,0.95135,0.948333,0.944371,600


count_dt


Unnamed: 0,precision,recall,f1-score,support
0,0.966337,0.976,0.971144,500
1,0.873684,0.83,0.851282,100
accuracy,,,0.951667,600
macro avg,0.92001,0.903,0.911213,600
weighted avg,0.950895,0.951667,0.951167,600


tf_dt


Unnamed: 0,precision,recall,f1-score,support
0,0.955253,0.982,0.968442,500
1,0.895349,0.77,0.827957,100
accuracy,,,0.946667,600
macro avg,0.925301,0.876,0.898199,600
weighted avg,0.945269,0.946667,0.945028,600


count_mnb


Unnamed: 0,precision,recall,f1-score,support
0,0.986193,1.0,0.993049,500
1,1.0,0.93,0.963731,100
accuracy,,,0.988333,600
macro avg,0.993097,0.965,0.97839,600
weighted avg,0.988494,0.988333,0.988162,600


tf_mnb


Unnamed: 0,precision,recall,f1-score,support
0,0.968992,1.0,0.984252,500
1,1.0,0.84,0.913043,100
accuracy,,,0.973333,600
macro avg,0.984496,0.92,0.948648,600
weighted avg,0.97416,0.973333,0.972384,600


count_cnb


Unnamed: 0,precision,recall,f1-score,support
0,0.986193,1.0,0.993049,500
1,1.0,0.93,0.963731,100
accuracy,,,0.988333,600
macro avg,0.993097,0.965,0.97839,600
weighted avg,0.988494,0.988333,0.988162,600


tf_cnb


Unnamed: 0,precision,recall,f1-score,support
0,0.988095,0.996,0.992032,500
1,0.979167,0.94,0.959184,100
accuracy,,,0.986667,600
macro avg,0.983631,0.968,0.975608,600
weighted avg,0.986607,0.986667,0.986557,600




При сравнении буду опираться на `accuracy`, как на стандартную метрику в данном случае.

Удивительно, что `CountVectorizer` работает лучше, чем `Tf-Idf` во всех моделях (я, если честно, думал, что будет наоборот).

Наилучшими в данном случае оказались сразу две модели `ComplementNaiveBayes` и `Multinomial` c  `CountVectorizer`, их `accuracy=0.9883`. Однако другие модели не сильно хуже. 

Параметры лучших моделей:


In [51]:
estimators['count_mnb']

In [52]:
estimators['count_cnb']

Лучшие параметры у обоих:

- CountVectorizer: `min_df=0.0`,`max_df=0.6`, `ngram_range=(1,2)`
- NB: `alpha=0.1`