# Задание

### Задание 1 - 30 баллов

- Загрузить набор данных [Spam Or Not Spam](https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset)
- Попробовать и сравнить различные способы векторизации:
  - `sklearn.feature_extraction.text.CountVectorizer`
  - `sklearn.feature_extraction.text.TfidfVectorizer`
- Обучить на полученных векторах модели, с использованием кросс-валидации и подбором гиперпараметров:
  - `sklearn.tree.DecisionTreeClassifier`
  - `sklearn.linear_model.LogisticRegression`
  - Naive Bayes
- Сравнить качество обученных моделей на отложенной выборке

Перед отправкой необходимо обеспечить воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок

#### Дата выдачи

09.10.2023

#### Мягкий дедлайн

17.10.2023 20:00 мск

#### Критерии оценки

- Датасет Spam Or Not Spam загружен - **2 балла**
- Реализована релевантная задаче предобработка текстов - **3 балла**
- Все модели векторизации обучены - **5 баллов**	
- Все необходимые модели классификации обучены - **5 баллов**
- Модели классификации обучены с использованием механизма кросс-валидации - **5 баллов**
- Для всех моделей классификации подобраны гиперпараметры - **5 баллов**
- Произведено сравнение качества полученных моделей - **5 баллов**

# Импорт библиотек

In [61]:
import numpy as np
import pandas as pd
import os
import random
import re
import warnings
warnings.filterwarnings("ignore")

import lightgbm

import optuna

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier


seed=42

In [2]:
def seed_everything(seed):
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    return 
seed_everything(seed)

# Загрузка датасета

In [3]:
data = pd.read_csv('spam_or_not_spam.csv')
data.head()

Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   email   2999 non-null   object
 1   label   3000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 47.0+ KB


In [5]:
data[data.email.isnull()]

Unnamed: 0,email,label
2966,,1


In [6]:
data.dropna(inplace=True)
data.label.value_counts()

0    2500
1     499
Name: label, dtype: int64

## Предобработка текста

In [27]:
texts = data.email
texts_preproccessed = texts.str.lower()
texts_preproccessed = texts_preproccessed.apply(lambda x: re.sub(r'[^a-z ]', '', x))
texts_preproccessed = texts_preproccessed.str.replace(r'[ ]+', r' ', regex=True).str.strip()
texts_preproccessed

0       date wed number aug number number number numbe...
1       martin a posted tassos papadopoulos the greek ...
2       man threatens explosion in moscow thursday aug...
3       klez the virus that won t die already the most...
4       in adding cream to spaghetti carbonara which h...
                              ...                        
2995    abc s good morning america ranks it the number...
2996    hyperlink hyperlink hyperlink let mortgage len...
2997    thank you for shopping with us gifts for all o...
2998    the famous ebay marketing e course learn to se...
2999    hello this is chinese traditional number o num...
Name: email, Length: 2999, dtype: object

In [43]:
broken_texts = texts_preproccessed[texts_preproccessed.apply(len)<20]
broken_texts

806     url
2806       
2828       
Name: email, dtype: object

In [51]:
texts_filtered = texts_preproccessed[~texts_preproccessed.index.isin(broken_texts.index)]
labels_filtered = data.label[texts_filtered.index]

# Векторизация текста и обучение моделей

In [63]:
count_vec = CountVectorizer()
tf_idf = TfidfVectorizer()

log_reg = LogisticRegression(random_state=seed)
tree = DecisionTreeClassifier(random_state=seed)
bayes = GaussianNB()


models = [log_reg, tree]
vectorizers = [count_vec, tf_idf]

In [65]:
optuna_log = optuna.integration.OptunaSearchCV(
    log_reg,
    {"C": optuna.distributions.FloatDistribution(1e-10, 1e10, log=True),
     "penalty":['l1', 'l2']},
)

In [55]:
X_train, X_test, y_train, y_test = train_test_split(
    texts_filtered,
    labels_filtered,
    test_size=0.2,
    random_state=seed,
    stratify=labels_filtered,
)

In [56]:
for vectorizer in vectorizers:
    for model in models:
        pipe = Pipeline([
            ('vectorizer', vectorizer),
            ('model', model),
        ])
        pipe.fit(X_train, y_train)
        print(f'Комбинация {vectorizer}, {model}: \n',
              classification_report(y_test, pipe.predict(X_test)))
        print('\n', '-------------------------------')

Комбинация CountVectorizer(), LogisticRegression(random_state=42): 
               precision    recall  f1-score   support

           0       0.99      1.00      0.99       500
           1       1.00      0.93      0.96       100

    accuracy                           0.99       600
   macro avg       0.99      0.97      0.98       600
weighted avg       0.99      0.99      0.99       600


 -------------------------------
Комбинация CountVectorizer(), DecisionTreeClassifier(random_state=42): 
               precision    recall  f1-score   support

           0       0.97      0.98      0.98       500
           1       0.91      0.86      0.88       100

    accuracy                           0.96       600
   macro avg       0.94      0.92      0.93       600
weighted avg       0.96      0.96      0.96       600


 -------------------------------
Комбинация TfidfVectorizer(), LogisticRegression(random_state=42): 
               precision    recall  f1-score   support

           0

#### Байес + count_vectorizer

In [57]:
X_train_transformed = count_vec.fit_transform(X_train)
bayes.fit(X_train_transformed.A, y_train)
print(classification_report(y_test, bayes.predict(count_vec.transform(X_test).A)))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       500
           1       0.95      0.74      0.83       100

    accuracy                           0.95       600
   macro avg       0.95      0.87      0.90       600
weighted avg       0.95      0.95      0.95       600



#### Байес + tf_idf

In [58]:
X_train_transformed = tf_idf.fit_transform(X_train)
bayes.fit(X_train_transformed.A, y_train)
print(classification_report(y_test, bayes.predict(tf_idf.transform(X_test).A)))

              precision    recall  f1-score   support

           0       0.94      0.99      0.97       500
           1       0.93      0.71      0.81       100

    accuracy                           0.94       600
   macro avg       0.94      0.85      0.89       600
weighted avg       0.94      0.94      0.94       600



Если брать векторизацию и модель без регулировки гиперпараметров, то наилучшее значение показала логистическая регрессия с более простым CountVectorizer()

#### Градиентный бустинг

In [35]:
lgb = lightgbm.LGBMClassifier(random_state=seed)

In [36]:
X_train_transformed = tf_idf.fit_transform(X_train)
lgb.fit(X_train_transformed, y_train)
print(classification_report(y_test, lgb.predict(tf_idf.transform(X_test))))

[LightGBM] [Info] Number of positive: 399, number of negative: 2000
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.062105 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 63591
[LightGBM] [Info] Number of data points in the train set: 2399, number of used features: 2135
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.166319 -> initscore=-1.611941
[LightGBM] [Info] Start training from score -1.611941
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       500
           1       0.98      0.95      0.96       100

    accuracy                           0.99       600
   macro avg       0.98      0.97      0.98       600
weighted avg       0.99      0.99      0.99       600



# Подбор гиперпараметров

In [70]:
optuna_log = optuna.integration.OptunaSearchCV(
    log_reg,
    {"C": optuna.distributions.FloatDistribution(1e-10, 1e10, log=True),
     "penalty":optuna.distributions.CategoricalDistribution(['l1', 'l2'])},
)

optuna_tree = optuna.integration.OptunaSearchCV(
    tree,
    {}

)

In [75]:
optuna_log.fit(count_vec.fit_transform(X_train), y_train)

[I 2023-10-25 17:32:21,336] A new study created in memory with name: no-name-ffcfb07d-64db-4d12-b51d-7f894d832895
[I 2023-10-25 17:32:27,871] Trial 0 finished with value: 0.9749591162143354 and parameters: {'C': 0.01592499032163402, 'penalty': 'l2'}. Best is trial 0 with value: 0.9749591162143354.
[W 2023-10-25 17:32:27,906] Trial 1 failed with parameters: {'C': 0.00020701025948382555, 'penalty': 'l1'} because of the following error: The value nan is not acceptable.
[W 2023-10-25 17:32:27,908] Trial 1 failed with value nan.
[W 2023-10-25 17:32:27,942] Trial 2 failed with parameters: {'C': 0.001043822605696029, 'penalty': 'l1'} because of the following error: The value nan is not acceptable.
[W 2023-10-25 17:32:27,943] Trial 2 failed with value nan.
[W 2023-10-25 17:32:27,990] Trial 3 failed with parameters: {'C': 903.2994939650298, 'penalty': 'l1'} because of the following error: The value nan is not acceptable.
[W 2023-10-25 17:32:27,992] Trial 3 failed with value nan.
[W 2023-10-25 1

OptunaSearchCV(estimator=LogisticRegression(random_state=42), n_jobs=1,
               param_distributions={'C': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None),
                                    'penalty': CategoricalDistribution(choices=('l1', 'l2'))})

In [77]:
print(classification_report(y_test, optuna_log.predict(count_vec.transform(X_test))))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       500
           1       1.00      0.97      0.98       100

    accuracy                           0.99       600
   macro avg       1.00      0.98      0.99       600
weighted avg       1.00      0.99      0.99       600



In [None]:
optuna lgb =

# Bert

In [14]:
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification, 
    DataCollatorWithPadding,
    TrainingArguments, 
    Trainer,
)
from datasets import ClassLabel, Dataset, Value

In [15]:
model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [17]:
def tokenize(texts):
    return tokenizer(texts["email"], padding="max_length", truncation=True)


In [18]:
bert_dataset = Dataset.from_pandas(data, split='train')
bert_dataset = bert_dataset.class_encode_column("label")

bert_dataset = bert_dataset.map(tokenize, batched=True)
bert_dataset = bert_dataset.train_test_split(test_size=0.3)
bert_dataset

Stringifying the column:   0%|          | 0/2999 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/2999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2999 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['email', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2099
    })
    test: Dataset({
        features: ['email', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 900
    })
})

In [19]:
def f1(eval_pred):
    predictions, labels = eval_pred
    return f1_score(label, predictions)

In [20]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
    output_dir="bert",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch"
)
trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=bert_dataset['train'], 
    eval_dataset=bert_dataset['test'],
    data_collator=data_collator,
    compute_metrics=f1
)

In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
