# Глубокое обучение и обработка естественного языка

## Домашняя работа №2

1. Загрузить набор данных Spam Or Not Spam
2. Попробовать и сравнить различные способы векторизации: 3 балла

  *   sklearn.feature_extraction.text.CountVectorizer
  *   sklearn.feature_extraction.text.TfidfVectorizer

3. Обучить на полученных векторах модели, с использованием кросс-валидации и подбором гиперпараметров: 3 балла

  *   sklearn.tree.DecisionTreeClassifier
  *   sklearn.linear_model.LogisticRegression
  *   Naive Bayes

4. Сравнить качество обученных моделей на отложенной выборке - 1 балл
5. Обеспечена воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок - 2 балла
6. Соблюден code style на уровне pep8 и On writing clean Jupyter notebooks - 1 балл

In [None]:
# установка spaCy
!pip install -U spacy

# English pipeline в spaCy
!python3 -m spacy download en_core_web_sm

In [2]:
# подключение библиотек
import numpy as np
import pandas as pd
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.pipeline import Pipeline

In [3]:
from google.colab import files
uploaded = files.upload()

Saving spam_or_not_spam.csv to spam_or_not_spam.csv


In [4]:
# функция, возвращает бейзлайн
def get_baseline(**kwargs):
  keys = list(kwargs.keys())
  values = list(kwargs.values())
  return Pipeline(
      steps=[ (key, value) for key, value in kwargs.items()]
      )

# функция перебора моделей
def fit_grid_search(pipe, params, n_jobs=-1, scoring='f1', cv=5):
  grid_search = HalvingGridSearchCV(
      pipe,
      param_grid=params,
      n_jobs=-1,
      cv=cv,
      scoring=scoring,
      random_state=2023
    )

  grid_search.fit(X_train, y_train)

  return  grid_search


# функция оценки на отложенной выборке
def estimate_test(model, X_test):
  y_pred = model.predict(X_test)
  report = classification_report(y_test, y_pred, output_dict=True)
  return report

### 1. Разведочный анализ

In [5]:
df = pd.read_csv('spam_or_not_spam.csv')
df

Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0
...,...,...
2995,abc s good morning america ranks it the NUMBE...,1
2996,hyperlink hyperlink hyperlink let mortgage le...,1
2997,thank you for shopping with us gifts for all ...,1
2998,the famous ebay marketing e course learn to s...,1


In [6]:
# типы данных
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   email   2999 non-null   object
 1   label   3000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 47.0+ KB


In [7]:
# пропуски в данных
df.isna().sum()

email    1
label    0
dtype: int64

In [8]:
df = df.dropna()

In [9]:
# соотношение классов
df['label'].value_counts()

0    2500
1     499
Name: label, dtype: int64

### 2. Нормализация, токенизация и лемматизация

In [10]:
nlp = spacy.load("en_core_web_sm")

df['cleaned_text'] = df['email'].apply(
    lambda x: ' '.join(
      token.lemma_.lower() for token in nlp(x) if
      not token.is_stop
      and not token.is_punct
      and not token.is_digit
      and not token.like_email
      and not token.like_num
      and not token.is_space
    )
  )

df.sample(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cleaned_text'] = df['email'].apply(


Unnamed: 0,email,label,cleaned_text
2954,otc newsletter discover tomorrow s winners fo...,1,otc newsletter discover tomorrow s winner imme...
2417,url URL date NUMBER NUMBER NUMBERtNUMBER NUMBE...,0,url url date number number numbertnumber numbe...
647,we have a partnership with webex we use their ...,0,partnership webex use serivce cross firewall a...
2378,url URL date not supplied yet another lego obs...,0,url url date supply lego obsessive build work ...
7,martin adamson wrote isn t it just basically a...,0,martin adamson write isn t basically mixture b...


### 3. Сравнение CountVectorizer и TfidfVectorizer


In [11]:
# тренеровочная и тестовая выборки
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df['label'], random_state=2023)

Сравним результирующие матрицы векторов для CountVectorizer и TfidfVectorizer

In [12]:
vectorizer = CountVectorizer(max_df=0.7, min_df=0.003)
X_train_vectorized = vectorizer.fit_transform(X_train)

pd.DataFrame(X_train_vectorized.toarray(), columns=vectorizer.get_feature_names_out()).head()

Unnamed: 0,aa,aaron,abandon,ability,able,abroad,absence,absolute,absolutely,abstract,...,yesterday,yield,york,young,yup,ziggy,zip,zone,zope,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
vectorizer = TfidfVectorizer(max_df=0.7, min_df=0.003)
X_train_vectorized = vectorizer.fit_transform(X_train)

pd.DataFrame(X_train_vectorized.toarray(), columns=vectorizer.get_feature_names_out()).head()

Unnamed: 0,aa,aaron,abandon,ability,able,abroad,absence,absolute,absolutely,abstract,...,yesterday,yield,york,young,yup,ziggy,zip,zone,zope,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.070433,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Создадим сводную таблицу, в которую будем заносить результаты обучения моделей на разных методах векторазиции текста

In [14]:
# сводная таблица
index = ['DecisionTreeClassifier', 'LogisticRegression', 'MultinomialNB']
columns = ['CountVectorizer', 'TfidfVectorizer']

res_df = pd.DataFrame([[0, 0]] * 3, index=index, columns=columns)
res_df

Unnamed: 0,CountVectorizer,TfidfVectorizer
DecisionTreeClassifier,0,0
LogisticRegression,0,0
MultinomialNB,0,0


Сетки параметров

In [15]:
# сетка параметров CountVectorizer
counter_params = {
    'counter__max_df': np.linspace(0.7, 1.0, 4),
    'counter__min_df': [0.0, 0.001, 0.003, 0.005],
    'counter__ngram_range': [(1, 1), (1, 2)],
}

# сетка параметров TfidfVectorizer
tfidf_params = {
    'tfidf__max_df': np.linspace(0.7, 1.0, 4),
    'tfidf__min_df': [0.0, 0.001, 0.003, 0.005],
    'tfidf__norm': ['l1', 'l2'],
}

# параметры DecisionTreeClassifier
tree_params = {
    'clf__criterion': ['gini', 'entropy', 'log_loss'],
    'clf__random_state': [2023]
}

# параметры классификатора LogisticRegression
logreg_params = {
    'clf__C': np.linspace(0.1, 1, 10),
    'clf__random_state': [2023]
}

# параметры классификатора MultinomialNB
nb_params = {
    'clf__alpha': np.linspace(0.1, 1, 10),
    'clf__force_alpha': [True, False]
}

#### 1. CountVectorizer
DecisionTreeClassifier

In [16]:
params = dict(tree_params, **counter_params)

In [17]:
# бейзлайн
pipe = get_baseline(counter=CountVectorizer(), clf=DecisionTreeClassifier())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['CountVectorizer'].iloc[0] = round(report['accuracy'], 3)

LogisticRegression

In [18]:
params = dict(logreg_params, **counter_params)

In [19]:
# бейзлайн
pipe = get_baseline(counter=CountVectorizer(), clf=LogisticRegression())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['CountVectorizer'].iloc[1] = round(report['accuracy'], 3)

MultinomialNB

In [20]:
params = dict(nb_params, **counter_params)

In [21]:
# бейзлайн
pipe = get_baseline(counter=CountVectorizer(), clf=MultinomialNB())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['CountVectorizer'].iloc[2] = round(report['accuracy'], 3)

#### 2. TfidfVectorizer
DecisionTreeClassifier

In [22]:
params = dict(tree_params, **tfidf_params)

In [23]:
# бейзлайн
pipe = get_baseline(tfidf=TfidfVectorizer(), clf=DecisionTreeClassifier())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['TfidfVectorizer'].iloc[0] = round(report['accuracy'], 3)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  res_df['TfidfVectorizer'].iloc[0] = round(report['accuracy'], 3)


LogisticRegression

In [24]:
params = dict(logreg_params, **tfidf_params)

In [25]:
# бейзлайн
pipe = get_baseline(tfidf=TfidfVectorizer(), clf=LogisticRegression())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['TfidfVectorizer'].iloc[1] = round(report['accuracy'], 3)

MultinomialNB

In [26]:
params = dict(nb_params, **tfidf_params)

In [27]:
# бейзлайн
pipe = get_baseline(tfidf=TfidfVectorizer(), clf=MultinomialNB())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['TfidfVectorizer'].iloc[2] = round(report['accuracy'], 3)

#### Итог

In [28]:
res_df

Unnamed: 0,CountVectorizer,TfidfVectorizer
DecisionTreeClassifier,0.96,0.953
LogisticRegression,0.987,0.967
MultinomialNB,0.988,0.984


In [29]:
!pip freeze > requirements.txt