# Глубокое обучение и обработка естественного языка

## Домашняя работа №3

1. Загрузить набор данных Spam Or Not Spam (или любой другой, какой вам нравится)
2. Обучить модели и сравнить различные способы векторизации с помощью внутренней оценки (intrinsic):

  *   Word2Vec SkipGram / CBOW (параметр sg в gensim.models.word2vec.Word2Vec) - 3 балла
  *   fastText (можно взять в gensim, или в fasttext как на семинаре) - 2 балла

3. Обучить на полученных векторах модели LogisticRegression и сравнить качество на отложенной выборке - 2 балла
4. Обеспечена воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок - 2 балла
5. Соблюден code style на уровне pep8 и On writing clean Jupyter notebooks - 1 балл

**Примечания**: \


*   Для получения более качественных эмбеддингов стоит предварительно сделать предобработку корпуса - отсеять стоп-слова, провести нормализацию и тп. Предобработка рассматривалась в первой лекции/семинаре
*   В данном случае под intrinsic оценкой подразумевается просто использование методов most_similar, doesnt_match. Однако, если есть желание, можно измерить косинусное расстояние между отдельными парами слов и проверить, есть ли корреляция с корпусами для intrinsic-оценки, которые обсуждались на семинаре



In [None]:
# установка spaCy
!pip install -U spacy

# English pipeline в spaCy
!python3 -m spacy download en_core_web_sm

In [238]:
import pandas as pd
import spacy
import gensim.models
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [None]:
# from google.colab import files
# uploaded = files.upload()

### 1. Разведочный анализ

In [239]:
df = pd.read_csv('spam_or_not_spam.csv')
df

Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0
...,...,...
2995,abc s good morning america ranks it the NUMBE...,1
2996,hyperlink hyperlink hyperlink let mortgage le...,1
2997,thank you for shopping with us gifts for all ...,1
2998,the famous ebay marketing e course learn to s...,1


In [240]:
# типы данных
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   email   2999 non-null   object
 1   label   3000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 47.0+ KB


In [241]:
# пропуски в данных
df.isna().sum()

email    1
label    0
dtype: int64

In [242]:
df = df.dropna()

In [243]:
# соотношение классов
df['label'].value_counts()

0    2500
1     499
Name: label, dtype: int64

### 2. Нормализация, токенизация и лемматизация

In [244]:
nlp = spacy.load("en_core_web_sm")

df['email_cleaned'] = df['email'].apply(
    lambda x: [token.lemma_.lower() for token in nlp(x) if
      not token.is_stop
      and not token.is_punct
      and not token.is_digit
      and not token.like_email
      and not token.like_num
      and not token.is_space
    ]
  )

df.sample(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['email_cleaned'] = df['email'].apply(


Unnamed: 0,email,label,email_cleaned
591,begin forwarded text date mon NUMBER sep NUMB...,0,"[begin, forward, text, date, mon, number, sep,..."
1234,on fri NUMBER oct NUMBER NUMBER NUMBER NUMBER ...,0,"[fri, number, oct, number, number, number, num..."
882,i actually thought of this kind of active chat...,0,"[actually, think, kind, active, chat, aol, num..."
2531,NUMBER fight the risk of cancer URL NUMBER sli...,1,"[number, fight, risk, cancer, url, number, sli..."
1732,if and when we package this perhaps we should...,0,"[package, use, barry, s, trick, greg, s, trick..."


### 3. Сравнение Word2Vec и FastText


In [276]:
# тренеровочная и тестовая выборки
X_train, X_test, y_train, y_test = train_test_split(df['email_cleaned'], df['label'], random_state=2023)

Word2Vec Skip-gram

In [246]:
skip_gram = gensim.models.Word2Vec(
    sentences=X_train,
    vector_size=300,
    window=7,
    min_count=2,
    sg=1,
    hs=0,
    negative=5,
    epochs=25,
    seed=2023,
)

In [277]:
"""
 Есть пустые предложения, обработать
"""
# векторизированные предложения
X_train_vectorized = []

for index, sentence in enumerate(X_train):
  sentence_vector = []
  if len(sentence) == 0:
    print("EMPTY")
  for token in sentence:
    try:
      token_vector = skip_gram.wv[token]
      sentence_vector.append(token_vector)
    except KeyError as e:
      token_vector = np.zeros(300)
      sentence_vector.append(token_vector)
  if len(sentence_vector) != 0:
    X_train_vectorized.append(np.mean(sentence_vector, axis=0))
  else:
    print("HERE")
    print(sentence_vector)
    y_train.pop(index)

X_train_vectorized = np.array(X_train_vectorized)

EMPTY
HERE
[]


In [259]:
for x in X_train_vectorized:
  print(x.shape)

(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)
(300,)

In [257]:
logereg = LogisticRegression()
logereg.fit(X_train_vectorized, y_train)
y_pred = logereg.predict(X_train_vectorized)

ValueError: ignored

In [None]:
# векторизированные предложения
X_test_vectorized = []

for index, sentence in enumerate(X_train):
  sentence_vector = []
  for token in sentence:
    try:
      token_vector = skip_gram.wv[token]
      sentence_vector.append(token_vector)
    except KeyError as e:
      pass
  if len(sentence_vector) != 0:
    X_test_vectorized.append(np.mean(sentence_vector, axis=0))
  else:
    y_test.pop(index)

X_test_vectorized = np.array(X_test_vectorized)

In [249]:
# classification_report(y_train, y_pred, output_dict=True)
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      1883
           1       0.73      0.54      0.62       365

    accuracy                           0.89      2248
   macro avg       0.82      0.75      0.78      2248
weighted avg       0.89      0.89      0.89      2248



In [None]:
y_pred = logereg.predict(X_train_vectorized)

Word2Vec CBOW

In [None]:
cbow = gensim.models.Word2Vec(
    sentences=df['email_cleaned'].to_numpy(),
    vector_size=300,
    window=7,
    min_count=10,
    sg=0,
    hs=0,
    negative=5,
    epochs=25,
    seed=2023,
)

In [None]:
cbow.wv.most_similar(positive=['computer'], topn=5)

[('useless', 0.5570467114448547),
 ('science', 0.5214260220527649),
 ('tech', 0.5202290415763855),
 ('virus', 0.48782187700271606),
 ('accessible', 0.48421990871429443)]

In [None]:
# подключение библиотек
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.pipeline import Pipeline

In [None]:
# функция, возвращает бейзлайн
def get_baseline(**kwargs):
  keys = list(kwargs.keys())
  values = list(kwargs.values())
  return Pipeline(
      steps=[ (key, value) for key, value in kwargs.items()]
      )

# функция перебора моделей
def fit_grid_search(pipe, params, n_jobs=-1, scoring='f1', cv=5):
  grid_search = HalvingGridSearchCV(
      pipe,
      param_grid=params,
      n_jobs=-1,
      cv=cv,
      scoring=scoring,
      random_state=2023
    )

  grid_search.fit(X_train, y_train)

  return  grid_search


# функция оценки на отложенной выборке
def estimate_test(model, X_test):
  y_pred = model.predict(X_test)
  report = classification_report(y_test, y_pred, output_dict=True)
  return report

### 3. Сравнение CountVectorizer и TfidfVectorizer


Сравним результирующие матрицы векторов для CountVectorizer и TfidfVectorizer

In [None]:
vectorizer = CountVectorizer(max_df=0.7, min_df=0.003)
X_train_vectorized = vectorizer.fit_transform(X_train)

pd.DataFrame(X_train_vectorized.toarray(), columns=vectorizer.get_feature_names_out()).head()

Unnamed: 0,aa,aaron,abandon,ability,able,abroad,absence,absolute,absolutely,abstract,...,yesterday,yield,york,young,yup,ziggy,zip,zone,zope,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
vectorizer = TfidfVectorizer(max_df=0.7, min_df=0.003)
X_train_vectorized = vectorizer.fit_transform(X_train)

pd.DataFrame(X_train_vectorized.toarray(), columns=vectorizer.get_feature_names_out()).head()

Unnamed: 0,aa,aaron,abandon,ability,able,abroad,absence,absolute,absolutely,abstract,...,yesterday,yield,york,young,yup,ziggy,zip,zone,zope,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.070433,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Создадим сводную таблицу, в которую будем заносить результаты обучения моделей на разных методах векторазиции текста

In [None]:
# сводная таблица
index = ['DecisionTreeClassifier', 'LogisticRegression', 'MultinomialNB']
columns = ['CountVectorizer', 'TfidfVectorizer']

res_df = pd.DataFrame([[0, 0]] * 3, index=index, columns=columns)
res_df

Unnamed: 0,CountVectorizer,TfidfVectorizer
DecisionTreeClassifier,0,0
LogisticRegression,0,0
MultinomialNB,0,0


Сетки параметров

In [None]:
# сетка параметров CountVectorizer
counter_params = {
    'counter__max_df': np.linspace(0.7, 1.0, 4),
    'counter__min_df': [0.0, 0.001, 0.003, 0.005],
    'counter__ngram_range': [(1, 1), (1, 2)],
}

# сетка параметров TfidfVectorizer
tfidf_params = {
    'tfidf__max_df': np.linspace(0.7, 1.0, 4),
    'tfidf__min_df': [0.0, 0.001, 0.003, 0.005],
    'tfidf__norm': ['l1', 'l2'],
}

# параметры DecisionTreeClassifier
tree_params = {
    'clf__criterion': ['gini', 'entropy', 'log_loss'],
    'clf__random_state': [2023]
}

# параметры классификатора LogisticRegression
logreg_params = {
    'clf__C': np.linspace(0.1, 1, 10),
    'clf__random_state': [2023]
}

# параметры классификатора MultinomialNB
nb_params = {
    'clf__alpha': np.linspace(0.1, 1, 10),
    'clf__force_alpha': [True, False]
}

#### 1. CountVectorizer
DecisionTreeClassifier

In [None]:
params = dict(tree_params, **counter_params)

In [None]:
# бейзлайн
pipe = get_baseline(counter=CountVectorizer(), clf=DecisionTreeClassifier())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['CountVectorizer'].iloc[0] = round(report['accuracy'], 3)

LogisticRegression

In [None]:
params = dict(logreg_params, **counter_params)

In [None]:
# бейзлайн
pipe = get_baseline(counter=CountVectorizer(), clf=LogisticRegression())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['CountVectorizer'].iloc[1] = round(report['accuracy'], 3)

MultinomialNB

In [None]:
params = dict(nb_params, **counter_params)

In [None]:
# бейзлайн
pipe = get_baseline(counter=CountVectorizer(), clf=MultinomialNB())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['CountVectorizer'].iloc[2] = round(report['accuracy'], 3)

#### 2. TfidfVectorizer
DecisionTreeClassifier

In [None]:
params = dict(tree_params, **tfidf_params)

In [None]:
# бейзлайн
pipe = get_baseline(tfidf=TfidfVectorizer(), clf=DecisionTreeClassifier())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['TfidfVectorizer'].iloc[0] = round(report['accuracy'], 3)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  res_df['TfidfVectorizer'].iloc[0] = round(report['accuracy'], 3)


LogisticRegression

In [None]:
params = dict(logreg_params, **tfidf_params)

In [None]:
# бейзлайн
pipe = get_baseline(tfidf=TfidfVectorizer(), clf=LogisticRegression())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['TfidfVectorizer'].iloc[1] = round(report['accuracy'], 3)

MultinomialNB

In [None]:
params = dict(nb_params, **tfidf_params)

In [None]:
# бейзлайн
pipe = get_baseline(tfidf=TfidfVectorizer(), clf=MultinomialNB())

# перебор моделей
grid_search_result = fit_grid_search(pipe, params)
model = grid_search_result.best_estimator_

# оценка на отложенной выборке
report = estimate_test(model, X_test)

# сохранение результата в таблицу
res_df['TfidfVectorizer'].iloc[2] = round(report['accuracy'], 3)

#### Итог

In [None]:
res_df

Unnamed: 0,CountVectorizer,TfidfVectorizer
DecisionTreeClassifier,0.96,0.953
LogisticRegression,0.987,0.967
MultinomialNB,0.988,0.984


In [None]:
!pip freeze > requirements.txt