# Домашнее задание NLP-3
Что в векторе твоем?

#### Цель:
В этом ДЗ вы освоите работу с предобученными векторными представлениями.

#### Описание/Пошаговая инструкция выполнения домашнего задания:
В качестве данных возьмите либо датасет, собранный в первом занятии (предпочтительно), либо скачайте данные с отзывами на фильмы с сайта IMDB (https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews), в которых для каждого отзыва поставлена семантическая оценка - "позитивный" или "негативный".

1) Разбейте собранные данные на train/test, отложив 20-30% наблюдений для тестирования.
2) Примените tf-idf преобразование для текстового описания. Используйте как отдельные токены, так и биграммы, отсейте стоп-слова, а также слова, которые встречаются слишком редко или слишком часто (параметры min/max_df), не забудьте убрать l2 регуляризацию, которая по умолчанию включена.
3) Обучите random forest или градиентный бустинг (LightGBM или catboost) на полученных векторах и подберите оптимальную комбинацию гиперпараметров с помощью GridSearch
4) Теперь воспользуйтесь предобученными word2vec / fasttext эмбеддингами для векторизации текста. Векторизуйте тексты с помощью метода word2vec/fasttext c весами tf-idf
Совет: для текстов на русском языке можно взять предобученные эмбеддинги с сайта rusvectores https://rusvectores.org/ru/models/ (вам подходят эмбеддинги с параметром тэгсет НЕТ). Для английского языка можете воспользоваться word2vec, обученными на Google News
5) Повторите эксперимент из пункта 3 с использованием полученных в пункте 4 векторов

#### Критерии оценки:
Разбиение на train/test - 1 балл  \
Предобработка текста при помощи tf-idf - 2 балла  \
Обучение модели на tf-idf векторах - 2 балла  \
Преобработка текста при помощи преобученных эмбеддингов word2vec/fasttext - 3 балла  \
Обучение модели на предобученных эмбеддингах - 2 балла

In [None]:
import nltk
import pandas as pd
import numpy as np
import seaborn as sns
from tqdm import tqdm
import time
from sklearn.metrics import *
tqdm.pandas()

from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from lightgbm import LGBMClassifier
import lightgbm as lgb

In [None]:
df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df.shape

(50000, 2)

In [None]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

У нас сбалансированный по классам датасет

# 1. Предобработка текста

In [None]:
# Удаляем пунктуацию
import string
spec_chars = string.punctuation + '\d\n\xa0«»\t—…'

df['clear'] = df['review'].str.lower().str.replace(f'[{spec_chars}]', '', regex=True)

## 1.1.Токенизация текста с использованием отдельных токенов и биграмм.

In [None]:
# Загружаем токенизатор

from nltk.tokenize import word_tokenize  # импортируем функцию 'word_tokenize' из пакета 'nltk.tokenize'.
                # которая выполняет токенизацию текста, разбивая его на отдельные слова.

nltk.download('punkt') # загружаем пакет 'punkt' из NLTK.'punkt' - набор правил для токенизации текста на английском языке.
  # Он определяет, как разделять текст на отдельные слова, учитывая пробелы, знаки препинания, пунктуацию и другие элементы.

df['tokens'] = df['clear'].progress_apply(word_tokenize) #новый столбец в DataFrame, который будет содержать результат
                                                         # токенизации

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
100%|██████████| 50000/50000 [00:47<00:00, 1048.01it/s]


## 1.2.Стемминг и отсеивание стоп-слов

In [None]:
%%time
# Применяем стемминг (т.к. англ.язык) и удаляем стоп-слова

import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

nltk.download('stopwords')
stops = set(stopwords.words('english'))
stemmer = SnowballStemmer("english")

df['stem'] = df['tokens'].progress_apply(lambda x: ' '.join([i for i in x if i not in stops])).progress_apply(stemmer.stem)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
100%|██████████| 50000/50000 [00:01<00:00, 33529.86it/s]
100%|██████████| 50000/50000 [00:11<00:00, 4187.25it/s]

CPU times: user 12.3 s, sys: 171 ms, total: 12.5 s
Wall time: 13.5 s





In [None]:
df.head()

Unnamed: 0,review,sentiment,clear,tokens,stem
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...,"[one, of, the, other, reviewers, has, mentione...",one reviewers mentioned watching oz episode yo...
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production br br the filmin...,"[a, wonderful, little, production, br, br, the...",wonderful little production br br filming tech...
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...,"[i, thought, this, was, a, wonderful, way, to,...",thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,negative,basically theres a family where a little boy j...,"[basically, theres, a, family, where, a, littl...",basically theres family little boy jake thinks...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter matteis love in the time of money is a ...,"[petter, matteis, love, in, the, time, of, mon...",petter matteis love time money visually stunni...


## 1.4. Модель векторизации текста tf-idf

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

In [None]:
# Преобразуем текстовые данные в векторы чисел
# TF-IDF - это метод, который присваивает весовые коэффициенты словам в тексте, чтобы отразить их важность.
# Слова, которые часто встречаются в конкретном документе, но реже встречаются в других документах,
# получают более высокие весовые коэффициенты.
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,2), norm=None, min_df=2)

Сохраним в Х - признак, а в у - целевую переменную.

In [None]:
X = df['stem']
y = df['sentiment']

Разобьем выборку на train/test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                          test_size=0.3, random_state=42, stratify=y)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((35000,), (15000,), (35000,), (15000,))

In [None]:
X_train = vectorizer.fit_transform(X_train) # Преобразуем текстовые данные в выборке'X_train' в векторы TF-IDF,
    # обучаем 'TfidfVectorizer' на данных из 'X_train' и преобразуем эти данные в матрицу TF-IDF.
X_test = vectorizer.transform(X_test) # Преобразует текстовые данные в тестовой выборке 'X_test' в векторы TF-IDF.
  #Важно использовать 'transform'  (а не `fit_transform`).

In [None]:
le = LabelEncoder() # 'LabelEncoder'  преобразуем  текстовые  метки  в  числовые  значения
le.fit(y_train) #  Обучаем 'LabelEncoder' на тренировочных метках 'y_train'.
"""
Важно, чтобы  'LabelEncoder'  был  обучен  только  на  тренировочных  данных  (используя  'le.fit(y_train)).
Затем  можно  использовать  'le.transform()'  для  преобразования  тестовых  меток  'y_test'.
Это  гарантирует,  что  тестовые  метки  будут  преобразованы  в  те же  числовые  значения,
которые  были  использованы  для  тренировки  модели.
"""

"\nВажно, чтобы  'LabelEncoder'  был  обучен  только  на  тренировочных  данных  (используя  'le.fit(y_train)). \nЗатем  можно  использовать  'le.transform()'  для  преобразования  тестовых  меток  'y_test'.  \nЭто  гарантирует,  что  тестовые  метки  будут  преобразованы  в  те же  числовые  значения,  \nкоторые  были  использованы  для  тренировки  модели.\n"

# 2.Обучение LightGBM или Сatboost на полученных векторах и подбор оптимальной комбинации гиперпараметров с помощью GridSearch

## 2.1.Сatboost + tf-idf

Catboost дает хорошие результаты без подбора гиперпараметров, поэтому воспользуемся им.

In [None]:
#!pip freeze

In [None]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.5


In [None]:
from catboost import CatBoostClassifier

# инициализируем модель
clf_cat = CatBoostClassifier(iterations=50, logging_level='Silent')

# обучаем модель на тренировочных данных
clf_cat.fit(X_train, le.transform(y_train));

In [None]:
# делаем предсказание для тестовых данных
y_pred = clf_cat.predict(X_test)

In [None]:
pred_proba = clf_cat.predict_proba(X_test)

In [None]:
roc_auc_score(y_test, pred_proba[:,1], average='macro')

0.9364777333333333

In [None]:
print(classification_report(le.transform(y_test), y_pred))

              precision    recall  f1-score   support

           0       0.87      0.84      0.86      7500
           1       0.85      0.88      0.86      7500

    accuracy                           0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



## 2.2. LightGBM + tf-idf

### GridSearchCV

In [None]:
"""
%%time

model = LGBMClassifier(random_state=42)
params = {
    'num_leaves': [5, 15, 31], # default=31
    'max_depth': [-1, 3, 7], # default=-1
    'learning_rate': [0.5, 0.1, 0.01], # default=0.1
}

clf = GridSearchCV(model, params, scoring='f1', verbose=10)
clf.fit(X_train, le.transform(y_train))
"""

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV 1/5; 1/27] START learning_rate=0.5, max_depth=-1, num_leaves=5..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71192
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22680
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 1/5; 1/27] END learning_rate=0.5, max_depth=-1, num_leaves=5;, score=0.863 total time=  11.1s
[CV 2/5; 1/27] START learning_rate=0.5, max_depth=-1, num_leaves=5..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71580
[LightGBM] [Info] Number of data points in the 

[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71405
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22764
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 3/5; 3/27] END learning_rate=0.5, max_depth=-1, num_leaves=31;, score=0.879 total time=  20.2s
[CV 4/5; 3/27] START learning_rate=0.5, max_depth=-1, num_leaves=31.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71138
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22691
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 4/5; 3/27]

[CV 1/5; 5/27] END learning_rate=0.5, max_depth=3, num_leaves=15;, score=0.853 total time=  11.4s
[CV 2/5; 5/27] START learning_rate=0.5, max_depth=3, num_leaves=15..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71580
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22831
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 2/5; 5/27] END learning_rate=0.5, max_depth=3, num_leaves=15;, score=0.845 total time=  12.7s
[CV 3/5; 5/27] START learning_rate=0.5, max_depth=3, num_leaves=15..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71405
[LightGBM] 

[CV 3/5; 5/27] END learning_rate=0.5, max_depth=3, num_leaves=15;, score=0.858 total time=  12.1s
[CV 4/5; 5/27] START learning_rate=0.5, max_depth=3, num_leaves=15..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71138
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22691
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 4/5; 5/27] END learning_rate=0.5, max_depth=3, num_leaves=15;, score=0.858 total time=  11.0s
[CV 5/5; 5/27] START learning_rate=0.5, max_depth=3, num_leaves=15..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 71540
[LightGBM] [Info] Number of data points in the train set: 28000, number of 

[CV 5/5; 5/27] END learning_rate=0.5, max_depth=3, num_leaves=15;, score=0.857 total time=  13.1s
[CV 1/5; 6/27] START learning_rate=0.5, max_depth=3, num_leaves=31..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71192
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22680
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 1/5; 6/27] END learning_rate=0.5, max_depth=3, num_leaves=31;, score=0.853 total time=  11.8s
[CV 2/5; 6/27] START learning_rate=0.5, max_depth=3, num_leaves=31..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71580
[LightGBM] 

[CV 2/5; 6/27] END learning_rate=0.5, max_depth=3, num_leaves=31;, score=0.845 total time=  11.8s
[CV 3/5; 6/27] START learning_rate=0.5, max_depth=3, num_leaves=31..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71405
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22764
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 3/5; 6/27] END learning_rate=0.5, max_depth=3, num_leaves=31;, score=0.858 total time=  15.0s
[CV 4/5; 6/27] START learning_rate=0.5, max_depth=3, num_leaves=31..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71138
[LightGBM] 

[CV 4/5; 6/27] END learning_rate=0.5, max_depth=3, num_leaves=31;, score=0.858 total time=  13.6s
[CV 5/5; 6/27] START learning_rate=0.5, max_depth=3, num_leaves=31..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71540
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22825
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 5/5; 6/27] END learning_rate=0.5, max_depth=3, num_leaves=31;, score=0.857 total time=  12.7s
[CV 1/5; 7/27] START learning_rate=0.5, max_depth=7, num_leaves=5...............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71192
[LightGBM] 

[CV 5/5; 7/27] END learning_rate=0.5, max_depth=7, num_leaves=5;, score=0.864 total time=  12.8s
[CV 1/5; 8/27] START learning_rate=0.5, max_depth=7, num_leaves=15..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71192
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22680
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 1/5; 8/27] END learning_rate=0.5, max_depth=7, num_leaves=15;, score=0.863 total time=  13.9s
[CV 2/5; 8/27] START learning_rate=0.5, max_depth=7, num_leaves=15..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 71580
[LightGBM] [Info] Number of data points in the train set: 28000, number of u

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71580
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22831
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 2/5; 9/27] END learning_rate=0.5, max_depth=7, num_leaves=31;, score=0.849 total time=  13.7s
[CV 3/5; 9/27] START learning_rate=0.5, max_depth=7, num_leaves=31..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71405
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22764
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 3/5; 9/27] END learning_rate=0.5, max_depth=7, num_leaves=31;, score=0.862 total t

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 4/5; 9/27] END learning_rate=0.5, max_depth=7, num_leaves=31;, score=0.858 total time=  14.1s
[CV 5/5; 9/27] START learning_rate=0.5, max_depth=7, num_leaves=31..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71540
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22825
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 5/5; 9/27] END learning_rate=0.5, max_depth=7, num_leaves=31;, score=0.859 total time=  14.5s
[CV 1/5; 10/27] START learning_rate=0.1, max_depth=-1, num_leaves=5.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, yo

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 3/5; 10/27] END learning_rate=0.1, max_depth=-1, num_leaves=5;, score=0.821 total time=  12.8s
[CV 4/5; 10/27] START learning_rate=0.1, max_depth=-1, num_leaves=5.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 71138
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22691
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 4/5; 10/27] END learning_rate=0.1, max_depth=-1, num_leaves=5;, score=0.817 total time=  16.1s
[CV 5/5; 10/27] START learning_rate=0.1, max_depth=-1, num_leaves=5.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 71540
[LightGBM] [Info] Number of data points in the train set: 

[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71192
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22680
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 1/5; 13/27] END learning_rate=0.1, max_depth=3, num_leaves=5;, score=0.814 total time=  13.6s
[CV 2/5; 13/27] START learning_rate=0.1, max_depth=3, num_leaves=5..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71580
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22831
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 2/5; 13/27]

[CV 2/5; 14/27] END learning_rate=0.1, max_depth=3, num_leaves=15;, score=0.804 total time=  13.1s
[CV 3/5; 14/27] START learning_rate=0.1, max_depth=3, num_leaves=15.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71405
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22764
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 3/5; 14/27] END learning_rate=0.1, max_depth=3, num_leaves=15;, score=0.813 total time=  13.3s
[CV 4/5; 14/27] START learning_rate=0.1, max_depth=3, num_leaves=15.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71138
[LightGBM

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 5/5; 14/27] END learning_rate=0.1, max_depth=3, num_leaves=15;, score=0.818 total time=  14.2s
[CV 1/5; 15/27] START learning_rate=0.1, max_depth=3, num_leaves=31.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71192
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22680
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 1/5; 15/27] END learning_rate=0.1, max_depth=3, num_leaves=31;, score=0.815 total time=  13.5s
[CV 2/5; 15/27] START learning_rate=0.1, max_depth=3, num_leaves=31.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, 

[CV 2/5; 15/27] END learning_rate=0.1, max_depth=3, num_leaves=31;, score=0.804 total time=  12.7s
[CV 3/5; 15/27] START learning_rate=0.1, max_depth=3, num_leaves=31.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 71405
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22764
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 3/5; 15/27] END learning_rate=0.1, max_depth=3, num_leaves=31;, score=0.813 total time=  13.6s
[CV 4/5; 15/27] START learning_rate=0.1, max_depth=3, num_leaves=31.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71138
[LightGBM] [Info] Number of data points in the train set: 28000, number o

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 5/5; 15/27] END learning_rate=0.1, max_depth=3, num_leaves=31;, score=0.818 total time=  13.8s
[CV 1/5; 16/27] START learning_rate=0.1, max_depth=7, num_leaves=5..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 71192
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22680
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 1/5; 16/27] END learning_rate=0.1, max_depth=7, num_leaves=5;, score=0.819 total time=  17.1s
[CV 2/5; 16/27] START learning_rate=0.1, max_depth=7, num_leaves=5..............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7

[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71138
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22691
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 4/5; 17/27] END learning_rate=0.1, max_depth=7, num_leaves=15;, score=0.836 total time=  16.1s
[CV 5/5; 17/27] START learning_rate=0.1, max_depth=7, num_leaves=15.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71540
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22825
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 5/5; 17/27

[CV 2/5; 19/27] END learning_rate=0.01, max_depth=-1, num_leaves=5;, score=0.731 total time=  14.3s
[CV 3/5; 19/27] START learning_rate=0.01, max_depth=-1, num_leaves=5............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71405
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22764
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 3/5; 19/27] END learning_rate=0.01, max_depth=-1, num_leaves=5;, score=0.742 total time=  13.3s
[CV 4/5; 19/27] START learning_rate=0.01, max_depth=-1, num_leaves=5............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71138
[LightG

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 4/5; 21/27] END learning_rate=0.01, max_depth=-1, num_leaves=31;, score=0.774 total time=  17.8s
[CV 5/5; 21/27] START learning_rate=0.01, max_depth=-1, num_leaves=31...........
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71540
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22825
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 5/5; 21/27] END learning_rate=0.01, max_depth=-1, num_leaves=31;, score=0.768 total time=  17.0s
[CV 1/5; 22/27] START learning_rate=0.01, max_depth=3, num_leaves=5.............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enou

[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 71192
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22680
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 1/5; 24/27] END learning_rate=0.01, max_depth=3, num_leaves=31;, score=0.731 total time=  12.1s
[CV 2/5; 24/27] START learning_rate=0.01, max_depth=3, num_leaves=31............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71580
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22831
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 2/5; 24/27] END learning_rate=0.01, max_depth=3, num_leaves=31;, score=0.

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 2/5; 26/27] END learning_rate=0.01, max_depth=7, num_leaves=15;, score=0.750 total time=  14.1s
[CV 3/5; 26/27] START learning_rate=0.01, max_depth=7, num_leaves=15............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 71405
[LightGBM] [Info] Number of data points in the train set: 28000, number of used features: 22764
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[CV 3/5; 26/27] END learning_rate=0.01, max_depth=7, num_leaves=15;, score=0.768 total time=  15.3s
[CV 4/5; 26/27] START learning_rate=0.01, max_depth=7, num_leaves=15............
[LightGBM] [Info] Number of positive: 14000, number of negative: 14000
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough

In [None]:
"""
clf.best_params_
"""

{'learning_rate': 0.5, 'max_depth': -1, 'num_leaves': 15}

In [None]:
# Обновим модель после фиттинга

# инициализируем модель с подобранными гиперпараметрами
clf = LGBMClassifier(**{'learning_rate': 0.5, 'max_depth': -1, 'num_leaves': 15}, random_state=42)

# обучаем модель на тренировочных данных
clf.fit(X_train, le.transform(y_train));

[LightGBM] [Info] Number of positive: 17500, number of negative: 17500
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 39.125274 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 87186
[LightGBM] [Info] Number of data points in the train set: 35000, number of used features: 28140
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


In [None]:
# делаем предсказание для тестовых данных
y_pred = clf.predict(X_test)

In [None]:
pred_proba = clf.predict_proba(X_test)

In [None]:
roc_auc_score(le.transform(y_test), pred_proba[:,1], average='macro')

0.9483703466666665

In [None]:
print(classification_report(le.transform(y_test), y_pred))

              precision    recall  f1-score   support

           0       0.88      0.87      0.87      7500
           1       0.87      0.88      0.88      7500

    accuracy                           0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000



precision    recall  f1-score   support

           0       0.88      0.87      0.87      7500
           1       0.87      0.88      0.88      7500

    accuracy                           0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000


# 3. Воспользуемся предобученными fasttext эмбеддингами для векторизации текста

Будем использовать FastText как более современный и эффективный по сравнению с word2vec

## 3.1. Fasttext pre-trained emb  + LGBM

In [None]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m71.7/73.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.1-py3-none-any.whl (238 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp310-cp310-linux_x86_64.whl size=4246764 sha256=836c3f7b605ca3dc7396654bbb48bcaf7606409bd3559def9ca36e677e30d209
  Stored in d

# Выбор модели
Я выбирала между моделями:
1) facebook/fasttext-en-vectors (https://huggingface.co/facebook/fasttext-en-vectors) c параметрами: cosine_similarity("man", "boy") =
0.0616533
2) cc.en.300.bin.gz (https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz), хорошо описанную на сайте: https://programmersought.com/article/457510884978/83

In [None]:
# Скачаем файл с предварительно обученными векторами слов для английского языка.
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

--2024-07-15 14:21:08--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.165.83.35, 18.165.83.91, 18.165.83.44, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.165.83.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4503593528 (4.2G) [application/octet-stream]
Saving to: ‘cc.en.300.bin.gz’


2024-07-15 14:27:08 (12.0 MB/s) - ‘cc.en.300.bin.gz’ saved [4503593528/4503593528]



In [None]:
# Распакуем файл bin.gz в текущую папку, -d сохранить оригинал (-Keep):
!gzip -d /content/cc.en.300.bin.gz

In [None]:
# Загрузим модель
import fasttext
ft = fasttext.load_model('cc.en.300.bin')

In [None]:
len(ft.get_words())

2000000

In [None]:
# Преобразуем текстовые данные из столбца `'clear'` DataFrame `df` в векторы эмбеддингов, используя модель FastText `ft`
embeddings = df['clear'].progress_apply(ft.get_sentence_vector)

NameError: name 'df' is not defined

In [None]:
embeddings[0].shape

(300,)

In [None]:
# Уменьшение размерности с помощью PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=100)  # Установка желаемого количества компонент
reduced_embeddings = pca.fit_transform(embeddings)

In [None]:
# Создание новой модели FastText с уменьшенными эмбеддингами
new_ft = fasttext.FastText(
    input=reduced_embeddings,
    output='cc.en.100.bin',  # Имя для сохранения новой модели
    dim=100,  # Новое измерение эмбеддингов
    model='skipgram',  # Тип модели (может быть 'skipgram' или 'cbow')
)

# Сохранение новой модели
new_ft.save_model('cc.en.100.bin')

In [None]:
len(ft.get_words())

2000000

In [None]:
embeddings = df['clear'].progress_apply(ft.get_sentence_vector)

In [None]:
embeddings[0].shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(np.array(embeddings.tolist()), df['sentiment'], test_size=0.3, random_state=42, stratify=df['sentiment'])

In [None]:
%%time
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

model = LGBMClassifier(random_state=123)
params = {
    'num_leaves': [5, 15, 31], # default=31
    'max_depth': [-1, 3, 7], # default=-1
    'learning_rate': [0.5, 0.1, 0.01], # default=0.1
}

clf = GridSearchCV(model, params, scoring='f1', verbose=10)
clf.fit(X_train, le.transform(y_train))

In [None]:
clf.best_params_

{'learning_rate': 0.1, 'max_depth': 7, 'num_leaves': 31}

In [None]:
y_pred = clf.predict(X_test)
print(classification_report(le.transform(y_test), y_pred))

              precision    recall  f1-score   support

           0       0.84      0.84      0.84      7500
           1       0.84      0.85      0.84      7500

    accuracy                           0.84     15000
   macro avg       0.84      0.84      0.84     15000
weighted avg       0.84      0.84      0.84     15000

