# Baseline решение классификации авторов по тексту

## План работы

1. **Предобработка данных** : очистка текста, подготовка датасета.



2. **Минимальный baseline** : случайное предсказание/предсказание самого частого класса, чтобы потом ориентироваться на эти значения при оценке других подходов.



3. **Использование эвристик** :
    - **Lexicon-based**: предсказание с использованием словарей, например, для тональности и токсичности;
    - **Rule-based**: предсказание с использованием различных паттернов: ключевые слова, регулярные выражения, счетчики положительных/отрицательных слов, длина текста и среднего количества слов в предложении, формулы для оценки сложности текста (например, Flesch-Kincaid), детекция простых шаблонов и специфических фраз, определение разнообразия словарного запаса и структуры текста и т.д.



4. **Статистическе подходы и ML** :
   - **BoW** или **TF-IDF** вместе с линейными классификаторами (**Logistic Regression** / **SVM**);
   - **N-gram'ы** и **Naive Bayes**: учет последовательности слов.



5. **Создание эмбеддингов и обучение простой модели**:
    - **Word2Vec** / **GloVe** вместе с  **LogReg** / **Linear Classifier** / **SVM** 



6. **Создание ансамблей** (комбинации простых моделей для повышения точности).

## Импорт библиотек

In [1]:
!pip install langdetect
!pip install pymorphy2
!pip install gensim
!unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25ldone
[?25h  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=c9f3bc4b323038f80ef39a3d0ba43bb1df502287d78c95d16599b85d2a1f54b8
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl.metadata (3.6 kB)
Collecting dawg-python>=0.7.1 (from pymorphy2)
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting pymorphy2-dicts-ru<3.0,>=2.4 (f

In [2]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup
import re
import string

import nltk
import spacy
from nltk import tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Убедитесь, что необходимые ресурсы загружены
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('corpora')
nltk.download('wordnet')

from langdetect import detect

from collections import Counter

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

sns.set_theme(rc={'figure.figsize':(15, 7)})

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Error loading corpora: Package 'corpora' not found in
[nltk_data]     index
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
/kaggle/input/glove-text/glove.6B.100d.txt
/kaggle/input/top-100-authors-preprocessed-train-test/top100_preprocessed_test.pq
/kaggle/input/top-100-authors-preprocessed-train-test/top100_preprocessed.pq


In [3]:
RANDOM_STATE = 17

## Загрузка данных

In [4]:
# Функция для краткого описания данных
def data_info(df):
    print(f'Количество строк - {df.shape[0]}\n', f'Количество столбцов - {df.shape[1]}', sep='')
    
    display(df.head(3))
    return df.info()

In [5]:
data_train = pd.read_parquet('/kaggle/input/top-100-authors-preprocessed-train-test/top100_preprocessed.pq')
data_test = pd.read_parquet('/kaggle/input/top-100-authors-preprocessed-train-test/top100_preprocessed_test.pq')

In [6]:
data_info(data_train)

Количество строк - 408
Количество столбцов - 42


Unnamed: 0,author,author_sn,title,text,FileName,is_gutenberg,text_len,text_len2,words_cnt,words_symbols,...,has_html,text_cleaned,text_cleaned_words,text_cleaned_lemmatization,fk_score,lexical_diversity,average_sentence_length,sentence_count,average_word_length,syllable_count
0,Geoffrey_Chaucer,Chaucer,The_Canterbury_Tales,THE CANTERBURY TALES And other Poems of GEOFF...,Geoffrey_Chaucer-The_Canterbury_Tales.txt,0,1650684,1517228,279354,0.184121,...,1,canterbury tales poems geoffrey chaucer edited...,"[canterbury, tales, poems, geoffrey, chaucer, ...",canterbury tale poem geoffrey chaucer edited p...,61.96818,0.127333,13.57943,10569,8.670947,222379
1,Umberto_Eco,Eco,From_The_Tree_To_The_Labyrinth,FROM THE TREE TO THE LABYRINTH FROM THE TREE ...,Umberto_Eco-From_the_Tree_to_the_Labyrinth.txt,0,1533277,1527975,249969,0.163595,...,0,tree labyrinth tree labyrinth historical studi...,"[tree, labyrinth, tree, labyrinth, historical,...",tree labyrinth tree labyrinth historical study...,17.781446,0.153241,14.392021,9701,9.125995,287891
2,Victor_Hugo,Hugo,Les_Misérables,LES MISÉRABLES By Victor Hugo Translated by I...,Victor_Hugo-Les_Misérables.txt,1,3304624,3210219,565614,0.176192,...,0,les mis rables victor hugo translated isabel f...,"[les, mis, rables, victor, hugo, translated, i...",le mi rables victor hugo translated isabel f h...,41.217732,0.08812,9.140349,29961,9.656992,506079


<class 'pandas.core.frame.DataFrame'>
Index: 408 entries, 0 to 408
Data columns (total 42 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   author                      408 non-null    object 
 1   author_sn                   408 non-null    object 
 2   title                       408 non-null    object 
 3   text                        408 non-null    object 
 4   FileName                    408 non-null    object 
 5   is_gutenberg                408 non-null    int64  
 6   text_len                    408 non-null    int64  
 7   text_len2                   408 non-null    int64  
 8   words_cnt                   408 non-null    int64  
 9   words_symbols               408 non-null    float64
 10  words_dots                  408 non-null    float64
 11  words_commas                408 non-null    float64
 12  words_excls                 408 non-null    float64
 13  words_questions             408 non-null

In [7]:
data_info(data_test)

Количество строк - 51
Количество столбцов - 16


Unnamed: 0,author,author_sn,title,text,FileName,is_gutenberg,has_html,text_cleaned,text_cleaned_words,text_cleaned_lemmatization,fk_score,lexical_diversity,average_sentence_length,sentence_count,average_word_length,syllable_count
5582,Aesop,Aesop,Aesop's_Fables,\r\n\r\n[Illustration]\r\n\r\n\r\n\r\n\r\nThe ...,Aesop-Aesop's_Fables.txt,1,0,illustration fables aesop selected told anew h...,"[illustration, fables, aesop, selected, told, ...",illustration fable aesop selected told anew hi...,60.888825,0.355881,12.258803,568,9.037197,10988
3938,Agatha_Christie,Christie,Poirot_Investigates,\r\n\r\n\r\n\r\nProduced by an anonymous Proje...,Agatha_Christie-Poirot_Investigates.txt,1,0,produced anonymous project gutenberg volunteer...,"[produced, anonymous, project, gutenberg, volu...",produced anonymous project gutenberg volunteer...,47.552857,0.22527,7.341569,3481,9.462748,45865
3881,Alexandre_Dumas,Dumas,Celebrated_Crimes_(complete),\n\n\n\n\nProduced by David Widger.\n\n\n\n\n\...,Alexandre_Dumas-Celebrated_Crimes_(complete).txt,1,1,produced david widger celebrated crimes comple...,"[produced, david, widger, celebrated, crimes, ...",produced david widger celebrated crime complet...,31.969128,0.074072,13.57758,21855,9.773315,565011


<class 'pandas.core.frame.DataFrame'>
Index: 51 entries, 5582 to 3319
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   author                      51 non-null     object 
 1   author_sn                   51 non-null     object 
 2   title                       51 non-null     object 
 3   text                        51 non-null     object 
 4   FileName                    51 non-null     object 
 5   is_gutenberg                51 non-null     int64  
 6   has_html                    51 non-null     int64  
 7   text_cleaned                51 non-null     object 
 8   text_cleaned_words          51 non-null     object 
 9   text_cleaned_lemmatization  51 non-null     object 
 10  fk_score                    51 non-null     float64
 11  lexical_diversity           51 non-null     float64
 12  average_sentence_length     51 non-null     float64
 13  sentence_count              51 non-nu

In [8]:
# Убедимся, что в тренировочных данных нет тестовых
test_titles = list(data_test['title'].unique())

data_train = data_train[~data_train.title.isin(test_titles)]
data_train.shape

(363, 42)

In [9]:
# Проверим, сколько авторов из тренировочных и тестовых данных совпадает
len(set(data_test['author']) & set(data_train['author'])) / len(set(data_test['author']))

0.9607843137254902

In [10]:
# Нужно для функций дальше
data_train['text_cleaned_words'] = data_train['text_cleaned_words'].apply(lambda x: list(x))
data_test['text_cleaned_words'] = data_test['text_cleaned_words'].apply(lambda x: list(x))

In [11]:
def metrics(y_pred, y_test, y_pred_probas=None):    
    # Для многоклассовой классификации
    accuracy = accuracy_score(y_test, y_pred)
    precision_macro = precision_score(y_test, y_pred, average='macro')
    precision_micro = precision_score(y_test, y_pred, average='micro')
    recall_macro = recall_score(y_test, y_pred, average='macro')
    recall_micro = recall_score(y_test, y_pred, average='micro')
    f1_macro = f1_score(y_test, y_pred, average='macro')
    f1_micro = f1_score(y_test, y_pred, average='micro')
    

    if y_pred_probas:
        roc_auc = roc_auc_score(y_test, y_pred_probas, multi_class='ovr')
    else:
        roc_auc = None


    # Создание DataFrame с метриками
    metrics = {
        'Metric': ['Accuracy', 'Precision (Macro)', 'Precision (Micro)', 
                   'Recall (Macro)', 'Recall (Micro)', 
                   'F1 Score (Macro)', 'F1 Score (Micro)', 'ROC AUC'],
        'Score': [accuracy, precision_macro, precision_micro, 
                  recall_macro, recall_micro, 
                  f1_macro, f1_micro, roc_auc]
    }
    
    return pd.DataFrame(metrics)


In [12]:
columns_to_drop = list(set(data_train.columns) - set(data_test.columns)) + ['author_sn', 'title', 'FileName', 'is_gutenberg', 'has_html']

In [13]:
# Выберем только нужные признаки
data_train = data_train.drop(columns_to_drop, axis=1)
data_test = data_test.drop(['author_sn', 'title', 'FileName', 'is_gutenberg', 'has_html'], axis=1)

# Разделим данные 
X_train, y_train = data_train.drop(['author'], axis=1), data_train['author']
X_test, y_test = data_test.drop(['author'], axis=1), data_test['author']

In [14]:
print(f'Размерность тренировочных данных: {X_train.shape}, {y_train.shape}')
print(f'Размерность тестовых данных: {X_test.shape}, {y_test.shape}')

Размерность тренировочных данных: (363, 10), (363,)
Размерность тестовых данных: (51, 10), (51,)


Предобработка уже была выполнена на предыдущем шаге, поэтому, чтобы не тратить время, я не стал копировать код с предобработкой в этот датасет. Данные уже предобработаны: приведены в нижний регистр, очищены от пунктуации, очищены от стопслов, разбиты на токены и лемматизированы. Далее идет работа с моделями.

## DummyClassifier - определяем baseline

In [15]:
dummy_classifier = DummyClassifier(strategy='most_frequent', random_state=RANDOM_STATE)
dummy_classifier.fit(X_train, y_train)
y_pred = dummy_classifier.predict(X_test)

baseline_metrics = metrics(y_pred, y_test)
baseline_metrics

Unnamed: 0,Metric,Score
0,Accuracy,0.019608
1,Precision (Macro),0.000384
2,Precision (Micro),0.019608
3,Recall (Macro),0.019608
4,Recall (Micro),0.019608
5,F1 Score (Macro),0.000754
6,F1 Score (Micro),0.019608
7,ROC AUC,


## Эвристики

### Lexicon-based

In [16]:
# Функция для подсчета самых распространенных слов у автора
def author_top_words(data, col_author='author', col_words='text_cleaned_words', n_top=50):
    author_words_count = data.groupby([col_author])[col_words].sum().to_frame()\
                             .apply(lambda x: Counter(x[col_words]), axis=1).to_frame()\
                             .apply(lambda row: sorted(row[0].items(), key=lambda x: x[1], reverse=True)[:n_top], axis=1).to_frame()

    author_words_count = author_words_count[0].apply(lambda x: dict(x).keys()).to_frame().reset_index()
    author_words_count.columns = ['author', "top_words"]
    
    return author_words_count

Для создания **lexicon-based** эвристики для определения автора текста для каждого автора создается набор наиболее часто используемых/уникальных слов. Затем текст с неизвестным автором сравнивается с наборами слов, после чего определяется наиболее вероятный автор на основе частоты появления их лексиконов в тексте. То есть нужно подсчитать, сколько слов каждого автора встречается в тексте, и определить автора по максимальному количеству совпадений.ений.

In [17]:
def lexicon_based_classifier(text, author_lexicons):
    word_count = Counter(text.split())
    authors = author_lexicons['author']
    scores = dict.fromkeys(authors, 0)

    for author in authors:
        for word in author_lexicons.set_index('author').loc[author]['top_words']:
            scores[author] += word_count[word]
        
    return max(scores, key=scores.get)

Пример предсказания текста Достоевского на тренировочных данных.

In [18]:
author_lexicons = author_top_words(data_train, n_top=50)
author_lexicons
lexicon_based_classifier(X_train.loc[201]['text_cleaned'], author_lexicons=author_lexicons)

'Fyodor_Dostoyevsky'

Сделаем предсказания для каждого автора на тестовых данных и рассчитаем метрики.

In [19]:
y_pred = X_test['text_cleaned'].apply(lambda x: lexicon_based_classifier(x, author_lexicons))

# Расчитываем метрики
lexicon_based_metrics = metrics(y_pred, y_test)
lexicon_based_metrics

Unnamed: 0,Metric,Score
0,Accuracy,0.568627
1,Precision (Macro),0.450292
2,Precision (Micro),0.568627
3,Recall (Macro),0.508772
4,Recall (Micro),0.568627
5,F1 Score (Macro),0.467836
6,F1 Score (Micro),0.568627
7,ROC AUC,


### Rule-based подход

Rule-based: предсказание с использованием различных паттернов: ключевые слова, регулярные выражения, счетчики положительных/отрицательных слов, длина текста и среднего количества слов в предложении, формулы для оценки сложности текста (например, Flesch-Kincaid), детекция простых шаблонов и специфических фраз, определение разнообразия словарного запаса и структуры текста и т.д.

In [20]:
# Функция для подсчета слогов в слове
def nsyllables(word):
    word = word.lower()
    syllable_count = 0
    vowels = "aeiouy"
    
    if word[0] in vowels:
        syllable_count += 1
        
    for i in range(1, len(word)):
        if word[i] in vowels and word[i - 1] not in vowels:
            syllable_count += 1
        
    if word.endswith("e"):
        syllable_count -= 1
    if syllable_count == 0:
            yllable_count = 1
    
    return syllable_count

In [21]:
# Функция для анализа текста. 
def analyze_text(text, text_cleaned):
    # Выделение предложений из текста
    sentences = nltk.sent_tokenize(text)
    # Выделение слов из текста 
    words = nltk.word_tokenize(text_cleaned)
    word_count = len(words)
    sentence_count = len(sentences)
    average_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    syllable_count = sum([nsyllables(word) for word in words])
    
    # Оценка сложности текста
    if len(sentences) == 0 or len(words) == 0:
        return 0
    else:
        fk_score = 206.835 - 1.015 * (word_count / sentence_count) - 84.6 * (syllable_count / word_count)   # calculate_flesch_kincaid(text)

    return fk_score, average_sentence_length

In [22]:
author_profiles = data_train.groupby('author')[['fk_score', 'average_sentence_length']].mean().reset_index()
author_profiles.head(3)

Unnamed: 0,author,fk_score,average_sentence_length
0,Aesop,68.432096,8.176658
1,Agatha_Christie,50.55296,6.591788
2,Ahmet_Hamdi_Tanpinar,33.215016,8.483373


In [23]:
# Предсказание автора по тексту
def predict_author(text, cleaned_text, author_profiles):    
    fk_score, average_sentence_length = analyze_text(text, cleaned_text)

    authors = author_profiles['author']
    scores = dict.fromkeys(authors, 0)
    
    for author in authors:
        author_fk, author_sentence_length = author_profiles.set_index('author').loc[author]['fk_score'],\
                                            author_profiles.set_index('author').loc[author]['average_sentence_length']

        score = 0
        # Сравниваем длинну предложений
        if abs(average_sentence_length - author_sentence_length) < 1:
            score += 1

        # Сравниваем сложность текста
        if abs(fk_score - author_fk) < 10:
            score += 1
            
        scores[author] = score
    
    predicted_author = max(scores, key=scores.get)
    
    return predicted_author

Попробуем сделать предсказание на тренировочных данных.

In [24]:
y_pred = X_test.apply(lambda row: predict_author(row['text'], row['text_cleaned'], author_profiles), axis=1)

# Расчитываем метрики
rule_based_metrics = metrics(y_pred, y_test)
rule_based_metrics

Unnamed: 0,Metric,Score
0,Accuracy,0.058824
1,Precision (Macro),0.032308
2,Precision (Micro),0.058824
3,Recall (Macro),0.046154
4,Recall (Micro),0.058824
5,F1 Score (Macro),0.033566
6,F1 Score (Micro),0.058824
7,ROC AUC,


Метрики низкие. Возможно, стоит добавить больше правил или пересмотреть текущие.

## Статистические подходы

Теперь нужно попробоват использовать ML и статистические методы. 

- BoW / TF-IDF + LogReg / SVM (классификация на основе ключевых слов)
- N-grams + Naive Bayes (классификация с учетом последовательности слов)

### BoW + LogReg 

In [25]:
text_feature = 'text_cleaned_lemmatization'
num_features = ['fk_score', 'lexical_diversity', 'average_sentence_length', 'sentence_count', 'average_word_length', 'syllable_count']

preprocessor = ColumnTransformer(
    transformers = [('text', CountVectorizer(), text_feature),
                    ('num', StandardScaler(), num_features)],
    remainder='drop'
)

# Создание Pipeline
log_reg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(random_state=RANDOM_STATE))
])

# Определим параметры
log_reg_params =[
    {'preprocessor__text__max_df': np.arange(0.7, 1, 0.1),
     'preprocessor__text__min_df': np.arange(0.1, 0.4, 0.1),  
     'preprocessor__text__max_features': [1500, 3500, 5000, 8000, 10000, 20000, 50000],
     'model__C': [0.01, 0.1, 1, 5, 10, 30, 50, 150]}
]

In [26]:
rand_search = RandomizedSearchCV(log_reg, log_reg_params, n_iter=10, scoring='accuracy', cv=3, verbose=3, random_state=RANDOM_STATE)
rand_search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END model__C=30, preprocessor__text__max_df=0.9999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.2;, score=0.587 total time=  24.7s
[CV 2/3] END model__C=30, preprocessor__text__max_df=0.9999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.2;, score=0.653 total time=  23.0s
[CV 3/3] END model__C=30, preprocessor__text__max_df=0.9999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.2;, score=0.587 total time=  25.2s
[CV 1/3] END model__C=30, preprocessor__text__max_df=0.7, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.30000000000000004;, score=0.463 total time=  22.3s
[CV 2/3] END model__C=30, preprocessor__text__max_df=0.7, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.30000000000000004;, score=0.603 total time=  23.2s
[CV 3/3] END model__C=30, preprocessor__text__max_df=0.7,

In [27]:
cv_result_cols = ['param_preprocessor__text__min_df', 'param_preprocessor__text__max_features','param_preprocessor__text__max_df', 
                  'param_model__C', 'mean_test_score', 'std_test_score', 'rank_test_score']

print(f'Лучшее значение метрики: {rand_search.best_score_}')


cv_results = pd.DataFrame(rand_search.cv_results_)[cv_result_cols].sort_values(by=['rank_test_score'])
cv_results

Лучшее значение метрики: 0.6584022038567493


Unnamed: 0,param_preprocessor__text__min_df,param_preprocessor__text__max_features,param_preprocessor__text__max_df,param_model__C,mean_test_score,std_test_score,rank_test_score
7,0.4,8000,1.0,50.0,0.658402,0.014047,1
0,0.2,1500,1.0,30.0,0.608815,0.031167,2
2,0.2,5000,0.8,1.0,0.570248,0.006748,3
9,0.2,50000,0.8,0.1,0.570248,0.006748,3
5,0.3,5000,0.7,150.0,0.556474,0.04493,5
3,0.2,8000,0.9,0.01,0.548209,0.112981,6
4,0.2,50000,0.7,10.0,0.539945,0.010308,7
6,0.3,1500,0.7,5.0,0.520661,0.05356,8
1,0.3,1500,0.7,30.0,0.515152,0.062699,9
8,0.3,1500,0.7,0.1,0.509642,0.043905,10


In [28]:
log_reg_BoW = rand_search.best_estimator_

y_pred = log_reg_BoW.predict(X_test)
log_reg_bow_metrics = metrics(y_pred, y_test)
log_reg_bow_metrics 

Unnamed: 0,Metric,Score
0,Accuracy,0.54902
1,Precision (Macro),0.399691
2,Precision (Micro),0.54902
3,Recall (Macro),0.518519
4,Recall (Micro),0.54902
5,F1 Score (Macro),0.432451
6,F1 Score (Micro),0.54902
7,ROC AUC,


### IF-IDF + LogReg 

In [29]:
text_feature = 'text_cleaned_lemmatization'
num_features = ['fk_score', 'lexical_diversity', 'average_sentence_length', 'sentence_count', 'average_word_length', 'syllable_count']

preprocessor = ColumnTransformer(
    transformers = [('text', TfidfVectorizer(), text_feature),
                    ('num', StandardScaler(), num_features)],
    remainder='drop'
)

# Создание Pipeline
log_reg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(random_state=RANDOM_STATE))
])

# Определим параметры
log_reg_params =[
    {'preprocessor__text__max_df': np.arange(0.7, 1, 0.1),
     'preprocessor__text__min_df': np.arange(0.1, 0.4, 0.1),  
     'preprocessor__text__max_features': [1500, 3500, 5000, 8000, 10000, 20000, 35000],
     'model__C': [0.01, 0.1, 1, 5, 10, 30, 50, 150]}
]

In [30]:
rand_search = RandomizedSearchCV(log_reg, log_reg_params, n_iter=10, scoring='accuracy', cv=3, verbose=3, random_state=RANDOM_STATE)
rand_search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END model__C=30, preprocessor__text__max_df=0.9999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.2;, score=0.397 total time=  22.6s
[CV 2/3] END model__C=30, preprocessor__text__max_df=0.9999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.2;, score=0.678 total time=  20.3s
[CV 3/3] END model__C=30, preprocessor__text__max_df=0.9999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.2;, score=0.521 total time=  20.6s
[CV 1/3] END model__C=30, preprocessor__text__max_df=0.7, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.30000000000000004;, score=0.397 total time=  21.6s
[CV 2/3] END model__C=30, preprocessor__text__max_df=0.7, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.30000000000000004;, score=0.628 total time=  20.1s
[CV 3/3] END model__C=30, preprocessor__text__max_df=0.7,

In [31]:
cv_result_cols = ['param_preprocessor__text__min_df', 'param_preprocessor__text__max_features','param_preprocessor__text__max_df', 
                  'param_model__C', 'mean_test_score', 'std_test_score', 'rank_test_score']

print(f'Лучшее значение метрики: {rand_search.best_score_}')



cv_results = pd.DataFrame(rand_search.cv_results_)[cv_result_cols].sort_values(by=['rank_test_score'])
cv_results

Лучшее значение метрики: 0.5426997245179064


Unnamed: 0,param_preprocessor__text__min_df,param_preprocessor__text__max_features,param_preprocessor__text__max_df,param_model__C,mean_test_score,std_test_score,rank_test_score
7,0.4,8000,1.0,50.0,0.5427,0.111766,1
0,0.2,1500,1.0,30.0,0.53168,0.114979,2
5,0.3,5000,0.7,150.0,0.512397,0.09986,3
1,0.3,1500,0.7,30.0,0.484848,0.102189,4
6,0.3,1500,0.7,5.0,0.465565,0.116161,5
4,0.2,35000,0.7,10.0,0.460055,0.113183,6
2,0.2,5000,0.8,1.0,0.371901,0.140253,7
8,0.3,1500,0.7,0.1,0.220386,0.054959,8
9,0.2,35000,0.8,0.1,0.217631,0.051094,9
3,0.2,8000,0.9,0.01,0.168044,0.010308,10


In [32]:
log_reg_tfidf = rand_search.best_estimator_

y_pred = log_reg_tfidf.predict(X_test)
#y_pred_proba = log_reg_tfidf.predict_proba(X_test)
log_reg_tfidf_metrics = metrics(y_pred, y_test)
log_reg_tfidf_metrics 

Unnamed: 0,Metric,Score
0,Accuracy,0.568627
1,Precision (Macro),0.442949
2,Precision (Micro),0.568627
3,Recall (Macro),0.557692
4,Recall (Micro),0.568627
5,F1 Score (Macro),0.474359
6,F1 Score (Micro),0.568627
7,ROC AUC,


### BoW + ngrams + Logistic Regression

In [33]:
text_feature = 'text_cleaned_lemmatization'
num_features = ['fk_score', 'lexical_diversity', 'average_sentence_length', 'sentence_count', 'average_word_length', 'syllable_count']

preprocessor = ColumnTransformer(
    transformers = [('text', CountVectorizer(ngram_range=(1, 2)), text_feature),
                    ('num', StandardScaler(), num_features)],
    remainder='drop'
)

# Создание Pipeline
log_reg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(random_state=RANDOM_STATE))
])

# Определим параметры
log_reg_params =[
    {'preprocessor__text__max_df': np.arange(0.7, 1, 0.1),
     'preprocessor__text__min_df': np.arange(0.1, 0.4, 0.1),  
     'preprocessor__text__max_features': [1500, 3500, 5000, 8000, 10000],
     'model__C': [0.01, 0.1, 1, 5, 10, 30, 50, 150]}
]

In [34]:
rand_search = RandomizedSearchCV(log_reg, log_reg_params, n_iter=5, scoring='accuracy', cv=3, verbose=3, random_state=RANDOM_STATE)
rand_search.fit(X_train, y_train)

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV 1/3] END model__C=150, preprocessor__text__max_df=0.9999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.4;, score=0.686 total time= 1.1min
[CV 2/3] END model__C=150, preprocessor__text__max_df=0.9999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.4;, score=0.661 total time= 1.3min
[CV 3/3] END model__C=150, preprocessor__text__max_df=0.9999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.4;, score=0.587 total time= 1.4min
[CV 1/3] END model__C=5, preprocessor__text__max_df=0.7, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.2;, score=0.496 total time= 1.0min
[CV 2/3] END model__C=5, preprocessor__text__max_df=0.7, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.2;, score=0.595 total time= 1.3min
[CV 3/3] END model__C=5, preprocessor__text__max_df=0.7, preprocessor__text__max_features

In [35]:
cv_result_cols = ['param_preprocessor__text__min_df', 'param_preprocessor__text__max_features','param_preprocessor__text__max_df', 
                  'param_model__C', 'mean_test_score', 'std_test_score', 'rank_test_score']

print(f'Лучшее значение метрики: {rand_search.best_score_}')


cv_results = pd.DataFrame(rand_search.cv_results_)[cv_result_cols].sort_values(by=['rank_test_score'])
cv_results

Лучшее значение метрики: 0.652892561983471


Unnamed: 0,param_preprocessor__text__min_df,param_preprocessor__text__max_features,param_preprocessor__text__max_df,param_model__C,mean_test_score,std_test_score,rank_test_score
3,0.3,5000,1.0,10.0,0.652893,0.047235,1
0,0.4,1500,1.0,150.0,0.644628,0.042141,2
2,0.4,1500,1.0,0.1,0.633609,0.030428,3
4,0.3,3500,0.7,30.0,0.559229,0.047874,4
1,0.2,1500,0.7,5.0,0.534435,0.043383,5


In [36]:
log_reg_BoW_ngrams = rand_search.best_estimator_

y_pred = log_reg_BoW_ngrams.predict(X_test)
log_reg_bow_ngram_metrics = metrics(y_pred, y_test)
log_reg_bow_ngram_metrics 

Unnamed: 0,Metric,Score
0,Accuracy,0.529412
1,Precision (Macro),0.410303
2,Precision (Micro),0.529412
3,Recall (Macro),0.490909
4,Recall (Micro),0.529412
5,F1 Score (Macro),0.430303
6,F1 Score (Micro),0.529412
7,ROC AUC,


### BoW + SVM

Теперь определим пайплайн для SVM-модели.

In [37]:
preprocessor = ColumnTransformer(
    transformers = [('text', CountVectorizer(), text_feature),
                    ('num', StandardScaler(), num_features)],
    remainder='drop'
)

# Создание Pipeline
SVC_classif = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', SVC(random_state=RANDOM_STATE))
])


# Определим параметры
SVC_params =[
    {
     'preprocessor__text__max_df': np.arange(0.7, 1.1, 0.1),
     'preprocessor__text__min_df': np.arange(0.1, 0.4, 0.1),  
     'preprocessor__text__max_features': [100, 300, 500, 1000, 1500, 3500, 5000, 8000, 10000],
     'model__C': [0.01, 0.1, 1, 5, 10, 30, 50, 150] 
    }
]

In [38]:
rand_search = RandomizedSearchCV(SVC_classif, SVC_params, n_iter=10, scoring='accuracy', cv=3, verbose=3, random_state=RANDOM_STATE)
rand_search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END model__C=5, preprocessor__text__max_df=0.8999999999999999, preprocessor__text__max_features=500, preprocessor__text__min_df=0.4;, score=0.281 total time=  13.0s
[CV 2/3] END model__C=5, preprocessor__text__max_df=0.8999999999999999, preprocessor__text__max_features=500, preprocessor__text__min_df=0.4;, score=0.504 total time=  12.9s
[CV 3/3] END model__C=5, preprocessor__text__max_df=0.8999999999999999, preprocessor__text__max_features=500, preprocessor__text__min_df=0.4;, score=0.240 total time=  13.0s
[CV 1/3] END model__C=150, preprocessor__text__max_df=0.7, preprocessor__text__max_features=300, preprocessor__text__min_df=0.2;, score=0.306 total time=  12.8s
[CV 2/3] END model__C=150, preprocessor__text__max_df=0.7, preprocessor__text__max_features=300, preprocessor__text__min_df=0.2;, score=0.488 total time=  13.0s
[CV 3/3] END model__C=150, preprocessor__text__max_df=0.7, preprocessor__text__max_features=300

In [39]:
print(f'Лучшее значение метрики: {rand_search.best_score_}')

cv_results = pd.DataFrame(rand_search.cv_results_)[cv_result_cols].sort_values(by=['rank_test_score'])
cv_results

Лучшее значение метрики: 0.40495867768595045


Unnamed: 0,param_preprocessor__text__min_df,param_preprocessor__text__max_features,param_preprocessor__text__max_df,param_model__C,mean_test_score,std_test_score,rank_test_score
5,0.2,300,0.9,150.0,0.404959,0.128742,1
1,0.2,300,0.7,150.0,0.349862,0.099479,2
0,0.4,500,0.9,5.0,0.341598,0.116161,3
6,0.1,10000,0.7,10.0,0.30854,0.127021,4
4,0.3,500,0.8,1.0,0.22314,0.03374,5
2,0.4,10000,1.0,0.01,0.112948,0.003896,6
7,0.3,5000,0.9,0.1,0.112948,0.003896,6
3,0.3,500,1.1,150.0,,,8
8,0.4,500,1.1,30.0,,,8
9,0.1,100,1.1,0.1,,,8


In [40]:
best_SVC_BoW = rand_search.best_estimator_

y_pred = best_SVC_BoW.predict(X_test)
#y_pred_proba = best_SVC_BoW.predict_proba(X_test)
svc_bow_metrics = metrics(y_pred, y_test)
svc_bow_metrics

Unnamed: 0,Metric,Score
0,Accuracy,0.490196
1,Precision (Macro),0.379245
2,Precision (Micro),0.490196
3,Recall (Macro),0.471698
4,Recall (Micro),0.490196
5,F1 Score (Macro),0.401903
6,F1 Score (Micro),0.490196
7,ROC AUC,


### tfidf + SVM

In [41]:
preprocessor = ColumnTransformer(
    transformers = [('text',TfidfVectorizer(), text_feature),
                    ('num', StandardScaler(), num_features)],
    remainder='drop'
)

# Создание Pipeline
SVC_classif = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', SVC(random_state=RANDOM_STATE))
])


# Определим параметры
SVC_params =[
    {
     'preprocessor__text__max_df': np.arange(0.7, 1.1, 0.1),
     'preprocessor__text__min_df': np.arange(0.1, 0.4, 0.1),  
     'preprocessor__text__max_features': [300, 500, 1000, 1500, 3500, 5000, 8000],
     'model__C': [0.01, 0.1, 1, 5, 10, 30, 50, 150] 
    }
]

In [42]:
rand_search = RandomizedSearchCV(SVC_classif, SVC_params, n_iter=10, scoring='accuracy', cv=3, verbose=3, random_state=RANDOM_STATE)
rand_search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END model__C=10, preprocessor__text__max_df=0.8999999999999999, preprocessor__text__max_features=500, preprocessor__text__min_df=0.4;, score=0.273 total time=  12.9s
[CV 2/3] END model__C=10, preprocessor__text__max_df=0.8999999999999999, preprocessor__text__max_features=500, preprocessor__text__min_df=0.4;, score=0.529 total time=  13.1s
[CV 3/3] END model__C=10, preprocessor__text__max_df=0.8999999999999999, preprocessor__text__max_features=500, preprocessor__text__min_df=0.4;, score=0.397 total time=  13.5s
[CV 1/3] END model__C=0.1, preprocessor__text__max_df=0.7, preprocessor__text__max_features=300, preprocessor__text__min_df=0.4;, score=0.140 total time=  12.7s
[CV 2/3] END model__C=0.1, preprocessor__text__max_df=0.7, preprocessor__text__max_features=300, preprocessor__text__min_df=0.4;, score=0.132 total time=  13.3s
[CV 3/3] END model__C=0.1, preprocessor__text__max_df=0.7, preprocessor__text__max_features=

In [43]:
print(f'Лучшее значение метрики: {rand_search.best_score_}')

cv_results = pd.DataFrame(rand_search.cv_results_)[cv_result_cols].sort_values(by=['rank_test_score'])
cv_results

Лучшее значение метрики: 0.39944903581267216


Unnamed: 0,param_preprocessor__text__min_df,param_preprocessor__text__max_features,param_preprocessor__text__max_df,param_model__C,mean_test_score,std_test_score,rank_test_score
0,0.4,500,0.9,10.0,0.399449,0.104611,1
8,0.2,300,0.9,50.0,0.396694,0.092768,2
9,0.2,500,0.9,150.0,0.393939,0.105045,3
3,0.1,8000,0.8,30.0,0.380165,0.094471,4
5,0.4,3500,0.9,150.0,0.380165,0.090784,4
7,0.1,5000,0.8,5.0,0.371901,0.094471,6
6,0.1,3500,0.8,1.0,0.253444,0.087028,7
1,0.4,300,0.7,0.1,0.143251,0.010308,8
2,0.3,1500,1.1,1.0,,,9
4,0.3,8000,1.1,0.1,,,9


In [44]:
best_SVM_tfidf = rand_search.best_estimator_

y_pred = best_SVM_tfidf.predict(X_test)
#y_pred_proba = best_SVM_tfidf.predict_proba(X_test)
SVM_tfidf_metrics = metrics(y_pred, y_test)
SVM_tfidf_metrics 

Unnamed: 0,Metric,Score
0,Accuracy,0.509804
1,Precision (Macro),0.400744
2,Precision (Micro),0.509804
3,Recall (Macro),0.472727
4,Recall (Micro),0.509804
5,F1 Score (Macro),0.416364
6,F1 Score (Micro),0.509804
7,ROC AUC,


### bigrams BoW + MultinominalNB

In [45]:
preprocessor = ColumnTransformer(
    transformers = [('text', CountVectorizer(ngram_range=(1,2)), text_feature),
                    ('num', MinMaxScaler(), num_features)],
    remainder='drop'
)

# Создание Pipeline
naive_bayes_classifier = Pipeline(steps=[
                                 ('preprocessor', preprocessor),
                                 ('model', MultinomialNB())
])


# Определим параметры
naive_bayes_params =[
    {
     'preprocessor__text__max_df': np.arange(0.6, 0.8, 0.1),
     'preprocessor__text__min_df': np.arange(0.05, 0.4, 0.1),  
     'preprocessor__text__max_features': [500, 1000, 1500, 3500],
    }
]

In [46]:
rand_search = RandomizedSearchCV(naive_bayes_classifier, naive_bayes_params, n_iter=5, scoring='accuracy', cv=3, verbose=3, random_state=RANDOM_STATE)
rand_search.fit(X_train, y_train)

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV 1/3] END preprocessor__text__max_df=0.7999999999999999, preprocessor__text__max_features=3500, preprocessor__text__min_df=0.25000000000000006;, score=0.711 total time=  50.5s
[CV 2/3] END preprocessor__text__max_df=0.7999999999999999, preprocessor__text__max_features=3500, preprocessor__text__min_df=0.25000000000000006;, score=0.719 total time= 1.0min
[CV 3/3] END preprocessor__text__max_df=0.7999999999999999, preprocessor__text__max_features=3500, preprocessor__text__min_df=0.25000000000000006;, score=0.702 total time= 1.1min
[CV 1/3] END preprocessor__text__max_df=0.7999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.05;, score=0.645 total time=  49.5s
[CV 2/3] END preprocessor__text__max_df=0.7999999999999999, preprocessor__text__max_features=1500, preprocessor__text__min_df=0.05;, score=0.711 total time= 1.0min
[CV 3/3] END preprocessor__text__max_df=0.7999999999999999, preprocessor__t

In [47]:
print(f'Лучшее значение метрики: {rand_search.best_score_}')
cv_result_cols = ['param_preprocessor__text__min_df', 'param_preprocessor__text__max_features', 'param_preprocessor__text__max_df', 'mean_test_score', 'std_test_score', 'rank_test_score']

cv_results = pd.DataFrame(rand_search.cv_results_)[cv_result_cols].sort_values(by=['rank_test_score'])
cv_results

Лучшее значение метрики: 0.7107438016528925


Unnamed: 0,param_preprocessor__text__min_df,param_preprocessor__text__max_features,param_preprocessor__text__max_df,mean_test_score,std_test_score,rank_test_score
0,0.25,3500,0.8,0.710744,0.006748,1
3,0.15,3500,0.7,0.699725,0.014047,2
1,0.05,1500,0.8,0.672176,0.028094,3
4,0.15,1000,0.7,0.636364,0.037571,4
2,0.05,500,0.7,0.597796,0.047396,5


In [48]:
naive_bayes_BoW = rand_search.best_estimator_

y_pred = naive_bayes_BoW.predict(X_test)
#y_pred_proba = naive_bayes_BoW.predict_proba(X_test)
naive_bayes_BoW_metrics = metrics(y_pred, y_test)
naive_bayes_BoW_metrics 

Unnamed: 0,Metric,Score
0,Accuracy,0.647059
1,Precision (Macro),0.505747
2,Precision (Micro),0.647059
3,Recall (Macro),0.568966
4,Recall (Micro),0.647059
5,F1 Score (Macro),0.525862
6,F1 Score (Micro),0.647059
7,ROC AUC,


### Эмбеддинги Word2Vec и GloVe вместе с простыми моделями

#### Word2Vec + LogReg

In [49]:
class Word2VecVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, model):
        self.model = model

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        vecs = []
        for text in X:
            word_vecs = [self.model.wv[word] for word in text.split() if word in self.model.wv]
            text_vector = np.mean(word_vecs, axis=0) if word_vecs else np.zeros(self.model.vector_size)
            vecs.append(text_vector)
        return np.array(vecs)

In [50]:
y_train.shape

(363,)

In [51]:
model_w2v = Word2Vec(sentences=X_train['text_cleaned_words'], vector_size=100, window=5, min_count=1)

preprocessor = ColumnTransformer(
    transformers = [('text', Word2VecVectorizer(model_w2v), text_feature),
                    ('num', StandardScaler(), num_features)],
    remainder='drop'
)


log_reg_w2v = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])



log_reg_w2v.fit(X_train, y_train)

y_pred = log_reg_w2v.predict(X_test)
metrics_w2v_log_reg = metrics(y_pred, y_test)
metrics_w2v_log_reg

Unnamed: 0,Metric,Score
0,Accuracy,0.294118
1,Precision (Macro),0.166217
2,Precision (Micro),0.294118
3,Recall (Macro),0.283019
4,Recall (Micro),0.294118
5,F1 Score (Macro),0.195642
6,F1 Score (Micro),0.294118
7,ROC AUC,


#### Word2Vec + SVM

In [52]:
model_w2v = Word2Vec(sentences=X_train['text_cleaned_words'], vector_size=100, window=5, min_count=1)

preprocessor = ColumnTransformer(
    transformers = [('text', Word2VecVectorizer(model_w2v), text_feature),
                    ('num', StandardScaler(), num_features)],
    remainder='drop'
)



svm_w2v = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])



svm_w2v.fit(X_train, y_train)

y_pred = svm_w2v.predict(X_test)
metrics_svm_w2v = metrics(y_pred, y_test)
metrics_svm_w2v

Unnamed: 0,Metric,Score
0,Accuracy,0.294118
1,Precision (Macro),0.165274
2,Precision (Micro),0.294118
3,Recall (Macro),0.283019
4,Recall (Micro),0.294118
5,F1 Score (Macro),0.194654
6,F1 Score (Micro),0.294118
7,ROC AUC,


Word2Vec показал себя хуже, чем более простые модели.

#### GloVe

In [53]:
# Получение векторов для каждого текста
def get_vector(text):
    return np.mean([glove_vectors[word] for word in text if word in glove_vectors], axis=0)

In [54]:
# Загрузка GloVe векторов
glove_vectors = {}
with open('/kaggle/input/glove-text/glove.6B.100d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype='float32')
        glove_vectors[word] = vector

scaler = StandardScaler()

X_train_glove = np.array([get_vector(text) for text in X_train['text_cleaned']])
X_test_glove = np.array([get_vector(text) for text in X_test['text_cleaned']])

In [55]:
log_reg = LogisticRegression(random_state=RANDOM_STATE)
log_reg.fit(X_train_glove, y_train)

y_pred = log_reg.predict(X_test_glove)

log_reg_glove = metrics(y_pred, y_test)
log_reg_glove

Unnamed: 0,Metric,Score
0,Accuracy,0.019608
1,Precision (Macro),0.000392
2,Precision (Micro),0.019608
3,Recall (Macro),0.019608
4,Recall (Micro),0.019608
5,F1 Score (Macro),0.000769
6,F1 Score (Micro),0.019608
7,ROC AUC,


# Выводы

In [56]:
all_metrics = pd.concat([baseline_metrics.T.loc['Score'], lexicon_based_metrics.T.loc['Score'], rule_based_metrics.T.loc['Score'], 
                        log_reg_bow_metrics.T.loc['Score'], log_reg_tfidf_metrics.T.loc['Score'], log_reg_bow_ngram_metrics.T.loc['Score'], svc_bow_metrics.T.loc['Score'],
                        SVM_tfidf_metrics.T.loc['Score'], naive_bayes_BoW_metrics.T.loc['Score'], metrics_w2v_log_reg.T.loc['Score'], metrics_svm_w2v.T.loc['Score'], log_reg_glove.T.loc['Score']], axis=1).T

all_metrics.columns = ['Accuracy', 'Precision (Macro)', 'Precision (Micro)', 'Recall (Macro)', 'Recall (Micro)', 'F1 Score (Macro)', 'F1 Score (Micro)', 'ROC AUC']
all_metrics.index = ['baseline', 'lexicon_based', 'rule_based', 'log_reg_bow', 'log_reg_tfidf', 'log_reg_bow_ngram', 'svc_bow', 'svc_tfidf', 'naive_bayes_BoW', 'log_reg_w2v',
                    'svm_w2v', 'log_reg_glove']

all_metrics.sort_values(by=['Accuracy', 'F1 Score (Macro)', 'F1 Score (Micro)'], ascending=False)

Unnamed: 0,Accuracy,Precision (Macro),Precision (Micro),Recall (Macro),Recall (Micro),F1 Score (Macro),F1 Score (Micro),ROC AUC
naive_bayes_BoW,0.647059,0.505747,0.647059,0.568966,0.647059,0.525862,0.647059,
log_reg_tfidf,0.568627,0.442949,0.568627,0.557692,0.568627,0.474359,0.568627,
lexicon_based,0.568627,0.450292,0.568627,0.508772,0.568627,0.467836,0.568627,
log_reg_bow,0.54902,0.399691,0.54902,0.518519,0.54902,0.432451,0.54902,
log_reg_bow_ngram,0.529412,0.410303,0.529412,0.490909,0.529412,0.430303,0.529412,
svc_tfidf,0.509804,0.400744,0.509804,0.472727,0.509804,0.416364,0.509804,
svc_bow,0.490196,0.379245,0.490196,0.471698,0.490196,0.401903,0.490196,
log_reg_w2v,0.294118,0.166217,0.294118,0.283019,0.294118,0.195642,0.294118,
svm_w2v,0.294118,0.165274,0.294118,0.283019,0.294118,0.194654,0.294118,
rule_based,0.058824,0.032308,0.058824,0.046154,0.058824,0.033566,0.058824,
