<h1 style="text-align: center;font-size: 30px">Классификация отзывов только на основе эмбеддингов текстов</h1>

# Содержание
### 0. [Импорт и проверка данных](#chapter0)
### 1. [Получение всех возможных эмбеддингов](#chapter1)
#### 1.1. [TF-IDF](#chapter1.1)
#### 1.2. [Word2Vec](#chapter1.2)
### 2. [Пробуем разные модели](#chapter2)
#### 2.1. [Логистическая регрессия](#chapter2.1)
#### 2.2. [Наивный Байесовский классификатор](#chapter2.2)
#### 2.3. [Дерево решений](#chapter2.3)
#### 2.4. [SVM](#chapter2.4)

<center id="chapter1"><h1 style="font-size: 24px"> 0. Импорт данных. </h1></center>

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.metrics import (
    f1_score, 
    accuracy_score,
    classification_report, 
)

In [3]:
X_train_df = pd.read_csv("../../data/fully-cleaned/text_train_cleaned.csv", index_col=0)
y_train_df = pd.read_csv("../../data/processed/train/target_train_df.csv", index_col=0)

X_val_df = pd.read_csv("../../data/fully-cleaned/text_val_cleaned.csv", index_col=0)
y_val_df = pd.read_csv("../../data/processed/val/target_val_df.csv", index_col=0)

In [4]:
X_train_df = X_train_df.fillna('')
X_val_df = X_val_df.fillna('')

In [5]:
display(X_train_df.info())
display(X_val_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 4161 entries, 0 to 4160
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    4161 non-null   object
dtypes: object(1)
memory usage: 65.0+ KB


None

<class 'pandas.core.frame.DataFrame'>
Index: 462 entries, 4161 to 4622
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    462 non-null    object
dtypes: object(1)
memory usage: 7.2+ KB


None

In [6]:
X_train = X_train_df['text']
y_train = y_train_df.values

X_val = X_val_df['text']
y_val = y_val_df.values

<center id="chapter1"><h1 style="font-size: 24px"> 1. Получение разных эмбеддингов. </h1></center>

## 1.1. TF-IDF <a id="chapter1.1"></a>

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
tfidf_10000 = TfidfVectorizer(max_features=10000)
tfidf_8000 = TfidfVectorizer(max_features=8000)
tfidf_5000 = TfidfVectorizer(max_features=5000)
tfidf_2500 = TfidfVectorizer(max_features=2500)

In [10]:
X_train_tfidf_10000 = tfidf_10000.fit_transform(X_train).toarray()
X_val_tfidf_10000 = tfidf_10000.transform(X_val).toarray()

X_train_tfidf_8000 = tfidf_8000.fit_transform(X_train).toarray()
X_val_tfidf_8000 = tfidf_8000.transform(X_val).toarray()

X_train_tfidf_5000 = tfidf_5000.fit_transform(X_train).toarray()
X_val_tfidf_5000 = tfidf_5000.transform(X_val).toarray()

X_train_tfidf_2500 = tfidf_2500.fit_transform(X_train).toarray()
X_val_tfidf_2500 = tfidf_2500.transform(X_val).toarray()

In [11]:
def try_model_with_tfidf(model, model_name: str = ""):
    
    global X_train_tfidf_10000, X_val_tfidf_10000
    global X_train_tfidf_8000, X_val_tfidf_8000
    global X_train_tfidf_5000, X_val_tfidf_5000
    global X_train_tfidf_2500, X_val_tfidf_2500
    global y_train, y_val

    print(f"-----------------Результат для {model_name} + tfidf_10000-----------------")
    model.fit(X_train_tfidf_10000, y_train)
    train_prediction = model.predict(X_train_tfidf_10000)
    val_prediction = model.predict(X_val_tfidf_10000)
    print("Train accuracy: ", accuracy_score(y_train, train_prediction))
    print("Val accuracy: ", accuracy_score(y_val, val_prediction))
    print()
    print()
    
    print(f"-----------------Результат для {model_name} + tfidf_8000-----------------")
    model.fit(X_train_tfidf_8000, y_train)
    train_prediction = model.predict(X_train_tfidf_8000)
    val_prediction = model.predict(X_val_tfidf_8000)
    print("Train accuracy: ", accuracy_score(y_train, train_prediction))
    print("Val accuracy: ", accuracy_score(y_val, val_prediction))
    print()
    print()

    print(f"-----------------Результат для {model_name} + tfidf_5000-----------------")
    model.fit(X_train_tfidf_5000, y_train)
    train_prediction = model.predict(X_train_tfidf_5000)
    val_prediction = model.predict(X_val_tfidf_5000)
    print("Train accuracy: ", accuracy_score(y_train, train_prediction))
    print("Val accuracy: ", accuracy_score(y_val, val_prediction))
    print()
    print()

    print(f"-----------------Результат для {model_name} + tfidf_2500-----------------")
    model.fit(X_train_tfidf_2500, y_train)
    train_prediction = model.predict(X_train_tfidf_2500)
    val_prediction = model.predict(X_val_tfidf_2500)
    print("Train accuracy: ", accuracy_score(y_train, train_prediction))
    print("Val accuracy: ", accuracy_score(y_val, val_prediction))
    print()
    print()

## 1.2. Word2Vec <a id="chapter1.2"></a>

In [12]:
X_train_split = [sentence.split() for sentence in X_train.values]
X_val_split = [sentence.split() for sentence in X_val.values]

In [13]:
import gensim

In [14]:
w2v_50 = gensim.models.Word2Vec(
    sentences=X_train_split, vector_size=50, window=5, min_count=2, workers=4
)

w2v_100 = gensim.models.Word2Vec(
    sentences=X_train_split, vector_size=100, window=5, min_count=2, workers=4
)

w2v_150 = gensim.models.Word2Vec(
    sentences=X_train_split, vector_size=150, window=5, min_count=2, workers=4
)

w2v_200 = gensim.models.Word2Vec(
    sentences=X_train_split, vector_size=200, window=5, min_count=2, workers=4
)

In [15]:
def vectorize_text(text_data, word2vec_model):
    """
    Функция получения эмбеддингов для набора текстов.
        text_data: набор текстов.
        word2vec_model: обученная модель получения эмбеддингов для слов.
    """
    vectors = []
    for sentence in text_data:
        no_vector = np.array([0]*word2vec_model.vector_size, dtype=np.float32)
        if len(sentence) == 0:
            vectors.append(no_vector)
        else:
            vector = np.mean([word2vec_model.wv[word] if word in word2vec_model.wv else no_vector for word in sentence], axis=0)
            vectors.append(vector)
    return np.array(vectors)

In [16]:
X_train_w2v_50 = vectorize_text(X_train_split, w2v_50)
X_val_w2v_50 = vectorize_text(X_val_split, w2v_50)

X_train_w2v_100 = vectorize_text(X_train_split, w2v_100)
X_val_w2v_100 = vectorize_text(X_val_split, w2v_100)

X_train_w2v_150 = vectorize_text(X_train_split, w2v_150)
X_val_w2v_150 = vectorize_text(X_val_split, w2v_150)

X_train_w2v_200 = vectorize_text(X_train_split, w2v_200)
X_val_w2v_200 = vectorize_text(X_val_split, w2v_200)

In [17]:
def try_model_with_word2vec(model, model_name: str = ""):
    
    global X_train_w2v_50, X_val_w2v_50
    global X_train_w2v_100, X_val_w2v_100
    global X_train_w2v_150, X_val_w2v_150
    global X_train_w2v_200, X_val_w2v_200
    global y_train, y_val

    print(f"-----------------Результат для {model_name} + w2v_50-----------------")
    model.fit(X_train_w2v_50, y_train)
    train_prediction = model.predict(X_train_w2v_50)
    val_prediction = model.predict(X_val_w2v_50)
    print("Train accuracy: ", accuracy_score(y_train, train_prediction))
    print("Val accuracy: ", accuracy_score(y_val, val_prediction))
    print()
    print()
    
    print(f"-----------------Результат для {model_name} + w2v_100-----------------")
    model.fit(X_train_w2v_100, y_train)
    train_prediction = model.predict(X_train_w2v_100)
    val_prediction = model.predict(X_val_w2v_100)
    print("Train accuracy: ", accuracy_score(y_train, train_prediction))
    print("Val accuracy: ", accuracy_score(y_val, val_prediction))
    print()
    print()

    print(f"-----------------Результат для {model_name} + w2v_150-----------------")
    model.fit(X_train_w2v_150, y_train)
    train_prediction = model.predict(X_train_w2v_150)
    val_prediction = model.predict(X_val_w2v_150)
    print("Train accuracy: ", accuracy_score(y_train, train_prediction))
    print("Val accuracy: ", accuracy_score(y_val, val_prediction))
    print()
    print()

    print(f"-----------------Результат для {model_name} + w2v_200-----------------")
    model.fit(X_train_w2v_200, y_train)
    train_prediction = model.predict(X_train_w2v_200)
    val_prediction = model.predict(X_val_w2v_200)
    print("Train accuracy: ", accuracy_score(y_train, train_prediction))
    print("Val accuracy: ", accuracy_score(y_val, val_prediction))
    print()
    print()

<center id="chapter2"><h1 style="font-size: 24px"> 2. Пробуем разные модели. </h1></center>

In [18]:
from sklearn.multioutput import MultiOutputClassifier

## 2.1. Логистическая регрессия <a id="chapter2.1"></a>

In [19]:
from sklearn.linear_model import LogisticRegression

In [20]:
log_clf = MultiOutputClassifier(LogisticRegression(max_iter=1000))

In [21]:
try_model_with_tfidf(log_clf, "Лог. регр. {max_iter=10000}")

-----------------Результат для Лог. регр. {max_iter=10000} + tfidf_10000-----------------
Train accuracy:  0.2989665945686133
Val accuracy:  0.22943722943722944


-----------------Результат для Лог. регр. {max_iter=10000} + tfidf_8000-----------------
Train accuracy:  0.2989665945686133
Val accuracy:  0.22943722943722944


-----------------Результат для Лог. регр. {max_iter=10000} + tfidf_5000-----------------
Train accuracy:  0.2989665945686133
Val accuracy:  0.22943722943722944


-----------------Результат для Лог. регр. {max_iter=10000} + tfidf_2500-----------------
Train accuracy:  0.30545541937034365
Val accuracy:  0.23809523809523808




In [24]:
log_clf = MultiOutputClassifier(LogisticRegression(max_iter=100, penalty='l2', solver='liblinear'))
try_model_with_tfidf(log_clf, "L2 Лог. регр. {max_iter=500}")

-----------------Результат для L2 Лог. регр. {max_iter=500} + tfidf_10000-----------------
Train accuracy:  0.26508050949291034
Val accuracy:  0.2012987012987013


-----------------Результат для L2 Лог. регр. {max_iter=500} + tfidf_8000-----------------
Train accuracy:  0.26508050949291034
Val accuracy:  0.2012987012987013


-----------------Результат для L2 Лог. регр. {max_iter=500} + tfidf_5000-----------------
Train accuracy:  0.27180966113914923
Val accuracy:  0.19913419913419914


-----------------Результат для L2 Лог. регр. {max_iter=500} + tfidf_2500-----------------
Train accuracy:  0.27950012016342224
Val accuracy:  0.2077922077922078




In [53]:
log_clf = MultiOutputClassifier(LogisticRegression(max_iter=500))
try_model_with_word2vec(log_clf, "Лог. регр. {max_iter=500}")

-----------------Результат для Лог. регр. {max_iter=500} + w2v_50-----------------
Train accuracy:  0.03148281663061764
Val accuracy:  0.023809523809523808


-----------------Результат для Лог. регр. {max_iter=500} + w2v_100-----------------
Train accuracy:  0.020427781783225185
Val accuracy:  0.017316017316017316


-----------------Результат для Лог. регр. {max_iter=500} + w2v_150-----------------
Train accuracy:  0.009853400624849795
Val accuracy:  0.010822510822510822


-----------------Результат для Лог. регр. {max_iter=500} + w2v_200-----------------
Train accuracy:  0.0
Val accuracy:  0.0




## 2.2. Наивный Байес <a id="chapter2.2"></a>

In [22]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, CategoricalNB, ComplementNB

In [31]:
gaus_nb = MultiOutputClassifier(GaussianNB())
try_model_with_tfidf(gaus_nb, "Наивный Байес - Гаусс")

-----------------Результат для Наивный Байес - Гаусс + tfidf_10000-----------------
Train accuracy:  0.565969718817592
Val accuracy:  0.03463203463203463


-----------------Результат для Наивный Байес - Гаусс + tfidf_8000-----------------
Train accuracy:  0.565969718817592
Val accuracy:  0.03463203463203463


-----------------Результат для Наивный Байес - Гаусс + tfidf_5000-----------------
Train accuracy:  0.49170872386445563
Val accuracy:  0.023809523809523808


-----------------Результат для Наивный Байес - Гаусс + tfidf_2500-----------------
Train accuracy:  0.34895457822638787
Val accuracy:  0.025974025974025976




In [33]:
multi_nb = MultiOutputClassifier(MultinomialNB())
try_model_with_tfidf(multi_nb, "Наивный Байес - Мультиномиальный")

-----------------Результат для Наивный Байес - Мультиномиальный + tfidf_10000-----------------
Train accuracy:  0.08435472242249459
Val accuracy:  0.047619047619047616


-----------------Результат для Наивный Байес - Мультиномиальный + tfidf_8000-----------------
Train accuracy:  0.08435472242249459
Val accuracy:  0.047619047619047616


-----------------Результат для Наивный Байес - Мультиномиальный + tfidf_5000-----------------
Train accuracy:  0.10430185051670271
Val accuracy:  0.06060606060606061


-----------------Результат для Наивный Байес - Мультиномиальный + tfidf_2500-----------------
Train accuracy:  0.1514059120403749
Val accuracy:  0.09956709956709957




In [23]:
com_nb = MultiOutputClassifier(ComplementNB())
try_model_with_tfidf(com_nb, "Наивный Байес - Complement")

-----------------Результат для Наивный Байес - Complement + tfidf_10000-----------------
Train accuracy:  0.3821196827685652
Val accuracy:  0.21428571428571427


-----------------Результат для Наивный Байес - Complement + tfidf_8000-----------------
Train accuracy:  0.3821196827685652
Val accuracy:  0.21428571428571427


-----------------Результат для Наивный Байес - Complement + tfidf_5000-----------------
Train accuracy:  0.3821196827685652
Val accuracy:  0.21428571428571427


-----------------Результат для Наивный Байес - Complement + tfidf_2500-----------------
Train accuracy:  0.3491949050708964
Val accuracy:  0.21645021645021645




In [60]:
gaus_nb = MultiOutputClassifier(GaussianNB())
try_model_with_word2vec(gaus_nb, "Наивный Байес - Complement")

-----------------Результат для Наивный Байес - Complement + w2v_50-----------------
Train accuracy:  0.006488824801730353
Val accuracy:  0.015151515151515152


-----------------Результат для Наивный Байес - Complement + w2v_100-----------------
Train accuracy:  0.001201634222542658
Val accuracy:  0.0021645021645021645


-----------------Результат для Наивный Байес - Complement + w2v_150-----------------
Train accuracy:  0.0007209805335255948
Val accuracy:  0.0021645021645021645


-----------------Результат для Наивный Байес - Complement + w2v_200-----------------
Train accuracy:  0.0
Val accuracy:  0.0




## 2.3. Дерево решений <a id="chapter2.3"></a>

In [24]:
from sklearn.tree import DecisionTreeClassifier

In [25]:
tree = MultiOutputClassifier(DecisionTreeClassifier(min_samples_leaf=3, max_depth=50))
try_model_with_tfidf(tree, "Дерево решений {min_samples_leaf=3, max_depth=50}")

-----------------Результат для Дерево решений {min_samples_leaf=3, max_depth=50} + tfidf_10000-----------------
Train accuracy:  0.5722182167748138
Val accuracy:  0.29004329004329005


-----------------Результат для Дерево решений {min_samples_leaf=3, max_depth=50} + tfidf_8000-----------------
Train accuracy:  0.5726988704638308
Val accuracy:  0.2878787878787879


-----------------Результат для Дерево решений {min_samples_leaf=3, max_depth=50} + tfidf_5000-----------------
Train accuracy:  0.5729391973083393
Val accuracy:  0.29004329004329005


-----------------Результат для Дерево решений {min_samples_leaf=3, max_depth=50} + tfidf_2500-----------------
Train accuracy:  0.5794280221100697
Val accuracy:  0.30303030303030304




In [28]:
tree = MultiOutputClassifier(DecisionTreeClassifier(max_depth=35, min_samples_leaf=5))
try_model_with_tfidf(tree, "Дерево решений {max_depth=35, min_samples_leaf=5}")

-----------------Результат для Дерево решений {max_depth=35, min_samples_leaf=5} + tfidf_10000-----------------
Train accuracy:  0.4784907474164864
Val accuracy:  0.2987012987012987


-----------------Результат для Дерево решений {max_depth=35, min_samples_leaf=5} + tfidf_8000-----------------
Train accuracy:  0.4763278058159096
Val accuracy:  0.3008658008658009


-----------------Результат для Дерево решений {max_depth=35, min_samples_leaf=5} + tfidf_5000-----------------
Train accuracy:  0.4784907474164864
Val accuracy:  0.2943722943722944


-----------------Результат для Дерево решений {max_depth=35, min_samples_leaf=5} + tfidf_2500-----------------
Train accuracy:  0.483297284306657
Val accuracy:  0.3246753246753247




In [26]:
tree = MultiOutputClassifier(DecisionTreeClassifier(max_depth=20, min_samples_leaf=5, max_leaf_nodes=100))
try_model_with_tfidf(tree, "Дерево решений {max_depth=40, min_samples_leaf=5}")

-----------------Результат для Дерево решений {max_depth=40, min_samples_leaf=5} + tfidf_10000-----------------
Train accuracy:  0.47368421052631576
Val accuracy:  0.3051948051948052


-----------------Результат для Дерево решений {max_depth=40, min_samples_leaf=5} + tfidf_8000-----------------
Train accuracy:  0.4710406152367219
Val accuracy:  0.3051948051948052


-----------------Результат для Дерево решений {max_depth=40, min_samples_leaf=5} + tfidf_5000-----------------
Train accuracy:  0.4732035568372987
Val accuracy:  0.30303030303030304


-----------------Результат для Дерево решений {max_depth=40, min_samples_leaf=5} + tfidf_2500-----------------
Train accuracy:  0.47969238163902905
Val accuracy:  0.3181818181818182




In [27]:
tree = MultiOutputClassifier(DecisionTreeClassifier(max_depth=35, min_samples_leaf=5))
try_model_with_word2vec(tree, "Дерево решений {max_depth=35, min_samples_leaf=5}")

-----------------Результат для Дерево решений {max_depth=35, min_samples_leaf=5} + w2v_50-----------------
Train accuracy:  0.4768084595049267
Val accuracy:  0.15584415584415584


-----------------Результат для Дерево решений {max_depth=35, min_samples_leaf=5} + w2v_100-----------------
Train accuracy:  0.4890651285748618
Val accuracy:  0.14935064935064934


-----------------Результат для Дерево решений {max_depth=35, min_samples_leaf=5} + w2v_150-----------------
Train accuracy:  0.5039653929343908
Val accuracy:  0.1471861471861472


-----------------Результат для Дерево решений {max_depth=35, min_samples_leaf=5} + w2v_200-----------------
Train accuracy:  0.4861812064407594
Val accuracy:  0.14935064935064934




## 2.4. SVM <a id="chapter2.3"></a>

In [47]:
from sklearn.svm import SVC

In [48]:
svc = MultiOutputClassifier(SVC())
try_model_with_word2vec(svc)

-----------------Результат для  + w2v_50-----------------
Train accuracy:  0.031242489786109107
Val accuracy:  0.02813852813852814


-----------------Результат для  + w2v_100-----------------
Train accuracy:  0.029560201874549386
Val accuracy:  0.02813852813852814


-----------------Результат для  + w2v_150-----------------
Train accuracy:  0.029560201874549386
Val accuracy:  0.02813852813852814


-----------------Результат для  + w2v_200-----------------
Train accuracy:  0.02859889449651526
Val accuracy:  0.02813852813852814


