# Проект для «Викишоп»

**Задача**

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. Нужно обучить модель классифицировать комментарии на позитивные и негативные. Имеется набор данных с разметкой о токсичности правок.

Метрики качества *F1* должна быть не меньше 0.75. 

**План работы:**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.


**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# Подготовка данных

In [1]:
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool
import re
import spacy
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV

In [3]:
try:
    df = pd.read_csv('/datasets/toxic_comments.csv')
except:
    df = pd.read_csv(r"C:\Users\Markm\Downloads\toxic_comments.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [6]:
#возбмем выборку из 50000 строк
df = df.sample(50000).reset_index(drop =True)

In [7]:
#удалим неинформативный столбец
df =df.drop('Unnamed: 0', axis = 1)

In [8]:
text = list(df['text'])

In [9]:
text

['"\n\n ""iPhone"" trademark dispute \n\nInteresting that this article is completely devoid of the battle over the iPhone name.\n\nApple, Cisco settle iPhone trademark lawsuit CNET.com (February 21, 2007 6:45 PM PST)\nCisco and Apple Agreement on IOS Trademark, Cisco.com (June 7, 2010 at 12:00 pm PST)\nCisco lost rights to iPhone trademark last year, experts say ZDNet.com (January 12, 2007  12:35 GMT (04:35 PST))\netc.   "',
 "Tartaglia's formula \n\nIn Tetrahedron#Volume it is stated that Tartaglia's formula is is essentially due to the painter Piero della Francesca in the 15th century. If this is the case, it should be also stated in the Niccolò Fontana Tartaglia article.",
 '"\n\nLOL. I admire your sense of humour.  Happy editing.talk "',
 'Please add to this page Concert Hall Hamalir.',
 ', LIZA Minelli.  And why on earth does a lowly software programmer like gwernob have a page?  I will be nominating it for deletion, as you are IRRELEVANT outside of your Wikiworld.',
 '"\n\n Popul

In [10]:
#напишем функцию для очистки текста (с помощью регулярного выражения)
def clear_text(text):
    text_cleared = []
    for elem in text:
        text = re.sub(r'[^a-zA-Z]', ' ', str(elem))
        text = text.split()
        text_cleared. append(" ".join(text).lower())
    return text_cleared

In [11]:
text_cleared = clear_text(text)

In [12]:
text_cleared

['iphone trademark dispute interesting that this article is completely devoid of the battle over the iphone name apple cisco settle iphone trademark lawsuit cnet com february pm pst cisco and apple agreement on ios trademark cisco com june at pm pst cisco lost rights to iphone trademark last year experts say zdnet com january gmt pst etc',
 'tartaglia s formula in tetrahedron volume it is stated that tartaglia s formula is is essentially due to the painter piero della francesca in the th century if this is the case it should be also stated in the niccol fontana tartaglia article',
 'lol i admire your sense of humour happy editing talk',
 'please add to this page concert hall hamalir',
 'liza minelli and why on earth does a lowly software programmer like gwernob have a page i will be nominating it for deletion as you are irrelevant outside of your wikiworld',
 'popular culture a few months ago i performed a crapectomy by removing the entire popular culture section it s a stupid thing to

In [13]:
nlp = spacy.load('en_core_web_lg')

In [14]:
#напишем функцию для лемматизации текста, с помощью библиотеки spacy
def lemmatize(text):
    lemmatize_text = []
    for elem in text:
        doc = nlp(elem)
        token = ' '.join([token.lemma_ for token in doc])
        lemmatize_text.append(token)
    return lemmatize_text

In [15]:
lem_text = lemmatize(text_cleared)

In [16]:
df['text'] = pd.Series(lem_text)

In [17]:
text_col = ['text']

**Выводы**
1. Данные очищены и лемматизированы.

# Обучение

In [18]:
X_series = df['text']
X_dataframe = df[['text']]
 
y = df['toxic']

In [19]:
features_train, features_test, target_train, target_test = train_test_split(X_series,y, test_size = 0.5, random_state = 42, stratify = y)

In [20]:
features_test.shape

(25000,)

In [21]:
target_test.shape

(25000,)

In [23]:
#создадим пайплайн с TFidVectorizer и логистической регрессией
model_pipe = Pipeline(
    [
        (
            'vect',
            TfidfVectorizer()
        ),
       
        (
            'clf',
            LogisticRegression(
                random_state=42
            )
        )
    ])
 
 
model_pipe.fit(features_train.head(100), target_train.head(100))

In [24]:
#Создадим пространство гиперпараметров для логистической регрессии
params = [{
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'clf': [LogisticRegression(random_state=42)],
    'clf__C' : [0.1, 10, 20],
    'clf__class_weight': [None, 'balanced']
    }
]

In [25]:
grid = GridSearchCV(
        model_pipe,
        params,
        cv=4,
        n_jobs=-1,
        scoring='f1',
        error_score='raise',
    )

In [26]:
grid.fit(features_train, target_train)
 
display(grid)
display(grid.best_estimator_)
print(grid.best_params_)
print(grid.best_score_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


{'clf': LogisticRegression(C=20, class_weight='balanced', random_state=42), 'clf__C': 20, 'clf__class_weight': 'balanced', 'vect__ngram_range': (1, 1)}
0.7608036854203879


In [27]:
pred_log = grid.best_estimator_.predict(features_test)

In [28]:
f1 = f1_score(target_test, pred_log)
f1

0.7515705311250716

Попробуем модель CatBoostClassifier.

In [29]:
model = CatBoostClassifier(random_state = 42, task_type = 'CPU')

In [30]:
features_test = features_test.to_numpy()

In [31]:
#Создадим обучающие и тестовые пулы
train_data = Pool(features_train.to_numpy(), target_train, text_features =list(range(len(text_col))))
test_data = Pool(features_test, text_features =list(range(len(text_col))))

In [32]:
model.fit(train_data)

Learning rate set to 0.040724
0:	learn: 0.6411051	total: 192ms	remaining: 3m 12s
1:	learn: 0.5962636	total: 262ms	remaining: 2m 10s
2:	learn: 0.5510935	total: 331ms	remaining: 1m 49s
3:	learn: 0.5151743	total: 395ms	remaining: 1m 38s
4:	learn: 0.4785857	total: 471ms	remaining: 1m 33s
5:	learn: 0.4486973	total: 534ms	remaining: 1m 28s
6:	learn: 0.4230400	total: 602ms	remaining: 1m 25s
7:	learn: 0.4006791	total: 665ms	remaining: 1m 22s
8:	learn: 0.3809972	total: 750ms	remaining: 1m 22s
9:	learn: 0.3597530	total: 839ms	remaining: 1m 23s
10:	learn: 0.3425156	total: 924ms	remaining: 1m 23s
11:	learn: 0.3279214	total: 1.01s	remaining: 1m 23s
12:	learn: 0.3137785	total: 1.1s	remaining: 1m 23s
13:	learn: 0.3000420	total: 1.19s	remaining: 1m 23s
14:	learn: 0.2881863	total: 1.27s	remaining: 1m 23s
15:	learn: 0.2782343	total: 1.36s	remaining: 1m 23s
16:	learn: 0.2676722	total: 1.45s	remaining: 1m 23s
17:	learn: 0.2599248	total: 1.53s	remaining: 1m 23s
18:	learn: 0.2521567	total: 1.62s	remaining: 

<catboost.core.CatBoostClassifier at 0x1d55b087fd0>

In [33]:
preds_class = model.predict(test_data)

In [34]:
preds_class

array([0, 0, 0, ..., 1, 0, 0], dtype=int64)

In [35]:
f1 = f1_score(target_test, preds_class)
f1

0.7664041994750657

# Общий вывод

1. Обучено две модели: LogisticRegression (в пайплайне с TfidfVectorizer())  и CatBoostClassifier.
2. Параметры LogisticRegression - C=10, class_weight='balanced', random_state=42, TfidVectorizer - ngram_range: (1, 1). F1 score - 0.7553 на тестовых данных.
3. f1 score у CatBoostClassifier без настройки гиперпараметров - 0.7664.
4. Обе модели удовлетворяют условиям задачи, однако у CatBoostClassifier даже без настройки гиперпараметров метрика немного лучше.