<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import os

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
import re 
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer
import spacy

In [2]:
# Загрузка модели
nlp = spacy.load("en_core_web_sm")

In [3]:
import warnings
# настройки
warnings.filterwarnings("ignore")

# Установка опции для отображения максимальной ширины столбца
pd.set_option('display.max_colwidth', None)

In [4]:
def upload(pth):
    if os.path.exists(pth):
        market_file = pd.read_csv(pth)
        return pd.read_csv(pth, parse_dates =[0],index_col= 0)
    else:
        print('Something is wrong')

In [5]:
pth = '/datasets/toxic_comments.csv'
df = upload(pth)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


In [7]:
df.head()

Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0


In [8]:
df['text'].duplicated().sum()

0

In [9]:
df['toxic'].unique()

array([0, 1])

In [10]:
df['toxic'].value_counts() 

0    143106
1     16186
Name: toxic, dtype: int64

In [11]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
lemmatizer = WordNetLemmatizer()

In [13]:
def lemmatize(text):
    token = nltk.word_tokenize(text)
    text = [word for word in token if word not in stopwords]
    text = [lemmatizer.lemmatize(word) for word in text]
    text = ' '.join(text)
    return text

In [14]:
df['lemmatized'] = df['text'].apply(lambda x: lemmatize(x))

In [15]:
df.head()

Unnamed: 0,text,toxic,lemmatized
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,"Explanation Why edits made username Hardcore Metallica Fan reverted ? They n't vandalism , closure GAs I voted New York Dolls FAC . And please n't remove template talk page since I 'm retired now.89.205.38.27"
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,"D'aww ! He match background colour I 'm seemingly stuck . Thanks . ( talk ) 21:51 , January 11 , 2016 ( UTC )"
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,"Hey man , I 'm really trying edit war . It 's guy constantly removing relevant information talking edits instead talk page . He seems care formatting actual info ."
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0,"`` More I ca n't make real suggestion improvement - I wondered section statistic later , subsection `` '' type accident '' '' -I think reference may need tidying exact format ie date format etc . I later , no-one else first - preference formatting style reference want please let know . There appears backlog article review I guess may delay reviewer turn . It 's listed relevant form eg Wikipedia : Good_article_nominations # Transport ``"
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,"You , sir , hero . Any chance remember page 's ?"


In [16]:
def clean_text(text):
    # Удаление разделителей строк
    text = re.sub('\n|\r', ' ', text)

    # Преобразование текста в нижний регистр
    text = text.lower()

    return text

In [17]:
df['lemmatized'] = df['lemmatized'].apply(lambda x: clean_text(x))
df.head()

Unnamed: 0,text,toxic,lemmatized
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,"explanation why edits made username hardcore metallica fan reverted ? they n't vandalism , closure gas i voted new york dolls fac . and please n't remove template talk page since i 'm retired now.89.205.38.27"
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,"d'aww ! he match background colour i 'm seemingly stuck . thanks . ( talk ) 21:51 , january 11 , 2016 ( utc )"
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,"hey man , i 'm really trying edit war . it 's guy constantly removing relevant information talking edits instead talk page . he seems care formatting actual info ."
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0,"`` more i ca n't make real suggestion improvement - i wondered section statistic later , subsection `` '' type accident '' '' -i think reference may need tidying exact format ie date format etc . i later , no-one else first - preference formatting style reference want please let know . there appears backlog article review i guess may delay reviewer turn . it 's listed relevant form eg wikipedia : good_article_nominations # transport ``"
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,"you , sir , hero . any chance remember page 's ?"


Загрузили данные, выполнили лемматизацию, удалили разделители строк и выполнили преобразование текста в нижний регистр. Загрузили стоп-слова для английского языка.

## Обучение

In [18]:
X= df['lemmatized']
y = df['toxic'].values

In [19]:
# Разделение данных на обучающую и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((127433,), (31859,), (127433,), (31859,))

In [21]:
pipe_final = Pipeline([
        ('tfidf', TfidfVectorizer(min_df = 1)),
        ('model', LogisticRegression())])

In [22]:
# Параметры LogisticRegression
l_params = {}

l_params['model__C'] = [0.1, 1.0, 10.0]
l_params['model__penalty'] = ['l1', 'l2'] 
l_params['model'] = [LogisticRegression()]

# Параметры DecisionTreeClassifier

tree_params = {}

tree_params['model__criterion'] = ['gini', 'entropy']
tree_params['model__max_depth'] = [2, 4, 6]
tree_params['model'] = [DecisionTreeClassifier()]  

params = [l_params, tree_params]

In [23]:
grid = GridSearchCV(pipe_final, cv = 5, n_jobs = -1, param_grid = params ,scoring = 'f1', verbose = False)
grid.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('model', LogisticRegression())]),
             n_jobs=-1,
             param_grid=[{'model': [LogisticRegression(C=10.0)],
                          'model__C': [0.1, 1.0, 10.0],
                          'model__penalty': ['l1', 'l2']},
                         {'model': [DecisionTreeClassifier()],
                          'model__criterion': ['gini', 'entropy'],
                          'model__max_depth': [2, 4, 6]}],
             scoring='f1', verbose=False)

In [24]:
# Сохраним лучшую модель в отдельную переменную
best_model = grid.best_estimator_.steps[1][1]

In [25]:
grid.best_score_

0.7747916112940048

In [26]:
results= pd.DataFrame(grid.cv_results_)

# Поиск индексов строк с максимальными значениями метрик
results['model_name'] = results['param_model'].apply(lambda x: str(x).split('(')[0])
max_index = results.groupby('model_name')['mean_test_score'].idxmax()

# Извлечение результатов лучших моделей
best_models = results.loc[max_index]

# Преобразование отрицательных значений метрик в положительные
if 'mean_test_score' in best_models:
    best_models['mean_test_score'] *= -1

# Сортировка результатов по значению метрики
sorted_results = best_models.sort_values(['mean_test_score'], ascending=True).reset_index(drop=True)

# Вывод результатов
sorted_results[['model_name','mean_test_score', 'mean_fit_time', 'mean_score_time']]

Unnamed: 0,model_name,mean_test_score,mean_fit_time,mean_score_time
0,LogisticRegression,-0.774792,48.198417,0.960087
1,DecisionTreeClassifier,-0.531101,6.45288,0.871489


In [29]:
pred = grid.predict(X_test)

In [30]:
res = f1_score(y_test, pred)
res

0.7775648841665216

## Выводы

В рамках проекта были загружены данные. Выполнена предобработка - лемматизация, преобразования текста. Обучены две модели - логистическая регрессия и дерево решений. Лучшей моделью оказалась линейная реграссия с метрикой f1 = 0.77, что соответствует требованию проекта (f1 > 0.75).