# Проект 12. Обработка текстов (NLP)

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.7. 

### Инструкция по выполнению проекта

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

<div class="alert alert-info"> <b>Комментарий студента:</b> Импортируем библиотеки </div>

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier

from sklearn.metrics import f1_score

import re

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df = pd.read_csv('/datasets/toxic_comments.csv').head(50000)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/toxic_comments.csv'

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
text     50000 non-null object
toxic    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.4+ KB


________________

<div class="alert alert-info"> <b>Комментарий студента:</b> Лемматизируем наш текст комментариев, попутно удаляя лишние символы </div>

In [4]:
import nltk
from nltk.stem import WordNetLemmatizer 
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet

# Lemmatize with POS Tag
from nltk.corpus import wordnet
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

corpus = df['text'] #list(df['text'])   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0-100!!!!!

def lemmatize(text):
    m = WordNetLemmatizer()
    word_list = nltk.word_tokenize(text)
    
    # Lemmatize list of words and join
    lemmatized_output = ' '.join([m.lemmatize(w, get_wordnet_pos(w)) for w in word_list])
    
    return lemmatized_output

def clear_text(text):
    x = re.sub(r'[^a-zA-Z]', ' ', text) 
    x = x.split()
    return " ".join(x)

print("Исходный текст:\n", corpus[0])
print("Очищенный и лемматизированный текст:\n", lemmatize(clear_text(corpus[0])))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Исходный текст:
 Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27
Очищенный и лемматизированный текст:
 Explanation Why the edits make under my username Hardcore Metallica Fan be revert They weren t vandalism just closure on some GAs after I vote at New York Dolls FAC And please don t remove the template from the talk page since I m retire now


In [5]:
%%time
df['lemm_text'] = corpus.apply(lambda x: lemmatize(clear_text(x)))
df.tail()

CPU times: user 9min 15s, sys: 48.6 s, total: 10min 3s
Wall time: 10min 9s


Unnamed: 0,text,toxic,lemm_text
49995,"""\nWell, she's done writing articles. But does...",0,Well she s do write article But doesn t the UN...
49996,There was also an Atlantis wrap back in 2001. ...,0,There be also an Atlantis wrap back in Maybe t...
49997,"""\nI've removed the various """"references"""" as ...",0,I ve remove the various reference a a userpage...
49998,RE: Removal of Templates \n\nI would be most g...,1,RE Removal of Templates I would be most gratef...
49999,Exactly. I want the drawing to have come first...,0,Exactly I want the draw to have come first too...


<div class="alert alert-info"> <b>Комментарий студента:</b> создан столбец с леммами по нашему тексту комментариев </div>

_________________________________________________________________

<div class="alert alert-info"> <b>Комментарий студента:</b> выделим трейн и тест наборы, а также фичи и таргеты </div>

In [6]:
train, valid_plus_test = train_test_split(df, test_size=0.4, random_state=123) #[0:1000]
valid, test = train_test_split(valid_plus_test, test_size=0.5, random_state=123)

print('Обучающая выборка:')
print(train.shape)

print('-----')
print('Обучающая выборка:')
print(valid.shape)

print('-----')
print('Тестовая выборка:')
print(test.shape)

features_train = train.drop(['text', 'toxic'], axis=1)
target_train = train['toxic'] # np.array(train['toxic'])

features_valid = valid.drop(['text', 'toxic'], axis=1)
target_valid = valid['toxic'] # np.array(test['toxic'])

features_test = test.drop(['text', 'toxic'], axis=1)
target_test = test['toxic'] # np.array(test['toxic'])

Обучающая выборка:
(30000, 3)
-----
Обучающая выборка:
(10000, 3)
-----
Тестовая выборка:
(10000, 3)


In [7]:
print(df[df['toxic'] == 0]['toxic'].count())
print(df[df['toxic'] == 1]['toxic'].count())

print(df[df['toxic'] == 0]['toxic'].count() / df[df['toxic'] == 1]['toxic'].count())

44845
5155
8.699321047526674


In [8]:
from sklearn.utils import shuffle
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_train_upsampled, target_train_upsampled = upsample(features_train, target_train, 9)

In [9]:
print(features_train.shape)
print(target_train.shape)
print('\n upsampled')
print(features_train_upsampled.shape)
print(target_train_upsampled.shape)
print()
print(features_valid.shape)
print(target_valid.shape)
print()
print(features_test.shape)
print(target_test.shape)

(30000, 1)
(30000,)

 upsampled
(55104, 1)
(55104,)

(10000, 1)
(10000,)

(10000, 1)
(10000,)


<div class="alert alert-info"> <b>Комментарий студента:</b> Создадим матрицу признаков со значениями TF-IDF </div>

In [10]:
from nltk.corpus import stopwords 
from sklearn.feature_extraction.text import TfidfVectorizer

corpus_train = features_train_upsampled['lemm_text'].values.astype('U')
corpus_valid = features_valid['lemm_text'].values.astype('U')
corpus_test = features_test['lemm_text'].values.astype('U')

nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)

tf_idf_train = count_tf_idf.fit(corpus_train)
tf_idf_train = count_tf_idf.transform(corpus_train)
tf_idf_valid = count_tf_idf.transform(corpus_valid)
tf_idf_test = count_tf_idf.transform(corpus_test)

print('corpus_train:', corpus_train.shape)
print("Размер матрицы train:", tf_idf_train.shape) 
print()
print('corpus_valid:', corpus_valid.shape)
print("Размер матрицы valid:", tf_idf_valid.shape)
print()
print('corpus_test:', corpus_test.shape)
print("Размер матрицы test:", tf_idf_test.shape)


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


corpus_train: (55104,)
Размер матрицы train: (55104, 58318)

corpus_valid: (10000,)
Размер матрицы valid: (10000, 58318)

corpus_test: (10000,)
Размер матрицы test: (10000, 58318)


# 2. Обучение

<div class="alert alert-info"> <b>Комментарий студента:</b> Обучим несколько моделей (логистическая регрессия, дерево решений, случайный лес, sgd classifier) </div>

In [11]:
%%time
model_log = LogisticRegression(class_weight='balanced')
model_log.fit(tf_idf_train, target_train_upsampled)
predicted_log = model_log.predict(tf_idf_valid)

print(f1_score(target_valid, predicted_log))

0.7214285714285715
CPU times: user 5.23 s, sys: 4.37 s, total: 9.6 s
Wall time: 9.62 s


In [12]:
predicted_log = model_log.predict(tf_idf_test)

print(f1_score(target_test, predicted_log))

0.7140238313473877


In [13]:
%%time
model_tree = DecisionTreeClassifier(random_state=12345)
model_tree.fit(tf_idf_train, target_train_upsampled)
predicted_tree = model_tree.predict(tf_idf_valid)

print(f1_score(target_valid, predicted_tree))

0.6074140241179098
CPU times: user 17.4 s, sys: 13.8 ms, total: 17.4 s
Wall time: 17.5 s


In [14]:
predicted_tree = model_tree.predict(tf_idf_test)

print(f1_score(target_test, predicted_tree))

0.5910313901345292


In [15]:
%%time
best_model = None 
best_result = 0
best_depth = 0
for depth in range(1, 100, 5):
    model_tree = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model_tree.fit(tf_idf_train, target_train_upsampled)
    predicted_tree = model_tree.predict(tf_idf_valid) 
    result_tree = f1_score(target_valid, predicted_tree) 
    if result_tree > best_result:
        best_model = model_tree
        best_result = result_tree
        best_depth = depth
        f1_tree = f1_score(target_valid, predicted_tree)

print('best model depth =', best_depth)
print("F1_score =", f1_tree)

best model depth = 61
F1_score = 0.6194774346793349
CPU times: user 3min 14s, sys: 143 ms, total: 3min 14s
Wall time: 3min 15s


In [16]:
predicted_tree = best_model.predict(tf_idf_test)

print(f1_score(target_test, predicted_tree))

0.5896568390526824


In [17]:
%%time
model_for = RandomForestClassifier()
model_for.fit(tf_idf_train, target_train_upsampled)
predicted_for = model_for.predict(tf_idf_valid)

print(f1_score(target_valid, predicted_for))

0.5835509138381201
CPU times: user 11.2 s, sys: 18 ms, total: 11.2 s
Wall time: 11.3 s


In [18]:
predicted_for = model_for.predict(tf_idf_test)

print(f1_score(target_test, predicted_for))

0.6051417270929466


In [19]:
%%time
model_s = SGDClassifier(class_weight='balanced')
model_s.fit(tf_idf_train, target_train_upsampled)
predicted_s = model_s.predict(tf_idf_valid)

print(f1_score(target_valid, predicted_s))

0.7084967320261438
CPU times: user 250 ms, sys: 84.8 ms, total: 335 ms
Wall time: 347 ms


In [20]:
predicted_s = model_s.predict(tf_idf_test)

print(f1_score(target_test, predicted_s))

0.706989247311828


# 3. Выводы

<div class="alert alert-info"> <b>Комментарий студента:</b> выполнены все задачи, поставленные в ходе выполнения проекта </div>

# Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [x]  Весь код выполняется без ошибок
- [x]  Ячейки с кодом расположены в порядке исполнения
- [x]  Данные загружены и подготовлены
- [x]  Модели обучены
- [x]  Значение метрики *F1* не меньше 0.7
- [x]  Выводы написаны