#Модель распознавания нетоксичных комментариев

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

Чтобы полностью охватить процесс обработки естественного языка, необходимо сделать следующие шаги:

1.Токенизация: Разбиение текста на отдельные слова или токены.

2. Удаление стоп-слов: Исключение часто встречающихся слов, которые не несут значимой информации (например, предлоги, союзы).

3. Стемминг или лемматизация: Приведение слов к их основной форме. Стемминг удаляет окончания слов, в то время как лемматизация учитывает грамматические особенности языка.

4. Векторизация: Преобразование текста в числовые векторы, которые можно использовать для обучения моделей машинного обучения.

5. Классификация или кластеризация: Использование алгоритмов машинного обучения для классификации или кластеризации текстовых данных.

## Подготовка

In [6]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.corpus import stopwords as nltk_stopwords
from joblib import Parallel, delayed
from nltk import pos_tag
import warnings
warnings.filterwarnings('ignore')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report, accuracy_score

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Загружаем все необходимые библиотеки

In [7]:
file_path = '/content/sample_data/toxic_comments.csv'
df = pd.read_csv(file_path)
df = df.sample(frac=0.5, random_state=42)  # Случайный выбор 50% данных, это сделано для упрощения расчетов
df

Unnamed: 0.1,Unnamed: 0,text,toxic
31015,31055,"Sometime back, I just happened to log on to ww...",0
102832,102929,"""\n\nThe latest edit is much better, don't mak...",0
67317,67385,""" October 2007 (UTC)\n\nI would think you'd be...",0
81091,81167,Thanks for the tip on the currency translation...,0
90091,90182,I would argue that if content on the Con in co...,0
...,...,...,...
1028,1028,less likeable messages here. Mark I know you a...,0
147303,147459,"""\n\n Nubsor \n\nHaving seen your heartfelt pl...",0
109563,109660,"""\n\nSligocki, before removing valuable explan...",0
101825,101922,Speedy oppose: Moving this page to the propose...,0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 79646 entries, 31015 to 10600
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  79646 non-null  int64 
 1   text        79646 non-null  object
 2   toxic       79646 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 2.4+ MB


In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

##Токенизация текста

In [10]:
# Функция для токенизации текста
def tokenize_text(text):
    return word_tokenize(text)

# Применяем токенизацию ко всем текстам в датасете
df['tokenized_text'] = df['text'].apply(tokenize_text)

# Просмотрим результаты токенизации для первых нескольких записей
df['tokenized_text'].head()

31015     [Sometime, back, ,, I, just, happened, to, log...
102832    [``, The, latest, edit, is, much, better, ,, d...
67317     [``, October, 2007, (, UTC, ), I, would, think...
81091     [Thanks, for, the, tip, on, the, currency, tra...
90091     [I, would, argue, that, if, content, on, the, ...
Name: tokenized_text, dtype: object

In [11]:
df.head(15)

Unnamed: 0.1,Unnamed: 0,text,toxic,tokenized_text
31015,31055,"Sometime back, I just happened to log on to ww...",0,"[Sometime, back, ,, I, just, happened, to, log..."
102832,102929,"""\n\nThe latest edit is much better, don't mak...",0,"[``, The, latest, edit, is, much, better, ,, d..."
67317,67385,""" October 2007 (UTC)\n\nI would think you'd be...",0,"[``, October, 2007, (, UTC, ), I, would, think..."
81091,81167,Thanks for the tip on the currency translation...,0,"[Thanks, for, the, tip, on, the, currency, tra..."
90091,90182,I would argue that if content on the Con in co...,0,"[I, would, argue, that, if, content, on, the, ..."
1860,1860,"""=Reliable sources===\nCheating:\n""""Barry Bond...",1,"[``, =Reliable, sources===, Cheating, :, '', '..."
125293,125422,WTF=\n\nHow The Fuck Does This Person Merit A ...,1,"[WTF=, How, The, Fuck, Does, This, Person, Mer..."
148986,149142,"cajuns, acadians\nCajuns, acadians, louisianan...",0,"[cajuns, ,, acadians, Cajuns, ,, acadians, ,, ..."
89697,89784,Hi - I dropped a pin in Google Maps at the cer...,0,"[Hi, -, I, dropped, a, pin, in, Google, Maps, ..."
64256,64323,Re removal of accessdate= for urls books \n\nT...,0,"[Re, removal, of, accessdate=, for, urls, book..."


## Удаление стоп слов

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
# Получаем список английских стоп-слов
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [14]:
# Функция для удаления стоп-слов из токенизированного текста
def remove_stop_words(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

# Удаляем стоп-слова из каждого токенизированного текста
df['text_without_stopwords'] = df['tokenized_text'].apply(remove_stop_words)

# Просмотрим результаты для первых нескольких записей
df['text_without_stopwords'].head()

31015     [Sometime, back, ,, happened, log, www.izoom.i...
102832    [``, latest, edit, much, better, ,, n't, make,...
67317     [``, October, 2007, (, UTC, ), would, think, '...
81091     [Thanks, tip, currency, translation, ., Think,...
90091     [would, argue, content, Con, comparison, Arts,...
Name: text_without_stopwords, dtype: object

## Лемматизация

In [15]:
# Убедитесь, что у вас загружены необходимые ресурсы NLTK
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [16]:
# Инициализируем лемматизатор
lemmatizer = WordNetLemmatizer()

# Кэш для POS-тегов
pos_cache = {}

# Функция для определения части речи слова
def get_wordnet_pos(word):
    if word in pos_cache:
        return pos_cache[word]

    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    pos_cache[word] = tag_dict.get(tag, wordnet.NOUN)
    return pos_cache[word]

# Функция для лемматизации текста
def lemmatize_text(tokens):
    return [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

# Применяем лемматизацию к каждому тексту
df['lemmatized_text'] = df['text_without_stopwords'].apply(lemmatize_text)

# Просмотрим результаты для первых нескольких записей
df['lemmatized_text'].head()

31015     [Sometime, back, ,, happen, log, www.izoom.in,...
102832    [``, late, edit, much, well, ,, n't, make, art...
67317     [``, October, 2007, (, UTC, ), would, think, '...
81091     [Thanks, tip, currency, translation, ., Think,...
90091     [would, argue, content, Con, comparison, Arts,...
Name: lemmatized_text, dtype: object

## Векторизация

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
import gc

# Объединяем лемматизированные слова обратно в строки
df['lemmatized_text_joined'] = df['lemmatized_text'].apply(' '.join)

# Инициализируем векторизатор TF-IDF с ограничением максимального количества признаков
vectorizer = TfidfVectorizer(max_features=10000)  # Пример: ограничиваем словарь 10,000 словами

# Применяем векторизатор к лемматизированным текстам
tfidf_matrix = vectorizer.fit_transform(df['lemmatized_text_joined'])

# Освобождаем память
del df['lemmatized_text_joined']
gc.collect()

# tfidf_matrix является разреженной матрицей
# Для просмотра результатов векторизации можно преобразовать её в DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Просмотрим первые несколько строк полученной матрицы
tfidf_df.head()


Unnamed: 0,00,000,000000,01,02,03,04,05,06,07,...,zionist,zionists,zoe,zombie,zone,zoo,zoom,zora,zuck,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.213064,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#Классификация

In [19]:
# Разделяем данные на обучающую и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, df['toxic'], test_size=0.2, random_state=42)

# Создаем и обучаем модель логистической регрессии
model = LogisticRegression()
model.fit(X_train, y_train)

# Делаем предсказания на тестовой выборке
y_pred = model.predict(X_test)

# Оцениваем модель
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.95      1.00      0.97     14289
           1       0.94      0.58      0.71      1641

    accuracy                           0.95     15930
   macro avg       0.95      0.79      0.84     15930
weighted avg       0.95      0.95      0.95     15930



Hезультаты классификации выглядят довольно хорошо, особенно с точки зрения точности (precision) и общей точности (accuracy). recall и f1-score для класса 1 (токсичные комментарии) ниже, чем для класса 0. Это означает, что модель лучше справляется с распознаванием нетоксичных комментариев, чем токсичных.