### Реализация задачи определения тональности текста на датасете англоязычных твитов (toxic_comments.csv) при помощи deeppavlov
#### Устанавливаем библиотеку deeppavlov

In [1]:
!pip install deeppavlov



In [2]:
from deeppavlov import build_model # Импортируем функцию build_model из DeepPavlov
import pandas as pd
df = pd.read_csv('datasets/toxic_comments.csv', encoding="utf-8", on_bad_lines="skip", engine='python')

### Усложнение 1: использовать предобработку (лемматизацию, векторизацию)

In [3]:
!pip install pymorphy2
from sklearn.feature_extraction.text import TfidfVectorizer
import pymorphy2
import re

# Инициализация лемматизатора
morph = pymorphy2.MorphAnalyzer()

# Функция для предобработки текста
def preprocess_text(text):
    # Удаляем спецсимволы и цифры
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    # Приводим текст к нижнему регистру
    text = text.lower()
    # Лемматизируем
    words = text.split()
    return ' '.join(morph.parse(word)[0].normal_form for word in words)

def vectorizer_text(text_series):
    # Векторизация текста с помощью TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Ограничиваем до 5000 наиболее важных признаков
    tfidf_matrix = tfidf_vectorizer.fit_transform(text_series)  # Преобразуем текст в векторный формат
    return tfidf_matrix

# Выполним предобработку к тексту
df['clean_text'] = df['text'].apply(preprocess_text)

# Выполним векторизацию к тексту
df['vectorized_text'] = list(vectorizer_text(df['clean_text']))



In [4]:
# Определяем список фраз для анализа
phrases = df['clean_text'].tolist()
# Создаем модель на основе предобученного BERT, настроенного для классификации оскорблений на английском языке
model = build_model('insults_kaggle_bert', download=True, install=True)
# Прогоняем фразы через модель, чтобы получить метки (оскорбление(Insult) или нет(Not Insult))
labels = model(phrases)
df['tone'] = labels
display(df.head(50))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Unnamed: 0.1,Unnamed: 0,text,toxic,clean_text,vectorized_text,tone
0,0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...,"(0, 1173)\t0.1381053351696153\n (0, 1420)\t...",Not Insult
1,1,D'aww! He matches this background colour I'm s...,0,daww he matches this background colour im seem...,"(0, 1806)\t0.24961010930058378\n (0, 921)\t...",Not Insult
2,2,"Hey man, I'm really not trying to edit war. It...",0,hey man im really not trying to edit war its j...,"(0, 874)\t0.2010046334121889\n (0, 29)\t0.2...",Not Insult
3,3,"""\nMore\nI can't make any real suggestions on ...",0,more i cant make any real suggestions on impro...,"(0, 1866)\t0.11917766467274624\n (0, 545)\t...",Not Insult
4,4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...,"(0, 1702)\t0.2846932570208014\n (0, 1851)\t...",Not Insult
5,5,"""\n\nCongratulations from me as well, use the ...",0,congratulations from me as well use the tools ...,"(0, 1745)\t0.43162263669804446\n (0, 1801)\...",Not Insult
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,cocksucker before you piss around on my work,"(0, 1883)\t0.43868616429245694\n (0, 130)\t...",Insult
7,7,Your vandalism to the Matt Shirvington article...,0,your vandalism to the matt shirvington article...,"(0, 187)\t0.3447985796031575\n (0, 1871)\t0...",Insult
8,8,Sorry if the word 'nonsense' was offensive to ...,0,sorry if the word nonsense was offensive to yo...,"(0, 576)\t0.14977252348118514\n (0, 600)\t0...",Not Insult
9,9,alignment on this subject and which are contra...,0,alignment on this subject and which are contra...,"(0, 525)\t0.4209031671207119\n (0, 1724)\t0...",Not Insult


### Усложнение 2: решить ту же задачу для русского языка 

In [5]:
df = pd.read_csv('datasets/rusentitweet_full.csv', encoding="utf-8", on_bad_lines="skip", engine='python')
# Создаем модель на основе предобученного BERT, настроенного для анализа тональности на русском языке
model = build_model('rusentiment_convers_bert', download=True, install=True)
# Выполним предобработку к тексту
df['clean_text'] = df['text'].apply(preprocess_text)

# Выполним векторизацию к тексту
df['vectorized_text'] = list(vectorizer_text(df['clean_text']))

phrases = df['clean_text'].tolist()
labels = model(phrases)
df['tone'] = labels #записываем данные в tone
display(df.head(50))

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased-conversational were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassi

Unnamed: 0.1,Unnamed: 0,text,label,id,clean_text,vectorized_text,tone
0,0,@varlamov @McFaul На,skip,"1,32793476580731E+018",varlamov mcfaul на,"(0, 269)\t0.4195509891106602\n (0, 60)\t0.6...",neutral
1,1,велл они всё равно что мусор так что ничего с...,negative,"1,25294318138735E+018",велла они всё равно что мусор так что ничего с...,"(0, 409)\t0.3301957582920934\n (0, 282)\t0....",negative
2,2,"""трезвая жизнь какая-то такая стрёмная""\n(с) а...",negative,"1,32361066906168E+018",трезвый жизнь какаять такой стрёмный с артём a...,"(0, 7)\t0.3986021850260091\n (0, 107)\t0.39...",negative
3,3,Ой какие неожиданные результаты 🤭 https://t.co...,neutral,"1,33623166116025E+018",ой какой неожиданный результат httpstcozwohpdkuqq,"(0, 43)\t0.45940351095040555\n (0, 359)\t0....",positive
4,4,@Shvonder_chief @dimsmirnov175 На заборе тоже ...,neutral,"1,29242173645413E+018",shvonder_chief dimsmirnov на забор тоже написа...,"(0, 143)\t0.2819257586136414\n (0, 392)\t0....",neutral
5,5,@idkwhht мы тоже мебельная компания уджина😳😳😳,neutral,"1,30375391117461E+018",idkwhht мы тоже мебельный компания уджина,"(0, 452)\t0.42623255053817144\n (0, 227)\t0...",neutral
6,6,Счастья здоровья 10 классникам https://t.co/M9...,speech,"1,3399173764276E+018",счастие здоровье классник httpstcomvunsixdi,"(0, 32)\t0.5\n (0, 224)\t0.5\n (0, 198)\t0...",speech
7,7,@dntbliev НЕ ПАЛИ.,neutral,"1,26789820717614E+018",dntbliev не пасть,"(0, 313)\t0.6384933294190861\n (0, 279)\t0....",neutral
8,8,@BTS_twt ты такой красивый 😭😭😭🥺💓,positive,"1,28155155170687E+018",bts_twt ты такой красивый,"(0, 236)\t0.5949625294958931\n (0, 444)\t0....",positive
9,9,"@Ladyzchensk Цыган , хуле ...",negative,"1,25776169012236E+018",ladyzchensk цыган хула,"(0, 467)\t0.5773502691896257\n (0, 470)\t0....",skip
