# Search and merge datasets

---
## Notebook contents ##

* Collection of the dataset files from the table by the link below 
* Establishing data types, formats and available values for the initial raw database
* Defining of merge recommendations for the initial database
* Merge
* Studying the basic functionality of the Great Expectations library (additionally-completed)

After execution, saving and sharing the dataset with the command is implied. 

---
## Data description and dtypes, values schema

* **raw_text_id (int)** — уникальный идентификатор сырой строки.  

* **dataset_id (string)** — идентификатор исходного датасета внутри Google-таблицы;   

* **source_platform (string)** — название платформы или ресурса (сайт, форум и т.д.), откуда получен текст.  
Информация собиралась эмпирически из источников-описаний датасетов (см. ссылки в таблице).   
UPD: не во всех датасетах указаны источники для каждой строки text_raw => некоторые значения в столбце source_platform поданы через запятую как "шапочные", общие значения.  
* **is_verified (float)** — признак того, была ли исходная разметка верифицирована либо вручную, либо автоматически.  
Информация собиралась эмпирически из источников-описаний датасетов.
* **text_raw (string)** — исходный текст сообщения без каких-либо предобработок.  
* **is_toxic (int)** — бинарный целевой признак токсичности сообщения (1 — токсичное, 0 — нетоксичное).   
Допустимые значения: `[0, 1, np.nan]`
* **toxicity_type (string)** — мультиклассовый признак, определяющий тип высказывания в более узкой классификации.   
Допустимые значения: `[SENSITIVE, INSULT, INAPPROPRIATE, THREAT, OBSCENITY, 'UNKNOWN']`.   
UPD: не все бинарные/мультикласс метки присутствуют в разметке, см. в графиках распределения

---
## Multiclass labels notes
### Multiclass labels variations ### 
* **INAPPROPRIATE** — Неуместное высказывание. 
* **SENSITIVE** — Чувствительная тема. 


* **INSULT** — Оскорбление.

* **THREAT** — Угроза.
* **OBSCENITY** — Непристойность / вульгарность.

### 'inappropriate' vs 'sensitive' and how to set up `is_toxic` values? ### 
* INAPPROPRIATE. Главный признак INAPPROPRIATE - неуместный способ подачи сказанного (не соответствующая контексту темы интонация, грубая/вульгарная лексика).    
Источник: https://huggingface.co/apanc/russian-inappropriate-messages  
Поэтому для метки INAPPROPRIATE **не стоит** однозначно ставить is_toxic=1.   
* SENSITIVE. Для этой метки важно содержание темы, а не способ подачи.   
Внутри чувствительных тем у высказывания могут быть виды мнений: общественно-неодобряемое и общественно-одобряемое.  
Также для SENSITIVE **также не стоит**  однозначно ставить is_toxic=1.  
Источник: https://aclanthology.org/2021.bsnlp-1.4
* INSULT, THREAT - однозначно is_toxic=1.
* ONSCENITY - не стоит ставить is_toxic=1

--- 


In [None]:
# cool stuff clears the outputs:
from IPython.utils import io
with io.capture_output() as captured:
    !pip install .

## Imports

In [None]:
import ast 
import re
import kagglehub
import numpy as np
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
import great_expectations as gx
import great_expectations.expectations as gxe 

import torch
from transformers import BertTokenizer, BertForSequenceClassification


## Downloads

In [None]:
# Download latest version
path = kagglehub.dataset_download("blackmoon/russian-language-toxic-comments")

print("Path to dataset files:", path)

In [None]:
# Download latest version
path = kagglehub.dataset_download("alexandersemiletov/toxic-russian-comments")

print("Path to dataset files:", path)

In [None]:

# Download latest version
path = kagglehub.dataset_download("nigula/russianinappropriatemessages")

print("Path to dataset files:", path)

## Merge

### Set up merging target schema ##

In [None]:
# target data schema in the result datafra,e: 
TARGET_SCHEMA = {
    'raw_text_id': 'int',
    'dataset_id': 'string',
    'source_platform': 'string', # name of the site/forum/etc
    'is_verified': 'float', # is source data was manually/automaically verified

    'text_raw': 'string', # raw text column
    'is_toxic': 'int',  # binary target column
    'toxicity_type': 'string' # multilabel target column
}
# note: bunch is a local naming for large mergable datasets from the internet and named in the target_schema as dataset_id.


In [None]:
# different columns in sub-datasets to be renamed: 
COLUMNS_MAP = {
    'comment': 'text_raw',
    'text_message': 'text_raw',
    'label_text': 'text_raw',
    'text': 'text_raw',
    'comments': 'text_raw',
    
    'toxic': 'is_toxic', 
    'primary_label': 'toxicity_type',
    'toxicity': 'is_toxic',
    'hate_speech': 'is_toxic',
    'abusive': 'is_toxic',

    'source': 'source_platform',
    # 'author': 'nickname',
}

In [None]:
df_common = pd.DataFrame(columns=TARGET_SCHEMA.keys())
df_common = df_common.astype(TARGET_SCHEMA)

df_common.info()

In [None]:
df_common.columns

### Set up path variables

In [None]:
# path to save dommon df csv: 
df_common_path = 'data/raw/df_common.csv'

# paths of already downloaded datasets from the links (or from the data.zip archive): 
data_0_path = 'data/raw/data_0/labeled.csv'
data_1_txt_path = "data/raw/data_1/dataset.txt"  # txt dataset id 1 
txt_to_csv_path = 'data/raw/data_1/parsed.csv'   # csv path to save txt dataset 1 
data_2_path = 'data/raw/data_2/Inappapropriate_messages.csv'
# dataset4 data is not found
data_4_train_path = 'data/raw/data_4/train-00000-of-00001.parquet'
data_4_test_path = 'data/raw/data_4/test-00000-of-00001.parquet'
data_5_path = 'data/raw/data_5/russian_distorted_toxicity.tsv'
data_6_path = 'data/raw/data_6/labled.csv'
data_7_path = 'data/raw/data_7/final_data.csv'
data_8_path = 'data/raw/data_8/sensitive_topics.csv'

### Great expectations handlers

In [None]:
# base core great expectations context: 
context = gx.get_context(mode="ephemeral")

# register pd datasrouces: 
datasource = context.data_sources.add_or_update_pandas(name="pandas dataframes")
asset = datasource.add_dataframe_asset(name="df_bunch_asset") # header for structural bunch block
batch_definition = asset.add_batch_definition_whole_dataframe("bunch") # structural block for merging dbs

In [None]:
def clear_validation_results(val_results) -> str:
    """returns the more readable version of the results"""
    
    results_dict = val_results.to_json_dict()
        
    rows = []
    for r in results_dict['results']:
        row = {
            "expectation": r['expectation_config']['type'],
            "column": r['expectation_config']['kwargs'].get('column'),
            "success": r['success'],
            "unexpected_count": r['result'].get('unexpected_count'),
            "unexpected_percent": r['result'].get('unexpected_percent'),
            "partial_unexpected_list": r['result'].get('partial_unexpected_list'),
        }
        rows.append(row)
    
    # filter only incorrect info:   
    df_invalid = pd.DataFrame(rows)
    df_invalid = df_invalid[df_invalid['success'] == False]
    
    # more readable version of outputs: 
    message = (
        "Batch is incorrect! The following expectations failed:\n\n" +
        df_invalid.to_string(index=False)
    ) 

    return message

def validate_on_intersections(df_merged): 
    """simple validator for merged data"""

    batch = batch_definition.get_batch(batch_parameters={"dataframe": df_merged})
    validator = context.get_validator(batch=batch)
    validator.expect_column_values_to_be_unique("text_raw")
    results = validator.validate()


    return results


def validate_df_batch(
        df_batch: pd.DataFrame
    ):
    """Great expectations simple workflow for mergable datasets.
    df_batch must be composed bassed on TARGET_SCHEMA. 
    """

    # create batch from df in terms of batch definition: 
    batch = batch_definition.get_batch(batch_parameters={"dataframe": df_batch})

    # Create the batch validator and add its clauses (expectations): 
    validator = context.get_validator(batch=batch)

    for col in TARGET_SCHEMA.keys():
        validator.expect_column_to_exist(col)
    validator.expect_column_values_to_not_be_null("text_raw")
    validator.expect_column_values_to_be_unique("text_raw")

    validator.expect_column_values_to_not_match_regex("text_raw", r"^\s*$")
    # validator.expect_column_values_to_be_unique("raw_text_id")
    validator.expect_column_values_to_be_in_set("is_toxic", [0, 1, ''])
    validator.expect_column_values_to_be_in_set("is_verified", [0, 1])

    # Soft types checking: 
    for col, dtype in TARGET_SCHEMA.items():
        actual = str(df_batch[col].dtype)
        if not actual.startswith(dtype):
            print(f"Warning! Dtype of {col} = {actual}, expected {dtype}")

    # Start validator: 
    results = validator.validate()

    # Check results: 
    if not results.success:
        message = clear_validation_results(results)
        raise ValueError(message)
    else: 
        print('Batch is valid!')

    return results

def merge_on_schema(df: pd.DataFrame, df_bunch: pd.DataFrame) -> pd.DataFrame:
    """df_bunch is the abstract name of large courpuses collected in the internet 
    (especially passed for this notebook).

    validate_df_batch may be used in the another mergable datasets. 
    """

    def reset_indexes(df, df_bunch):
        """Set new index values in df_bunch based on max index found in df"""

        max_existing_id = df["raw_text_id"].max() if not df.empty else -1
        # Find rows that have NaN or duplicate raw_text_id in df_bunch and correct them if found: 
        invalid_mask = df_bunch["raw_text_id"].isna() | df_bunch["raw_text_id"].duplicated()
        if invalid_mask.any():
            n_invalid = invalid_mask.sum()
            print(f"Warning: {n_invalid} invalid raw_text_id(s) found. Reassigning unique IDs...")
            new_ids = pd.RangeIndex(start=max_existing_id + 1, stop=max_existing_id + 1 + n_invalid)
            df_bunch.loc[invalid_mask, "raw_text_id"] = new_ids

        return df_bunch

    # Common check: 
    validate_df_batch(df_bunch)
    
    df_bunch = reset_indexes(df, df_bunch)
    df_bunch["raw_text_id"] = df_bunch["raw_text_id"].astype("int")

    merged_df = pd.merge(df, df_bunch, how='outer')

    # check on the intersections of merged dataframe: 
    inter_results = validate_on_intersections(merged_df)

    # hard types: 
    merged_df['raw_text_id'] = merged_df['raw_text_id'].astype(int)
    merged_df['text_raw'] = merged_df['text_raw'].astype('string')
    merged_df['source_platform'] = merged_df['source_platform'].astype('string')
    # merged_df['nickname'] = merged_df['nickname'].astype('string')
        
    merged_df['toxicity_type'] = merged_df['toxicity_type'].astype('string')

    # if not inter_results.success: 
    #     try: 
    #         merged_df = merged_df.drop_duplicates(subset='text_raw', keep='first')
    #         mergef_df = reset_indexes(mergef_df)
    #     except Exception as ex: 
    #         print(ex)
    #         message = clear_validation_results(inter_results)
    #         raise ValueError(message)

    return merged_df

### Consider each dataset, transform it to a common view, and merge.

## id 0 ## 

In [None]:
data_0 = pd.read_csv(data_0_path)
data_0.head()

In [None]:
data_0 = data_0.rename(columns=COLUMNS_MAP)
data_0.head()

In [None]:
data_0_bunch = pd.DataFrame(columns=df_common.columns, data=data_0)

data_0_bunch['dataset_id'] = 0
data_0_bunch['source_platform'] = '2ch, pikabu'
data_0_bunch['is_verified'] = 1
data_0_bunch['is_toxic'] = data_0_bunch['is_toxic'].astype(int)

data_0_bunch['toxicity_type'] = (
    data_0_bunch['toxicity_type'].fillna('')
).astype(str)

data_0_bunch.head()



In [None]:
df_common = merge_on_schema(
    df=df_common,
    df_bunch=data_0_bunch
)

In [None]:
df_common.shape

In [None]:
df_common.head()

## id 1 ## 

In [None]:
encoding = "utf-8"          

# ----- Функция парсинга одной строки -----
label_pattern = re.compile(r'^(?:__label__[^ \t\r\n]+(?:,__label__[^ \t\r\n]+)*)')  

def parse_line(line: str):
    """
    Возвращает (labels_list, text)
    Примеры меток в начале строки:
      __label__INSULT текст...
      __label__INSULT,__label__THREAT текст...
      __label__INSULT,__label__THREAT    текст...
    """
    line = line.rstrip("\n")
    m = label_pattern.match(line)
    if not m:
        # если строка не начинается с метки — считаем всю строку текстом и без меток
        return [], line.strip()
    labels_block = m.group(0)
    # извлечь отдельные метки, убрать префикс "__label__"
    raw_labels = [lab.replace("__label__", "") for lab in labels_block.split(",")]
    # текст — остаток строки после меток
    text = line[m.end():].strip()
    return raw_labels, text

# ----- Читаем файл и собираем данные -----
rows = []
with open(data_1_txt_path, "r", encoding=encoding) as f:
    for i, ln in enumerate(f, start=1):
        if not ln.strip():
            # пропускаем пустые строки
            continue
        labels, text = parse_line(ln)
        rows.append({"text": text, "labels": labels, "primary_label": labels[0] if labels else None})

# ----- Создаём DataFrame -----
df = pd.DataFrame(rows)
print("Loaded rows:", len(df))
display(df.head(10))


In [None]:
df.to_csv(txt_to_csv_path)

In [None]:
data_1 = pd.read_csv(txt_to_csv_path, index_col=0)

In [None]:
data_1.head()

In [None]:
data_1 = data_1.rename(columns=COLUMNS_MAP)
data_1.head()

In [None]:
data_1_bunch = pd.DataFrame(columns=df_common.columns, data=data_1)

data_1_bunch['dataset_id'] = 1
data_1_bunch['source_platform'] = 'ok.ru'
data_1_bunch['is_verified'] = 1 # data from competiion should be verified usually  

data_1_bunch['toxicity_type'] = data_1.loc[:, 'labels'].apply(lambda r: ','.join(ast.literal_eval(r)))
data_1_bunch['toxicity_type'] = (
    data_1_bunch['toxicity_type'].fillna('')
).astype(str)

# data_1_bunch['label'] = data_1_bunch['label'].astype(int)
data_1_bunch.head()

In [None]:
def set_tox_type(r: str):
    if r == 'NORMAL': # normal category 
        return 0
    elif r in ['INSULT', 'THREAT']: # toxic categories  
        return 1
    else: 
        return np.nan

data_1_bunch['is_toxic'] = data_1_bunch['toxicity_type'].apply(lambda r: set_tox_type(r))

data_1_bunch.head()

In [None]:
data_1_bunch.toxicity_type.unique()

In [None]:
data_1_bunch.isna().sum()

In [None]:
data_1_bunch = data_1_bunch.drop_duplicates(subset='text_raw', keep='first')

In [None]:
df_common = merge_on_schema(df_common, data_1_bunch)

In [None]:
df_common.tail()

In [None]:
df_common.to_csv(df_common_path)

## id 2 ## 

In [None]:
data_2 = pd.read_csv(data_2_path)

In [None]:
data_2.shape

In [None]:
data_2.head()

In [None]:
# why is this distribution so hard-splitted?..
plt.hist(x=data_2['inappropriate'], bins=10)
# plt.hist(x=pd.read_csv('data/raw/data_2/Inappapropriate_messages_last.csv')['inappropriate'], bins=10)

In [None]:
data_2 = data_2.rename(columns=COLUMNS_MAP)

In [None]:
data_2.head()

In [None]:
data_2_bunch = pd.DataFrame(columns=df_common.columns, data=data_2)

data_2_bunch['dataset_id'] = 2
data_2_bunch['source_platform'] = '2ch.hk, Pikabu.ru, answers.mail.ru'

# yandex.toloka passed as the labelling method => is_verified=1: 
data_2_bunch['is_verified'] = 1 

data_2_bunch['toxicity_type'] = (
    data_2_bunch['toxicity_type'].fillna('')
).astype(str)


# data_1_bunch['label'] = data_1_bunch['label'].astype(int)
data_2_bunch.head()

The rows in this dataset are named as related to dangerous themes.  
=> we can't name those rows as normal, even if INAPPROPRIATE== 0. All those themes are sensitive at least. 

In [None]:
data_2_bunch['toxicity_type'] = data_2['inappropriate'].apply(lambda r: 'INAPPROPRIATE' if r >= 0.5 else 'SENSITIVE')
data_2_bunch.head()

In [None]:
# binary labels want to be passed and checked in future experiments.
data_2_bunch.head()

In [None]:
data_2_bunch.dtypes

In [None]:
data_2_bunch.isna().sum()

In [None]:
data_2_bunch = data_2_bunch.drop_duplicates(subset='text_raw', keep='first')

In [None]:
df_common = merge_on_schema(df_common, data_2_bunch)

In [None]:
df_common.tail()

In [None]:
df_common.shape

In [None]:
data_2_bunch['source_platform'].unique()

## id 3 ## 
data is not found, boooooo

## id 4 ## 

In [None]:
data_4 = pd.concat([
    pd.read_parquet(data_4_train_path),
    pd.read_parquet(data_4_test_path)
])

In [None]:
data_4.head()

In [None]:
# labels are reversed in this dataset => fix that: 
data_4.loc[:, 'is_toxic'] = data_4.loc[:, 'label'].apply(lambda x: 1 if x==0 else 0)

In [None]:
data_4.head()

In [None]:
data_4 = data_4.rename(columns=COLUMNS_MAP)
data_4_bunch = pd.DataFrame(columns=df_common.columns, data=data_4)

data_4_bunch['dataset_id'] = 4
data_4_bunch['source_platform'] = '2ch, vk' # 

# hugging face dataset without any information about labelling or corectness. => can't pass it as verified 
data_4_bunch['is_verified'] = 0 
data_4_bunch['toxicity_type'] = (
    data_4_bunch['toxicity_type'].fillna('')
).astype(str)


data_4_bunch['is_toxic'] = data_4_bunch['is_toxic'].astype(int)
data_4_bunch.head()

In [None]:
# trying to merge dfs and check out how does the expectations of intersection work: 
try: 
    merge_on_schema(df_common, data_4_bunch)
except Exception as ex:
    print(ex)

### Demo of GE pros

Great expectations is the processing layer that tells us about the problems in dataset to resolve. 
The specific ouptut of the problem is:   
{'success': False, 'expectation_config': {'type': 'expect_column_values_to_not_match_regex', 'kwargs': {'batch_id': 'pandas dataframes-df_bunch_asset', 'column': 'text_raw', 'regex': '^\\s*$'}

regex expectation checks on the empty rows => delete them:  
also, duplicates with primary df found, => delete duplicates in batch df:  

In [None]:
data_4_bunch[data_4_bunch['text_raw']==''].shape

In [None]:
data_4_bunch = data_4_bunch[data_4_bunch['text_raw']!='']
data_4_bunch= data_4_bunch[~data_4_bunch['text_raw'].isin(df_common['text_raw'])]
data_4_bunch = data_4_bunch.drop_duplicates(subset='text_raw')
data_4_bunch.shape

trying to merge one more time => success: 

In [None]:
df_common = merge_on_schema(df_common, data_4_bunch)

In [None]:
df_common.to_csv(df_common_path)

In [None]:
df_common.shape

In [None]:
data_4_bunch['source_platform'].unique()

## id 5 ## 

In [None]:
data_5 = pd.read_csv(data_5_path, sep='\t')

In [None]:
data_5.head()

In [None]:
data_5['comments'].isna().sum()

In [None]:
data_5['corrected'].isna().sum()

In [None]:
data_5.isna().sum()

In [None]:
data_5 = data_5.rename(columns=COLUMNS_MAP)
pd.DataFrame(columns=df_common.columns, data=data_5)

In [None]:

data_5_bunch = pd.DataFrame(columns=df_common.columns, data=data_5)

data_5_bunch['dataset_id'] = 5
# data_5_bunch['source_platform'] = 'vk, other' # vk passed explicitly, "several source data" named to other
# https://github.com/alla-g/toxicity-detection-thesis/tree/main?tab=readme-ov-file


data_5_bunch['is_verified'] = 0 
data_5_bunch['toxicity_type'] = (
    data_5_bunch['toxicity_type'].fillna('')
).astype(str)

data_5_bunch['is_toxic'] = data_5_bunch['is_toxic'].astype(int)
data_5_bunch.dropna(subset='text_raw', inplace=True)

data_5_bunch.head()

In [None]:
data_5_bunch.shape

In [None]:
data_5_bunch[~data_5_bunch['text_raw'].isin(df_common['text_raw'])].shape

In [None]:
data_5_bunch[~data_5_bunch['text_raw'].isin(df_common['text_raw'])]

In [None]:
# unickness: 
data_5[~data_5['text_raw'].isin(df_common['text_raw'])].shape

In [None]:
data_5_bunch = data_5_bunch[~data_5_bunch['text_raw'].isin(df_common['text_raw'])]
data_5_bunch = data_5_bunch[data_5_bunch['text_raw'] != ' ']
data_5_bunch.dropna(subset='text_raw', inplace=True)
data_5_bunch.drop_duplicates(subset='text_raw', inplace=True)

In [None]:
data_5_bunch.shape

In [None]:
data_5_bunch.head()

In [None]:
df_common = merge_on_schema(df_common, data_5_bunch)

## id 6 ## 

In [None]:
data_6 = pd.read_csv(data_6_path)
data_6.head()

In [None]:
data_6 = data_6.rename(columns=COLUMNS_MAP)
data_6_bunch = pd.DataFrame(columns=df_common.columns, data=data_6)

data_6_bunch['dataset_id'] = 6
data_6_bunch['source_platform'] = 'YouTube' 
data_6_bunch['is_verified'] = 0 
data_6_bunch['toxicity_type'] = (
    data_6_bunch['toxicity_type'].fillna('')
).astype(str)

data_6_bunch['is_toxic'] = data_6_bunch['is_toxic'].astype(int)
# data_6_bunch.dropna(subset='text_raw', inplace=True)

data_6_bunch.head()

In [None]:
data_6_bunch[~data_6_bunch['text_raw'].isin(df_common['text_raw'])].shape

In [None]:
data_6_bunch = data_6_bunch[data_6_bunch['text_raw']!='']
data_6_bunch = data_6_bunch[data_6_bunch['text_raw']!=' ']
data_6_bunch.dropna(subset='text_raw', inplace=True)
data_6_bunch = data_6_bunch.drop_duplicates(subset='text_raw')

In [None]:
data_6_bunch.head()

In [None]:
data_6_bunch.source_platform.unique()

In [None]:
df_common = merge_on_schema(df_common, data_6_bunch)

In [None]:
df_common.shape

## id 7 ## 

In [None]:
data_7 = pd.read_csv(data_7_path, sep=';', index_col=0)

In [None]:
data_7.head()

In [None]:
data_7 = data_7.rename(columns=COLUMNS_MAP)
data_7_bunch = pd.DataFrame(columns=df_common.columns, data=data_7)

data_7_bunch['dataset_id'] = 7
data_7_bunch['source_platform'] = 'Social Media, TV-Scripts (South Park)' 

# info about labeelling is not passed: 
data_7_bunch['is_verified'] = 0 
data_7_bunch['toxicity_type'] = (
    data_7_bunch['toxicity_type'].fillna('')
).astype(str)
data_7_bunch['is_toxic'] = data_7_bunch['is_toxic'].astype(int)

data_7_bunch.head()

In [None]:
data_7_bunch.drop_duplicates(subset='text_raw', inplace=True)

In [None]:
df_common = merge_on_schema(df_common, data_7_bunch)

In [None]:
df_common.shape

In [None]:
df_common.to_csv(df_common_path)

In [None]:
df_common['is_toxic'].isna().sum()

In [None]:
df_common.head()

## id 8

In [None]:
data_8 = pd.read_csv(data_8_path)

In [None]:
data_8.head(2)

In [None]:
data_8.shape

How much data does each category contain? Is it necessary to add those columns data as a separate multilabel values?  
Or is it possible to transform these values into existing labels? 

In [None]:
cols = [col for col in data_8.columns if col!='text']

for col in cols: 
    print(col)
    dat = data_8.loc[data_8[data_8[col]==1].index, :]
    print(dat.shape[0])
    print(dat.text.values[:3])

Check the appropriateness: 
* "onine/offline crime" -> THREAT


In [None]:
cols

The data is too heterogeneous to be correctly added to the general dataset

## Review of the corectness, save

In [None]:
df_common.shape

In [None]:
df_common.drop_duplicates(subset='text_raw', inplace=True)
df_common.shape

In [None]:
df_common.toxicity_type.unique() # <- corrected multiple labels in the rows

In [None]:
df_common.source_platform.unique() # <- repaired sources (something was wrong in the previous dataset)

In [None]:
df_common.to_csv(df_common_path)

## Conclusions of data collection 
* Great expectations may be really useful in simple realtime data pipelines. But in the static processing it's slightly redundant (but it's still cool to take a look at this library);
* Data imbalance was found in single class labelling (200+k nontoxic vs ~50k toxic). 
* 455551 rows is a summary count of rows which are include multi- and single labels. 

### Steps to be done
* Run the existing models (ML, DL, LLM/RAG) on the unlabelled data 
* Find and use the simpliest cloud storage as possible to use in this task to store the raw data, precomputed features and models 

# Visualize data fullness and variety

**NOTE**: these plots were created to ensure in data fullness and correctness of merge, but the main plots are still in `dataset_eda.ipynb`

In [None]:
import pandas as pd
df = pd.read_csv('data/raw/df_common.csv', index_col=0)

In [None]:
df_plot = df.copy()
df_plot['is_toxic'] = df_plot['is_toxic'].fillna('UNKNOWN')
df_plot['is_toxic'] = df_plot['is_toxic'].astype(str)

toxicity_counts = df_plot.groupby(['toxicity_type', 'is_toxic'], sort=True).size().unstack(fill_value=0)
# sort values 
toxicity_counts['total'] = toxicity_counts.sum(axis=1)
toxicity_counts = toxicity_counts.sort_values('total', ascending=False)
toxicity_counts = toxicity_counts.drop(columns='total')

fig, ax = plt.subplots(figsize=(12, 6))
toxicity_counts.plot(kind='bar', stacked=True, ax=ax, width=0.8, alpha=0.7) 
ax.set_xticks(range(len(toxicity_counts.index)))
ax.set_xticklabels(toxicity_counts.index, rotation=20, ha='right')

plt.title('Distribution of message types\n by toxicity', fontsize=14)
plt.xlabel('Toxicity type', fontsize=12)
plt.ylabel('Messages count', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.legend(title='is_toxic')
plt.tight_layout()
plt.show()

'Obscenity' values are missing in the common eda, but we can still be interested in those 

In [None]:
df['source_platform'].value_counts()

In [None]:
plt.figure(figsize=(10, 6))
df['source_platform'].value_counts().plot(kind='bar', color='steelblue')

plt.title('Top data platforms\nfor combined sources', fontsize=14)
plt.xlabel('Download platform', fontsize=12)
plt.ylabel('Count', fontsize=12)

plt.xticks(rotation=20, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.tight_layout()
plt.show()



In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter

all_sources = []

for item in df['source_platform'].dropna():
    parts = [p.strip() for p in item.split(',')]
    all_sources.extend(parts)

freq_dict = dict(Counter(all_sources))

wordcloud = WordCloud(
    width=1000,
    height=600,
    background_color='white',
    colormap='viridis',
    prefer_horizontal=0.6,
    relative_scaling=0.2  
).generate_from_frequencies(freq_dict)

plt.figure(figsize=(8, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Unique platforms word cloud', fontsize=16)
plt.tight_layout()
plt.show()


Cool, we checked that nothing was lost of platforms or unique labels!

# Check the corectness of labelling using the existing models

In [None]:
from IPython.utils import io
with io.capture_output() as captured:
    !pip install .
    !pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126 
    !pip3 install transformers
    !nvcc --version

In [None]:
torch.cuda.is_available()

### russian_toxicity_classifier (ID 0, 1 in Datasets table)
https://huggingface.co/s-nlp/russian_toxicity_classifier

There is only one model in the list above...

In [None]:
from tqdm import tqdm

def inference(model, tokenizer, batch_str):
    '''simple inference example for bert-like models'''
    try: 
        batch_str = batch_str.strip() 
        batch = tokenizer.encode(batch_str, return_tensors='pt')
        outp = model(batch)
        # outp_int = int(np.argmax(outp))
        pred = torch.argmax(outp.logits, dim=1)
    except Exception as ex: 
        print(ex)
        pred = np.nan
    return pred

# load tokenizer and model weights
tokenizer = BertTokenizer.from_pretrained('s-nlp/russian_toxicity_classifier')
model = BertForSequenceClassification.from_pretrained('s-nlp/russian_toxicity_classifier')

max_size = 512 # some messages will be smaller
# df['tox_type_model_0'] = df.loc[:, 'text_raw'].apply(lambda r: inference(model, tokenizer, r[:max_size]))
# twdm version: 
tqdm.pandas(desc="Toxicity inference")
results = []
for i, row in tqdm(df.iterrows(), total=len(df), desc="Processing texts"):
    text = row['text_raw'][:max_size] if isinstance(row['text_raw'], str) else ""
    result = inference(model, tokenizer, text)
    results.append(result)

df['tox_type_model_0'] = pd.Series(results)

In [None]:
df['tox_type_model_0']