## Problem Statement

The Conversation AI team, a research initiative founded by Jigsaw and Google builds a technology to prevent voices in Conversation. In 2020, Jigsaw organized a competition on Kaggle where the competitors has to build machine learning models that can identify toxicity in Online conversations, where toxicity is defined as `anything rude, disrespectful, or otherwise likely` to make someone leave the discussion. If these contributions can be identified, we could have a safer, more collaborative internet.     

## Dataset Description

As part of the competition, competitors were provided several files, specifically:

`jigsaw-toxic-comment-train.csv` - data from the [Jigsaw toxic comment classification competition](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge). The dataset is made up of English comments from Wikipedia’s talk page edits.

`jigsaw-unintended-bias-train.csv` - data from the [Jigsaw Unintended Bias in Toxicity Classification competition](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification). This is an expanded version of the Civil Comments dataset with a range of additional labels.

`sample_submission.csv` - a sample submission file.

`test.csv` - comments from Wikipedia talk pages in different non-English languages.

`validation.csv` - comments from Wikipedia talk pages in different non-English languages

## Evaluation Metric
Submissions were evaluated based on Area Under the ROC Curve between the predicted probability and the observed target.

## Strategies to Tackle
### `Monolingual Approach`
Monolingual models are the type of language models which are trained on a single language.

They are focused on understanding and generating text in a specific language. 

For example, a monolingual model trained on English language will be proficient in understanding and generating English text. 

These models are typically for tasks such as text classification, sentiment analysis and more within a specific language.

Monolingual models can be beneficial to utilize when we have a specific language in our training, testing datasets and in the upcoming unseen data. 

### `Multilingual Approach`
Multilingual models are, on the other hand, are trained on multiple different languages. 

They are designed to handle and process text in multiple languages, allowing them to perform across different languages. 

Multilingual models have the advantage of of being able to provide language-agnostic solutions, as they can handle a wide-range of languages. 

They can be used for zero-shot and few-shot learning, where the model can perform a task in a language it has not been seen specifically trained on by leveraging its knowledge of other languages.

### `Which models to use for our problem?`
As per the dataset given in the competition, we have only english data in our training dataset and very samples are given in the validation dataset containing languages `Spanish`, `Turkish` and `Italian` only and the Testing dataset contains languages `Turkish`, `Spanish`, `Italian`, `Russian`, `French` and `Portugese`.

Since in our validation and test dataset contains non-english languages it would be better approach to build multilingual models rather than monolingual models. 

Now, if we had only one language (as stated above) building monolingual models would be a better choice. 

Let's discuss Multilingual models approach a bit more:

How are multilingual models are trained?

Multilingual models are pre-trained on a mix of different languages and they don't distinguish between the languages. 

The English BERT was pre-trained on English Wikipedia and BookCorpus dataset, while multilingual models like mBERT was pre-trained on 102 different languages from largest Wikipedia dataset and XLM-Roberta was pre-trained on CommonCrawl dataset from 100 different languages respectively.

### `Cross Lingual Transfer`
Cross-lingual transfer refers to transfer learning using data and models available for one language for which ample such resources are available (e.g., English) to solve tasks in another, commonly more low-resource, language.

In our case, we are trying to create an application that can automatically detect whether a sentence or phrase is toxic or not. 

Models like XLM-Roberta provides us the ability to fine tune it on English dataset and predict to identify comments in any different language.

XLM-R is able to take it's specific knowledge learnt in one language and apply it to a different langauge (languages), even though it never seen the language during fine-tuning. 

This concept is of transfer learning applied from one language to another language is referred to as `Cross-Lingual Transfer (AKA Zero-Shot Learning)` .

Another reason to use Pre-Trained multi-lingual models for a task like this (as in our case) is that is the `Lack of languages by resources` i.e., different languages have different amounts of training data available to build models like BERT and its variants. 

Some languages like English, Chinese, Russian, Indonesian, Vietnamese etc. are the languages that have high resource languages, whereas languages like sundanese, assamese etc. are low resource languages. 

Training our own BERT like model on these low resources could be very expensive in terms of data collection and performance-wise  , therefore, We should leverage these multi-lingual models.

### `What experiments did I perform ?`
At the Overall level, I performed 9 experiments with the following ideas keeping in mind.

Perform Pre-processing techniques like removal of stopwords, removing URLs, Contraction to Expansion of words, removing multiple characters from words and removal of punctuations.

From model stand point we experimented with 2 models mBERT & XLM-Roberta.

From Training dataset perspective we used 2 types of datasets: one case where we used the provided training datasets where we tried to balance the dataset by the target and the other case where we used the translated training dataset of languages provided in the test dataset along with the english language with class balancing.

We always trained on validation dataset as well to further boost the performance of the model.

Now we build a model with the following ideas:
> Training on original training & validation datasets, class balancing (undersampling), fine-tuning the model for 2 epochs on training as well as validation dataset, and will not perform any preprocessing dataset. We will be leveraging the TPUs offered by Kaggle.

## Installing Libraries

In [1]:
!pip install nltk
!pip install transformers --quiet

import re
import nltk
import string
import os, gc
import pandas as pd
import tensorflow as tf
from transformers import TFAutoModel
from transformers import AutoTokenizer
nltk.download('stopwords')

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting regex>=2021.8.3
  Downloading regex-2023.5.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m771.9/771.9 KB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2023.5.5
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

D0508 07:23:28.285075581      14 config.cc:119]                        gRPC EXPERIMENT tcp_frame_size_tuning               OFF (default:OFF)
D0508 07:23:28.285109420      14 config.cc:119]                        gRPC EXPERIMENT tcp_rcv_lowat                       OFF (default:OFF)
D0508 07:23:28.285113692      14 config.cc:119]                        gRPC EXPERIMENT peer_state_based_framing            OFF (default:OFF)
D0508 07:23:28.285116753      14 config.cc:119]                        gRPC EXPERIMENT flow_control_fixes                  ON  (default:ON)
D0508 07:23:28.285119508      14 config.cc:119]                        gRPC EXPERIMENT memory_pressure_controller          OFF (default:OFF)
D0508 07:23:28.285122651      14 config.cc:119]                        gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D0508 07:23:28.285125779      14 config.cc:119]                        gRPC EXPERIMENT new_hpack_huffman_decoder           ON  (default:ON)
D0508 07:23:28.

True

## Setting data paths

In [2]:
main_data_dir_path = "../input/jigsaw-multilingual-toxic-comment-classification/"
toxic_comment_train_csv_path = main_data_dir_path + "jigsaw-toxic-comment-train.csv"
unintended_bias_train_csv_path = main_data_dir_path + "jigsaw-unintended-bias-train.csv"
validation_csv_path = main_data_dir_path + "validation.csv"
test_csv_path = main_data_dir_path + "test.csv"
submission_csv_path = main_data_dir_path + "sample_submission.csv"

## TPU Configurations

Intializing the TPU configurations and other constants like `number of epochs, batch_size (16 * number of cores offered on TPUS), MAX_LEN (length of the sentence), we use xlm-roberta-large model, number of samples (for undersampling) = 150k, Learning_rate = 1e-5 etc.`

In [3]:
#################### TPU Configurations ####################
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

AUTO = tf.data.experimental.AUTOTUNE
# Configuration
EPOCHS = 2
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 192
MODEL = 'xlm-roberta-large'
NUM_SAMPLES = 150000
RANDOM_STATE = 42
LEARNING_RATE = 1e-5 ######################### MAIN CHANGE ############################
WEIGHT_DECAY = 1e-6

Running on TPU  
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: local
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/

## Reading & Balancing the data by Target column

In [4]:
## Reading csv files 
train1 = pd.read_csv(toxic_comment_train_csv_path)
train2 = pd.read_csv(unintended_bias_train_csv_path)
valid = pd.read_csv(validation_csv_path)
test = pd.read_csv(test_csv_path)
sub = pd.read_csv(submission_csv_path)

## Converting floating points to integers ##
train2.toxic = train2['toxic'].round().astype(int)

##### BALANCING THE DATA ##### 
# : Taking all the data from toxic_comment_train_file & all data corresponding to unintended bias train file
# & sampling 150k observations randomly from non-toxic observation population.

# Combine train1 with a subset of train2
train = pd.concat([
    train1[['comment_text', 'toxic']],
    train2[['comment_text', 'toxic']].query('toxic==1'),
    train2[['comment_text', 'toxic']].query('toxic==0').sample(n=NUM_SAMPLES, random_state=RANDOM_STATE)
])

## Dropping missing observations with respect to comment-text column 
train = train.dropna(subset=['comment_text'])

In [5]:
def encode(texts, tokenizer, max_len):
    """
    Function takes a list of texts, tokenizer (object)
    initialized from HuggingFace library, max_len (defines
    of how long the sentence lengths should be).
    """       
    tokens = tokenizer(texts, max_length=max_len, 
                    truncation=True, padding='max_length',
                    add_special_tokens=True, return_tensors='np')
    
    return tokens

## Encoding comment_text

We first initialize the tokenizer from Hugging Face transformer library and encoding our training, validation and test dataset comment_texts. 

In [6]:
## Intializing the tokenizer ##
tokenizer = AutoTokenizer.from_pretrained(MODEL)

train_inputs = encode(train['comment_text'].values.tolist(), 
                      tokenizer, max_len=MAX_LEN)
validation_inputs = encode(valid['comment_text'].values.tolist(),
                          tokenizer, max_len=MAX_LEN)
test_inputs = encode(test['content'].values.tolist(),
                    tokenizer, max_len=MAX_LEN)

Downloading (…)lve/main/config.json: 100%|██████████| 616/616 [00:00<00:00, 138kB/s]
Downloading (…)tencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:00<00:00, 41.6MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 9.10M/9.10M [00:00<00:00, 57.5MB/s]


## Preparing data using tf.data.Data API

Writing a function to create a tuple of inputs and outputs, where inputs have a dictionary datatype.

We'll be leveraging tf.data.Data API to pass our inputs and outputs as tuple, i.e., (inputs, outputs), inputs are `{"input_ids": input_ids, "attention_mask": attention_mask} and outputs labels`.

In [7]:
def map_fn(input_ids, attention_mask, labels=None):
    if labels is not None:
        return {"input_ids": input_ids, "attention_mask": attention_mask}, labels
    else:
        return {"input_ids": input_ids, "attention_mask": attention_mask}

In [8]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_inputs["input_ids"],
                                                    train_inputs["attention_mask"],
                                                   train['toxic']))
train_dataset = train_dataset.map(map_fn)
train_dataset = train_dataset.repeat().shuffle(buffer_size=2048,seed=RANDOM_STATE).batch(BATCH_SIZE).prefetch(AUTO)

validation_dataset = tf.data.Dataset.from_tensor_slices((validation_inputs['input_ids'],
                                                         validation_inputs['attention_mask'],
                                                        valid['toxic']))
validation_dataset = validation_dataset.map(map_fn)
validation_dataset = validation_dataset.batch(BATCH_SIZE).prefetch(AUTO)

test_dataset = tf.data.Dataset.from_tensor_slices((test_inputs['input_ids'],
                                                  test_inputs['attention_mask']))
test_dataset = test_dataset.map(map_fn)
test_dataset = test_dataset.batch(BATCH_SIZE)

## Building the model

In [9]:
def build_model(transformer_layer, max_len):
    """
    Creating the model input layers, output layers,
    model definition and compilation.
        
    Returns: model object after compiling. 
    """
    input_ids = tf.keras.layers.Input(shape=(max_len,), 
                                      dtype=tf.int32, 
                                      name="input_ids")
    attention_mask = tf.keras.layers.Input(shape=(max_len,), 
                                       dtype=tf.int32, 
                                       name="attention_mask")
    output = transformer_layer.roberta(input_ids, 
                                 attention_mask=attention_mask)[1]
    x = tf.keras.layers.Dense(1024, activation='relu')(output)
    y = tf.keras.layers.Dense(1, activation='sigmoid',name='outputs')(x)
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask],
                             outputs=y)
    
    optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE,
                                         weight_decay=WEIGHT_DECAY)
    loss = tf.keras.losses.BinaryCrossentropy()
    AUC = tf.keras.metrics.AUC()
    
    model.compile(loss=loss, metrics=[AUC], optimizer=optimizer)    
    return model

## Loading model on TPUs

It is important to initialize & compile the model inside the `with strategy.scope()`.

One thing I want to point out that for some reason I was getting different results even though I was setting the seed before initializing the model, but the results are always consistent even though the results differ very little every time we run the pipeline.

In [10]:
with strategy.scope():
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    tf.random.set_seed(RANDOM_STATE)
    model = build_model(transformer_layer,
                        max_len=MAX_LEN)
model.summary()

Downloading tf_model.h5: 100%|██████████| 2.24G/2.24G [00:46<00:00, 48.0MB/s]
All model checkpoint layers were used when initializing TFXLMRobertaModel.

All the layers of TFXLMRobertaModel were initialized from the model checkpoint at xlm-roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaModel for predictions without further training.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 192)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 192)]        0           []                               
                                                                                                  
 roberta (TFXLMRobertaMainLayer  TFBaseModelOutputWi  559890432  ['input_ids[0][0]',              
 )                              thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 192,                                           

## Training the model on Only English data for 2 epochs

In [11]:
train_steps_per_epoch = train_inputs['input_ids'].shape[0] // BATCH_SIZE
train_history = model.fit(train_dataset,
                         steps_per_epoch=train_steps_per_epoch,
                         validation_data=validation_dataset,
                         epochs=2) 

Epoch 1/2


2023-05-08 07:28:46.072958: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add_790/ReadVariableOp.
2023-05-08 07:28:48.332211: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add_790/ReadVariableOp.




2023-05-08 07:53:43.374173: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add/ReadVariableOp.
2023-05-08 07:53:43.934692: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add/ReadVariableOp.


Epoch 2/2


## Training the model on Validation data for 2 epochs further to fine-tune on it

In [12]:
validation_steps_per_epoch = validation_inputs['input_ids'].shape[0] // BATCH_SIZE
validation_history = model.fit(validation_dataset.repeat(),
                              steps_per_epoch=validation_steps_per_epoch,
                              epochs=2)

Epoch 1/2
Epoch 2/2


- Public LeaderBoard score on kaggle (test dataset): 0.936 and Private LeaderBoard score : 0.9346

## Results
| Experiment | Public Test LeaderBoard Score | Private Test LeaderBoard Score |
| --- | --- | --- |
| 1 (mBERT + No Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5) | 0.8850 | 0.8869 |
|2 (xlm-roberta-large + No Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5) | 0.9259 | 0.9264 |
|3 (mBERT + Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5) | 0.8259 | 0.8239 |
|4 (xlm-roberta-large + Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5) | 0.8755 | 0.8754 |
|5 (mBERT + No Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5) |  0.9195 | 0.9212 |
|6 ((xlm-roberta-large + No Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5) |  0.9329 | 0.9212 |
|7 (mBERT + Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5) |  0.8696 | 0.9212 |
|8 ((xlm-roberta-large + Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5) |  0.8861 | 0.8866 |
|9 (xlm-roberta-large + No Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 1e-5) | 0.936 | 0.9346 |

## Predicting on Test dataset 

In [13]:
sub['toxic'] = model.predict(test_dataset, verbose=1)
sub.to_csv('submission.csv', index=False)

2023-05-08 08:20:04.172862: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp.
2023-05-08 08:20:04.668074: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp.




In [14]:
sub.head()

Unnamed: 0,id,toxic
0,0,0.000308
1,1,0.000241
2,2,0.266148
3,3,6.3e-05
4,4,7.8e-05


## Saving the model

In [15]:
model_save_path = "../working/Multilingual_toxic_comment_classifier"
model.save(model_save_path)



INFO:tensorflow:Assets written to: ../working/Multilingual_toxic_comment_classifier/assets


INFO:tensorflow:Assets written to: ../working/Multilingual_toxic_comment_classifier/assets


## Loading the model

In [16]:
import tensorflow as tf

model_save_path = "../working/Multilingual_toxic_comment_classifier"
loaded_model = tf.keras.models.load_model(model_save_path)
y = loaded_model.predict(test_dataset.take(1))
y[:6]



array([[3.0799874e-04],
       [2.2472920e-04],
       [2.6646560e-01],
       [5.7183450e-05],
       [7.6287179e-05],
       [3.1223629e-02]], dtype=float32)

Writing the function to prepare for the new text, we encode the text using the `tokenizer with the sentence length=192`

In [17]:
from transformers import AutoTokenizer
tokenizer_ = AutoTokenizer.from_pretrained("xlm-roberta-large")

text = "politicians are like cancer for this country"
def prep_data(text, tokenizer, max_len=192):
    tokens = tokenizer(text, max_length=max_len, 
                    truncation=True, padding='max_length',
                    add_special_tokens=True, return_tensors='tf')
    
    return {"input_ids": tokens['input_ids'],
            "attention_mask": tokens['attention_mask']}

Predicting the probability of toxic and non-toxic on a sample text.

In [18]:
prob_of_toxic_comment = loaded_model.predict(prep_data(text=text, tokenizer=tokenizer_, max_len=192))[0][0]
prob_of_non_toxic_comment = 1 - prob_of_toxic_comment
prob_of_toxic_comment, prob_of_non_toxic_comment
probs = {"prob_of_toxic_comment": prob_of_toxic_comment,
 "prob_of_non_toxic_comment": prob_of_non_toxic_comment}
probs



{'prob_of_toxic_comment': 0.26497197,
 'prob_of_non_toxic_comment': 0.7350280284881592}

### Testing the model with the Gradio App before final pushing the model to HuggingFace Spaces

In [28]:
!pip3 install gradio --quiet
import tensorflow as tf
import gradio as gr

loaded_model = tf.keras.models.load_model(model_save_path)

from transformers import AutoTokenizer
tokenizer_ = AutoTokenizer.from_pretrained("xlm-roberta-large")

examples_list = ["politicians are like cancer for this country", 
                 "Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было,",
                "Для каких стан является эталоном современная система здравоохранения РФ? Для Зимбабве? Ты тупой? хох",
                ]

def prep_data(text, tokenizer, max_len=192):
    tokens = tokenizer(text, max_length=max_len, 
                    truncation=True, padding='max_length',
                    add_special_tokens=True, return_tensors='tf')
    
    return {"input_ids": tokens['input_ids'],
            "attention_mask": tokens['attention_mask']}

def predict(text):
    prob_of_toxic_comment = loaded_model.predict(prep_data(text=text, tokenizer=tokenizer_, max_len=192))[0][0]
    prob_of_non_toxic_comment = 1 - prob_of_toxic_comment
    prob_of_toxic_comment, prob_of_non_toxic_comment
    probs = {"prob_of_toxic_comment": float(prob_of_toxic_comment),
             "prob_of_non_toxic_comment": float(prob_of_non_toxic_comment)}
    return probs

interface = gr.Interface(fn=predict, inputs=gr.components.Textbox(lines=4,label='Comment'),
                        outputs=[gr.Label(label='Probabilities')], examples=examples_list,
                        title='Multi-Lingual Toxic Comment Classification.',
                        description='XLM-Roberta Large model')
interface.launch(debug=False, share=True)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mRunning on local URL:  http://127.0.0.1:7865
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Running on public URL: https://af370decb4339b429e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces






Woow!!!

Our application is up and running, this link is only temporary and and it remains ony for 72 hours. For permanent hosting,
we can upload our Gradio app Interface to [HuggingFace Spaces](https://huggingface.co/spaces). 

Now download all the files and folders from kaggle output manually & this kaggle kernel locally

## Turning our Multi-Lingual Toxic Comment Classification Gradio Demo into a deployable app

We'll deploy the demo application on HuggingFace Spaces.

What is HuggingFace Spaces?

It is a resource that allows anybody to host and share machine learning application.

### Deployed Gradio App Structure
To upload our gradio app, we'll want to put everything together into a singe directory.

For example, our demo might live at the path `demos/melanoma_skin_cancer_files` with the following structure:
```    
demos/
    └── multilingual_toxic_comment_files/
        ├── Multilingual_toxic_comment_classifier/
        │   ├── variable/
        │   │   ├── variables.data-00000-of-00001
        │   │   └── variables.index
        │   ├── fingerprint.pb
        │   ├── keras_metadata.pb
        │   └── saved_model.pb 
        ├── app.py
        ├── examples/
        │   └── dataset
        └── requirements.txt
```

Where:
- `Multilingual_toxic_comment_classifier` is our saved fine-tuned model (binary files associated).
- `app.py` contains our Gradio app, our data preprocessing function and our predict function.
    **Note**: `app.py` is the default filename used for HuggingFace Spaces, if we deploy our apps there.
- `examples` contains sample dataframe which contains toxic & non-toxic comments from russian, spanish, english, italian, turkish, portugese and last french languages to showcase the demo of our Gradio application.
- `requirements.txt` file contains the dependencies/packages to run our application such as tensorflow, gradio, transformers.

### Creating a demo folder to store our Multilingual Toxic Comment Classifier App files
To begin, we'll create an empty directory `demos/` that will contain all our necessary files for the application.

We can achive this using Python's `pathlib.Path("path_of_dir")` to establish directory path and then `pathlib.Path("path_of_dir").mkdir()` to create it.

In [1]:
############### ROOT_DIR : I Have put the files in my E: drive
## Importing Packages 
import shutil
from pathlib import Path
import os

ROOT_DIR = "\\".join(os.getcwd().split("\\")[:2])

## Create Melanoma skin cancer demo path
multilingual_toxic_comment_demo_path = Path(f"{ROOT_DIR}/demos/multilingual_toxic_comment_files")

## Removing files that might already exist and creating a new directory.
if multilingual_toxic_comment_demo_path.exists():
    shutil.rmtree(multilingual_toxic_comment_demo_path)
    multilingual_toxic_comment_demo_path.mkdir(parents=True, # Do we want to make parent folders?
                                exist_ok=True) # Create even if they already exists? 
else:
    ## If the path doesn't exist, create one 
    multilingual_toxic_comment_demo_path.mkdir(parents=True,
                                exist_ok=True)

### Creating a folder of example images to use with our Melanoma skin cancer demo
Now we'll create an empty directory called `examples` and store a sample dataset containing comments from the Russian, Turkish, English, Spanish, Portugese, French, Italian languages. I have collected these comments from online and created a CSV file for them. 

To do so we'll:

1. Create an empty directory `examples/` within the `demos/multilingual_toxic_comment_files` directory.
2. Collect some comment samples from online in these languages and create a CSV file out of them containing both toxic as well as non-toxic comments.

In [2]:
import pandas as pd
from pathlib import Path

## Create examples directory
multilingual_toxic_comment_examples_path = multilingual_toxic_comment_demo_path / "examples"
multilingual_toxic_comment_examples_path.mkdir(parents=True, exist_ok=True)

sample_comments = Path(f"sample_comments.csv")

comments = {"comment_text": ["Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было, кисель их бы придумал.",
                 "Страницу обнови, дебил. Это тоже не оскорбление, а доказанный факт - не-дебил про себя во множественном числе писать не будет. Или мы в тебя верим - это ты и твои воображаемые друзья?",
                 "В шапке были ссылки на инфу по текущему фильму марвел. Эти ссылки были заменены на фразу Репортим брипидора, игнорируем его посты. Если этого недостаточно, чтобы понять, что модератор абсолютный неадекват, и его нужно лишить полномочий, тогда эта борда пробивает абсолютное дно по неадекватности.",
                 "Про графику было обидно) я так то проходил все серии гта со второй части по пятую, кроме гта 4. И мне не мешала графика ни в одной из частей. На компе у меня было куча видеокарт. Начиная с 32мб RIVA TNT и заканчивая 2Гб 560Ti на которой я спокойно играю который год в танки, гта5, ведьмака3 купил на распродаже и начал проходить. Да, не на ультрах. С пониженными текстурами. И не мешает. Я не понимаю дрочева на графике, требовать графику уровня плойки 4 минимум. Мне надо чтобы глаза не резало, только и всего. По поводу управления, мне не хватает переходника на type c. У меня джойстик есть от иксбокса360. Потенциала в мобильных играх достаточно чтобы забить кнопки как забивались в той же NFS MW в 2005. Не самая плохая игра была.",
                 "This is such an urgent design problem; kudos to you for taking it on. Very impressive!",
                 "haha you guys are a bunch of losers.", "ur a sh*tty comment.",
                 "Il solito vizio,o moda, della sinistra di andare ad aiutare tutti tranne chi ne ha bisogno in casa nostra. Quanti autorespiratori si sarebbero potuti acquistare con 50 milioni di euro? Bastardi traditori della patria!!",
                 'SIAMO ALLA FOLLIA', 
                 '20px Caro editor, encontramos problemas na edição que fez na página Sertanejo universitário. A edição teve de ser revertida por não ser adequada para a Wikipédia. Se quiser experimentar a edição de páginas pode fazê-lo à vontade na página de testes da Wikipédia. Recomenda-se a leitura das páginas Breve introdução sobre a Wikipédia, O que a Wikipédia não é e Erros comuns na Wikipédia. Obrigado pela compreensão.    Vitor       Mazuco    Msg ',
                 "Le contributeur  y  tente de prouver par l absurde que le commentaire de diff du contributeur  x  est ridicule en recopiant ce dernier, et supprime sans autre explication un passage apparemment parfaitement consensuel. Qui plus est, le contributeur  y  ne prend pas la peine de discuter de la précédente contribution du contributeur  x , alors que l article a déjà un bandeau d avertissement à ne pas se lancer dans des guerres d édition. Bref, la prochaine fois, je vous bloque pour désorganisation du projet en vue d une argumentation personnelle. L article est déjà assez instable pour que vous n y mêliez pas une guerre d ego - et si vous n aimez pas qu on vous rappelle de ne pas  jouer au con , qui n est en rien une insulte, mais la détection d un problème de comportement, n y jouez pas. SammyDay (discuter) "]}

pd.DataFrame(comments, 
             columns=['comment_text']).to_csv(multilingual_toxic_comment_examples_path / sample_comments,
                                              index=False)

Now we verify our example images are present, let's list the contents of our `demo/melanoma_skin_cancer/examples/` directory with `os.listdir()` and then format the filepaths into a list of lists (to make it compatible with the Gradio's `gradio.Interface()`, example parameter).

In [3]:
example_list = [[example] for example in pd.read_csv(multilingual_toxic_comment_examples_path / sample_comments)['comment_text'].tolist()]
example_list

[['Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было, кисель их бы придумал.'],
 ['Страницу обнови, дебил. Это тоже не оскорбление, а доказанный факт - не-дебил про себя во множественном числе писать не будет. Или мы в тебя верим - это ты и твои воображаемые друзья?'],
 ['В шапке были ссылки на инфу по текущему фильму марвел. Эти ссылки были заменены на фразу Репортим брипидора, игнорируем его посты. Если этого недостаточно, чтобы понять, что модератор абсолютный неадекват, и его нужно лишить полномочий, тогда эта борда пробивает абсолютное дно по неадекватности.'],
 ['Про графику было обидно) я так то проходил все серии гта со второй части по пятую, кроме гта 4. И мне не мешала графика ни в одной из частей. На компе у меня было куча видеокарт. Начиная с 32мб RIVA TNT и заканчивая 2Гб 560Ti на которой я спокойно играю который год в танки, гта5, ведьмака3 купил на распродаже и начал проходить. Да, не на ультрах. С пониженными текстурами. И 

### Moving our trained XLM-Roberta model binary files into our multilingual_toxic_comment_files demo directory.
We have saved our fine-tuned model in `outout/working/multilingual_toxic_comment_files/` directory and we'll move our model files to `demos/multilingual_toxic_comment_files/` directory as specified above.  

We use Python's `shutil.move()` method and passing in `src`(the source path of the target file) and `dst` (the destination folder path of the target file to be moved into) parameters.

In [5]:
## Importing Libraries
import shutil

## Create a source path for our target model
multilingual_toxic_comment_model_dir_path = f"{ROOT_DIR}\\output\\working\\Multilingual_toxic_comment_classifier\\"

## Create a destination path for our target model
multilingual_toxic_comment_model_dir_destination = multilingual_toxic_comment_demo_path

## Try to move the file
try:
    print(f"Attempting to move the {multilingual_toxic_comment_model_dir_path} to {multilingual_toxic_comment_model_dir_destination}")
    
    ## Move the model
    shutil.move(src=multilingual_toxic_comment_model_dir_path,
           dst=multilingual_toxic_comment_model_dir_destination)
    
    print("Model move completed")
## If the model has already been moved, check if it exists
except:
    print(f"No model found at {multilingual_toxic_comment_model_dir_path}, perhaps it's already moved.")
    print(f"Model already exists at {multilingual_toxic_comment_model_dir_destination}: {multilingual_toxic_comment_model_dir_destination.exists()}")

Attempting to move the E:\MultiLingual-Toxic-Comment-Classification\output\working\Multilingual_toxic_comment_classifier\ to E:\MultiLingual-Toxic-Comment-Classification\demos\multilingual_toxic_comment_files
Model move completed


## Turning our Gradio App into a Python Script (`app.py`)

In [8]:
## Now if we look into which directory we are currently, we'll find that using the following code
import os
os.getcwd()

'E:\\MultiLingual-Toxic-Comment-Classification\\notebooks'

Now we will move into the demos directory where we will write some helper utilities.

In `cd ../demos/`: `..` means we are moving outside of the notebooks directory.
`demos/`: means we moving inside the demos directory.

In [9]:
cd ../demos/

E:\MultiLingual-Toxic-Comment-Classification\demos


In [19]:
import tensorflow as tf
import gradio as gr
import pandas as pd
from transformers import AutoTokenizer

model_save_path = "Multilingual_toxic_comment_classifier/"
### Loading the fine-tuned model ###
loaded_model = tf.keras.models.load_model(model_save_path)
### Initializing the tokenizer ###
tokenizer_ = AutoTokenizer.from_pretrained("xlm-roberta-large")

examples_list = [
    [example]
    for example in pd.read_csv("examples/sample_comments.csv")["comment_text"].tolist()
]


def prep_data(text, tokenizer, max_len=192):
    tokens = tokenizer(
        text,
        max_length=max_len,
        truncation=True,
        padding="max_length",
        add_special_tokens=True,
        return_tensors="tf",
    )

    return {
        "input_ids": tokens["input_ids"],
        "attention_mask": tokens["attention_mask"],
    }


def predict(text):
    prob_of_toxic_comment = loaded_model.predict(
        prep_data(text=text, tokenizer=tokenizer_, max_len=192)
    )[0][0]
    prob_of_non_toxic_comment = 1 - prob_of_toxic_comment
    prob_of_toxic_comment, prob_of_non_toxic_comment
    probs = {
        "prob_of_toxic_comment": float(prob_of_toxic_comment),
        "prob_of_non_toxic_comment": float(prob_of_non_toxic_comment),
    }
    return probs


interface = gr.Interface(
    fn=predict,
    inputs=gr.components.Textbox(lines=4, label="Comment"),
    outputs=[gr.Label(label="Probabilities")],
    examples=examples_list,
    title="Multi-Lingual Toxic Comment Classification.",
    description="XLM-Roberta Large model",
)
interface.launch(debug=False)

Overwriting multilingual_toxic_comment_files/app.py


### Creating a requirements.txt file for our Gradio App(`requirements.txt`)
This is the last file we need to create for our application.

This file contains all the necessary packages for our Gradio application.

When we deploy our demo app to HuggingFace Spaces, it will search through this file and install the dependencies we mention so our appication can run.

1. `tensorflow==2.12`
2. `pandas==1.5.2`
3. `gradio==3.1.4`
4. `transformers==4.28.1`

In [16]:
%%writefile multilingual_toxic_comment_files/requirements.txt
tensorflow==2.12
pandas==1.5.2
gradio==3.1.4
transformers==4.28.1

Overwriting multilingual_toxic_comment_files/requirements.txt


## Deploying our Application to HuggingFace Spaces
To deploy our demo, there are 2 main options for uploading to HuggingFace Spaces

1. [Uploading via the Hugging Face Web Interface (easiest)](https://huggingface.co/docs/hub/repositories-getting-started#adding-files-to-a-repository-web-ui)
2. [Uploading via the command line or terminal](https://huggingface.co/docs/hub/repositories-getting-started#terminal)

NOTE: To host any application on HuggingFace, we first need to [sign up for a free HuggingFace Account](https://huggingface.co/join)

### Running our Application locally

1. Open the terminal or command prompt.
2. Changing the `multilingual_toxic_comment_files` directory (cd multilingual_toxic_comment_files).
3. Creating an environment `(python3 -m venv env)` or use `(python -m venv env)`.
4. Activating the environment `(source env/Scripts/activate)`.
5. Installing the `requirements.txt` using `pip install -r requirements.txt`.
> If faced any errors, we might need to upgrade `pip` using `pip install --upgrade pip`.  
6. Run the app `(python3 app.py).`

This should results in a Gradio demo locally at the URL such as : `http://127.0.0.1:7860/`. 

### Uploading to Hugging Face
We've verified our Melanoma_skin_cancer detection application is working in our local system.

To upload our application to Hugging Face Spaces, we need to do the following.

1. [Sign up](https://huggingface.co/welcome) for a Hugging Face account.
2. Start a new Hugging Face Space by going to our profile at the top right corner and then select [New Space](https://huggingface.co/new-space).
3. Declare the name to the space like `Chirag1994/multilingual_toxic_comment_classification_app`.
4. Select a license (I am using MIT license).
5. Select Gradio as the Space SDK (software development kit).
6. Choose whether your Space is Public or Private (I am keeping it Public).
7. Click Create Space.
8. Clone the repository locally by running: `git clone https://huggingface.co/spaces/[YOUR_USERNAME]/[YOUR_SPACE_NAME]` in the terminal or command prompt. In our case mine would be like - `git clone https://huggingface.co/spaces/Chirag1994/multilingual_toxic_comment_classification_app`.
9. Copy/Move the contents of the downloaded `multilingual_toxic_comment_classification_app` folder to the cloned repo folder.
10. To upload files and track larger files (e.g., files that are greater than 10MB) for them we need to [install Git LFS](https://git-lfs.github.com/) which stands for Git large File Storage.
11. Open up the cloned directory using VS code (I'm using VS code), and use the terminal (git bash in my case) and after installing the git lfs, use the command `git lfs install` to start tracking the file that we want to track. For example - git lfs track `"Multilingual_toxic_comment_classifier" directory files`.
12. Create a new .gitignore file and the files & folders that we don't want git to track like :
    - `__pycache__/`
    - `.vscode/`
    - `venv/`
    - `.gitignore`
    - `.gitattributes`
13. Add the rest of the files and commit them with:
    - `git add .`
    - `git commit -m "commit message that you want"`
14. Push(load) the files to Hugging Face
    - `git push`
15. It might a couple of minutes to finish and then the app will be up and running. 

## Our Final Application deployed on HuggingFace Spaces 

In [6]:
# IPython is a library to help make Python interactive
from IPython.display import IFrame

# Embed FoodVision Mini Gradio demo
IFrame(src="https://chirag1994-multilingual-toxic-comment-classifier.hf.space", width=1000, height=800)