# NLP - Toxic Comments Classifier (Report)

Team Number: 6
<br/>
Team Members: Dina Boshnaq, Iris Loret, Ingrid Hansen
<br/>
Streamlit app URL: https://toxiccomments.streamlit.app/

## Introduction
For this assignment we are doing classification of comments to check if they're toxic or not. The dataset used is the Toxic Comment Classification dataset from Kaggle. We will be doing single label classification (toxic or not toxic) instead of multi label classification. We are using a Transformer model from Hugging Face, specifically the DistilBERT base model (uncased). We are doing transfer learning by using the pre-trained tokenizer from the language model (DistilBERT) to initialize a new tokenizer and then build our own model based on it.
We split our dataset then train the model, and finally make predictions on text that we pass to the model.

We are running our code on Kaggle since it has a closer connection to the data source so it uses less memory. Initially we had tried Colab but we were using too much memory so we switched to Kaggle.

## EDA
We start off by downloading the dataset from Kaggle, we will be using the train.csv file to train our data. We load the data file in a dataset on Kaggle. Then we load that file as a pandas dataframe in our code.

We did a simple EDA on the data where we checked the different columns and their datatypes.

<img src="reportImages/info.jpg" width="400"/>

We have a comment_text column with the comments, and 6 different types of toxic comments. But since we're only doing single label classification, we want to have one column as the label to predict. We make one column called is_toxic. That's why for each row, we check if any of the 6 types of toxic comments are set to 1 (1 meaning it is of that type of toxicity, 0 meaning it's not). If at least one of them is set to 1, we set the value in is_toxic to 1. If none of them are valued 1, we set it to 0. Then we drop the 6 columns, keeping only comment_text and is_toxic. We also dropped the id column since there's no use for it.

When checking the value counts of the is_toxic column, the 0 value had 143346 entries while the 1 value had 16225. This causes an imbalance in the data, and it's also too much data since we were still testing it. Therefore, we undersampled and balanced the data out by taking only 2000 samples from each (the 0 entries, and the 1 entries). This will make training the model faster and easier. If we can get this to run then we know it's working and we can increase the number of samples to make a better model.

An issue that was encountered later on when trying to train the model is that it wasn't seeing the is_toxic column as the target label column, so we had to rename it to "label" per the documentation.

Up to this point, the data type of label is integer and it's either 0 or 1. But this isn't enough to identify the labels, especially since we will be assigning the number of labels to 2 when making the model. The model configurations, especially the num_labels parameter, need to match the number of unique labels used during training. id2label and label2id should be correctly configured in the config.json file in the model directory.
If the original labels are numeric (0 and 1), the model might expect a mapping like id2label = {0: '0', 1: '1'} and label2id = {'0': 0, '1': 1}. However, if we use string labels, the mapping would be more intuitive: id2label = {'Not Toxic': 0, 'Toxic': 1} and label2id = {0: 'Not Toxic', 1: 'Toxic'}. We tried both and the second approach worked where we first map 0 to "Not Toxic" and 1 to "Toxic", then encode the labels:

id2label = {0: "Not Toxic", 1: "Toxic"}
label2id = {"Not Toxic": 0, "Toxic": 1}
df_toxic_balanced["label"] = df_toxic_balanced["label"].apply(lambda x: label2id[x])

This will result in the configuration in the config.json (which will be produced later on when making the model) to be correct later on where we'd have the 2 labels and the correct problem type "single label classification".

The final correct config.json:

<img src="reportImages/rightConfig.jpg" width="400"/>

The initial wrong config.json:

<img src="reportImages/wrongConfig.jpg" width="400"/>


We then make the hugging face dataset from our dataframe in order to apply tokenization on it.

## Tokenizing the data

The pre-trained transformer model we're using is a distilled version of the BERT base model (DistilBERT), specifically the uncased one (case insensitive).
We decided to try this one first since it's smaller and faster than BERT but based on it. It is self-supervised and uses Masked language modeling (MLM) which is in a way good since it can learn in a bidirectional representation of the sentence. It should get the job done, but of course, there are other transformer models out there that are more accurate in classifying comments based on toxicity, however they're bigger and will take a longer time in training. Since we're testing things out first, this model suffices.

A tokenizer is needed in order to convert raw input (sentences) into smaller units such as words, this will help the ML models to understand and process the input. 
We initialize the tokenizer to be used using the AutoTokenizer class from Hugging face transformers library. This class will help us load a pre-trained tokenizer for our model from the pre-trained DistilBERT Uncased model. We set the use_fast parameter to True, this will enable the use of a fast tokenizer.

A function named "preprocess" is created to map the tokenizer to the dataset, specifically to the column "comment_text" which contains the input text to be analyzed. It then applies the tokenizer on each input sentence. The tokenizer is configured with the parameters truncation=True and max_length=128, indicating that it should truncate sequences longer than 128 tokens while ensuring that sequences are not longer than this maximum length. This will ensure that the input sequences have a consistent length which makes the processing by the model more efficient. So in summary, through this function we will prepare the dataset for further processing by transforming the raw text in the "comment_text" column into tokenized sequences suitable for input to natural language processing models.

We apply the preprocess function on our dataset. The mapping function will be applied in batches rather than individually for each sample (batched=True). Then, a DataCollatorWithPadding object (DataCollatorWithPadding is a class in the Hugging Face library) is created to form these batches of the data. We are using the tokenizer we made earlier and a padding strategy 'max_length', which means that the sequences will be padded to the maximum length in the batch. The data collator will create uniform batch sizes, which allow efficient parallel processing in deep learning frameworks, by using Truncation and Padding. These are 2 different ways of achieving uniform batch sizes.
It trunctuates the sequences which exceed the maximum length, and pads the sequences which are less than the maximum length. The maximum length is determined by the maximum length of the input sequences after they have been tokenized by the tokenizer. Truncation involves shortening a sequence by removing tokens from the end. While padding involves adding special tokens (usually zeros) to the end of a sequence to make it equal in length to the maximum sequence length within a batch.

## Creating Train and Test set

We split the tokenized dataset into training and test sets, %70 for training, %30 for testing. Our dataset is now the following:

<img src="reportImages/datasetSplit.jpg"/>

We define the training set and testing set individually so we can pass them to model later. They are called tok_train_dataset and tok_test_dataset.

## Creating an Evaluation Metric

We use the library called "Evaluate" for evaluating our model. It is a library with a wide range of evaluation tools. There are 3 types of evaluations in it, one of them is "Metric" which we will be using. It measures the performance of a model on a given dataset, usually by comparing the model's predictions to some ground truth labels. We will be using an "Accuracy" metric from this type. According to the documentation on the Hugging Face website: Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative.

Accuracy is a common evaluation metric used to measure the accuracy of a classification model. So it is suitable for our case in binary classification for toxic and non-toxic comments.

We then make a compute_metrics function to be used later in the evaluation of the model. As a parameter, we pass the model's predictions (eval_pred). The function extracts the predicted labels, compares them to the true labels (references), and calculates the accuracy using the loaded accuracy metric. The final accuracy value is then returned. This function encapsulates the logic for computing accuracy during the evaluation phase and will bes used in a later step in the model trainer.

## Making the model


## Conclusion

### Making the model

In [47]:
# Download the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model, num_labels=2, id2label=id2label, label2id=label2id, output_attentions=True)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight

In [48]:
# Defining the training arguements
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    fp16=True,
    num_train_epochs=2,
    weight_decay=0.01,
)

In [51]:
# Training the model on our data with our specific training arguements
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tok_train_dataset,
    eval_dataset=tok_test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=350, training_loss=0.2803288922991071, metrics={'train_runtime': 150.4657, 'train_samples_per_second': 37.218, 'train_steps_per_second': 2.326, 'total_flos': 741817432473600.0, 'train_loss': 0.2803288922991071, 'epoch': 2.0})

In [52]:
# Saving the trained model along with other training-related information
trainer.save_model("comments_model")

In [53]:
# Saving the raw model object as is, without any additional information related to training
# We will use this for prediction
model_save_path = '/kaggle/working/comments_model.pkl'

with open(model_save_path, 'wb') as model_file:
    pickle.dump(model, model_file)

In [54]:
# Loading the raw model
model_pickle_path = '/kaggle/working/comments_model.pkl'

with open(model_pickle_path, 'rb') as model_file:
    model = pickle.load(model_file)

### Testing the model

In [62]:
# Load the pre-trained tokenizer and model
model_path = "/kaggle/working/comments_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
 
# Create a text classification pipeline using the loaded model and tokenizer
pipeline =  TextClassificationPipeline(model=model, tokenizer=tokenizer)

# Make predictions on sample texts and print the results
print(pipeline("You are beautiful"))
print(pipeline("You are ugly"))

[{'label': 'Not Toxic', 'score': 0.7726908326148987}]
[{'label': 'Toxic', 'score': 0.9823653101921082}]


In [63]:
# For extra insight, we look at the json file
import json

config_path = "/kaggle/working/comments_model/config.json"

with open(config_path, 'r') as config_file:
    config = json.load(config_file)

config


{'_name_or_path': 'distilbert-base-cased',
 'activation': 'gelu',
 'architectures': ['DistilBertForSequenceClassification'],
 'attention_dropout': 0.1,
 'dim': 768,
 'dropout': 0.1,
 'hidden_dim': 3072,
 'id2label': {'0': 'Not Toxic', '1': 'Toxic'},
 'initializer_range': 0.02,
 'label2id': {'Not Toxic': 0, 'Toxic': 1},
 'max_position_embeddings': 512,
 'model_type': 'distilbert',
 'n_heads': 12,
 'n_layers': 6,
 'output_attentions': True,
 'output_past': True,
 'pad_token_id': 0,
 'problem_type': 'single_label_classification',
 'qa_dropout': 0.1,
 'seq_classif_dropout': 0.2,
 'sinusoidal_pos_embds': False,
 'tie_weights_': True,
 'torch_dtype': 'float32',
 'transformers_version': '4.29.2',
 'vocab_size': 28996}