# Hate Speech Detection on Assamese - HASOC 2023

## Team: Code Fellas
- Members: Abhinav, Adarsh, Ananya, Dinesh

In the HASOC 2023 competition, our team "Code Fellas" took on the challenge of hate speech detection in the Assamese language. We employed a variety of approaches, ranging from basic machine learning models to more advanced deep learning techniques.

### Approaches Explored:

1. **Traditional Models:**
   - Logistic Regression
   - Support Vector Machine (SVM)
   - XGBoost
   - Decision Trees

2. **Deep Learning Models:**
   - LSTM (Long Short-Term Memory)
   - BiLSTM (Bidirectional LSTM)
   - LSTM with CNN 1D
   - BiLSTM with CNN 1D
   - XLM Roberta
   - M-Bert (Cased and Uncased)
   - M-Roberta
   - Assamese Bert
   - Distilled Bert
   - Indic Bert
   
### Results:

After rigorous experimentation, we found that the IndicBERT model yielded the best accuracy for hate speech detection in Assamese, based on our research. The model achieved an impressive F1 Score of 0.69726, showcasing its effectiveness in handling the nuances of the Assamese language and detecting hate speech accurately.

Our journey in this competition allowed us to delve into the complexities of hate speech detection, explore a wide range of models, and understand their strengths and weaknesses in the context of Assamese text.

We're proud of our team's collaborative efforts and the achievements we've made in advancing the field of hate speech detection for the Assamese language. We look forward to future opportunities to contribute to such meaningful tasks.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install sentencepiece

In [None]:
!pip install transformers torch

In [None]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModel
from torch.utils.data import DataLoader, TensorDataset

## 1. Loading the train and test data. We are also calling the model name here

In [None]:
TRAIN = '/content/train_A_AH_HASOC2023.csv'
TEST = '/content/test_A_AH_HASOC2023.csv'
MAPPER = ['text', 'task_1'] # [X, Y]
# MODEL = 'bert-base-multilingual-cased'
# MODEL = 'bert-base-multilingual-uncased'
# MODEL = 'unitary/multilingual-toxic-xlm-roberta'
# MODEL = 'l3cube-pune/assamese-bert'
# MODEL = 'xlm-mlm-100-1280'
# MODEL = 'distilbert-base-multilingual-cased'
MODEL = 'ai4bharat/indic-bert'
NUM_LABELS = 2

## 2. Splitting the train dataset into train and validation data. First we would take the test size as 0.2. Later on, we would take the overall training data

In [None]:
# Load the data
train_df = pd.read_csv(TRAIN)

train_df[MAPPER[1]] = train_df[MAPPER[1]].map({'NOT': 0, 'HOF': 1})
train_df, test_df = train_test_split(train_df, test_size=0.2, random_state=42,stratify=train_df["task_1"])

In [None]:
!pip install sacremoses

## 3. We will import the tokenizer and the model required to run our model

In [None]:
from transformers import AutoModel, AutoTokenizer   # Importing autotokenizer and automodel from transformers for indicbert model
tokenizer = AutoTokenizer.from_pretrained(MODEL)    # Creating the tokenizer
model = AutoModel.from_pretrained(MODEL, num_labels=NUM_LABELS) #Creating the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')     # We will use inbuilt cuda gpu if available. Else we would use cpu.

## 4. Preprocess the data. We would do tokenization and convert our data into pytorch tensors

In [None]:
def preprocess_data(df):
    inputs = tokenizer(df[MAPPER[0]].tolist(), padding=True, truncation=True, return_tensors='pt', max_length=128)
    labels = torch.tensor(df[MAPPER[1]].tolist())
    return inputs, labels

train_inputs, train_labels = preprocess_data(train_df)   # Applying the tokenizer and converting the inputs into tensors.
test_inputs, test_labels = preprocess_data(test_df)  # No labels for test set

In [None]:
batch_size = 96     # Taking the batch size as 96

train_dataset = TensorDataset(train_inputs['input_ids'], train_inputs['attention_mask'], train_labels)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = TensorDataset(test_inputs['input_ids'], test_inputs['attention_mask'], test_labels)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

## 5. Defining our optimizer

In [None]:
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)    # we will apply AdamW optimizer with exponential learning rate taking initial value as 10^-5 and gamma as 0.9
#decay_rate = 0.9
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

## 6. Training Phase of the Model

In [None]:
num_epochs = 30 # *** setting the number of epochs to 30 initially. Later on, based on best epoch, we update this value. ***

PATIENCE = 2 #setting the patience value to 2 which means when ever the loss increases continuously two times on the validation data then the training will be halted automatically
best_val_loss = float('inf') #defining int max as the best_val_loss till now before the start of the training of the model
early_stopping_counter = 0 #early_stopping_counter is the variable which counts the number of times validation loss has been increased

#epoch wise training of the model
for epoch in range(num_epochs):
    model.train() #entering the training mode for model as per pytorch
    total_loss = 0 #initializing the loss before the epoch to 0 and it is used to get the accumilated overall loss for all the batches

    #training the model batchwise as per the defined batch size
    for batch in train_loader: #running a loop to train all the train batches
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device) #The input tensors (input_ids, attention_mask, labels) are moved to the device (presumably a GPU) using .to(device).

        optimizer.zero_grad() #clearing the gradients of the model parameters for each batch as we are calculating the gradients for individual batches
        outputs = model(input_ids, attention_mask=attention_mask,labels=labels)#The model is called with the inputs, and it returns outputs which includes the loss.
        loss = outputs.loss
        loss.backward()#Backpropagation is performed by calling loss.backward(), computing the gradients of the loss with respect to the model's parameters.
        optimizer.step() #The optimizer is updated using optimizer.step() to modify the model's parameters based on the computed gradients.

        total_loss += loss.item()#accumulating the loss of each batch in the total loss
    model.eval() #setting the model to evaluation mode correctly
    predictions = [] #list to store the predictions to get the evaluation loss
    val_loss = 0 #initiating the val loss to zero
    with torch.no_grad():
        for batch in test_loader: #The code iterates through each batch in the test_loader (presumably the validation dataset).
            input_ids, attention_mask,labels = batch
            input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device) #moving the tensors to the device

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels) #gettting the outputs by running the model
            val_loss += outputs.loss.item() #getting the final validation accuracy after one complete epoch

    #scheduler.step()
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        early_stopping_counter = 0 #if the validation loss decreases then we will make the early stopping counter 0
    else:
        early_stopping_counter += 1 #at each and every epoch if the validation loss gets increasing we will increase teh early stopping counter by the value of 1

    print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {total_loss / len(train_loader)},val_loss: {val_loss / len(test_loader)}') #printing the loss of each epoch
    if early_stopping_counter >= PATIENCE: #if the count of early stopping crosses the patience level then we will exit the training part
        print("Early stopping triggered.")
        break

## We will check the epoch upto which both training loss and validation loss got decreased. We will consider that epoch and re run the model at that epoch. This is because to make sure our model doesn't get overfit nor it get's underfitted.

## Here, it got stopped at 6th epoch and we saw 4th epoch is the best one.

## 7. Evaluation of the test data(from train test split) and predictions phase

In [None]:
from sklearn.metrics import classification_report # importing the classification report from sklearn to predict the scores of the model
model.eval() #setting the model to evaluation mode
predictions = [] #initiating an empty list of predictions
truths = [] #initiating an empty list of original values(labels)
with torch.no_grad():
    for batch in test_loader: #testing the model on the test data batch wise
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device) #loading the batches to device

        outputs = model(input_ids, attention_mask=attention_mask)# getting the predictions from the model
        logits = outputs.logits
        predicted_labels = torch.argmax(logits, dim=1)#getting the labels from the outputs predicted
        truths.extend(labels.cpu().tolist())# updating the original labels to device
        predictions.extend(predicted_labels.cpu().tolist()) # updating the predicted labels to device

temp = predictions
print(classification_report(truths,temp)) # printing the classification report based on the original and predicted labels
# test_df['predicted_label'] = predictions

## 8. Make predictions from the original Test Data (Given test data)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL) #loading  the tokenizer
# model = BertForSequenceClassification.from_pretrained(MODEL, num_labels=NUM_LABELS)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') #getting the device
test_df = pd.read_csv("test_A_AH_HASOC2023.csv") # reading the original test data
#function to preprocess the data and tokenize the data
def preprocess_data(df):
    inputs = tokenizer(df[MAPPER[0]].tolist(), padding=True, truncation=True, return_tensors='pt', max_length=128)
    return inputs


test_inputs = preprocess_data(test_df)  # preprocess the test data
test_dataset = TensorDataset(test_inputs['input_ids'], test_inputs['attention_mask'])
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) #loading the data into the dataloader which gives the data batchwise

In [None]:
model.eval()# setting the model in evaluation mode
predictions = [] #initializing the empty predictions list

with torch.no_grad():
    for batch in test_loader: #for each batch in test_loader data
        input_ids, attention_mask = batch
        input_ids, attention_mask = input_ids.to(device), attention_mask.to(device) #binding the batch to the device

        outputs = model(input_ids, attention_mask=attention_mask) #getting the predictions from the model
        logits = outputs.logits
        predicted_labels = torch.argmax(logits, dim=1) #getting the predicted labels from the predictions made
        predictions.extend(predicted_labels.cpu().tolist())

test_df['predicted_label'] = predictions #appending the predictions of the model to the test data frame

In [None]:
test_df #viewing the resulting test dataframe

In [None]:
test_df["predicted_label"].value_counts()#verifying the value_counts of the predictions to check any high overfitting

In [None]:
test_df = test_df.drop("text",axis=1)# dropping the unwanted axis

In [None]:
test_df

In [None]:
test_df[MAPPER[1]] = test_df["predicted_label"].map({0: 'NOT', 1: 'HOF'}) #reverse mapping the binary labels to the original text labels

In [None]:
test_df

In [None]:
test_df = test_df.drop("predicted_label",axis=1)

In [None]:
test_df

## 9. Saving the predictions to the file

In [None]:
test_df.to_csv("final_assamese_l3cube-pune_assamese-bert_epoch_7.csv",index=False) #saving the predictions dataframe to csv file

In [None]:
test_df.head(2)