# Hate speech in Bangla

Retreived from : https://www.kaggle.com/naurosromim/bengali-hate-speech-dataset/version/1


Original Paper: https://arxiv.org/abs/2012.09686

## Corpus

In [1]:
import os
import pandas as pd

file_path = os.path.join(os.getcwd(), "data", "bn_hate.csv")

df = pd.read_csv(file_path)
df.head()

Unnamed: 0,sentence,hate,category
0,যত্তসব পাপন শালার ফাজলামী!!!!!,1,sports
1,পাপন শালা রে রিমান্ডে নেওয়া দরকার,1,sports
2,জিল্লুর রহমান স্যারের ছেলে এতো বড় জারজ হবে এটা...,1,sports
3,শালা লুচ্চা দেখতে পাঠার মত দেখা যায়,1,sports
4,তুই তো শালা গাজা খাইছচ।তুর মার হেডায় খেলবে সাকিব,1,sports


In [2]:
categories = set(df["category"].values)
categories

{'Meme, TikTok and others',
 'celebrity',
 'crime',
 'entertainment',
 'politics',
 'religion',
 'sports'}

## Preprocessing

- The paper doesn't mention any specific tokenization or representation methods, other than saying they've used gensim to train word2vec embeddings and fasttext. 

- Tokenization on social media words is difficult, because people often use non standard words, with different forms and often with wrong spelling. So what works on a regular Bangla corpus may not work well here. 

- The paper also doesn't use contextualized embeddings. They focus on distributional embeddings such as word2vec. 

Let's try with BNLP Toolkit First. If it doesn't work we can go for BERT tokenizer which uses byte pair encoding.

In [3]:
from bnlp import BasicTokenizer

basic_tokenizer = BasicTokenizer()

# test with a bn sentence
tokens = basic_tokenizer.tokenize("আমি বাংলায় গান গাই")
tokens



['আমি', 'বাংলায়', 'গান', 'গাই']

In [4]:
# now on a sentence from the corpus
sample_sentence = df["sentence"].values[0]
tokens = basic_tokenizer.tokenize(sample_sentence)
tokens

['যত্তসব', 'পাপন', 'শালার', 'ফাজলামী', '!', '!', '!', '!', '!']

In [5]:
from string import punctuation

def remove_punctuation(tokens):
    return [token for token in tokens if token not in punctuation]
  
remove_punctuation(tokens)  

['যত্তসব', 'পাপন', 'শালার', 'ফাজলামী']

Now for cleaning all the sentences

In [6]:
sentences = df["sentence"].values
labels = df["hate"].values

In [7]:
# tokenize and remove punctuation

def clean_sentences(sentences):
    cleaned_sentences = []
    for s in sentences:
        tokens = basic_tokenizer.tokenize(s)
        cleaned_sentences.append(remove_punctuation(tokens))
    return cleaned_sentences

In [8]:
X = clean_sentences(sentences)
y = [int(i) for i in labels]

## The issue with pretrained Word2Vec embeddings

This corpus contains a lot of words which are not used in the standard form of Bangla. As a result, most of them will throw a key not found related error during lookup in the Word2Vec vocabulary. Besides, BN Word2Vec embeddings were trained on a very general and formal corpora and it doesn't make sense to use it here for informal text. Embeddings should be used based on the contents of a corpus. 

So what to do? Training own embeddings seems a better option then to get empty vectors from a pretrained word2vec model. 

## Training a word2vec embedding on this corpus

Well ..... Let's do it.

In [9]:
from gensim.models import Word2Vec

# embedding_dim is the size of a word vector
# e.g. the word2vec output for a single word
def train_w2v_model(clean_sentences, embedding_dim, window_size):
  model = Word2Vec(clean_sentences, vector_size=embedding_dim, window=window_size, min_count=1, max_vocab_size=10e3)
  return model

w2v_model = train_w2v_model(X, embedding_dim=300, window_size=5)

In [10]:
w2v_model.wv.most_similar("শয়তান")

[('বেয়াদব', 0.9845836162567139),
 ('মাদারচোদ', 0.9790210127830505),
 ('খোর', 0.9774585962295532),
 ('হালা', 0.9759621024131775),
 ('মাদারচুদ', 0.9738253951072693),
 ('বন্ড', 0.9722453355789185),
 ('চুদ', 0.9721114039421082),
 ('তাহেরি', 0.9688363075256348),
 ('চোরের', 0.9666358232498169),
 ('বেটা', 0.966454029083252)]

In [11]:
vocabulary = w2v_model.wv.index_to_key
len(vocabulary)

2786

## Encode inputs with embeddings

In [12]:
from tqdm import tqdm

def encode_X_tokens_with_embed_idx(X, corpus_embedding=w2v_model):
    encoded_X = list()
    
    for i in tqdm(range(len(X)), desc="encode_tokens_with_embed_idx"):
        idxs = []
        tokens = X[i]

        for token in tokens:
            try:
                idx = corpus_embedding.wv.key_to_index[token]
            except:
                # if token isn't in the vocab
                idx = 0

            idxs.append(idx)

        encoded_X.append(idxs)

    return encoded_X


encoded_X = encode_X_tokens_with_embed_idx(X)

encode_tokens_with_embed_idx: 100%|██████████████████████████████████████████| 30000/30000 [00:00<00:00, 196076.75it/s]


In [13]:
encoded_X[0]

[2500, 67, 293, 0]

## Padding
Left with 0 for a maximum sequence length

Now to find that sequence length ...

In [14]:
max_seq_len = max([len(x) for x in encoded_X])
max_seq_len

560

In [15]:
import numpy as np


def pad_tokens(encoded_X, seq_len=max_seq_len):
    padded = np.zeros(
        (len(encoded_X), seq_len),
        dtype=np.int32
    )

    for i in tqdm(range(len(encoded_X)), desc="pad"):
        tokens = encoded_X[i]
        if len(tokens) == 0:
            continue

        padded[i, -len(tokens):] = np.array(tokens)

    return padded


padded_X = pad_tokens(encoded_X)


pad: 100%|███████████████████████████████████████████████████████████████████| 30000/30000 [00:00<00:00, 389616.91it/s]


In [16]:
padded_X[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

So, erm, a lot of sparse values. Let's see if it works well. Otherwise we can always go back and find a better compression technqiue.

## Pytorch dataset

### Dataset

In [17]:
import torch
from torch.utils.data import Dataset, DataLoader

class BNHateSpeechDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        sentence = self.X[idx]
        label = self.y[idx]
        
        return torch.tensor(sentence), torch.tensor(label)

### Train Test Split

In [18]:
from sklearn.model_selection import train_test_split

# train and test split
x_train, x_test, y_train, y_test = train_test_split(
    padded_X, y, random_state=42, train_size=0.8
)


# train and validation split
# https://datascience.stackexchange.com/questions/15135/train-test-validation-set-splitting-in-sklearn
x_train, x_val, y_train, y_val = train_test_split(
    x_train, y_train, train_size=0.8, random_state=42)


In [19]:
trainset = BNHateSpeechDataset(x_train, y_train)
testset = BNHateSpeechDataset(x_test, y_test)
valset = BNHateSpeechDataset(x_val, y_val)

## CNN


This model is based on https://arxiv.org/abs/1408.5882  


### Definition

In [20]:


import torch.nn as nn
import torch.nn.functional as F


class HateSpeechCNN(nn.Module):
    def __init__(self, freeze_embeddings=True):
        super(HateSpeechCNN, self).__init__()

        # properties
        self.kernel_sizes = [3, 4, 5]
        self.num_filters = 100
        self.embedding_dim = w2v_model.wv.vector_size
        self.output_size = 1
        self.vocab_size = len(w2v_model.wv.index_to_key)

        # convert embeddings to tensors!
        self.corpus_embedding = torch.from_numpy(w2v_model.wv.vectors)

        # neural network

        # embedding layer
        # by default we're freezing embeddings
        self.embedding = nn.Embedding.from_pretrained(
            self.corpus_embedding, freeze=freeze_embeddings)

        # conv layers
        # 3 conv layers, since 3 kernel sizes
        self.conv1d = nn.ModuleList([
            nn.Conv2d(1, self.num_filters,
                      (k, self.embedding_dim), padding=(k - 2, 0))

            for k in self.kernel_sizes
        ])

        # final linear layer
        self.linear = nn.Linear(len(self.kernel_sizes)
                                * self.num_filters, self.output_size)

        # dropout and sigmoid
        # why sigmoid? Well, binary classification task!
        self.dropout = nn.Dropout(0.1)
        self.sigmoid = nn.Sigmoid()

    # helper
    def conv_and_pool(self, x, conv):
        """
        Convolutional + max pooling layer
        """
        # squeeze last dim to get size: (batch_size, num_filters, conv_seq_length)
        x = F.relu(conv(x)).squeeze(3)

        # 1D pool over conv_seq_length
        # squeeze to get size: (batch_size, num_filters)
        x_max = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x_max

    def forward(self, x):
        embeds = self.embedding(x)
        embeds = embeds.unsqueeze(1)  # reshape for conv (vector to matrix)

        conv_out = [self.conv_and_pool(embeds, conv) for conv in self.conv1d]

        # concate convolution outputs as a "vector"
        out = torch.cat(conv_out, 1)
        # apply dropout
        out = self.dropout(out)

        # linear
        out = self.linear(out)

        return self.sigmoid(out)


### Instantiation

In [21]:
# model
cnn = HateSpeechCNN()
print(cnn)

# hyperparameters
learning_rate = 0.001
batch_size = 128

loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(cnn.parameters(), lr=learning_rate)

# device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

cnn = cnn.to(device)

# dataloaders
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False, drop_last=True)
validation_loader = DataLoader(valset, batch_size=batch_size, shuffle=False, drop_last=True)

HateSpeechCNN(
  (embedding): Embedding(2786, 300)
  (conv1d): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1), padding=(1, 0))
    (1): Conv2d(1, 100, kernel_size=(4, 300), stride=(1, 1), padding=(2, 0))
    (2): Conv2d(1, 100, kernel_size=(5, 300), stride=(1, 1), padding=(3, 0))
  )
  (linear): Linear(in_features=300, out_features=1, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (sigmoid): Sigmoid()
)


### Tensorboard setup

In [22]:
# set up tensorboard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()

### Evaluation setup

In [23]:
import numpy as np
from sklearn.metrics import f1_score, accuracy_score

### Training

In [24]:
# train

def train_cnn(model, train_loader, validation_loader, epochs, optimizer, loss_fn, device):
    step_counter = 0

    for e in range(epochs):
        print(f"Epoch: {e + 1}/{epochs}")
        # set to train mode
        model.train()
        for sample in tqdm(train_loader):
            sentence, label = sample

            # send to device
            sentence = sentence.to(device)
            label = label.to(device)
        
            # zero gradients
            model.zero_grad()
        
            # forward pass
            output = model(sentence)
            loss = loss_fn(output.squeeze(), label.float())
            loss.backward()
            optimizer.step()
        
        
            # +1 to step counter
            step_counter += 1
        
            if step_counter % 50 == 0:
                # evaluate training
                pred = torch.round(output.squeeze())
                pred = pred.cpu().detach().numpy()

                accuracy = accuracy_score(label.cpu().detach().numpy(), pred)
                f1 = f1_score(label.cpu().detach().numpy(), pred)

                writer.add_scalars(
                    'Train_F1_Train_Accuracy',
                    {
                        'F1': f1,
                        'Accuracy': accuracy
                    },
                    step_counter)


                # validation
                validation_losses = []
                validation_f1s = []
                validation_accs = []
        
                # set to eval mode
                model.eval()
                with torch.no_grad():
                    # run the forward pass on val loader
                    for sample in validation_loader:
                        sentence, label = sample
                        sentence = sentence.to(device)
                        label = label.to(device)
                        val_output = model(sentence)

                        val_loss = loss_fn(val_output.squeeze(), label.float())
                        validation_losses.append(val_loss.item())
            
                        pred = torch.round(output.squeeze())
                        pred = pred.cpu().detach().numpy()

                        accuracy = accuracy_score(label.cpu().detach().numpy(), pred)
                        f1 = f1_score(label.cpu().detach().numpy(), pred)
            
                        validation_accs.append(accuracy)
                        validation_f1s.append(f1)

                    # log avg validation loss and training loss
                    writer.add_scalars(
                        'Loss_Train_ValidationAVG', 
                        {
                            "Train": loss.item(),
                            "Validation": np.mean(validation_losses)
                        }, 
                        step_counter)
        
                    writer.add_scalars(
                        'Evaluation_Val_F1_Val_Accuracy',
                        {
                            'F1': np.mean(validation_f1s),
                            'Accuracy': np.mean(validation_accs)
                        },
                        step_counter)

    # flush tensorboard
    writer.flush()
    writer.close()


In [25]:
# call train function
train_cnn(
    model=cnn,
    train_loader=train_loader,
    validation_loader=validation_loader,
    epochs=10,
    optimizer=optimizer,
    loss_fn=loss_fn,
    device=device)

Epoch: 1/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:14<00:00, 10.25it/s]


Epoch: 2/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:12<00:00, 11.61it/s]


Epoch: 3/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:12<00:00, 11.63it/s]


Epoch: 4/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:12<00:00, 11.58it/s]


Epoch: 5/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:12<00:00, 11.58it/s]


Epoch: 6/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:13<00:00, 11.53it/s]


Epoch: 7/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:12<00:00, 11.57it/s]


Epoch: 8/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:13<00:00, 11.44it/s]


Epoch: 9/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:12<00:00, 11.55it/s]


Epoch: 10/10


100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [00:12<00:00, 11.55it/s]


### Testing time!

In [26]:
def inference(model, test_loader):
    accs = []
    f1s = []
    
    model.eval()
    with torch.no_grad():
        for sample in tqdm(test_loader):
            sentence, label = sample
            sentence = sentence.to(device)
            label = label.to(device)
            
            output = model(sentence)
            pred = torch.round(output.squeeze())
            
            pred = pred.cpu().detach().numpy()
            
            acc = accuracy_score(label.cpu().detach().numpy(), pred)
            accs.append(acc)
            
            f1 = f1_score(label.cpu().detach().numpy(), pred)
            f1s.append(f1)
            
    return np.mean(accs), np.mean(f1s)


acc, f1 = inference(cnn, test_loader)
print("Accuracy : ", acc)
print("F1 : ", f1)

100%|██████████████████████████████████████████████████████████████████████████████████| 46/46 [00:02<00:00, 21.13it/s]

Accuracy :  0.8211616847826086
F1 :  0.7009803530045093





## Tensorboard Logs

Tensorboard logs are saved @  https://tensorboard.dev/experiment/oO65xxIVRKeYNiKdTlLiNA/