# Hate Speech Detection on Bodo - HASOC 2023

## Team: Code Fellas
- Members: Abhinav, Adarsh, Ananya, Dinesh

In the HASOC 2023 competition, our team "Code Fellas" took on the challenge of hate speech detection. We employed a variety of approaches, ranging from basic machine learning models to more advanced deep learning techniques.

### Approaches Explored:

1. **Traditional Models:**
   - Logistic Regression
   - Support Vector Machine (SVM)
   - XGBoost
   - Decision Trees

2. **Deep Learning Models:**
   - LSTM (Long Short-Term Memory)
   - BiLSTM (Bidirectional LSTM)
   - LSTM with CNN 1D
   - BiLSTM with CNN 1D
   - XLM Roberta
   - M-Bert (Cased and Uncased)
   - M-Roberta
   - Distilled Bert
   - Indic Bert

### Results:
After rigorous experimentation, we found that the BiLSTM model yielded the best accuracy for hate speech detection in Bodo, based on our research. The model achieved an impressive F1 Score of 0.83513, showcasing its effectiveness in handling the nuances of the Bodo language and detecting hate speech accurately.
As bodo is a low resource language unlike assamese and bengali, bert based models may not give better results as compared to that of BiLSTM.

Our journey in this competition allowed us to delve into the complexities of hate speech detection, explore a wide range of models, and understand their strengths and weaknesses in the context of Bodo text.

We're proud of our team's collaborative efforts and the achievements we've made in advancing the field of hate speech detection for the Bodo language. We look forward to future opportunities to contribute to such meaningful tasks

In [None]:
import pandas as pd

In [None]:
# Importing dataset
data1 = pd.read_csv("/content/train_BO_AH_HASOC2023.csv")
test_data = pd.read_csv("/content/test_BO_AH_HASOC2023.csv")

In [None]:
data1.head()

Unnamed: 0,S. No.,text,task_1
0,1,गोदाव खामानि मावओ बोला नो सानसे देरहा थारगोन,NOT
1,2,निखावरि सुबुंफोरा सिखाव,HOF
2,3,मा बिमा ख'र' परिबर्थननि खोथा फैखो बेयाव मोसौ,HOF
3,4,थोद जामबा सैमा साला मा मिसेस जाखो बेलाय,HOF
4,5,माखौ बकिबाय थादों नों बोरमा फानथा दम दंब्ला खा...,HOF


In [None]:
# Mapping "NOT" to 0 and "HOF" to 1
data1["task_1"] = data1["task_1"].map({"NOT" : 0, "HOF":1})

In [None]:
data1

Unnamed: 0,S. No.,text,task_1
0,1,गोदाव खामानि मावओ बोला नो सानसे देरहा थारगोन,0
1,2,निखावरि सुबुंफोरा सिखाव,1
2,3,मा बिमा ख'र' परिबर्थननि खोथा फैखो बेयाव मोसौ,1
3,4,थोद जामबा सैमा साला मा मिसेस जाखो बेलाय,1
4,5,माखौ बकिबाय थादों नों बोरमा फानथा दम दंब्ला खा...,1
...,...,...,...
1674,1675,नोंलाय जामबा नोंबो सासे सनमान गैयै मानसिसो गिद...,1
1675,1676,एै मावजि लाब गैया दानो बनद खालामनायनि खोथा बुं...,1
1676,1677,सिखला फुरकव रपे खलामनांगव,1
1677,1678,सेनदेल खुबै नांगौलै सालाफोरखौ,1


# 1. Data Preprocessing and Cleaning

In [None]:
import re
import nltk
import string

In [None]:
data1 = data1[["text", "task_1"]]  # task_1 denotes the labels

In [None]:
# Removing usernames from the text and replacing them with an empty string
def username_remover(text):
  text = re.sub(r'@[^ ]+',"", text)
  return text

In [None]:
data1["text"] = data1["text"].apply(username_remover)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data1["text"] = data1["text"].apply(username_remover)


### Cleaning the text by removing urls, newlines, puncuations, brackets etc. Then the clean text is split into words and then reduced. The reduced words are then joined.

In [None]:
# Function that cleans the text by removing the unnecessary parts mentioned above
def clean(text):
  text = re.sub(r'#\w+', '', text)
  text = re.sub('https?://\S+|www\.\S+', '', text)
  text = re.sub('\n', '', text)
  text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
  pattern = re.compile(r"(.)\1{2,}")
  words = text.split()
  reduced_words = []
  for word in words:
      reduced_word = pattern.sub(r"\1\1", word)
      reduced_words.append(reduced_word)
  text = ' '.join(reduced_words)
  return text

In [None]:
# Testing the function. You can observe that the 2nd word (මමමමමමම) in the sentence got reduced to (මම).
print(clean("සුමිත් මමමමමමම අවස්ථාවේ සිටියාවූ උපදේශයේ උපකාරයක් පවතී."))

සුමිත් මම අවස්ථාවේ සිටියාවූ උපදේශයේ උපකාරයක් පවතී


In [None]:
# Text is cleaned by applying the clean function
data1["text"] = data1["text"].apply(clean)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data1["text"] = data1["text"].apply(clean)


In [None]:
# Removing duplicates from the text
data1 = data1.drop_duplicates('text')

In [None]:
data1.isnull().sum()

text      0
task_1    0
dtype: int64

In [None]:
data1['task_1'].value_counts()

1    998
0    681
Name: task_1, dtype: int64

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Splitting data into text and label
x = data1['text']
y = data1['task_1']

In [None]:
x_train = x
y_train = y

## 2. Tokenization to convert text to keras tensors

In [None]:
!pip install keras-preprocessing



In [None]:
from keras.models import Sequential
from keras.layers import Conv1D, Embedding, Dense,MaxPool1D, LSTM, Dropout
from keras.preprocessing.text import Tokenizer
# from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
from wordcloud import WordCloud

In [None]:
# Loading the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_train.values)
word_index = tokenizer.word_index
print('Found %s unique tokens' % len(word_index))

Found 4806 unique tokens


In [None]:
# Getting the maximum length of text in the dataset
lengths = []
for i in range(0,len(x_train.values)):
  lengths.append(len(x_train.values[i]))
max(lengths)

394

In [None]:
# Converting text to tensors by tokenization
X = tokenizer.texts_to_sequences(x_train.values)
X[0]

[252, 20, 1502, 1503, 172, 456, 457, 869]

In [None]:
# Sequence padding
X = pad_sequences(X, maxlen=130)

In [None]:
X.shape

(1679, 130)

In [None]:
print('Shape of data tensor:', X.shape)

Shape of data tensor: (1679, 130)


## 3. Appyling our models, LSTM, BiLSTM and CNN

In [None]:
from keras.regularizers import l2
from keras.layers import Conv1D, Embedding, Dense, MaxPool1D, Bidirectional, LSTM, Dropout
from tensorflow.keras.callbacks import EarlyStopping

### i. Using Bidirectional LSTM layer

In [None]:
model = Sequential()
model.add(Embedding(len(word_index) + 1, 128, input_length=X.shape[1]))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2, kernel_regularizer=l2(1e-4))))  # BiLSTM layer
model.add(Dropout(0.2))
model.add(Dense(256,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 130, 128)          615296    
                                                                 
 dropout_6 (Dropout)         (None, 130, 128)          0         
                                                                 
 bidirectional_2 (Bidirectio  (None, 256)              263168    
 nal)                                                            
                                                                 
 dropout_7 (Dropout)         (None, 256)               0         
                                                                 
 dense_4 (Dense)             (None, 256)               65792     
                                                                 
 dropout_8 (Dropout)         (None, 256)               0         
                                                      

#### After experimentation we arrived at fitting the model using 3 epochs and a batch size of 64, which yielded the best results mentioned in the beginning.

In [None]:
# history = model.fit(X, yc_train, epochs=20, batch_size=32, validation_split=0.2, callbacks=[early_stopping])
history = model.fit(X, y_train, epochs=3, batch_size=64, validation_split=0.2)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
# Getting the text from test data
x_test = test_data["text"]

In [None]:
x_test

0              BPF बानाय लांनाय लामाया 5 बोसोरानो जोरासै
1      बै समाव माबेयाव हाबसोनानै दंमोन नोंलाय, दाना ब...
2      बे थांखिखौ मिनिग्रापोरा हारिखौ लेवारपोरबायदि थ...
3                   मोसौ खुगायाव एमफौ नांबाय नोंनाव सैमा
4      2003आव BTC गोरोबथा जादों बेनि थाखाय बो थोजासे ...
                             ...                        
415                                आं आनो खाजा होआखै मोन
416    बियो आंखौ बिनि बिमा बिफा बुथारनायनि थाखाय दायन...
417                     राहुलआ गावनि फोरोंगिरिखौ मान होआ
418                     राकेशआ गावनि फोरोंगिरिखौ मान होआ
419                      रमेशआ निखावरि मानसिखौ रायज्लाया
Name: text, Length: 420, dtype: object

In [None]:
# Converting the test texts to tensors
x_test = tokenizer.texts_to_sequences(x_test.ravel())
x_test = pad_sequences(x_test, maxlen=130)
test_prediction = (model.predict(x_test) > 0.5).astype("int32")



In [None]:
# Predicting the test values
test_prediction

array([[0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
    

In [None]:
# Converting final data to DataFrame
fin_data = pd.DataFrame()

In [None]:
# Final data
fin_data

In [None]:
# Creating list of Sr. No.s
lst = []
for i in range(len(test_prediction)):
  lst.append(i+1)

In [None]:
# Adding column headers to file
fin_data["S. No."] = pd.DataFrame(lst)
fin_data["task_1"] = pd.DataFrame(test_prediction)

In [None]:
# Mapping "NOT" to 0 and "HOF" to 1
fin_data["task_1"] = fin_data["task_1"].map({1 : "HOF", 0 : "NOT"})

In [None]:
# Importing our predictions to csv
fin_data.to_csv("bodo_BiLSTM_without_earlystopping.csv")


### ii. Using Conv1D layer with BiLSTM Layer using early stopping
#### Early stopping is used to prevent overfitting

In [None]:
model2 = Sequential()
model2.add(Embedding(len(word_index) + 1, 128, input_length = X.shape[1]))
model2.add(Dropout(0.2))
model2.add(Conv1D(filters=128, kernel_size=3, padding='same', activation='relu'))   # Conv1D layer
model2.add(MaxPool1D(pool_size = 2))
# model2.add(LSTM(200,dropout=0.2, recurrent_dropout=0.2))
model2.add(Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2, kernel_regularizer=l2(1e-4))))  # BiLSTM layer
model.add(Dropout(0.2))
model.add(Dense(256,activation='relu'))
model.add(Dropout(0.2))
model2.add(Dense(1,activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model2.summary())

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 130, 128)          615296    
                                                                 
 dropout_9 (Dropout)         (None, 130, 128)          0         
                                                                 
 conv1d_1 (Conv1D)           (None, 130, 128)          49280     
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 65, 128)          0         
 1D)                                                             
                                                                 
 bidirectional_3 (Bidirectio  (None, 256)              263168    
 nal)                                                            
                                                                 
 dense_7 (Dense)             (None, 1)                

In [None]:
# Using early to stopping using 3 patience epochs
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

#### After experimentation we arrived at fitting the model using 3 epochs and a batch size of 64.

In [None]:
# model2.fit(X, yc_train,epochs=20,batch_size = 32, validation_split=0.2, callbacks=[early_stopping])
model2.fit(X, y_train,epochs=3,batch_size = 64, validation_split=0.2)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7a39bf7d97e0>

In [None]:
# Predicting the test labels
test_prediction2 = (model2.predict(x_test) > 0.5).astype("int32")



In [None]:
# Converting final data to DataFrame
fin_data = pd.DataFrame()

In [None]:
# Final data
fin_data

In [None]:
# Creating list of Sr. No.s
lst = []
for i in range(len(test_prediction)):
  lst.append(i+1)

In [None]:
# Adding column headers to file
fin_data["S. No."] = pd.DataFrame(lst)
fin_data["task_1"] = pd.DataFrame(test_prediction2)

In [None]:
# Mapping "NOT" to 0 and "HOF" to 1
fin_data["task_1"] = fin_data["task_1"].map({1 : "HOF", 0 : "NOT"})

In [None]:
# Importing our predictions to csv
fin_data.to_csv("bodo_BiLSTM_CNN1D_without_earlystopping.csv")