## Gender Based Violence - Tweet classification challenge

This challenge aims to classify tweets about GBV without using keywords.

Are you able to develop a machine learning model that can classify a tweet about gender-based violence into either of the five categories:
- Sexual violence
- Emotional violence
- Economic violence
- Physical violence
- Harmful traditional practice


### Install modules/package

In [None]:
!pip install -U nltk #natural language toolkit

### import libraries/modules

In [75]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import re # For regular expressions in text cleaning
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
# Open Multilingual Wordnet for lemmatizer
nltk.download('omw-1.4')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score

from collections import Counter


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/ML_models/GBV_tweet_classification/

### load the dataset

In [None]:
train = pd.read_csv("Train.csv")

In [None]:
if 'Tweet_ID' in train.columns:
  train = train.drop("Tweet_ID", axis=1)
train.head()

In [None]:
test = pd.read_csv("Test.csv")

In [None]:
test.head()

### EDA

In [None]:
train.shape # the training dataset has 39650 rows

In [None]:
train["type"].unique()

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})

sns.countplot(train["type"])

This dataset is highly imbalanced; from the graphs above, sexual violence is about 80%+.

In [None]:
test.shape # the test dataset has 15581 rows

In [78]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = str(text).lower() # Convert to lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs
    text = re.sub(r'@\w+|#\w+', '', text) # Remove mentions and hashtags (as they might not be direct content)
    text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
    text = re.sub(r'\d+', '', text) # Remove numbers
    text = text.strip() # Remove leading/trailing whitespace
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words] # Lemmatize and remove stop words
    return " ".join(words)

In [79]:
# apply preprocessing for both train and test data
train['tweet_cleaned'] = train['tweet'].apply(preprocess_text)
test['tweet_cleaned'] = test['tweet'].apply(preprocess_text)

Separating data into X (feature) y(label)

In [81]:
X = train['tweet_cleaned']
y = train["type"]

In [82]:
# Splitting data into test and training
# Training size = 67%
# Testing data size = 33%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [91]:
# Feature engineering the tweet column to fit into model.
# Requirement for categorical data to be converted into numerical vectors
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [86]:
# Encoding Label [type] since model requires numerical vectors
encoder = LabelEncoder()
y_train_encoded = encoder.fit_transform(y_train)
y_test_encoded = encoder.transform(y_test)

In [87]:
print(Counter(y_train_encoded))

Counter({np.int64(4): 21909, np.int64(1): 3971, np.int64(3): 425, np.int64(2): 142, np.int64(0): 118})


In [92]:
# Creating a pipeline for balancing the dataset and using LGBM Classifier for classification
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', LGBMClassifier(random_state=42, n_jobs=-1))
])

In [93]:
# fit training data into the model
# X_train_vectorized and y_train_encoded
pipeline.fit(X_train_vectorized, y_train_encoded)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 4.759888 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 318712
[LightGBM] [Info] Number of data points in the train set: 109545, number of used features: 3740
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438


In [94]:
# predicting the y_test data
y_predicted = pipeline.predict(X_test_vectorized)



In [95]:
# Check the unique values from the predicted types
print("Predicted unique tweet types ",np.unique(y_predicted))

Predicted unique tweet types  [0 1 2 3 4]


In [96]:
# Checking the accuracy score of our model
print('Accuracy Score\n:', accuracy_score(y_test_encoded, y_predicted))

Accuracy Score
: 0.9980894153611005


In [97]:
print('Classification Report\n:', classification_report(y_test_encoded, y_predicted, target_names=encoder.classes_))
print('Confusion Matrix\n:', confusion_matrix(y_test_encoded, y_predicted))

Classification Report
:                               precision    recall  f1-score   support

Harmful_Traditional_practice       0.96      0.97      0.96        70
           Physical_violence       1.00      1.00      1.00      1975
           economic_violence       1.00      0.99      0.99        75
          emotional_violence       0.98      0.99      0.98       226
             sexual_violence       1.00      1.00      1.00     10739

                    accuracy                           1.00     13085
                   macro avg       0.99      0.99      0.99     13085
                weighted avg       1.00      1.00      1.00     13085

Confusion Matrix
: [[   68     0     0     0     2]
 [    1  1969     0     0     5]
 [    0     0    74     0     1]
 [    0     1     0   223     2]
 [    2     7     0     4 10726]]


In [99]:
# Redefined the pipeline then fine-tuning the model using GridSearchCV
pipeline_for_grid = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', LGBMClassifier(random_state=42, n_jobs=-1)) # is_unbalance can be tuned or kept True
])

param_grid = {
    'smote__k_neighbors': [3, 5, 7], # Tuning k_neighbors for SMOTE
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.05, 0.1, 0.2],
    'classifier__num_leaves': [20, 31, 40], # Default is 31
    'classifier__max_depth': [-1, 10, 20], # -1 means no limit
    'classifier__min_child_samples': [20, 30, 40],
    'classifier__subsample': [0.7, 0.8, 0.9],
    'classifier__colsample_bytree': [0.7, 0.8, 0.9],
    'classifier__reg_alpha': [0, 0.1, 0.5], # L1 regularization
    'classifier__reg_lambda': [0, 0.1, 0.5], # L2 regularization
    'classifier__is_unbalance': [True, False] # Experiment with built-in imbalance handling
}

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) # Reduced n_splits for faster tuning initially

grid_search = GridSearchCV(pipeline_for_grid, param_grid, cv=skf, scoring='f1_macro', n_jobs=-1, verbose=2)
grid_search.fit(X_train_vectorized, y_train_encoded) # Fit GridSearchCV on the original training data, pipeline handles SMOTE

print("Best parameters: ", grid_search.best_params_)
print("Best cross-validation F1-macro score: ", grid_search.best_score_)

y_pred_tuned = grid_search.predict(X_test_vectorized)
print("\nTuned Model Classification Report:\n", classification_report(y_test_encoded, y_pred_tuned, target_names=encoder.classes_))
print("\nTuned Model Confusion Matrix:\n", confusion_matrix(y_test_encoded, y_pred_tuned))

Fitting 3 folds for each of 118098 candidates, totalling 354294 fits


KeyboardInterrupt: 

In [71]:
base_classifier = RandomForestClassifier(random_state=42)

In [72]:
bb_classifier = BalancedBaggingClassifier(estimator=base_classifier,
                                          sampling_strategy='auto',
                                          n_estimators=300,
                                          replacement=False,
                                          random_state=42)

In [73]:
bb_classifier.fit(X_res, y_res)

KeyboardInterrupt: 

In [63]:

predicted = bb_classifier.predict(X_test_vectorized)

In [64]:
predicted
np.unique(predicted)

array([0, 1, 2, 3, 4])

In [65]:
print('Accuracy Score\n', accuracy_score(y_test_encoded, predicted))

Accuracy Score
 0.9942682460833014


In [66]:
print("Classification report", classification_report(y_test_encoded, predicted))
print("Confusion Matrix", confusion_matrix(y_test_encoded, predicted) )

Classification report               precision    recall  f1-score   support

           0       0.53      0.99      0.69        70
           1       1.00      1.00      1.00      1975
           2       0.95      1.00      0.97        75
           3       0.98      1.00      0.99       226
           4       1.00      0.99      1.00     10739

    accuracy                           0.99     13085
   macro avg       0.89      1.00      0.93     13085
weighted avg       1.00      0.99      0.99     13085

Confusion Matrix [[   69     0     0     0     1]
 [    1  1973     0     0     1]
 [    0     0    75     0     0]
 [    0     0     0   226     0]
 [   59     4     4     5 10667]]


### evaluation metric
The evaluation metric on the Zindi leaderboard is accuracy; let’s test our simple model,

- Our simple model gave us an accuracy score of 88%, fair enough, but you should also be careful that the data is highly imbalanced, as discussed earlier.

- Question? Is the imbalance percentage the same as in the test? Can you think of various ways to deal with the imbalance dataset

- We can also see the only 3 categories(i.e 'Physical_violence', 'emotional_violence', 'sexual_violence') were predicted yet we have 5 categories. Therefore 88% might not be that good.


### submission

In [None]:
sample_submission = pd.read_csv("SampleSubmission.csv")

In [None]:
sample_submission.head()

In [100]:
def predict_result(classifier, encoder, data, vectorizer):
  x_vectorized = vectorizer.transform(data.tweet)

  y_predicted = classifier.predict(x_vectorized)

  return (encoder.inverse_transform(y_predicted))

In [108]:
##let's prdict on the test data
y_test_result = predict_result(pipeline, encoder, test, vectorizer)



In [110]:
y_test_result

array(['sexual_violence', 'Harmful_Traditional_practice',
       'Harmful_Traditional_practice', ..., 'sexual_violence',
       'sexual_violence', 'sexual_violence'], dtype=object)

In [111]:
np.unique(y_test_result)

array(['Harmful_Traditional_practice', 'Physical_violence',
       'economic_violence', 'emotional_violence', 'sexual_violence'],
      dtype=object)

In [112]:
sample_submission["type"] = y_test_result

In [115]:
%pwd

'/content/drive/MyDrive/ML_models/GBV_tweet_classification'

In [116]:
sample_submission.to_csv('SampleSubmission.csv', index=False)

### To do,
- Do more analysis
- Try working on ways to balance the dataset, undersampling, oversampling, using SMOTE, etc
- You can try other text classification models, e.g., using nltk, etc.
