**Dataset**
labeled datasset collected from twitter (Lab 1 - Hate Speech.tsv)

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>



**Evaluation metric**
macro f1 score

### Import used libraries

In [None]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.stem import PorterStemmer
nltk.download('wordnet')
import string
import contractions
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Load Dataset

###### Note: search how to load the data from tsv file

In [None]:
df = pd.read_csv("Lab 1 - Hate Speech.tsv", sep= "\t")

In [None]:
pd.set_option('display.max_rows',500)
pd.set_option('display.max_colwidth',500)

In [None]:
df.head(50)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo
6,7,0,@user camping tomorrow @user @user @user @user @user @user @user dannyâ¦
7,8,0,the next school year is the year for exams.ð¯ can't think about that ð­ #school #exams #hate #imagine #actorslife #revolutionschool #girl
8,9,0,we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers â¦
9,10,0,@user @user welcome here ! i'm it's so #gr8 !


### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [None]:
x=df.drop(['id','label'],axis=1)
y=df['label']

In [None]:
print("shape of x :",x.shape)
print("shape of y  :" ,y.shape)

shape of x : (31535, 1)
shape of y  : (31535,)


In [None]:
x_train ,x_test ,y_train , y_test = train_test_split(x,y,test_size=0.15, random_state=24, shuffle=True, stratify= y)

### EDA on training data

- check NaNs

In [None]:
x_train.isna().sum()

tweet    0
dtype: int64

- check duplicates

In [None]:
x_train.duplicated().sum()

1947

In [None]:
duplicate_rows = x_train[x_train.duplicated()]

In [None]:
x_train[x_train.duplicated(keep=False)]

Unnamed: 0,tweet
11592,secrets of a #marriage and a #family
6751,have my lover stop being angry at me visit us..&gt;&gt;&gt; #lover #friend #astrologer #love
5014,i am thankful for saturdays. #thankful #positive
16977,7 impoant things to allow the #child to be
29968,100 amazing health benefits of cucumbers! #healthy is !! #altwaystoheal!
...,...
21032,(advanced value chain videos at ) #valuechain
26602,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
21766,one more day to go ... first swimming lesson #facingfears @user
9785,i am thankful for getting encouragement. #thankful #positive


- show a representative sample of data texts to find out required preprocessing steps

In [None]:
x_train.head(10)

Unnamed: 0,tweet
20587,@user punk! in a gay pub! #brewpix #edinburgh #ccblooms
25897,when you get your hair cut + colored it makes you feel good â¡
29415,tonight i finally get to see @user !!!!! @user @user @user #utrecht
3669,live your life not your age. #behappy positivepositivepositive #love #hope #faithâ¦
6656,"trump backers sing 'happy bihday' to presumptive nominee: toward the end of his speech, trump suppor..."
24459,#reminder to count our #blessings today. #fathersday #sunset! some don't haveâ¦
7666,"@user just sky series link for your, no doubt, amazing second series of sensitive skin"
17389,have a wonderful f r i d a y ð¸ #love #emikagifts #jewelrydesign #designer #handmadeâ¦
3201,f a t h e r Â´s d a y ð #fathersday #love #family #enjoy #tbt #bw #black #whiteâ¦
12553,free my phone #im


- check dataset balancing

In [None]:
y_train.value_counts()

label
0    24923
1     1881
Name: count, dtype: int64

This data is imbalance

- Cleaning and Preprocessing are:
    - 1) Remove duplicated
    - 2) Remove any numbers.
    - 3) Remove URLs, mentions, and hashtags.
    - 4) Remove contractions.
    - 5) Remove punctuation marks.
    - 6) Remove non-ASCII characters and special symbols.
    - 7) Replace f r i d a y to friday.
    - 8) Stemmer on each word.
    - 9) Lowercase.
    - 10) Stop Words.
    - 11) Vecrotize words.

### Cleaning and Preprocessing (Train data)

In [None]:
x_train.drop_duplicates(inplace=True)

In [None]:
y_train.drop(duplicate_rows.index,inplace=True)

In [None]:
# Remove any numbers
x_train['tweet'] = x_train['tweet'].apply(lambda x: re.sub('\d+',"",x))

In [None]:
# Remove URLs, mentions, and hashtags
x_train['tweet'] = x_train['tweet'].apply(lambda x: re.sub('http\S+|www\S+|@[^\s]+|#\S+',"",x))

In [None]:
x_train.head(10)

Unnamed: 0,tweet
20587,punk! in a gay pub!
25897,when you get your hair cut + colored it makes you feel good â¡
29415,tonight i finally get to see !!!!!
3669,live your life not your age. positivepositivepositive
6656,"trump backers sing 'happy bihday' to presumptive nominee: toward the end of his speech, trump suppor..."
24459,to count our today. some don't haveâ¦
7666,"just sky series link for your, no doubt, amazing second series of sensitive skin"
17389,have a wonderful f r i d a y ð¸
3201,f a t h e r Â´s d a y ð
12553,free my phone


In [None]:
def preprocess_text(text):
    # Remove contractions
    expanded_text = contractions.fix(text)
    # Remove punctuation marks
    cleaned_text = expanded_text.translate(str.maketrans('', '', string.punctuation))
    return cleaned_text

In [None]:
x_train['tweet'] = x_train['tweet'].apply(preprocess_text)

In [None]:
x_train.head(10)

Unnamed: 0,tweet
20587,punk in a gay pub
25897,when you get your hair cut colored it makes you feel good â¡
29415,tonight i finally get to see
3669,live your life not your age positivepositivepositive
6656,trump backers sing happy bihday to presumptive nominee toward the end of his speech trump suppor
24459,to count our today some do not haveâ¦
7666,just sky series link for your no doubt amazing second series of sensitive skin
17389,have a wonderful f r i d a y ð¸
3201,f a t h e are Â´s d a y ð
12553,free my phone


In [None]:
# Remove non-ASCII characters and special symbols
x_train['tweet'] = x_train['tweet'].apply(lambda x: re.sub(r'[^\x00-\x7F]+',"",x))

In [None]:
x_train.head(10)

Unnamed: 0,tweet
20587,punk in a gay pub
25897,when you get your hair cut colored it makes you feel good
29415,tonight i finally get to see
3669,live your life not your age positivepositivepositive
6656,trump backers sing happy bihday to presumptive nominee toward the end of his speech trump suppor
24459,to count our today some do not have
7666,just sky series link for your no doubt amazing second series of sensitive skin
17389,have a wonderful f r i d a y
3201,f a t h e are s d a y
12553,free my phone


In [None]:
x_train['tweet'] = x_train['tweet'].str.replace('f r i d a y', 'friday', regex=False)

In [None]:
x_train.head(10)

Unnamed: 0,tweet
20587,punk in a gay pub
25897,when you get your hair cut colored it makes you feel good
29415,tonight i finally get to see
3669,live your life not your age positivepositivepositive
6656,trump backers sing happy bihday to presumptive nominee toward the end of his speech trump suppor
24459,to count our today some do not have
7666,just sky series link for your no doubt amazing second series of sensitive skin
17389,have a wonderful friday
3201,f a t h e are s d a y
12553,free my phone


In [None]:
stemmer = PorterStemmer()

In [None]:
# function to stem each word in a tweet
def stem_words(tweet):
    # split tweet to words
    words = tweet.split()
    # Stem each word using the  Stemmer
    stemmed_words = [stemmer.stem(word) for word in words]
    # Join the stemmed words back into a single string
    stemmed_tweet = ' '.join(stemmed_words)
    return stemmed_tweet

In [None]:
x_train['tweet'] = x_train['tweet'].apply(stem_words)

In [None]:
x_train.head(50)

Unnamed: 0,tweet
20587,punk in a gay pub
25897,when you get your hair cut color it make you feel good
29415,tonight i final get to see
3669,live your life not your age positivepositiveposit
6656,trump backer sing happi bihday to presumpt nomine toward the end of hi speech trump suppor
24459,to count our today some do not have
7666,just sky seri link for your no doubt amaz second seri of sensit skin
17389,have a wonder friday
3201,f a t h e are s d a y
12553,free my phone


###Cleaning and Preprocessing (Test data)

In [None]:
# Remove any numbers
x_test['tweet'] = x_test['tweet'].apply(lambda x: re.sub('\d+',"",x))

In [None]:
# Remove URLs, mentions, and hashtags
x_test['tweet'] = x_test['tweet'].apply(lambda x: re.sub('http\S+|www\S+|@[^\s]+|#\S+',"",x))

In [None]:
x_test['tweet'] = x_test['tweet'].apply(preprocess_text)

In [None]:
# Remove non-ASCII characters and special symbols
x_test['tweet'] = x_test['tweet'].apply(lambda x: re.sub(r'[^\x00-\x7F]+',"",x))

In [None]:
x_test['tweet'] = x_test['tweet'].str.replace('f r i d a y', 'friday', regex=False)

In [None]:
x_test['tweet'] = x_test['tweet'].apply(stem_words)

In [None]:
x_test.head(20)

Unnamed: 0,tweet
16999,so in love with the beach life
21285,trump is a liber elit textbook definit of a croni capitalist trump onli care for trump
14236,yessssss st seri win in oz
2200,thi weekend i will be wish i am in the you with all the other author luck
25558,poloz albea wildfir will cut yy q gdp by about
21661,it is the are so even a would not them
19003,my nigga trippin earli with it nigga out here shootin finger
1456,go to a conc a guy check in goe through secur with gun
22226,edc la vega we liter cannot even omg
27244,is get as bad as


### Modelling

In [None]:
vectorizer = CountVectorizer(stop_words='english',lowercase=True)
vectorizer.fit(x_train['tweet'])
x_train_v = vectorizer.transform(x_train['tweet'])

In [None]:
logistic_regression = LogisticRegression()
# Fit the model on the training data
logistic_regression.fit(x_train_v, y_train)

In [None]:
x_test_v = vectorizer.transform(x_test['tweet'])
y_pred = logistic_regression.predict(x_test_v)

#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [None]:
# Calculate macro F1-score
macro_f1 = f1_score(y_test, y_pred, average='macro')

print("Macro F1-score: {:.2f}".format(macro_f1))

Macro F1-score: 0.74


### Enhancement

- Using different N-grams
- Using different text representation technique
- Hyperparameter tuning

Using ngram CountVectorizer

In [None]:
vectorizer_ngram = CountVectorizer(stop_words='english',lowercase=True, ngram_range=(1, 3))
vectorizer_ngram.fit(x_train['tweet'])
x_train_v_n = vectorizer_ngram.transform(x_train['tweet'])

In [None]:
logistic_regression_2 = LogisticRegression()
# Fit the model on the training data
logistic_regression_2.fit(x_train_v_n, y_train)

In [None]:
x_test_v_n = vectorizer_ngram.transform(x_test['tweet'])
y_pred = logistic_regression_2.predict(x_test_v_n)

In [None]:
# Calculate macro F1-score
macro_f1 = f1_score(y_test, y_pred, average='macro')

print("Macro F1-score: {:.2f}".format(macro_f1))

Macro F1-score: 0.77


Using TF-IDF

In [None]:
vectorizer_2 = TfidfVectorizer(stop_words='english',lowercase=True)
vectorizer_2.fit(x_train['tweet'])
x_train_v_f = vectorizer_2.transform(x_train['tweet'])

In [None]:
logistic_regression_3 = LogisticRegression()
# Fit the model on the training data
logistic_regression_3.fit(x_train_v_f, y_train)

In [None]:
x_test_v_f = vectorizer_2.transform(x_test['tweet'])
y_pred = logistic_regression_3.predict(x_test_v_f)

In [None]:
# Calculate macro F1-score
macro_f1 = f1_score(y_test, y_pred, average='macro')

print("Macro F1-score: {:.2f}".format(macro_f1))

Macro F1-score: 0.67


Using TF-IDF with N_gram

In [None]:
vectorizer_ngram_2 = TfidfVectorizer(stop_words='english',lowercase=True,ngram_range=(1,3))
vectorizer_ngram_2.fit(x_train['tweet'])
x_train_v_f = vectorizer_ngram_2.transform(x_train['tweet'])

In [None]:
logistic_regression_4 = LogisticRegression()
# Fit the model on the training data
logistic_regression_4.fit(x_train_v_f, y_train)

In [None]:
x_test_v_f = vectorizer_ngram_2.transform(x_test['tweet'])
y_pred = logistic_regression_4.predict(x_test_v_f)

In [None]:
# Calculate macro F1-score
macro_f1 = f1_score(y_test, y_pred, average='macro')

print("Macro F1-score: {:.2f}".format(macro_f1))

Macro F1-score: 0.66


-------------------------------------------------------

#### Extra: use custom scikit-learn Transformers

Using custom transformers in scikit-learn provides flexibility, reusability, and control over the data transformation process, allowing you to seamlessly integrate with scikit-learn's pipelines, enabling you to combine multiple preprocessing steps and modeling into a single workflow. This makes your code more modular, readable, and easier to maintain.

##### link: https://www.andrewvillazon.com/custom-scikit-learn-transformers/

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stemmer = PorterStemmer()

    def preprocess_text(self, text):
        # Remove numbers
        text = re.sub('\d+', '', text)
        # Remove URLs, mentions, and hashtags
        text = re.sub('http\S+|www\S+|@[^\s]+|#\S+', '', text)
        # Expand contractions
        text = contractions.fix(text)
        # Remove punctuation marks
        text = text.translate(str.maketrans('', '', string.punctuation))
        # Remove non-ASCII characters and special symbols
        text = re.sub(r'[^\x00-\x7F]+', '', text)
        # Replace 'f r i d a y' with 'friday'
        text = text.replace('f r i d a y', 'friday')
        # split text to each word
        words = text.split()
        stemmed_words = [self.stemmer.stem(word) for word in words]
        # Join the stemmed words back into a single string
        preprocessed_text = ' '.join(stemmed_words)
        return preprocessed_text

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [self.preprocess_text(text) for text in X]


#### Extra: use scikit-learn pipline

##### link: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Using pipelines in scikit-learn promotes better code organization, reproducibility, and efficiency in machine learning workflows.

In [None]:
# Create a pipeline with the CustomTransformer and LogisticRegression
pipeline = Pipeline([
    ('preprocessor', CustomTransformer()),
    ('vectorizer', CountVectorizer(stop_words='english',ngram_range=(1,3))),
    ('classifier', LogisticRegression())
])

In [None]:
# Fit the pipeline on the training data
pipeline.fit(x_train['tweet'], y_train)

In [None]:
# Make predictions on the test data
y_pred = pipeline.predict(x_test['tweet'])

In [None]:
# Calculate macro F1-score
macro_f1 = f1_score(y_test, y_pred, average='macro')

# Print macro F1-score
print("Macro F1-score: {:.2f}".format(macro_f1))

Macro F1-score: 0.77


### Conclusion and final results


The best Solution is using CountVectorize With n_gram , macro F1-score=77%

#### Done!