# Introduction
**Dataset**
labeled datasset collected from twitter (Hate Speech.tsv)

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>

**Evaluation metric**
macro f1 score

**I will use Cusomer Transform and Pipeline in this Notebook**

# Import used libraries

In [10]:
! pip install contractions






In [11]:
! pip install emoji



In [12]:
import warnings
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)

In [132]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [57]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import contractions
import re
import string
import emoji
import unicodedata
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# Load Dataset

In [18]:
# Load the TSV file
data = pd.read_csv('Hate Speech.tsv', delimiter='\t')

# Display the first few rows of the DataFrame
print(data.head())

   id  label                                              tweet

0   1      0  @user when a father is dysfunctional and is so...

1   2      0  @user @user thanks for #lyft credit i can't us...

2   3      0                                bihday your majesty

3   4      0  #model   i love u take with u all the time in ...

4   5      0             factsguide: society now    #motivation


In [19]:
data.drop('id', axis=1, inplace=True)

In [20]:
data.shape

(31535, 2)

In [21]:
print(data.head())

   label                                              tweet

0      0  @user when a father is dysfunctional and is so...

1      0  @user @user thanks for #lyft credit i can't us...

2      0                                bihday your majesty

3      0  #model   i love u take with u all the time in ...

4      0             factsguide: society now    #motivation


# Data preproccing

### 1.Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [22]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

### 2.EDA on training data

- check NaNs

In [23]:
data.isna().sum()

label    0
tweet    0
dtype: int64

- check duplicates

In [24]:
train_data[train_data.duplicated()]


Unnamed: 0,label,tweet
16764,0,angry squeaking frog video: #frog #nature...
28579,0,#model i love u take with u all the time in ...
30794,0,i'm so and #grateful now that - #affirmations
23699,0,can #lighttherapy help with or #depression? ...
11994,0,i am thankful for good hugs. #thankful #positive
...,...,...
25939,0,#arkansas gorilla simulator: you need to do ...
189,0,i am thankful for cats. #thankful #positive
25658,0,#model i love u take with u all the time in ...
17568,0,i am genuine. #i_am #positive #affirmation


There are 1779 dublicated rows in training data we should remove them

In [25]:
train_data= train_data.drop_duplicates()

In [26]:
# check again
train_data[train_data.duplicated()]

Unnamed: 0,label,tweet


In [27]:
# Unpacking train_data and test_data into X_train, y_train, X_test, y_test
X_train, y_train = train_data['tweet'], train_data['label']
X_test, y_test = test_data['tweet'], test_data['label']

- check dataset balancing

In [28]:
train_data['label'].value_counts()

label
0    21837
1     1612
Name: count, dtype: int64

Ooh... it is a big difference between number of data in each class so i will
*  Try to minimize this difference
*  Use macro measure to evaluate model performnce

- show a representative sample of data texts to find out required preprocessing steps

In [29]:
sample_data = train_data.sample(n=20)
print(sample_data)

       label                                              tweet

9577       0  genuinely was playing ring of fire and then @u...

4511       0  #left always like to overreact on mr. #trump's...

21315      0                                    bihday rg @user

13363      0  25 science backed ways to feel happier!    #re...

16121      0  i love how i can lay in my bed comfoably and w...

12412      1  #paladino ... what can i label him as ..that h...

2400       0  @user i am very disappointed after watching 10...

13273      0  finally got it framed. thanks georgiana_draghi...

11951      0  it rains here a lot, but there's never any thu...

26040      0  @user no suppo for the #friendlyrobotics #rl50...

24414      0  wigs &amp; hooker boots caoon fashion   @user ...

20841      0  1000 followers ð thank you for following me...

24799      0  @user find your purpose in life to be truly  ....

11311      0                #governance #feminism gets to be  ?

4119       0  @user the b

##According to our Data I need to do:
- some Cleaning and Preprocessing:
    - Lowercasing
    - Remove # symbols
    - Remove digits
    - Remove usernames
    - Remove Url
    - Remove construction
    - Remove emojis
    - Remove un ascii characters
    - Remove punctuation
    - Stemming or Lemmatizing




### Cleaning 

In [39]:
# Load nltk resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data]   Package punkt is already up-to-date!

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data]   Package stopwords is already up-to-date!

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data]   Package wordnet is already up-to-date!


True

#### Extra: use custom scikit-learn Transformers

Using custom transformers in scikit-learn provides flexibility, reusability, and control over the data transformation process, allowing you to seamlessly integrate with scikit-learn's pipelines, enabling you to combine multiple preprocessing steps and modeling into a single workflow. This makes your code more modular, readable, and easier to maintain.

##### link: https://www.andrewvillazon.com/custom-scikit-learn-transformers/

#### TextPreprocessor Transformer

In [84]:
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, use_stemming=False, use_lemmatization=False):
        self.use_stemming = use_stemming
        self.use_lemmatization = use_lemmatization
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        # Iterate over each text in X and preprocess it
        preprocessed_texts = []
        for text in X:
            # Lowercasing
            text = text.lower()

            # Remove # symbols
            text = re.sub(r'#', '', text)

            # Remove digits
            text = re.sub(r'\d', '', text)

            # Remove usernames (assuming they start with @)
            text = re.sub(r'@\w+', '', text)

            # Remove URLs
            text = re.sub(r'http\S+|www\S+', '', text)

            # Remove construction (e.g., RT for retweets)
            text = re.sub(r'\b(rt|RT)\b', '', text)

            # Remove emojis
            text = ''.join(char for char in text if not emoji.is_emoji(char))

            # Remove un-ASCII characters
            text = text.encode('ascii', 'ignore').decode('ascii')

            # Remove punctuation
            text = text.translate(str.maketrans('', '', string.punctuation))

            # Apply stemming or lemmatization
            if self.use_stemming:
                text = self.stemmer.stem(text)
            elif self.use_lemmatization:
                text = self.lemmatizer.lemmatize(text)

            preprocessed_texts.append(text)
        return preprocessed_texts

    def transform(self, X, y=None):
        return self.fit(X, y)

    def fit_transform(self, X, y=None):
        return self.fit(X, y)


In [128]:
## check transformer
# tx=TextPreprocessor(use_stemming=True)
# xx=tx.fit_transform(X_train)
# xx

#### FeatureExtractor Transformer

In [117]:

class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, vectorizer='bow', ngram_range=(1, 3)):
        self.vectorizer_type = vectorizer
        self.ngram_range = ngram_range
        self.vectorizer = None

    def fit(self, X, y=None):
        # Initialize the correct vectorizer based on the user's choice
        if self.vectorizer_type == 'bow':
            self.vectorizer = CountVectorizer(stop_words='english', ngram_range=self.ngram_range)
        elif self.vectorizer_type == 'tfidf':
            self.vectorizer = TfidfVectorizer(stop_words='english', ngram_range=self.ngram_range)
        else:
            raise ValueError("Invalid vectorizer type. Choose 'bow' or 'tfidf'.")

        # Fit the vectorizer to the data
        self.vectorizer.fit(X)
        return self

    def transform(self, X):
        # Transform the data using the fitted vectorizer
        return self.vectorizer.transform(X)

    def fit_transform(self, X, y=None):
        # Fit and transform the data using the fitted vectorizer
        self.fit(X, y)
        return self.transform(X)


In [129]:
## check transformer
# tv=FeatureExtractor(vectorizer='bow', ngram_range=(1, 3))
# vv=tv.fit_transform(xx)
# vv

**You  are doing Great so far!**

### Modelling

In [118]:
model= LogisticRegression()

#### Extra: use scikit-learn pipline

##### link: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Using pipelines in scikit-learn promotes better code organization, reproducibility, and efficiency in machine learning workflows.

#### Example usage:

In [110]:
# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessing', TextPreprocessor(use_stemming=True)),
    ('Vectorizing', FeatureExtractor(vectorizer='bow')),
    ('model', model),
])


In [111]:
# Training the model
pipeline.fit(X_train, y_train)


In [119]:
X_train

17097    big shout out to @user who replaced my hard us...
8390     #happy #monday! no #need or #reason to #feel  ...
12840    check out all the stars who made our sister te...
25608    awesome -- it's the rare wildly hyped phenomen...
9295     jamming to the new @user song my day is made! ...
                               ...                        
29802    @user @user congratulations!!! niblles forever...
5390     cross check day passed!  next week, on to resp...
860      black professor makes assumptions about an ent...
15795    look at the door #lucky #work #worked #working...
23654    add me  close peepz #staycloseofamily &lt;3   ...
Name: tweet, Length: 23449, dtype: object

In [112]:
# Evaluating the model
y_pred = pipeline.predict(X_test)

#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [114]:
# Evaluate the model using macro-averaged metrics
report = classification_report(y_test, y_pred, digits=3)

# Print the classification report
print(report)

              precision    recall  f1-score   support



           0      0.964     0.997     0.980      5875

           1      0.922     0.493     0.643       432



    accuracy                          0.962      6307

   macro avg      0.943     0.745     0.811      6307

weighted avg      0.961     0.962     0.957      6307


### Enhancement

- Using different N-grams
- Using different text representation technique
- Hyperparameter tuning

In [115]:

pipeline = Pipeline(steps=[
    ('preprocessing', TextPreprocessor(use_stemming=True)),
    ('Vectorizing', FeatureExtractor(vectorizer='tfidf')),
    ('model', model),
])

# Training the model
pipeline.fit(X_train, y_train)

# Evaluating the model
y_pred = pipeline.predict(X_test)

# Evaluate the model using macro-averaged metrics
report = classification_report(y_test, y_pred, digits=3)

# Print the classification report
print(report)


              precision    recall  f1-score   support



           0      0.939     0.998     0.968      5875

           1      0.831     0.125     0.217       432



    accuracy                          0.938      6307

   macro avg      0.885     0.562     0.593      6307

weighted avg      0.932     0.938     0.916      6307


In [116]:

pipeline = Pipeline(steps=[
    ('preprocessing', TextPreprocessor(use_lemmatization=True)),
    ('Vectorizing', FeatureExtractor(vectorizer='tfidf')),
    ('model', model),
])

# Training the model
pipeline.fit(X_train, y_train)

# Evaluating the model
y_pred = pipeline.predict(X_test)

# Evaluate the model using macro-averaged metrics
report = classification_report(y_test, y_pred, digits=3)

# Print the classification report
print(report)


              precision    recall  f1-score   support



           0      0.940     0.998     0.968      5875

           1      0.809     0.127     0.220       432



    accuracy                          0.938      6307

   macro avg      0.874     0.563     0.594      6307

weighted avg      0.931     0.938     0.917      6307


In [182]:
pipeline = Pipeline(steps=[
    ('preprocessing', TextPreprocessor(use_lemmatization=True)),
    ('Vectorizing', FeatureExtractor(vectorizer='bow',ngram_range=(1,2))),
    ('model', model),
])

# Training the model
pipeline.fit(X_train, y_train)

# Evaluating the model
y_pred = pipeline.predict(X_test)

# Evaluate the model using macro-averaged metrics
report = classification_report(y_test, y_pred, digits=3)

# Print the classification report
print(report)

              precision    recall  f1-score   support



           0      0.966     0.996     0.981      5875

           1      0.909     0.530     0.670       432



    accuracy                          0.964      6307

   macro avg      0.938     0.763     0.825      6307

weighted avg      0.963     0.964     0.960      6307


#### Done!