**Dataset**
labeled datasset collected from twitter (Lab 1 - Hate Speech.tsv)

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>

**Total Estimated Time = 90-120 Mins**

**Evaluation metric**
macro f1 score

### Import used libraries

In [9]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 2.3 MB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2024.4.16-cp310-cp310-win_amd64.whl (268 kB)
     -------------------------------------- 268.9/268.9 kB 1.3 MB/s eta 0:00:00
Installing collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2024.4.16


In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [57]:
!pip install contractions
import contractions
import string

In [15]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\NAWAL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Load Dataset

###### Note: search how to load the data from tsv file

In [44]:
pd.set_option('display.max_rows',500)
pd.set_option('display.max_colwidth',500)

In [45]:
data = pd.read_csv("Lab 1 - Hate Speech.tsv", sep='\t')

In [46]:
data.head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo
6,7,0,@user camping tomorrow @user @user @user @user @user @user @user dannyâ¦
7,8,0,the next school year is the year for exams.ð¯ can't think about that ð­ #school #exams #hate #imagine #actorslife #revolutionschool #girl
8,9,0,we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers â¦
9,10,0,@user @user welcome here ! i'm it's so #gr8 !


### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [47]:
X_train, X_test, y_train, y_test = train_test_split(data['tweet'], data['label'], test_size=0.2, random_state=123)

### EDA on training data

- check NaNs

In [48]:
X_train.isnull().sum()

0

- check duplicates

In [49]:
X_train.duplicated().sum()

1807

- show a representative sample of data texts to find out required preprocessing steps

In [52]:
X_train.sample(10)

12060                             why are white people #expats when the rest of us are #immigrants?   #classism #imperialsm
628                               @user happy father's day to all dads except #rogergoodell! #fathersday   #goodellsucksâ¦
3972                                                                               @user i got muted on facebook... #nogift
9937                                          so it's a..........ððððð #girl #baby #pregnancy #princess  â¦
5858                                                                       i wake up to a empty phone every morningðð
15032                                                             ð   south sudan allowed soldiers to rape as wages : un
6661     jo_yana : my body is in the air but my mind is still in firenze... |    #pittiuomo #pittiuomo90 #pittiimmagine â¦
6114                     @user i pay 70 bucks for internet and i get 140 up 70 down #halodedication #grownassman   #hashtag
9334    

- check dataset balancing

In [35]:
y_train.value_counts()

label
0    23491
1     1737
Name: count, dtype: int64

###### **unbalanced data ---> do we will use macro**

- Cleaning and Preprocessing are:
    - 1
    - 2
    - 3
    - ... etc.

### Cleaning and Preprocessing

#### Extra: use custom scikit-learn Transformers

Using custom transformers in scikit-learn provides flexibility, reusability, and control over the data transformation process, allowing you to seamlessly integrate with scikit-learn's pipelines, enabling you to combine multiple preprocessing steps and modeling into a single workflow. This makes your code more modular, readable, and easier to maintain.

##### link: https://www.andrewvillazon.com/custom-scikit-learn-transformers/

#### Example usage:

In [65]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        # Add code for fitting the transformer here
        
        # Remove stopwords
        vectorizer = CountVectorizer(stop_words='english')
        vectorizer.fit(X)
        X = vectorizer.transform(X)
        return X
    
    def transform(self, X):
        # Add code for transforming the data here
        
        cleaned_texts = []
        for text in X:
            # Lowercase
            text = text.lower()
            # Removing newlines and tabs
            text = text.replace('\n', '').replace('\t', '')
            # Remove URLs
            text = re.sub(r'http\S+', '', text)
            # Remove punctuation and digits
            text = re.sub(r'[^a-zA-Z\s]', '', text)
            # remove contractions then remove punctuations
            text = contractions.fix(text)
            text = text.translate(str.maketrans('', '', string.punctuation))
            
            cleaned_texts.append(text)
            
        return cleaned_texts
    
    def fit_transform(self, X, y=None):
        # This function combines fit and transform
        self.fit(X, y)
        return self.transform(X)

**You  are doing Great so far!**

### Modelling

#### Extra: use scikit-learn pipline

##### link: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Using pipelines in scikit-learn promotes better code organization, reproducibility, and efficiency in machine learning workflows.

#### Example usage:

In [66]:
from sklearn.pipeline import Pipeline

model = LogisticRegression()

# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('vectorizer', TfidfVectorizer()),
    ('model', model),
])

# Now you can use the pipeline for training and prediction
pipeline.fit(X_train, y_train)

#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [67]:
from sklearn import metrics

In [79]:
y_pred = pipeline.predict(X_test)
report = metrics.classification_report(y_test, y_pred)
print(report)   # without macro 

              precision    recall  f1-score   support

           0       0.94      1.00      0.97      5831
           1       0.90      0.27      0.41       476

    accuracy                           0.94      6307
   macro avg       0.92      0.63      0.69      6307
weighted avg       0.94      0.94      0.93      6307



In [80]:
from sklearn.metrics import f1_score

In [82]:
macro_f1 = f1_score(y_test, y_pred, average='macro') 

print("Macro F1 Score:", macro_f1)

Macro F1 Score: 0.6907059017590702


### Enhancement

- Using different N-grams
- Using different text representation technique
- Hyperparameter tuning

In [71]:
param_grid = {
    'vectorizer': [TfidfVectorizer(), CountVectorizer()],
    'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'model__C': [0.1, 1, 10, 100]}

In [73]:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro', verbose=1)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [74]:
best_model = grid_search.best_estimator_

### Conclusion and final results


In [76]:
y_pred = best_model.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

In [77]:
macro_f1 = f1_score(y_test, y_pred, average='macro')
macro_f1

0.828886395643053

#### Done!