**Dataset**
labeled datasset collected from twitter

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>


**Evaluation metric**
macro f1 score

### Import used libraries

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 500)

### Load Dataset

###### Note: search how to load the data from tsv file

In [4]:
df = pd.read_csv("Lab 1 - Hate Speech.tsv", sep= "\t")
df.head(100)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo
6,7,0,@user camping tomorrow @user @user @user @user @user @user @user dannyâ¦
7,8,0,the next school year is the year for exams.ð¯ can't think about that ð­ #school #exams #hate #imagine #actorslife #revolutionschool #girl
8,9,0,we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers â¦
9,10,0,@user @user welcome here ! i'm it's so #gr8 !


### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [None]:
df.columns

Index(['id', 'label', 'tweet'], dtype='object')

In [5]:
from sklearn.model_selection import train_test_split

x = df[['tweet']]
y = df[['label']]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [6]:
x.head()

Unnamed: 0,tweet
0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,bihday your majesty
3,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,factsguide: society now #motivation


### EDA on training data

In [7]:
training_data = pd.concat([x_train, y_train], axis = 1)

In [8]:
training_data.head()

Unnamed: 0,tweet,label
17097,big shout out to @user who replaced my hard used jacket under warranty. looking forward to breaking it in on the wet coast!,0
8390,#happy #monday! no #need or #reason to #feel or #cry ever #again. #god has #fulfilled his #promise to us #all! have and #awesome day!,0
12840,"check out all the stars who made our sister team, @user",0
25608,awesome -- it's the rare wildly hyped phenomenon that actually lives up to the hype.,0
9295,jamming to the new @user song my day is made! #hcr #floatyourboat #yesssssâ¦,0


- check NaNs

In [9]:
training_data.isnull().sum()

tweet    0
label    0
dtype: int64

- check duplicates

In [10]:
training_data.duplicated().sum()

1779

In [11]:
training_data.drop_duplicates(inplace = True)

- show a representative sample of data texts to find out required preprocessing steps

In [12]:
sample_tweets = training_data['tweet'].sample(n=20)
for tweet in sample_tweets:
    print(tweet)
    print("-" *100)


no words needed âºï¸ð #kids #children #love #adorable  #model #daughter   #picofthedayâ¦
----------------------------------------------------------------------------------------------------
@user @user @user @user is it true #lnp will cut seniors pension if they own their home on july 1st 2017?
----------------------------------------------------------------------------------------------------
our winter menu is coming soon   #thegrove #newmenu #pies #lambshanks #pasta
----------------------------------------------------------------------------------------------------
this idiot makes my blood boil!  #pig #liar
----------------------------------------------------------------------------------------------------
#gingergothitched we can't get over yesterday night! #love #wedding #beautiful #couple #inlove   #pay...
----------------------------------------------------------------------------------------------------
got my tickets for @user today. looking forward to a great day! than

- check dataset balancing

In [13]:
class_counts = training_data['label'].value_counts()
print("Class Counts:")
print(class_counts)

Class Counts:
label
0    21837
1     1612
Name: count, dtype: int64


will use macro-avg to counter the imbalance

- Cleaning and Preprocessing:
    - lemmetizing
    - stop words
    - emojis, hashtags, mentions and non-word chars in general (except space)
    - digits

### Cleaning and Preprocessing

#### Extra: use custom scikit-learn Transformers

Using custom transformers in scikit-learn provides flexibility, reusability, and control over the data transformation process, allowing you to seamlessly integrate with scikit-learn's pipelines, enabling you to combine multiple preprocessing steps and modeling into a single workflow. This makes your code more modular, readable, and easier to maintain.

##### link: https://www.andrewvillazon.com/custom-scikit-learn-transformers/

#### Example usage:

In [14]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [15]:
 nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [16]:
import re

In [34]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        transformed_tweets = []
        for tweet in X['tweet']: # change here
            words = nltk.word_tokenize(tweet)
            transformed_words = [lemmatizer.lemmatize(word) for word in words if word.lower() not in stop_words]
            transformed_tweet = ' '.join(transformed_words)
            transformed_tweet = re.sub(r'[^\w\s]', '', transformed_tweet) #remove non-word characters except space
            transformed_tweet = re.sub(r'[^\x00-\x7F]+', '', transformed_tweet) #remove characters like these  ¼âïðµ
            transformed_tweet = re.sub(r'\d+', '', transformed_tweet) #remove digits

            transformed_tweets.append(transformed_tweet)

        #transformed_df = pd.DataFrame(transformed_tweets, columns=['tweet'])
        transformed_df = pd.Series(transformed_tweets)
        return transformed_df

    def preprocess(self, text):
        pass

    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.transform(X)

In [35]:
x_train = training_data[['tweet']]
y_train = training_data[['label']]

In [29]:
#transformer = CustomTransformer()
#x_train = transformer.fit(x_train)
#x_train

### Modelling

#### Extra: use scikit-learn pipline

##### link: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Using pipelines in scikit-learn promotes better code organization, reproducibility, and efficiency in machine learning workflows.

#### Example usage:

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

model = LogisticRegression()
Vectorizer = CountVectorizer()

pipeline = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', Vectorizer),
    ('model', model),
])

In [38]:
#Now you can use the pipeline for training and prediction
pipeline.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [39]:
y_pred = pipeline.predict(x_test)

#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [40]:
from sklearn.metrics import f1_score

macro_f1 = f1_score(y_test, y_pred, average='macro')
macro_f1

0.8169867382910861

Let's try another model

In [55]:
from sklearn.linear_model import SGDClassifier

model = SGDClassifier()
Vectorizer = CountVectorizer()

pipeline = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', Vectorizer),
    ('model', model),
])

In [56]:
pipeline.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


In [57]:
y_pred = pipeline.predict(x_test)

In [58]:
macro_f1 = f1_score(y_test, y_pred, average='macro')
macro_f1

0.8326000394036601

### Conclusion and final results



The SGDClassifier is a linear classifier that's capable of handling large-scale and sparse data efficiently. It's particularly useful when dealing with text classification and natural language processing tasks where feature spaces can be very large. It can handle high-dimensional input data..