**Dataset**
labeled dataset collected from twitter (Hate Speech.tsv)

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>

**Evaluation metric**
macro f1 score

**Steps**

To classify hate speech in tweets, follow these key steps:

1. **Data Preprocessing**: Clean text (remove punctuation, stopwords, etc.), lowercase, tokenize, and so on.
2. **Text Representation**: Use Bag of Words, TF-IDF, or word embeddings (e.g., GloVe, Word2Vec, or FastText).
3. **Modeling Approaches**:
   - **Traditional Models**: Logistic Regression, Naive Bayes, SVM, Random Forest.
   - **Deep Learning**: LSTM or RNN.
4. **Evaluation**
5. **Optimization**: Use hyperparameter tuning, regularization, and ensemble methods for better performance.


### Import used libraries

In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')  # For tokenization
nltk.download('wordnet')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [2]:
import pandas as pd
import numpy as np
import re
import gensim.downloader as api
from gensim.models import KeyedVectors

# Torch for deep learning
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Scikit-learn
from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

# Imbalanced-learn
from imblearn.over_sampling import ADASYN
from imblearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline

# NLTK for text preprocessing
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Set pandas display options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 500)


### Load Dataset

###### Note: search how to load the data from tsv file

In [3]:
df = pd.read_csv("Hate Speech.tsv", sep= "\t")
df.head()


Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation


### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [4]:
# Split data into features (X) and target (y)
X = df.drop('label', axis=1)
y = df['label']

# Split data into train and test (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further split the training data into train and validation (75% train, 25% validation)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

print(f"Training Set: {X_train.shape}, Validation Set: {X_val.shape}, Test Set: {X_test.shape}")


Training Set: (18921, 2), Validation Set: (6307, 2), Test Set: (6307, 2)


In [5]:
# Combine features and labels for each dataset
train_combined = pd.concat([X_train, y_train], axis=1)
test_combined = pd.concat([X_test, y_test], axis=1)
val_combined = pd.concat([X_val, y_val], axis=1)


### EDA on training data

- check NaNs
  
result:there is no need for dropping null in prepossing step

In [6]:
# Check for NaN values in each column
nan_counts_train = train_combined.isna().sum()
nan_counts_test = test_combined.isna().sum()
nan_counts_validation = val_combined.isna().sum()
print(f"the count of NaNs in training dataset features {nan_counts_train} ")
print(f"the count of NaNs in testing dataset features{nan_counts_test} ")
print(f"the count of NaNs in validation dataset features{nan_counts_validation}")


the count of NaNs in training dataset features id       0
tweet    0
label    0
dtype: int64 
the count of NaNs in testing dataset featuresid       0
tweet    0
label    0
dtype: int64 
the count of NaNs in validation dataset featuresid       0
tweet    0
label    0
dtype: int64


- check duplicates

there is no duplicates

In [7]:

# Check for duplicates in each combined dataset
duplicates_train_combined = train_combined.duplicated().sum()
duplicates_test_combined = test_combined.duplicated().sum()
duplicates_val_combined = val_combined.duplicated().sum()

# Print the count of duplicated rows in each dataset
print(f"The count of duplicated rows in training dataset : {duplicates_train_combined}")
print(f"The count of duplicated rows in testing dataset : {duplicates_test_combined}")
print(f"The count of duplicated rows in validation dataset : {duplicates_val_combined}")


The count of duplicated rows in training dataset : 0
The count of duplicated rows in testing dataset : 0
The count of duplicated rows in validation dataset : 0


- show a representative sample of data texts to find out required preprocessing steps

In [8]:
sample_data = X_train.sample(5, random_state=42)

# Print the representative sample
sample_data.head()


Unnamed: 0,id,tweet
15827,16232,â #eur/usd unable to regain 1.1300 despite dovish fed #blog #silver #gold #forex
3220,3335,take pop out for churrasco this sunday! #fathersday
30026,30454,new video shows how this guy and how he staed to abuse and humiliated an muslim guyâ¦
13620,13974,video: currently at @user @user #cute #eyes #like #follow #instafit #
7436,7561,@user @user #wishing you a lovely mid week wednesday day


- check dataset balancing

  The Dataset is not balanced

In [9]:
# Check class distribution for training, testing, and validation datasets
train_class_distribution = train_combined['label'].value_counts(normalize=True)
test_class_distribution = test_combined['label'].value_counts(normalize=True)
val_class_distribution = val_combined['label'].value_counts(normalize=True)

# Print class distributions
print("Class distribution in training dataset:")
print(train_class_distribution)

print("\nClass distribution in testing dataset:")
print(test_class_distribution)

print("\nClass distribution in validation dataset:")
print(val_class_distribution)


Class distribution in training dataset:
label
0    0.928598
1    0.071402
Name: proportion, dtype: float64

Class distribution in testing dataset:
label
0    0.931505
1    0.068495
Name: proportion, dtype: float64

Class distribution in validation dataset:
label
0    0.931822
1    0.068178
Name: proportion, dtype: float64


- Cleaning and Preprocessing are:
    - 1-Fix Encoding Issues: As seen in the sample, there are encoding problems (â, â¦)
    - 2-Remove Hashtags and Mentions
    - 3-Remove Special Characters and Punctuation
    - 4-Lowercasing: Convert all text to lowercase to standardize it.
    - 5-Remove Stopwords
    - 6-remove URLs if they are found
    - 7-Lemmatization: Reduce words to their base forms (e.g., "running" → "run").
    - 8- data is unbalanced we can use oversampling teq like ADASYN .
    - 9-Tokenization: Break the text into individual words.
    - 10-Handling Emojis: Emojis might carry sentiment and may need to be processed or kept.


### Cleaning and Preprocessing

#### Use custom scikit-learn Transformers

Using custom transformers in scikit-learn provides flexibility, reusability, and control over the data transformation process, allowing you to seamlessly integrate with scikit-learn's pipelines, enabling you to combine multiple preprocessing steps and modeling into a single workflow. This makes your code more modular, readable, and easier to maintain.

##### link: https://www.andrewvillazon.com/custom-scikit-learn-transformers/

#### Example usage:

In [None]:
class CustomTextTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, max_features=1000, vectorizer=None):
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        self.vectorizer = vectorizer

    def fit(self, X, y=None):
        X_preprocessed = X['tweet'].apply(self.preprocess_text)
        if self.vectorizer:
            self.vectorizer.fit(X_preprocessed)
        return self

    def transform(self, X):
        X_preprocessed = X['tweet'].apply(self.preprocess_text)
        if self.vectorizer:
            X_transformed = self.vectorizer.transform(X_preprocessed.astype(str)).toarray()
            return X_transformed
        return X_preprocessed

    def fit_transform(self, X, y=None):
        X_preprocessed = X['tweet'].apply(self.preprocess_text)
        if self.vectorizer:
            X_transformed = self.vectorizer.fit_transform(X_preprocessed.astype(str)).toarray()
            return X_transformed
        return X_preprocessed

    def preprocess_text(self, text):
        text = self.fix_encoding(text)
        text = self.remove_hashtags_mentions(text)
        text = self.remove_special_characters(text)
        text = self.lowercase_text(text)
        text = self.remove_stopwords(text)
        text = self.remove_urls(text)
        text = self.lemmatize_text(text)
        text = self.tokenize_text(text)
        return text

    def fix_encoding(self, text):
        return text.encode('utf-8', 'ignore').decode('utf-8') if isinstance(text, str) else str(text)

    def remove_hashtags_mentions(self, text):
        return re.sub(r'[@#]\w+', '', text)

    def remove_special_characters(self, text):
        return re.sub(r'[^\w\s]', '', text)

    def lowercase_text(self, text):
        return text.lower()

    def remove_stopwords(self, text):
        return ' '.join([word for word in text.split() if word not in self.stop_words])

    def remove_urls(self, text):
        return re.sub(r'http\S+|www\S+', '', text)

    def lemmatize_text(self, text):
        return ' '.join([self.lemmatizer.lemmatize(word) for word in text.split()])

    def tokenize_text(self, text):
        return ' '.join(word_tokenize(text))


In [11]:

# Initialize the custom transformer without ADASYN
transformer = CustomTextTransformer(max_features=1000,vectorizer=TfidfVectorizer(max_features=1000))


# Transform training data
X_train_transformed = transformer.fit_transform(X_train)

# Apply ADASYN separately to transformed training data
adasyn = ADASYN(random_state=42)
X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train_transformed, y_train)

print(pd.Series(y_train_resampled).value_counts())

label
1    17899
0    17570
Name: count, dtype: int64


In [12]:
X_test_transformed = transformer.transform(X_test)


**You  are doing Great so far!**

### Modelling

#### Extra: use scikit-learn pipline

##### link: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Using pipelines in scikit-learn promotes better code organization, reproducibility, and efficiency in machine learning workflows.

#### Example usage:

In [None]:
class LSTMClassifier(nn.Module):
    def __init__(self, input_size=1000, hidden_size=64, output_size=1, epochs=10, batch_size=32):
        super(LSTMClassifier, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.epochs = epochs
        self.batch_size = batch_size

        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()
        self.optimizer = optim.Adam(self.parameters(), lr=0.001)
        self.criterion = nn.BCELoss()

    def forward(self, x):
        lstm_out, _ = self.lstm(x)

        if len(lstm_out.shape) == 3:
            lstm_out = lstm_out[:, -1, :]  
        elif len(lstm_out.shape) == 2:
            lstm_out = lstm_out[:, :self.hidden_size] 

        out = self.fc(lstm_out)
        return self.sigmoid(out)

    def fit(self, X, y):
        X_train_tensor = torch.tensor(X, dtype=torch.float32)
        y_train_tensor = torch.tensor(y, dtype=torch.float32).unsqueeze(1)

        train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
        train_loader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)

        self.train()
        for epoch in range(self.epochs):
            for X_batch, y_batch in train_loader:
                self.optimizer.zero_grad()
                outputs = self(X_batch)
                loss = self.criterion(outputs, y_batch)
                loss.backward()
                self.optimizer.step()
            print(f"Epoch {epoch+1}/{self.epochs}, Loss: {loss.item()}")

    def predict(self, X):
        X_tensor = torch.tensor(X, dtype=torch.float32).unsqueeze(1)

        self.eval()

        with torch.no_grad():
            predictions = self(X_tensor)

        predictions_binary = (predictions >= 0.5).float()

        return predictions_binary


In [14]:

pipeline = Pipeline([
    ('preprocessing', CustomTextTransformer(max_features=1000, vectorizer=TfidfVectorizer(max_features=1000))),
    ('adasyn', ADASYN(random_state=42)),
    ('classifier', LogisticRegression())
])
# Now you can fit the pipeline on training data and use it for predictions
pipeline.fit(X_train, y_train)
y_pred_logistic = pipeline.predict(X_test)


In [15]:
pipeline = Pipeline([
    ('preprocessing', CustomTextTransformer(max_features=1000, vectorizer=TfidfVectorizer(max_features=1000))),
    ('adasyn', ADASYN(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit the pipeline on training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = pipeline.predict(X_test)

In [16]:
# Create the pipeline with CustomTextTransformer, ADASYN, and LSTMClassifier

pipeline = Pipeline([
    ('preprocessing', CustomTextTransformer(max_features=1000, vectorizer=TfidfVectorizer(max_features=1000))),
    ('adasyn', ADASYN(random_state=42)),
    ('classifier', LSTMClassifier(input_size=1000, hidden_size=64, epochs=10, batch_size=32))
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Predict with the pipeline
y_pred_lstm = pipeline.predict(X_test)

Epoch 1/10, Loss: 0.5308780670166016
Epoch 2/10, Loss: 0.28685304522514343
Epoch 3/10, Loss: 0.5718701481819153
Epoch 4/10, Loss: 0.29492565989494324
Epoch 5/10, Loss: 0.31292545795440674
Epoch 6/10, Loss: 0.38197338581085205
Epoch 7/10, Loss: 0.3122020959854126
Epoch 8/10, Loss: 0.5348643660545349
Epoch 9/10, Loss: 0.17346535623073578
Epoch 10/10, Loss: 0.1454213410615921


#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

For LogisticRegression with TfidfVectorizer

In [17]:
f1_macro = f1_score(y_test, y_pred_logistic, average='macro')

print("F1 Score:", f1_macro)

F1 Score: 0.5879052172791551


For RandomForestClassifier with TfidfVectorizer

In [18]:
f1 = f1_score(y_test, y_pred_rf, average='macro')
print("F1 Score:", f1)

F1 Score: 0.6183633905152892


For LSTMClassifier with TfidfVectorizer

In [19]:
# Evaluate the model
f1 = f1_score(y_test, y_pred_lstm, average='macro')
print(f"F1 Score: {f1}")

F1 Score: 0.6091836863525852


### Enhancement

- Using different text representation or modeling techniques
- Hyperparameter tuning

Use glove instead of TfidfVectorizer

In [20]:
embedding_model = api.load("glove-wiki-gigaword-100")

class EmbeddingTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, embedding_model, embedding_dim=100):
        self.embedding_model = embedding_model
        self.embedding_dim = embedding_dim

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return np.array([self._average_embedding(doc) for doc in X])

    def _average_embedding(self, doc):
        words = doc.split()
        word_embeddings = [self.embedding_model[word] for word in words if word in self.embedding_model]
        return np.mean(word_embeddings, axis=0) if word_embeddings else np.zeros(self.embedding_dim)





In [21]:
pipeline = Pipeline([
    ('preprocessing', CustomTextTransformer(max_features=1000, vectorizer=None)),
    ('embedding', EmbeddingTransformer(embedding_model=embedding_model)),
    ('adasyn', ADASYN(random_state=42)),
    ('classifier', LogisticRegression())
])

# Fit and predict with the pipeline
pipeline.fit(X_train, y_train)
y_pred_logistic_glove = pipeline.predict(X_test)

In [22]:
pipeline = Pipeline([
    ('preprocessing', CustomTextTransformer(max_features=1000, vectorizer=None)),
    ('embedding', EmbeddingTransformer(embedding_model=embedding_model)),
    ('adasyn', ADASYN(random_state=42)),
    ('classifier', RandomForestClassifier())
])

# Fit and predict with the pipeline
pipeline.fit(X_train, y_train)
y_pred_rf_glove = pipeline.predict(X_test)

In [23]:
pipeline = Pipeline([
    ('preprocessing', CustomTextTransformer(max_features=1000, vectorizer=None)),
    ('embedding', EmbeddingTransformer(embedding_model=embedding_model)),
    ('adasyn', ADASYN(random_state=42)),
    ('classifier',LSTMClassifier(input_size=100, hidden_size=64, epochs=10, batch_size=32))
])

# Fit and predict with the pipeline
pipeline.fit(X_train, y_train)
y_pred_lstm_glove = pipeline.predict(X_test)

Epoch 1/10, Loss: 0.42926865816116333
Epoch 2/10, Loss: 0.29352521896362305
Epoch 3/10, Loss: 0.36207205057144165
Epoch 4/10, Loss: 0.09106875956058502
Epoch 5/10, Loss: 0.19850663840770721
Epoch 6/10, Loss: 0.27941569685935974
Epoch 7/10, Loss: 0.1117481142282486
Epoch 8/10, Loss: 0.29841166734695435
Epoch 9/10, Loss: 0.06586093455553055
Epoch 10/10, Loss: 0.11626700311899185


For LogisticRegression with glove

In [24]:
f1 = f1_score(y_test, y_pred_logistic_glove, average='macro')
print("F1 Score:", f1)

F1 Score: 0.5913852813185521


For RandomForestClassifier with glove

In [25]:
f1 = f1_score(y_test, y_pred_rf_glove, average='macro')
print("F1 Score:", f1)

F1 Score: 0.719606430468441


For LSTMClassifier with glove

In [26]:
# Evaluate the model
f1 = f1_score(y_test, y_pred_lstm_glove, average='macro')
print(f"F1 Score: {f1}")

F1 Score: 0.6643312874543827


Fine tuning for lstm with glove

In [27]:


# Define hyperparameters grid
param_grid = {
    'hidden_size': [32, 64, 128],
    'epochs': [5, 10, 15],
    'batch_size': [16, 32, 64]
}

# Generate all combinations of hyperparameters
grid = ParameterGrid(param_grid)

# Store results
results = []

for params in grid:
    print(f"Training with parameters: {params}")

    # Define the pipeline with current parameters
    pipeline = Pipeline([
        ('preprocessing', CustomTextTransformer(max_features=1000, vectorizer=None)),
        ('embedding', EmbeddingTransformer(embedding_model=embedding_model)),
        ('adasyn', ADASYN(random_state=42)),
        ('classifier', LSTMClassifier(input_size=100, hidden_size=params['hidden_size'], epochs=params['epochs'], batch_size=params['batch_size']))
    ])

    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train)

    # Predict on validation data
    y_pred_val = pipeline.predict(X_val)

    # Evaluate performance using F1 score
    score = f1_score(y_val, y_pred_val, average='macro')
    results.append({**params, 'f1_score': score})
    print(f"Validation f1_score: {score}")

# Get the best model based on F1 score
best_params = max(results, key=lambda x: x['f1_score'])
print(f"Best parameters: {best_params}")


Training with parameters: {'batch_size': 16, 'epochs': 5, 'hidden_size': 32}
Epoch 1/5, Loss: 0.5170995593070984
Epoch 2/5, Loss: 0.21430934965610504
Epoch 3/5, Loss: 0.8821848630905151
Epoch 4/5, Loss: 0.22437450289726257
Epoch 5/5, Loss: 0.34859591722488403
Validation f1_score: 0.6218548282197617
Training with parameters: {'batch_size': 16, 'epochs': 5, 'hidden_size': 64}
Epoch 1/5, Loss: 0.31929394602775574
Epoch 2/5, Loss: 0.23752300441265106
Epoch 3/5, Loss: 0.3831334412097931
Epoch 4/5, Loss: 0.26279494166374207
Epoch 5/5, Loss: 0.4612191617488861
Validation f1_score: 0.6451704296619705
Training with parameters: {'batch_size': 16, 'epochs': 5, 'hidden_size': 128}
Epoch 1/5, Loss: 0.42075666785240173
Epoch 2/5, Loss: 0.3708648979663849
Epoch 3/5, Loss: 0.2890675365924835
Epoch 4/5, Loss: 0.297117680311203
Epoch 5/5, Loss: 0.3386310636997223
Validation f1_score: 0.6723673831279523
Training with parameters: {'batch_size': 16, 'epochs': 10, 'hidden_size': 32}
Epoch 1/10, Loss: 0.4343

In [28]:
# Define the pipeline with current parameters
pipeline = Pipeline([
        ('preprocessing', CustomTextTransformer(max_features=1000, vectorizer=None)),
        ('embedding', EmbeddingTransformer(embedding_model=embedding_model)),
        ('adasyn', ADASYN(random_state=42)),
        ('classifier', LSTMClassifier(input_size=100, hidden_size=128, epochs=15, batch_size=16))
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)
y_pred_lstm_glove_tuned = pipeline.predict(X_test)

Epoch 1/15, Loss: 0.23303402960300446
Epoch 2/15, Loss: 0.5155577063560486
Epoch 3/15, Loss: 0.4906882643699646
Epoch 4/15, Loss: 0.22091908752918243
Epoch 5/15, Loss: 0.05827739089727402
Epoch 6/15, Loss: 0.43039825558662415
Epoch 7/15, Loss: 0.40079304575920105
Epoch 8/15, Loss: 0.16123683750629425
Epoch 9/15, Loss: 0.0581737756729126
Epoch 10/15, Loss: 0.18877704441547394
Epoch 11/15, Loss: 0.20361539721488953
Epoch 12/15, Loss: 0.316938579082489
Epoch 13/15, Loss: 0.013389569707214832
Epoch 14/15, Loss: 0.02015012502670288
Epoch 15/15, Loss: 0.01455944124609232


In [29]:
# Evaluate the model
f1 = f1_score(y_test, y_pred_lstm_glove_tuned, average='macro')
print(f"F1 Score: {f1}")

F1 Score: 0.7361710571477604


### Conclusion and final results




### Summary

1. **Exploratory Data Analysis (EDA)**:
   - The dataset was found to be clean with no missing values or duplicates.

2. **Data Imbalance**:
   - The dataset was imbalanced, and **ADASYN** was used as an oversampling technique to balance the class distribution in the training data.

3. **Preprocessing Steps**:
   - **Encoding Fixes**: Resolved encoding issues.
   - **Text Cleaning**: Removed hashtags, mentions, special characters, stopwords, and URLs.
   - **Lemmatization and Tokenization**: Reduced words to base forms and split text into words.
   - **Emoji Handling**: Emojis were preserved, considering their potential sentiment value.

4. **Modeling**:
   - **Logistic Regression**: Achieved an **F1 score of 0.5879**.
   - **Random Forest**: Achieved a slightly better **F1 score of 0.618363**.
   - **LSTM Classifier**: Outperformed the other models with the highest **F1 score of 0.609**.

5. **Conclusions**:
   - The **Random Forest** was the most effective.
   - The **ADASYN** oversampling technique helped to address data imbalance effectively.


   ### Model Enhancement and Hyperparameter Tuning

#### **1. Using Pre-trained Word Embeddings (GloVe)**:
   - The **GloVe word embeddings** (100-dimensional) were used to improve text representation.
   - The following models were evaluated using the GloVe embeddings:

   **Pipeline for Logistic Regression:**
   - **F1 Score**: 0.59138
   
   **Pipeline for Random Forest:**
   - **F1 Score**: 0.71960
   
   **Pipeline for LSTM:**
   - **F1 Score**: 0.66433
   
#### **2. Hyperparameter Tuning for LSTM Classifier**:
   - **Hyperparameters tested**:
     - `hidden_size`: [32, 64, 128]
     - `epochs`: [5, 10, 15]
     - `batch_size`: [16, 32, 64]
   
   - After tuning, the best hyperparameters were:
     - `hidden_size`: 128
     - `epochs`: 15
     - `batch_size`: 16
     - **Validation F1 Score**: 0.74159104
     - **Test F1 Score**: 0.7361710571477604

#### **3. Conclusions**:
   - **GloVe embeddings** improved the performance of the models.
   - **Random Forest** showed the best performance in terms of F1 score among the non-LSTM models.but after fine tuning for lstm , the lstm
   one become better
   - **LSTM Classifier** showed significant improvement after hyperparameter tuning and become the best model among the models
   

#### Done!