#### Import Necessary Libraries:
Import required libraries like pandas for data manipulation, re for regular expressions, and various modules from scikit-learn for machine learning tasks, including data preprocessing, feature extraction, and model evaluation.

In [63]:
# Import necessary libraries
import pandas as pd
import re
import pickle
from sklearn.utils import resample
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

#### Load the Training and Testing Data:
Use pandas to load your training and testing data from CSV files. This data will be used for training and testing your hate speech detection model.

In [64]:
# Load the training and testing data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

#### Text Cleaning Function (clean_text):
This function cleans the text data by performing the following operations on input text:

* Convert text to lowercase.
* Remove special characters and punctuation marks.
* Remove numbers.
* Remove URLs.
* Remove hashtags.

The function takes an input text (or a pandas Series) and returns the cleaned text (or a cleaned Series).

In [65]:
def clean_text(text):
    """
    Cleans text data by converting to lowercase, removing special characters, numbers, URLs, and hashtags.
    
    Args:
        text (str or pd.Series): Input text or text Series to be cleaned.
    
    Returns:
        str or pd.Series: Cleaned text or text Series.
    """
    if isinstance(text, str):
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text, re.UNICODE)
        text = re.sub(r'[0-9]+', '', text)
        text = re.sub(r'(@[^a-zA-Z]+)|(\\w+:\\/\\/\\s+)|^rt|http.+?', ' ', text)
        text = re.sub(r'#\w*', '', text)
        return text
    elif isinstance(text, pd.Series):
        cleaned_series = text.str.lower()
        cleaned_series = cleaned_series.str.replace(r'[^\w\s]', '', regex=True)
        cleaned_series = cleaned_series.str.replace(r'[0-9]+', '', regex=True)
        cleaned_series = cleaned_series.str.replace(r'(@[^a-zA-Z]+)|(\\w+:\\/\\/\\s+)|^rt|http.+?', ' ', regex=True)
        cleaned_series = cleaned_series.str.replace(r'#\w*', '', regex=True)
        return cleaned_series
    else:
        raise ValueError("Input must be a string or a pandas Series.")



#### Apply Text Cleaning to Training and Testing Data:
Use the clean_text function to clean the tweet text in both the training and testing datasets. This cleaning process prepares the text data for feature extraction and modeling.

In [66]:
# Apply text cleaning to the training and testing data

train_data['tweet'] = train_data['tweet'].apply(lambda x: clean_text(x))
test_data['tweet'] = test_data['tweet'].apply(lambda x: clean_text(x))


#### Upsample the Minority Class:
Address the class imbalance issue by oversampling the minority class (label 1) in the training data using the resample function from scikit-learn. The minority class is randomly resampled with replacement to match the majority class size.

In [67]:
# Upsample the minority class

train_majority = train_data[train_data.label == 0]
train_minority = train_data[train_data.label == 1]
train_minority_upsampled = resample(train_minority, replace=True, n_samples=29720, random_state=123)
train_upsampled = pd.concat([train_majority, train_minority_upsampled])


#### Split the Data into Training and Testing Sets:
Split the upsampled training data into training and testing sets using train_test_split. The split will ensure that you have training and testing data for model training and evaluation.

In [68]:
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(train_upsampled['tweet'], train_upsampled['label'], random_state=0)


#### Machine Learning Pipeline:
Define a machine learning pipeline that includes the following steps:

* list_to_series: A custom transformation step to convert the data into a pandas Series format for compatibility with other pipeline components.
* CountVectorizer: Convert the text data into a matrix of token counts (bag of words).
* TfidfTransformer: Apply TF-IDF (Term Frequency-Inverse Document Frequency) transformation to the token counts.
* RandomForestClassifier: Use a Random Forest Classifier with 100 decision trees for text classification.

In [69]:
# Define a machine learning pipeline

# Define a function to convert a list to a pandas Series
def list_to_series(data):
    return pd.Series(data)

# Define the pipeline
pipeline = Pipeline([
    ('list_to_series', FunctionTransformer(list_to_series, validate=False)),
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=0)),
])

#### Train the Model:
Fit the machine learning pipeline (model) to the training data (X_train and y_train). The model combines the feature extraction and classifier into one pipeline.

In [70]:
# Train the model

model = pipeline.fit(X_train, y_train)
y_predict = model.predict(X_test)

#### Store the Model:
Save the trained model to a file using the pickle library. The saved model can be loaded and used later for predictions.

In [71]:
# Store the model

model_filename = 'finalized_model.sav'
pickle.dump(model, open(model_filename, 'wb'))

#### Evaluate the Model:
Calculate various performance metrics for the model, including accuracy, precision, recall, F1 score, classification report, and the confusion matrix, using y_test and the model's predictions (y_predict). These metrics help assess the model's performance in detecting hate speech.

In [72]:
# Evaluate the model

accuracy = accuracy_score(y_test, y_predict)
precision = precision_score(y_test, y_predict)
recall = recall_score(y_test, y_predict)
f1 = f1_score(y_test, y_predict)
classification_rep = classification_report(y_test, y_predict)
confusion_mtx = confusion_matrix(y_test, y_predict)

#### Print Evaluation Results:
Display the evaluation results, including accuracy, precision, recall, F1 score, the classification report (with precision, recall, and F1 for each class), and the confusion matrix. These results provide insights into the model's strengths and weaknesses.

In [73]:
# Print evaluation results

print('Accuracy Score: ', accuracy)
print('Precision Score: ', precision)
print('Recall Score: ', recall)
print('F1 Score: ', f1)
print('\nClassification Report:\n', classification_rep)
print('\nConfusion Matrix:\n', confusion_mtx)

Accuracy Score:  0.9985868102288021
Precision Score:  0.9972041006523765
Recall Score:  1.0
F1 Score:  0.9986000933271115

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7370
           1       1.00      1.00      1.00      7490

    accuracy                           1.00     14860
   macro avg       1.00      1.00      1.00     14860
weighted avg       1.00      1.00      1.00     14860


Confusion Matrix:
 [[7349   21]
 [   0 7490]]


#### Load the Model:
Use pickle to load the previously stored model (finalized_model.sav) from the file. This allows you to make predictions using the trained model without retraining it.

In [74]:
# Load the model

loaded_model = pickle.load(open(model_filename, 'rb'))

#### Use the Model:
Finally, demonstrate how to use the loaded model for predictions. In this case, you make predictions on a single test tweet from the testing data.

In [76]:
# Use the model

result = loaded_model.predict([test_data['tweet'][4]])
print(result)

[0]


In [None]:
"""

Overall, your code is a well-structured pipeline for hate speech detection, with attention to data preprocessing, 
resampling, model training, evaluation, and model persistence. It provides a clear and organized workflow for building 
and assessing hate speech detection models. The code is also well-documented and easy to follow.

"""