

<h1 style='; border:0; border-radius: 10px; text-shadow: 1px 1px black; font-weight: bold; color:#4D1873'><center> Twitter-Sentiment-Analysis:
Detecting Hate Speech in Tweets Using ML
</center></h1>

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#4D1873; font-family:arial;color:#FFFFFF;font-size:170%;text-align:center;border-radius:55px 1px;">INTRODUCTION</p>

<div style="border-radius:5px;
            border : black solid;
            background-color: #E3E3E3;
            text-align: left">


### <mark>Sentiment analysis</mark>
<p style="text-align: left">Sentiment analysis, is a specialized technique in natural language processing (NLP) that focuses on identifying and interpreting . Organizations employ sentiment analysis systems to derive insights from unstructured and unorganized data sources. These systems automate the analysis process, replacing the need for manual evaluation by using rule-based, automatic, or hybrid approaches.</p>

###<mark>About :  </mark>
<p>This project focused on developing a machine learning model to predict  hate speech in tweets using machine learning techniques. Hate speech is identified as tweets containing racist or sexist sentiments. The goal is to classify tweets into two categories:</p>
<ul>
<li> Contains hate speech (racist/sexist)</li>
<li> Does not contain hate speech.</li>
</ul>

By leveraging data preprocessing,
 feature engineering (e.g., TF-IDF, word embeddings), and classification models (e.g., Logistic Regression, SVM, Random Forest,Gradient Boosting (e.g., XGBoost, LightGBM)), <p>This aims to build a system for automated content moderation, ensuring safer online environments and supporting business needs like legal compliance and brand protection.</p>

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#4D1873 ;font-family:arial;color:#FFFFFF;font-size:170%;text-align:center;border-radius:55px 1px;">IMPORT NECESSARY LIBRARIES</p>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import model_selection, preprocessing, linear_model, metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import ensemble
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from xgboost import XGBClassifier


import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from textblob import Word
nltk.download('wordnet')

from termcolor import colored
from warnings import filterwarnings
filterwarnings('ignore')

from sklearn import set_config
set_config(print_changed_only = False)

print(colored("\nLIBRARIES WERE SUCCESFULLY IMPORTED...", color = "blue", attrs = ["dark", "bold"]))

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.



LIBRARIES WERE SUCCESFULLY IMPORTED...


[nltk_data] Downloading package wordnet to /root/nltk_data...


<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#4D1873 ;font-family:arial;color:#FFFFFF;font-size:170%;text-align:center;border-radius:55px 1px;">LOAD DATASETS</p>

In [2]:
train_set = pd.read_csv("/train.csv",
                   encoding = "utf-8",
                   engine = "python",
                   header = 0)

test_set = pd.read_csv("/test.csv",
                   encoding = "utf-8",
                   engine = "python",
                   header = 0)

print(colored("\nDATASETS WERE SUCCESFULLY LOADED...", color = "blue", attrs = ["dark", "bold"]))

FileNotFoundError: [Errno 2] No such file or directory: '/train.csv'

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">The first five rows of train set</span>

In [None]:
train_set.head(n = 5).style.background_gradient(cmap = "summer")

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">The first five rows of test set</span>

In [None]:
test_set.head(n = 5).style.background_gradient(cmap = "summer")

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Shapes of the train and test sets</span>

In [None]:
print("Train set shape: {} and test set shape: {}".format(train_set.shape, test_set.shape))

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Get general information about train set</span>

In [None]:
train_set.info()

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Check whether there are duplicated values</span>

In [None]:
print("Totally there are {} duplicated values in train_set".format(train_set.duplicated().sum()))

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Get the number of classes of the "label" variable of train set</span>

In [None]:
train_set.groupby("label").count().style.background_gradient(cmap = "summer")

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#4D1873 ;font-family:arial;color:#FFFFFF;font-size:170%;text-align:center;border-radius:55px 1px;">CLEAN AND PROCESS DATASET</p>

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Convert uppercase letters to lowercase letters in "tweet" columns</span>

In [None]:
train_set["tweet"] = train_set["tweet"].apply(lambda x: " ".join(x.lower() for x in x.split()))
test_set["tweet"] = test_set["tweet"].apply(lambda x: " ".join(x.lower() for x in x.split()))

print(colored("\nCONVERTED SUCCESFULLY...", color = "blue", attrs = ["dark", "bold"]))

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Delete punctuation marks from "tweet" columns</span>

In [None]:
train_set["tweet"] = train_set["tweet"].str.replace('[^\w\s]','')
test_set["tweet"] = test_set["tweet"].str.replace('[^\w\s]','')

print(colored("\nDELETED SUCCESFULLY...", color = "blue", attrs = ["dark", "bold"]))

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Delete numbers from "tweet" columns</span>

In [None]:
train_set['tweet'] = train_set['tweet'].str.replace('\d','')
test_set['tweet'] = test_set['tweet'].str.replace('\d','')

print(colored("\n NUMBERS DELETED SUCCESFULLY...", color = "blue", attrs = ["dark", "bold"]))

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Delete stopwords from "tweet" columns</span>

In [None]:
sw = stopwords.words("english")
train_set['tweet'] = train_set['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in sw))
test_set['tweet'] = test_set['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in sw))

print(colored("\nSTOPWORDS DELETED SUCCESFULLY...", color = "blue", attrs = ["dark", "bold"]))

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Lemmatization. That is, we get the roots of the words in the "tweet" columns</span>

In [None]:
train_set['tweet'] = train_set['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
test_set['tweet'] = test_set['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

print(colored("\nDONE SUCCESFULLY...", color = "blue", attrs = ["dark", "bold"]))

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Drop "id" column from datasets</span>

In [None]:
train_set = train_set.drop("id", axis = 1)
test_set = test_set.drop("id", axis = 1)

print(colored("\n'ID' COLUMNS DROPPED SUCCESFULLY...", color = "blue", attrs = ["dark", "bold"]))

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Look at the latest condition of train set
</span>

In [None]:
train_set.head(n = 10)

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Look at the latest condition of test set</span>

In [None]:
test_set.head(n = 10)

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Divide datasets</span>

In [None]:
x = train_set["tweet"]
y = train_set["label"]

train_x, test_x, train_y, test_y = model_selection.train_test_split(x, y, test_size = 0.20, shuffle = True, random_state = 11)

print(colored("\nDIVIDED SUCCESFULLY...", color = "blue", attrs = ["dark", "bold"]))

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#4D1873 ;font-family:arial;color:#FFFFFF;font-size:170%;text-align:center;border-radius:55px 1px;">VECTORIZE DATA</p>

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">"Count Vectors" method</span>

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(train_x)

x_train_count = vectorizer.transform(train_x)
x_test_count = vectorizer.transform(test_x)

x_train_count.toarray()

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">"TF-IDF" method</span>

In [None]:
tf_idf_word_vectorizer = TfidfVectorizer()
tf_idf_word_vectorizer.fit(train_x)

x_train_tf_idf_word = tf_idf_word_vectorizer.transform(train_x)
x_test_tf_idf_word = tf_idf_word_vectorizer.transform(test_x)

x_train_tf_idf_word.toarray()

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#4D1873 ;font-family:arial;color:#FFFFFF;font-size:170%;text-align:center;border-radius:55px 1px;">BUILD MACHINE LEARNING MODELS</p>

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Logistic regression model with "count-vectors" method</span>

In [None]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from termcolor import colored

# Fit the Logistic Regression model
log = LogisticRegression()
log_model = log.fit(x_train_count, train_y)

# Cross-validation accuracy score
accuracy = cross_val_score(log_model,
                           x_test_count,
                           test_y,
                           cv=20).mean()

# Predict on the test data
y_pred = log_model.predict(x_test_count)

# Calculate precision, recall, and F1 score using sklearn
precision = metrics.precision_score(test_y, y_pred)
recall = metrics.recall_score(test_y, y_pred)
f1 = metrics.f1_score(test_y, y_pred)

# Print results
print(colored("\nLogistic regression model with 'count-vectors' method", color="red", attrs=["dark", "bold"]))
print(colored("Accuracy ratio: ", color="red", attrs=["dark", "bold"]), accuracy)
print(colored("Precision: ", color="red", attrs=["dark", "bold"]), precision)
print(colored("Recall: ", color="red", attrs=["dark", "bold"]), recall)
print(colored("F1 Score: ", color="red", attrs=["dark", "bold"]), f1)

# Optionally, print a full classification report (including precision, recall, F1 score)
print("\nClassification Report:")
print(metrics.classification_report(test_y, y_pred))


### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Logistic regression model with "tf-idf" method</span>

In [None]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from termcolor import colored

# Fit the Logistic Regression model with TF-IDF features
log = LogisticRegression()
log_model = log.fit(x_train_tf_idf_word, train_y)

# Cross-validation accuracy score
accuracy = cross_val_score(log_model,
                           x_test_tf_idf_word,
                           test_y,
                           cv=20).mean()

# Predict on the test data
y_pred = log_model.predict(x_test_tf_idf_word)

# Calculate precision, recall, and F1 score using sklearn
precision = metrics.precision_score(test_y, y_pred)
recall = metrics.recall_score(test_y, y_pred)
f1 = metrics.f1_score(test_y, y_pred)

# Print results
print(colored("\nLogistic regression model with 'tf-idf' method", color="red", attrs=["dark", "bold"]))
print(colored("Accuracy ratio: ", color="red", attrs=["dark", "bold"]), accuracy)
print(colored("Precision: ", color="red", attrs=["dark", "bold"]), precision)
print(colored("Recall: ", color="red", attrs=["dark", "bold"]), recall)
print(colored("F1 Score: ", color="red", attrs=["dark", "bold"]), f1)

# Optionally, print a full classification report (including precision, recall, F1 score)
print("\nClassification Report:")
print(metrics.classification_report(test_y, y_pred))


### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">XGBoost model with "count-vectors" method</span>

In [None]:
from sklearn import metrics
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from termcolor import colored

# Fit the XGBoost model with 'count-vectors' method
xgb = XGBClassifier()
xgb_model = xgb.fit(x_train_count, train_y)

# Cross-validation accuracy score
accuracy = cross_val_score(xgb_model,
                           x_test_count,
                           test_y,
                           cv=20).mean()

# Predict on the test data
y_pred = xgb_model.predict(x_test_count)

# Calculate precision, recall, and F1 score using sklearn
precision = metrics.precision_score(test_y, y_pred)
recall = metrics.recall_score(test_y, y_pred)
f1 = metrics.f1_score(test_y, y_pred)

# Print results
print(colored("\nXGBoost model with 'count-vectors' method", color="red", attrs=["dark", "bold"]))
print(colored("Accuracy ratio: ", color="red", attrs=["dark", "bold"]), accuracy)
print(colored("Precision: ", color="red", attrs=["dark", "bold"]), precision)
print(colored("Recall: ", color="red", attrs=["dark", "bold"]), recall)
print(colored("F1 Score: ", color="red", attrs=["dark", "bold"]), f1)

# Optionally, print a full classification report (including precision, recall, F1 score)
print("\nClassification Report:")
print(metrics.classification_report(test_y, y_pred))


### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">XGBoost model with "tf-idf" method</span>

In [None]:
from sklearn import metrics
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from termcolor import colored

# Fit the XGBoost model with TF-IDF features
xgb = XGBClassifier()
xgb_model = xgb.fit(x_train_tf_idf_word, train_y)

# Cross-validation accuracy score
accuracy = cross_val_score(xgb_model,
                           x_test_tf_idf_word,
                           test_y,
                           cv=20).mean()

# Predict on the test data
y_pred = xgb_model.predict(x_test_tf_idf_word)

# Calculate precision, recall, and F1 score using sklearn
precision = metrics.precision_score(test_y, y_pred)
recall = metrics.recall_score(test_y, y_pred)
f1 = metrics.f1_score(test_y, y_pred)

# Print results
print(colored("\nXGBoost model with 'tf-idf' method", color="red", attrs=["dark", "bold"]))
print(colored("Accuracy ratio: ", color="red", attrs=["dark", "bold"]), accuracy)
print(colored("Precision: ", color="red", attrs=["dark", "bold"]), precision)
print(colored("Recall: ", color="red", attrs=["dark", "bold"]), recall)
print(colored("F1 Score: ", color="red", attrs=["dark", "bold"]), f1)

# Optionally, print a full classification report (including precision, recall, F1 score)
print("\nClassification Report:")
print(metrics.classification_report(test_y, y_pred))


### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;">Light GBM model with "count-vectors" method</span>

In [None]:
from sklearn import metrics
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
from termcolor import colored

# Fit the LightGBM model with 'count-vectors' method
lgbm = LGBMClassifier()
lgbm_model = lgbm.fit(x_train_count.astype("float64"), train_y)

# Cross-validation accuracy score
accuracy = cross_val_score(lgbm_model,
                           x_test_count.astype("float64"),
                           test_y,
                           cv=20).mean()

# Predict on the test data
y_pred = lgbm_model.predict(x_test_count.astype("float64"))

# Calculate precision, recall, and F1 score using sklearn
precision = metrics.precision_score(test_y, y_pred)
recall = metrics.recall_score(test_y, y_pred)
f1 = metrics.f1_score(test_y, y_pred)

# Print results
print(colored("\nLight GBM model with 'count-vectors' method", color="red", attrs=["dark", "bold"]))
print(colored("Accuracy ratio: ", color="red", attrs=["dark", "bold"]), accuracy)
print(colored("Precision: ", color="red", attrs=["dark", "bold"]), precision)
print(colored("Recall: ", color="red", attrs=["dark", "bold"]), recall)
print(colored("F1 Score: ", color="red", attrs=["dark", "bold"]), f1)

# Optionally, print a full classification report (including precision, recall, F1 score)
print("\nClassification Report:")
print(metrics.classification_report(test_y, y_pred))


### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;"> Light GBM model with "tf-idf" method</span>

In [None]:
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from lightgbm import LGBMClassifier
from termcolor import colored

# Fit the LightGBM model
lgbm = LGBMClassifier()
lgbm_model = lgbm.fit(x_train_count.astype("float64"), train_y)

# Cross-validation accuracy score
accuracy = cross_val_score(lgbm_model,
                           x_test_count.astype("float64"),
                           test_y,
                           cv=20).mean()

# Predict on the test data
y_pred = lgbm_model.predict(x_test_count.astype("float64"))

# Calculate precision, recall, F1 score using sklearn
precision = metrics.precision_score(test_y, y_pred)
recall = metrics.recall_score(test_y, y_pred)
f1 = metrics.f1_score(test_y, y_pred)

# Print results
print(colored("\nLight GBM model with 'count-vectors' method", color="red", attrs=["dark", "bold"]))
print(colored("Accuracy ratio: ", color="red", attrs=["dark", "bold"]), accuracy)
print(colored("Precision: ", color="red", attrs=["dark", "bold"]), precision)
print(colored("Recall: ", color="red", attrs=["dark", "bold"]), recall)
print(colored("F1 Score: ", color="red", attrs=["dark", "bold"]), f1)

# Optionally, print a full classification report (including precision, recall, F1 score)
print("\nClassification Report:")
print(metrics.classification_report(test_y, y_pred))


### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;"> ROC AUC (curvature)</span>

In [None]:
y = train_y
X = x_train_count.astype("float64")

logit_roc_auc = roc_auc_score(y, lgbm_model.predict(X))

fpr, tpr, thresholds = roc_curve(y, lgbm_model.predict_proba(X)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='AUC (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()

# **Random Forest Model**

In [None]:
!pip install scikit-learn


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
# Assuming a CSV file where 'text' is the tweet text and 'label' is the target (1 for hate speech, 0 for non-hate speech)
df = pd.read_csv('/train.csv')
ds = pd.read_csv('/test.csv')

# Split the data into features (X) and labels (y)
X = df['tweet']
y = df['label']

# Split the dataset into training and testing sets (65% training, 35% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)

# Text preprocessing: Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)  # You can adjust max_features as needed
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train the Random Forest model
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)  # You can adjust n_estimators
rf_classifier.fit(X_train_tfidf, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test_tfidf)

# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#4D1873 ;font-family:arial;color:#FFFFFF;font-size:170%;text-align:center;border-radius:55px 1px;">ESTIMATION OVER TEST SET</p>

### <span style = "background:#4D1873; font-size:100%; color:#fff; border-radius:0px;"> Look at the first 5 rows of the test set</span>

In [None]:
test_set.head()

## <mark>Here we encode values of "tweet" column of test set with "count-vectors" method.</mark>

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(train_x)
test_set = vectorizer.transform(test_set["tweet"])
test_set.toarray()

In [None]:
lgbm_model.predict(test_set.astype("float"))[0:5]

In [None]:
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from termcolor import colored

# Fit the SVM model
svm = SVC()
svm_model = svm.fit(x_train_count.astype("float64"), train_y)

# Cross-validation accuracy score
accuracy = cross_val_score(svm_model,
                           x_test_count.astype("float64"),
                           test_y,
                           cv=20).mean()

# Predict on the test data
y_pred = svm_model.predict(x_test_count.astype("float64"))

# Calculate precision, recall, and F1 score using sklearn
precision = metrics.precision_score(test_y, y_pred)
recall = metrics.recall_score(test_y, y_pred)
f1 = metrics.f1_score(test_y, y_pred)

# Print results
print(colored("\nSVM model with 'count-vectors' method", color="red", attrs=["dark", "bold"]))
print(colored("Accuracy ratio: ", color="red", attrs=["dark", "bold"]), accuracy)
print(colored("Precision: ", color="red", attrs=["dark", "bold"]), precision)
print(colored("Recall: ", color="red", attrs=["dark", "bold"]), recall)
print(colored("F1 Score: ", color="red", attrs=["dark", "bold"]), f1)

# Optionally, print a full classification report (including precision, recall, F1 score)
print("\nClassification Report:")
print(metrics.classification_report(test_y, y_pred))


# Conclusion 📝

- We used the **Twitter Sentiment Analysis** dataset and explored the data in various ways.
- We prepared the tweet text data by removing unnecessary elements like special characters, URLs, etc.
- We trained a model based on **TensorFlow** with appropriate settings for text classification.
- We evaluated the model using various evaluation metrics to assess its performance.
- If you're interested in working on any text-based project, you can apply the same methodology. However, you may need to adjust a few settings, such as column names and preprocessing steps, depending on your dataset.
- We specifically worked on a **binary classification** problem, where the task is to classify the tweets into two categories (e.g., positive or negative sentiment).


<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#4D1873 ;font-family:arial;color:#FFFFFF;font-size:170%;text-align:center;border-radius:55px 1px;">VISUALIZATION WITH WORD CLOUD</p>