## Twitter Sentiment Analysis
### Detecting hatred tweets, provided by Analytics Vidhya

#### The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets. Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

### Loading Libraries and Data

In [1]:
import pandas as pd

In [2]:
# Load the training dataset
train_df = pd.read_csv(r"C:\Users\karth\OneDrive\Documents\Artificial Intelligence Course Materials\Self Exploration Projects\Twitter Sentiment Analysis\train.csv")

### Text Normalization

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [4]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\karth\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\karth\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\karth\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
# Text preprocessing function
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'@[^\s]+', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    text = ' '.join(tokens)
    return text

In [6]:
# Apply text preprocessing to both training and testing datasets
train_df['tweet_clean'] = train_df['tweet'].apply(preprocess_text)

### Feature Extraction (using TF-IDF)

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

In [12]:
# Fit and transform the cleaned tweets from training data
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['tweet_clean']).toarray()
y_train_tfidf = train_df['label']

### Split the training data into training and validation sets

In [43]:
from sklearn.model_selection import train_test_split

In [44]:
# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_tfidf, y_train_tfidf, test_size=0.2, random_state=42)

### Model Training - Logistic Regression

In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
# Initialize and train the model (Logistic Regression) on TF-IDF features
lr_model_tfidf = LogisticRegression()
lr_model_tfidf.fit(X_train_tfidf, y_train_tfidf)

In [45]:
# Make predictions on the validation set
y_pred_lr_val = lr_model_tfidf.predict(X_val)

### Model Training - Support Vector Machine (SVM)

In [20]:
from sklearn.svm import SVC

In [21]:
# Initialize and train the model (SVM) on TF-IDF features
svm_model_tfidf = SVC(kernel='linear')
svm_model_tfidf.fit(X_train_tfidf, y_train_tfidf)

In [48]:
# Make predictions on the validation set
y_pred_svm_val = svm_model_tfidf.predict(X_val)

### Model Training - Random Forest

In [24]:
from sklearn.ensemble import RandomForestClassifier

In [25]:
# Initialize and train the model (Random Forest) on TF-IDF features
rf_model_tfidf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model_tfidf.fit(X_train_tfidf, y_train_tfidf)

In [49]:
# Make predictions on the validation set
y_pred_rf_val = rf_model_tfidf.predict(X_val)

### Model Training - XGBoost

In [28]:
import xgboost as xgb

In [29]:
# Initialize and train the model (XGBoost) on TF-IDF features
xgb_model_tfidf = xgb.XGBClassifier()
xgb_model_tfidf.fit(X_train_tfidf, y_train_tfidf)

In [51]:
# Make predictions on the validation set
y_pred_xgb_val = xgb_model_tfidf.predict(X_val)

### Fine Tuning XGBoost

In [32]:
from sklearn.model_selection import GridSearchCV

In [33]:
# Define hyperparameters grid for XGBoost
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.1, 0.01, 0.05],
    'max_depth': [3, 4, 5]
}

In [34]:
# Initialize XGBoost classifier
xgb_classifier = xgb.XGBClassifier()

In [35]:
# Initialize GridSearchCV
grid_search_xgb = GridSearchCV(estimator=xgb_classifier, param_grid=param_grid_xgb, cv=5, scoring='accuracy')

In [36]:
# Perform grid search
grid_search_xgb.fit(X_train_tfidf, y_train_tfidf)

In [37]:
# Best parameters
print("Best Parameters for XGBoost:", grid_search_xgb.best_params_)

Best Parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}


In [38]:
# Initialize and train the tuned XGBoost model
xgb_model_tfidf_tuned = xgb.XGBClassifier(**grid_search_xgb.best_params_)
xgb_model_tfidf_tuned.fit(X_train_tfidf, y_train_tfidf)

In [52]:
# Make predictions on the validation set
y_pred_xgb_tuned_val = xgb_model_tfidf_tuned.predict(X_val)

### Evaluation Metrics

In [41]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [46]:
# Calculate evaluation metrics for Logistic Regression on the validation set
accuracy_lr_val = accuracy_score(y_val, y_pred_lr_val)
precision_lr_val = precision_score(y_val, y_pred_lr_val)
recall_lr_val = recall_score(y_val, y_pred_lr_val)
f1_lr_val = f1_score(y_val, y_pred_lr_val)

In [47]:
# Print evaluation metrics for Logistic Regression on the validation set
print("Evaluation Metrics for Logistic Regression (Validation Set):")
print("Accuracy:", accuracy_lr_val)
print("Precision:", precision_lr_val)
print("Recall:", recall_lr_val)
print("F1 Score:", f1_lr_val)

Evaluation Metrics for Logistic Regression (Validation Set):
Accuracy: 0.9493195682778038
Precision: 0.851063829787234
Recall: 0.3508771929824561
F1 Score: 0.49689440993788814


In [54]:
# Calculate evaluation metrics for SVM
accuracy_svm_val = accuracy_score(y_val, y_pred_svm_val)
precision_svm_val = precision_score(y_val, y_pred_svm_val)
recall_svm_val = recall_score(y_val, y_pred_svm_val)
f1_svm_val = f1_score(y_val, y_pred_svm_val)

In [55]:
# Print evaluation metrics
print("Evaluation Metrics for SVM:")
print("Accuracy:", accuracy_svm_val)
print("Precision:", precision_svm_val)
print("Recall:", recall_svm_val)
print("F1 Score:", f1_svm_val)
print()

Evaluation Metrics for SVM:
Accuracy: 0.9480681995933051
Precision: 0.8229166666666666
Recall: 0.34649122807017546
F1 Score: 0.4876543209876543



In [56]:
# Calculate evaluation metrics for Random Forest
accuracy_rf_val = accuracy_score(y_val, y_pred_rf_val)
precision_rf_val = precision_score(y_val, y_pred_rf_val)
recall_rf_val = recall_score(y_val, y_pred_rf_val)
f1_rf_val = f1_score(y_val, y_pred_rf_val)

In [57]:
print("Evaluation Metrics for Random Forest:")
print("Accuracy:", accuracy_rf_val)
print("Precision:", precision_rf_val)
print("Recall:", recall_rf_val)
print("F1 Score:", f1_rf_val)
print()

Evaluation Metrics for Random Forest:
Accuracy: 0.993430314406382
Precision: 0.9859154929577465
Recall: 0.9210526315789473
F1 Score: 0.9523809523809524



In [58]:
# Calculate evaluation metrics for XGBoost
accuracy_xgb_val = accuracy_score(y_val, y_pred_xgb_val)
precision_xgb_val = precision_score(y_val, y_pred_xgb_val)
recall_xgb_val = recall_score(y_val, y_pred_xgb_val)
f1_xgb_val = f1_score(y_val, y_pred_xgb_val)

In [59]:
print("Evaluation Metrics for XGBoost:")
print("Accuracy:", accuracy_xgb_val)
print("Precision:", precision_xgb_val)
print("Recall:", recall_xgb_val)
print("F1 Score:", f1_xgb_val)
print()

Evaluation Metrics for XGBoost:
Accuracy: 0.9585484123259815
Precision: 0.9098712446351931
Recall: 0.4649122807017544
F1 Score: 0.6153846153846153



In [60]:
# Calculate evaluation metrics for tuned XGBoost
accuracy_xgb_tuned_val = accuracy_score(y_val, y_pred_xgb_tuned_val)
precision_xgb_tuned_val = precision_score(y_val, y_pred_xgb_tuned_val)
recall_xgb_tuned_val = recall_score(y_val, y_pred_xgb_tuned_val)
f1_xgb_tuned_val = f1_score(y_val, y_pred_xgb_tuned_val)

In [61]:
print("Evaluation Metrics for Tuned XGBoost:")
print("Accuracy:", accuracy_xgb_tuned_val)
print("Precision:", precision_xgb_tuned_val)
print("Recall:", recall_xgb_tuned_val)
print("F1 Score:", f1_xgb_tuned_val)

Evaluation Metrics for Tuned XGBoost:
Accuracy: 0.9540122008446739
Precision: 0.9090909090909091
Recall: 0.39473684210526316
F1 Score: 0.5504587155963303


#### From the evaluation metrics, we can observe the following:

#### Random Forest achieved the highest accuracy and F1-score among all models, indicating excellent overall performance.
#### Logistic Regression and SVM achieved moderate performance, with lower accuracy and F1-scores compared to Random Forest.
#### XGBoost performed reasonably well but had a lower recall compared to Random Forest.
#### Tuned XGBoost achieved slightly better recall than the untuned version but had lower precision and F1-score.