**Title: Balanced Dataset Creation for Binary Classification**

**Methodology**

The data sampling process involved creating a balanced dataset from an imbalanced source dataset containing binary classifications (Yes/No). The original dataset showed significant class imbalance with 9,728 "Yes" samples and 1,433 "No" samples. To address this imbalance, random sampling was performed on the majority class ("Yes") to match the size of the minority class ("No"). The sampling was conducted using a fixed random seed (42) to ensure reproducibility. After sampling, the data was shuffled to prevent any order-based biases.

**Output:**

The sampling process resulted in a perfectly balanced dataset with the following characteristics:

Final size: 2,866 total samples

"Yes" class: 1,433 samples

"No" class: 1,433 samples

The balanced dataset was saved as 'balanced_dataset.csv' for subsequent model training.

In [None]:
import pandas as pd
import numpy as np

# Read the CSV file
print("Reading the file...")
df = pd.read_csv('Final_Data_GPT-Abusive.csv')

# Separate YES and NO comments
yes_comments = df[df['Related'] == 'Yes']
no_comments = df[df['Related'] == 'No']

print(f"Original dataset:")
print(f"YES comments: {len(yes_comments)}")
print(f"NO comments: {len(no_comments)}")

# Randomly sample YES comments to match NO comments count
sampled_yes = yes_comments.sample(n=len(no_comments), random_state=42)

# Combine sampled YES comments with all NO comments
balanced_df = pd.concat([sampled_yes, no_comments])

# Shuffle the final dataset
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"\nBalanced dataset:")
print(f"YES comments: {len(balanced_df[balanced_df['Related'] == 'Yes'])}")
print(f"NO comments: {len(balanced_df[balanced_df['Related'] == 'No'])}")

# Save to new file
balanced_df.to_csv('balanced_dataset.csv', index=False)
print("\nBalanced dataset saved to 'balanced_dataset.csv'")

Reading the file...
Original dataset:
YES comments: 9728
NO comments: 1433

Balanced dataset:
YES comments: 1433
NO comments: 1433

Balanced dataset saved to 'balanced_dataset.csv'


**Title: Comparative Analysis of Machine Learning Models for BinaryClassification**

**Methodology:**

Six different machine learning models were evaluated using the balanced dataset. Each model was trained using an 80-20 train-test split with a fixed random seed (42). The text data was preprocessed using CountVectorizer with English stop words removed. The following models were evaluated:

1. Naive Bayes (MultinomialNB)

2. Logistic Regression

3. Support Vector Machine (LinearSVC)

4. Random Forest

5. Decision Tree

6. XGBoost

**Output:**

Performance metrics for each model:

**Naive Bayes:**

Accuracy: 0.50

F1-scores: No (0.47), Yes (0.53)

**Logistic Regression:**

Accuracy: 0.45

F1-scores: No (0.47), Yes (0.44)

**Support Vector Machine:**

Accuracy: 0.47

F1-scores: No (0.48), Yes (0.44)

**Random Forest:**

Accuracy: 0.51

F1-scores: No (0.54), Yes (0.48)

**Decision Tree:**

Accuracy: 0.49

F1-scores: No (0.51), Yes (0.48)

**XGBoost:**

Accuracy: 0.46

F1-scores: No (0.49), Yes (0.43)

Among all models tested, the Random Forest classifier showed the best overall performance with an accuracy of 0.51 and the highest F1-score for the "No" class (0.54). However, all models showed relatively similar performance, with accuracies ranging between 0.45 and 0.51, indicating the challenging nature of the classification task.

In [None]:
# Install gdown to download files from Google Drive
!pip install gdown





In [None]:

# Please Run this Cell First
file_id = '19vuDz3MQJjYtdiSS9OtMLIPGQwxbyTgb'
!gdown --id {file_id} -O balanced_dataset.csv


Downloading...
From: https://drive.google.com/uc?id=19vuDz3MQJjYtdiSS9OtMLIPGQwxbyTgb
To: /content/balanced_dataset.csv
100% 3.70M/3.70M [00:00<00:00, 211MB/s]


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score


# 1. Load the data
print("Loading data...")
df = pd.read_csv('balanced_dataset.csv')

# 2. Remove rows with NaN values
print("Cleaning data...")
df = df.dropna(subset=['comment_body', 'Related'])  # Remove rows with NaN in these columns

# 3. Prepare features (X) and labels (y)
X = df['comment_body']  # Your text column
y = df['Related']    # Your label column

# 4. Split data into training (80%) and testing (20%) sets
print("Splitting data into training and testing sets...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Convert text to numerical features
print("Converting text to features...")
vectorizer = CountVectorizer(stop_words='english')
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)

# 6. Train the Naive Bayes model
print("Training Naive Bayes model...")
model = MultinomialNB()
model.fit(X_train_features, y_train)

# 7. Make predictions
print("Making predictions...")
predictions = model.predict(X_test_features)

# 8. Print results
print("\nModel Performance:")
print("-" * 50)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, predictions))

# 9. Print dataset sizes
print("\nDataset Information:")
print(f"Original dataset size: {len(df)}")
print(f"Training set size: {len(X_train)} comments")
print(f"Testing set size: {len(X_test)} comments")


Loading data...
Cleaning data...
Splitting data into training and testing sets...
Converting text to features...
Training Naive Bayes model...
Making predictions...

Model Performance:
--------------------------------------------------
Accuracy: 0.50

Detailed Classification Report:
              precision    recall  f1-score   support

          No       0.46      0.48      0.47       262
         Yes       0.54      0.52      0.53       312

    accuracy                           0.50       574
   macro avg       0.50      0.50      0.50       574
weighted avg       0.50      0.50      0.50       574


Dataset Information:
Original dataset size: 2866
Training set size: 2292 comments
Testing set size: 574 comments


In [None]:
from sklearn.linear_model import LogisticRegression


# 1. Load and clean data
print("Loading data...")
df = pd.read_csv('balanced_dataset.csv')
df = df.dropna(subset=['comment_body', 'Related'])  # Remove rows with NaN

# 2. Prepare features (X) and labels (y)
X = df['comment_body']  # Text data
y = df['Related']    # Labels

# 3. Split data 80-20
print("Splitting data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Convert text to features
print("Converting text to features...")
vectorizer = CountVectorizer(stop_words='english')
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)

# 5. Train Logistic Regression model
print("Training Logistic Regression model...")
model = LogisticRegression(max_iter=1000)  # Increased iterations for convergence
model.fit(X_train_features, y_train)

# 6. Make predictions
print("Making predictions...")
predictions = model.predict(X_test_features)

# 7. Print results
print("\nModel Performance:")
print("-" * 50)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, predictions))

# 8. Print dataset sizes
print("\nDataset Information:")
print(f"Total comments processed: {len(df)}")
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

Loading data...
Splitting data...
Converting text to features...
Training Logistic Regression model...
Making predictions...

Model Performance:
--------------------------------------------------
Accuracy: 0.45

Detailed Classification Report:
              precision    recall  f1-score   support

          No       0.42      0.53      0.47       262
         Yes       0.50      0.39      0.44       312

    accuracy                           0.45       574
   macro avg       0.46      0.46      0.45       574
weighted avg       0.46      0.45      0.45       574


Dataset Information:
Total comments processed: 2866
Training set size: 2292
Testing set size: 574


In [None]:
from sklearn.svm import LinearSVC

# 1. Load and clean data
print("Loading data...")
df = pd.read_csv('balanced_dataset.csv')
df = df.dropna(subset=['comment_body', 'Related'])  # Remove rows with NaN

# 2. Prepare features (X) and labels (y)
X = df['comment_body']  # Text data
y = df['Related']    # Labels


# 3. Split data 80-20
print("Splitting data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Convert text to features
print("Converting text to features...")
vectorizer = CountVectorizer(stop_words='english')
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)

# 5. Train SVM model
print("Training SVM model...")
svm_model = LinearSVC(random_state=42, max_iter=110000)
svm_model.fit(X_train_features, y_train)


# 6. Make predictions
print("Making predictions...")
predictions = svm_model.predict(X_test_features)

# 7. Print results
print("\nModel Performance:")
print("-" * 50)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, predictions))

# 8. Print dataset sizes
print("\nDataset Information:")
print(f"Total comments processed: {len(df)}")
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")


Loading data...
Splitting data...
Converting text to features...
Training SVM model...
Making predictions...

Model Performance:
--------------------------------------------------
Accuracy: 0.47

Detailed Classification Report:
              precision    recall  f1-score   support

          No       0.43      0.55      0.48       262
         Yes       0.51      0.39      0.44       312

    accuracy                           0.47       574
   macro avg       0.47      0.47      0.46       574
weighted avg       0.47      0.47      0.46       574


Dataset Information:
Total comments processed: 2866
Training set size: 2292
Testing set size: 574


In [None]:
from sklearn.ensemble import RandomForestClassifier

# 1. Load data
print("Loading data...")
df = pd.read_csv('balanced_dataset.csv')

# 2. Prepare features (X) and labels (y)
X = df['comment_body']  # Text data
y = df['Related']      # YES/NO labels

# 3. Split data 80-20
print("Splitting data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Convert text to features
print("Converting text to features...")
vectorizer = CountVectorizer(stop_words='english')
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)

# 5. Train Random Forest model
print("Training Random Forest model...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_features, y_train)

# 6. Make predictions
print("Making predictions...")
predictions = rf_model.predict(X_test_features)

# 7. Print results
print("\nModel Performance:")
print("-" * 50)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, predictions))

# 8. Print dataset sizes
print("\nDataset Information:")
print(f"Total comments processed: {len(df)}")
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

Loading data...
Splitting data...
Converting text to features...
Training Random Forest model...
Making predictions...

Model Performance:
--------------------------------------------------
Accuracy: 0.51

Detailed Classification Report:
              precision    recall  f1-score   support

          No       0.47      0.63      0.54       262
         Yes       0.57      0.42      0.48       312

    accuracy                           0.51       574
   macro avg       0.52      0.52      0.51       574
weighted avg       0.53      0.51      0.51       574


Dataset Information:
Total comments processed: 2866
Training set size: 2292
Testing set size: 574


In [None]:

from sklearn.tree import DecisionTreeClassifier


# 1. Load data
print("Loading data...")
df = pd.read_csv('balanced_dataset.csv')

# 2. Prepare features (X) and labels (y)
X = df['comment_body']  # Text data
y = df['Related']      # YES/NO labels

# 3. Split data 80-20
print("Splitting data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Convert text to features
print("Converting text to features...")
vectorizer = CountVectorizer(stop_words='english')
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)

# 5. Train Decision Tree model
print("Training Decision Tree model...")
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_features, y_train)

# 6. Make predictions
print("Making predictions...")
predictions = dt_model.predict(X_test_features)

# 7. Print results
print("\nModel Performance:")
print("-" * 50)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, predictions))

# 8. Print dataset sizes
print("\nDataset Information:")
print(f"Total comments processed: {len(df)}")
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

Loading data...
Splitting data...
Converting text to features...
Training Decision Tree model...
Making predictions...

Model Performance:
--------------------------------------------------
Accuracy: 0.49

Detailed Classification Report:
              precision    recall  f1-score   support

          No       0.46      0.58      0.51       262
         Yes       0.55      0.42      0.48       312

    accuracy                           0.49       574
   macro avg       0.50      0.50      0.49       574
weighted avg       0.51      0.49      0.49       574


Dataset Information:
Total comments processed: 2866
Training set size: 2292
Testing set size: 574


In [None]:

from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder

# 1. Load data
print("Loading data...")
df = pd.read_csv('balanced_dataset.csv')

# 2. Prepare features (X) and labels (y)
X = df['comment_body']  # Text data
y = df['Related']      # YES/NO labels

# 3. Convert YES/NO to 1/0
print("Converting labels to numbers...")
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)  # Converts 'YES' to 1, 'NO' to 0

# 4. Split data 80-20
print("Splitting data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Convert text to features
print("Converting text to features...")
vectorizer = CountVectorizer(stop_words='english')
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)

# 6. Train XGBoost model
print("Training XGBoost model...")
xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(X_train_features, y_train)

# 7. Make predictions
print("Making predictions...")
predictions = xgb_model.predict(X_test_features)

# 8. Convert predictions back to YES/NO for the report
predictions_labels = label_encoder.inverse_transform(predictions)
y_test_labels = label_encoder.inverse_transform(y_test)

# 9. Print results
print("\nModel Performance:")
print("-" * 50)
print(f"Accuracy: {accuracy_score(y_test_labels, predictions_labels):.2f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test_labels, predictions_labels))

# 10. Print dataset sizes
print("\nDataset Information:")
print(f"Total comments processed: {len(df)}")
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

Loading data...
Converting labels to numbers...
Splitting data...
Converting text to features...
Training XGBoost model...
Making predictions...

Model Performance:
--------------------------------------------------
Accuracy: 0.46

Detailed Classification Report:
              precision    recall  f1-score   support

          No       0.43      0.56      0.49       262
         Yes       0.51      0.38      0.43       312

    accuracy                           0.46       574
   macro avg       0.47      0.47      0.46       574
weighted avg       0.47      0.46      0.46       574


Dataset Information:
Total comments processed: 2866
Training set size: 2292
Testing set size: 574
