Twitter Sentiment Analysis Assignment


Problem Statement:

Perform sentiment analysis on a dataset of Twitter tweets. The dataset includes tweets, user information, and corresponding sentiment labels (positive, negative, or neutral). The objective is to analyze the sentiments expressed in tweets, develop a sentiment classifier, and gain insights into public opinions on Twitter. Use one Dataset from files for this exercise. 

In [1]:
import pandas as pd

# Load the training and testing datasets
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Display the first few rows of the dataset
print(train_data.head())
print(test_data.head())


import re

def preprocess_with_regex(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)  # Remove URLs
    text = re.sub(r'@\w+|\#', '', text)  # Remove mentions and hashtags
    text = re.sub(r'[^a-z\s]', '', text)  # Remove special characters
    tokens = text.split()  # Tokenize
    return ' '.join(tokens)

# Apply to the train dataset
train_data['cleaned_text'] = train_data['tweet'].apply(preprocess_with_regex)
train_data[['tweet', 'cleaned_text']].head()

# Apply to the test dataset
test_data['cleaned_text'] = test_data['tweet'].apply(preprocess_with_regex)
test_data[['tweet', 'cleaned_text']].head()


   id  label                                              tweet
0   1      0   @user when a father is dysfunctional and is s...
1   2      0  @user @user thanks for #lyft credit i can't us...
2   3      0                                bihday your majesty
3   4      0  #model   i love u take with u all the time in ...
4   5      0             factsguide: society now    #motivation
      id                                              tweet
0  31963  #studiolife #aislife #requires #passion #dedic...
1  31964   @user #white #supremacists want everyone to s...
2  31965  safe ways to heal your #acne!!    #altwaystohe...
3  31966  is the hp and the cursed child book up for res...
4  31967    3rd #bihday to my amazing, hilarious #nephew...


Unnamed: 0,tweet,cleaned_text
0,#studiolife #aislife #requires #passion #dedic...,studiolife aislife requires passion dedication...
1,@user #white #supremacists want everyone to s...,white supremacists want everyone to see the ne...
2,safe ways to heal your #acne!! #altwaystohe...,safe ways to heal your acne altwaystoheal heal...
3,is the hp and the cursed child book up for res...,is the hp and the cursed child book up for res...
4,"3rd #bihday to my amazing, hilarious #nephew...",rd bihday to my amazing hilarious nephew eli a...


In [4]:
# Feature Extraction
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Utilize TF-IDF for feature extraction
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Feature extraction and dataset splitting completed.")
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")

Feature extraction and dataset splitting completed.
Training set size: (25569, 5000)
Testing set size: (6393, 5000)


In [5]:
#Support Vector Machine (SVM) Model Tuning:

from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Support Vector Machine (SVM) classifier
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)

# Predict on the test set
y_pred = svm_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      5937
           1       0.86      0.46      0.60       456

    accuracy                           0.96      6393
   macro avg       0.91      0.73      0.79      6393
weighted avg       0.95      0.96      0.95      6393

Accuracy Score:
0.9562020960425466


In [6]:
#LSTM Model Tuning:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

# Preprocess the text data for the model
tokenizer = Tokenizer(num_words=5000, lower=True)
tokenizer.fit_on_texts(train_data['cleaned_text'])
X_train = tokenizer.texts_to_sequences(train_data['cleaned_text'])
X_test = tokenizer.texts_to_sequences(test_data['cleaned_text'])

X_train = pad_sequences(X_train, maxlen=100)
X_test = pad_sequences(X_test, maxlen=100)

# Encode the labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data['label'])
y_train = to_categorical(y_train, num_classes=3)

# Split the training data for validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Build the LSTM model
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=100))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), verbose=1)

# Predict on the validation set
y_val_pred = model.predict(X_val)
y_val_pred_classes = np.argmax(y_val_pred, axis=1)
y_val_classes = np.argmax(y_val, axis=1)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val_classes, y_val_pred_classes, target_names=label_encoder.classes_.astype(str)))

print("Accuracy Score:")
print(accuracy_score(y_val_classes, y_val_pred_classes))



Epoch 1/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 56ms/step - accuracy: 0.9238 - loss: 0.2850 - val_accuracy: 0.9545 - val_loss: 0.1294
Epoch 2/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 57ms/step - accuracy: 0.9633 - loss: 0.1065 - val_accuracy: 0.9589 - val_loss: 0.1215
Epoch 3/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 57ms/step - accuracy: 0.9731 - loss: 0.0781 - val_accuracy: 0.9539 - val_loss: 0.1259
Epoch 4/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 56ms/step - accuracy: 0.9779 - loss: 0.0644 - val_accuracy: 0.9543 - val_loss: 0.1289
Epoch 5/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 59ms/step - accuracy: 0.9821 - loss: 0.0503 - val_accuracy: 0.9589 - val_loss: 0.1340
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.

In [7]:
#Naive Bayes Model Tuning:

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Naive Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict on the test set
y_pred = nb_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      5937
           1       0.93      0.31      0.47       456

    accuracy                           0.95      6393
   macro avg       0.94      0.66      0.72      6393
weighted avg       0.95      0.95      0.94      6393

Accuracy Score:
0.9494759893633662


In [8]:
#Random Forest Model Tuning:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      5937
           1       0.84      0.52      0.64       456

    accuracy                           0.96      6393
   macro avg       0.90      0.76      0.81      6393
weighted avg       0.96      0.96      0.95      6393

Accuracy Score:
0.9590176755826686


In [9]:
#Feedforward Neural Network (FNN) Model Tuning:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, Dropout, Flatten
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

# Preprocess the text data for the model
tokenizer = Tokenizer(num_words=5000, lower=True)
tokenizer.fit_on_texts(train_data['cleaned_text'])
X_train = tokenizer.texts_to_sequences(train_data['cleaned_text'])
X_test = tokenizer.texts_to_sequences(test_data['cleaned_text'])

X_train = pad_sequences(X_train, maxlen=100)
X_test = pad_sequences(X_test, maxlen=100)

# Encode the labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data['label'])
y_train = to_categorical(y_train, num_classes=3)

# Split the training data for validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Build the Feedforward Neural Network (FNN) model
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=100))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), verbose=1)

# Predict on the validation set
y_val_pred = model.predict(X_val)
y_val_pred_classes = np.argmax(y_val_pred, axis=1)
y_val_classes = np.argmax(y_val, axis=1)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val_classes, y_val_pred_classes, target_names=label_encoder.classes_.astype(str)))

print("Accuracy Score:")
print(accuracy_score(y_val_classes, y_val_pred_classes))


Epoch 1/5




[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.9293 - loss: 0.2407 - val_accuracy: 0.9535 - val_loss: 0.1310
Epoch 2/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9694 - loss: 0.0888 - val_accuracy: 0.9573 - val_loss: 0.1282
Epoch 3/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9893 - loss: 0.0353 - val_accuracy: 0.9521 - val_loss: 0.1579
Epoch 4/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9949 - loss: 0.0197 - val_accuracy: 0.9543 - val_loss: 0.2311
Epoch 5/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9982 - loss: 0.0075 - val_accuracy: 0.9560 - val_loss: 0.2666
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      593

In [10]:
#Logistic Regression Model Tuning:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression classifier
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

# Predict on the test set
y_pred = lr_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      5937
           1       0.92      0.32      0.47       456

    accuracy                           0.95      6393
   macro avg       0.93      0.66      0.72      6393
weighted avg       0.95      0.95      0.94      6393

Accuracy Score:
0.9494759893633662


In [11]:
#Gradient Boosting Model Tuning:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Gradient Boosting classifier
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)

# Predict on the test set
y_pred = gb_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      5937
           1       0.88      0.26      0.40       456

    accuracy                           0.94      6393
   macro avg       0.91      0.63      0.69      6393
weighted avg       0.94      0.94      0.93      6393

Accuracy Score:
0.9447833567964962


In [12]:
#Convolutional Neural Network (CNN) Model Tuning:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, Dense, Dropout, Flatten
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

# Preprocess the text data for the model
tokenizer = Tokenizer(num_words=5000, lower=True)
tokenizer.fit_on_texts(train_data['cleaned_text'])
X_train = tokenizer.texts_to_sequences(train_data['cleaned_text'])
X_test = tokenizer.texts_to_sequences(test_data['cleaned_text'])

X_train = pad_sequences(X_train, maxlen=100)
X_test = pad_sequences(X_test, maxlen=100)

# Encode the labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data['label'])
y_train = to_categorical(y_train, num_classes=3)

# Split the training data for validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Build the Convolutional Neural Network (CNN) model
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=100))
model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), verbose=1)

# Predict on the validation set
y_val_pred = model.predict(X_val)
y_val_pred_classes = np.argmax(y_val_pred, axis=1)
y_val_classes = np.argmax(y_val, axis=1)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val_classes, y_val_pred_classes, target_names=label_encoder.classes_.astype(str)))

print("Accuracy Score:")
print(accuracy_score(y_val_classes, y_val_pred_classes))

Epoch 1/5




[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 12ms/step - accuracy: 0.9189 - loss: 0.2624 - val_accuracy: 0.9521 - val_loss: 0.1363
Epoch 2/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 12ms/step - accuracy: 0.9693 - loss: 0.0926 - val_accuracy: 0.9567 - val_loss: 0.1351
Epoch 3/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 12ms/step - accuracy: 0.9855 - loss: 0.0465 - val_accuracy: 0.9528 - val_loss: 0.1628
Epoch 4/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 12ms/step - accuracy: 0.9913 - loss: 0.0254 - val_accuracy: 0.9556 - val_loss: 0.2419
Epoch 5/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 12ms/step - accuracy: 0.9958 - loss: 0.0136 - val_accuracy: 0.9482 - val_loss: 0.2982
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97    

In [13]:
#K-Nearest Neighbors (KNN) Model Tuning:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a K-Nearest Neighbors classifier
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Predict on the test set
y_pred = knn_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      5937
           1       0.97      0.17      0.28       456

    accuracy                           0.94      6393
   macro avg       0.96      0.58      0.63      6393
weighted avg       0.94      0.94      0.92      6393

Accuracy Score:
0.9402471453151885


In [14]:
#Decision Tree Model Tuning:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree classifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Predict on the test set
y_pred = dt_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))



Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97      5937
           1       0.60      0.58      0.59       456

    accuracy                           0.94      6393
   macro avg       0.78      0.78      0.78      6393
weighted avg       0.94      0.94      0.94      6393

Accuracy Score:
0.9424370405130612


In [16]:
pip install -U xgboost

Collecting xgboost
  Downloading xgboost-2.1.3-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.3-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.3/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.3/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.5/124.9 MB 762.0 kB/s eta 0:02:44
   ---------------------------------------- 0.8/124.9 MB 799.2 kB/s eta 0:02:36
   ---------------------------------------- 0.8/124.9 MB 799.2 kB/s eta 0:02:36
   ---------------------------------------- 1.0/124.9 MB 786.4 kB/s eta 0:02:38
   ---------------------------------------- 1.3/124.9 MB 780.2 kB/s eta 0:02:39
   ---------------------------------------- 1.3/124.9 MB 780.2 kB/s eta 0:02:39
    --------------------------------------- 1.6/124.9 MB 806.6 kB/s eta 0:02:33
    ----



In [18]:
#XGBoost Model Tuning:

from xgboost import XGBClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an XGBoost classifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred = xgb_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

Parameters: { "use_label_encoder" } are not used.



Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      5937
           1       0.83      0.39      0.53       456

    accuracy                           0.95      6393
   macro avg       0.89      0.69      0.75      6393
weighted avg       0.95      0.95      0.94      6393

Accuracy Score:
0.9508837791334271


In [17]:
#Bidirectional LSTM Model Tuning:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D, Bidirectional
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

# Preprocess the text data for the model
tokenizer = Tokenizer(num_words=5000, lower=True)
tokenizer.fit_on_texts(train_data['cleaned_text'])
X_train = tokenizer.texts_to_sequences(train_data['cleaned_text'])
X_test = tokenizer.texts_to_sequences(test_data['cleaned_text'])

X_train = pad_sequences(X_train, maxlen=100)
X_test = pad_sequences(X_test, maxlen=100)

# Encode the labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data['label'])
y_train = to_categorical(y_train, num_classes=3)

# Split the training data for validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Build the Bidirectional LSTM model
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=100))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), verbose=1)

# Predict on the validation set
y_val_pred = model.predict(X_val)
y_val_pred_classes = np.argmax(y_val_pred, axis=1)
y_val_classes = np.argmax(y_val, axis=1)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val_classes, y_val_pred_classes, target_names=label_encoder.classes_.astype(str)))

print("Accuracy Score:")
print(accuracy_score(y_val_classes, y_val_pred_classes))

Epoch 1/5




[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 72ms/step - accuracy: 0.9307 - loss: 0.2648 - val_accuracy: 0.9531 - val_loss: 0.1316
Epoch 2/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 72ms/step - accuracy: 0.9646 - loss: 0.1025 - val_accuracy: 0.9557 - val_loss: 0.1195
Epoch 3/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 72ms/step - accuracy: 0.9711 - loss: 0.0804 - val_accuracy: 0.9546 - val_loss: 0.1265
Epoch 4/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 72ms/step - accuracy: 0.9797 - loss: 0.0615 - val_accuracy: 0.9576 - val_loss: 0.1291
Epoch 5/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 70ms/step - accuracy: 0.9824 - loss: 0.0509 - val_accuracy: 0.9560 - val_loss: 0.1421
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.98      0.

In [19]:
#Ridge Classifier Model Tuning:

from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Ridge Classifier
ridge_model = RidgeClassifier()
ridge_model.fit(X_train, y_train)

# Predict on the test set
y_pred = ridge_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      5937
           1       0.86      0.45      0.59       456

    accuracy                           0.96      6393
   macro avg       0.91      0.72      0.78      6393
weighted avg       0.95      0.96      0.95      6393

Accuracy Score:
0.9554199906147348


In [20]:
#AdaBoost Model Tuning:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an AdaBoost classifier
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_model.fit(X_train, y_train)

# Predict on the test set
y_pred = ada_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))





Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.97      5937
           1       0.71      0.42      0.53       456

    accuracy                           0.95      6393
   macro avg       0.83      0.70      0.75      6393
weighted avg       0.94      0.95      0.94      6393

Accuracy Score:
0.9465039887376818


In [22]:
pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-4.5.0-py3-none-win_amd64.whl.metadata (17 kB)
Downloading lightgbm-4.5.0-py3-none-win_amd64.whl (1.4 MB)
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ------- -------------------------------- 0.3/1.4 MB ? eta -:--:--
   ------- -------------------------------- 0.3/1.4 MB ? eta -:--:--
   -------------- ------------------------- 0.5/1.4 MB 799.2 kB/s eta 0:00:02
   --------------------- ------------------ 0.8/1.4 MB 860.9 kB/s eta 0:00:01
   ----------------------------- ---------- 1.0/1.4 MB 868.0 kB/s eta 0:00:01
   ------------------------------------ --- 1.3/1.4 MB 919.8 kB/s eta 0:00:01
   ---------------------------------------- 1.4/1.4 MB 922.9 kB/s eta 0:00:00
Installing collected packages: lightgbm
Successfully installed lightgbm-4.5.0
Note: you may need to restart the kernel to use updated packages.




In [23]:
#LightGBM Model Tuning:

import lightgbm as lgb
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a LightGBM classifier
lgb_model = lgb.LGBMClassifier(n_estimators=100, random_state=42)
lgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred = lgb_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

[LightGBM] [Info] Number of positive: 1786, number of negative: 23783
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.029679 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 52754
[LightGBM] [Info] Number of data points in the train set: 25569, number of used features: 1573
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.069850 -> initscore=-2.588993
[LightGBM] [Info] Start training from score -2.588993
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      5937
           1       0.83      0.39      0.53       456

    accuracy                           0.95      6393
   macro avg       0.89      0.69      0.75      6393
weighted avg       0.95      0.95      0.94      6393

Accuracy Score:
0.9505709369623025


In [25]:
pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp312-cp312-win_amd64.whl.metadata (1.2 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Downloading catboost-1.2.7-cp312-cp312-win_amd64.whl (101.7 MB)
   ---------------------------------------- 0.0/101.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/101.7 MB ? eta -:--:--
   ---------------------------------------- 0.3/101.7 MB ? eta -:--:--
   ---------------------------------------- 0.3/101.7 MB ? eta -:--:--
   ---------------------------------------- 0.5/101.7 MB 645.7 kB/s eta 0:02:37
   ---------------------------------------- 0.8/101.7 MB 762.0 kB/s eta 0:02:13
   ---------------------------------------- 1.0/101.7 MB 838.4 kB/s eta 0:02:01
   ---------------------------------------- 1.0/101.7 MB 838.4 kB/s eta 0:02:01
    --------------------------------------- 1.3/101.7 MB 838.9 kB/s eta 0:02:00
    --------------------------------------- 1.3/101.7 MB 83

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'c:\\Python312\\etc'
Consider using the `--user` option or check the permissions.



In [26]:
#CatBoost Model Tuning:
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a CatBoost classifier
cat_model = CatBoostClassifier(iterations=100, random_state=42, verbose=0)
cat_model.fit(X_train, y_train)

# Predict on the test set
y_pred = cat_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      5937
           1       0.84      0.42      0.56       456

    accuracy                           0.95      6393
   macro avg       0.90      0.71      0.77      6393
weighted avg       0.95      0.95      0.95      6393

Accuracy Score:
0.9527608321601752


In [39]:
pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Collecting tensorflow<2.19,>=2.18 (from tf-keras)
  Using cached tensorflow-2.18.0-cp312-cp312-win_amd64.whl.metadata (3.3 kB)
Collecting tensorflow-intel==2.18.0 (from tensorflow<2.19,>=2.18->tf-keras)
  Using cached tensorflow_intel-2.18.0-cp312-cp312-win_amd64.whl.metadata (4.9 kB)
Collecting tensorboard<2.19,>=2.18 (from tensorflow-intel==2.18.0->tensorflow<2.19,>=2.18->tf-keras)
  Using cached tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.18.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'c:\\python312\\scripts\\tensorboard.exe'
Consider using the `--user` option or check the permissions.



In [42]:
# #Transformer-based Model (BERT) Tuning:
# # For BERT, we will use the transformers library from Hugging Face. This requires a bit more setup, including tokenization and model configuration.
# from transformers import BertTokenizer, TFBertForSequenceClassification
# from tensorflow.keras.optimizers import Adam
# from tensorflow.keras.losses import SparseCategoricalCrossentropy
# from tensorflow.keras.metrics import SparseCategoricalAccuracy
# from sklearn.preprocessing import LabelEncoder
# from sklearn.metrics import classification_report, accuracy_score
# import tensorflow as tf

# # Load the BERT tokenizer
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# # Tokenize the text data
# def tokenize_texts(texts, max_len=128):
#     return tokenizer(
#         texts.tolist(),
#         max_length=max_len,
#         padding='max_length',
#         truncation=True,
#         return_tensors='tf'
#     )

# train_encodings = tokenize_texts(train_data['cleaned_text'])
# test_encodings = tokenize_texts(test_data['cleaned_text'])

# # Encode the labels
# label_encoder = LabelEncoder()
# y_train = label_encoder.fit_transform(train_data['label'])
# y_test = label_encoder.transform(test_data['label'])

# # Build the BERT model
# model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# # Compile the model
# optimizer = Adam(learning_rate=2e-5)
# loss = SparseCategoricalCrossentropy(from_logits=True)
# metric = SparseCategoricalAccuracy('accuracy')

# model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# # Train the model
# history = model.fit(
#     {'input_ids': train_encodings['input_ids'], 'attention_mask': train_encodings['attention_mask']},
#     y_train,
#     epochs=3,
#     batch_size=16,
#     validation_split=0.2
# )

# # Predict on the test set
# y_pred = model.predict({'input_ids': test_encodings['input_ids'], 'attention_mask': test_encodings['attention_mask']})
# y_pred_classes = tf.argmax(y_pred.logits, axis=1).numpy()

# # Evaluate the model
# print("Classification Report:")
# print(classification_report(y_test, y_pred_classes, target_names=label_encoder.classes_.astype(str)))

# print("Accuracy Score:")
# print(accuracy_score(y_test, y_pred_classes))

In [28]:
#Yes, there are still a few more models that can be used for text classification. Here are additional model tuning codes for Extra Trees, Stochastic Gradient Descent (SGD), and a simple Recurrent Neural Network (RNN) using Keras.

# Extra Trees Model Tuning:
    
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an Extra Trees classifier
et_model = ExtraTreesClassifier(n_estimators=100, random_state=42)
et_model.fit(X_train, y_train)

# Predict on the test set
y_pred = et_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      5937
           1       0.85      0.56      0.68       456

    accuracy                           0.96      6393
   macro avg       0.91      0.78      0.83      6393
weighted avg       0.96      0.96      0.96      6393

Accuracy Score:
0.9615204129516659


In [30]:
#Stochastic Gradient Descent (SGD) Model Tuning:

from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Stochastic Gradient Descent classifier
sgd_model = SGDClassifier(random_state=42)
sgd_model.fit(X_train, y_train)

# Predict on the test set
y_pred = sgd_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      5937
           1       0.90      0.35      0.50       456

    accuracy                           0.95      6393
   macro avg       0.93      0.67      0.74      6393
weighted avg       0.95      0.95      0.94      6393

Accuracy Score:
0.9508837791334271


In [33]:
#Simple Recurrent Neural Network (RNN) Model Tuning:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, SpatialDropout1D
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

# Preprocess the text data for the model
tokenizer = Tokenizer(num_words=5000, lower=True)
tokenizer.fit_on_texts(train_data['cleaned_text'])
X_train = tokenizer.texts_to_sequences(train_data['cleaned_text'])
X_test = tokenizer.texts_to_sequences(test_data['cleaned_text'])

X_train = pad_sequences(X_train, maxlen=100)
X_test = pad_sequences(X_test, maxlen=100)

# Encode the labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data['label'])
y_train = to_categorical(y_train, num_classes=3)

# Split the training data for validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Build the Simple RNN model
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=100))
model.add(SpatialDropout1D(0.2))
model.add(SimpleRNN(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), verbose=1)

# Predict on the validation set
y_val_pred = model.predict(X_val)
y_val_pred_classes = np.argmax(y_val_pred, axis=1)
y_val_classes = np.argmax(y_val, axis=1)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val_classes, y_val_pred_classes, target_names=label_encoder.classes_.astype(str)))

print("Accuracy Score:")
print(accuracy_score(y_val_classes, y_val_pred_classes))

Epoch 1/5




[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 18ms/step - accuracy: 0.8160 - loss: 0.4640 - val_accuracy: 0.9337 - val_loss: 0.2094
Epoch 2/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 17ms/step - accuracy: 0.9400 - loss: 0.1829 - val_accuracy: 0.9476 - val_loss: 0.1696
Epoch 3/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 17ms/step - accuracy: 0.9584 - loss: 0.1175 - val_accuracy: 0.9503 - val_loss: 0.1528
Epoch 4/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 17ms/step - accuracy: 0.9700 - loss: 0.0836 - val_accuracy: 0.9520 - val_loss: 0.1866
Epoch 5/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 17ms/step - accuracy: 0.9798 - loss: 0.0605 - val_accuracy: 0.9520 - val_loss: 0.1857
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.97    

In [34]:
#Certainly! Here are additional model tuning codes for other models that might have been missed, including Gradient Boosting Machines (GBM), Voting Classifier, and a simple Multi-Layer Perceptron (MLP) using Keras.

# Gradient Boosting Machines (GBM) Model Tuning:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Gradient Boosting classifier
gbm_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gbm_model.fit(X_train, y_train)

# Predict on the test set
y_pred = gbm_model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      5937
           1       0.88      0.26      0.40       456

    accuracy                           0.94      6393
   macro avg       0.91      0.63      0.69      6393
weighted avg       0.94      0.94      0.93      6393

Accuracy Score:
0.9447833567964962


In [35]:
#Voting Classifier Model Tuning:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_data['cleaned_text'])

# Labels
y = train_data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base models
log_clf = LogisticRegression(max_iter=1000, random_state=42)
nb_clf = MultinomialNB()
svm_clf = SVC(kernel='linear', probability=True, random_state=42)

# Combine base models into a voting classifier
voting_clf = VotingClassifier(estimators=[
    ('lr', log_clf),
    ('nb', nb_clf),
    ('svm', svm_clf)
], voting='soft')

# Train the voting classifier
voting_clf.fit(X_train, y_train)

# Predict on the test set
y_pred = voting_clf.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      5937
           1       0.90      0.42      0.57       456

    accuracy                           0.96      6393
   macro avg       0.93      0.71      0.77      6393
weighted avg       0.95      0.96      0.95      6393

Accuracy Score:
0.9552635695291726


In [36]:
#Multi-Layer Perceptron (MLP) Model Tuning:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Embedding
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

# Preprocess the text data for the model
tokenizer = Tokenizer(num_words=5000, lower=True)
tokenizer.fit_on_texts(train_data['cleaned_text'])
X_train = tokenizer.texts_to_sequences(train_data['cleaned_text'])
X_test = tokenizer.texts_to_sequences(test_data['cleaned_text'])

X_train = pad_sequences(X_train, maxlen=100)
X_test = pad_sequences(X_test, maxlen=100)

# Encode the labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data['label'])
y_train = to_categorical(y_train, num_classes=3)

# Split the training data for validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Build the Multi-Layer Perceptron (MLP) model
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=100))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), verbose=1)

# Predict on the validation set
y_val_pred = model.predict(X_val)
y_val_pred_classes = np.argmax(y_val_pred, axis=1)
y_val_classes = np.argmax(y_val, axis=1)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val_classes, y_val_pred_classes, target_names=label_encoder.classes_.astype(str)))

print("Accuracy Score:")
print(accuracy_score(y_val_classes, y_val_pred_classes))



Epoch 1/5




[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.9213 - loss: 0.2451 - val_accuracy: 0.9559 - val_loss: 0.1288
Epoch 2/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9714 - loss: 0.0827 - val_accuracy: 0.9587 - val_loss: 0.1382
Epoch 3/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9907 - loss: 0.0341 - val_accuracy: 0.9545 - val_loss: 0.1698
Epoch 4/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9957 - loss: 0.0150 - val_accuracy: 0.9559 - val_loss: 0.2319
Epoch 5/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 9ms/step - accuracy: 0.9983 - loss: 0.0074 - val_accuracy: 0.9517 - val_loss: 0.2659
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.98      0.97      593

Choosing the "best" model depends heavily on the specific problem, dataset, and constraints such as computational resources and interpretability requirements. However, here are some general considerations for different scenarios:

## General-Purpose and High Accuracy:
Gradient Boosting Machines (GBM) / XGBoost / LightGBM / CatBoost:
Why: These models often provide high accuracy and are robust to overfitting. They are versatile and can handle a variety of data types and structures.
When: When you need high accuracy and have sufficient computational resources. These models are particularly effective for structured/tabular data.
Applications: Financial modeling, web search ranking, healthcare, fraud detection.

## Deep Learning for Sequential Data:
LSTM / Bidirectional LSTM / Transformer-based Models (BERT):
Why: These models are excellent for handling sequential data and understanding context in natural language processing (NLP) tasks.
When: When dealing with time series data, text data, or any sequential data where context is important.
Applications: Speech recognition, language modeling, text classification, question answering, language translation.

## Simplicity and Interpretability:
Logistic Regression / Decision Tree / Naive Bayes:
Why: These models are simple, interpretable, and quick to train. They work well for smaller datasets and provide insights into feature importance.
When: When interpretability is crucial, and you need a quick, simple model for binary or multiclass classification.
Applications: Credit scoring, medical diagnosis, spam detection, document classification.

## Ensemble Methods:
Random Forest / Extra Trees / Voting Classifier:
Why: These models combine multiple base models to improve accuracy and robustness. They are less prone to overfitting compared to single models.
When: When you need a robust model that can handle large datasets and reduce overfitting.
Applications: Fraud detection, recommendation systems, healthcare.

## Efficiency and Scalability:
Stochastic Gradient Descent (SGD) / Ridge Classifier:
Why: These models are efficient and can handle large datasets and online learning scenarios.
When: When you need a fast and scalable model for large datasets or real-time applications.
Applications: Text classification, image recognition, recommendation systems.

## Conclusion:
#Best All-Rounder: Gradient Boosting Machines (GBM) / XGBoost / LightGBM / CatBoost are often considered the best all-rounders for structured data due to their high accuracy and robustness.
Best for Sequential Data: Transformer-based Models (BERT) are leading in NLP tasks due to their ability to understand context and handle complex language tasks.
Best for Simplicity and Interpretability: Logistic Regression / Decision Tree are preferred when interpretability and simplicity are key.
Ultimately, the best model is context-dependent, and it's often beneficial to try multiple models and perform thorough cross-validation to determine which one performs best for your specific use case.
