This notebooks contains codes used to explore and evaluate Logistic Regression, SVC, Ranom Forest and Ensemble model.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import nltk
from scipy import sparse
import os
from scipy import io

In [2]:
path= "labeled_data.csv"
original=pd.read_csv(path)

In [7]:
data = original.copy()
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
encoder = LabelEncoder()
data["Overall Sentiment"] = encoder.fit_transform(data["Overall Sentiment"])

In [8]:
X = data["After_lemmatization"]
Y = data["Overall Sentiment"]

There is imbalance in data so, need to balance classes.

In [10]:
Y.value_counts()

2    334196
1     93398
0     72406
Name: Overall Sentiment, dtype: int64

In [11]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/sajangurung/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Following is the CountVectorizer used to convert text data into sarpse matrics.

In [12]:
bow_counts = CountVectorizer(tokenizer= word_tokenize, # type of tokenization
                             ngram_range=(1,3)) # number of n-grams

Transforming features into sparse matrix using CountVecorizer.

In [13]:
X_bow = bow_counts.fit_transform(X)



In [14]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
rus = RandomUnderSampler(random_state=42)
smote = SMOTE(random_state=42)
X_res, Y_res = smote.fit_resample(X_bow, Y)

In [16]:
Y_res.value_counts()

2    334196
1    334196
0    334196
Name: Overall Sentiment, dtype: int64

Splitting train and test data in ration of 90% training and 10% test data since we have large dataset.

In [15]:
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X_res, Y_res, test_size=0.1, random_state=42)

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Following is function to evaluate model.

In [19]:
def evaluate_model(model, x_train, y_train, x_test, y_test, id):
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)

    # calculate metrices for model: Precision, Recall, F-Score 
    report = classification_report(y_test, y_pred)

    print("Report for ", id)
    print(report)

Running Logistic Regression model with default parameters.

In [None]:
best_model = LogisticRegression()

In [18]:
evaluate_model(best_model, X_train, y_train, X_test, y_test, "Logistic Regression")

Report for  Logistic Regression
              precision    recall  f1-score   support

           0       0.83      0.77      0.80      1432
           1       0.78      0.85      0.81      1436
           2       0.81      0.80      0.80      1437

    accuracy                           0.80      4305
   macro avg       0.80      0.80      0.80      4305
weighted avg       0.80      0.80      0.80      4305



Following code performs grid search for Logistic regression.

In [17]:
from sklearn.model_selection import train_test_split, GridSearchCV

# Define the logistic regression model
logreg_model = LogisticRegression()

# Define hyperparameters to tune
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'solver': ['liblinear', 'saga']
}

# Create GridSearchCV instance with 10-fold cross-validation
grid_search = GridSearchCV(logreg_model, param_grid, cv=10, scoring='accuracy', n_jobs=-1)

# Perform grid search on the data
grid_search.fit(X_train, y_train)

# Display the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Access the best model directly (optional)
best_model = grid_search.best_estimator_

# Evaluate the best model using cross-validation
cv_results = cross_val_score(best_model, X_train, y, cv=10, scoring='accuracy')
print("Cross-Validation Accuracy: {:.2f} (+/- {:.2f})".format(cv_results.mean(), cv_results.std() * 2))

evaluate_model(best_model, X_train, y_train, X_test, y_test, "Logistic Regression")



Best Hyperparameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}


NameError: name 'cross_val_score' is not defined

Above code took 16 hours to run, although the code was able to find best parameters with Gridsearch, there was error following the code. Due to time constraints used the hyperparameters to train Logistic regression.

Best Hyperparameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}

In [29]:
best_model = LogisticRegression(C= 10, penalty= 'l1', solver= 'liblinear')
evaluate_model(best_model, X_train, y_train, X_test, y_test, "Logistic Regression")

Report for  Logistic Regression
              precision    recall  f1-score   support

           0       0.78      0.89      0.83      6757
           1       0.86      0.77      0.81      6567
           2       0.91      0.89      0.90      6775

    accuracy                           0.85     20099
   macro avg       0.85      0.85      0.85     20099
weighted avg       0.85      0.85      0.85     20099



Evaluating SVC with default parameters.

In [27]:
from sklearn.svm import SVC

model = SVC()

evaluate_model(model, X_train, y_train, X_test, y_test, "SVC")

Report for  SVC
              precision    recall  f1-score   support

           0       0.73      0.81      0.77      6757
           1       0.86      0.69      0.77      6567
           2       0.83      0.91      0.87      6775

    accuracy                           0.80     20099
   macro avg       0.81      0.80      0.80     20099
weighted avg       0.81      0.80      0.80     20099



Grid searching hyperparameters for SVC

In [32]:
# Define the SVM model
svm_model = SVC()

# Define hyperparameters to tune
param_grid = {
    'C': [0.1, 1],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

# Create GridSearchCV instance with 10-fold cross-validation
grid_search = GridSearchCV(svm_model, param_grid, cv=10, scoring='accuracy', n_jobs=-1)

# Perform grid search on the data
grid_search.fit(X_train, y_train)

# Display the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

best_model = grid_search.best_estimator_

# Predict using the best model
y_pred = best_model.predict(X_scaled)

# Evaluate the best model using accuracy
accuracy = accuracy_score(y, y_pred)
print("Accuracy: {:.2f}".format(accuracy))
evaluate_model(best_model, X_train, y_train, X_test, y_test, "Logistic Regression")

Best Hyperparameters: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}


NameError: name 'X_scaled' is not defined

Grid search was complete but error occured after hyperparameters were found. This ran for 17 hours.
Best Hyperparameters: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}

Evaluating SVC with best hyperparameters.

In [33]:
evaluate_model(best_model, X_train, y_train, X_test, y_test, "SVC")

Report for  SVC
              precision    recall  f1-score   support

           0       0.75      0.87      0.80      6757
           1       0.85      0.75      0.79      6567
           2       0.90      0.87      0.88      6775

    accuracy                           0.83     20099
   macro avg       0.83      0.83      0.83     20099
weighted avg       0.83      0.83      0.83     20099



In [58]:
X_original = original["After_lemmatization"]
Y_original = original["Overall Sentiment"]

In [59]:
X_original_bow = bow_counts.fit_transform(X_original)



In [60]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
x_resampled, y_resampled = rus.fit_resample(X_original_bow, Y_original)

In [61]:
y_resampled.value_counts()

Negative    72406
Neutral     72406
Positive    72406
Name: Overall Sentiment, dtype: int64

In [62]:
X_train, X_test, y_train, y_test = train_test_split(x_resampled, y_resampled, test_size=0.1, random_state=42)

In [63]:
best_model_LR = LogisticRegression(C= 10, penalty= 'l1', solver= 'liblinear')
evaluate_model(best_model_LR, X_train, y_train, X_test, y_test, "Logistic Regression")



Report for  Logistic Regression
              precision    recall  f1-score   support

    Negative       0.87      0.84      0.86      7170
     Neutral       0.84      0.87      0.86      7213
    Positive       0.88      0.87      0.87      7339

    accuracy                           0.86     21722
   macro avg       0.86      0.86      0.86     21722
weighted avg       0.86      0.86      0.86     21722



In [66]:
#Best Hyperparameters: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}

best_model_SVC = SVC(C = 1, gamma= 'scale', kernel = 'linear' )

evaluate_model(best_model_SVC, X_train, y_train, X_test, y_test, "Logistic Regression")

Report for  Logistic Regression
              precision    recall  f1-score   support

    Negative       0.85      0.85      0.85      7170
     Neutral       0.85      0.88      0.86      7213
    Positive       0.86      0.85      0.86      7339

    accuracy                           0.86     21722
   macro avg       0.86      0.86      0.86     21722
weighted avg       0.86      0.86      0.86     21722



Performing Grid search on RandomForestClassifier.

In [18]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the hyperparameter search space
param_dist = {
    'n_estimators': [50, 100, 150, 200],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Create a Random Forest classifier
rf = RandomForestClassifier()

# Random search with cross-validation
random_search = RandomizedSearchCV(
    rf, param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42
)

# Fit the random search to the data
random_search.fit(X_train, y_train)

# Print the best parameters
print("Best Parameters:", random_search.best_params_)

# Evaluate the model with the best parameters on the test set
best_model = random_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/site-packages/sklearn/base.py", line 1145, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/site-packages/sklearn/base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn

Best Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Test Accuracy: 0.7203343449922882


Evaluating Random Forest with best Hyperparamenters.

In [20]:
from sklearn.metrics import classification_report

evaluate_model(best_model, X_train, y_train, X_test, y_test, "Random Forest")

Report for  Random Forest
              precision    recall  f1-score   support

           0       0.90      0.40      0.55      6757
           1       0.64      0.89      0.74      6567
           2       0.76      0.88      0.81      6775

    accuracy                           0.72     20099
   macro avg       0.76      0.72      0.70     20099
weighted avg       0.76      0.72      0.70     20099



Creating an ensemble model using above 3 models. Voting classifier was used with models from above with best hyperparameters. Hard voting was used since SVC did not support probability.

In [27]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

best_model_SVC = SVC(C = 1, gamma= 'scale', kernel = 'linear' )
best_model_LR = LogisticRegression(C= 10, penalty= 'l1', solver= 'liblinear')
# ensemble model
ensemble_classifier = VotingClassifier(
    estimators=[
        ("lr", best_model_LR), 
        ("nb", best_model_SVC), 
        ("rf", best_model)]
    , n_jobs=-1, voting="hard")

Evaluating above ensemble model.

In [28]:
evaluate_model(ensemble_classifier, X_train, y_train, X_test, y_test, "Random Forest")

Report for  Random Forest
              precision    recall  f1-score   support

           0       0.78      0.87      0.82      6757
           1       0.87      0.75      0.81      6567
           2       0.89      0.91      0.90      6775

    accuracy                           0.85     20099
   macro avg       0.85      0.84      0.84     20099
weighted avg       0.85      0.85      0.84     20099

