Flow of the Code:

1.Preprocessing: Text data is cleaned.

2.Labeling: Sentiments are labeled based on positive/negative reviews.

3.Feature Extraction: TF-IDF vectorization converts the text into numerical data.

4.Imbalance Handling: SMOTE handles the class imbalance.

5.Model Training: Logistic Regression.

6.Evaluation: Model performance is evaluated with MLflow.

1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import re
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import mlflow
import mlflow.sklearn
from imblearn.over_sampling import SMOTE


2. SpaCy Model Loading

In [2]:
# Load SpaCy's English model (disable unnecessary components)
nlp = spacy.load('en_core_web_sm', disable=["parser", "ner"])

3. Loading the Dataset

In [3]:
# Load the dataset
df = pd.read_csv('Hotel_Reviews.csv')

4. Preprocessing Function
(Lemmatization and Stopword Removal)

In [4]:
# Preprocessing function (with batch processing and disabling unused components)
def preprocess_texts(texts):
    cleaned_texts = []
    for doc in nlp.pipe(texts, batch_size=1000):
        tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
        cleaned_texts.append(' '.join(tokens))
    return cleaned_texts

5. Combining and Preprocessing Reviews


In [5]:
# Combine reviews and apply preprocessing with batch processing
df['cleaned_review'] = preprocess_texts(df['Positive_Review'] + ' ' + df['Negative_Review'])


6. Sentiment Labeling


In [6]:
# Label sentiment
def label_sentiment(row):
    if row['Negative_Review'].strip() == 'No Negative':
        return 1  # Positive
    elif row['Positive_Review'].strip() == 'No Positive':
        return -1  # Negative
    else:
        return 0  # Neutral

7. Class Distribution Check


In [8]:
# Handle class imbalance
print("Original class distribution:", df['Sentiment'].value_counts())


Original class distribution: Sentiment
 0    352029
 1    127890
-1     35819
Name: count, dtype: int64


8. TF-IDF Vectorization


In [9]:
# Initialize TF-IDF Vectorizer with n-grams
tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=1500, max_df=0.8, min_df=0.01)
X = tfidf.fit_transform(df['cleaned_review'])
y = df['Sentiment']

9. Train-Test Split


In [None]:
# Split the data with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


10. Handling Class Imbalance (SMOTE)


In [None]:
# Handle class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [None]:
print("Resampled class distribution:", np.bincount(y_train_resampled + 1))

Resampled class distribution: [246420 246420 246420]


11. Logistic Regression Model and Cross-Validation


In [None]:
# Initialize Logistic Regression
model = LogisticRegression(max_iter=200)
# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score( model, X_train_resampled, y_train_resampled, cv=cv, scoring='f1_macro')

12. Model Training and Evaluation with MLflow


In [None]:
# Start MLflow run
with mlflow.start_run():
    model.fit(X_train_resampled, y_train_resampled)
    
    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate model performance
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, digits=4)
    cm = confusion_matrix(y_test, y_pred)
    
    print("Classification Report:")
    print(report)
    print("Confusion Matrix:")
    print(cm)
    print("Accuracy:", accuracy)

    # Log parameters, metrics, and models to MLflow
    mlflow.log_param("max_iter", 200)
    mlflow.log_param("resampling", "SMOTE")
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_macro", scores.mean())
    mlflow.sklearn.log_model(model, "model")
    mlflow.sklearn.log_model(tfidf, "tfidf_vectorizer")

Classification Report:
              precision    recall  f1-score   support

          -1     0.9873    0.9963    0.9918     10746
           0     0.9986    0.9938    0.9962    105609
           1     0.9865    0.9971    0.9918     38367

    accuracy                         0.9948    154722
   macro avg     0.9908    0.9957    0.9932    154722
weighted avg     0.9948    0.9948    0.9948    154722

Confusion Matrix:
[[ 10706     35      5]
 [   138 104952    519]
 [     0    110  38257]]
Accuracy: 0.9947841935859154


