# Logistic Regression

In this notebook we will present the Logistic Regression model and compare how the different vectorization methods perform with it.

First of all, we load our preprocessed dataset and do all the different vectorizations.


In [1]:
# Data handling
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

import multiprocessing


from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [2]:
df = pd.read_csv("../DATASETS/preprocessed_text.csv")

In [3]:
df.isnull().sum()
df.fillna('', inplace=True)
df.head()

Unnamed: 0,content,score,content_cleaned
0,Plsssss stoppppp giving screen limit like when...,2,plss stopp giving screen limit like when you a...
1,Good,5,good
2,👍👍,5,thumbs_up
3,Good,3,good
4,"App is useful to certain phone brand ,,,,it is...",1,app is useful to certain phone brand it is not...


In [4]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the data
bow = vectorizer.fit_transform(df['content_cleaned'])

print(len(vectorizer.vocabulary_))
print(bow.shape)

39783
(113292, 39783)


### Preparing the labels 

In [5]:
y_df = df[['score']]

y_scaler = MinMaxScaler(feature_range=(1, 5))
y = y_scaler.fit_transform(y_df)

### Performing the train-test splits for the different vectors

In [6]:
bow_train, bow_test, y_train, y_test = train_test_split(bow, y, test_size=0.2, random_state=42)

## Models

Logistic Regression requires scaling of the data, otherwise it is harder for it to converge. Therefore, we use a StandardScaler to scale our data, we train using the training set and evaluate our models using the test set.

In [7]:
# Scale the data
scaler = StandardScaler(with_mean=False)  # with_mean=False because BoW has sparse matrix format

bow_train_scaled = scaler.fit_transform(bow_train)
bow_test_scaled = scaler.transform(bow_test)


In [8]:
# Logistic Regression model with L2 regularization
model = Ridge()
model.fit(bow_train_scaled, y_train)

In [9]:
# Predictions
y_pred = model.predict(bow_test_scaled)

In [10]:
# Step 7: Inverse transform the predictions and actual test values
y_pred = y_scaler.inverse_transform(y_pred.reshape(-1, 1)).flatten()
y_test = y_scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()

In [11]:
y_pred

array([-0.81484475,  2.3423072 ,  5.39278123, ...,  1.51370284,
        3.72792163,  3.82034695])

In [12]:
y_test

array([1., 1., 5., ..., 1., 1., 4.])

In [13]:
# Step 8: Clip predictions to stay within the 1-5 range
y_pred_original_clipped = np.clip(y_pred, 1, 5)

In [14]:
y_pred_original_clipped

array([1.        , 2.3423072 , 5.        , ..., 1.51370284, 3.72792163,
       3.82034695])

In [15]:
# Step 9: Evaluate the model using mean squared error
mse = mean_squared_error(y_test, y_pred_original_clipped)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 1.5188466468997632


In [23]:
# Step 10: Print some prediction examples along with review content
print("\nExample predictions:")
for i in range(10, 30):
    print(f"Review {i+1}:")
    print(f"Content: {df['content_cleaned'].iloc[i]}")
    print(f"Predicted score = {y_pred_original_clipped[i]:.2f}, Actual score = {y_test[i]:.2f}\n")


Example predictions:
Review 11:
Content: ok
Predicted score = 2.95, Actual score = 5.00

Review 12:
Content: worst customer service very scripted refuse to take responsibility or provide any compensation when their streaming services fail on a device clearly the management is lacking in good business principles and want to make a cheap buck
Predicted score = 3.96, Actual score = 4.00

Review 13:
Content: login keeps failing
Predicted score = 1.73, Actual score = 1.00

Review 14:
Content: jjonnm
Predicted score = 5.00, Actual score = 5.00

Review 15:
Content: maybe i share my experience to how use the netflix it because so beautiful but very hard how to create an account but solid haha
Predicted score = 2.78, Actual score = 1.00

Review 16:
Content: for an app who is sole purpose is to stream entertainment it is absolutely rubbish when casting or the like some of the content is ok but it fails as a versitile streaming app
Predicted score = 1.37, Actual score = 1.00

Review 17:
Content:

In [16]:
# Saving the model
import joblib

# Save the vectorizer
joblib.dump(vectorizer, 'vectorizer.pkl')

# Save the model and scalers as well
joblib.dump(model, 'model.pkl')
joblib.dump(scaler, 'maxabs_scaler.pkl')
joblib.dump(y_scaler, 'minmax_scaler.pkl')

print("Model and scalers saved successfully.")

Model and scalers saved successfully.


## Results

We notice that the frequency-based vectorizing methods fall behind the context-based ones. Moreover, the Word2Vec vectors perform better than the GloVe ones, with our custom Word2Vec CBoW model giving the best accuracy of 79%.

## Pros and Cons of the Logistic Regression model

### Pros:
- **Simplicity**: Logistic Regression is straightforward to understand and interpret.
- **Efficiency**: Computationally efficient and fast, compared to more complex models, like neural networks and ensembles.
- **Feature Importance**: The coefficients of the Logistic Regression model can give insights to the impact of different features. In NLP that means that words or semantics that affect greatly the output class will have bigger coefficients.

### Cons:
- **Assumption of Linearity**: The model assumes a linear relationship between the features and the outcome, performing poorly when complex relationships exist and the decision boundaries are not linear.
- **Feature Scaling Required**: Features need to be scaled beforehand for optimal performance.
- **Outlier Sensitivity**: Logistic Regression is sensitive to outliers, which can unfairly affect greatly the model parameters.