# Linear Regression

In this notebook we will present the Linear Regression model with L2 regularization. For simplicity we will train and evaluate it using only BoW vectorization. 

Linear Regression is one of the most known and fundamental regression algorithms. As in Linear Regression, Ridge Regression adds L2 regularization during the fitting process of the algorithm. This means that in the loss function, the squared sum of the weights is added to the Mean Squared Error. This way large coefficients lead to bigger loss and they are penalised. The result of this is the avoidance of overfitting.

### Implementation in Python

Let's begin by importing the libraries we need.

In [1]:
# Data handling
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

import multiprocessing


from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

### Loading the dataset

We load our dataset and add empty texts in case of missing values.

In [2]:
df = pd.read_csv("../DATASETS/preprocessed_text.csv")

In [3]:
df.isnull().sum()
df.fillna('', inplace=True)
df.head()

Unnamed: 0,content,score,content_cleaned
0,Plsssss stoppppp giving screen limit like when...,2,plss stopp giving screen limit like when you a...
1,Good,5,good
2,👍👍,5,thumbs_up
3,Good,3,good
4,"App is useful to certain phone brand ,,,,it is...",1,app is useful to certain phone brand it is not...


### Vectorization

We vectorize our dataset using BoW.

In [4]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the data
bow = vectorizer.fit_transform(df['content_cleaned'])

print(len(vectorizer.vocabulary_))
print(bow.shape)

39783
(113292, 39783)


### Preparing the labels 

We scale the labels into the range 1 to 5 with MinMaxScaler. They are already in that range but we fit our scaler to be able to use it on new unseen data.

In [5]:
y_df = df[['score']]

y_scaler = MinMaxScaler(feature_range=(1, 5))
y = y_scaler.fit_transform(y_df)

### Train - test split

We perform the train-test split, keeping 20% of the original data for evaluation. We also keep the indices split of our dataframe.

In [6]:
indices = df.index

bow_train, bow_test, y_train, y_test, train_idx, test_idx = train_test_split(bow, y, indices, test_size=0.2, random_state=42)

### Scaling

Linear Regression requires scaling of the data, otherwise it is harder for it to converge. Therefore, we use a StandardScaler to scale our data. We fit it using the training set and just transform the test set.

In [7]:
# Scale the data
scaler = StandardScaler(with_mean=False)  # with_mean=False because BoW has sparse matrix format

bow_train_scaled = scaler.fit_transform(bow_train)
bow_test_scaled = scaler.transform(bow_test)


### Model

Next we define our Ridge model and we train it

In [8]:
# Logistic Regression model with L2 regularization
model = Ridge()
model.fit(bow_train_scaled, y_train)

After we train our model, we can make predictions on our test dataset.

In [9]:
# Predictions
y_pred = model.predict(bow_test_scaled)

In order to properly evaluate the predictions, we need to invert the scaling to turn them back in the desired range of values.

In [10]:
# Inverse transform the predictions and actual test values
y_pred = y_scaler.inverse_transform(y_pred.reshape(-1, 1)).flatten()
y_test = y_scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()

In [11]:
y_pred

array([-0.81484475,  2.3423072 ,  5.39278123, ...,  1.51370284,
        3.72792163,  3.82034695])

In [12]:
y_test

array([1., 1., 5., ..., 1., 1., 4.])

Moreover, we notice that some values are less than 1 and more than 5, so we clip them in 1-5 range

In [13]:
# Clip predictions to stay within the 1-5 range
y_pred_original_clipped = np.clip(y_pred, 1, 5)

In [14]:
y_pred_original_clipped

array([1.        , 2.3423072 , 5.        , ..., 1.51370284, 3.72792163,
       3.82034695])

### Evalutation

After we have made some predictions, we can evaluate our model using the Mean Squared Error for regression tasks.

In [15]:
# Evaluate the model using mean squared error
mse = mean_squared_error(y_test, y_pred_original_clipped)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 1.5188466468997632


We also print some prediction examples, along with the true value and the review content. We can see that the model performance is satisfying.

In [16]:
# Print some prediction examples along with review content
print("\nExample predictions:")
for i in range(10, 30):
    print(f"Review {i+1}:")
    print(f"Content: {df['content_cleaned'][test_idx[i]]}")
    print(f"Predicted score = {y_pred_original_clipped[i]:.2f}, Actual score = {y_test[i]:.2f}\n")


Example predictions:
Review 11:
Content: it takes like 2 3 minute to open the app that really freaks me out please do something
Predicted score = 2.95, Actual score = 5.00

Review 12:
Content: wh0 does not love netflix the nest shows and movies are on there the on problem is that you pay for about everything you need to pay for more than 1 person to download stuff and you need to pay for more than 1 person to be able to watch
Predicted score = 3.96, Actual score = 4.00

Review 13:
Content: why has my app changed side i used to scroll from right to left and now it is reversed i know it is tiny bug but ui bugs are the worse
Predicted score = 1.73, Actual score = 1.00

Review 14:
Content: it is awesome to wear that you can use so many pictures now i love the updates and also they are putting brand new movies on it and i love it is so amazing how you can just watch a brand new movie on your phone
Predicted score = 5.00, Actual score = 5.00

Review 15:
Content: the resolution paired with m

### Saving the model

In the end, we save our model, along with the vectorizer and the scalers we used for future use.

In [17]:
# Saving the model
import joblib

# Save the vectorizer
joblib.dump(vectorizer, 'vectorizer.pkl')

# Save the model and scalers as well
joblib.dump(model, 'model.pkl')
joblib.dump(scaler, 'maxabs_scaler.pkl')
joblib.dump(y_scaler, 'minmax_scaler.pkl')

print("Model and scalers saved successfully.")

Model and scalers saved successfully.


## Pros and Cons of the Logistic Regression model

### Pros:
- **Simplicity**: Linear Regression is straightforward to understand and interpret.
- **Efficiency**: Computationally efficient and fast, compared to more complex models, like neural networks and ensembles.
- **Feature Importance**: The coefficients of the Linear Regression model can give insights to the impact of different features. In NLP that means that words or semantics that affect greatly the output class will have bigger coefficients.

### Cons:
- **Assumption of Linearity**: The model assumes a linear relationship between the features and the outcome, performing poorly when complex relationships exist and the decision boundaries are not linear.
- **Feature Scaling Required**: Features need to be scaled beforehand for optimal performance.
- **Outlier Sensitivity**: Linear Regression is sensitive to outliers, which can unfairly affect greatly the model parameters.