# Random Forest Regressor

In this notebook we will present the Random Forest Regressor. For simplicity we will train and evaluate it using only Word2Vec vectorization.

Random Forest is an ensemble kind of algorithm. It fits several decision trees on various sub-samples of the training dataset and the predictions are made by averaging over all those decision trees. This has as a result an improvement in accuracy and less over-fitting to noise.

### Implementation in Python

Let's begin by importing the libraries we need.

In [1]:
# Data handling
import numpy as np
import pandas as pd

from gensim import models

import multiprocessing

from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

### Loading the dataset

We load our dataset and add empty texts in case of missing values.

In [2]:
df = pd.read_csv("../DATASETS/preprocessed_text.csv")

In [3]:
df.isnull().sum()
df.fillna('', inplace=True)
df.head()

Unnamed: 0,content,score,content_cleaned
0,Plsssss stoppppp giving screen limit like when...,2,plss stopp giving screen limit like when you a...
1,Good,5,good
2,👍👍,5,thumbs_up
3,Good,3,good
4,"App is useful to certain phone brand ,,,,it is...",1,app is useful to certain phone brand it is not...


### Vectorization

We vectorize our dataset using Word2Vec. In this case we train our own CBoW model with our dataset to get some more accurate performance.

In [4]:
def get_average_word2vec2(tokens_list, model, vector_size):
    valid_tokens = [token for token in tokens_list if token in model.wv]
    if not valid_tokens:
        return np.zeros(vector_size)
    word_vectors = [model.wv[token] for token in valid_tokens]
    average_vector = np.mean(word_vectors, axis=0)
    return average_vector

# Define model parameters
vector_size = 100   # Dimensionality of the word vectors
window_size = 5     # Context window size
min_count = 1       # Minimum word frequency
workers = multiprocessing.cpu_count()  # Number of worker threads to use

# Tokenize the text data
df['tokens'] = df['content_cleaned'].apply(lambda x: x.split())

# Train the Word2Vec model
cbow = models.Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=0, window=window_size, min_count=min_count, workers=workers)

df['word2vec_cbow'] = df['tokens'].apply(lambda x: get_average_word2vec2(x, cbow, vector_size))

df.head()

Unnamed: 0,content,score,content_cleaned,tokens,word2vec_cbow
0,Plsssss stoppppp giving screen limit like when...,2,plss stopp giving screen limit like when you a...,"[plss, stopp, giving, screen, limit, like, whe...","[0.39999062, 0.3815289, 0.21260098, -0.0204688..."
1,Good,5,good,[good],"[0.57290155, -2.2940826, -0.4351687, 0.2013278..."
2,👍👍,5,thumbs_up,[thumbs_up],"[-1.2796623, -1.1963838, -1.4142389, 0.0607745..."
3,Good,3,good,[good],"[0.57290155, -2.2940826, -0.4351687, 0.2013278..."
4,"App is useful to certain phone brand ,,,,it is...",1,app is useful to certain phone brand it is not...,"[app, is, useful, to, certain, phone, brand, i...","[0.45239678, 0.37467182, 0.7946273, 0.4409721,..."


### Preparing the labels 

We scale the labels into the range 1 to 5 with MinMaxScaler. They are already in that range but we fit our scaler to be able to use it on new unseen data.

In [5]:
y_df = df[['score']]

y_scaler = MinMaxScaler(feature_range=(1, 5))
y = y_scaler.fit_transform(y_df)

### Train - test split

We perform the train-test split, keeping 20% of the original data for evaluation. We also keep the indices split of our dataframe.

In [6]:
w2v_cbow = np.vstack(df['word2vec_cbow'].values)

indices = df.index

w2v_cbow_train, w2v_cbow_test, y_train, y_test, train_idx, test_idx = train_test_split(w2v_cbow, y.flatten(), indices, test_size=0.2, random_state=42)

### Scaling

Random Forest does not require scaling of data, since it does not find coefficients of hyperplanes to act as decision boundaries. Therefore, each feature is handled on its own.

### Model

Next we define our Ridge model and we train it.

In [17]:
model = RandomForestRegressor(n_estimators=80, max_depth= 15)
model.fit(w2v_cbow_train, y_train)

After we train our model, we can make predictions on our test dataset.

In [18]:
# Predictions
y_pred = model.predict(w2v_cbow_test)

In order to properly evaluate the predictions, we need to invert the scaling to turn them back in the desired range of values.

In [19]:
# Inverse transform the predictions and actual test values
y_pred = y_scaler.inverse_transform(y_pred.reshape(-1, 1)).flatten()
y_test = y_scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()

In [20]:
y_pred

array([1.33793049, 1.70539285, 4.89782447, ..., 1.52231426, 3.23168107,
       4.39613392])

In [21]:
y_test

array([1., 1., 5., ..., 1., 1., 4.])

Moreover, we notice that some values are less than 1 and more than 5, so we clip them in 1-5 range.

In [22]:
# Clip predictions to stay within the 1-5 range
y_pred_original_clipped = np.clip(y_pred, 1, 5)

In [23]:
y_pred_original_clipped

array([1.33793049, 1.70539285, 4.89782447, ..., 1.52231426, 3.23168107,
       4.39613392])

### Evalutation
After we have made some predictions, we can evaluate our model using the Mean Squared Error for regression tasks.

In [24]:
# Evaluate the model using mean squared error
mse = mean_squared_error(y_test, y_pred_original_clipped)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 1.3157371740238468


We also print some prediction examples, along with the true value and the review content. We can see that the model performance is satisfying.

In [25]:
# Print some prediction examples along with review content
print("\nExample predictions:")
for i in range(10, 30):
    print(f"Review {i+1}:")
    print(f"Content: {df['content_cleaned'][test_idx[i]]}")
    print(f"Predicted score = {y_pred_original_clipped[i]:.2f}, Actual score = {y_test[i]:.2f}\n")


Example predictions:
Review 11:
Content: it takes like 2 3 minute to open the app that really freaks me out please do something
Predicted score = 2.29, Actual score = 5.00

Review 12:
Content: wh0 does not love netflix the nest shows and movies are on there the on problem is that you pay for about everything you need to pay for more than 1 person to download stuff and you need to pay for more than 1 person to be able to watch
Predicted score = 3.11, Actual score = 4.00

Review 13:
Content: why has my app changed side i used to scroll from right to left and now it is reversed i know it is tiny bug but ui bugs are the worse
Predicted score = 1.66, Actual score = 1.00

Review 14:
Content: it is awesome to wear that you can use so many pictures now i love the updates and also they are putting brand new movies on it and i love it is so amazing how you can just watch a brand new movie on your phone
Predicted score = 4.83, Actual score = 5.00

Review 15:
Content: the resolution paired with m

### Saving the model

In the end, we save our model, along with the vectorizer and the scalers we used for future use.

In [27]:
# Saving the model
import joblib

# Save the vectorizer
joblib.dump(cbow, 'vectorizer.pkl')

# Save the model and scalers as well
joblib.dump(model, 'model.pkl')
joblib.dump(y_scaler, 'minmax_scaler.pkl')

print("Model and scalers saved successfully.")

Model and scalers saved successfully.
