# Helpfulness Prediction With Random Forest
## Data Science and Big Data Analytics Project

---

### Authors: 
- **Andrea Alberti** ([GitHub](https://github.com/AndreaAlberti07))
- **Davide Ligari** ([GitHub](https://github.com/DavideLigari01))
- **Cristian Andreoli** ([GitHub](https://github.com/CristianAndreoli94))

### Date: September 2023

---

## Data: 
The chosen dataset is [Amazon Books Reviews](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews).


## Goal:
Build a model able to predict the helpfulness of a review based on its content. 

---

### Model Training and Evaluation using Grid Search Cross Validation

The features are extracted from the review text using Word2Vec with window = 5, creating vectors of 30 features for each word. The document is the average of the vectors of the words that compose it. The model used is a Random Forest Classifier.

| N_grid | n_estimators | max_depth | min_samples_split | Test MSE             | Test $R^2$         |
| ------ | ------------ | --------- | ----------------- | -------------------- | ------------------ |
| 2      | 100          | None      | 4                 | 0.025927002564121476 | 0.2581554095121458 |
| 3      | 200          | None      | 2                 | 0.025702599368344427 | 0.2531769186466868 |


In [None]:
import joblib
import numpy as np
import pymongo as pm
import pandas as pd
import numpy as np
import gensim
import sklearn as sk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error,r2_score


In [3]:

# Load the training and test data
train_data = np.load(
    '../_gitignore/train_data_wv2_30_5.npz', allow_pickle=True)
test_data = np.load(
    '../_gitignore/test_data_w2v_30_5.npz', allow_pickle=True)
X_train_embedding = train_data['x']
Y_train = train_data['y']
X_test_embedding = test_data['x']
Y_test = test_data['y']

# Define the hyperparameter grid you want to search over
param_grid_1 = {
    'n_estimators': [100, 200],       # Number of trees in the forest
    'max_depth': [30, 50],     # Maximum depth of the trees
    'min_samples_split': [4, 8],    # Minimum samples required to split a node
    # Add more hyperparameters and their values to explore here
}

param_grid_2 = {
    'n_estimators': [50, 100],       # Number of trees in the forest
    'max_depth': [None, 10],     # Maximum depth of the trees
    'min_samples_split': [4, 8],    # Minimum samples required to split a node
    # Add more hyperparameters and their values to explore here
}


param_grid_3 = {
    'n_estimators': [50, 100,200],       # Number of trees in the forest
    'max_depth': [None, 10],     # Maximum depth of the trees
    'min_samples_split': [2, 5],    # Minimum samples required to split a node
    # Add more hyperparameters and their values to explore here
}
# Create the RandomForestRegressor model
rand_forest = RandomForestRegressor(random_state=42)

# Create a GridSearchCV object with the model and hyperparameter grid
grid_search = GridSearchCV(estimator=rand_forest,
                           param_grid=param_grid_3, cv=2, n_jobs=-1)

# Fit the GridSearchCV object to your training data
grid_search.fit(X_train_embedding, Y_train)

# Get the best hyperparameters and the best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Save the best model to a file
joblib.dump(best_estimator,
            f'../trained/rand_forest_model_md_mss_ne.gz', compress=('gzip', 3))

# Print the best hyperparameters
print("Best Hyperparameters:", best_params)

# Evaluate the best estimator on the test set
Y_test_pred = best_estimator.predict(X_test_embedding)
mse = mean_squared_error(Y_test, Y_test_pred)
print("Mean Squared Error on Test Set:", mse)

Y_train_pred = best_estimator.predict(X_train_embedding)
mse = mean_squared_error(Y_train, Y_train_pred)
print("Mean Squared Error on Train Set:", mse)

Best Hyperparameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
Mean Squared Error on Test Set: 0.02587508853631114
Mean Squared Error on Train Set: 0.0036453911772139323


In [5]:
# model_path = '../trained/rand_forest_model_mdNone_mss4_ne100.gz'
model_path = '../trained/rand_forest_model_mdNone_mss2_ne100.gz'
loaded_model = joblib.load(model_path)
Y_test_pred = best_estimator.predict(X_test_embedding)

r2= r2_score(Y_test, Y_test_pred)
print('Model mdNone_mss4_ne100')
print("R2 Score on Test Set:", r2)

Y_train_pred = best_estimator.predict(X_train_embedding)
r2= r2_score(Y_train, Y_train_pred)
print("R2 Score on Train Set:", r2)

Model mdNone_mss4_ne100
R2 Score on Test Set: 0.2531769186466868
R2 Score on Train Set: 0.8941849152318941
