# Helpfulness Prediction
## Data Science and Big Data Analytics Project

---

### Authors: 
- **Andrea Alberti** ([GitHub](https://github.com/AndreaAlberti07))
- **Davide Ligari** ([GitHub](https://github.com/DavideLigari01))
- **Cristian Andreoli** ([GitHub](https://github.com/CristianAndreoli94))

### Date: September 2023

---

## Data: 
The chosen dataset is [Amazon Books Reviews](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews).


## Goal:
Build a model able to predict the helpfulness of a review based on its content. 

---

In [1]:
import pymongo as pm
import pyspark as ps
import pandas as pd
import numpy as np
import gensim
import sklearn as sk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.ensemble import RandomForestRegressor
import joblib

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andreaalberti/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 1. Data Loading

In [None]:
client = pm.MongoClient('mongodb://localhost:27017/')
spark_db = client['spark_db']
books_ratings = spark_db['books_rating']

### 2. Data Reshaping
The goal is to predict the helpfulness of a review based on its content. Thus all the sample without a text or with no helpfulness votes are removed.
To take into account both the number of helfpful votes and the total votes, a laplacian smoothing has been used. In particular the helpfulness score to predict is computed as follows:
$$helpfulness\_score = (\frac{helpful\_votes + smoothing\_param}{total\_votes + 2*smoothing\_param})$$

Thereby the helpfulness score is in the range [0,1], specifically the score tends to the natural ratio when the number of votes increases. 

In [None]:
pipeline_remove = {'$match':{
                        'review/score':{'$exists':True},
                        'N_helpful'	:{'$exists':True, '$ne':0},
                        'Tot_votes'	:{'$exists':True, '$ne':0}
                        }
    
                }

smoothing_param = 1

pipeline_project = {'$project':{
                            'review/text':1,
                            'helpfulness_score':{'$divide':[
                                                        {'$sum':['$N_helpful', smoothing_param]},
                                                        {'$sum': ['$Tot_votes', smoothing_param*2]}
                                                             ]
                                                 },
                            '_id':0,
                            }
                    }

mongo_dataset = books_ratings.aggregate([pipeline_remove, pipeline_project])
df_dataset = pd.DataFrame(list(mongo_dataset))
arr_dataset = np.array(df_dataset)

X_train, X_test, Y_train, Y_test = sk.model_selection.train_test_split(arr_dataset[:,0], arr_dataset[:,1], test_size=0.2, random_state=42)

### 3. Features Extraction
To provide the model with the capabilities of understanding the context and detect similar words, instead of using a simple bag of words representation, we opted for a word embedding approach. In particular, we used the Word2Vec model provided by the Gensim library. The model is trained on the train set and then used to transform the reviews in a vector representation.
The model specification is:
- **vector_size**: 30
- **window**: 5
- **min_count**: 2

In [None]:
stop_words = set(stopwords.words('english'))

def preprocess(doc):
    tokens = gensim.utils.simple_preprocess(doc)
    return [token for token in tokens if token not in stop_words]

X_train_w2v = [preprocess(doc) for doc in X_train]

In [None]:
model = gensim.models.Word2Vec(X_train_w2v, vector_size=30, window=5, min_count=2)

model.save('../model/_gitignore/word2vec.model')

def get_embedding(doc):
    embeddings = []
    words = preprocess(doc)
    for word in words:
        if word in model.wv:
            embeddings.append(model.wv[word])
    if len(embeddings) > 0:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(model.vector_size)

X_train_embedding = [get_embedding(doc) for doc in X_train]
X_test_embedding = [get_embedding(doc) for doc in X_test]

In [None]:
# Store the embedding
np.savez('../model/train_data_wv2_30_5.npz',x = X_train_embedding, y = Y_train)
np.savez('../model/test_data_w2v_30_5.npz',x = X_test_embedding, y = Y_test)

### 4. Model Training and Evaluation using Grid Search Cross Validation

The features are extracted from the review text using Word2Vec with window = 5, creating vectors of 30 features for each word. The document is the average of the vectors of the words that compose it. The model used is a Random Forest Classifier.

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
import joblib
from sklearn.metrics import mean_squared_error


# Load the training and test data
train_data = np.load('../model/_gitignore/train_data_wv2_30_5.npz', allow_pickle=True)
test_data = np.load('../model/_gitignore/test_data_w2v_30_5.npz', allow_pickle=True)
X_train_embedding = train_data['x']
Y_train = train_data['y']
X_test_embedding = test_data['x']
Y_test = test_data['y']

# Define the hyperparameter grid you want to search over
param_grid_1 = {
    'n_estimators': [100,200],       # Number of trees in the forest
    'max_depth': [30,50],     # Maximum depth of the trees
    'min_samples_split': [4,8],    # Minimum samples required to split a node
    # Add more hyperparameters and their values to explore here
}

param_grid_2 = {
    'n_estimators': [50, 100],       # Number of trees in the forest
    'max_depth': [None, 10],     # Maximum depth of the trees
    'min_samples_split': [4, 8],    # Minimum samples required to split a node
    # Add more hyperparameters and their values to explore here
}


param_grid_3 = {
    'n_estimators': [50, 100],       # Number of trees in the forest
    'max_depth': [None, 10],     # Maximum depth of the trees
    'min_samples_split': [2, 5],    # Minimum samples required to split a node
    # Add more hyperparameters and their values to explore here
}

# Create the RandomForestRegressor model
rand_forest = RandomForestRegressor(random_state=42)

# Create a GridSearchCV object with the model and hyperparameter grid
grid_search = GridSearchCV(estimator=rand_forest, param_grid=param_grid_1, cv=2, n_jobs=-1)

# Fit the GridSearchCV object to your training data
grid_search.fit(X_train_embedding, Y_train)

# Get the best hyperparameters and the best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Save the best model to a file
joblib.dump(best_estimator, f'../model/_gitignore/rand_forest_model_md_mss_ne.gz', compress=('gzip', 3))

# Print the best hyperparameters
print("Best Hyperparameters:", best_params)

# Evaluate the best estimator on the test set
Y_test_pred = best_estimator.predict(X_test_embedding)
mse = mean_squared_error(Y_test, Y_test_pred)
print("Mean Squared Error on Test Set:", mse)

Y_train_pred = best_estimator.predict(X_train_embedding)
mse = mean_squared_error(Y_train, Y_train_pred)
print("Mean Squared Error on Train Set:", mse)


Best Hyperparameters: {'max_depth': 30, 'min_samples_split': 4, 'n_estimators': 200}
Mean Squared Error on Test Set: 0.024280057489063204
Mean Squared Error on Train Set: 0.004935147918521729


---

## Use Total votes as feature to predict helpful votes

### Not useful since Total votes are not available at the time of the review

In [None]:
pipeline_remove = {'$match':{
                        'review/score':{'$exists':True},
                        'N_helpful'	:{'$exists':True, '$ne':0},
                        'Tot_votes'	:{'$exists':True, '$ne':0}
                        }
    
                }

smoothing_param = 1

pipeline_project = {'$project':{
                            'review/text':1,
                            'Tot_votes':1,
                            'N_helpful':1,
                            '_id':0,
                            }
                    }

mongo_dataset = books_ratings.aggregate([pipeline_remove, pipeline_project])
df_dataset = pd.DataFrame(list(mongo_dataset))
arr_dataset = np.array(df_dataset)

X_train, X_test, Y_train, Y_test = sk.model_selection.train_test_split(arr_dataset[:,0:2], arr_dataset[:,2], test_size=0.2, random_state=42)

In [None]:
stop_words = set(stopwords.words('english'))

def preprocess(doc):
    tokens = gensim.utils.simple_preprocess(doc)
    return [token for token in tokens if token not in stop_words]

X_train_w2v = [preprocess(doc[0]) for doc in X_train[:]]

In [None]:
model = gensim.models.Word2Vec(X_train_w2v, vector_size=30, window=5, min_count=2)

def get_embedding(doc):
    embeddings = []
    words = preprocess(doc[0])
    for word in words:
        if word in model.wv:
            embeddings.append(np.append(model.wv[word], doc[1]))
    if len(embeddings) > 0:
         return np.mean(embeddings, axis=0)
    else:
        return np.zeros(model.vector_size+1)

X_train_embedding = [get_embedding(doc) for doc in X_train[:]]
X_test_embedding = [get_embedding(doc) for doc in X_test[:]]

In [None]:
np.savez('../model/train_data_wv2_totvotes_30_5.npz',x = X_train_embedding, y = Y_train)
np.savez('../model/test_data_w2v_totvotes_30_5.npz',x = X_test_embedding, y = Y_test)

In [None]:
rand_forest = RandomForestRegressor(n_estimators=100, random_state=42)
rand_forest.fit(X_train_embedding, Y_train)
joblib.dump(rand_forest, '../model/rand_forest_model_totvotes.gz', compress=('gzip', 3))

In [None]:
Y_test_pred = rand_forest.predict(X_test_embedding)
Y_train_pred = rand_forest.predict(X_train_embedding)

In [None]:
def rmse(Y_test, Y_pred):
    return np.sqrt(sk.metrics.mean_squared_error(Y_test, Y_pred))

rmse_test = rmse(Y_test, Y_test_pred)
print('RMSE with smoothing: ', rmse_test)

rmse_train = rmse(Y_train, Y_train_pred)
print('RMSE with smoothing: ', rmse_train)