# Notebook for Different types of supervised learning training and their metrics

### Considerations:
- **Feature Scaling**: Models like SVM, neural networks, and even those that use regularization (like Ridge and Lasso) benefit from feature scaling. Ensuring all numeric features are on a similar scale can improve model performance.
- **Model Tuning**: Utilizing grid search or randomized search to tune the hyperparameters of whichever models you choose can significantly enhance their performance.
- **Cross-Validation**: Using k-fold cross-validation to ensure your model generalizes well to unseen data is crucial, especially given the potential variability in your data.

### Implementation:
You can utilize libraries such as Scikit-learn for most traditional models, and TensorFlow or PyTorch if you opt for neural networks. Given your data's complexity and the potential insights from different models, experimenting with a few different approaches and tuning their parameters would be a strategic approach.

## Evaluation Metrics

### 1. **Mean Absolute Error (MAE)**
   - **Why**: MAE is easy to interpret because it gives the average error magnitude in the same units as the target (`avg_rating`). It’s robust to outliers, meaning large deviations won't overly influence the metric. Since the ratings are likely in a limited range (e.g., 1-5), this will give us a clear idea of the typical prediction error.
   - **Interpretation**: A lower MAE indicates that the predictions are, on average, close to the actual ratings. If the MAE is 0.1, for example, it means the predictions deviate by an average of 0.1 points from the actual ratings.

### 2. **Root Mean Squared Error (RMSE)**
   - **Why**: RMSE penalizes larger errors more than MAE, which can be useful if we want to heavily penalize predictions that are far from the true ratings. Since outliers might occur in review ratings (e.g., extremely negative or positive experiences), RMSE helps reflect the seriousness of these larger deviations.
   - **Interpretation**: The RMSE gives an error measurement in the same units as `avg_rating` and emphasizes large errors. It’s good for understanding how much larger errors affect the model’s predictions.

### 3. **R-squared (R²)**
   - **Why**: R-squared provides an indication of how well the model explains the variance in `avg_rating`. It's particularly useful when comparing different models because it quantifies how much of the variance in the target variable the model can explain.
   - **Interpretation**: An R² value close to 1 indicates that the model explains most of the variability in the data, while a lower R² (close to 0) indicates a poor fit. However, it’s not sensitive to scale, so we should use it alongside other metrics like MAE or RMSE.

### 4. **Adjusted R-squared**
   - **Why**: Adjusted R-squared is especially important when we have multiple features (as we do, with NLP features like word embeddings, topic modeling, sentiment analysis, etc.). It adjusts for the number of predictors, penalizing the addition of irrelevant features. This helps ensure that we're not overfitting by adding unnecessary complexity.
   - **Interpretation**: An increase in adjusted R-squared indicates that adding more features improves the model meaningfully, while a decrease suggests that the additional features are unnecessary or even detrimental.

### **Why RMSE as primary metric?**
- **Penalty on Large Errors**: RMSE penalizes larger errors more than Mean Absolute Error (MAE). This is crucial when we want to minimize significant deviations in predicted ratings, which are likely to be more impactful than smaller deviations.
- **Real-World Relevance**: In review-based prediction problems like ours, an occasional large error (e.g., predicting a rating far from the actual value) could be more problematic than a series of small errors. RMSE helps emphasize these larger errors, making the model focus on reducing them.
- **Smooth Gradient**: RMSE tends to provide a smoother gradient during optimization in most machine learning algorithms, which helps improve convergence during model training.

### **How to Use Other Metrics:**
- **MAE**: Use it as a secondary metric to give an interpretable measure of average error. It’s easy to understand and will provide insight into the typical error magnitude, but it won't penalize large errors as much.
- **R-squared (R²) or Adjusted R-squared**: These can be helpful when evaluating how much variance in `avg_rating` the model explains, especially when we compare different models. However, we don’t rely on these alone because they don’t provide direct information about prediction error.


In [1]:
import pandas as pd
import os
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from datetime import datetime
import math

In [2]:
# Load the data
def load_data(parquet_file_path):
    df = pd.read_parquet(parquet_file_path)
    return df

# Preprocessing and feature-label separation
def preprocess_data(df):
    # Drop non-numeric features that are not required for training
    df = df.drop(columns=['hotel_id'])  # Example of dropping non-informative features

    # Define the label (target) and the features
    X = df.drop(columns=['avg_rating'])  # Features (all except avg_rating)
    y = df['avg_rating']  # Label

    return X, y

# Train-Test Split
def split_data(X, y, test_size=0.2, random_state=42):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

# Feature Scaling
def scale_data(X_train, X_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    return X_train_scaled, X_test_scaled

In [3]:
parquet_file_path = '../output/l3_data_2024-09-10_19-33-39.parquet'  # Replace with actual parquet file path

# Load dataset
df = load_data(parquet_file_path)

# Preprocess data
X, y = preprocess_data(df)

# Split data
X_train, X_test, y_train, y_test = split_data(X, y)

# Scale data
X_train_scaled, X_test_scaled = scale_data(X_train, X_test)

## 1. **Linear Regression Models**
- **Ridge Regression**: This model is particularly effective if there's multicollinearity in your data or if you want to prevent overfitting. Ridge regression uses L2 regularization which penalizes the sum of the square of the coefficients.
- **Lasso Regression**: If your dataset has redundant or less important features, Lasso can help automatically perform feature selection by applying L1 regularization, which penalizes the sum of the absolute values of the coefficients.

## 2. **Tree-Based Models**
- **Random Forest**: This ensemble method, which operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees, can handle a mix of numerical and categorical data well and is robust against overfitting.
- **Gradient Boosting Machines (GBMs)**: Models like XGBoost, LightGBM, or CatBoost can provide powerful predictive insights. These models are known for handling different types of data and transforming features effectively internally, which can be particularly useful for your mixed data types.

## 3. **Support Vector Machines (SVM)**
- While typically used for classification, SVMs can also be adapted for regression (SVR). They might be particularly useful if the decision boundary between different rating levels is not linear.

## 4. **Neural Networks**
- **Deep Learning**: Given the complexity and high dimensionality of your data (especially with word embeddings and sentiment scores), neural networks might capture interactions that other models miss. A simple feedforward neural network or more complex architectures like convolutional neural networks (CNNs) might be effective, depending on the size and quality of your dataset.

## 5. **Ensemble Learning**
- **Stacking**: Combining the predictions of several models can often lead to better performance than any single model. For instance, stacking decision trees, SVM, and linear models might give you a more robust prediction.

In [4]:
def adjusted_r_squared(r2, n, p):
    """Calculate Adjusted R-Squared"""
    return 1 - (1 - r2) * ((n - 1) / (n - p - 1))

In [5]:
def train_and_evaluate_models_with_tuning(X_train, X_test, y_train, y_test, model_save_path='models'):
    models = {
        'Linear Regression': (LinearRegression(), {}),
        'Decision Tree': (DecisionTreeRegressor(random_state=42), {
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }),
        'Random Forest': (RandomForestRegressor(random_state=42), {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }),
        'Gradient Boosting': (GradientBoostingRegressor(random_state=42), {
            'n_estimators': [50, 100, 200],
            'learning_rate': [0.01, 0.1, 0.2],
            'max_depth': [3, 5, 10],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }),
        'Support Vector Machine': (SVR(), {
            'kernel': ['linear', 'rbf'],
            'C': [0.1, 1, 10],
            'epsilon': [0.1, 0.2, 0.5]
        }),
        'Neural Network': (MLPRegressor(random_state=42, max_iter=500), {
            'hidden_layer_sizes': [(50,), (100,), (50, 50)],
            'activation': ['relu', 'tanh'],
            'solver': ['adam', 'sgd'],
            'learning_rate_init': [0.001, 0.01]
        })
    }
    
    results = []  # To store results as a list of dictionaries
    
    n = len(y_test)  # Number of samples
    p = X_train.shape[1]  # Number of features

    # Create directory if it doesn't exist
    if not os.path.exists(model_save_path):
        os.makedirs(model_save_path)

    for name, (model, param_grid) in models.items():
        print(f"Tuning {name}...")
        
        # Perform GridSearchCV for hyperparameter tuning
        grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
        grid_search.fit(X_train, y_train)
        
        # Get the best estimator (model) from GridSearch
        best_model = grid_search.best_estimator_
        y_pred = best_model.predict(X_test)
        
        # Calculate metrics
        rmse = math.sqrt(mean_squared_error(y_test, y_pred))
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        adj_r2 = adjusted_r_squared(r2, n, p)
        
        # Append results to the list as a dictionary
        results.append({
            'Model': name,
            'RMSE': rmse,
            'MAE': mae,
            'R2': r2,
            'Adjusted R2': adj_r2,
            'Best Params': grid_search.best_params_  # Save the best parameters
        })
        
        # Save the model with current date in filename
        current_date = datetime.now().strftime("%Y%m%d_%H%M%S")
        model_filename = f"{model_save_path}/model_{name.replace(' ', '_').lower()}_{current_date}.pkl"
        
        with open(model_filename, 'wb') as file:
            pickle.dump(best_model, file)
        
        print(f"Model {name} saved as {model_filename} with best params: {grid_search.best_params_}")
    
    # Convert the list of results into a DataFrame
    results_df = pd.DataFrame(results)
    
    # Print the results DataFrame
    print(results_df)
    
    return results_df

In [6]:
# Train and evaluate models
results_df = train_and_evaluate_models_with_tuning(X_train_scaled, X_test_scaled, y_train, y_test, model_save_path='../models')

# Display best model based on RMSE
best_model_row = results_df.loc[results_df['RMSE'].idxmin()]
best_model_name = best_model_row['Model']
best_model_rmse = best_model_row['RMSE']

print(f"\nBest model: {best_model_name} with RMSE: {best_model_rmse}")

Tuning Linear Regression...


Model Linear Regression saved as ../models/model_linear_regression_20240914_093546.pkl with best params: {}
Tuning Decision Tree...
Model Decision Tree saved as ../models/model_decision_tree_20240914_093557.pkl with best params: {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 2}
Tuning Random Forest...
Model Random Forest saved as ../models/model_random_forest_20240914_101805.pkl with best params: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Tuning Gradient Boosting...


KeyboardInterrupt: 