# Model Evaluation and Hyperparameter Tuning

## Introduction
In this notebook, we will focus on hyperparameter tuning for the best-performing models from the previous notebook. We will use Grid Search to find the optimal parameters and evaluate the tuned models on the test set.

## Step 1: Load the Preprocessed Data
We begin by loading the preprocessed data once again.

In [22]:
import pandas as pd
import pickle
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the preprocessed data

In [23]:
X_train_scaled = pd.read_csv('../Data/Clean-Data/X_train_scaled.csv')
X_test_scaled = pd.read_csv('../Data/Clean-Data/X_test_scaled.csv')
y_train = pd.read_csv('../Data/Clean-Data/y_train.csv')
y_test = pd.read_csv('../Data/Clean-Data/y_test.csv')

# Convert target to 1D array

In [24]:
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

Loading Data: The preprocessed data is reloaded for hyperparameter tuning.

## Step 2: Hyperparameter Tuning for XGBoost
We will perform Grid Search to find the best parameters for the XGBoost model.

In [25]:
import xgboost as xgb
from sklearn.metrics import make_scorer, mean_squared_error

# Initialize the XGBoost model
xgb_model = xgb.XGBRegressor(random_state=42)
# Train the XGBoost model
xgb_model.fit(X_train_scaled, y_train)

# Make predictions on the test data
y_pred_xgb = xgb_model.predict(X_test_scaled)

# Calculate the metrics
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50],# 100, 200, 500],
    'max_depth': [3],# 5, 7, 10],
    'learning_rate': [0.2]# , 0.1, 0.2]
}

# Define the scoring metric
scoring = make_scorer(mean_squared_error, greater_is_better=False)

# Initialize Grid Search with Cross-Validation
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, scoring=scoring, cv=5, n_jobs=-1, verbose=1)

# Fit the grid search to the data
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score (Negative MSE):", best_score)


Fitting 5 folds for each of 1 candidates, totalling 5 fits
Best Parameters: {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 50}
Best Score (Negative MSE): -1468.939231218048


Grid Search: We use Grid Search to explore different hyperparameter combinations and find the best configuration for the XGBoost model.

## Step 3: Evaluate the Tuned XGBoost Model
After tuning, we evaluate the model on the test set.

In [26]:
# Train the XGBoost model
xgb_model.fit(X_train_scaled, y_train)

# Make predictions on the test data
y_pred_xgb = xgb_model.predict(X_test_scaled)

# Calculate the metrics
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)


print("XGBoost Model Performance with Best Parameters:")
print(f"Mean Absolute Error (MAE): {mae_xgb}")
print(f"Mean Squared Error (MSE): {mse_xgb}")
print(f"R-squared: {r2_xgb}")



XGBoost Model Performance with Best Parameters:
Mean Absolute Error (MAE): 29.637879615610892
Mean Squared Error (MSE): 1638.6236042712737
R-squared: 0.3351529836654663


Evaluation: The tuned model is evaluated on the test set to compare its performance with the untuned version.

## Step 4: Hyperparameter Tuning for Random Forest
Similarly, we perform hyperparameter tuning for the Random Forest model.

In [27]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest model
rf_model = RandomForestRegressor(random_state=42)

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50], #100, 200, 500, 1000],
    'max_depth': [5], #10, 15, 20],
    'min_samples_split': [2]#, 5, 10]
}

# Initialize Grid Search with Cross-Validation
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=param_grid, scoring=scoring, cv=5, n_jobs=-1, verbose=1)

# Fit the grid search to the data
grid_search_rf.fit(X_train_scaled, y_train)

# Get the best parameters and the best score
best_params_rf = grid_search_rf.best_params_
best_score_rf = grid_search_rf.best_score_




print("Best Parameters for Random Forest:", best_params_rf)
print("Best Score (Negative MSE):", best_score_rf)


Fitting 5 folds for each of 1 candidates, totalling 5 fits
Best Parameters for Random Forest: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 50}
Best Score (Negative MSE): -1545.43907743731


Grid Search for Random Forest: Similar to XGBoost, we tune the Random Forest model using Grid Search.

## Step 5: Evaluate the Tuned Random Forest Model
Finally, we evaluate the tuned Random Forest model on the test set.

In [28]:
best_rf_model = grid_search_rf.best_estimator_

# Make predictions on the test data using the best model
y_pred_rf = best_rf_model.predict(X_test_scaled)

# Calculate the metrics
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Model Performance with Best Parameters:")
print(f"Mean Absolute Error (MAE): {mae_rf}")
print(f"Mean Squared Error (MSE): {mse_rf}")
print(f"R-squared: {r2_rf}")


Random Forest Model Performance with Best Parameters:
Mean Absolute Error (MAE): 29.617878327337333
Mean Squared Error (MSE): 1629.475794837296
R-squared: 0.33886464982436715


## Test tensor flow

In [29]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Define the model architecture
def create_model(input_dim, neurons_layer1=64, neurons_layer2=32, neurons_layer3=16, dropout_rate=0.2):
    model = Sequential([
        Dense(neurons_layer1, activation='relu', input_shape=(input_dim,)),
        Dropout(dropout_rate),
        Dense(neurons_layer2, activation='relu'),
        Dropout(dropout_rate),
        Dense(neurons_layer3, activation='relu'),
        Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse')
    return model

# Define hyperparameter combinations to try
param_combinations = [
    {'neurons_layer1': 64, 'neurons_layer2': 32, 'neurons_layer3': 16, 'dropout_rate': 0.2, 'batch_size': 32, 'epochs': 100},
    {'neurons_layer1': 128, 'neurons_layer2': 64, 'neurons_layer3': 32, 'dropout_rate': 0.3, 'batch_size': 64, 'epochs': 150},
    {'neurons_layer1': 32, 'neurons_layer2': 16, 'neurons_layer3': 8, 'dropout_rate': 0.1, 'batch_size': 16, 'epochs': 50}
]

# Function to evaluate model
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test).flatten()
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mae, mse, r2

# Prepare data (assuming X_train_scaled, y_train, X_test_scaled, y_test are already defined)
input_dim = X_train_scaled.shape[1]

# Manually search through hyperparameters
best_model = None
best_mse = float('inf')
best_params = None

for params in param_combinations:
    print(f"Training with parameters: {params}")
    
    # Create and train the model
    model = create_model(input_dim, **{k: v for k, v in params.items() if k != 'batch_size' and k != 'epochs'})
    history = model.fit(
        X_train_scaled, y_train,
        epochs=params['epochs'],
        batch_size=params['batch_size'],
        validation_split=0.2,
        verbose=0
    )
    
    # Evaluate the model
    mae_tf, mse_tf, r2_tf = evaluate_model(model, X_test_scaled, y_test)
     
    # Check if this is the best model so far
    if mse_tf < best_mse:
        best_mse = mse_tf
        best_model = model
        best_params = params

print(f"Best parameters: {best_params}")

# Evaluate the best model
mae_tf, mse_tf, r2_tf = evaluate_model(best_model, X_test_scaled, y_test)

print("TensorFlow Model Performance with Best Parameters:")
print(f"Mean Absolute Error (MAE): {mae_tf}")
print(f"Mean Squared Error (MSE): {mse_tf}")
print(f"R-squared: {r2_tf}")


# Assuming best_model is the best TensorFlow model after manual tuning
best_tf_model = best_model  # Assign best_model to best_tf_model



Training with parameters: {'neurons_layer1': 64, 'neurons_layer2': 32, 'neurons_layer3': 16, 'dropout_rate': 0.2, 'batch_size': 32, 'epochs': 100}


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Training with parameters: {'neurons_layer1': 128, 'neurons_layer2': 64, 'neurons_layer3': 32, 'dropout_rate': 0.3, 'batch_size': 64, 'epochs': 150}


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Training with parameters: {'neurons_layer1': 32, 'neurons_layer2': 16, 'neurons_layer3': 8, 'dropout_rate': 0.1, 'batch_size': 16, 'epochs': 50}


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step 
Best parameters: {'neurons_layer1': 64, 'neurons_layer2': 32, 'neurons_layer3': 16, 'dropout_rate': 0.2, 'batch_size': 32, 'epochs': 100}
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
TensorFlow Model Performance with Best Parameters:
Mean Absolute Error (MAE): 30.63292102077948
Mean Squared Error (MSE): 1712.349695024922
R-squared: 0.305239737033844


Evaluation: The tuned Random Forest model is evaluated to determine if the tuning improved its performance.

## Step 5: Model 4 - XGBoost
We will now test the XGBoost model.

In [30]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## Step 4: Hyperparameter Tuning for Gradient Boosting
# Here, we perform hyperparameter tuning for the Gradient Boosting model using Grid Search with Cross-Validation.

# Initialize the Gradient Boosting model without setting specific hyperparameters
gb_model = GradientBoostingRegressor(random_state=42)

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50],# 100, 200],  # Number of boosting stages to be run
    'learning_rate': [0.01],# 0.1, 0.2],  # Learning rate shrinks the contribution of each tree
    'min_samples_split': [2],#, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1]#, 2, 4]  # Minimum number of samples required to be at a leaf node
}

# Initialize Grid Search with Cross-Validation
grid_search_gb = GridSearchCV(estimator=gb_model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs=-1, verbose=1)

# Fit the grid search to the data
grid_search_gb.fit(X_train_scaled, y_train)

# Get the best parameters and the best score
best_params_gb = grid_search_gb.best_params_
best_score_gb = grid_search_gb.best_score_

print("Best Parameters for Gradient Boosting:", best_params_gb)
print("Best Score (Negative MSE):", best_score_gb)

## Step 5: Evaluate the Tuned Gradient Boosting Model
# After tuning, we evaluate the performance of the best Gradient Boosting model on the test set.

# Retrieve the best Gradient Boosting model
best_gb_model = grid_search_gb.best_estimator_

# Train the Gradient Boosting model with the best parameters on the entire training data
best_gb_model.fit(X_train_scaled, y_train)

# Make predictions on the testing data
y_pred_gb = best_gb_model.predict(X_test_scaled)

# Calculate and print regression metrics
mae_gb = mean_absolute_error(y_test, y_pred_gb)
mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

print("Gradient Boosting Model Performance with Best Parameters:")
print(f"Mean Absolute Error (MAE): {mae_gb}")
print(f"Mean Squared Error (MSE): {mse_gb}")
print(f"R-squared: {r2_gb}")


Fitting 5 folds for each of 1 candidates, totalling 5 fits
Best Parameters for Gradient Boosting: {'learning_rate': 0.01, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Best Score (Negative MSE): -1980.6512935756223
Gradient Boosting Model Performance with Best Parameters:
Mean Absolute Error (MAE): 34.5589328275395
Mean Squared Error (MSE): 1955.156942012443
R-squared: 0.2067244118622915


## Step 6: Comparison of models
Compare all models

In [31]:
# Assuming you've completed training and evaluating all models, including the TensorFlow model
# Now assign best_model to best_tf_model
best_tf_model = best_model

## Step 6: Model Comparison
# Compare the performance of all the models and select the best one.

# Store the performance metrics for each model in a dictionary
model_performance = {
    'Random Forest': {
        'MAE': mae_rf,
        'MSE': mse_rf,
        'R2': r2_rf,
        'Model': best_rf_model  # Ensure best_rf_model is defined
    },
    'Gradient Boosting': {
        'MAE': mae_gb,
        'MSE': mse_gb,
        'R2': r2_gb,
        'Model': best_gb_model  # Ensure best_gb_model is defined
    },
    'TensorFlow Neural Network': {
        'MAE': mae_tf,
        'MSE': mse_tf,
        'R2': r2_tf,
        'Model': best_tf_model  # Now correctly assigned
    },
    'XGBoost': {
        'MAE': mae_xgb,
        'MSE': mse_xgb,
        'R2': r2_xgb,
        'Model': xgb_model  # Ensure xgb_model is defined
    }
}

# Print the performance of each model
for model_name, metrics in model_performance.items():
    print(f"{model_name} Performance:")
    print(f"Mean Absolute Error (MAE): {metrics['MAE']}")
    print(f"Mean Squared Error (MSE): {metrics['MSE']}")
    print(f"R-squared: {metrics['R2']}")
    print()

# Determine the best model based on R-squared or MSE
best_model_name = max(model_performance, key=lambda name: model_performance[name]['R2'])
best_final_model = model_performance[best_model_name]['Model']

print(f"The best model is: {best_model_name}")


Random Forest Performance:
Mean Absolute Error (MAE): 29.617878327337333
Mean Squared Error (MSE): 1629.475794837296
R-squared: 0.33886464982436715

Gradient Boosting Performance:
Mean Absolute Error (MAE): 34.5589328275395
Mean Squared Error (MSE): 1955.156942012443
R-squared: 0.2067244118622915

TensorFlow Neural Network Performance:
Mean Absolute Error (MAE): 30.63292102077948
Mean Squared Error (MSE): 1712.349695024922
R-squared: 0.305239737033844

XGBoost Performance:
Mean Absolute Error (MAE): 29.637879615610892
Mean Squared Error (MSE): 1638.6236042712737
R-squared: 0.3351529836654663

The best model is: Random Forest


## Step 7: Save the Model and Scaler
Save the best model to be used.

In [32]:
## Step 7: Save the Model and Scaler
# Save the best model and the scaler for future use.

# Concatenate the features (X) data
X_combined = pd.concat([X_train_scaled, X_test_scaled], axis=0).reset_index(drop=True)

# Save the combined features
X_combined.to_csv('../Data/Clean-Data/X_combined.csv', index=False)
print("Combined features saved successfully!")

# Initialize and fit the scaler
scaler = MinMaxScaler()
scaler.fit(X_combined)
scaled_features = scaler.fit_transform(X_combined)
scaler.transform(X_combined)

# Save the scaler
with open('scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)
print("Scaler saved successfully!")

# Save the best model
with open('model.pkl', 'wb') as file:
    pickle.dump(best_final_model, file)
print(f"Best model ({best_model_name}) saved as 'best_model.pkl'")

# Verify that the model can be loaded
with open('model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)
print("Model loaded successfully for verification!")


Combined features saved successfully!
Scaler saved successfully!
Best model (Random Forest) saved as 'best_model.pkl'
Model loaded successfully for verification!


# Conclusion
In this notebook, we successfully performed hyperparameter tuning for two of our best models and evaluated their performance on the test set. The results of this tuning will help us choose the final model for predicting Instagram post interactions.