# Model Selection

## Introduction
In this notebook, we will experiment with various machine learning models to predict Instagram post interactions. We will train and evaluate each model individually, comparing their performance to select the best one.

## Step 1: Load the Preprocessed Data
We start by loading the preprocessed data from the previous notebook.

In [1]:
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## Load the preprocessed data

In [2]:
X_train_scaled = pd.read_csv('../Data/Clean-Data/X_train_scaled.csv')
X_test_scaled = pd.read_csv('../Data/Clean-Data/X_test_scaled.csv')
y_train = pd.read_csv('../Data/Clean-Data/y_train.csv')
y_test = pd.read_csv('../Data/Clean-Data/y_test.csv')

In [1]:
X_train_scaled = X_train_scaled.columns

NameError: name 'X_train_scaled' is not defined

## Convert target to 1D array

In [3]:
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

## Display basic info

In [4]:
print("Training Data Shape:", X_train_scaled.shape)
print("Testing Data Shape:", X_test_scaled.shape)

Training Data Shape: (5390, 1062)
Testing Data Shape: (1348, 1062)


Loading Data: The preprocessed data is loaded to be used in training and evaluating various models.

## Step 2: Model 1 - XGBoost
We will first test the XGBoost model, which is known for its performance and scalability.

In [5]:
import xgboost as xgb

# Initialize the XGBoost model
xgb_model = xgb.XGBRegressor(
                n_estimators=50,
                learning_rate=0.2,
                max_depth=3,
                reg_alpha=6.0,
                reg_lambda=10.0,
                random_state=42
)

# Train the model on the training data
xgb_model.fit(X_train_scaled, y_train)

# Make predictions on the testing data
y_pred = xgb_model.predict(X_test_scaled)

# Calculate and print regression metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("XGBoost Model Performance:")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared: {r2}")


XGBoost Model Performance:
Mean Absolute Error (MAE): 28.647811002487686
Mean Squared Error (MSE): 1524.4347323018417
R-squared: 0.38148343563079834


XGBoost: XGBoost is a powerful ensemble learning method that uses gradient boosting to improve performance.

## Step 3: Model 2 - Random Forest
Next, we will test the Random Forest model, which is an ensemble of decision trees.

In [6]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest model
rf_model = RandomForestRegressor(
                n_estimators=50,
                max_depth=3,
                random_state=42
)

# Train the model on the training data
rf_model.fit(X_train_scaled, y_train)

# Make predictions on the testing data
y_pred = rf_model.predict(X_test_scaled)

# Calculate and print regression metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Random Forest Model Performance:")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared: {r2}")


Random Forest Model Performance:
Mean Absolute Error (MAE): 30.599498506537387
Mean Squared Error (MSE): 1706.0952522841314
R-squared: 0.3077774547952207


Random Forest: Random Forest is an ensemble method that aggregates the predictions of multiple decision trees to improve accuracy and control overfitting.

## Step 4: Model 3 - Gradient Boosting
We will test the Gradient Boosting model, which is similar to XGBoost but with different implementation details.

In [7]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize the Gradient Boosting model
gb_model = GradientBoostingRegressor(
                n_estimators=50,
                learning_rate=0.1,
                max_depth=3,
                random_state=42
)

# Train the model on the training data
gb_model.fit(X_train_scaled, y_train)

# Make predictions on the testing data
y_pred = gb_model.predict(X_test_scaled)

# Calculate and print regression metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Gradient Boosting Model Performance:")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared: {r2}")


Gradient Boosting Model Performance:
Mean Absolute Error (MAE): 28.901184853503146
Mean Squared Error (MSE): 1532.2172727112911
R-squared: 0.3783258356161827


Gradient Boosting: Like XGBoost, Gradient Boosting builds models sequentially, with each new model trying to correct the errors of the previous ones.

## Step 5: Model 4 - TensorFlow Neural Network
Finally, we will test a neural network model using TensorFlow.

In [8]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Define the model architecture
def create_model(input_dim):
    model = Sequential([
        Dense(64, activation='relu', input_shape=(input_dim,)),
        Dropout(0.2),
        Dense(32, activation='relu'),
        Dropout(0.2),
        Dense(16, activation='relu'),
        Dense(1)
    ])
    return model

# Initialize and compile the model
input_dim = X_train_scaled.shape[1]
tf_model = create_model(input_dim)
tf_model.compile(optimizer='adam', loss='mse')

# Train the model
history = tf_model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, validation_split=0.2, verbose=0)

# Make predictions on the testing data
y_pred = tf_model.predict(X_test_scaled).flatten()

# Calculate and print regression metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("TensorFlow Model Performance:")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared: {r2}")


2024-08-21 12:57:52.842303: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-08-21 12:57:52.845970: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-08-21 12:57:52.856414: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-21 12:57:52.870949: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-21 12:57:52.874524: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-21 12:57:52.884936: I tensorflow/core/platform/cpu_feature_gu

[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
TensorFlow Model Performance:
Mean Absolute Error (MAE): 30.772515076204293
Mean Squared Error (MSE): 1741.0597383077863
R-squared: 0.29359114170074463


Neural Networks: This model is a simple feedforward neural network that can learn complex patterns in the data, especially when dealing with high-dimensional features.

# Conclusion
In this notebook, we have trained and evaluated multiple models. In the next notebook, we will focus on hyperparameter tuning to further improve the performance of the best model.