# Predictive Modeling on Customer Spending

This project applies numeric prediction techniques to build a predictive model for customer spending predictions. The dataset contains data about whether or not different consumers made a purchase in response to a test mailing of a certain catalog and, in case of a purchase, how much money each consumer spent.

## Part 1 
This part aim to build numeric prediction models that predict Spending based on the other available customer information (obviously, not including the Purchase attribute among the inputs!). Several different models were applied, including linear regression, k-NN, regression tree, SVM regreesion, Neural Network and ensembling models. I will explore each techniques and present the best result (best predictive model).

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import root_mean_squared_error, make_scorer

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

In [8]:
# Load Data

df = pd.read_excel("/Users/shawnwang/Desktop/Predictive Analytics/HW3/HW3.xlsx", sheet_name="All Data")
print(df.head())

   sequence_number  US  source_a  source_c  source_b  source_d  source_e  \
0                1   1         0         0         1         0         0   
1                2   1         0         0         0         0         1   
2                3   1         0         0         0         0         0   
3                4   1         0         1         0         0         0   
4                5   1         0         1         0         0         0   

   source_m  source_o  source_h  ...  source_x  source_w  Freq  \
0         0         0         0  ...         0         0     2   
1         0         0         0  ...         0         0     0   
2         0         0         0  ...         0         0     2   
3         0         0         0  ...         0         0     1   
4         0         0         0  ...         0         0     1   

   last_update_days_ago  1st_update_days_ago  Web order  Gender=male  \
0                  3662                 3662          1            0   
1 

In [161]:
# Data Processing

df = df.drop(columns=["sequence_number"])

X = df.iloc[:, :-2]
y = df["Spending"]

# 80/20 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify numeric columns
numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns
binary_cols = [col for col in numeric_cols if X[col].nunique() == 2]
continuous_cols = [col for col in numeric_cols if col not in binary_cols]

# Standardize continuous columns only
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[continuous_cols] = scaler.fit_transform(X_train[continuous_cols])
X_test_scaled[continuous_cols] = scaler.transform(X_test[continuous_cols])

# Log transformation since spending is skewed
# y_train_log = np.log1p(y_train)
# y_test_log = np.log1p(y_test)

#### Linear Regression

In [164]:
lr_model = LinearRegression()

# Nested CV setup (Inner CV for GridSearch and Outer CV for Model Evaluation)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics (RMSE scorer)
rmse_scorer = make_scorer(root_mean_squared_error)

# Performe Nested Cross-Validation (Outer CV)
lr_scores = cross_val_score(lr_model, X_train_scaled, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Linear Regression Nested CV RMSE: {lr_scores.mean():.4f} ± {lr_scores.std():.4f}")

# Train the best model on full training set
lr_model.fit(X_train_scaled, y_train)

# Make predictions on test data
y_pred_lr = lr_model.predict(X_test_scaled)
print(f"Linear Regression Test RMSE: {root_mean_squared_error(y_test, y_pred_lr):.4f}")

Linear Regression Nested CV RMSE: 126.1608 ± 15.9257
Linear Regression Test RMSE: 129.3107


#### k-NN

In [167]:
knn = KNeighborsRegressor()

# Define Hyperparameter Grid for Tuning
knn_params = {
    "n_neighbors": [3, 5, 7, 9],
    "weights": ["uniform", "distance"]
}

# Nested CV setup (Inner CV for GridSearch and Outer CV for Model Evaluation)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics (RMSE scorer)
rmse_scorer = make_scorer(root_mean_squared_error)

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
knn_grid = GridSearchCV(knn, knn_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
knn_scores = cross_val_score(knn_grid, X_train_scaled, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"k-NN Nested CV RMSE: {knn_scores.mean():.4f} ± {knn_scores.std():.4f}")

# Train the best model on full training set
knn_grid.fit(X_train_scaled, y_train)

# Make predictions on test data
y_pred_knn = knn_grid.predict(X_test_scaled)
print(f"k-NN Test RMSE: {root_mean_squared_error(y_test, y_pred_knn):.4f}")

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
k-NN Nested CV RMSE: 144.8318 ± 22.1507
Fitting 5 folds for each of 8 candidates, totalling 40 fits
k-NN Test RMSE: 159.6923


#### Regression Tree

In [132]:
tree = DecisionTreeRegressor(random_state=42)

# Define Hyperparameter Grid for Tuning
tree_params = {
    "max_depth": [3, 5, 10, None],
    "min_samples_split": [2, 5, 10]
}

# Nested CV setup (Inner CV for GridSearch and Outer CV for Model Evaluation)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics (RMSE scorer)
rmse_scorer = make_scorer(root_mean_squared_error)

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
tree_grid = GridSearchCV(tree, tree_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
tree_scores = cross_val_score(tree_grid, X_train, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Regression Tree Nested CV RMSE: {tree_scores.mean():.4f} ± {tree_scores.std():.4f}")

# Train the best model on full training set
tree_grid.fit(X_train, y_train)

# Make predictions on test data
y_pred_tree = tree_grid.predict(X_test)
print(f"Regression Tree Test RMSE: {root_mean_squared_error(y_test, y_pred_tree):.4f}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Regression Tree Nested CV RMSE: 181.7520 ± 27.8873
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Regression Tree Test RMSE: 184.3340


#### SVM Regression

In [135]:
svr = SVR()

# Define Hyperparameter Grid for Tuning
svr_params = {
    "C": [0.1, 1, 10],
    "epsilon": [0.1, 0.2, 0.5],
    "kernel": ["rbf", "linear"]
}

# Nested CV setup (Inner CV for GridSearch and Outer CV for Model Evaluation)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics (RMSE scorer)
rmse_scorer = make_scorer(root_mean_squared_error)

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
svr_grid = GridSearchCV(svr, svr_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
svr_scores = cross_val_score(svr_grid, X_train_scaled, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"SVM Regression Nested CV RMSE: {svr_scores.mean():.4f} ± {svr_scores.std():.4f}")

# Train the best model on full training set
svr_grid.fit(X_train_scaled, y_train)

# Make predictions on test data
y_pred_svr = svr_grid.predict(X_test_scaled)
print(f"SVM Regression Test RMSE: {root_mean_squared_error(y_test, y_pred_svr):.4f}")

Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
SVM Regression Nested CV RMSE: 204.8357 ± 19.7009
Fitting 5 folds for each of 18 candidates, totalling 90 fits
SVM Regression Test RMSE: 215.2204


#### Neural Network

In [178]:
mlp = MLPRegressor(
    max_iter=2000,
    early_stopping=True,
    validation_fraction=0.1,
    random_state=42
)

# Define Hyperparameter Grid for Tuning
mlp_params = {
    "hidden_layer_sizes": [(50,), (100,), (50, 50)],
    "activation": ["relu"],
    "alpha": [0.0001, 0.001],
    "learning_rate": ["constant", "adaptive"]
}

# Nested CV setup (Inner CV for GridSearch and Outer CV for Model Evaluation)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics (RMSE scorer)
rmse_scorer = make_scorer(root_mean_squared_error)

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
mlp_grid = GridSearchCV(mlp, mlp_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
mlp_scores = cross_val_score(mlp_grid, X_train_scaled, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Neural Network Nested CV RMSE: {mlp_scores.mean():.4f} ± {mlp_scores.std():.4f}")

# Train the best model on full training set
mlp_grid.fit(X_train_scaled, y_train)

# Make predictions on test data
y_pred_mlp = mlp_grid.predict(X_test_scaled)
print(f"Neural Network Test RMSE: {root_mean_squared_error(y_test, y_pred_mlp):.4f}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Neural Network Nested CV RMSE: 124.5299 ± 15.2990
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Neural Network Test RMSE: 132.8257


#### Random Forest

In [141]:
rf = RandomForestRegressor(random_state=42)

# Define Hyperparameter Grid for Tuning
rf_params = {
    "n_estimators": [100, 200],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5]
}

# Nested CV setup (Inner CV for GridSearch and Outer CV for Model Evaluation)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics (RMSE scorer)
rmse_scorer = make_scorer(root_mean_squared_error)

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
rf_grid = GridSearchCV(rf, rf_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
rf_scores = cross_val_score(rf_grid, X_train, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Random Forest Nested CV RMSE: {rf_scores.mean():.4f} ± {rf_scores.std():.4f}")

# Train the best model on full training set
rf_grid.fit(X_train, y_train)

# Make predictions on test data
y_pred_rf = rf_grid.predict(X_test)
print(f"Random Forest Test RMSE: {root_mean_squared_error(y_test, y_pred_rf):.4f}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Random Forest Nested CV RMSE: 131.7516 ± 19.1827
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Random Forest Test RMSE: 138.3140


#### Gradient Boosting

In [144]:
gb = GradientBoostingRegressor(random_state=42)

# Define Hyperparameter Grid for Tuning
gb_params = {
    "n_estimators": [100, 200],
    "learning_rate": [0.05, 0.1],
    "max_depth": [3, 5]
}

# Nested CV setup (Inner CV for GridSearch and Outer CV for Model Evaluation)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics (RMSE scorer)
rmse_scorer = make_scorer(root_mean_squared_error)

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
gb_grid = GridSearchCV(gb, gb_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
gb_scores = cross_val_score(gb_grid, X_train, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Gradient Boosting Nested CV RMSE: {gb_scores.mean():.4f} ± {gb_scores.std():.4f}")

# Train the best model on full training set
gb_grid.fit(X_train, y_train)

# Make predictions on test data
y_pred_gb = gb_grid.predict(X_test)
print(f"Gradient Boosting Test RMSE: {root_mean_squared_error(y_test, y_pred_gb):.4f}")

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Gradient Boosting Nested CV RMSE: 136.1910 ± 21.8745
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Gradient Boosting Test RMSE: 139.7450


### Summary
By comparing the nested CV RMSE of each model, it seems that Neural Network has the lowest Nested CV RMSE (124.53) with a close RMSE on test data (132.83), hence should be the better predictive model here. The performance was followed closely by Linear Regression. I've also tried ensemble model of neural network and linear regression below, which seemed to perform only slightly better:

In [152]:
# Step 1: Get predictions from both models
y_pred_lr = lr_model.predict(X_test_scaled)
y_pred_nn = mlp_grid.predict(X_test_scaled)

# Step 2: Average predictions
y_pred_ensemble = (y_pred_lr + y_pred_nn)/2

# Step 3: Evaluate ensemble
from sklearn.metrics import root_mean_squared_error

ensemble_rmse = root_mean_squared_error(y_test, y_pred_ensemble)
print(f"Ensemble Test RMSE: {ensemble_rmse:.4f}")

Ensemble Test RMSE: 130.6990


## Part 2
As a variation, I will create a separate “restricted” dataset (i.e., a subset of the original dataset), which includes only purchase records (i.e., where Purchase = 1), and build numeric prediction models to predict Spending for this restricted dataset.

In [79]:
df_p = pd.read_excel("HW3.xlsx", sheet_name="All Data")

# Filter to only include customers who made a purchase
df_p = df_p[df_p["Purchase"] == 1]

# Drop the sequence_number column (as before)
df_p = df_p.drop(columns=["sequence_number"])

X = df_p.iloc[:, :-2]
y = df_p["Spending"]

# 80/20 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify numeric columns
numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns
binary_cols = [col for col in numeric_cols if X[col].nunique() == 2]
continuous_cols = [col for col in numeric_cols if col not in binary_cols]

# Standardize continuous columns only
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[continuous_cols] = scaler.fit_transform(X_train[continuous_cols])
X_test_scaled[continuous_cols] = scaler.transform(X_test[continuous_cols])

In [81]:
# Nested CV setup (Inner CV for GridSearch and Outer CV for Model Evaluation)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics (RMSE scorer)
rmse_scorer = make_scorer(root_mean_squared_error)

#### Linear Regression

In [83]:
lr_model = LinearRegression()

# Performe Nested Cross-Validation (Outer CV)
lr_scores = cross_val_score(lr_model, X_train_scaled, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Linear Regression Nested CV RMSE: {lr_scores.mean():.4f} ± {lr_scores.std():.4f}")

# Train the best model on full training set
lr_model.fit(X_train_scaled, y_train)

# Make predictions on test data
y_pred_lr = lr_model.predict(X_test_scaled)
print(f"Linear Regression Test RMSE: {root_mean_squared_error(y_test, y_pred_lr):.4f}")

Linear Regression Nested CV RMSE: 157.8116 ± 20.8627
Linear Regression Test RMSE: 189.4194


#### k-NN

In [86]:
knn = KNeighborsRegressor()

# Define Hyperparameter Grid for Tuning
knn_params = {
    "n_neighbors": [3, 5, 7, 9],
    "weights": ["uniform", "distance"]
}

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
knn_grid = GridSearchCV(knn, knn_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
knn_scores = cross_val_score(knn_grid, X_train_scaled, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"k-NN Nested CV RMSE: {knn_scores.mean():.4f} ± {knn_scores.std():.4f}")

# Train the best model on full training set
knn_grid.fit(X_train_scaled, y_train)

# Make predictions on test data
y_pred_knn = knn_grid.predict(X_test_scaled)
print(f"k-NN Test RMSE: {root_mean_squared_error(y_test, y_pred_knn):.4f}")

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
k-NN Nested CV RMSE: 171.0356 ± 21.6982
Fitting 5 folds for each of 8 candidates, totalling 40 fits
k-NN Test RMSE: 191.4186


#### Regression Tree

In [89]:
tree = DecisionTreeRegressor(random_state=42)

# Define Hyperparameter Grid for Tuning
tree_params = {
    "max_depth": [3, 5, 10, None],
    "min_samples_split": [2, 5, 10]
}

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
tree_grid = GridSearchCV(tree, tree_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
tree_scores = cross_val_score(tree_grid, X_train, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Regression Tree Nested CV RMSE: {tree_scores.mean():.4f} ± {tree_scores.std():.4f}")

# Train the best model on full training set
tree_grid.fit(X_train, y_train)

# Make predictions on test data
y_pred_tree = tree_grid.predict(X_test)
print(f"Regression Tree Test RMSE: {root_mean_squared_error(y_test, y_pred_tree):.4f}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Regression Tree Nested CV RMSE: 222.3281 ± 11.3222
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Regression Tree Test RMSE: 208.1318


#### SVM Regression

In [92]:
svr = SVR()

# Define Hyperparameter Grid for Tuning
svr_params = {
    "C": [0.1, 1, 10],
    "epsilon": [0.1, 0.2, 0.5],
    "kernel": ["rbf", "linear"]
}

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
svr_grid = GridSearchCV(svr, svr_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
svr_scores = cross_val_score(svr_grid, X_train_scaled, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"SVM Regression Nested CV RMSE: {svr_scores.mean():.4f} ± {svr_scores.std():.4f}")

# Train the best model on full training set
svr_grid.fit(X_train_scaled, y_train)

# Make predictions on test data
y_pred_svr = svr_grid.predict(X_test_scaled)
print(f"SVM Regression Test RMSE: {root_mean_squared_error(y_test, y_pred_svr):.4f}")

Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
SVM Regression Nested CV RMSE: 210.4521 ± 33.9121
Fitting 5 folds for each of 18 candidates, totalling 90 fits
SVM Regression Test RMSE: 269.1500


#### Neural Network

In [95]:
mlp = MLPRegressor(
    max_iter=2000,
    early_stopping=True,
    validation_fraction=0.1,
    random_state=42
)

# Define Hyperparameter Grid for Tuning
mlp_params = {
    "hidden_layer_sizes": [(50,), (100,), (50, 50)],
    "activation": ["relu"],
    "alpha": [0.0001, 0.001],
    "learning_rate": ["constant", "adaptive"]
}

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
mlp_grid = GridSearchCV(mlp, mlp_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
mlp_scores = cross_val_score(mlp_grid, X_train_scaled, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Neural Network Nested CV RMSE: {mlp_scores.mean():.4f} ± {mlp_scores.std():.4f}")

# Train the best model on full training set
mlp_grid.fit(X_train_scaled, y_train)

# Make predictions on test data
y_pred_mlp = mlp_grid.predict(X_test_scaled)
print(f"Neural Network Test RMSE: {root_mean_squared_error(y_test, y_pred_mlp):.4f}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Neural Network Nested CV RMSE: 167.6999 ± 15.6801
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Neural Network Test RMSE: 191.1333


#### Random Forest

In [98]:
rf = RandomForestRegressor(random_state=42)

# Define Hyperparameter Grid for Tuning
rf_params = {
    "n_estimators": [100, 200],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5]
}

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
rf_grid = GridSearchCV(rf, rf_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
rf_scores = cross_val_score(rf_grid, X_train, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Random Forest Nested CV RMSE: {rf_scores.mean():.4f} ± {rf_scores.std():.4f}")

# Train the best model on full training set
rf_grid.fit(X_train, y_train)

# Make predictions on test data
y_pred_rf = rf_grid.predict(X_test)
print(f"Random Forest Test RMSE: {root_mean_squared_error(y_test, y_pred_rf):.4f}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Random Forest Nested CV RMSE: 160.5808 ± 19.4517
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Random Forest Test RMSE: 183.5751


#### Gradient Boosting

In [101]:
gb = GradientBoostingRegressor(random_state=42)

# Define Hyperparameter Grid for Tuning
gb_params = {
    "n_estimators": [100, 200],
    "learning_rate": [0.05, 0.1],
    "max_depth": [3, 5]
}

# Perform GridSearchCV for hyperparameter tuning (Inner CV); Nested Cross-Validation (Outer CV)
gb_grid = GridSearchCV(gb, gb_params, scoring=rmse_scorer, cv=inner_cv, n_jobs=-1, verbose=1)
gb_scores = cross_val_score(gb_grid, X_train, y_train, cv=outer_cv, scoring=rmse_scorer)
print(f"Gradient Boosting Nested CV RMSE: {gb_scores.mean():.4f} ± {gb_scores.std():.4f}")

# Train the best model on full training set
gb_grid.fit(X_train, y_train)

# Make predictions on test data
y_pred_gb = gb_grid.predict(X_test)
print(f"Gradient Boosting Test RMSE: {root_mean_squared_error(y_test, y_pred_gb):.4f}")

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Gradient Boosting Nested CV RMSE: 171.3641 ± 17.7902
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Gradient Boosting Test RMSE: 188.3235


### Conclusion
Judging by the nested CV RMSE, the better performing models should be Linear Regression and Random Forest.

## Part 3
For each predictive modeling technique, the following discusses the predictive performance differences between the models built for Part1 and Part2: which models exhibit better predictive performance?

#### Conclusion
The main difference between Part1 and Part2 is that Part1 contains many data with zero spending. The size of the data is larger but skewed. While Part2 contains buyers only, which the dataset is smaller but with more variance.
In general, models had poorer performance when dealing with Part2. This could be due to the highly varied spending, leading to a higher error. As for each model's difference:

#### Linear Regression: Part2 has high variance that the LR model probably could not capture very well, thus cause higher error.
Part1
Linear Regression Nested CV RMSE: 126.1608 ± 15.9257
Linear Regression Test RMSE: 129.3107

Part2
Linear Regression Nested CV RMSE: 157.8116 ± 20.8627
Linear Regression Test RMSE: 189.4194

#### k-NN: k-NN can struggle more on noisy or multi-dimensional targets. Part2 can make k-NN (which is distance-based) less stable.
Part1
k-NN Nested CV RMSE: 144.8318 ± 22.1507
k-NN Test RMSE: 159.6923

Part2
k-NN Nested CV RMSE: 171.0356 ± 21.6982
k-NN Test RMSE: 191.4186

#### Regression Tree: Regression Tree overfit more on smaller and high-variance datasets, leading the the result of Part2.
Part1
Regression Tree Nested CV RMSE: 181.7520 ± 27.8873
Regression Tree Test RMSE: 184.3340

Part2
Regression Tree Nested CV RMSE: 222.3281 ± 11.3222
Regression Tree Test RMSE: 208.1318

#### SVM Regression: Same as the other models, SVM performed worse on task(b)
task(a)
SVM Regression Nested CV RMSE: 204.8357 ± 19.7009
SVM Regression Test RMSE: 215.2204

task(b)
SVM Regression Nested CV RMSE: 210.4521 ± 33.9121
SVM Regression Test RMSE: 269.1500

#### Neural Network: Neural Network performed relatively better on Part1, but may suffer more on Part2 as smaller dataset and less consistency may prevent deep patterns.
Part1
Neural Network Nested CV RMSE: 124.5299 ± 15.2990
Neural Network Test RMSE: 132.8257

Part2
Neural Network Nested CV RMSE: 167.6999 ± 15.6801
Neural Network Test RMSE: 191.1333

#### Random Forest: More stable between Part1 and Part2. Random Forest may be able to reduce variance by ensembling different trees, thus produce smoother and more generalized predictions.
Part1
Random Forest Nested CV RMSE: 131.7516 ± 19.1827
Random Forest Test RMSE: 138.3140

Part2
Random Forest Nested CV RMSE: 160.5808 ± 19.4517
Random Forest Test RMSE: 183.5751

#### Gradient Boosting: May struggle with high variance of Part2
Part1
Gradient Boosting Nested CV RMSE: 136.1910 ± 21.8745
Gradient Boosting Test RMSE: 139.7450

Part2
Gradient Boosting Nested CV RMSE: 171.3641 ± 17.7902
Gradient Boosting Test RMSE: 188.3235


#### Acknowledgement
This project is inspired by and recreated from assignments from the Predictive Analytics course by Professor Yicheng Song of the UMN MSBA program