### PDT XGBoost Regression Model
#### Data:
* This model takes the labeled set of features of the pendant drop profile and becomes a function of beta. Input features include Drop Height, Capillary Radius, R-s, R-e, and Smax. The current model is trained, tested, and tuned on dataset (data/pdt-dataset.csv) which has 2500 entries.


In [2]:
import pandas as pd
import pickle
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, StratifiedKFold

from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor


In [3]:
# I am multiplying all elements by 10^6, to keep float integrity when using gridsearchCV as int64
df = pd.read_csv('../data/pdt-dataset.csv').apply(lambda x: x*1000000).astype('int64')
df.head()

Unnamed: 0,Drop Height,Capillary Radius,R-s,R-e,Smax,Beta
0,2943094,682425,886383,1090550,3590000,400000
1,3033584,668466,892828,1087285,3689763,400000
2,3130900,665749,879073,1084616,3789526,400000
3,3231715,672651,892636,1084586,3889289,400000
4,3338794,690917,884025,1088827,3989053,400000


In [4]:
X = df.drop('Beta', axis=1)
y = df['Beta']

# Stratified fold includes the same percentage of target values in each fold.
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)
y.head()

0    400000
1    400000
2    400000
3    400000
4    400000
Name: Beta, dtype: int64

In [5]:
# This function takes a list of hyperparameter configs and finds the best one.
def grid_search(params, random=False):
    # Initialize XGB Regressor with objective='reg:squarederror' (MSE)
    xgb = XGBRegressor(booster='gbtree', objective='reg:squarederror',
    random_state=2)
    if random:
        grid = RandomizedSearchCV(xgb, params, cv=kfold, n_iter=20, n_jobs=-1)
    else:
        grid = GridSearchCV(xgb, params, cv=kfold, n_jobs=-1)
    grid.fit(X, y)
    best_params = grid.best_params_
    print("Best params:", best_params)
    best_score = grid.best_score_
    print("Training score: {:.3f}".format(best_score))

In [6]:
grid_search(params={'n_estimators': [100, 200, 400, 800]})

Best params: {'n_estimators': 800}
Training score: 0.999


In [7]:
grid_search(params={'learning_rate':[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]})

Best params: {'learning_rate': 0.1}
Training score: 0.999


In [8]:
grid_search(params={'max_depth':[2, 3, 5, 6, 8]})

Best params: {'max_depth': 5}
Training score: 0.999


## Tuned XGBoost Regressor
* n-estimators: 800
* learning_rate=.1
* max_depth = 5

Accuracy score on test data (.999), MSE: (0.0034324513493428823)



In [9]:
# Build, train, test, and save our model
xgb = XGBRegressor(booster='gbtree', objective='reg:squarederror',
    random_state=2, learning_rate=.1, n_estimators=800, max_depth=5)

df = pd.read_csv('../data/pdt-dataset.csv')
X = df.drop('Beta', axis=1)
y = df['Beta']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)

reg_mse = mean_squared_error(y_test, y_pred)
reg_rmse = np.sqrt(reg_mse)
print(reg_rmse)

with open("../models/pdt-regression-model.pkl", 'wb') as f:
    pickle.dump(xgb, f)

0.0034324513493428823


An example of how to use saved models.

In [10]:
# Load the model from models folder
with open("../models/pdt-regression-model.pkl", 'rb') as f:
    model = pickle.load(f)

Experimenting with wider beta range on same model

In [11]:
# Build, train, test, and save our model
xgb = XGBRegressor(booster='gbtree', objective='reg:squarederror',
    random_state=2, learning_rate=.1, n_estimators=800, max_depth=5)

df = pd.read_csv('../data/pdt-dataset-wider-beta.csv')
X = df.drop('Beta', axis=1)
y = df['Beta']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)

reg_mse = mean_squared_error(y_test, y_pred)
reg_rmse = np.sqrt(reg_mse)
print(reg_rmse)

#with open("../models/pdt-regression-model.pkl", 'wb') as f:
#    pickle.dump(xgb, f)

0.004374367554479346


Experiment with same model but without Smax as training data

In [12]:
# Build, train, test, and save our model
xgb = XGBRegressor(booster='gbtree', objective='reg:squarederror',
    random_state=2, learning_rate=.1, n_estimators=800, max_depth=5)

df = pd.read_csv('../data/pdt-dataset-wider-beta-no-Smax.csv')
X = df.drop('Beta', axis=1)
y = df['Beta']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)

reg_mse = mean_squared_error(y_test, y_pred)
reg_rmse = np.sqrt(reg_mse)
print(reg_rmse)

0.004334942946277289
