# Week 4 - Models and Experimentation

## Step 1 Training a model

For the purposes of this demo, we will be using this [adapted demo](https://www.datacamp.com/tutorial/xgboost-in-python) and training an XGBoost model, and then doing some experimentation and hyperparameter tuning.


If running this notebook locally, use the following steps to create virtual environment:
- Don't use past python 3.10
- To create virtual environment use "venv"

`python -m venv NAME`

- Try to avoid anaconda, poetry or similar package management platforms
- To install a package use pip

`python -m pip install <package-name>`

- once you are done working with this virtual environment, deactivate it with `deactivate`

### Install packages

In [3]:
!pip install wandb -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
import xgboost as xgb
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


### Import data

We will be using Diamonds dataset imported from Seaborn. It is also available on [Kaggle](https://www.kaggle.com/datasets/shivam2503/diamonds).

Read about the features by following the link. We will be predicting the price of diamonds.

In [5]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [6]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


In [7]:
diamonds.shape

(53940, 10)

In [8]:
X,y = diamonds.drop('price', axis=1), diamonds[['price']]

# For the cut, color and clarity use pandas category to enable XGBoost ability to deal with categorical data.

X['cut'] = X['cut'].astype('category')
X['color'] = X['color'].astype('category')
X['clarity'] = X['clarity'].astype('category')

### Split the data and train a model

In [9]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

In [10]:
# Define hyperparameters
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}

n = 100
model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
)


    E.g. tree_method = "hist", device = "cuda"



In [11]:
# Define evaluation metrics - Root Mean Squared Error

predictions = model.predict(dtest)
rmse = mean_squared_error(y_test, predictions, squared=False)
print(f"RMSE: {rmse}")

RMSE: 532.8838153117543



    E.g. tree_method = "hist", device = "cuda"



### Incorporate validation

In [12]:
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 100

# Create the validation set
evals = [(dtrain, "train"), (dtest, "validation")]

In [13]:
evals = [(dtrain, "train"), (dtest, "validation")]

model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=10,
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630
[10]	train-rmse:550.99470	validation-rmse:571.16640
[20]	train-rmse:491.51435	validation-rmse:544.08058



    E.g. tree_method = "hist", device = "cuda"



[30]	train-rmse:464.38845	validation-rmse:537.01895
[40]	train-rmse:445.99106	validation-rmse:533.85127
[50]	train-rmse:430.36010	validation-rmse:532.90320
[60]	train-rmse:418.87898	validation-rmse:533.04629
[70]	train-rmse:409.66247	validation-rmse:533.58046
[80]	train-rmse:397.34048	validation-rmse:534.31963
[90]	train-rmse:389.94294	validation-rmse:532.61946
[99]	train-rmse:377.70831	validation-rmse:532.88383


In [14]:
# Incorporate early stopping
n = 10000


model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   # Activate early stopping
   early_stopping_rounds=50
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630
[50]	train-rmse:430.36010	validation-rmse:532.90320
[100]	train-rmse:377.56825	validation-rmse:532.79980
[103]	train-rmse:375.44970	validation-rmse:532.50220


In [15]:
# Cross-validation

params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 1000

results = xgb.cv(
   params, dtrain,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)



    E.g. tree_method = "hist", device = "cuda"



In [16]:
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,2861.153015,8.266765,2861.773555,36.937516
1,2081.378004,5.534608,2084.973481,32.064109
2,1545.361682,3.287745,1553.681211,31.059209
3,1182.364236,3.585787,1192.464771,26.157805
4,941.828819,2.971779,958.467497,23.613538


In [17]:
best_rmse = results['test-rmse-mean'].min()

best_rmse

549.1039652582465

## Start W&B


- Login into your W&B profile using the code below
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - WANDB_API_KEY - find this in your "Settings" section under your profile
    - WANDB_BASE_URL - this is the url of the W&B server

- Find your API Token in "Profile" -> "Setttings" in the W&B App



## Experiments

In [18]:
# Log in to your W&B account
import wandb

wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [19]:
from sklearn.model_selection import GridSearchCV
import math
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

In [20]:
y_train.max()

price    18818
dtype: int64

In [21]:


import xgboost as xgb

def train_model(config=None):
    with wandb.init(config=config):
        config = wandb.config

        params = {
            'max_depth': int(config.max_depth),
            'n_estimators': int(config.n_estimators),
            'learning_rate': config.learning_rate,
            'subsample': config.subsample,
            'objective': 'reg:squarederror'
        }

        # Train model
        model = xgb.train(params, dtrain, num_boost_round=100)

        # Predict and evaluate
        preds = model.predict(dtest)
        rmse = np.sqrt(mean_squared_error(y_test, preds))

        # Log the rmse to Weights & Biases
        wandb.log({'rmse': rmse})




sweep_config = {
    'method': 'random',
    'metric': {
        'name': 'rmse',
        'goal': 'minimize'
    },
    'parameters': {
        'n_estimators': {
            'values': [100, 150, 50, 200, 120]
        },
        'max_depth': {
            'values': [6, 8, 4, 10, 7]
        },
        'learning_rate': {
            'values': [0.01, 0.05, 0.3, 0.15, 0.07]
        },
        'subsample': {
            'values': [1.0, 0.8, 0.9, 1.0, 0.85]
        }
    }
}


sweep_id = wandb.sweep(sweep_config, project="diamond_price_prediction")
wandb.agent(sweep_id, train_model, count=5)
# Set count=5 to limit to 5 experiments


Create sweep with ID: woa3iqd4
Sweep URL: https://wandb.ai/practicum-class/diamond_price_prediction/sweeps/woa3iqd4


[34m[1mwandb[0m: Agent Starting Run: fqwlvrzr with config:
[34m[1mwandb[0m: 	learning_rate: 0.07
[34m[1mwandb[0m: 	max_depth: 4
[34m[1mwandb[0m: 	n_estimators: 150
[34m[1mwandb[0m: 	subsample: 0.8
[34m[1mwandb[0m: Currently logged in as: [33mshreya18[0m ([33mpracticum-class[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,561.45907


[34m[1mwandb[0m: Agent Starting Run: g28o958v with config:
[34m[1mwandb[0m: 	learning_rate: 0.15
[34m[1mwandb[0m: 	max_depth: 4
[34m[1mwandb[0m: 	n_estimators: 50
[34m[1mwandb[0m: 	subsample: 1


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,548.3458


[34m[1mwandb[0m: Agent Starting Run: 26c5ok2l with config:
[34m[1mwandb[0m: 	learning_rate: 0.15
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 200
[34m[1mwandb[0m: 	subsample: 0.85


VBox(children=(Label(value='0.001 MB of 0.010 MB uploaded\r'), FloatProgress(value=0.11494146925983961, max=1.…

0,1
rmse,▁

0,1
rmse,534.86974


[34m[1mwandb[0m: Agent Starting Run: 2kt9dcow with config:
[34m[1mwandb[0m: 	learning_rate: 0.05
[34m[1mwandb[0m: 	max_depth: 4
[34m[1mwandb[0m: 	n_estimators: 150
[34m[1mwandb[0m: 	subsample: 0.85


VBox(children=(Label(value='0.001 MB of 0.010 MB uploaded\r'), FloatProgress(value=0.11494146925983961, max=1.…

0,1
rmse,▁

0,1
rmse,594.04096


[34m[1mwandb[0m: Agent Starting Run: 2x8arhwi with config:
[34m[1mwandb[0m: 	learning_rate: 0.05
[34m[1mwandb[0m: 	max_depth: 8
[34m[1mwandb[0m: 	n_estimators: 120
[34m[1mwandb[0m: 	subsample: 1


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,531.43709


In [22]:
# # Initialize W&B for experiment tracking
# wandb.init(project="xgboost_5_experiments")


# # Define five distinct hyperparameter sets
# hyperparameter_sets = [
#     {'n_estimators': 100, 'max_depth': 6, 'learning_rate': 0.1, 'subsample': 1.0, 'enable_categorical': True},
#     {'n_estimators': 150, 'max_depth': 8, 'learning_rate': 0.05, 'subsample': 0.8, 'enable_categorical': True},
#     {'n_estimators': 50, 'max_depth': 4, 'learning_rate': 0.3, 'subsample': 0.9,  'enable_categorical': True},
#     {'n_estimators': 200, 'max_depth': 10, 'learning_rate': 0.15, 'subsample': 1.0, 'enable_categorical': True},
#     {'n_estimators': 120, 'max_depth': 7, 'learning_rate': 0.07, 'subsample': 0.85, 'enable_categorical': True},
# ]

# # Run five experiments
# results = []

# for idx, params in enumerate(hyperparameter_sets):
#     # Log each experiment with W&B
#     with wandb.init(reinit=True, name=f"experiment_{idx + 1}"):
#         # Create a model with the current hyperparameters
#         model = xgb.train(
#           params=params,
#           dtrain=dtrain,
#           evals=evals,
#           num_boost_round=n,
#           verbose_eval=10
#         )

#         # model = XGBRegressor(**params)

#         # Train the model
#         # model.fit(X_train, y_train)

#         # Predict on the test set
#         # y_pred = model.predict(dtest)

#         # Calculate mse
#         mse = mean_squared_error(y_test, y_pred)

#         # Log the hyperparameters and accuracy
#         wandb.log({
#             "hyperparameters": params,
#             "mean_squared_error": mse,
#         })

#         # Store the results
#         results.append({
#             "experiment": idx + 1,
#             "mean_squared_error": mse,
#             "hyperparameters": params
#         })

# # Output the results of the experiments
# print("Results of 5 Experiments:")
# for result in results:
#     print(result)

# wandb.finish()  # End the W&B run

## Randomized Search

In [23]:
# from sklearn.model_selection import RandomizedSearchCV
# import scipy.stats as st

# # Initialize W&B for experiment tracking
# wandb.init(project="xgboost_random_search")

# # Define a hyperparameter space for Random Search
# param_distributions = {
#     'n_estimators': st.randint(50, 200),
#     'max_depth': st.randint(3, 10),
#     'learning_rate': st.uniform(0.01, 0.3),
#     'subsample': st.uniform(0.7, 0.3),
#     'colsample_bytree': st.uniform(0.7, 0.3),
#     'enable_categorical': [True],
# }

# # Configure RandomizedSearchCV with XGBRegressor
# random_search = RandomizedSearchCV(
#     estimator=XGBRegressor(),
#     param_distributions=param_distributions,
#     n_iter=5,  # Number of random experiments
#     scoring='neg_mean_squared_error',  # Evaluation metric for regression
#     cv=3,  # Cross-validation folds
#     n_jobs=-1,  # Parallel processing
#     random_state=42,  # For reproducibility
# )

# # Run Random Search
# random_search.fit(X_train, y_train)

# # Get the best model
# best_model = random_search.best_estimator_

# # Predict on the test set
# y_pred = best_model.predict(X_test)

# # Calculate mean squared error
# mse = mean_squared_error(y_test, y_pred)

# # Log the best hyperparameters and MSE to W&B
# wandb.log({
#     "best_hyperparameters": random_search.best_params_,
#     "mean_squared_error": mse,
# })

# print(f"Best Hyperparameters: {random_search.best_params_}")
# print(f"Mean Squared Error: {mse}")

# wandb.finish()  # End the W&B run

## Findings



In this experiment, I used an XGBoost model to predict diamond prices and tweaked the settings through Weights & Biases to identify the best parameters. I conducted five different tests, adjusting critical parameters like max_depth, n_estimators, learning_rate, and subsample to achieve the lowest possible root mean square error (RMSE).

For each test, I trained the model using the DMatrix data format, evaluated its performance by examining the RMSE, and tracked everything with Weights & Biases. The performance varied significantly based on the settings, highlighting the importance of choosing the right parameters.

Weights & Biases made it easy to monitor each test in real time and compare different trials. I could access detailed results through links provided for each run, which made it straightforward to pinpoint the best settings to enhance the model’s predictions."


In [32]:
##end of notebook

In [None]:
# TO DO
# Start experiment tracking with W&B
# Do at least 5 experiments with various hyperparameters
# Choose any method for hyperparameter tuning: grid search, random search, bayesian search
# Describe your findings and what you see