<a href="https://colab.research.google.com/github/Shreyasi-jcx3419/weightsandbiases/blob/main/Practicum_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 - Models and Experimentation

## Step 1 Training a model

For the purposes of this demo, we will be using this [adapted demo](https://www.datacamp.com/tutorial/xgboost-in-python) and training an XGBoost model, and then doing some experimentation and hyperparameter tuning.


If running this notebook locally, use the following steps to create virtual environment:
- Don't use past python 3.10
- To create virtual environment use "venv"

`python -m venv NAME`

- Try to avoid anaconda, poetry or similar package management platforms
- To install a package use pip

`python -m pip install <package-name>`

- once you are done working with this virtual environment, deactivate it with `deactivate`

### Install packages

In [1]:
!pip install wandb -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


### Import data

We will be using Diamonds dataset imported from Seaborn. It is also available on [Kaggle](https://www.kaggle.com/datasets/shivam2503/diamonds).

Read about the features by following the link. We will be predicting the price of diamonds.

In [3]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [4]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


In [5]:
diamonds.shape

(53940, 10)

In [6]:
X,y = diamonds.drop('price', axis=1), diamonds[['price']]

# For the cut, color and clarity use pandas category to enable XGBoost ability to deal with categorical data.

X['cut'] = X['cut'].astype('category')
X['color'] = X['color'].astype('category')
X['clarity'] = X['clarity'].astype('category')

### Split the data and train a model

In [7]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

In [8]:
# Define hyperparameters
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}

n = 100
model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
)


    E.g. tree_method = "hist", device = "cuda"



In [9]:
# Define evaluation metrics - Root Mean Squared Error

predictions = model.predict(dtest)
rmse = mean_squared_error(y_test, predictions, squared=False)
print(f"RMSE: {rmse}")

RMSE: 532.8838153117543



    E.g. tree_method = "hist", device = "cuda"



### Incorporate validation

In [10]:
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 100

# Create the validation set
evals = [(dtrain, "train"), (dtest, "validation")]

In [11]:
evals = [(dtrain, "train"), (dtest, "validation")]

model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=10,
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630
[10]	train-rmse:550.99470	validation-rmse:571.16640



    E.g. tree_method = "hist", device = "cuda"



[20]	train-rmse:491.51435	validation-rmse:544.08058
[30]	train-rmse:464.38845	validation-rmse:537.01895
[40]	train-rmse:445.99106	validation-rmse:533.85127
[50]	train-rmse:430.36010	validation-rmse:532.90320
[60]	train-rmse:418.87898	validation-rmse:533.04629
[70]	train-rmse:409.66247	validation-rmse:533.58046
[80]	train-rmse:397.34048	validation-rmse:534.31963
[90]	train-rmse:389.94294	validation-rmse:532.61946
[99]	train-rmse:377.70831	validation-rmse:532.88383


In [12]:
# Incorporate early stopping
n = 10000


model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   # Activate early stopping
   early_stopping_rounds=50
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630



    E.g. tree_method = "hist", device = "cuda"



[50]	train-rmse:430.36010	validation-rmse:532.90320
[100]	train-rmse:377.56825	validation-rmse:532.79980
[103]	train-rmse:375.44970	validation-rmse:532.50220


In [13]:
# Cross-validation

params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 1000

results = xgb.cv(
   params, dtrain,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)



    E.g. tree_method = "hist", device = "cuda"



In [14]:
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,2861.153015,8.266765,2861.773555,36.937516
1,2081.378004,5.534608,2084.973481,32.064109
2,1545.361682,3.287745,1553.681211,31.059209
3,1182.364236,3.585787,1192.464771,26.157805
4,941.828819,2.971779,958.467497,23.613538


In [15]:
best_rmse = results['test-rmse-mean'].min()

best_rmse

549.1039652582465

## Start W&B


- Login into your W&B profile using the code below
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - WANDB_API_KEY - find this in your "Settings" section under your profile
    - WANDB_BASE_URL - this is the url of the W&B server

- Find your API Token in "Profile" -> "Setttings" in the W&B App



In [16]:
# Log in to your W&B account
import wandb

wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [28]:
sweep_config = {
    'method': 'random',
    'metric': {
        'name': 'rmse',
        'goal': 'minimize'
    },
    'parameters': {
        'learning_rate': {
            'min': 0.01,
            'max': 0.2
        },
        'max_depth': {
            'values': [3, 5, 7, 9]
        },
        'colsample_bytree': {
            'min': 0.6,
            'max': 0.9
        },
        'n_estimators': {
            'values': [100, 150, 200, 250]
        }
    }
}


In [30]:
import wandb

sweep_id = wandb.sweep(sweep_config, project="diamonds_experiments", entity='shreyasi_jcx3419')

Create sweep with ID: 85bepaho
Sweep URL: https://wandb.ai/shreyasi_jcx3419/diamonds_experiments/sweeps/85bepaho


In [31]:
def train():
    run = wandb.init()

    config = wandb.config

    params = {
        'objective': 'reg:squarederror',
        'learning_rate': config.learning_rate,
        'max_depth': int(config.max_depth),
        'subsample': 0.9,
        'colsample_bytree': config.colsample_bytree,
        'n_estimators': int(config.n_estimators),
        'eval_metric': 'rmse'
    }

    model = xgb.train(params, dtrain, num_boost_round=config.n_estimators)

    predictions = model.predict(dtest)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))

    wandb.log({'rmse': rmse})

    run.finish()


In [32]:
wandb.agent(sweep_id, train)

[34m[1mwandb[0m: Agent Starting Run: kji4505m with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6309207421453186
[34m[1mwandb[0m: 	learning_rate: 0.03610392227790473
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 100
[34m[1mwandb[0m: Currently logged in as: [33mshreyasi_jcx3419[0m. Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,594.51182


[34m[1mwandb[0m: Agent Starting Run: y4phi63w with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8043782320341248
[34m[1mwandb[0m: 	learning_rate: 0.02657517082673636
[34m[1mwandb[0m: 	max_depth: 3
[34m[1mwandb[0m: 	n_estimators: 100


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,958.3554


[34m[1mwandb[0m: Agent Starting Run: 7qedavsw with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7195927256929011
[34m[1mwandb[0m: 	learning_rate: 0.0375031642243228
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 200


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,526.95738


[34m[1mwandb[0m: Agent Starting Run: ycrsirti with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6550777607515582
[34m[1mwandb[0m: 	learning_rate: 0.07937288235849038
[34m[1mwandb[0m: 	max_depth: 3
[34m[1mwandb[0m: 	n_estimators: 200


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,591.29472


[34m[1mwandb[0m: Agent Starting Run: i80io65p with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.644953499231781
[34m[1mwandb[0m: 	learning_rate: 0.03114292955812417
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 200


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,566.34407


[34m[1mwandb[0m: Agent Starting Run: sl47xymd with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8721812952306391
[34m[1mwandb[0m: 	learning_rate: 0.1016978301968217
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 100


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,530.57959


[34m[1mwandb[0m: Agent Starting Run: ip6go59v with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6218020275618733
[34m[1mwandb[0m: 	learning_rate: 0.04815391261460225
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 250


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,529.93645


[34m[1mwandb[0m: Agent Starting Run: 8e1bz8xr with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7982189086824258
[34m[1mwandb[0m: 	learning_rate: 0.14895782029958657
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 200


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,538.69343


[34m[1mwandb[0m: Agent Starting Run: 15joznoo with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.86376521970067
[34m[1mwandb[0m: 	learning_rate: 0.10245846789068146
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	n_estimators: 200


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,539.55947


[34m[1mwandb[0m: Agent Starting Run: fzzjbkju with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6931333989259658
[34m[1mwandb[0m: 	learning_rate: 0.0970773938791552
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 150


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,533.97562


[34m[1mwandb[0m: Agent Starting Run: wdcu3b2w with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8932619636557046
[34m[1mwandb[0m: 	learning_rate: 0.14410028553741674
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	n_estimators: 250


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Each entry in the log refers to different runs of the experiment, each with its own set of hyperparameters like `colsample_bytree`, `learning_rate`, `max_depth`, and `n_estimators`. Here's a brief overview of the results from these runs:

1. **Range of RMSE**: The RMSE values range from about 526.96 to 958.36, indicating varied performance across different configurations. Lower RMSE values signify better model performance, with the lowest observed RMSE being 526.96.
2. **Configuration Variations**: Each run tested different values for the hyperparameters. For example, `max_depth` varied from 3 to 9, `n_estimators` from 100 to 250, and learning rates and `colsample_bytree` also saw significant variations.
3. **Best Performing Runs**: The best performance was observed in the run with a `max_depth` of 7, `n_estimators` of 200, and `colsample_bytree` around 0.72. The worst-performing setup had a `max_depth` of 3, a `learning_rate` of around 0.027, and `n_estimators` set to 100.
4. **Link to Further Details**: Each run is linked to a detailed page on wandb, allowing further exploration of the configurations, performance graphs, and other metrics.
5. **Progress and Consistency**: The logs suggest ongoing testing and parameter tuning in an effort to optimize model predictions. Each run's data is saved, synchronized, and linked for easy access and analysis.
