# Predicting Load demand using Weather 

We wish to see the correlation between weather and electricity load demand in the NY state. The weather and load data were obtained from ~link~ ~link~

This data was processed and the combined to get a consolidated data source which we will use in this notebook to predict the load demand using Random Forest.

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
import xgboost 
import dask_cudf
from dask import delayed
import dask_xgboost
from dask.distributed import Client, wait
from dask.dataframe import from_delayed
import cudf
import dask
from dask_cuda import LocalCUDACluster
import numpy as np
import pandas as pd


cluster = LocalCUDACluster()
client = Client(cluster)

# Reading Data

Let's read in the data and take a look at the available fields

In [4]:
merged_ddf = dask_cudf.read_csv("processed_data.csv")
merged_ddf.head().to_pandas()

Unnamed: 0.1,Unnamed: 0,county,day,month,hour,year,Load_x,air_tmp_x,dew_x,sea_pressure_x,...,wind_spd_tmp,season,Weekday,Load_x_t-1,air_tmpt-1,wind_spd_t-1,Load_x_5,air_tmp_x_5,sea_pressure_x_5,wind_spd_x_5
0,2595,0.0,20.0,9.0,12.0,2019.0,4009.800049,233.0,-14.0,10137.0,...,0.064377682,2.0,1.0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.0
1,2594,0.0,20.0,9.0,11.0,2019.0,9195.900391,217.0,-13.0,10140.0,...,0.096774194,2.0,1.0,4009.800049,233.0,15.0,0.0,0.0,0.0,0.0
2,2593,0.0,20.0,9.0,10.0,2019.0,9097.0,217.0,-13.0,10139.0,...,0.069124424,2.0,1.0,9195.900391,217.0,21.0,0.0,0.0,0.0,0.0
3,2592,0.0,20.0,9.0,9.0,2019.0,7576.100098,67.0,-999.0,10178.0,...,0.611940299,2.0,1.0,9097.0,217.0,15.0,0.0,0.0,0.0,0.0
4,2591,0.0,20.0,9.0,8.0,2019.0,7330.799805,67.0,-999.0,10178.0,...,0.537313433,2.0,1.0,7576.100098,67.0,41.0,7441.920069,160.2,10154.4,25.6


## Brief about some fields

The county field is a categorical variable that represents the county in NY state that the entry belongs to. 

`day`, `month`, `year`, `hour` represent the `TimeStamp`(This column was split to obtain the other fields) the record was taken.

`Load_x` is the actual value for Load consumed at the time. Similarly `air_tmp_x`, `dew_x` ... are the values recorded at the given time.

`Load_x_t-1` is the load recorded at `t-1` time. It was calculated per county to maintain correctness. (Same is the case for all columns ended with `t-1` 

`Load_x_5` is the summary of theprevious 5 entries (calculated using `rolling`) 

### Uninterested columns

We will exclude `Unnamed: 0`, `precip_6_x` (because most of the values are null) and `TimeStamp` because we already have the fields representing this value making this redundant.

In [5]:
from dask_ml.model_selection import train_test_split

def split_data(merged_ddf):
    X_train = merged_ddf[input_cols].loc[:int(0.8*len(merged_ddf))]
    y_train = merged_ddf['Load_x'].loc[:int(0.8*len(merged_ddf))]
    
    X_test = merged_ddf[input_cols].loc[int(0.8*len(merged_ddf)):]
    y_test= merged_ddf['Load_x'].loc[int(0.8*len(merged_ddf)):]
    
    print("Train len ", len(X_train))
    print("Test len ", len(X_test))
    
    train_dmat = xgboost.DMatrix(X_train.compute(), y_train.compute())
    test_dmat = xgboost.DMatrix(X_test.compute(), y_test.compute())
    return train_dmat, test_dmat, X_train, X_test, y_train, y_test

In [6]:
input_cols = [c for c in merged_ddf.columns if c not in['precip_6_x', 'TimeStamp', 'Load_x', 'Unnamed: 0']]

for col in input_cols:
    merged_ddf[col] = merged_ddf[col].astype('float32')

train_dmat, test_dmat, X_train, X_test, y_train, y_test = split_data(merged_ddf)

Train len  41222
Test len  10306


# Random Forest

Now, we have our train and test splits we can train the Random Forest model. We'll train the model on 100 trees initially for faster computation and tune the parameters and then increase the trees later to observe its impact on the performance.

In [7]:
rf_gpu_parameters = {'colsample_bynode': 0.9,
 'learning_rate': 0.1,
 'max_depth': 7,
 'num_parallel_tree': 100,
 'objective': 'reg:squarederror',
 'subsample': 0.6,
 'tree_method': 'gpu_hist',
 'min_child_weight': 6,
 'gamma': 0.1,
 'alpha': 1.0,
 'reg_lambda': 1.0,
}


In [50]:
rf_model = xgboost.train(
    rf_gpu_parameters,
    train_dmat,
    num_boost_round=100,
    evals=[(test_dmat, "Test"), (train_dmat, "Train")],
    early_stopping_rounds=10,
    verbose_eval=10
)

[0]	Test-rmse:15353.2	Train-rmse:30389.3
Multiple eval metrics have been passed: 'Train-rmse' will be used for early stopping.

Will train until Train-rmse hasn't improved in 10 rounds.
[10]	Test-rmse:5389.15	Train-rmse:10920.3
[20]	Test-rmse:2144.79	Train-rmse:4374.81
[30]	Test-rmse:1457.65	Train-rmse:2449.48
[40]	Test-rmse:1467.07	Train-rmse:1986.77
[50]	Test-rmse:1508.23	Train-rmse:1847.91
[60]	Test-rmse:1531.4	Train-rmse:1765.46
[70]	Test-rmse:1540.33	Train-rmse:1696.03
[80]	Test-rmse:1537.48	Train-rmse:1636.58
[90]	Test-rmse:1542.27	Train-rmse:1586.27
[99]	Test-rmse:1542.45	Train-rmse:1545.27


### Tuning Learning Rate

The model already seems to be yeilding goo results, let's try to chnage the values and see if we can do better

In [31]:
min_rmse = float("Inf")
best_lr = None

for lr in [ 0.1,0.2, 0.3]:
    print("CV with lr={}".format(lr))
    # Update our parameters
    rf_gpu_parameters['learning_rate'] = lr
    # Run CV
    cv_results = xgboost.cv(
        rf_gpu_parameters,
        train_dmat,
        num_boost_round=100,
        seed=42,
        nfold=3,
        metrics={'rmse'},
    )
    # Update best MAE
    mean_rmse = cv_results['test-rmse-mean'].min()
    boost_rounds = cv_results['test-rmse-mean'].argmin()

    
    print("\tRMSE {} for {} rounds".format(mean_rmse, boost_rounds))
    if mean_rmse < min_rmse:
        min_rmse = mean_rmse
        best_lr = (lr)
print("Best LR: {} RMSE: {}".format(best_lr, min_rmse))

CV with lr=0.1
	RMSE 2303.4584959999997 for 99 rounds
CV with lr=0.2
	RMSE 2302.688558 for 99 rounds
CV with lr=0.3
	RMSE 2306.0630696666667 for 95 rounds
Best LR: 0.2 RMSE: 2302.688558


In [8]:
rf_gpu_parameters['learning_rate'] = 0.2

## Tuning Max Depth and Min Child Weight

In [33]:
def grid_search_depth_wt(gridsearch_params, train_dmat, xgb_gpu_params):

    min_rmse = float("Inf")
    best_params = None
    for i, (max_depth, min_child_weight) in enumerate(gridsearch_params):
        print("CV with max_depth={}, min_child_weight={}".format(
                                 max_depth,
                                 min_child_weight))
        # Update our parameters
        rf_gpu_parameters['max_depth'] = max_depth
        rf_gpu_parameters['min_child_weight'] = min_child_weight

        # Run CV
        cv_results = xgboost.cv(
            rf_gpu_parameters,
            train_dmat,
            num_boost_round=100,
            seed=42,
            nfold=3,
            metrics={'rmse'},
            early_stopping_rounds=10
        )
        # Update best MAE
        mean_rmse = cv_results['test-rmse-mean'].min()
        boost_rounds = cv_results['test-rmse-mean'].argmin()

        print("\tRMSE {} for {} rounds".format(mean_rmse, boost_rounds))
        if mean_rmse < min_rmse:
            min_rmse = mean_rmse
            best_params = (max_depth,min_child_weight)
    print("Best params: {}, {}, RMSE: {}".format(best_params[0], best_params[1],  min_rmse))
    return best_params

In [34]:
gridsearch_params = [(depth, child_wt) 
                     for depth in range(5,8) 
                     for child_wt in range(4,7)
                    ]

best_params = grid_search_depth_wt(gridsearch_params, train_dmat, rf_gpu_parameters)

CV with max_depth=5, min_child_weight=4
	RMSE 2324.291748 for 99 rounds
CV with max_depth=5, min_child_weight=5
	RMSE 2325.2718096666667 for 98 rounds
CV with max_depth=5, min_child_weight=6
	RMSE 2327.5948080000003 for 99 rounds
CV with max_depth=6, min_child_weight=4
	RMSE 2307.7678223333332 for 99 rounds
CV with max_depth=6, min_child_weight=5
	RMSE 2310.602458 for 99 rounds
CV with max_depth=6, min_child_weight=6
	RMSE 2311.2578126666667 for 99 rounds
CV with max_depth=7, min_child_weight=4
	RMSE 2301.7738446666667 for 97 rounds
CV with max_depth=7, min_child_weight=5
	RMSE 2304.104736 for 99 rounds
CV with max_depth=7, min_child_weight=6
	RMSE 2302.688558 for 99 rounds
Best params: 7, 4, RMSE: 2301.7738446666667


In [9]:
rf_gpu_parameters['max_depth'], rf_gpu_parameters['min_child_weight']= 7,4

In [37]:
min_rmse = float("Inf")
best_params = None
alpha_lambda = [(alpha, lam) 
               for alpha in [ 2.0, 5.0, 10.]
               for lam in [0.5, 1.5, 2.5]]
for alpha, lam in alpha_lambda:
    print("CV with alpha={}, lambda={}".format(alpha,lam))
    # Update our parameters
    rf_gpu_parameters['alpha'] = alpha
    rf_gpu_parameters['reg_lambda'] =lam
    # Run CV
    cv_results = xgboost.cv(
        rf_gpu_parameters,
        train_dmat,
        num_boost_round=100,
        seed=42,
        nfold=3,
        metrics={'rmse'},
        early_stopping_rounds=10
    )
    # Update best MAE
    mean_rmse = cv_results['test-rmse-mean'].min()
    boost_rounds = cv_results['test-rmse-mean'].argmin()

    
    print("\tRMSE {} for {} rounds".format(mean_rmse, boost_rounds))
    if mean_rmse < min_rmse:
        min_rmse = mean_rmse
        best_params = (alpha,lam)
print("Best params: {}, {} RMSE: {}".format(best_params[0], best_params[1] , min_rmse))

CV with alpha=2.0, lambda=0.5
	RMSE 2303.67749 for 99 rounds
CV with alpha=2.0, lambda=1.5
	RMSE 2300.6336263333337 for 99 rounds
CV with alpha=2.0, lambda=2.5
	RMSE 2297.274902333333 for 99 rounds
CV with alpha=5.0, lambda=0.5
	RMSE 2301.998454 for 99 rounds
CV with alpha=5.0, lambda=1.5
	RMSE 2300.6818843333335 for 99 rounds
CV with alpha=5.0, lambda=2.5
	RMSE 2301.1280923333334 for 99 rounds
CV with alpha=10.0, lambda=0.5
	RMSE 2300.6433103333334 for 99 rounds
CV with alpha=10.0, lambda=1.5
	RMSE 2299.0305176666666 for 97 rounds
CV with alpha=10.0, lambda=2.5
	RMSE 2300.4023436666666 for 99 rounds
Best params: 2.0, 2.5 RMSE: 2297.274902333333


In [10]:
rf_gpu_parameters['alpha'], rf_gpu_parameters['reg_lambda'] = 2.0, 2.5

In [11]:
model = xgboost.train(
    rf_gpu_parameters                                                         ,
    train_dmat,
    num_boost_round=300,
    evals=[(test_dmat, "Test"), (train_dmat, "Train")],
    verbose_eval= 10,
    early_stopping_rounds=10
)

[0]	Test-rmse:13647.5	Train-rmse:27081.1
Multiple eval metrics have been passed: 'Train-rmse' will be used for early stopping.

Will train until Train-rmse hasn't improved in 10 rounds.
[10]	Test-rmse:1855.81	Train-rmse:3785.54
[20]	Test-rmse:1452.52	Train-rmse:1995.81
[30]	Test-rmse:1529.56	Train-rmse:1775.8
[40]	Test-rmse:1547.48	Train-rmse:1640.45
[50]	Test-rmse:1541.64	Train-rmse:1540.89
[60]	Test-rmse:1547.64	Train-rmse:1460.4
[70]	Test-rmse:1535.02	Train-rmse:1392.05
[80]	Test-rmse:1515.63	Train-rmse:1332.22
[90]	Test-rmse:1511.07	Train-rmse:1277.84
[100]	Test-rmse:1507.82	Train-rmse:1228.98
[110]	Test-rmse:1505.45	Train-rmse:1184.69
[120]	Test-rmse:1510.05	Train-rmse:1143.7
[130]	Test-rmse:1505.57	Train-rmse:1105.07
[140]	Test-rmse:1507.18	Train-rmse:1069.15
[150]	Test-rmse:1507.08	Train-rmse:1034.51
[160]	Test-rmse:1506.74	Train-rmse:1002.52
[170]	Test-rmse:1505.06	Train-rmse:972.751
[180]	Test-rmse:1507.21	Train-rmse:944.001
[190]	Test-rmse:1510.94	Train-rmse:916.679
[200]	Tes

In [12]:
rf_gpu_parameters['num_parallel_tree'] = 1000
rf_gpu_parameters

{'colsample_bynode': 0.9,
 'learning_rate': 0.2,
 'max_depth': 7,
 'num_parallel_tree': 1000,
 'objective': 'reg:squarederror',
 'subsample': 0.6,
 'tree_method': 'gpu_hist',
 'min_child_weight': 4,
 'gamma': 0.1,
 'alpha': 2.0,
 'reg_lambda': 2.5}

In [None]:
model = xgboost.train(
    rf_gpu_parameters                                                         ,
    train_dmat,
    num_boost_round=300,
    evals=[(test_dmat, "Test"), (train_dmat, "Train")],
    verbose_eval= 10,
    early_stopping_rounds=10
)

[0]	Test-rmse:13649.5	Train-rmse:27080.7
Multiple eval metrics have been passed: 'Train-rmse' will be used for early stopping.

Will train until Train-rmse hasn't improved in 10 rounds.
[10]	Test-rmse:1853.48	Train-rmse:3786.6
[20]	Test-rmse:1451.31	Train-rmse:1995.11
[30]	Test-rmse:1515.49	Train-rmse:1775.84
[40]	Test-rmse:1539.39	Train-rmse:1639.46
[50]	Test-rmse:1540.47	Train-rmse:1540.11
[60]	Test-rmse:1534.53	Train-rmse:1460.49
[70]	Test-rmse:1527.7	Train-rmse:1392.62
[80]	Test-rmse:1520.86	Train-rmse:1332.42
[90]	Test-rmse:1515.13	Train-rmse:1278.78
[100]	Test-rmse:1515.31	Train-rmse:1229.89
[110]	Test-rmse:1512.48	Train-rmse:1185.21
[120]	Test-rmse:1513.44	Train-rmse:1143.43
[130]	Test-rmse:1513	Train-rmse:1104.82
[140]	Test-rmse:1511.68	Train-rmse:1068.43
[150]	Test-rmse:1512.5	Train-rmse:1034.34
[160]	Test-rmse:1512.72	Train-rmse:1002.31
[170]	Test-rmse:1513.52	Train-rmse:972.17
[180]	Test-rmse:1514.76	Train-rmse:943.576
