# Predict Bike Sharing Demand with AutoGluon Template

## Project: Predict Bike Sharing Demand with AutoGluon
This notebook is a template with each step that you need to complete for the project.

Please fill in your code where there are explicit `?` markers in the notebook. You are welcome to add more cells and code as you see fit.

Once you have completed all the code implementations, please export your notebook as a HTML file so the reviews can view your code. Make sure you have all outputs correctly outputted.

`File-> Export Notebook As... -> Export Notebook as HTML`

There is a writeup to complete as well after all code implememtation is done. Please answer all questions and attach the necessary tables and charts. You can complete the writeup in either markdown or PDF.

Completing the code template and writeup template will cover all of the rubric points for this project.

The rubric contains "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. The stand out suggestions are optional. If you decide to pursue the "stand out suggestions", you can include the code in this notebook and also discuss the results in the writeup file.

## Step 1: Create an account with Kaggle

### Create Kaggle Account and download API key
Below is example of steps to get the API username and key. Each student will have their own username and key.

## Step 2: Download the Kaggle dataset using the kaggle python library

### Open up Sagemaker Studio and use starter template

1. Notebook should be using a `ml.t3.medium` instance (2 vCPU + 4 GiB)
2. Notebook should be using kernal: `Python 3 (MXNet 1.8 Python 3.7 CPU Optimized)`

### Install packages

In [None]:
# !pip install -U pip
# !pip install -U setuptools wheel
# !pip install -U "mxnet<2.0.0" bokeh==2.0.1
# !pip install autogluon --no-cache-dir
# # Without --no-cache-dir, smaller aws instances may have trouble installing

### Setup Kaggle API Key

In [None]:
# # # create the .kaggle directory and an empty kaggle.json file
# !mkdir /home/.kaggle
# !touch /home/.kaggle/kaggle.json
# !chmod 600 /home/.kaggle/kaggle.json

In [None]:
# # Fill in your user name and key from creating the kaggle account and API token file
# import json
# kaggle_username = "markawuku"
# kaggle_key = "56c7b691f3fdb5cf000b08b0c55fde75"

# # Save API token the kaggle.json file
# with open("/root/.kaggle/kaggle.json", "w") as f:
#     f.write(json.dumps({"username": kaggle_username, "key": kaggle_key}))

### Download and explore dataset

In [None]:
# Download the dataset, it will be in a .zip file so you'll need to unzip it as well.
# !kaggle competitions download -c bike-sharing-demand

# If you already downloaded it you can use the -o command to overwrite the file
# !unzip -o bike-sharing-demand.zip

In [1]:
import pandas as pd
from autogluon.tabular import TabularPredictor
import autogluon.core as ag

In [2]:
# Create the train dataset in pandas by reading the csv
# Set the parsing of the datetime column so you can use some of the `dt` features in pandas later
train = pd.read_csv('data/train.csv', parse_dates=['datetime'])
train.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [4]:
# Simple output of the train dataset to view some of the min/max/varition of the dataset features.
train.describe()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
count,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0
mean,2.506614,0.028569,0.680875,1.418427,20.23086,23.655084,61.88646,12.799395,36.021955,155.552177,191.574132
std,1.116174,0.166599,0.466159,0.633839,7.79159,8.474601,19.245033,8.164537,49.960477,151.039033,181.144454
min,1.0,0.0,0.0,1.0,0.82,0.76,0.0,0.0,0.0,0.0,1.0
25%,2.0,0.0,0.0,1.0,13.94,16.665,47.0,7.0015,4.0,36.0,42.0
50%,3.0,0.0,1.0,1.0,20.5,24.24,62.0,12.998,17.0,118.0,145.0
75%,4.0,0.0,1.0,2.0,26.24,31.06,77.0,16.9979,49.0,222.0,284.0
max,4.0,1.0,1.0,4.0,41.0,45.455,100.0,56.9969,367.0,886.0,977.0


In [5]:
# Create the test pandas dataframe in pandas by reading the csv, remember to parse the datetime!
test = pd.read_csv('data/test.csv', parse_dates=['datetime'])
test.head()
# test.shape

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
0,2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027
1,2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,0.0
2,2011-01-20 02:00:00,1,0,1,1,10.66,13.635,56,0.0
3,2011-01-20 03:00:00,1,0,1,1,10.66,12.88,56,11.0014
4,2011-01-20 04:00:00,1,0,1,1,10.66,12.88,56,11.0014


In [6]:
# Same thing as train and test dataset
submission = pd.read_csv('data/sampleSubmission.csv', parse_dates=['datetime'])
submission.head()
# submission.shape

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,0
1,2011-01-20 01:00:00,0
2,2011-01-20 02:00:00,0
3,2011-01-20 03:00:00,0
4,2011-01-20 04:00:00,0


## Step 3: Train a model using AutoGluon’s Tabular Prediction

Requirements:
* We are prediting `count`, so it is the label we are setting.
* Ignore `casual` and `registered` columns as they are also not present in the test dataset. 
* Use the `root_mean_squared_error` as the metric to use for evaluation.
* Set a time limit of 10 minutes (600 seconds).
* Use the preset `best_quality` to focus on creating the best model.

In [7]:
# casual and registered columns to remmove/ignored
ignore_cols = ['casual','registered']
# train.drop(ignore_cols, axis=1, inplace=True)  # using the learner_kwards={'ignored_columns': ignore_cols} of TabularPredictor 

target = 'count'
metric = 'root_mean_squared_error'
ttime = 10 * 60 # train various models for 10 minutes, 10 x 60 seconds
train.info() # confirm if casual and registered columns are remmoved - manual drop

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB


In [9]:
# to ignore columns of train data in fit, use learnier_kwargs 'ignored_columns' of TabularPredictor
predictor = TabularPredictor(label=target, eval_metric=metric, learner_kwargs={'ignored_columns': ignore_cols}).fit(
    train_data=train,
    time_limit=ttime,
    presets='best_quality'
)

No path specified. Models will be saved in: "AutogluonModels/ag-20221216_163054/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20221216_163054/"
AutoGluon Version:  0.6.0
Python Version:     3.8.15
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022
Train Data Rows:    10886
Train Data Columns: 11
Label Column: count
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == int and many unique label-values observed).
	Label info (max, min, mean, stddev): (977, 1, 191.57413, 181.14445)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])


### Review AutoGluon's training run with ranking of models that did the best.

In [10]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                     model   score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0      WeightedEnsemble_L3  -50.397299      22.317726  442.836139                0.000764           0.591672            3       True         20
1   NeuralNetFastAI_BAG_L2  -51.446703      21.325593  419.577812                0.611587          55.819029            2       True         18
2   RandomForestMSE_BAG_L2  -53.327027      21.214292  381.178461                0.500286          17.419677            2       True         15
3     ExtraTreesMSE_BAG_L2  -54.295980      21.205088  369.005760                0.491083           5.246977            2       True         17
4           XGBoost_BAG_L2  -54.985735      20.875395  371.384517                0.161389           7.625733            2       True         19
5          LightGBM_BAG_L2  -55.162694      21.001076  374.963897         



{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
  'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
  'XGBoost_BAG_L1': 'StackerEnsembleModel_XGBoost',
  'NeuralNetTorch_BAG_L1': 'StackerEnsembleModel_TabularNeuralNetTorch',
  'LightGBMLarge_BAG_L1': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesMSE_BAG_L2': 'StackerEnsembleModel_XT',
  'NeuralNetFastAI_BAG_L2': 'StackerEnsembleModel_NNFa

In [15]:
# best model by autogluon
predictor.leaderboard(data=train, silent=True)


Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,KNeighborsDist_BAG_L1,-0.0,-84.125061,0.036026,0.052007,0.057873,0.036026,0.052007,0.057873,1,True,2
1,WeightedEnsemble_L2,-0.0,-84.125061,0.038656,0.053028,0.512507,0.00263,0.001021,0.454634,2,True,12
2,RandomForestMSE_BAG_L1,-42.939157,-116.548359,0.581102,0.535277,10.285983,0.581102,0.535277,10.285983,1,True,5
3,ExtraTreesMSE_BAG_L1,-45.921107,-124.600676,0.559784,0.398963,2.964968,0.559784,0.398963,2.964968,1,True,7
4,ExtraTreesMSE_BAG_L2,-62.756012,-54.29598,45.278469,21.205088,369.00576,0.636455,0.491083,5.246977,2,True,17
5,KNeighborsUnif_BAG_L1,-70.693174,-101.546199,0.038055,0.031702,0.042162,0.038055,0.031702,0.042162,1,True,1
6,RandomForestMSE_BAG_L2,-73.960021,-53.327027,45.34578,21.214292,381.178461,0.703765,0.500286,17.419677,2,True,15
7,XGBoost_BAG_L2,-77.701485,-54.985735,44.961787,20.875395,371.384517,0.319773,0.161389,7.625733,2,True,19
8,LightGBM_BAG_L2,-84.631163,-55.162694,45.287336,21.001076,374.963897,0.645321,0.28707,11.205114,2,True,14
9,XGBoost_BAG_L1,-85.623725,-131.624665,1.386324,0.732848,11.67873,1.386324,0.732848,11.67873,1,True,9


predictor.feat

### Create predictions from test dataset

In [16]:
# evalauation = predictor.evaluate(test)
predictions = predictor.predict(test)

y_pred = pd.DataFrame(predictions, columns=['count'])
y_pred # print dataframe

Unnamed: 0,count
0,26.345877
1,43.131622
2,47.567802
3,50.524109
4,52.965538
...,...
6488,157.488251
6489,157.505081
6490,153.458344
6491,146.478683


#### NOTE: Kaggle will reject the submission if we don't set everything to be > 0.

In [17]:
# Describe the `predictions` series to see if there are any negative values
predictions.describe()

count    6493.000000
mean      100.251656
std        87.535828
min        -1.232798
25%        23.113815
50%        69.305283
75%       165.564026
max       353.474365
Name: count, dtype: float64

In [18]:
# How many negative values do we have?
# (df[df<0]).sum().sum()
neg_values = (predictions[predictions < 0]).sum().sum()
print('Number of negative values: ', neg_values)


Number of negative values:  -4.846719


In [None]:
# Set them to zero
# df[df < 0] = 0
predictions[predictions < 0] = 0

### Set predictions to submission dataframe, save, and submit

In [None]:
submission["count"] = predictions
submission.to_csv("LOCAL_submission.csv", index=False)

In [None]:
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m "first raw submission"

#### View submission via the command line or in the web browser under the competition's page - `My Submissions`

In [None]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

# !kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

#### Initial score of `1.77321`

## Step 4: Exploratory Data Analysis and Creating an additional feature
* Any additional feature will do, but a great suggestion would be to separate out the datetime into hour, day, or month parts.

In [None]:
# Create a histogram of all features to show the distribution of each one relative to the data. This is part of the exploritory data analysis
train.hist(figsize=(30,18), legend=True, grid=False)

In [None]:
train.corr()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

remove_time = train.drop('datetime', axis=1)

fig, ax = plt.subplots(figsize=(35,30))
sns.heatmap(remove_time, annot = False, ax=ax)

In [None]:
# create a new feature
train['hour'] = train['datetime'].dt.hour
train['day'] = train['datetime'].dt.day
train['month'] = train['datetime'].dt.month
train['year'] = train['datetime'].dt.year



test['hour'] = test['datetime'].dt.hour
test['day'] = test['datetime'].dt.day
test['month'] = test['datetime'].dt.month
test['year'] = test['datetime'].dt.year

train.head(20)

## Make category types for these so models know they are not just numbers
* AutoGluon originally sees these as ints, but in reality they are int representations of a category.
* Setting the dtype to category will classify these as categories in AutoGluon.

In [None]:
train["season"] = train['season'].astype('category')
train["weather"] = train['weather'].astype('category')

test["season"] = test['season'].astype('category')
test["weather"] = test['weather'].astype('category')

                                         
train.info()                                         

In [None]:
# View are new feature
train.head()

In [None]:
# View histogram of all features again now with the hour feature
train.hist(figsize=(30,22), legend=True, grid=False)

## Step 5: Rerun the model with the same settings as before, just with more features

In [None]:
predictor_new_features = TabularPredictor(label=target, eval_metric = metric).fit(
    train_data = train,    
    time_limit = ttime,    
    presets = 'best_quality'
)

In [None]:
predictor_new_features.fit_summary()

In [None]:
# predicting with new features
new_feat_predictions = predictor_new_features.predict(test)
new_feat_predictions.head()

In [None]:
# Same submitting predictions
submission_new_features = pd.read_csv('data/sampleSubmission.csv', parse_dates=['datetime'])

# replace counts with new predictions
submission_new_features["count"] = new_feat_predictions
submission_new_features.to_csv("LOCAL_submission_new_features.csv", index=False)

In [None]:
!kaggle competitions submit -c bike-sharing-demand -f LOCAL_submission_new_features.csv -m "new features"

In [None]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

#### New Score of `0.72416`

## Step 6: Hyper parameter optimization
* There are many options for hyper parameter optimization.
* Options are to change the AutoGluon higher level parameters or the individual model hyperparameters.
* The hyperparameters of the models themselves that are in AutoGluon. Those need the `hyperparameter` and `hyperparameter_tune_kwargs` arguments.

### Step 6a: AutoGluon high level paramter set to `light`
`light`: Results in smaller models. Generally will make inference speed much faster and disk usage much lower, but with worse accuracy

In [None]:

predictor_light_hpo = TabularPredictor(label=target, eval_metric = metric).fit(
    train_data = train,    
    time_limit = ttime,    
    presets = 'best_quality',
    hyperparameters='light'
)

In [None]:
predictor_light_hpo.fit_summary()

In [None]:
# Remember to set all negative values to zero

predictions_light_hpo = predictor_light_hpo.predict(test)
predictions_light_hpo.head()


In [None]:
predictions_light_hpo.describe()

In [None]:
neg_val_light_hpo = (predictions_light_hpo[predictions_light_hpo < 0]).sum().sum()
print('Number of negative values: ', neg_val_light_hpo)

In [None]:
# Same submitting predictions
submission_light_hpo = pd.read_csv('data/sampleSubmission.csv', parse_dates=['datetime'])

submission_light_hpo["count"] = predictions_light_hpo
submission_light_hpo.to_csv("LOCAL_submission_new_light_hpo.csv", index=False)

In [None]:
!kaggle competitions submit -c bike-sharing-demand -f LOCAL_submission_new_light_hpo.csv -m "new features with 'light' hyperparameters"

In [None]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

#### New Score of `0.47104`

### Step 6b: AutoGluon low level hyperparameter tuning raw data


In [None]:
train_init = pd.read_csv('data/train.csv', parse_dates=['datetime'])
test_init = pd.read_csv('data/test.csv', parse_dates=['datetime'])
submission_init = pd.read_csv('data/sampleSubmission.csv', parse_dates=['datetime'])

ignore_cols = ['casual','registered']
train_init.drop(ignore_cols, axis=1, inplace=True)  # using the ignored_colums kwargs of TabularPredictor 

target = 'count'
metric = 'root_mean_squared_error'
ttime = 10 * 60 # train various models for 10 minutes, 10 x 60 seconds
train_init.info() # confirm if casual and registered columns are remmoved - manual drop


# setting up individual hyper-parameters for each algorithm
# https://lightgbm.readthedocs.io/en/latest/Parameters.html
gbm_options = {
    # 'num_boost_round': 500,
    'num_leaves': ag.space.Int(lower=100, upper=500, default=250),
    # 'tree_learner': 'feature',  #  serial, feature, data, voting
}


# https://catboost.ai/docs/concepts/parameter-tuning.html
cat_options = {
    'iterations':  ag.space.Int(200, 500, default=250),
    # 'depth': ag.space.Int(4, 10, default=6),
    # 'random_strength': ag.space.Int(0, 20, default=7),
}


# https://xgboost.readthedocs.io/en/latest/parameter.html
xgb_options = { # empyt dict uses default params
}


nueral_net_option = {
    # 'num_epochs': 200,
    'learning_rate': ag.space.Real(1e-4, 1e-1, default=5e-4, log=True),
    'dropout_prob': ag.space.Real(0.01, 0.6, default=0.1),
    # 'activation': ag.space.Categorical('relu', 'softrelu', 'tanh'),
}


# hyperparamter for each model
# {} uses autogluon default presets
hyperparameters = {
    'GBM': gbm_options,
    'CAT': cat_options,
    'NN_TORCH': nueral_net_option,
    'XGB': xgb_options,
    'RF': {},
    'FASTAI': {}
}


hyperparameter_tune_kwargs = {
    'num_trials': 10,
    'searcher': 'auto',  # auto random, bayesopt
    'scheduler': 'local', 
}



predictor_init = TabularPredictor(label=target, eval_metric=metric).fit(
    train_data = train_init,
    time_limit = ttime,
    presets = 'best_quality',
    hyperparameters = hyperparameters,
    hyperparameter_tune_kwargs = hyperparameter_tune_kwargs   
)


predictor_init.fit_summary()


init_predictions = predictor_init.predict(test_init)
init_predictions.head()


init_predictions[init_predictions < 0] = 0


submission_init["count"] = init_predictions
submission_init.to_csv("init_preds/INIT_submission.csv", index=False)


!kaggle competitions submit -c bike-sharing-demand -f init_preds/INIT_submission.csv -m "Init raw submission"
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

### Step 6c: AutoGluon low level hyperparameter tuning using new features


In [None]:

train_feat = pd.read_csv('data/train.csv', parse_dates=['datetime'])
test_feat = pd.read_csv('data/test.csv', parse_dates=['datetime'])
submission_feat = pd.read_csv('data/sampleSubmission.csv', parse_dates=['datetime'])

ignore_cols = ['casual','registered']
train_feat.drop(ignore_cols, axis=1, inplace=True)  # using the ignored_colums kwargs of TabularPredictor 

target = 'count'
metric = 'root_mean_squared_error'
ttime = 10 * 60 # train various models for 10 minutes, 10 x 60 seconds


train_feat['hour'] = train_feat['datetime'].dt.hour
train_feat['day'] = train_feat['datetime'].dt.day
train_feat['month'] = train_feat['datetime'].dt.month
train_feat['year'] = train_feat['datetime'].dt.year
test_feat['hour'] = test_feat['datetime'].dt.hour
test_feat['day'] = test_feat['datetime'].dt.day
test_feat['month'] = test_feat['datetime'].dt.month
test_feat['year'] = test_feat['datetime'].dt.year


train_feat["season"] = train_feat['season'].astype('category')
train_feat["weather"] = train_feat['weather'].astype('category')
test_feat["season"] = test_feat['season'].astype('category')
test_feat["weather"] = test_feat['weather'].astype('category')


train_feat.info() # confirm if casual and registered columns are remmoved - manual drop


# setting up individual hyper-parameters for each algorithm
# https://lightgbm.readthedocs.io/en/latest/Parameters.html
gbm_options = {
    # 'num_boost_round': 500,
    'num_leaves': ag.space.Int(lower=100, upper=500, default=250),
    # 'tree_learner': 'feature',  #  serial, feature, data, voting
}


# https://catboost.ai/docs/concepts/parameter-tuning.html
cat_options = {
    'iterations':  ag.space.Int(200, 500, default=250),
    # 'depth': ag.space.Int(4, 10, default=6),
    # 'random_strength': ag.space.Int(0, 20, default=7),
}


# https://xgboost.readthedocs.io/en/latest/parameter.html
xgb_options = { # empyt dict uses default params
}


nueral_net_option = {
    # 'num_epochs': 250,
    'learning_rate': ag.space.Real(1e-4, 1e-1, default=5e-4, log=True),
    'dropout_prob': ag.space.Real(0.01, 0.6, default=0.1),
    # 'activation': ag.space.Categorical('relu', 'softrelu', 'tanh'),
}


# hyperparamter for each model
# {} uses autogluon default presets
hyperparameters = {
    'GBM': gbm_options,
    'CAT': cat_options,
    'NN_TORCH': nueral_net_option,
    'XGB': xgb_options,
    'RF': {},
    'FASTAI': {}
}


hyperparameter_tune_kwargs = {
    'num_trials': 10,
    'searcher': 'auto',  # auto random, bayesopt
    'scheduler': 'local', 
}



predictor_feat = TabularPredictor(label=target, eval_metric=metric).fit(
    train_data = train_feat,
    time_limit = ttime,
    presets = 'best_quality',
    hyperparameters = hyperparameters,
    hyperparameter_tune_kwargs = hyperparameter_tune_kwargs   
)


predictor_feat.fit_summary()


feat_predictions = predictor_feat.predict(test_feat)
feat_predictions.head()


feat_predictions[feat_predictions < 0] = 0


submission_feat["count"] = feat_predictions
submission_feat.to_csv("feat_preds/FEAT_submission.csv", index=False)


!kaggle competitions submit -c bike-sharing-demand -f feat_preds/FEAT_submission.csv -m "New Features submission"
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

### Step 6d: AutoGluon low level hyperparameter extreme tuning using new features
#### (Tuned Light GBM, CatBoost, Neural Network, Random Forest, XGBoost)


In [None]:

train_hpo = pd.read_csv('data/train.csv', parse_dates=['datetime'])
test_hpo = pd.read_csv('data/test.csv', parse_dates=['datetime'])
submission_hpo = pd.read_csv('data/sampleSubmission.csv', parse_dates=['datetime'])

ignore_cols = ['casual','registered']
train_hpo.drop(ignore_cols, axis=1, inplace=True)  # using the ignored_colums kwargs of TabularPredictor 

target = 'count'
metric = 'root_mean_squared_error'
ttime = 10 * 60 # train various models for 10 minutes, 10 x 60 seconds


train_hpo['hour'] = train_hpo['datetime'].dt.hour
train_hpo['day'] = train_hpo['datetime'].dt.day
train_hpo['month'] = train_hpo['datetime'].dt.month
train_hpo['year'] = train_hpo['datetime'].dt.year
test_hpo['hour'] = test_hpo['datetime'].dt.hour
test_hpo['day'] = test_hpo['datetime'].dt.day
test_hpo['month'] = test_hpo['datetime'].dt.month
test_hpo['year'] = test_hpo['datetime'].dt.year


train_hpo["season"] = train_hpo['season'].astype('category')
train_hpo["weather"] = train_hpo['weather'].astype('category')
test_hpo["season"] = test_hpo['season'].astype('category')
test_hpo["weather"] = test_hpo['weather'].astype('category')


# train_hpo.info() # confirm if casual and registered columns are remmoved - manual drop


# setting up individual hyper-parameters for each algorithm
# https://lightgbm.readthedocs.io/en/latest/Parameters.html
gbm_options = {
    'num_boost_round': 500,
    'num_leaves': ag.space.Int(lower=100, upper=700), # default=250),
    'tree_learner': ['serial', 'feature', 'data', 'voting']
}


# https://catboost.ai/docs/concepts/parameter-tuning.html
cat_options = {
    'iterations':  ag.space.Int(50, 1000), #, default=250),
    'depth': ag.space.Int(2, 200), #, default=6),
    'random_strength': ag.space.Int(0, 200), #, default=7),
}


# https://xgboost.readthedocs.io/en/latest/parameter.html
xgb_options = { # empyt dict uses default params
    'learning_rate': ag.space.Real(1e-3, 1e-1, default=5e-4, log=True),
    'max_depth': ag.space.Int(6, 200), #, default=6),
    'min_child_weight': ag.space.Int(6, 250), #, default=6),
    'subsample': ag.space.Real(0.1, 1, default=0.4),
    'lambda':  ag.space.Real(0.5, 10, default=0.4),
    'alpha': ag.space.Real(0.5, 10, default=0.4),
}


# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
rf_options = { # empyt dict uses default params
    'n_estimators': ag.space.Int(150, 2000), #, default=6),
    'max_depth': ag.space.Int(6, 500), #, default=6),}
    'bootstrap': [True, False]
}


nueral_net_option = {
    'num_epochs': 400,
    'learning_rate': ag.space.Real(1e-5, 1e-1, default=5e-4, log=True),
    'dropout_prob': ag.space.Real(0.05, 0.6, default=0.1),
    'activation': ag.space.Categorical('relu', 'softrelu', 'tanh'),
}


# hyperparamter for each model
# {} uses autogluon default presets
hyperparameters = {
    'GBM': gbm_options,
    'CAT': cat_options,
    'RF': rf_options,
    'XGB': xgb_options,
    'NN_TORCH': nueral_net_option,
    'FASTAI': {},
    'KNN': {}
}


hyperparameter_tune_kwargs = {
    'num_trials': 10,
    'searcher': 'bayes',  # auto random, 'bayes']
    'scheduler': 'local', 
}



predictor_hpo = TabularPredictor(label=target, eval_metric=metric).fit(
    train_data = train_hpo,
    time_limit = ttime,
    presets = 'best_quality',
    hyperparameters = hyperparameters,
    hyperparameter_tune_kwargs = hyperparameter_tune_kwargs   
)


predictor_hpo.fit_summary()


In [None]:

predictions_hpo = predictor_hpo.predict(test_hpo)
predictions_hpo.head()


predictions_hpo[predictions_hpo < 0] = 0


submission_hpo["count"] = predictions_hpo
submission_hpo.to_csv("hpo_preds/HPO_submission.csv", index=False)


!kaggle competitions submit -c bike-sharing-demand -f hpo_preds/HPO_submission.csv -m "New Features - More Tuning Run 2 - submission"
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

## Step 7: Write a Report
### Refer to the markdown file for the full report
### Creating plots and table for report

In [None]:
# Taking the top model score from each training run and creating a line plot to show improvement
# You can create these in the notebook and save them to PNG or use some other tool (e.g. google sheets, excel)
fig = pd.DataFrame(
    {
        "model": ["initial", "add_features", 'hpo_init', "light_hpo", 'hpo_feat', 'hpo_hpo'],
        "score": [-51.016775, -30.188990, -114.598932,  -37.163054, -33.831518, -35.414700]
    }
).plot(x="model", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_train_score.png')


In [None]:
# Take the 3 kaggle scores and creating a line plot to show improvement
fig = pd.DataFrame(
    {
        "test_eval": ["initial", "add_features",  'hpo_init', "light_hpo",  'hpo_feat', 'hpo_hpo'],
        "score":  [1.77321, 0.72416,  1.40947,  0.47104, 0.45144, 0.49312]
    }
).plot(x="test_eval", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_test_score.png')

### Hyperparameter table

In [None]:
# The 3 hyperparameters we tuned with the kaggle score as the result
pd.DataFrame({
    "model": ["initial",    "add_features",     "hpo"],
    "hpo1": ['num_leaves',  'num_leaves',   ['num_leaves', 'num_boost_round', 'tree_learner']],
    "hpo2": ['iterations',  'iterations',   ['iterations', 'depth', 'num_epochs']],
    "hpo3": ['learning_rate',   'learning_rate',    ['learning_rate', 'dropout_prob', 'random_strength']],
    "score": [1.4094, 0.45144, 0.49312]
})