# Hamilton - Time Series model

#### Requirements:

- Install dependencies (listed in `requirements.txt`)
- Download and decompress the data

More details on how to set up your environment can be found [here](https://github.com/flaviassantos/hamilton/tree/main/examples/model_examples/time-series#set-up).

***

Uncomment and run the cell below if you are in a Google Colab environment. It will:
1. Mount google drive. You will be asked to authenticate and give permissions.
2. Change directory to google drive.
3. Make a directory "hamilton-tutorials"
4. Change directory to it.
5. Clone this repository to your google drive
6. Move your current directory to the hello_world example
7. Install requirements.

This means that any modifications will be saved, and you won't lose them if you close your browser.

In [1]:
## 1. Mount google drive
# from google.colab import drive
# drive.mount('/content/drive')
## 2. Change directory to google drive.
# %cd /content/drive/MyDrive
## 3. Make a directory "hamilton-tutorials"
# !mkdir hamilton-tutorials
## 4. Change directory to it.
# %cd hamilton-tutorials
## 5. Clone this repository to your google drive
# !git clone https://github.com/DAGWorks-Inc/hamilton/
## 6. Move your current directory to the hello_world example
# %cd hamilton/examples/hello_world
## 7. Install requirements.
# %pip install -r requirements.txt
# clear_output()  # optionally clear outputs
# To check your current working directory you can type `!pwd` in a cell and run it.

***
This is an example of how one might use Hamilton using the M5 Forecasting Kaggle challenge as an example. 

>For demonstration purposes, the data used to train the model in this notebook has been reduced.
***

In [3]:
import logging
import sys
import time

import data_loaders
import model_pipeline
import pandas as pd
import transforms

from hamilton import driver

In [4]:
logger = logging.getLogger(__name__)


# this is hard coded here, but it could be passed in, or in some other versioned file.
model_params = {
    "num_leaves": 55,
    "min_child_weight": 0.034,
    "feature_fraction": 0.379,
    "bagging_fraction": 0.418,
    "min_data_in_leaf": 106,
    "objective": "regression",
    "max_depth": -1,
    "learning_rate": 0.005,
    "boosting_type": "gbdt",
    "bagging_seed": 11,
    "metric": "rmse",
    "verbosity": -1,
    "reg_alpha": 0.3899,
    "reg_lambda": 0.648,
    "random_state": 222,
}

In [5]:
def main():
    """The main function to orchestrate everything."""
    start_time = time.time()
    config = {
        "calendar_path": "m5-forecasting-accuracy/calendar.csv",
        "sell_prices_path": "m5-forecasting-accuracy/sell_prices.csv",
        "sales_train_validation_path": "m5-forecasting-accuracy/sales_train_validation.csv",
        "submission_path": "m5-forecasting-accuracy/sample_submission.csv",
        "load_test2": "False",
        "n_fold": 2,
        "model_params": model_params,
        "num_rows_to_skip": 2750000,  # for training set
    }
    dr = driver.Driver(config, data_loaders, transforms, model_pipeline)
    dr.display_all_functions("./all_functions.dot", {"format": "png"})
    dr.visualize_execution(
        ["kaggle_submission_df"], "./kaggle_submission_df.dot", {"format": "png"}
    )
    kaggle_submission_df: pd.DataFrame = dr.execute(["kaggle_submission_df"])
    duration = time.time() - start_time
    logger.info(f"Duration: {duration}")
    kaggle_submission_df.to_csv("kaggle_submission_df.csv", index=False)
    logger.info(f"Shape of submission DF: {kaggle_submission_df.shape}")
    logger.info(kaggle_submission_df.head())

In [6]:
logging.basicConfig(level=logging.INFO, stream=sys.stdout)
main()

INFO:data_loaders:Loading from parquet.
INFO:data_loaders:submission has 60980 rows and 29 columns
INFO:data_loaders:Loading from parquet.
INFO:data_loaders:sales_train_validation has 3049 rows and 1919 columns
INFO:utils:sales_train_validation: Mem. usage decreased to 322.63 Mb (9.4% reduction)
INFO:data_loaders:Melted sales train validation has 5832737 rows and 8 columns
INFO:data_loaders:Loading from parquet.
INFO:utils:calendar: Mem. usage decreased to  0.12 Mb (41.9% reduction)
INFO:data_loaders:calendar has 1969 rows and 14 columns
INFO:data_loaders:Loading from parquet.
INFO:utils:sell_prices: Mem. usage decreased to 14.35 Mb (31.2% reduction)
INFO:data_loaders:sell_prices has 684112 rows and 4 columns
INFO:data_loaders:Our final dataset to train has 3936457 rows and 18 columns
INFO:model_pipeline:Fold: 1
Training until validation scores don't improve for 50 rounds
[100]	training's rmse: 3.25014	valid_1's rmse: 2.51533
[200]	training's rmse: 2.91563	valid_1's rmse: 2.25289
[300]

***
Here's the Kaggle Submission DAG that this code executes:
***
![DAG](kaggle_submission_df.dot.png)
