# GQuant tutorial: build a XGBoost model to predict the next day stock return in 30 minutes.


XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. Since its introduction, this algorithm has not only been credited with winning numerous Kaggle competitions but also for being the driving force under the hood for several cutting-edge industry applications. XGBoost natively supports the GPU acceleration, which speeds up the training and inference by orders of magnitude. 

gQuant is a graph computation tool that builds on top of RAPIDS which includes the XGBoost algorithm. gQuant project has a Jupyterlab extension that can guide the user to build data science workflows in the browser. In this tutorial, we will learn step by step how to use gQuant user interface and build a simple XGBoost model from scratch to predict positive or negative next day stock return. This tutorial is organized as follows:

    1. Prepare the fake dataset with categorical variables
    2. Preprocess the dataset to be ready for XGBoost algorithm
    3. Train a XGBoost model and run inference
    4. Visualize the machine learning result
    5. Accelerate the XGBoost inference by tree inference library
    6. Change the dataset to stock dataset and predict the positive/negative next day stock return. 
    
In each step, there are animated gif files that show detailed steps. To effectively use the tutorial, we recommend follow the steps in the animation and try to reproduce the results. 

## Prepare the environment

Let's import the necessary gQuant library. 

In [1]:
import sys; sys.path.insert(0, '..')
from gquant.dataframe_flow import TaskGraph

## Prepare for running in Dask environment

Let's start the Dask local cluster environment for distributed computation.

Dask provides a web-based dashboard to help to track progress, identify performance issues, and debug failures. To learn more about Dask dashboard, just follow this [link](https://distributed.dask.org/en/latest/web.html).


In [2]:
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
from dask.distributed import Client
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:39931  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 2  Cores: 2  Memory: 100.00 GB


## Prepare the dataset

### Add a data generator node
In this step, we add gQuant TaskGraph nodes, make connections and evaluate the graph
<img src="images/xgboost/create_node.gif" align="center">

### Explore and visualize the data
In this step, we change the data generator configuration and visualize the result.
<img src="images/xgboost/visualize_data.gif" align="center">

### Add categorical variable
To simulate categorical variables, we convert two of the continuous variables into categorical variables and encode them with one-hot encoding.
<img src="images/xgboost/categorical_variable.gif" align="center">

After this step, you should have a TaskGraph looks like this:

In [3]:
task_graph = TaskGraph.load_taskgraph('../taskgraphs/xgboost_example/data_generator.gq.yaml')
task_graph.draw()

GQuantWidget(sub=HBox(), value=[OrderedDict([('id', 'data_gen'), ('type', 'ClassificationData'), ('conf', {'n_…

### Save the graph and create a composite node
We encapsulate the dataset create steps into a single composite node
<img src="images/xgboost/create_composite_node.gif" align="center">

## Preprocess the data

### Split the dataset into train and test
We split the dataset randomly into train and test so we can test the performance of the learned XGBoost model later.
<img src="images/xgboost/split_the_dataset.gif" align="center">

### Normalize the features
Though it is not needed for XGBoost model, normalizing the features can be useful for other machine learning models. Think of this step as a placeholder for some preprocessing steps that are needed to clean up the dataset.
<img src="images/xgboost/normalize.gif" align="center">

After this step, you should have a TaskGraph looks like this:

In [4]:
task_graph = TaskGraph.load_taskgraph('../taskgraphs/xgboost_example/ml_preprocess.gq.yaml')
task_graph.draw()

GQuantWidget(sub=HBox(), value=[OrderedDict([('id', ''), ('type', 'Output_Collector'), ('conf', {}), ('inputs'…

## Machine Learning
### Train an XGBoost Model and run inference
In this step, we feed the prepared dataset to train an XGBoost model. The output model object is used to run inference for both train and test dataset.
<img src="images/xgboost/train_and_infer.gif" align="center">


### gQuant evaluates a subgraph, no wasted computation
In this step, we show by switching to a dask_cudf output port, the graph can run in a distributed environment automatically. The graph only does computations on the nodes that are necessary for the results. 
<img src="images/xgboost/dask_and_sub_graph.gif" align="center">


After this step, you should have a TaskGraph looks like this:

In [5]:
task_graph = TaskGraph.load_taskgraph('../taskgraphs/xgboost_example/xgboost_model.gq.yaml')
task_graph.draw()

GQuantWidget(sub=HBox(), value=[OrderedDict([('id', 'data_gen'), ('type', 'ClassificationData'), ('conf', {'n_…

### Visualize the training result
gQuant provide analysis nodes to evaluate the XGBoost model. In this step, will check the ROC curve and feature importances
<img src="images/xgboost/xgboost_metrics.gif" align="center">


After this step, you should have a TaskGraph looks like this:

In [6]:
task_graph = TaskGraph.load_taskgraph('../taskgraphs/xgboost_example/metrics.gq.yaml')
task_graph.draw()

GQuantWidget(sub=HBox(), value=[OrderedDict([('id', 'data_gen'), ('type', 'ClassificationData'), ('conf', {'n_…

### Forest inference for deployment
Forest inference library provides a great performance boost for XGBoost model inference as shown in this [blog](https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35). In this step, we export the XGBoost model to use a forest inference library. 
<img src="images/xgboost/forest_inference.gif" align="center">

### Distributed inference
We can run inference in a distributed environment. Usually, we have a lot of data to process in production.
<img src="images/xgboost/distributed_inference.gif" align="center">

After this step, you should have a TaskGraph looks like this:

In [3]:
task_graph = TaskGraph.load_taskgraph('../taskgraphs/xgboost_example/tree_inference.gq.yaml')
task_graph.draw()

GQuantWidget(sub=HBox(), value=[OrderedDict([('id', 'data_gen'), ('type', 'ClassificationData'), ('conf', {'n_…

### Create a custom node
Since we have a nice XGBoost model composite node, let's convert it to a normal gQuant node for future use without writing any Python code. How cool is that!
<img src="images/xgboost/custom_node.gif" align="center">

## Real life example

We have been working on the fake data for predicting some binary classes. Let's change the dataset to something meaningful.

### Get the stock data
We prepare a dataset that calculates the features using technical indicators. We convert the next day return into a binary label indicating positive or negative returns.
We re-use the TaskGraph from previous [06_xgboost_trade](https://github.com/rapidsai/gQuant/blob/master/notebooks/06_xgboost_trade.ipynb) notebook. 
<img src="images/xgboost/prepare_stock_data.gif" align="center">

After this step, you should have a TaskGraph looks like this:


In [8]:
task_graph = TaskGraph.load_taskgraph('../taskgraphs/xgboost_example/stock_data.gq.yaml')
task_graph.draw()

GQuantWidget(sub=HBox(), value=[OrderedDict([('id', 'stock_data'), ('type', 'CsvStockLoader'), ('conf', {'file…

### Run the XGBoost model on the stock data
This is the last step! We add our custom XGBoost node created before. We can now train and make predictions for our stock dataset easily. As you can see, the ROC value is not bad at all!
<img src="images/xgboost/xgboost_stock_data.gif" align="center">

After this step, you should have a TaskGraph looks like this:

Note, you need to create the custom node as shown before to see this graph.

In [4]:
task_graph = TaskGraph.load_taskgraph('../taskgraphs/xgboost_example/xgboost_stock.gq.yaml')
task_graph.draw()

GQuantWidget(sub=HBox(), value=[OrderedDict([('id', 'stock_data'), ('type', 'CsvStockLoader'), ('conf', {'file…