<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# NVTabular demo on RecSys2020 Challenge

## Overview

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.  It provides a high level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS cuDF library.

## RecSys2020 Challenge

The [RecSys](https://recsys.acm.org/) conference is the leading data science conference for recommender systems and organizes an annual competiton in recommender systems. The [RecSys Challenge 2020](https://recsys-twitter.com/), hosted by Twitter, was about predicting interactions of ~200 mio. tweet-user pairs. NVIDIA's team scored 1st place. The team explained the solution in the [blogpost](https://medium.com/rapids-ai/winning-solution-of-recsys2020-challenge-gpu-accelerated-feature-engineering-and-training-for-cd67c5a87b1f) and published the code on [github](https://github.com/rapidsai/deeplearning/tree/main/RecSys2020).

## Learning objectives

This notebook covers the end-2-end pipeline, from loading the original .tsv file to training model, with NVTabular, dask_cudf and XGBoost. We demonstrate multi-GPU support for NVTabular and new `nvt.ops`, implemented based on the success of our RecSys2020 solution.
1. **NVTabular** to preprocess the original .tsv file.
2. **dask_cudf** to split the preprocessed data into a training and validation set with.
3. **NVTabular** to create additional features with only **~70 lines of code**.
4. **dask_cudf** / **XGBoost on GPU** to train our model.

In [1]:
# External Dependencies
import time
import glob
import gc

import cupy
import cupy as cp
import cudf
import rmm
import dask
import dask_cudf

import numpy as np
import nvtabular as nvt
import xgboost as xgb

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from dask.distributed import wait
from dask.utils import parse_bytes
from dask.delayed import delayed
from nvtabular.utils import device_mem_size


Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')


In [2]:
time_total_start = time.time()


In [3]:
BASE_DIR = "/raid/data/recsys2020/"


First, we initalize our local cuda cluster and assign the memory pool.

In [4]:
# device_spill_frac = 0.9
# capacity = device_mem_size(kind="total")

# cluster = LocalCUDACluster(protocol="ucx",
#                           rmm_pool_size="31GB",
#                           CUDA_VISIBLE_DEVICES = "0,1",
#                           local_directory = BASE_DIR + 'dask/',
#                           device_memory_limit = capacity * device_spill_frac,
#                           enable_tcp_server_over_ucx=True,
#                           enable_nvlink=True)
# client = Client(cluster)
# client

cluster = LocalCUDACluster(protocol="tcp", rmm_pool_size="31GB")
client = Client(cluster)
client


0,1
Client  Scheduler: tcp://127.0.0.1:43681  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 2  Cores: 2  Memory: 49.17 GB


In [5]:
# Initialize RMM pool on ALL workers
def _rmm_pool():
    rmm.reinitialize(
        pool_allocator=True, initial_pool_size=None,  # Use default size
    )


client.run(_rmm_pool)


{'tcp://127.0.0.1:32997': None, 'tcp://127.0.0.1:37465': None}

Next, we define the data schema:
- features are the column names in the original .tsv file. The .tsv file has no header.
- cat_names are the categorical features, we want to use.
- cont_names are the numerical features, we want to use.
- label_name contains the target column.

In [6]:
features = [
    "text_tokens",  ###############
    "hashtags",  # Tweet Features
    "tweet_id",  #
    "media",  #
    "links",  #
    "domains",  #
    "tweet_type",  #
    "language",  #
    "timestamp",  ###############
    "a_user_id",  ###########################
    "a_follower_count",  # Engaged With User Features
    "a_following_count",  #
    "a_is_verified",  #
    "a_account_creation",  ###########################
    "b_user_id",  #######################
    "b_follower_count",  # Engaging User Features
    "b_following_count",  #
    "b_is_verified",  #
    "b_account_creation",  #######################
    "b_follows_a",  #################### Engagement Features
    "reply",  # Target Reply
    "retweet",  # Target Retweet
    "retweet_comment",  # Target Retweet with comment
    "like",  # Target Like
    ####################
]

cat_names = [
    "hashtags",
    "tweet_id",
    "media",
    "links",
    "domains",
    "tweet_type",
    "language",
    "a_user_id",
    "a_is_verified",
    "b_user_id",
    "b_is_verified",
    "b_follows_a",
]

cont_names = [
    "timestamp",
    "a_follower_count",
    "a_following_count",
    "a_account_creation",
    "b_follower_count",
    "b_following_count",
    "b_account_creation",
]

label_name = ["reply", "retweet", "retweet_comment", "like"]


We initialize our NVTabular workflow.

In [7]:
proc = nvt.Workflow(cat_names=cat_names, cont_names=cont_names, label_name=label_name)


We define two helper function, we apply in our NVTabular workflow:
1. splitmedia2 splits the entries in media by `\t` and keeps only the first two values (if available),
2. count_token counts the number of token in a column (e.g. how many hashtags are in a tweet),

In [8]:
def splitmedia2(col):
    if col.shape[0] == 0:
        return col
    else:
        return (
            col.str.split("\t")[0].fillna("") + "_" + col.str.split("\t")[1].fillna("")
        )


def count_token(col, token):
    not_null = col.isnull() == 0
    return ((col.str.count(token) + 1) * not_null).fillna(0)


We initialize a nvt.Dataset. The engine is `csv` as the `.tsv` file has a similar structure. The `.tsv` file uses the special character `\x01` to separate columns. There is no header in the file and we define column names with the parameter names.

In [9]:
trains_itrs = nvt.Dataset(
    BASE_DIR + "training.tsv",
    header=None,
    names=features,
    engine="csv",
    sep="\x01",
    part_size="1GB",
)


Next we define our preprocessing workflow.

1. Extract weekday from the timestamp column.
2. Count the number of tokens in the columns hashtags, domains, links.
3. Fill in na/missing values in target columns + hashtags, domains, links.
4. Change data type to uint32 to save memory/storage.
5. Apply splitmedia function.
6. Encode columns in continous integer to save memory. Some categorical columns contain long hashes of type String as values to preserve the privacy of the users (e.g. userId, language, etc.). Long hashes of type String requires signficant amount of memory to store. We encode/map the Strings to continous Integer to save significant memory.
7. Change data type to uint32 to save memory/storage.

In [10]:
proc.add_feature(
    [
        # Extract weekday from the timestamp column
        nvt.ops.LambdaOp(
            op_name="wd",
            f=lambda col, gdf: cudf.to_datetime(col, unit="s").dt.weekday,
            columns=["timestamp"],
            replace=False,
        ),
        # Count the number of tokens in the columns hashtags, domains, links
        nvt.ops.LambdaOp(
            op_name="count_t",
            f=lambda col, gdf: count_token(col, "\t"),
            columns=["hashtags", "domains", "links"],
            replace=False,
        ),
        # Fill in na/missing values in target columns + hashtags, domains, links
        nvt.ops.FillMissing(columns=label_name + ["hashtags", "domains", "links"]),
        # Change data type to uint32 to save memory/storage
        nvt.ops.LambdaOp(
            op_name="astypeint32",
            f=lambda col, gdf: col.astype(np.uint32),
            columns=label_name
            + [
                "timestamp",
                "a_follower_count",
                "a_following_count",
                "a_account_creation",
                "b_follower_count",
                "b_following_count",
                "b_account_creation",
            ],
            replace=True,
        ),
        # Apply splitmedia function
        nvt.ops.LambdaOp(
            op_name="splitmedia",
            f=lambda col, gdf: splitmedia2(col),
            columns=["media"],
            replace=True,
        ),
        # Encode columns in continous integer to save memory
        nvt.ops.Categorify(
            columns=[
                "media",
                "language",
                "tweet_type",
                "tweet_id",
                "a_user_id",
                "b_user_id",
                "hashtags",
                "domains",
                "links",
            ]
        ),
        # Change data type to uint32 to save memory/storage
        nvt.ops.LambdaOp(
            op_name="astypeint32_2",
            f=lambda col, gdf: col.astype(np.uint32),
            replace=True,
            columns=[
                "media",
                "language",
                "tweet_type",
                "tweet_id",
                "a_user_id",
                "b_user_id",
                "hashtags",
                "domains",
                "links",
            ],
        ),
    ]
)


We apply the workflow and save the output to `preprocess/`,

In [11]:
%%time

time_preproc_start = time.time()
proc.apply(trains_itrs, record_stats=True, output_path=BASE_DIR + 'preprocess/')




CPU times: user 4min 8s, sys: 1min 36s, total: 5min 45s
Wall time: 6min 55s


_metadata	 part.2.parquet   part.31.parquet  part.43.parquet
part.0.parquet	 part.20.parquet  part.32.parquet  part.44.parquet
part.1.parquet	 part.21.parquet  part.33.parquet  part.45.parquet
part.10.parquet  part.22.parquet  part.34.parquet  part.46.parquet
part.11.parquet  part.23.parquet  part.35.parquet  part.47.parquet
part.12.parquet  part.24.parquet  part.36.parquet  part.48.parquet
part.13.parquet  part.25.parquet  part.37.parquet  part.49.parquet
part.14.parquet  part.26.parquet  part.38.parquet  part.5.parquet
part.15.parquet  part.27.parquet  part.39.parquet  part.50.parquet
part.16.parquet  part.28.parquet  part.4.parquet   part.6.parquet
part.17.parquet  part.29.parquet  part.40.parquet  part.7.parquet
part.18.parquet  part.3.parquet   part.41.parquet  part.8.parquet
part.19.parquet  part.30.parquet  part.42.parquet  part.9.parquet


# 2. Split

We split the training data by time into a train and validation set. The first 5 days are train and the last 2 days are test. We use the weekday for it.

In [13]:
%%time

time_split_start = time.time()

df = dask_cudf.read_parquet(BASE_DIR + 'preprocess/*.parquet')
VALID_DOW = [1, 2]
valid = df[df['timestamp_wd'].isin(VALID_DOW)].reset_index(drop=True)
train = df[~df['timestamp_wd'].isin(VALID_DOW)].reset_index(drop=True)
train = train.sort_values(["b_user_id", "timestamp"]).reset_index(drop=True)
valid = valid.sort_values(["b_user_id", "timestamp"]).reset_index(drop=True)
train.to_parquet(BASE_DIR + 'nv_train/')
valid.to_parquet(BASE_DIR + 'nv_valid/')
time_split = time.time()-time_split_start

del train; del valid


CPU times: user 7.54 s, sys: 493 ms, total: 8.03 s
Wall time: 34.7 s


119

# 3. Feature Engineering 

Now, we can apply the actual feature engineering. First, we define the data schema again.

In [14]:
CATEGORICAL_COLUMNS = [
    "hashtags",
    "tweet_id",
    "media",
    "links",
    "domains",
    "tweet_type",
    "language",
    "a_user_id",
    "a_is_verified",
    "b_user_id",
    "b_is_verified",
    "b_follows_a",
    "dt_dow",
]

CONTINUOUS_COLUMNS = [
    "timestamp",
    "a_follower_count",
    "a_following_count",
    "b_follower_count",
    "b_following_count",
    "hashtags_count_t",
    "domains_count_t",
    "links_count_t",
]
LABEL_COLUMNS = ["reply", "retweet", "retweet_comment", "like"]


We initalize the NVTabular workflow.

In [15]:
proc = nvt.Workflow(
    cat_names=CATEGORICAL_COLUMNS,
    cont_names=CONTINUOUS_COLUMNS,
    label_name=LABEL_COLUMNS,
)


Finally, we define our preprocessing workflow.

1. Transform the target columns into boolean (0/1) targets.
2. Apply TargetEncoding with kfold of 5 and smoothing of 20. TargetEncoding is explained in [here](https://medium.com/rapids-ai/target-encoding-with-rapids-cuml-do-more-with-your-categorical-data-8c762c79e784) and [here](https://github.com/rapidsai/deeplearning/blob/main/RecSys2020Tutorial/03_3_TargetEncoding.ipynb).
3. CountEncode the columns media, tweet_type, language, a_user_id, b_user_id. CountEncoding is explained [here](https://github.com/rapidsai/deeplearning/blob/main/RecSys2020Tutorial/03_4_CountEncoding.ipynb).
4. Calc ratio of a_following_count/a_follower_count.
5. Calc ratio of b_following_count/b_follower_count.
6. Transform timestamp to datetime type.
7. Extract hour.
8. Extract minute.
9. Extract seconds.
10. Convert columns to float.
11. Difference encode b_follower_count, b_following_count, language grouped by b_user_id. DifferenceEncoding is explained [here](https://github.com/rapidsai/deeplearning/blob/main/RecSys2020Tutorial/05_2_TimeSeries_Differences.ipynb).
12. Fill missing values.

In [16]:
proc.add_feature(
    [
        # Transform the target columns into boolean (0/1) targets
        nvt.ops.LambdaOp(
            op_name="change",
            f=lambda col, gdf: (col > 0).astype("int8"),
            columns=LABEL_COLUMNS,
            replace=True,
        ),
        # Apply TargetEncoding with kfold of 5 and smoothing of 20
        nvt.ops.TargetEncoding(
            cat_groups=[
                "media",
                "tweet_type",
                "language",
                "a_user_id",
                "b_user_id",
                [
                    "domains",
                    "language",
                    "b_follows_a",
                    "tweet_type",
                    "media",
                    "a_is_verified",
                ],
            ],
            cont_target=LABEL_COLUMNS,
            kfold=5,
            p_smooth=20,
        ),
        # CountEncode the columns media, tweet_type, language, a_user_id, b_user_id
        nvt.ops.JoinGroupby(
            columns=["media", "tweet_type", "language", "a_user_id", "b_user_id"]
        ),
        # Calc ratio of a_following_count/a_follower_count
        nvt.ops.LambdaOp(
            op_name="a_ff_rate",
            f=lambda col, gdf: gdf["a_following_count"] / gdf["a_follower_count"],
            columns=["a_following_count"],
            replace=False,
        ),
        # Calc ratio of b_following_count/b_follower_count
        nvt.ops.LambdaOp(
            op_name="b_ff_rate",
            f=lambda col, gdf: gdf["b_following_count"] / gdf["b_follower_count"],
            columns=["b_following_count"],
            replace=False,
        ),
        # Transform timestamp to datetime type
        nvt.ops.LambdaOp(
            op_name="to_datetime",
            f=lambda col, gdf: cudf.to_datetime(col.astype("int32"), unit="s"),
            columns=["timestamp"],
            replace=False,
        ),
        # Extract hour
        nvt.ops.LambdaOp(
            op_name="to_hour",
            f=lambda col, gdf: col.dt.hour,
            columns=["timestamp_to_datetime"],
            replace=False,
        ),
        # Extract minute
        nvt.ops.LambdaOp(
            op_name="to_minute",
            f=lambda col, gdf: col.dt.minute,
            columns=["timestamp_to_datetime"],
            replace=False,
        ),
        # Extract seconds
        nvt.ops.LambdaOp(
            op_name="to_second",
            f=lambda col, gdf: col.dt.second,
            columns=["timestamp_to_datetime"],
            replace=False,
        ),
        # Convert columns to float
        nvt.ops.LambdaOp(
            op_name="asfloat",
            f=lambda col, gdf: col.astype("float32"),
            columns=["b_follower_count", "b_following_count", "language"],
            replace=True,
        ),
        # Difference encode b_follower_count, b_following_count, language grouped by b_user_id
        nvt.ops.DifferenceLag(
            "b_user_id",
            columns=["b_follower_count", "b_following_count", "language"],
            shift=1,
        ),
        # Fill missing values
        nvt.ops.FillMissing(fill_val=0),
    ]
)


We initialize the train and valid as NVTabular datasets.

In [17]:
train_dataset = nvt.Dataset(
    glob.glob(BASE_DIR + "nv_train/*.parquet"), engine="parquet", part_size="4GB"
)
valid_dataset = nvt.Dataset(
    glob.glob(BASE_DIR + "nv_valid/*.parquet"), engine="parquet", part_size="4GB"
)


We apply the workflow to the datasets.

In [18]:
%%time

time_fe_start = time.time()
proc.apply(train_dataset, record_stats=True, output_path=BASE_DIR + 'nv_train_fe/')
proc.apply(valid_dataset, record_stats=False, output_path=BASE_DIR + 'nv_valid_fe/')


  "right", dtype_r, dtype_l, libcudf_join_type
  "right", dtype_r, dtype_l, libcudf_join_type


CPU times: user 3min 27s, sys: 2min 13s, total: 5min 41s
Wall time: 5min 44s


# 4. Train

After the preprocessing and feature engineering is done, we can train a model to predict our targets. We load our datasets with `dask_cudf`.

In [19]:
train = dask_cudf.read_parquet(BASE_DIR + "nv_train_fe/*.parquet")
valid = dask_cudf.read_parquet(BASE_DIR + "nv_valid_fe/*.parquet")


In [20]:
train.columns


Index(['timestamp', 'a_follower_count', 'a_following_count',
       'b_follower_count', 'b_following_count', 'hashtags_count_t',
       'domains_count_t', 'links_count_t', 'hashtags', 'tweet_id', 'media',
       'links', 'domains', 'tweet_type', 'language', 'a_user_id',
       'a_is_verified', 'b_user_id', 'b_is_verified', 'b_follows_a', 'reply',
       'retweet', 'retweet_comment', 'like', 'TE_media_reply',
       'TE_media_retweet', 'TE_media_retweet_comment', 'TE_media_like',
       'TE_tweet_type_reply', 'TE_tweet_type_retweet',
       'TE_tweet_type_retweet_comment', 'TE_tweet_type_like',
       'TE_language_reply', 'TE_language_retweet',
       'TE_language_retweet_comment', 'TE_language_like', 'TE_a_user_id_reply',
       'TE_a_user_id_retweet', 'TE_a_user_id_retweet_comment',
       'TE_a_user_id_like', 'TE_b_user_id_reply', 'TE_b_user_id_retweet',
       'TE_b_user_id_retweet_comment', 'TE_b_user_id_like',
       'TE_domains_language_b_follows_a_tweet_type_media_a_is_verified_

We use following input features:

In [21]:
features = [
    "media",
    "tweet_type",
    "language",
    "a_follower_count",
    "a_following_count",
    "a_is_verified",
    "b_follower_count",
    "b_following_count",
    "b_is_verified",
    "b_follows_a",
    "hashtags_count_t",
    "domains_count_t",
    "links_count_t",
    "TE_media_reply",
    "TE_media_retweet",
    "TE_media_retweet_comment",
    "TE_media_like",
    "TE_tweet_type_reply",
    "TE_tweet_type_retweet",
    "TE_tweet_type_retweet_comment",
    "TE_tweet_type_like",
    "TE_language_reply",
    "TE_language_retweet",
    "TE_language_retweet_comment",
    "TE_language_like",
    "TE_a_user_id_reply",
    "TE_a_user_id_retweet",
    "TE_a_user_id_retweet_comment",
    "TE_a_user_id_like",
    "TE_b_user_id_reply",
    "TE_b_user_id_retweet",
    "TE_b_user_id_retweet_comment",
    "TE_b_user_id_like",
    "TE_domains_language_b_follows_a_tweet_type_media_a_is_verified_reply",
    "TE_domains_language_b_follows_a_tweet_type_media_a_is_verified_retweet",
    "TE_domains_language_b_follows_a_tweet_type_media_a_is_verified_retweet_comment",
    "TE_domains_language_b_follows_a_tweet_type_media_a_is_verified_like",
    "media_count",
    "tweet_type_count",
    "language_count",
    "a_user_id_count",
    "b_user_id_count",
    "b_follower_count_DifferenceLag",
    "b_following_count_DifferenceLag",
    "language_DifferenceLag" "dt_dow",
    "timestamp_to_datetime_to_hour",
    "timestamp_to_datetime_to_minute",
    "timestamp_to_datetime_to_second",
]
features = [x for x in features if x in train.columns]
label_names = ["reply", "retweet", "retweet_comment", "like"]
other = ["tweet_id"]
print(features, len(features))


['media', 'tweet_type', 'language', 'a_follower_count', 'a_following_count', 'a_is_verified', 'b_follower_count', 'b_following_count', 'b_is_verified', 'b_follows_a', 'hashtags_count_t', 'domains_count_t', 'links_count_t', 'TE_media_reply', 'TE_media_retweet', 'TE_media_retweet_comment', 'TE_media_like', 'TE_tweet_type_reply', 'TE_tweet_type_retweet', 'TE_tweet_type_retweet_comment', 'TE_tweet_type_like', 'TE_language_reply', 'TE_language_retweet', 'TE_language_retweet_comment', 'TE_language_like', 'TE_a_user_id_reply', 'TE_a_user_id_retweet', 'TE_a_user_id_retweet_comment', 'TE_a_user_id_like', 'TE_b_user_id_reply', 'TE_b_user_id_retweet', 'TE_b_user_id_retweet_comment', 'TE_b_user_id_like', 'TE_domains_language_b_follows_a_tweet_type_media_a_is_verified_reply', 'TE_domains_language_b_follows_a_tweet_type_media_a_is_verified_retweet', 'TE_domains_language_b_follows_a_tweet_type_media_a_is_verified_retweet_comment', 'TE_domains_language_b_follows_a_tweet_type_media_a_is_verified_like',

We drop the columns, which are not required for training.

In [22]:
RMV = [x for x in train.columns if not (x in features + label_names + other)]

train = train.drop(RMV, axis=1)
RMV = [c for c in RMV if c in valid.columns]
valid = valid.drop(RMV, axis=1)
wait(train)
wait(valid)


DoneAndNotDoneFutures(done=set(), not_done=set())

Our experiments show that we require only 10% of the training dataset. Our feature engineering, such as TargetEncoding, uses the training datasets and leverage the information of the full dataset. The actual model training does not require all the data, anymore.

In [23]:
SAMPLE_RATIO = 0.1
SEED = 1

if SAMPLE_RATIO < 1.0:
    train["sample"] = train["tweet_id"].map_partitions(
        lambda cudf_df: cudf_df.hash_encode(stop=10)
    )
    print(len(train))

    train = train[train["sample"] < 10 * SAMPLE_RATIO]
    (train,) = dask.persist(train)
    train.head()
    print(len(train))


Y_train = train[label_names]
(Y_train,) = dask.persist(Y_train)
Y_train.head()

train = train.drop(["sample", "tweet_id"] + label_names, axis=1)
(train,) = dask.persist(train)
train.head()

print("Using %i features:" % (len(features)), train.shape[1])


71043477
7103396
Using 47 features: 47


Similar to the training dataset, our experiments show that 35% of our validation dataset is enough to get a good estimate of the performance metric.

In [24]:
SAMPLE_RATIO = 0.35  # VAL SET NOW SIZE OF TEST SET
SEED = 1
if SAMPLE_RATIO < 1.0:
    print(len(valid))
    valid["sample"] = valid["tweet_id"].map_partitions(
        lambda cudf_df: cudf_df.hash_encode(stop=10)
    )

    valid = valid[valid["sample"] < 10 * SAMPLE_RATIO]
    (valid,) = dask.persist(valid)
    valid.head()
    print(len(valid))

Y_valid = valid[label_names]
(Y_valid,) = dask.persist(Y_valid)
Y_valid.head()

valid = valid.drop(["sample", "tweet_id"] + label_names, axis=1)
(valid,) = dask.persist(valid)


27142542
10842998


We initialize our XGBoost parameter.

In [25]:
print("XGB Version", xgb.__version__)

xgb_parms = {
    "max_depth": 8,
    "learning_rate": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.3,
    "eval_metric": "logloss",
    "objective": "binary:logistic",
    "tree_method": "gpu_hist",
    "predictor": "gpu_predictor",
}


XGB Version 1.2.1


In [26]:
if train.columns.duplicated().sum() > 0:
    raise Exception(f"duplicated!: { train.columns[train.columns.duplicated()] }")
print("no dup :) ")
print(f"X_train.shape {train.shape}")
print(f"X_valid.shape {valid.shape}")


no dup :) 
X_train.shape (Delayed('int-41d6dc48-e4d9-489e-91fb-dff751f22a00'), 47)
X_valid.shape (Delayed('int-5f9520c3-de19-497f-ad92-65cfd0aff43a'), 47)


We optimize the datatype a last time.

In [27]:
for col in train.columns:
    if train[col].dtype == "bool":
        train[col] = train[col].astype("int8")
        valid[col] = valid[col].astype("int8")
train, valid = dask.persist(train, valid)


We train our XGBoost models. The challenge requires to predict 4 targets, does a user
1. like a tweet
2. reply a tweet
3. comment a tweet
4. comment and reply a tweet
<br>
We train 4x XGBoost models for 300 rounds on a GPU.

In [28]:
%%time
time_train_start = time.time()

NROUND = 300
VERBOSE_EVAL = 50    
oof = np.zeros((len(valid),len(label_names)))
preds = []
for i in range(4):

    name = label_names[i]
    print('#'*25);print('###',name);print('#'*25)
       
    start = time.time(); print('Creating DMatrix...')
    dtrain = xgb.dask.DaskDMatrix(client,data=train,label=Y_train.iloc[:, i])
    dvalid = xgb.dask.DaskDMatrix(client,data=valid,label=Y_valid.iloc[:, i])
    print('Took %.1f seconds'%(time.time()-start))
             
    start = time.time(); print('Training...')
    model = xgb.dask.train(client, xgb_parms, 
                           dtrain=dtrain,
                           #evals=[(dtrain,'train'),(dvalid,'valid')],
                           num_boost_round=NROUND,
                           #early_stopping_rounds=ESR,
                           verbose_eval=VERBOSE_EVAL) 
    print('Took %.1f seconds'%(time.time()-start))
        
    start = time.time(); print('Predicting...')
    #Y_valid[f'pred_{name}'] = xgb.dask.predict(client,model,valid)
    #oof[:, i] += xgb.dask.predict(client,model,dvalid).compute()
    preds.append(xgb.dask.predict(client,model,valid))
    print('Took %.1f seconds'%(time.time()-start))
        
    del model, dtrain, dvalid



#########################
### reply
#########################
Creating DMatrix...
Took 1.5 seconds
Training...


  self.sync(self._update_scheduler_info)


Took 12.9 seconds
Predicting...


  [<function _predict_async.<locals>.mapped_predict  ... titions>, True]
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s)


Took 0.5 seconds
#########################
### retweet
#########################
Creating DMatrix...
Took 0.3 seconds
Training...


  self.sync(self._update_scheduler_info)


Took 12.6 seconds
Predicting...
Took 0.5 seconds
#########################
### retweet_comment
#########################
Creating DMatrix...
Took 0.3 seconds
Training...


  self.sync(self._update_scheduler_info)


Took 11.6 seconds
Predicting...
Took 0.4 seconds
#########################
### like
#########################
Creating DMatrix...
Took 0.3 seconds
Training...


  self.sync(self._update_scheduler_info)


Took 13.0 seconds
Predicting...
Took 0.6 seconds
CPU times: user 5.44 s, sys: 864 ms, total: 6.31 s
Wall time: 56.2 s


In [29]:
yvalid = Y_valid[label_names].values.compute()
oof = cupy.array([i.values.compute() for i in preds]).T


In [30]:
from sklearn.metrics import auc


def precision_recall_curve(y_true, y_pred):
    y_true = y_true.astype("float32")
    ids = cupy.argsort(-y_pred)
    y_true = y_true[ids]
    y_pred = y_pred[ids]
    y_pred = cupy.flip(y_pred, axis=0)

    acc_one = cupy.cumsum(y_true)
    sum_one = cupy.sum(y_true)

    precision = cupy.flip(acc_one / cupy.cumsum(cupy.ones(len(y_true))), axis=0)
    precision[:-1] = precision[1:]
    precision[-1] = 1.0

    recall = cupy.flip(acc_one / sum_one, axis=0)
    recall[:-1] = recall[1:]
    recall[-1] = 0
    n = (recall == 1).sum()

    return precision[n - 1 :], recall[n - 1 :], y_pred[n:]


def compute_prauc(pred, gt):
    prec, recall, thresh = precision_recall_curve(gt, pred)
    recall, prec = cupy.asnumpy(recall), cupy.asnumpy(prec)
    prauc = auc(recall, prec)
    return prauc


def log_loss(y_true, y_pred, eps=1e-7, normalize=True, sample_weight=None):
    y_true = y_true.astype("int32")
    y_pred = cupy.clip(y_pred, eps, 1 - eps)
    if y_pred.ndim == 1:
        y_pred = cupy.expand_dims(y_pred, axis=1)
    if y_pred.shape[1] == 1:
        y_pred = cupy.hstack([1 - y_pred, y_pred])

    y_pred /= cupy.sum(y_pred, axis=1, keepdims=True)
    loss = -cupy.log(y_pred)[cupy.arange(y_pred.shape[0]), y_true]
    return _weighted_sum(loss, sample_weight, normalize).item()


def _weighted_sum(sample_score, sample_weight, normalize):
    if normalize:
        return cupy.average(sample_score, weights=sample_weight)
    elif sample_weight is not None:
        return cupy.dot(sample_score, sample_weight)
    else:
        return sample_score.sum()


# FAST METRIC FROM GIBA
def compute_rce_fast(pred, gt):
    cross_entropy = log_loss(gt, pred)
    yt = np.mean(gt)
    strawman_cross_entropy = -(yt * np.log(yt) + (1 - yt) * np.log(1 - yt))
    return (1.0 - cross_entropy / strawman_cross_entropy) * 100.0


Finally, we calculate the performance metric PRAUC and RCE for each target.

In [31]:
txt = ""
for i in range(4):
    prauc = compute_prauc(oof[:, i], yvalid[:, i])  # .item()
    rce = compute_rce_fast(oof[:, i], yvalid[:, i]).item()
    txt_ = f"{label_names[i]:20} PRAUC:{prauc:.5f} RCE:{rce:.5f}"
    print(txt_)
    txt += txt_ + "\n"


reply                PRAUC:0.13947 RCE:16.98449
retweet              PRAUC:0.52080 RCE:28.51452
retweet_comment      PRAUC:0.05037 RCE:10.66172
like                 PRAUC:0.76762 RCE:24.91603


In [32]:
time_total = time.time() - time_total_start


In [39]:
print("Total time: {:.2f}s".format(time_total))
print()
print("1. Preprocessing:       {:.2f}s".format(time_preproc))
print("2. Splitting:           {:.2f}s".format(time_split))
print("3. Feature engineering: {:.2f}s".format(time_fe))
print("4. Training:            {:.2f}s".format(time_train))


Total time: 884.19s

1. Preprocessing:       415.91s
2. Splitting:           34.61s
3. Feature engineering: 344.50s
4. Training:            56.21s
