# MetaSpore Getting Started

MetaSpore is a machine learning platform, which provides a one-stop solution for data preprocessing, model training and online prediction.

In this article, we introduce the basic API of MetaSpore briefly.

## Prepare Data

We use the publicly available dataset [Terabyte Click Logs](https://labs.criteo.com/2013/12/download-terabyte-click-logs-2/) published by CriteoLabs as our demo dataset.

We sample the dataset with sampling rate 0.001 so that the running of the demo can finish quickly. More information about the demo dataset can be found in [MetaSpore Demo Dataset](https://ks3-cn-beijing.ksyuncs.com/dmetasoul-bucket/demo/criteo/index.html).

Execute the following cell to download the demo dataset into the working directory. Those data files take up about 2.1 GiB disk space and the downloading process may take sveral minutes. If the downloading fails, please refer to [MetaSpore Demo Dataset](https://ks3-cn-beijing.ksyuncs.com/dmetasoul-bucket/demo/criteo/index.html) and download the dataset manually.

In [1]:
# import metaspore
# metaspore.demo.download_dataset()

You can check the downloaded dataset by executing the following cell.

In [2]:
!ls -l ${PWD}/data/

total 8
drwxrwxr-x 3 ec2-user ec2-user 4096 Jul  8 09:25 test
drwxrwxr-x 3 ec2-user ec2-user 4096 Jul  8 14:47 train


(Optional) To upload the dataset to your own s3 bucket:

1. Fill ``{YOUR_S3_BUCKET}`` and ``{YOUR_S3_PATH}`` with your preferred values in the following cell.
2. Uncomment the cell by removing the leading ``#`` character.
3. Execute the cell.

In [3]:
# YOUR_S3_BUCKET='s3://sagemaker-us-west-2-452145973879/datasets/CriteoLabs/'
# YOUR_S3_PATH='datasets/CriteoLabs'

In [4]:
# !aws s3 cp --recursive ${PWD}/data/ s3://sagemaker-us-west-2-452145973879/datasets/CriteoLabs/demo/data/

Alternatively, you can open a terminal by selecting the ``File`` -> ``New`` -> ``Terminal`` menu item and executing Bash commands in it.

You can check the uploaded dataset in your s3 bucket by uncommenting and executing the following cell.

In [5]:
# !aws s3 ls s3://sagemaker-us-west-2-452145973879/datasets/CriteoLabs/demo/data/
# !aws s3 ls s3://mv-mtg-di-for-poc-datalab/2024/06/14/00/

The ``schema`` directory contains configuration files and must also be uploaded to s3 so that the model can be trained in cluster environment. 

In [6]:
#!aws s3 cp --recursive ${PWD}/schema/ s3://sagemaker-us-west-2-452145973879/datasets/CriteoLabs/demo/schema/

In the rest of the article, we assume the demo dataset and schemas has been uploaded to `ROOT_DIR`.

In [7]:
# ROOT_DIR = 's3://sagemaker-us-west-2-452145973879/datasets/CriteoLabs/demo'
# ROOT_DIR = '.'
ROOT_DIR = 's3://mv-mtg-di-for-poc-datalab'

## Define the Model

We can define our neural network model by subclassing ``torch.nn.Module`` as usual PyTorch models. The following ``DemoModule`` class provides an example.

Compared to usual PyTorch models, the notable difference is the ``_sparse`` layer created by instantiating ``ms.EmbeddingSumConcat`` which takes an embedding size and paths of two text files. ``ms.EmbeddingSumConcat`` makes it possible to define large-scale sparse models in PyTorch, which is a distinguishing feature of MetaSpore.

The ``_schema_dir`` field is an s3 directory which makes it possible to use the ``DemoModule`` class in cluster environment.

In [8]:
# import torch
# import metaspore as ms

# class DemoModule(torch.nn.Module):
#     def __init__(self):
#         super().__init__()
#         self._embedding_size = 16
#         self._schema_dir = ROOT_DIR + '/schema/'
#         self._column_name_path = self._schema_dir + 'column_name_demo.txt'
#         self._combine_schema_path = self._schema_dir + 'combine_schema_demo.txt'
#         self._sparse = ms.EmbeddingSumConcat(self._embedding_size, self._column_name_path, self._combine_schema_path)
#         self._sparse.updater = ms.FTRLTensorUpdater()
#         self._sparse.initializer = ms.NormalTensorInitializer(var=0.01)
#         self._dense = torch.nn.Sequential(
#             ms.nn.Normalization(self._sparse.feature_count * self._embedding_size),
#             torch.nn.Linear(self._sparse.feature_count * self._embedding_size, 1024),
#             torch.nn.ReLU(),
#             torch.nn.Linear(1024, 512),
#             torch.nn.ReLU(),
#             torch.nn.Linear(512, 1),
#         )

#     def forward(self, x):
#         x = self._sparse(x)
#         x = self._dense(x)
#         return torch.sigmoid(x)

In [9]:
import metaspore as ms
import torch
import torch.nn as nn


def nansum(x):
    return torch.where(torch.isnan(x), torch.zeros_like(x), x).sum()


def log_loss(yhat, y):
    return nansum(-(y * (yhat + 1e-12).log() + (1 - y) *
                    (1 - yhat + 1e-12).log()))

# 自定义的主函数入口
class DNNModelMain(nn.Module):
    def __init__(self, feature_config_file):
        super().__init__()
        self._embedding_size = 16
        self._schema_dir = ROOT_DIR + '/schema/'
        self._column_name_path = self._schema_dir + 'column_name_mobivista.txt'
        self._combine_schema_path = self._schema_dir + 'combine_schema_mobivista.txt'
        self.feature_config_file = feature_config_file  # TODO not used
        self._sparse = ms.EmbeddingSumConcat(
            self._embedding_size,
            combine_schema_source=self._column_name_path,
            combine_schema_file_path=self._combine_schema_path,
            # enable_feature_gen=True,
            # feature_config_file=feature_config_file,
            # enable_fgs=False
        )
        self._sparse.updater = ms.FTRLTensorUpdater(alpha=0.01)
        self._sparse.initializer = ms.NormalTensorInitializer(var=0.001)
        extra_attributes = {
            "enable_fresh_random_keep": True,
            "fresh_dist_range_from": 0, 
            "fresh_dist_range_to": 1000,
            "fresh_dist_range_mean": 950,
            "enable_feature_gen": True,
            "use_hash_code": False
        }
        self._sparse.extra_attributes = extra_attributes
        feature_count = self._sparse.feature_count
        feature_dim = self._sparse.feature_count * self._embedding_size

        self._gateEmbedding = GateEmbedding(feature_dim, feature_count, self._embedding_size)
        self._h1 = nn.Linear(feature_dim, 1024)
        self._h2 = FourChannelHidden(1024, 512)
        self._h3 = FourChannelHidden(512, 256)
        self._h4 = nn.Linear(256, 1)

        self._bn = ms.nn.Normalization(feature_dim, momentum=0.01, eps=1e-5, affine=True)
        self._zero = torch.zeros(1, 1)
        self.act0 = nn.Sigmoid()

    def forward(self, x):
        emb = self._sparse(x)
        bno = self._bn(emb)
        
        # print(f"self._sparse._data.type: {type(self._sparse._data)}, self._sparse._data.shape: {self._sparse._data.shape}") 
        # print(f"x.type: {type(x)}, x.shape: {x.shape}, x: {x}")
        # print(f"emb.type: {type(emb)}, emb.shape: {emb.shape}, ") # emb: {emb}
        # print(f"bno.type: {type(bno)}, bno.shape: {bno.shape}, ")  # bno: {bno}
        
        d = self._gateEmbedding(bno)
        o = self._h1(d)
        r, s1, s2, s3 = self._h2(o, self._zero, self._zero, self._zero)
        r, s1, s2, s3 = self._h3(r, s1, s2, s3)
        return self.act0(self._h4(r))


class FourChannelHidden(nn.Module):
    def __init__(self, in_size, out_size):
        super().__init__()
        self.wc2 = nn.Linear(int(in_size / 4), int(in_size / 4))
        self.wc3 = nn.Linear(int(in_size), int(in_size - in_size / 4 * 3))
        self.w = nn.Linear(int(in_size + int(in_size / 4) * 2) + int(in_size - int(in_size / 4) * 3) + 3, out_size)
        self.act1 = nn.Tanh()
        self.act = nn.ReLU()
        self.fl = int(in_size / 4)

    def forward(self, input, i1, i2, i3):
        f0 = input[:, :self.fl]
        f1 = input[:, self.fl:self.fl * 2]
        f2 = input[:, self.fl * 2:self.fl * 3]
        f3 = input[:, self.fl * 3:]

        c1 = self.act1(f0 * f1) * f1
        c2 = self.act1(self.wc2(f2) * f2)
        c3 = self.act1(f3 * self.wc3(input))

        s1 = torch.sum(c1, 1, True) + i1
        s2 = torch.sum(c2, 1, True) + i2
        s3 = torch.sum(c3, 1, True) + i3

        return self.act(self.w(torch.cat((input, c1, c2, c3, s1, s2, s3), 1))), s1, s2, s3


class GateEmbedding(nn.Module):
    def __init__(self, in_size, out_size, emb_size):
        super().__init__()
        self.layer1 = torch.nn.Linear(in_size, out_size)
        self.out_size = out_size
        self.emb_size = emb_size
        self.act2 = nn.Sigmoid()

    def forward(self, input):
        gate = self.act2(self.layer1(input))
        gate_reshape = torch.reshape(gate, (-1, self.out_size, 1))
        input_reshape = torch.reshape(input, (-1, self.out_size, self.emb_size))
        return (gate_reshape * input_reshape).reshape(-1, self.out_size * self.emb_size)

  from .autonotebook import tqdm as notebook_tqdm


Instantiating the ``DemoModule`` class to define our PyTorch model.

In [10]:
# module = DemoModule()
module = DNNModelMain('schema/combine_schema_mobivista.txt')

[WARN] 2024-07-09 08:27:49.458 STSAssumeRoleWithWebIdentityCredentialsProvider [140266886629184] Token file must be specified to use STS AssumeRole web identity creds provider.
[2024-07-09 08:27:49.459] [info] [s3_sdk_filesys.cpp:357] Try to open S3 stream: s3://mv-mtg-di-for-poc-datalab/schema/combine_schema_mobivista.txt, read_only true
[32mloaded combine schema from[m [32mcombine schema file [m's3://mv-mtg-di-for-poc-datalab/schema/combine_schema_mobivista.txt'
[2024-07-09 08:27:49.632] [info] [s3_sdk_filesys.cpp:380] Opened read-only stream for object: s3://mv-mtg-di-for-poc-datalab/schema/combine_schema_mobivista.txt with total length 2260
[2024-07-09 08:27:49.636] [info] [s3_sdk_filesys.cpp:419] Read S3 object s3://mv-mtg-di-for-poc-datalab/schema/combine_schema_mobivista.txt with size 2260 at position 0 larger than total size: 2260, change size to 2260
[2024-07-09 08:27:49.734] [info] [s3_sdk_filesys.cpp:413] Read S3 object s3://mv-mtg-di-for-poc-datalab/schema/combine_schem

## Train the Model

To train our model, first we need to create a ``ms.PyTorchEstimator`` passing in several arguments including our PyTorch model ``module`` and the number of workers and servers.

``model_out_path`` specifies where to store the trained model.

``input_label_column_index`` specifies the column index of the label column in the dataset, which is ``0`` for the demo dataset.

In [11]:
model_out_path = ROOT_DIR + '/output/dev/model_out/'
estimator = ms.PyTorchEstimator(module=module,
                                worker_count=4,
                                server_count=4,
                                model_out_path=model_out_path,
                                experiment_name='0.1',
                                input_label_column_index=0)

Next, we create a Spark session by calling ``ms.spark.get_session()`` and load the training dataset by call ``ms.input.read_s3_csv()``.

``delimiter`` specifies the column delimiter of the dataset, which is the TAB character ``'\t'`` for the demo dataset.

We also need to pass column names because the csv files do not contain headers.

In [12]:
column_names = []
with open(f'./schema/column_name_mobivista.txt', 'r') as f:
    for line in f:
        column_names.append(line.split(' ')[1].strip())
print(column_names)

['_11001', '_11002', '_11003', '_11004', '_11007', '_11008', '_11021', '_11022', '_11023', '_11024', '_11041', '_11042', '_11043', '_11044', '_11045', '_11046', '_11061', '_11062', '_11063', '_11064', '_11065', '_11066', '_11081', '_11082', '_11083', '_11084', '_11085', '_11086', '_11601', '_11602', '_11603', '_12001', '_12002', '_12003', '_12004', '_12005', '_12006', '_20001', '_20002', '_20003', '_20101', '_20102', '_20201', '_20202', '_20203', '_20204', '_20205', '_20206', '_20207', '_20208', '_20209', '_20210', '_30001', '_30002', '_30003', '_30004', '_30005', '_30006', '_30201', '_30202', '_30203', '_30204', '_30205', '_30206', '_30207', '_40001', '_40002', '_40003', '_40004', '_40005', '_40201', '_40202', '_40203', '_40204', '_40205', '_40206', '_40207', '_40208', '_40209', '_40210', '_40211', '_40212', '_40213', '_40214', '_40215', '_40231', '_40301', '_40302', '_40303', '_40304', '_40305', '_40306', '_40307', '_40321', '_40322', '_40323', '_40324', '_50801', '_50802', '_50805',

In [13]:
# train_dataset_path = ROOT_DIR + '/data/train/day_0_0.001_train.csv'

file_base_path = 's3://mv-mtg-di-for-poc-datalab/2024/06/14/00/'
file_names = [f'part-{str(i).zfill(5)}-1e73cc51-9b17-4439-9d71-7d505df2cae3-c000.snappy.orc' for i in range(11)]
train_dataset_path = [file_base_path + file_name for file_name in file_names]

# train_dataset_path = 's3://mv-mtg-di-for-poc-datalab/2024/06/14/00/part-00000-1e73cc51-9b17-4439-9d71-7d505df2cae3-c000.snappy.orc'
# train_dataset_path = 's3://mv-mtg-di-for-poc-datalab/2024/06/14/00'

spark_confs = {
    'spark.eventLog.enabled':'true',
    'spark.executor.memory': '20g',
    'spark.driver.memory': '10g',
}

spark_session = ms.spark.get_session(local=True,
                                     batch_size=100,
                                     worker_count=estimator.worker_count,
                                     server_count=estimator.server_count,
                                     log_level='INFO',
                                     spark_confs=spark_confs)

# train_dataset = ms.input.read_s3_csv(spark_session, train_dataset_path, delimiter='\t', column_names=column_names)

train_dataset = spark_session.read.orc(train_dataset_path)
# train_dataset.printSchema()

24/07/09 08:27:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/09 08:27:53 INFO deprecation: fs.s3a.server-side-encryption-key is deprecated. Instead, use fs.s3a.server-side-encryption.key
24/07/09 08:27:54 INFO InMemoryFileIndex: It took 880 ms to list leaf files for 11 paths.


In [14]:
train_dataset.count()

24/07/09 08:27:56 INFO FileSourceStrategy: Pushed Filters: 
24/07/09 08:27:56 INFO FileSourceStrategy: Post-Scan Filters: 
24/07/09 08:27:56 INFO FileSourceStrategy: Output Data Schema: struct<>
24/07/09 08:27:56 INFO CodeGenerator: Code generated in 113.692871 ms
24/07/09 08:27:56 INFO CodeGenerator: Code generated in 13.114585 ms
24/07/09 08:27:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 260.8 KiB, free 5.8 GiB)
24/07/09 08:27:57 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 5.8 GiB)
24/07/09 08:27:57 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-16-14-249.us-west-2.compute.internal:42701 (size: 27.4 KiB, free: 5.8 GiB)
24/07/09 08:27:57 INFO SparkContext: Created broadcast 0 from count at NativeMethodAccessorImpl.java:0
24/07/09 08:27:57 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.


133715

In [15]:
train_dataset = train_dataset.limit(200000)

In [16]:
train_dataset.count()

24/07/09 08:28:01 INFO FileSourceStrategy: Pushed Filters: 
24/07/09 08:28:01 INFO FileSourceStrategy: Post-Scan Filters: 
24/07/09 08:28:01 INFO FileSourceStrategy: Output Data Schema: struct<>
24/07/09 08:28:01 INFO CodeGenerator: Code generated in 11.527332 ms
24/07/09 08:28:01 INFO CodeGenerator: Code generated in 8.607257 ms
24/07/09 08:28:01 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 260.8 KiB, free 5.8 GiB)
24/07/09 08:28:01 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 5.8 GiB)
24/07/09 08:28:01 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-16-14-249.us-west-2.compute.internal:42701 (size: 27.4 KiB, free: 5.8 GiB)
24/07/09 08:28:01 INFO SparkContext: Created broadcast 3 from count at NativeMethodAccessorImpl.java:0
24/07/09 08:28:01 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
24

133715

In [17]:
# train_dataset = train_dataset.limit(200000)

Finally, we call the ``fit()`` method of ``ms.PyTorchEstimator`` to train our model. This will take several minutes and you can see the progress by looking at the output of the cell. The trained model is stored in ``model_out_path`` and the ``model`` variable.

In [18]:
model = estimator.fit(train_dataset)

[2024-07-09 08:28:05.696] [info] PS job with coordinator address 172.16.14.249:38755 started.
[2024-07-09 08:28:05.696] [info] PSRunner::RunPS: pid: 25611, tid: 27139, thread: 0x7f922bfff700
[2024-07-09 08:28:05.696] [info] PSRunner::RunPSCoordinator: pid: 25611, tid: 27139, thread: 0x7f922bfff700
[2024-07-09 08:28:05.697] [info] ActorProcess::Receiving: Coordinator pid: 25611, tid: 27144, thread: 0x7f92241fc700


24/07/09 08:28:05 INFO SparkContext: Starting job: collect at PythonRDD.scala:180
24/07/09 08:28:05 INFO SparkContext: Starting job: collect at /home/ec2-user/anaconda3/envs/metaspore/lib/python3.8/site-packages/metaspore/agent.py:291
24/07/09 08:28:05 INFO DAGScheduler: Got job 2 (collect at PythonRDD.scala:180) with 4 output partitions
24/07/09 08:28:05 INFO DAGScheduler: Final stage: ResultStage 4 (collect at PythonRDD.scala:180)
24/07/09 08:28:05 INFO DAGScheduler: Parents of final stage: List()
24/07/09 08:28:05 INFO DAGScheduler: Missing parents: List()
24/07/09 08:28:05 INFO DAGScheduler: Submitting ResultStage 4 (PythonRDD[16] at RDD at PythonRDD.scala:53), which has no missing parents
24/07/09 08:28:05 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 6.8 KiB, free 5.8 GiB)
24/07/09 08:28:05 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 4.3 KiB, free 5.8 GiB)
24/07/09 08:28:05 INFO BlockManagerInfo: Added broa

[2024-07-09 08:28:06.962] [info] C[0]:9: The coordinator has connected to 4 servers and 4 workers.
PS Coordinator node [32mC[0]:9[m is ready.


[2024-07-09 08:28:07.012] [info] S[1]:26 has connected to others.
[2024-07-09 08:28:07.012] [info] W[1]:28 has connected to others.
[2024-07-09 08:28:07.012] [info] W[0]:12 has connected to others.
[2024-07-09 08:28:07.012] [info] W[3]:60 has connected to others.
[2024-07-09 08:28:07.012] [info] S[0]:10 has connected to others.
[2024-07-09 08:28:07.012] [info] S[2]:42 has connected to others.
[2024-07-09 08:28:07.012] [info] S[3]:58 has connected to others.
[2024-07-09 08:28:07.012] [info] W[2]:44 has connected to others.
PS Worker node [38;5;051mW[3]:60[m is ready.PS Worker node [38;5;051mW[1]:28[m is ready.PS Worker node [38;5;051mW[0]:12[m is ready.PS Server node [38;5;196mS[1]:26[m is ready.



PS Server node [38;5;196mS[2]:42[m is ready.
PS Server node [38;5;196mS[0]:10[m is ready.
PS Worker node [38;5;051mW[2]:44[m is ready.
PS Server node [38;5;196mS[3]:58[m is ready.
24/07/09 08:28:07 INFO PythonRunner: Times: total = 1242, boot = 277, init = 899, finish = 66
24

2024-07-09 08:28:39.760 -- auc: 0.5682206284153005, Δauc: 0.5682206284153005, pcoc: 21.160964488983154, Δpcoc: 21.160964488983154, loss: 0.007082240581512451, Δloss: 0.007082240581512451, #instance: 1000
2024-07-09 08:28:41.913 -- auc: 0.5187041068989071, Δauc: 0.5065957991803278, pcoc: 21.054775555928547, Δpcoc: 20.948586622873943, loss: 0.007033280372619629, Δloss: 0.006984320163726807, #instance: 2000
2024-07-09 08:28:44.099 -- auc: 0.520181862771027, Δauc: 0.564282853455565, pcoc: 21.222007536552322, Δpcoc: 21.571013409158457, loss: 0.0069747254053751625, Δloss: 0.00685761547088623, #instance: 3000
2024-07-09 08:28:46.338 -- auc: 0.516218910350688, Δauc: 0.5337323661608295, pcoc: 21.191953496730076, Δpcoc: 21.09917798249618, loss: 0.006894747495651245, Δloss: 0.006654813766479493, #instance: 4000
2024-07-09 08:28:48.541 -- auc: 0.5272748179656701, Δauc: 0.5057406513224958, pcoc: 21.741008927336836, Δpcoc: 24.457388426128187, loss: 0.0067712158203125, Δloss: 0.00627708911895752, #in

[Stage 5:>                  (0 + 4) / 4][Stage 19:>                 (0 + 1) / 1]

2024-07-09 08:29:38.305 -- auc: 0.5293626294528302, Δauc: 0.580328205128205, pcoc: 6.163493042779986, Δpcoc: 1.4147229290008545, loss: 0.002661504030227661, Δloss: 0.0014188719987869264, #instance: 28000
2024-07-09 08:29:40.494 -- auc: 0.5319619642521545, Δauc: 0.6315210057175326, pcoc: 5.968241206450438, Δpcoc: 1.6255710536035999, loss: 0.0026192869729009167, Δloss: 0.0014372093677520752, #instance: 29000
2024-07-09 08:29:42.644 -- auc: 0.5339718591330834, Δauc: 0.616985889800955, pcoc: 5.870465298418445, Δpcoc: 2.40199361349407, loss: 0.002568730954329173, Δloss: 0.0011026064157485963, #instance: 30000
2024-07-09 08:29:44.779 -- auc: 0.537462756605966, Δauc: 0.6818647540983607, pcoc: 5.725058767596465, Δpcoc: 1.5264451901117961, loss: 0.002529177411910026, Δloss: 0.0013425711393356323, #instance: 31000
2024-07-09 08:29:47.074 -- auc: 0.5388793738698944, Δauc: 0.613965970081078, pcoc: 5.5626272022503835, Δpcoc: 1.2491667447266754, loss: 0.0024936601035296917, Δloss: 0.0013926235437393

[Stage 5:>                  (0 + 4) / 4][Stage 19:>                 (0 + 1) / 1]

2024-07-09 08:30:38.984 -- auc: 0.601767021863062, Δauc: 0.7578630217519107, pcoc: 3.6795907221817616, Δpcoc: 1.1595884561538696, loss: 0.0019289275482296943, Δloss: 0.0011842916011810304, #instance: 56000
2024-07-09 08:30:41.148 -- auc: 0.6048132966174046, Δauc: 0.7666699956722927, pcoc: 3.6227366022473246, Δpcoc: 1.1560013986402942, loss: 0.0019171003343766196, Δloss: 0.001254776358604431, #instance: 57000
2024-07-09 08:30:43.365 -- auc: 0.6063286100246239, Δauc: 0.7022448979591837, pcoc: 3.591372358684767, Δpcoc: 1.4335124015808105, loss: 0.0019010831405376566, Δloss: 0.0009881030917167663, #instance: 58000
2024-07-09 08:30:45.628 -- auc: 0.6082942521304577, Δauc: 0.7385074095500868, pcoc: 3.5520155713004153, Δpcoc: 1.163229693537173, loss: 0.0018860470965757208, Δloss: 0.0010139565467834472, #instance: 59000
2024-07-09 08:30:47.897 -- auc: 0.6095491111244128, Δauc: 0.6895798475553077, pcoc: 3.5164383120426366, Δpcoc: 1.2217050899158826, loss: 0.0018719715972741445, Δloss: 0.0010415

[Stage 5:>                  (0 + 4) / 4][Stage 19:>                 (0 + 1) / 1]

2024-07-09 08:31:38.578 -- auc: 0.638328025122592, Δauc: 0.7209761491075659, pcoc: 2.8897510818189116, Δpcoc: 0.9800937175750732, loss: 0.001635926707681403, Δloss: 0.0011367332935333253, #instance: 83000
2024-07-09 08:31:40.726 -- auc: 0.6393108042361362, Δauc: 0.7375508130081301, pcoc: 2.879645829330195, Δpcoc: 1.6436471343040466, loss: 0.001625846783320109, Δloss: 0.0007892130613327027, #instance: 84000
2024-07-09 08:31:42.941 -- auc: 0.6408337936331073, Δauc: 0.7918646752658973, pcoc: 2.8648311015927628, Δpcoc: 1.2409723334842258, loss: 0.001616258741827572, Δloss: 0.0008108632564544678, #instance: 85000
2024-07-09 08:31:45.084 -- auc: 0.6423785403832009, Δauc: 0.7840816326530612, pcoc: 2.847752908819917, Δpcoc: 1.147618818283081, loss: 0.0016078091797440551, Δloss: 0.000889596402645111, #instance: 86000
2024-07-09 08:31:47.238 -- auc: 0.6429747266594853, Δauc: 0.7066491560873583, pcoc: 2.8287981566166374, Δpcoc: 1.0136549813406808, loss: 0.0016006148758975938, Δloss: 0.00098190474

[Stage 5:>                  (0 + 4) / 4][Stage 19:>                 (0 + 1) / 1]

2024-07-09 08:32:38.861 -- auc: 0.6664878917744271, Δauc: 0.7666110302212437, pcoc: 2.435406409958748, Δpcoc: 1.0088087443647713, loss: 0.0014814785504126335, Δloss: 0.0011723276376724243, #instance: 111000
2024-07-09 08:32:40.996 -- auc: 0.6672265261395396, Δauc: 0.7524923076923077, pcoc: 2.4231452323984395, Δpcoc: 1.140135612487793, loss: 0.001477778366633824, Δloss: 0.001067057967185974, #instance: 112000
2024-07-09 08:32:43.133 -- auc: 0.6681113542701298, Δauc: 0.7768615384615385, pcoc: 2.4105258795103675, Δpcoc: 1.0774174404144288, loss: 0.0014741071093398912, Δloss: 0.0010629262924194336, #instance: 113000
2024-07-09 08:32:45.311 -- auc: 0.6684236589711251, Δauc: 0.7119215881245783, pcoc: 2.3952539185231383, Δpcoc: 0.991286746386824, loss: 0.001472607013426329, Δloss: 0.0013030961751937867, #instance: 114000
2024-07-09 08:32:47.491 -- auc: 0.6692778427823116, Δauc: 0.8142419378374435, pcoc: 2.3887984123717394, Δpcoc: 1.56034178960891, loss: 0.0014678174801494763, Δloss: 0.0009218

24/07/09 08:33:27 INFO ArrowPythonRunner: Times: total = 290567, boot = -29214, init = 29368, finish = 290413
24/07/09 08:33:27 INFO DataWritingSparkTask: Writer for partition 0 is committing.
24/07/09 08:33:27 INFO DataWritingSparkTask: Committed partition 0 (task 103, attempt 0, stage 19.0)
24/07/09 08:33:27 INFO Executor: Finished task 0.0 in stage 19.0 (TID 103). 3126 bytes result sent to driver
24/07/09 08:33:27 INFO TaskSetManager: Finished task 0.0 in stage 19.0 (TID 103) in 290612 ms on ip-172-16-14-249.us-west-2.compute.internal (executor driver) (1/1)
24/07/09 08:33:27 INFO TaskSchedulerImpl: Removed TaskSet 19.0, whose tasks have all completed, from pool 
24/07/09 08:33:27 INFO DAGScheduler: ResultStage 19 (save at NativeMethodAccessorImpl.java:0) finished in 290.627 s
24/07/09 08:33:27 INFO DAGScheduler: Job 16 is finished. Cancelling potential speculative or zombie tasks for this job
24/07/09 08:33:27 INFO TaskSchedulerImpl: Killing all running tasks in stage 19: Stage fin

2024-07-09 08:33:28.394 -- auc: 0.6791733503943378, Δauc: 1.0, pcoc: 2.2107853569724103, Δpcoc: nan, loss: 0.0014082378691300413, Δloss: nan, #instance: 133000
2024-07-09 08:33:28.401 -- auc: 0.6791733503943378, Δauc: 1.0, pcoc: 2.2107853569724103, Δpcoc: nan, loss: 0.0014082378691300413, Δloss: nan, #instance: 133000
2024-07-09 08:33:28.407 -- auc: 0.6791733503943378, Δauc: 1.0, pcoc: 2.2107853569724103, Δpcoc: nan, loss: 0.0014082378691300413, Δloss: nan, #instance: 133000
2024-07-09 08:33:28.422 -- auc: 0.6794450507586803, Δauc: 0.7371942446043165, pcoc: 2.202266472963432, Δpcoc: 0.8652276277542115, loss: 0.0014089247200878906, Δloss: 0.00153668860455493, #instance: 133715


[38;5;196msaving model to s3://mv-mtg-di-for-poc-datalab/output/dev/model_out/[m
[38;5;196msaving model to s3://mv-mtg-di-for-poc-datalab/output/dev/model_out/[m
[38;5;196msaving model to s3://mv-mtg-di-for-poc-datalab/output/dev/model_out/[m
[38;5;196msaving model to s3://mv-mtg-di-for-poc-datalab/output/dev/model_out/[m
Get aws endpoint from env: 
[WARN] 2024-07-09 08:33:28.535 STSAssumeRoleWithWebIdentityCredentialsProvider [139660716283712] Token file must be specified to use STS AssumeRole web identity creds provider.
[2024-07-09 08:33:28.536] [info] [s3_sdk_filesys.cpp:357] Try to open S3 stream: s3://mv-mtg-di-for-poc-datalab/output/dev/model_out/_sparse__sparse_meta.json, read_only false
Get aws endpoint from env:  (0 + 4) / 4][Stage 21:>                 (0 + 4) / 4]
[WARN] 2024-07-09 08:33:28.997 STSAssumeRoleWithWebIdentityCredentialsProvider [139659815823104] Token file must be specified to use STS AssumeRole web identity creds provider.
Get aws endpoint from env: 
[

[2024-07-09 08:34:37.053] [info] C[0]:9 has stopped.
[2024-07-09 08:34:37.059] [info] PS job with coordinator address 172.16.14.249:38755 stopped.


24/07/09 08:34:37 INFO PythonRunner: Times: total = 68675, boot = -221, init = 222, finish = 68674
24/07/09 08:34:37 INFO PythonRunner: Times: total = 68675, boot = -264, init = 307, finish = 68632
24/07/09 08:34:37 INFO PythonRunner: Times: total = 68675, boot = -265, init = 266, finish = 68674
24/07/09 08:34:37 INFO PythonRunner: Times: total = 68675, boot = -265, init = 266, finish = 68674
24/07/09 08:34:37 INFO Executor: Finished task 2.0 in stage 21.0 (TID 110). 1311 bytes result sent to driver
24/07/09 08:34:37 INFO Executor: Finished task 1.0 in stage 21.0 (TID 109). 1311 bytes result sent to driver
24/07/09 08:34:37 INFO Executor: Finished task 0.0 in stage 21.0 (TID 108). 1311 bytes result sent to driver
24/07/09 08:34:37 INFO Executor: Finished task 3.0 in stage 21.0 (TID 111). 1311 bytes result sent to driver
24/07/09 08:34:37 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 21 because the barrier taskSet requires 4 slots, while the total numbe

## Evaluate the Model

To evaluate our model, we use the ``ms.input.read_s3_csv()`` function again to load the test dataset, passing in the column delimiter ``'\t'``.

In [19]:
# test_dataset_path = ROOT_DIR + '/data/test/day_1_0.001_test.csv'
# test_dataset = ms.input.read_s3_csv(spark_session, test_dataset_path, delimiter='\t', column_names=column_names)

test_dataset_path = 's3://mv-mtg-di-for-poc-datalab/2024/06/15/00/part-00000-f79b9ee6-aaf5-4117-88d5-44eea69dcea3-c000.snappy.orc'
test_dataset = spark_session.read.orc(test_dataset_path)
# test_dataset.printSchema()

[2024-07-09 08:34:37.096] [info] PS job with coordinator address 172.16.14.249:38755 stopped.
[38;5;196mps agent deregistered for process 27178 thread 0x7f054b621740[m
[2024-07-09 08:34:37.097] [info] PS job with coordinator address 172.16.14.249:38755 stopped.
[38;5;196mps agent deregistered for process 27170 thread 0x7f054b621740[m
[2024-07-09 08:34:37.097] [info] PS job with coordinator address 172.16.14.249:38755 stopped.
[38;5;196mps agent deregistered for process 27164 thread 0x7f054b621740[m
[2024-07-09 08:34:37.097] [info] PS job with coordinator address 172.16.14.249:38755 stopped.
[38;5;196mps agent deregistered for process 27154 thread 0x7f054b621740[m
24/07/09 08:34:37 INFO InMemoryFileIndex: It took 72 ms to list leaf files for 1 paths.


Next, we call the ``model.transform()`` method to transform the test dataset, which will add a column named ``rawPrediction`` to the test dataset representing the predicted labels. For ease of integration with Spark MLlib, ``model.transform()`` will also add a column named ``label`` to the test dataset representing the actual labels.

Like the training process, this will take several minutes and you can see the progress by looking at the output of the cell. The transformed test dataset is stored in the ``result`` variable.

In [None]:
result = model.transform(test_dataset)

[2024-07-09 08:34:38.389] [info] PS job with coordinator address 172.16.14.249:45359 started.
[2024-07-09 08:34:38.389] [info] PSRunner::RunPS: pid: 25611, tid: 32667, thread: 0x7f9201fef700
[2024-07-09 08:34:38.389] [info] PSRunner::RunPSCoordinator: pid: 25611, tid: 32667, thread: 0x7f9201fef700
[2024-07-09 08:34:38.390] [info] ActorProcess::Receiving: Coordinator pid: 25611, tid: 32670, thread: 0x7f92241fc700
[2024-07-09 08:34:38.413] [info] C[0]:9: The coordinator has connected to 4 servers and 4 workers.
PS Coordinator node [32mC[0]:9[m is ready.


24/07/09 08:34:38 INFO SparkContext: Starting job: collect at /home/ec2-user/anaconda3/envs/metaspore/lib/python3.8/site-packages/metaspore/agent.py:308
24/07/09 08:34:38 INFO DAGScheduler: Got job 19 (collect at /home/ec2-user/anaconda3/envs/metaspore/lib/python3.8/site-packages/metaspore/agent.py:308) with 4 output partitions
24/07/09 08:34:38 INFO DAGScheduler: Final stage: ResultStage 22 (collect at /home/ec2-user/anaconda3/envs/metaspore/lib/python3.8/site-packages/metaspore/agent.py:308)
24/07/09 08:34:38 INFO DAGScheduler: Parents of final stage: List()
24/07/09 08:34:38 INFO DAGScheduler: Missing parents: List()
24/07/09 08:34:38 INFO DAGScheduler: Submitting ResultStage 22 (PythonRDD[57] at collect at /home/ec2-user/anaconda3/envs/metaspore/lib/python3.8/site-packages/metaspore/agent.py:308), which has no missing parents
24/07/09 08:34:38 INFO SparkContext: Starting job: collect at /home/ec2-user/anaconda3/envs/metaspore/lib/python3.8/site-packages/metaspore/agent.py:291
24/07

2024-07-09 08:35:29.093 -- auc: 0.7935697261539958, Δauc: 0.7935697261539958, pcoc: 1.342880056017921, Δpcoc: 1.342880056017921, loss: 0.0009004550576210022, Δloss: 0.0009004550576210022, #instance: 1000
2024-07-09 08:35:30.228 -- auc: 0.7706256604678019, Δauc: 0.7506712999526142, pcoc: 1.2108774844636307, Δpcoc: 1.1042600228236272, loss: 0.0010036511719226837, Δloss: 0.0011068472862243652, #instance: 2000
2024-07-09 08:35:30.389 -- auc: 0.7721788785762943, Δauc: 0.7796660882053736, pcoc: 1.3667513988912106, Δpcoc: 1.797696927014519, loss: 0.0009359679619471232, Δloss: 0.0008006015419960022, #instance: 3000
2024-07-09 08:35:30.619 -- auc: 0.7728555860071498, Δauc: 0.7733968225713141, pcoc: 1.3270700060088059, Δpcoc: 1.2166522171186365, loss: 0.0009517827332019806, Δloss: 0.0009992270469665527, #instance: 4000
2024-07-09 08:35:31.396 -- auc: 0.7610224948875256, Δauc: 0.7120733389702283, pcoc: 1.3227039142088457, Δpcoc: 1.3061886974003003, loss: 0.0009715264320373535, Δloss: 0.0010505012

24/07/09 08:35:32 INFO ArrowPythonRunner: Times: total = 8166, boot = 9, init = 4096, finish = 4061
24/07/09 08:35:32 INFO MemoryStore: Block rdd_87_5 stored as values in memory (estimated size 92.4 MiB, free 5.7 GiB)
24/07/09 08:35:32 INFO BlockManagerInfo: Added rdd_87_5 in memory on ip-172-16-14-249.us-west-2.compute.internal:42701 (size: 92.4 MiB, free: 5.7 GiB)
24/07/09 08:35:32 INFO DataWritingSparkTask: Writer for partition 5 is committing.
24/07/09 08:35:32 INFO DataWritingSparkTask: Committed partition 5 (task 173, attempt 0, stage 36.0)
24/07/09 08:35:32 INFO Executor: Finished task 5.0 in stage 36.0 (TID 173). 2202 bytes result sent to driver
24/07/09 08:35:32 INFO TaskSetManager: Finished task 5.0 in stage 36.0 (TID 173) in 8326 ms on ip-172-16-14-249.us-west-2.compute.internal (executor driver) (6/8)

2024-07-09 08:35:32.530 -- auc: 0.7587315108523889, Δauc: 0.7849909481783208, pcoc: 1.313184674490582, Δpcoc: 1.5647252798080444, loss: 0.0009743590950965881, Δloss: 0.0008268364071846009, #instance: 8000
2024-07-09 08:35:32.781 -- auc: 0.7563653490700832, Δauc: 0.740676923076923, pcoc: 1.2907458502261793, Δpcoc: 1.1327765274047852, loss: 0.0009864906999799941, Δloss: 0.0010835435390472413, #instance: 9000
2024-07-09 08:35:33.594 -- auc: 0.7605478829212844, Δauc: 0.7967469262295082, pcoc: 1.2849988259209526, Δpcoc: 1.23686749736468, loss: 0.0009884464502334595, Δloss: 0.0010060482025146483, #instance: 10000
2024-07-09 08:35:33.845 -- auc: 0.7617235558718601, Δauc: 0.7710137457044673, pcoc: 1.246319194868499, Δpcoc: 0.9562219619750977, loss: 0.00100769089568745, Δloss: 0.001200135350227356, #instance: 11000
2024-07-09 08:35:34.673 -- auc: 0.7655332124414678, Δauc: 0.7985743680188124, pcoc: 1.229484454481846, Δpcoc: 1.0761680688176836, loss: 0.0010165041486422221, Δloss: 0.00111344993114

24/07/09 08:35:34 INFO ArrowPythonRunner: Times: total = 12325, boot = -254, init = 7088, finish = 5491
24/07/09 08:35:34 INFO MemoryStore: Block rdd_87_0 stored as values in memory (estimated size 131.0 MiB, free 5.6 GiB)
24/07/09 08:35:34 INFO BlockManagerInfo: Added rdd_87_0 in memory on ip-172-16-14-249.us-west-2.compute.internal:42701 (size: 131.0 MiB, free: 5.6 GiB)
24/07/09 08:35:34 INFO DataWritingSparkTask: Writer for partition 0 is committing.
24/07/09 08:35:34 INFO DataWritingSparkTask: Committed partition 0 (task 168, attempt 0, stage 36.0)
24/07/09 08:35:35 INFO Executor: Finished task 0.0 in stage 36.0 (TID 168). 2202 bytes result sent to driver
24/07/09 08:35:35 INFO TaskSetManager: Finished task 0.0 in stage 36.0 (TID 168) in 12472 ms on ip-172-16-14-249.us-west-2.compute.internal (executor driver) (7/8)


``result`` is a normal PySpark DataFrame and can be inspected by its methods.

In [None]:
result.show(5)

Finally, we use ``pyspark.ml.evaluation.BinaryClassificationEvaluator`` to compute test AUC.

In [None]:
import pyspark
evaluator = pyspark.ml.evaluation.BinaryClassificationEvaluator()
test_auc = evaluator.evaluate(result)
print('test_auc: %g' % test_auc)

When all computations are done, we should call the ``stop()`` method of ``spark_session`` to make sure all the resources are released.

In [None]:
# spark_session.stop()

## Summary

We illustrated how to train and evaluate neural network model in MetaSpore. Users familiar with PyTorch and Spark MLlib should get started easily, which is the design goal of MetaSpore.