In [1]:
##%matplotlib widget
## with %matplotlib notebook: seems to require ipympl as part of environment, either
## part of the conda environment or "pip install ipympl"
## otherwise, does not show ANY plots in notebook, plt.savefig() works
%matplotlib notebook  
##%matplotlib inline    ## --plt.savefig()  works, but re-sizing does NOT

This notebook is a short demo to illustrate execution.   For odd historical reasons, it uses "toy Monte Carlo" (simulated data)for "training" and "full LHCB MC" for validation.

The network architecture is a "simple" model that uses 1 input channel (the KDE [kernel density estimator] but from the track parameters) feeding 5 convolutional layers followed by a fully connected layer.

In today's version, the network will start with weights from a previously trained version.
 

Check the current GPU usage. Please try to be nice!

In [2]:
!nvidia-smi

Sun Apr  4 09:24:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  TITAN V             Off  | 00000000:03:00.0 Off |                  N/A |
| 28%   31C    P8    23W / 250W |   9749MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:83:00.0 Off |                    0 |
| N/A   32C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN V             Off  | 00000000:84:00.0 Off |                  N/A |
| 28%   

> **WARNING**: The card numbers here are *not* the same as in CUDA. You have been warned.

## Imports

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import time
import torch
import pandas as pd
import mlflow

# Python 3 standard library
from pathlib import Path

from torchsummary import summary

'''
HELPER FUNCTIONS
'''
# From model/collectdata.py
from model.collectdata_mdsA import collect_data
# For poca KDE
from model.collectdata_poca_KDE import collect_data_poca

# From model/loss.py
##from loss import Loss
from model.alt_loss_A import Loss

# From model/training.py
from model.training import trainNet, select_gpu, Results

# From model/models.py
##  will start with model from TwoFeatures_CNN6Layer_A in the first instance
##  see relevant cell below

from model.models_mjp_30Mar21 import ACN_2i4_10L_4S_BN as ModelA

# From model/utilities.py
from model.utilities import load_full_state, count_parameters, Params

from model.plots import dual_train_plots, replace_in_ax

## adds image of model architecture
import hiddenlayer as HL


    pip install -U awkward1

In Python:

    >>> import awkward1 as ak
    >>> new_style_array = ak.from_awkward0(old_style_array)
    >>> old_style_array = ak.to_awkward0(new_style_array)



Set up Torch device configuration. All tensors and model parameters need to know where to be put.
This takes a BUS ID number: The BUS ID is the same as the listing at the top of this script.

In [4]:
device = select_gpu(0)

1 available GPUs (initially using device 0):
  0 TITAN V


  and should_run_async(code)


### Set up local parameters

In [5]:
# params order - batch size, epochs, lr, epoch_start (which is usually set to 0)
args = Params(128, 500, 1e-5, 200)

## Loading data

Load the dataset, split into parts, then move to device (see `collectdata.py` in the `../model` directory)

In [6]:
## newer vernacular
## Training dataset. You can put as many files here as desired.
##  set the option load_XandXsq = True to use both DKE and KDE^2 as input features

## This is used when training with the original KDE
train_loader = collect_data('/share/lazy/sokoloff/ML-data_A/Aug14_80K_train.h5',
                             '/share/lazy/sokoloff/ML-data_AA/Oct03_80K_train.h5',
#                             '/share/lazy/sokoloff/ML-data_AA/Oct03_40K_train.h5',
                             '/share/lazy/will/ML_mdsA/June30_2020_80k_1.h5',
#                             '/share/lazy/will/ML_mdsA/June30_2020_80k_3.h5',
#                             '/share/lazy/will/ML_mdsA/June30_2020_80k_4.h5',
#                             '/share/lazy/will/ML_mdsA/June30_2020_80k_5.h5',
#                             '/share/lazy/will/ML_mdsA/June30_2020_80k_6.h5',
#                             '/share/lazy/will/ML_mdsA/June30_2020_80k_7.h5',
#                             '/share/lazy/will/ML_mdsA/June30_2020_80k_8.h5',
#                             '/share/lazy/will/ML_mdsA/June30_2020_80k_9.h5',
                            #'/share/lazy/sokoloff/ML-data_AA/Oct03_80K2_train.h5',
                             batch_size=args.batch_size,
## if we are using a larger dataset (240K events, with the datasets above, and 11 GB  of GPU memory),
## not the dataset will overflow the GPU memory; device=device will allow the data to move back
## and forth between the CPU and GPU memory. While this allows use of a larger dataset, it slows
## down performance by about 10%.  So comment out when not needed.
                            device=device,
                            masking=True, shuffle=True,
                            load_XandXsq=False,
                            load_xy=False)

# Validation dataset. You can slice to reduce the size.
## dataAA -> /share/lazy/sokoloff/ML-data_AA/
val_loader = collect_data('/share/lazy/sokoloff/ML-data_AA/Oct03_20K_val.h5',
## mds val_loader = collect_data('dataAA/HLT1CPU_1kevts_val.h5',

                          batch_size=args.batch_size,
                          slice=slice(256 * 39),
                          device=device,
                          masking=True, shuffle=False,
                          load_XandXsq=False,
                          load_xy=False)

'''
## This is used when training with the new KDE
train_loader = collect_data_poca('/share/lazy/will/data/June30_2020_80k_1.h5',
                            '/share/lazy/will/data/June30_2020_80k_3.h5',
                            batch_size=args.batch_size,
                            device=device,
                            masking=True, shuffle=True,
                           ## slice = slice(0,18000)
                           )

val_loader = collect_data_poca('/share/lazy/sokoloff/ML-data_AA/20K_POCA_kernel_evts_200926.h5',
                            batch_size=args.batch_size,
                            device=device,
                            masking=True, shuffle=True,
                            ##slice = slice(18000,None)
                           )
'''

Loading data...
Loaded /share/lazy/sokoloff/ML-data_A/Aug14_80K_train.h5 in 14.7 s
Loaded /share/lazy/sokoloff/ML-data_AA/Oct03_80K_train.h5 in 12.76 s
Loaded /share/lazy/will/ML_mdsA/June30_2020_80k_1.h5 in 12.74 s
Constructing 240000 event dataset took 6.135 s
Loading data...
Loaded /share/lazy/sokoloff/ML-data_AA/Oct03_20K_val.h5 in 3.015 s
Constructing 9984 event dataset took 0.1226 s


"\n## This is used when training with the new KDE\ntrain_loader = collect_data_poca('/share/lazy/will/data/June30_2020_80k_1.h5',\n                            '/share/lazy/will/data/June30_2020_80k_3.h5',\n                            batch_size=args.batch_size,\n                            device=device,\n                            masking=True, shuffle=True,\n                           ## slice = slice(0,18000)\n                           )\n\nval_loader = collect_data_poca('/share/lazy/sokoloff/ML-data_AA/20K_POCA_kernel_evts_200926.h5',\n                            batch_size=args.batch_size,\n                            device=device,\n                            masking=True, shuffle=True,\n                            ##slice = slice(18000,None)\n                           )\n"

# Preparing the model

Prepare a model, use multiple GPUs if they are VISIBLE, and move the model to the device.

In [7]:
# Set model to use (defined above)
model = ModelA()

# Prints out layout of each model (keep commented out)
##summary(model, input_size=(4, 4000))
##print(model.parameters)

# Sets save directory for mlflow
mlflow.tracking.set_tracking_uri('file:/share/lazy/pv-finder_model_repo')
mlflow.set_experiment('Four Feature AllCNN')

Traceback (most recent call last):
  File "/home/michael24peters/.local/lib/python3.7/site-packages/mlflow/store/tracking/file_store.py", line 197, in list_experiments
    experiment = self._get_experiment(exp_id, view_type)
  File "/home/michael24peters/.local/lib/python3.7/site-packages/mlflow/store/tracking/file_store.py", line 260, in _get_experiment
    meta = read_yaml(experiment_dir, FileStore.META_DATA_FILE_NAME)
  File "/home/michael24peters/.local/lib/python3.7/site-packages/mlflow/utils/file_utils.py", line 167, in read_yaml
    raise MissingConfigException("Yaml file '%s' does not exist." % file_path)
mlflow.exceptions.MissingConfigException: Yaml file '/share/lazy/pv-finder_model_repo/ML/meta.yaml' does not exist.


In [8]:
print("Let's use", torch.cuda.device_count(), "GPUs!")
if torch.cuda.device_count() > 1:
    model = torch.nn.DataParallel(model)

Let's use 1 GPUs!


Let's move the model's weight matricies to the GPU:

In [9]:
loss = Loss(epsilon=1e-5,coefficient=2.5)
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)

##  use the first five layers from a pre-existing model
##  see example at https://discuss.pytorch.org/t/how-to-load-part-of-pre-trained-model/1113
##   ML -> /share/lazy/sokoloff/ML

# When loading pretrained models, use this code; otherwise, comment it out
# For other pretrained models, go to MLFlow and find the path for "run_stats.pyt"
pretrained_dict = '/share/lazy/pv-finder_model_repo/12/c1c6ae0fb0b24eb0a85cd76eea7f5ef6/artifacts/run_stats.pyt'
load_full_state(model, optimizer, pretrained_dict)

we also froze 0 weights
Of the 35.0 parameter layers to update in the current model, 35.0 were loaded


Let's move the model's weight matricies to the GPU:

In [10]:
model.to(device)

ACN_2i4_10L_4S_BN(
  (conv1): Conv(
    (0): Conv1d(1, 20, kernel_size=(25,), stride=(1,), padding=(12,))
    (1): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout(p=0.15, inplace=False)
    (3): LeakyReLU(negative_slope=0.01)
  )
  (conv2): Conv(
    (0): Conv1d(20, 10, kernel_size=(15,), stride=(1,), padding=(7,))
    (1): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout(p=0.15, inplace=False)
    (3): LeakyReLU(negative_slope=0.01)
  )
  (conv3): Conv(
    (0): Conv1d(30, 10, kernel_size=(15,), stride=(1,), padding=(7,))
    (1): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout(p=0.15, inplace=False)
    (3): LeakyReLU(negative_slope=0.01)
  )
  (conv4): Conv(
    (0): Conv1d(10, 10, kernel_size=(15,), stride=(1,), padding=(7,))
    (1): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout(p=0.15, 

## Train 



The body of this loop runs once per epoch. Results is a named tuple of values (loss per epoch for training and validation, time each). Start by setting up a plot first:

In [11]:
ax, tax, lax, lines = dual_train_plots()
fig = ax.figure
plt.tight_layout()
# This gets built up during the run - do not rerun this cell
results = pd.DataFrame([], columns=Results._fields)

<IPython.core.display.Javascript object>

In [12]:
avgEff = 0.0
avgFP = 0.0

print('for model: ', model)   
run_name = 'ACN_2i4_10L_4S_BN (P3)'
# Create an mlflow run
with mlflow.start_run(run_name=run_name) as run:
    # Log parameters of the model
    for key, value in vars(args).items():
        print(key, value)
        mlflow.log_param(key, value)
    
    # Log parameter count in the model
    mlflow.log_param('Parameters', count_parameters(model))
    
    # Begin run
    for result in trainNet(model, optimizer, loss,
                            train_loader, val_loader,
                            args.epochs+args.epoch_start, epoch_start=args.epoch_start,
                            notebook=True, device=device):

        result = result._asdict()
        results = results.append(pd.Series(result), ignore_index=True)
        xs = results.index

        # Update the plot above
        lines['train'].set_data(results.index, results.cost)
        lines['val'].set_data(results.index, results.val)

        #filter first cost epoch (can be really large)
        max_cost = max(max(results.cost if len(results.cost)<2 else results.cost[1:]), max(results.val))
        min_cost = min(min(results.cost), min(results.val))
    
        # The plot limits need updating too
        ax.set_ylim(min_cost*.9, max_cost*1.1)  
        ax.set_xlim(-.5, len(results.cost) - .5)
    
        replace_in_ax(lax, lines['eff'], xs, results['eff_val'].apply(lambda x: x.eff_rate))
        replace_in_ax(tax, lines['fp'], xs, results['eff_val'].apply(lambda x: x.fp_rate))            
            
        # Redraw the figure
        fig.canvas.draw()
            
        ## MLFLOW ##
        # Log metrics
        mlflow.log_metric('Efficiency', result['eff_val'].eff_rate, result['epoch'])
        mlflow.log_metric('False Positive Rate',  result['eff_val'].fp_rate, result['epoch'])
        mlflow.log_metric('Validation Loss',  result['val'], result['epoch'])
        mlflow.log_metric('Training Loss',  result['cost'], result['epoch'])
        
        # If we are on the last 10 epochs but NOT the last epoch
        if(result['epoch'] >= args.epochs + args.epoch_start - 10):
            avgEff += result['eff_val'].eff_rate
            avgFP += result['eff_val'].fp_rate
           
        # If we are on the last epoch
        if(result['epoch'] == args.epochs + args.epoch_start - 1):
            print('Averaging...\n')
            avgEff /= 10
            avgFP /= 10
            mlflow.log_metric('10 Efficiency Average', avgEff)
            mlflow.log_metric('10 False Positive Average', avgFP)
            print('Average Eff: ', avgEff)
            print('Average FP Rate: ', avgFP)
            
        
        # Log tags
        mlflow.set_tag('Skip connections', '4')
        mlflow.set_tag('Asymmetry', '2.5')
        mlflow.set_tag('KDE_A', 'False')
        mlflow.set_tag('BN Input', 'False')

        # Save model state dictionary, optimizer state dictionary, and epoch number
        torch.save({
            'model':model.state_dict(),
            'optimizer':optimizer.state_dict(),
            'epoch':args.epochs+result['epoch']
            }, 'run_stats.pyt')
        # Save the run stats into mlflow
        mlflow.log_artifact('run_stats.pyt')
    
    # Generate tight plot at end of training
    dual_train_plots(results.index,
                 results.cost, results.val, 
                 results['eff_val'].apply(lambda x: x.eff_rate),
                 results['eff_val'].apply(lambda x: x.fp_rate))
    plt.tight_layout()
    # Save plot
    plt.savefig('plot.png')  
    mlflow.log_artifact('plot.png')

for model:  ACN_2i4_10L_4S_BN(
  (conv1): Conv(
    (0): Conv1d(1, 20, kernel_size=(25,), stride=(1,), padding=(12,))
    (1): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout(p=0.15, inplace=False)
    (3): LeakyReLU(negative_slope=0.01)
  )
  (conv2): Conv(
    (0): Conv1d(20, 10, kernel_size=(15,), stride=(1,), padding=(7,))
    (1): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout(p=0.15, inplace=False)
    (3): LeakyReLU(negative_slope=0.01)
  )
  (conv3): Conv(
    (0): Conv1d(30, 10, kernel_size=(15,), stride=(1,), padding=(7,))
    (1): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout(p=0.15, inplace=False)
    (3): LeakyReLU(negative_slope=0.01)
  )
  (conv4): Conv(
    (0): Conv1d(10, 10, kernel_size=(15,), stride=(1,), padding=(7,))
    (1): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Drop

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  file=sys.stderr,


HBox(children=(FloatProgress(value=0.0, description='Epochs', layout=Layout(flex='2'), max=500.0, style=Progre…

HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 200: train=1316.05, val=1314.7, took 81.229 s
  Validation Found 320 of 54504, added 9471 (eff 0.59%) (0.949 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 201: train=1316.03, val=1314.67, took 78.758 s
  Validation Found 320 of 54504, added 9472 (eff 0.59%) (0.949 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 202: train=1316.01, val=1314.65, took 80.04 s
  Validation Found 322 of 54504, added 9466 (eff 0.59%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 203: train=1316.0, val=1314.63, took 79.581 s
  Validation Found 326 of 54504, added 9470 (eff 0.60%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 204: train=1315.98, val=1314.61, took 80.859 s
  Validation Found 327 of 54504, added 9460 (eff 0.60%) (0.947 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 205: train=1315.96, val=1314.61, took 78.652 s
  Validation Found 324 of 54504, added 9463 (eff 0.59%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 206: train=1315.95, val=1314.59, took 78.541 s
  Validation Found 326 of 54504, added 9467 (eff 0.60%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 207: train=1315.93, val=1314.57, took 78.488 s
  Validation Found 323 of 54504, added 9470 (eff 0.59%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 208: train=1315.91, val=1314.54, took 78.416 s
  Validation Found 327 of 54504, added 9463 (eff 0.60%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 209: train=1315.89, val=1314.52, took 81.154 s
  Validation Found 327 of 54504, added 9460 (eff 0.60%) (0.947 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 210: train=1315.88, val=1314.37, took 80.57 s
  Validation Found 331 of 54504, added 9467 (eff 0.61%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 211: train=1315.86, val=1314.5, took 80.864 s
  Validation Found 328 of 54504, added 9466 (eff 0.60%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 212: train=1315.84, val=1314.48, took 80.639 s
  Validation Found 328 of 54504, added 9470 (eff 0.60%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 213: train=1315.82, val=1314.46, took 80.079 s
  Validation Found 328 of 54504, added 9466 (eff 0.60%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 214: train=1315.81, val=1314.45, took 78.907 s
  Validation Found 328 of 54504, added 9464 (eff 0.60%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 215: train=1315.79, val=1314.43, took 81.3 s
  Validation Found 329 of 54504, added 9462 (eff 0.60%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 216: train=1315.77, val=1314.41, took 80.345 s
  Validation Found 330 of 54504, added 9467 (eff 0.61%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 217: train=1315.75, val=1313.77, took 80.942 s
  Validation Found 322 of 54504, added 9474 (eff 0.59%) (0.949 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 218: train=1315.74, val=1314.33, took 81.426 s
  Validation Found 333 of 54504, added 9463 (eff 0.61%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 219: train=1315.72, val=1313.67, took 81.262 s
  Validation Found 325 of 54504, added 9478 (eff 0.60%) (0.949 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 220: train=1315.7, val=1314.28, took 81.203 s
  Validation Found 332 of 54504, added 9460 (eff 0.61%) (0.947 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 221: train=1315.69, val=1314.24, took 79.226 s
  Validation Found 329 of 54504, added 9466 (eff 0.60%) (0.948 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 222: train=1315.67, val=1313.7, took 78.369 s
  Validation Found 323 of 54504, added 9474 (eff 0.59%) (0.949 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 223: train=1315.66, val=1313.39, took 78.358 s
  Validation Found 320 of 54504, added 9487 (eff 0.59%) (0.95 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 224: train=1315.64, val=1313.99, took 78.466 s
  Validation Found 323 of 54504, added 9476 (eff 0.59%) (0.949 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…

Epoch 225: train=1315.63, val=1313.75, took 79.024 s
  Validation Found 322 of 54504, added 9477 (eff 0.59%) (0.949 FP/event)


HBox(children=(FloatProgress(value=0.0, description='Training', layout=Layout(flex='2'), max=1875.0, style=Pro…




KeyboardInterrupt: 

In [None]:
##quit()