# Analyze Data

**Requirements**: You might need to make a venv and install Ipython and setup use in jupyternotebook (see here for help: https://www.geeksforgeeks.org/using-jupyter-notebook-in-virtual-environment/)

**Download** the NF timing param model and put it in the folder "examples/NF_model" from the duke box

**Needed packages**: normalizing-flows, torch,numpy,pandas,uproot
may also need to install basic stuff: os, tqdm, time, matplotlib, tracemalloc, linecache, 

**Note**: the analysis code does not need to use ROOT - however, the processing code (not shown here) does, so if you want to run that you need pyroot or eicshell

### This notebook contains the code to run the data analysis, but all of the functions are in util.py to keep things (a little) organized

### If you want to debug specific functions, feel free to copy and paste them here rather than importing them


In [5]:
from util import print_w_time, get_compiled_NF_model,generateSiPMOutput,display_top
import uproot
import numpy as np
from  torch import device as torchdevice
from torch.cuda import is_available as torchcudaisavailable
import matplotlib.pyplot as plot
import time
# Get device to be used
device = torchdevice('cuda' if torchcudaisavailable() else 'cpu')
from os.path import exists as ospathexists
from os import makedirs as osmakedirs
def checkdir(path):
    if not ospathexists(path): 
        osmakedirs(path)
from tqdm import tqdm
import datetime
from  pandas import read_csv as pd_read_csv

print_w_time("began analyze_data")
outputDataframePathName = "dfs/test_output.csv"
inputProcessedData = "dfs/test_data_from_processing.csv"


'''MEMORY PROFILING'''
import linecache
import os
import tracemalloc

tracemalloc.start()
    
'''MEMORY PROFILING SETUP END'''
model_path = "./NF_model/run_7_3context_8flows_26hl_256hu_2000bs.pth"
model_compile = get_compiled_NF_model(model_path) #load timing param model
processed_data = pd_read_csv(inputProcessedData) #open input csv
begin = time.time()
df = generateSiPMOutput(processed_data, model_compile,batch_size = 50000) #analyze data

df.to_csv(outputDataframePathName) #save output (this is the clustering input)
end = time.time()
print_w_time(f"generateSiPMOutput took {(end - begin) / 60} minutes")
print_w_time("finished job")
print_w_time("analyzing memory snapshot")

snapshot = tracemalloc.take_snapshot() #some memory profiling can be useful
display_top(snapshot)

13:08:56 began analyze_data


  self.load_state_dict(torch.load(path))


13:08:58 Processing data in generateSiPMOutput...
13:09:00 starting sampling
13:09:00 Starting batch # 1.0 / 3
13:09:00 Starting batch # 2.0 / 3
13:09:01 Starting batch # 3.0 / 3
13:09:01 sampling took 0.029230237007141113 minutes
13:09:01 creating df
13:09:10 Beginning pulse process
         90214248 function calls (89406655 primitive calls) in 239.321 seconds

   Ordered by: cumulative time
   List reduced from 617 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.048   10.048  239.321  239.321 /hpc/group/vossenlab/rck32/eic/eicKLMcluster/examples/util.py:266(generateSiPMOutput)
   384784    2.234    0.000  144.579    0.000 /hpc/group/vossenlab/rck32/ML_venv/lib64/python3.9/site-packages/pandas/core/groupby/ops.py:607(get_iterator)
   128260    0.440    0.000   93.706    0.001 /hpc/group/vossenlab/rck32/ML_venv/lib64/python3.9/site-packages/pandas/core/groupby/ops.py:622(_get_splitter)
   128260    0.210    0.000   9