Include out favorite packages

In [None]:
#Magic function for matplotlib to display inside juypter notebook
%matplotlib inline 

import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['font.size']  = 16
matplotlib.rcParams['font.family']= 'serif'

import numpy as np
import pandas as pd

One of the more useful tools for physics analysis with Pandas is so called "query". This is sort of like TTree->Draw on <b>steroids</b>.

For deep learning, in particular the Faster-RCNN detector, we want to analyze the network inferences while putting restrictions on certain quantities of interest which include network score, and IoU. 

Pandas is perfect for slicing and dicing your data to find out what the hell is going on. No longer will you have to write, and re-write scripts to do analysis. We can load data in memory once and get our hands <b>dirty</b> until we find an interesting result. Once again I give a shout out to the Pandas book for more amazing things you can do with pandas.

Lets start with a data file produced from the Faster-RCNN detector network. I've sliced off ~10MB from the ~100MB full network output to save you from downloading a very large data file.

http://www.nevis.columbia.edu/~vgenty/public/some_detections.txt

In [None]:
#What is in here... just pure ascii
!head some_detections.txt

In [None]:
detect_df = pd.read_csv("some_detections.txt", # the file
                        sep=' ',               # separator is a space
                        names=['d1','image_id','prob','x1','y1','x2','y2'],  # The first column is a dummy index,
                        usecols=range(1,7))    # so lets ignore that first dummy column

# Print it and take a look
detect_df

What's in this file? 

0) The first column is the image number (it's really the TTree index number in the LArCV ROOT file...). You see this number repeats itself many times. That's b/c each row is a detection and we get multiple detections per image.

1) Second column is the detection score.

2) Columns [3,4,5,6] are the four corners of a bounding box x1,y1,x2,y2.

The ground truth label is this specific output so we will have to add it in by hand. All images with less than 1000 are "cosmic" images and contain no MC neutrino. Image larger than that actually have a neutrino in them. Trust me on this.

In [None]:
# How many cosmic inferences do we have?
print detect_df.query('image_id <= 1000').index.size

# How many neutrino inferences do we have?
print detect_df.query('image_id > 1000').index.size

Not very useful since the detections are not unique but lets go ahead and assign a ground truth class to each of the images for this we can use pandas.apply() function

In [None]:
# make a function that assigs the ground truth and takes as input the row
def assign_gt(row):
    if row['image_id'] <= 1000 :
        return 0
    return 1

# make a new column called gt (ground truth)
detect_df['gt'] = detect_df.apply(assign_gt,axis=1)

Usually we care about the <b>best</b> detection per image, so lets slice the dataframe <b>per image</b> and pick the box with the highest network score.

In [None]:
top_predictions = []
for name, group in detect_df.groupby('image_id'):
    # sort the group bases on probability
    sorted_df = group.sort_values(by='prob',ascending=False)
    
    # get the top index (use positional indexer iloc, not ix)
    # there is a more elegant way to do this but you will find when doing analysis
    # just do what you have to to get the job done
    top_predictions.append(sorted_df.iloc[0])
    

In [None]:
#Convert to a data frame
top_df = pd.DataFrame(top_predictions)

In [None]:
# Just want to make a note here that linespace and arange are not the same. You should provide 1 extra bin
# in the case of linspace (40->41) to get the correct binning. In arange you see I go over the li

print "np.lnspace"
print np.linspace(0,1.0,40)

print "np.arange"
print np.arange(0,1.0+0.025,0.025)

In [None]:
arange_   = np.arange(0,1.0,0.025)
linspace_ = np.linspace(0,1.0,40)
bins_     = np.arange(0,1.0+0.025,0.025)


l = plt.hist(arange_,bins=linspace_)
print "39 bins..."
plt.show()

a = plt.hist(arange_,bins_)
print "40 nice bins..."
plt.show()

In [None]:
#Lets now look at how well the detection network can separation neutrinos from cosmic images

neutrinos = top_df.query('gt == 1.0')['prob'].values
cosmics   = top_df.query('gt == 0.0')['prob'].values


div=0.025
bins = np.arange(0,1.0+div,div)

fig,ax = plt.subplots(figsize=(10,6))


# note... normed=True here will give an area normalized histogram, not what we want
# if we scale cosmic and neutrino events up all we have to do is multiply a scalar
# by this histogram to get the total rate, not the case for area normalized!
ax.hist(neutrinos,
        bins=bins,
        weights=np.array([1/float(neutrinos.size) for _ in xrange(neutrinos.size)]),
        alpha=0.5,
        lw=2,
        histtype='stepfilled',
        color='blue',
       label='neutrino')
ax.hist(cosmics,
        bins=bins,
        weights=np.array([1/float(cosmics.size) for _ in xrange(cosmics.size)]),
        alpha=0.5,
        lw=2,
        histtype='stepfilled',
        color='red',
        label='cosmics')

ax.set_ylim(0,.10)
ax.set_ylabel("Events (Normalized)",fontweight='bold')
ax.set_xlabel("Nu Score",fontweight='bold')
ax.legend(loc='upper left')

#Kaleko's suggestion
ax.grid()
plt.show()