# Processing jet images

This Jupyter Notebook will process a jet image data set from an HDF5-file to a Pandas DataFrame ready for input to a network. 

**Data description**:<br>
The data used in this example can be downloaded from: https://zenodo.org/record/269622#.X7JntC-HJAb. It is 2.2 GB in size. A full dataset description can be found in [arXiv:1701.05927]. It is an HDF5-file with the following fields:
- 'image' : array of dim (872666, 25, 25), contains the pixel intensities of each 25x25 image
- 'signal' : binary array to identify signal (1, i.e. W boson) vs background (0, i.e. QCD)
- 'jet_eta': eta coordinate per jet
- 'jet_phi': phi coordinate per jet
- 'jet_mass': mass per jet
- 'jet_pt': transverse momentum per jet
- 'jet_delta_R': distance between leading and subleading subjets if 2 subjets present, else 0
- 'tau_1', 'tau_2', 'tau_3': substructure variables per jet (a.k.a. n-subjettiness, where n=1, 2, 3)
- 'tau_21': tau2/tau1 per jet
- 'tau_32': tau3/tau2 per jet

Reading HDF5 file

In [1]:
import h5py

path = '/Users/nallenallis/Documents/LTH/Exjobb/data/jet-images_Mass60-100_pT250-300_R1.25_Pix25.hdf5'
file = h5py.File(path,'r')

# Jet image
image = file['/image']

# Jet variables
signal = file['/signal']

m = file['/jet_mass']
pt = file['/jet_pt']
eta = file['/jet_eta']
phi = file['/jet_phi']
deltaR = file['/jet_delta_R']

tau_1 = file['/tau_1']
tau_2 = file['/tau_2']
tau_3 = file['/tau_3']
tau_21 = file['/tau_21']
tau_32 = file['/tau_32']

# Dimensions
nbr_images = image.shape[0]
nbr_variables = 5          # = 11 
dim0 = image.shape[1] 
dim1 = image.shape[2]

In [2]:
import numpy as np
import pandas as pd

image = np.array(image)
image = image.reshape((nbr_images,dim0*dim1))



data = np.zeros([nbr_images, dim0*dim1+nbr_variables])

#for n in range(nbr_images): # Loops through all images
    #for c in range(len(cols)): # Loops through all columns
        #if c < dim0*dim1: # If we are in a "image" column
        #    data[n,c] = image[n*c]
        #else: # If we are in a "variable" column
        #    data[n,c] = col


In [3]:
#variables = {'signal/background': signal, 'm': m, 'pt': pt, 'eta': eta, 'phi': phi, 'deltaR': deltaR, 'tau_1': tau_1, 'tau_2': tau_2, 'tau_3': tau_3, 'tau_21': tau_21, 'tau_32': tau_32} 
variables = {'signal/background': signal, 'm': m, 'pt': pt, 'eta': eta, 'phi': phi} 
df2 = pd.DataFrame(variables)

pixels = [str(x) for x in range(dim0*dim1)] # Creates array ['0', '1', '2', ..., 'dim0*dim1']
df1 = pd.DataFrame(np.zeros([nbr_images, 625]), columns = pixels)

### Truncating the data

Because the dataset is so large (contains 872666 jet images), one optional step is to truncate the data and only include a limited set. 

In [10]:
nbr_images = 50000
df1.truncate(before=0, after=nbr_images-1)
df2.truncate(before=0, after=nbr_images-1)

Unnamed: 0,signal/background,m,pt,eta,phi
0,1.0,95.136238,299.065826,-1.517478,5.756136
1,1.0,81.271561,291.957397,-0.458434,5.938481
2,1.0,76.364853,251.558395,1.375518,2.774098
3,1.0,85.025551,271.143646,0.202537,1.718446
4,1.0,88.171738,271.161774,-0.767848,1.980324
...,...,...,...,...,...
49995,0.0,79.455582,250.558533,1.413080,5.841528
49996,0.0,75.540802,265.298645,-1.599532,1.360684
49997,0.0,61.758492,284.996735,0.589839,5.080204
49998,0.0,69.689804,289.933319,1.064852,3.736380


In [5]:
df = pd.concat([df1, df2])

In [11]:
df = df1.assign(m = m)

In [9]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,620,621,622,623,624,signal/background,m,pt,eta,phi
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,


In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])

frames = [df1, df2]
result = pd.concat(frames)

In [None]:
result.head()