# Advanced Skills

## 1. Downsizing data with a specific ratio

### 1) Introduction
Have you thought that given data with 5 labels, you can also do a binary classification? For example, we have 5 jets: t, q, g, w, z. Now instead of 5 Tagger, we only want to do the T Tagger, which means all the other jets are clustered to not-T. In this case, if we directly use the data, it results in input imbalance: T is only 1/4 of not-T. Thus we need to downsize the data so that elements are relatively balanced.<br><br>
Additionally, for testing, we usually use part of the data, so downsizing is very necessary.



### 2) Code
Generally we use the conditional silicing in pandas to accomplish this task.

In [None]:
import numpy as np
import h5py
import pandas as pd
from tqdm import tqdm

# load data
load_path = "data/samples.h5"
df = pd.read_hdf(load_path, key="data")   

In [None]:
# Locate the data who meets the condition
tJets = df[df['j_t']==1]

print("There are %d constituents of T jets." %(tJets.shape[0]))

However, we cannot directly split this Series. Instead, we want to split the data by jets.

In [None]:
labels=['j_t','j_q','j_g','j_w','j_z']
jet_dict = {}
for label in labels:
    jet_dict[label] = np.unique(df[df[label]==1]['j_index'])

Now we have a dictionary in which keys are labels and values are correpsponding jet indices.<br>
Now lets take 100 for each jet.

In [None]:
jet_size = np.repeat([100],5)
df100 = pd.DataFrame()

for label, indices in jet_dict.items():
    chosen_jets = np.random.choice(indices,size=jet_size[labels.index(label)], replace=False)
    temp_df = df[df['j_index'].isin(chosen_jets)]
    df100 = pd.concat([df100,temp_df],axis=0)

df100.info()

Here we have a dataset with 100 jets for each category.

### 3) Excercise
Try to create a dataset with 500 jets for total but with ratio t:q:g:z:w = 3:2:1:0:0.

## 2. N-Constituents of Jets
The amount of constituents varies in jets, ranging from 20 to more than 200. When we classify, we always expect to keep each jet containning the same number of constituents. For example, if we set n_con = 40, ranking by the transverse momentum (default), the first 40 constituents will be kept. If constituents in a jet are less than 40, we will use zero-padding.

In [None]:
def data_transform (nConstituents, data):
    kColumns = data.columns.shape[0]

    # we expect the output shape (mJets, nConstituents, kColumns)
    jet_list = list(set(data['j_index']))
    data_expected = []

    for jet in tqdm(jet_list):
        # Zero padding for insufficient jets. 
        # So we create a empty array and add signals in.
        jet_frame = np.zeros((nConstituents, kColumns))
        jet_temp = data[data['j_index']==jet].values
        if (jet_temp.shape[0]<nConstituents):
            for i, constituent in enumerate(jet_temp):
                jet_frame[i] = constituent
        else:
            jet_frame = jet_temp[:nConstituents] + jet_frame
        data_expected.append(jet_frame)
    # "j_index" is useless for machine learning part. Drop it!

    return np.array(data_expected)[:,:,:-1]

## 3. Jet Clustering & Rotating
### 1) Dependencies
You need Linux to run "pyjet". This is the tutorial to install WSL: [WSL Tutorial](https://github.com/451488975/Anaconda_Setup/blob/master/CPU_with_WSL.ipynb)

Make sure you have:
 - pyjet
<br>

### 2) Clustering
Sometimes we have data with 4-momenta form, either (px,py,pz,e) or (pT,eta,phi,mass).

In [None]:
from pyjet import cluster

def _load (filePath, nJets=200000, nConstituents=40):
    '''
    Returns:
        momenta: (nJets, 4, nConstituents)
    '''
    cols = ['E_'+str(i) for i in range(nConstituents)]+ ['PX_'+str(i) for i in range(nConstituents)] + ['PX_'+str(i) for i in range(nConstituents)] + ['PY_'+str(i) for i in range(nConstituents)] + ['PZ_'+str(i) for i in range(nConstituents)] + ['is_signal_new']
    df = pd.read_hdf(filePath,key='table',stop=nJets, columns = cols)
    # Take all the 4 momentum from 200 particles in all jets and reshape them into one particle per row
    momenta = df.iloc[:,:-1].to_numpy()
    momenta = momenta.reshape(-1,nConstituents,4)
    nJets = slice(nJets)
    momenta = momenta[nJets, :nConstituents, :]
    momenta = np.transpose(momenta, (0, 2, 1))
    label = df['is_signal_new']
    return momenta, label

In [None]:
filePath = "data/4m_samples.h5"
nJets = 100
nConstituents = 40
data,labels = _load(filePath,nJets , nConstituents)

data = np.core.records.fromarrays( [data[:,0],data[:,1],event[:,2],event[:,3]], names= 'E, PX, PY, PZ' , formats = 'f8, f8, f8,f8')
'''
R: Clustering radius for the main jets
p = -1, 0, 1 => anti-kt, C/A, kt Algorithm
ep = False, True => (px,py,pz,e) , (pT,eta,phi,mass)
'''
sequence = cluster(eventCopy, R=R0, p= p, ep=True)

### 3) Rotating
Rotation is performed to remove the stochastic nature of the decay angle relative to the η − φ coordinate system. For two-body decay processes (such as the hadronic decay of a W boson) the direction connecting the axes of the leading two subjets can be rotated until the leading subject is directly above the subleading subjet.
<br><br>
More information about Jet-Image: [Paper](https://arxiv.org/pdf/1407.5675.pdf)

In [None]:
def _rotate(event, R0 = 0.2,  p = 1):
    '''
    input:
        event: (nConstituents,4)
        R0 = Clustering radius for the main jets
        p = -1, 0, 1 => anti-kt, C/A, kt Algorithm
    '''
    event = np.transpose(event,(1,0))
    eventCopy = np.core.records.fromarrays( [event[:,0],event[:,1],event[:,2],event[:,3]], names= 'E, PX, PY, PZ' , formats = 'f8, f8, f8,f8')
    sequence = cluster(eventCopy, R=R0, p= p, ep=True)
    # List of jets
    jets = sequence.inclusive_jets()
    if len(jets)<2:
        return []
    else:
        subjet_data = event
        subjet_array = jets
        
        p = np.linalg.norm(event[:, 1:], axis=1)
        eta = 0.5 * np.log((p + event[:, 3]) / (p - event[:, 3]))
        phi = np.arctan2(event[:, 2], event[:, 1])
                
        # Shift all data such that the leading subjet
        # jet new center is located at (eta,phi) = (0,0)
        eta -= subjet_array[0].eta
        phi = np.array( [_deltaPhi(i,subjet_array[0].phi) for i in phi])
        
        # Rotate the jet image such that the second leading
        # jet is located at -pi/2
        s1x, s1y = subjet_array[1].eta - subjet_array[0].eta, _deltaPhi(subjet_array[1].phi,subjet_array[0].phi)
        
        theta = np.arctan2(s1y, s1x)
        if theta < 0.0:
            theta += 2 * np.pi
        etaRot, phiRot = _rotate2D(eta, phi, np.pi - theta)
        
        # Collect the trimmed subjet constituents
        return etaRot, phiRot

### 4) Excercise
Try to generate jet image with 4-momenta data.