# Advanced Skills
## 1. Downsizing data with a specific ratio
### 1) Introduction
Have you thought that given data with 5 labels, you can also do a binary classification? For example, we have 5 jets: t, q, g, w, z. Now instead of 5 Tagger, we only want to do the T Tagger, which means all the other jets are clustered to not-T. In this case, if we directly use the data, it results in input imbalance: T is only 1/4 of not-T. Thus we need to downsize the data so that elements are relatively balanced.<br><br>
Additionally, for testing, we usually use part of the data, so downsizing is very necessary.



In [None]:
import numpy as np
import h5py
import pandas as pd
from tqdm import tqdm
# TODO
# load the data "samples.h5" and find out how many constituents are labbled as "j_t".
# Hint: if a constituent belongs to t tagger, the value of "j_t" should be 1, otherwise 0.
filPath = "data/samples.h5"

However, we cannot directly split this Series. Instead, we want to split the data by jets.

In [None]:
# TODO
# Find out the distribution of each kind of tagger.
# Hint: for each jet, you only want to count it once.


Now we have a dictionary in which keys are labels and values are correpsponding jet indices.<br>
Let's take 10 for each jet.

In [None]:
# Example: randomly take 10 w jets.
jet_w = np.unique(df[df['j_w']==1]['j_index'])
chosen_jets = np.random.choice(jet_w, size = 10, replace=False) 
# Replace=False => we won't take an element more than once.
w_10 = df[df["j_index"].isin(chosen_jets)]

# TODO #1
# Create a dataframe with 10 jets for each kind.
# Hint: operate seperately and then concatenate.





# TODO #2
# Bonus: Try to create a dataset with 100 jets for total but with ratio t:q:g:z:w = 3:2:1:0:0.
# Hint: change random size.

## 2. N-Constituents of Jets
The amount of constituents varies in jets, ranging from 20 to more than 200. When we classify, we always expect to keep each jet containning the same number of constituents. For example, if we set nConstituents = 40, ranking by the transverse momentum (default), the first 40 constituents will be kept. If constituents in a jet are less than 40, we will use zero-padding.

In [None]:
# TODO
# How many constituents are in the jet with j_index=100139009?

You are right if you got 35. But we want exact 40 slots for each jet, which means we need to do zero-padding for the last 5 slots.

In [None]:
# Example: zero-padding [0,1,2] to [0,1,2,0,0]
array = np.arange(0,3,1)

# Create an zero array with expected shape (1,5).
array_exp = np.zeros(5,dtype=int)           
# copy the corresponding elements
for i,ele in enumerate(array):
    array_exp[i] = ele

In [None]:
# Example: Slice [0,1,2,3,4,5,6,7] to [0,1,2,3,4]
array = np.arange(0,8,1)

# Create an zero array with expected shape (1,5).
array_exp = np.zeros(5,dtype=int)           
# Add the corresponding elements
array_exp = array[:5]+array_exp

Now we know how to deal with the two circumstances: nConstituents is greater or lower than the expected number. First let's try integrate it into one method.

In [None]:
# TODO
# Write a method that can fit this array into the shape (2,5).
# Hint: operate on each row and then combine them. (for loop)
array = [[0,1,2],[0,1,2,3,4,5,6,7]]



In [None]:
# TODO
# In the next step we will use the method we just learned to trim our data into the desired shape.
# Hint: 1. Use "j_index" to identify jets. 
# 2. The expected output shape is in 3D: (mJets, nConstituents, kColumns), 
# so the zero array you create would be in size (nConstituents, kColumns)
# 3. loop through jets, operate one by one and then combine.

## 3. Jet Clustering & Rotating
### 1) Dependencies
You need Linux to run "pyjet". This is the tutorial to install WSL: [WSL Tutorial](https://github.com/451488975/Anaconda_Setup/blob/master/CPU_with_WSL.ipynb)

Make sure you have:
 - pyjet
<br>

### 2) Get low/high-level features
Sometimes we have data with 4-momenta form, either (e,px,py,pz) or (pT,eta,phi,mass). But we need more, such as etaRel,phiRel ...

In [None]:
# TODO
# Load and observe your data: columns, shape, datatype...
# Answer the questions:
# 1. how many jets are there?
# 2. how many constituents are in each jet?
# 3. Are all the jets are full? (no zero-paddings)
# 4. What's the datatype for each feature
# 5. What features are we gonna use to get transverse momentum, psudorapidity...
filePath = "data/4m_samples.h5"


In [None]:
# TODO
# Load with specific shape (nJets, 4, nConstituents)
# Hint: read through the instruction of pands.read_hdf, 
# you will find a way to take just a part of rows and columns.


In [None]:
# I have to provide the answer for the last question, because all subsequent steps are based on this step. 
def _load (filePath, nJets, nConstituents):
    '''
    Returns:
        momenta: (nJets, 4, nConstituents)
    '''
    cols = ['E_'+str(i) for i in range(nConstituents)]+ ['PX_'+str(i) for i in range(nConstituents)] + ['PX_'+str(i) for i in range(nConstituents)] + ['PY_'+str(i) for i in range(nConstituents)] + ['PZ_'+str(i) for i in range(nConstituents)] + ['is_signal_new']

    df = pd.read_hdf(filePath,key='data',stop=nJets, columns = cols)
    # Take all the 4 momentum from 200 particles in all jets and reshape them into one particle per row
    momenta = df.iloc[:,:-1].to_numpy()
    momenta = momenta.reshape(-1,nConstituents,4)
    nJets = slice(nJets)
    momenta = momenta[nJets, :nConstituents, :]
    momenta = np.transpose(momenta, (0, 2, 1))
    label = df['is_signal_new']
    return momenta, label

In [None]:
# TODO
# get low level features: pT, eta, phi, ... 
# Try to get as many feature as possible, the answer will be probided after camp.
# Hint: reference page (https://www.lhc-closer.es/taking_a_closer_look_at_lhc/0.momentum)



### 3) Cluster and Rotate
Rotation is performed to remove the stochastic nature of the decay angle relative to the η − φ coordinate system. For two-body decay processes (such as the hadronic decay of a W boson) the direction connecting the axes of the leading two subjets can be rotated until the leading subject is directly above the subleading subjet.
<br><br>
More information about Jet-Image: [Paper](https://arxiv.org/pdf/1407.5675.pdf)

In [None]:
# Example: Clustering
# For each event(jet), we cluster it to get subjets.
# Assuming you already have one piece of jet called "event".
# Unfortunately, it is not runable, due to the undefined parameter "event",
# but if you complete the get features part, at least you have 'pT, eta, phi, mass', 
# you can load your results to run this code.
# Notice that event is one jet, and features should follow the order 'pT, eta, phi, mass'.
from pyjet import cluster
flattened_event = np.core.records.fromarrays( [event[:,0],event[:,1],event[:,2],event[:,3]], names= 'pT, eta, phi, mass' , formats = 'f8, f8, f8,f8')
sequence = cluster(flattened_event, R=R0, p= p)
jets = sequence.inclusive_jets()

If you get error about pyjet, please check wheter you are using Linux system. WSL is recommmened.

In [None]:
# TODO
# Try clustering one piece of jet from your results.

Now we have the four features of subjets. We want to put the leading subjet at the origin, and the subleading subjet at -pi/2


In [None]:
# Example: Rotation
pT = event[:, 0]
eta = event[:, 1]
phi = event[:, 2]
mass = event[:, 3]
# shifts all data with respect to the leading subjet so that
# the Jet Image is centerd at the origin (eta,phi) = (0,0).
def deltaPhi(phi1,phi2):
    # Make sure it in the range (-pi, pi)
    x = phi1-phi2
    while x>= np.pi: x -= np.pi*2.
    while x< -np.pi: x += np.pi*2.
    return x

eta -= jets[0].eta
phi = np.array( [deltaPhi(i,jets[0].phi) for i in phi])

# Rotate the jet image such that the second leading
# subjet is located at -pi/2
s1x, s1y = jets[1].eta -jets[0].eta, deltaPhi(jets[1].phi,jets[0].phi)

theta = np.arctan2(s1y, s1x)
if theta < 0.0:
    theta += 2 * np.pi
def rotate(x, y, a):
    xp = x * np.cos(a) - y * np.sin(a)
    yp = x * np.sin(a) + y * np.cos(a)
    return xp, yp
etaRot, phiRot = rotate(eta, phi, np.pi - theta)

In [None]:
# TODO
# Try to cluster and rotate every jet to get etaRot and phiRot.

### 4) Visualization
Let's review how to plot Jet Image.<br>
In this case, we are provided with data with zero-paddings. We need to eliminate them before ploting, since zeros would lead to nan for eta and phi.

In [None]:
# TODO
# Find out the indices where pT=0 from the features you get previously.
# Hint: np.where()

In [None]:
# TODO
# get all the non-zero (etaRot,phiRot,pT) set using conditional slicing.
# And then plot a 2D-histogram.