# 4-top quark event data analysis

The following exercises aims to get you aquainted with some common data analysis tools which will come in handy to read in and manipulate raw data into a form where we can extract useful information out of it. This exercise relies on a real particle physics dataset which you can download at this [Google Drive](https://drive.google.com/file/d/14RfPABjncO3DXrwQkVsSVmu6IhR2CFpR/view).

# The problem

Currently a hot topic at the Large Hadron Collider is the search for the production of 4 of the heaviest quarks (the top quark) in a single event. These special events are called "4-top events". A signal for the production of these events has not been observed so far. The discovery of "4-top events" is interesting since it might be that more events are measured than expected. This could point to physics beyond the Standard Model of particle physics, since new physical processes could also generate events with 4 top quarks. The top quarks decay to other particles which are then measured in the detector surrounding the point where the quarks were originally produced.

<b>Background</b>

If you know the mass of a particle, most of the time you know <i>what that particle is</i>. However, there is no way to just build a single detector that gives you the mass. You need to be clever and make use of Special relativity, specifically <a href="http://en.wikipedia.org/wiki/Relativistic_mechanics">relativistic kinematics</a>.

To determine the mass ($m$) of a particle you need to know the 4-momenta of the particles ($\mathbf{P}$) that are detected after the collision: the energy ($E$), the momentum in the x direction ($p_x$), the momentum in the y direction ($p_y$), the momentum in the z direction ($p_z$).

$$\mathbf{P} = (E,p_x,p_y,p_z)$$

$$
m = \sqrt{E^2-(p_x^2+p_y^2 + p_z^2)}
$$

# The LHC Dataset

The simulated training and validation data are provided in a one-line-per-event text format (CSV), where each line has variable length and contains 5 event-level quantities followed by low-level features for each object in the event. The format of CSV files are (in one line):

event ID; process ID; event weight; MET; METphi; obj1, E1, pt1, eta1, phi1; obj2, E2, pt2,
eta2, phi2; $\ldots$



- **obj** specifies the particles detected in the event (e.g. electron (e), photon (p), a so-called jet (j), a so-called b-jet (b) a muon (m) etc.). The + or - specifies the charge of the particle. 

- (**E**,**pt**,**eta**,**phi**) specify the 4-vector of the measured particle, i.e. the energy, transverse component of the momentum and the theta (given here in units of pseudorapidity) and phi angles.

- **event ID** is a serial integer to uniquely identify that particular event in the run.

- **event weight** is a real number to determine the likelihood of the process. This is a generator quantity and does not exist in real data. It should not be used for "training".

- **process ID** is a string referring to the process which generated the event. In real life events this is unknown on a event-by-event basis, but in simulated data sets (like the one we are providing you with) this information is accessible.

- **MET** and **METphi** entries are the magnitude and the azimuthal angle of the missing transverse energy vector of the event, respectively. "Missing" means that this momentum is taken away by undetected particles, e.g. neutrinos.
<br>

As an example, an event corresponding to the final state of the 
background $t\bar{t}+2j$ process with two $b$-jets and one jet reads as follows:
<br>
<br>
$$ \tiny
94;ttbar
;0.00167779;112288;1.74766;b,331927,147558,-1.44969,-1.76399;j,100406,85589,-0.568259,-1.17144;b,55808.8,54391.4,
-0.198215,1.726;j,72078.9,52432.5,-0.835736,1.57786; \ldots
;
$$
<br>
* <a href="http://en.wikipedia.org/wiki/Jet_(particle_physics)">Jets</a>  are a cone of hadrons that are produced from a collision.

In [None]:
# Common data analysis tool imports
import numpy as np               # Numpy. Optimized and highly performant multidimensional arrays in Python
from numpy import argmax    
import pandas as pd              # The Pandas package. Very useful tool for data analysis and data manipulation
import seaborn as sns            # Statistical plotting package, alternative to matplotlib
import matplotlib.pyplot as plt  # Standard Python plotting package

In [None]:
# Importing functionality from scikit-learn, a machine learning library
from sklearn.metrics import confusion_matrix, roc_curve    # Self-explanatory functionality

In [None]:
# Acccessing downloaded CSV files from Google Colab (this is specific for Google Colab, other solutions exist as well)
from google.colab import files
uploaded = files.upload()

Saving TrainingValidationData_320k_vs_80k_shuffle.csv to TrainingValidationData_320k_vs_80k_shuffle.csv


In [None]:
import io
df = pd.read_csv(io.BytesIO(uploaded['TrainingValidationData_320k_vs_80k_shuffle.csv']), header = None, sep='\n')   # if working locally, you can simply provide a string with the filepath. 
# The above reads in the CSV data into a Pandas "Dataframe" object, a sort of in-memory "database", which can be queried like a database, filtered, etc.
# You can inspect, interact, and modify DataFrames in many ways ---> Read the docs! (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

# As you can see, we specify "header = None", and "sep = '\n' " to reflect the fact that this file has an "unusual" shape (no header row, and rows have other delimiters like ";")

In [None]:
df = df[0].str.split(';|,', expand=True) # Because the data is in a CSV format, but has ";" separated content additionally... we want to expand this into a proper "database" for further manipulation
print(df.shape) # show the number of events (rows) and the number of data points (columns)

(400000, 101)


In [None]:
df.head(10)   # Show the first 10 rows of the data... There's a lot of "None"s in the later columns... Let's change that!

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
0,21,ttbar,1,19592.4,-0.0410686,b,431384.0,247371.0,-1.15222,1.84686,j,213946.0,209386.0,0.15921,-1.54733,j,437117.0,114355.0,-2.01385,2.49587,b,325344.0,102696.0,-1.81806,-2.61373,j,264381.0,94381.4,-1.68816,0.911486,j,102289.0,79279.5,0.742968,-0.824569,j,286397.0,47371.0,-2.48432,0.284316,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,21,ttbar,1,34361.5,-2.39277,b,104553.0,50546.8,1.34776,2.99585,b,92290.0,41017.6,1.44207,-1.93482,e+,157002.0,84815.6,1.22637,0.880097,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,21,ttbar,1,24356.0,-0.260029,j,255675.0,73086.1,1.92294,-1.81685,b,304656.0,69365.6,2.15966,2.62402,j,34450.9,31419.0,-0.422891,1.1691,m+,143384.0,73078.3,1.29477,-0.391141,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,21,ttbar,1,36565.7,-0.666041,b,131526.0,78025.7,1.10913,-3.06327,j,481649.0,65577.1,2.68213,0.276183,j,189547.0,58669.4,1.83809,-0.951545,b,54248.3,44391.9,-0.599836,2.1239,b,46417.5,34953.1,-0.773663,-2.8925,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,21,ttbar,1,8966.28,-1.89928,j,185195.0,79061.0,-1.49199,0.682821,j,57319.1,53416.0,-0.335488,-2.89574,j,156252.0,41927.5,-1.98904,-2.01592,j,26106.0,25063.6,-0.0169258,1.74293,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,21,ttbar,1,23917.4,1.69316,b,74494.2,52650.0,0.869359,1.74075,b,66159.8,52004.6,0.694358,-0.847169,j,407808.0,50008.8,2.7878,2.3065,j,162398.0,43717.4,1.98514,-0.648184,j,169970.0,38039.3,2.17656,-2.52978,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6,21,ttbar,1,25571.3,-2.39031,j,158001.0,134569.0,0.565989,-0.573557,j,137536.0,98372.8,0.861471,2.93359,j,95527.2,93224.6,0.12508,-2.82266,b,60877.1,58178.3,-0.211073,-0.0354712,j,72025.9,48625.8,0.940629,0.292041,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
7,21,ttbar,1,19804.2,0.81136,j,145088.0,111437.0,0.736566,2.51313,j,168921.0,93965.8,-1.19046,-1.05389,j,88423.6,59281.9,-0.951829,-0.0881277,b,45176.7,43237.6,0.0439013,-0.176301,j,251543.0,38700.7,-2.55859,-3.05716,b,42004.9,33691.1,-0.67644,2.94843,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
8,21,ttbar,1,62962.5,0.886331,b,166871.0,146545.0,-0.487993,-1.11008,j,92130.1,84659.9,0.318398,2.22891,j,153137.0,79068.5,1.26205,1.7495,j,79708.3,68682.7,-0.553417,-2.32295,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
9,21,ttbar,1,27831.8,-2.6922,j,269994.0,138348.0,1.27167,-1.45395,j,878112.0,113719.0,2.73222,2.02191,j,107918.0,106497.0,0.137113,0.840852,j,212507.0,47149.7,2.18415,-2.63698,j,42072.4,39504.3,0.31538,2.80517,j,47230.5,36065.8,0.744566,-0.391823,j,246004.0,31563.8,2.74167,-1.51909,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
df = df.fillna(0) # Replace None values with 0's ---> Handy for type conversions later!
print("len(df.columns)", len(df.columns)) # You can access the list of "column names" with df.columns... As you can see above, they're currently just indexed (by numbers)
# print(df.column) <--- just to confirm the previous statement 

### Columns are fully separated (so you can access each via Dataframe indexing), so the number of columns reflects the number of data points there are ### 
### But let's compute the number of original "objects". Each object was originally grouped into 5's like [EventID, processID, evtweight, MET, METphi]  ### 
###  or [objID, E, px, py, pz]. So the following lines computes this, and one could use this information to add headers                                ###

n_obj = (len(df.columns) - 5) // 5       
print("n_obj =", n_obj)
df = df.drop(range(5 + 5 * n_obj, len(df.columns)), axis = 'columns') # Drop the empty column at the end

len(df.columns) 101
n_obj = 19


In [None]:
df.head(10) # Ok we're getting somewhere...

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,21,ttbar,1,19592.4,-0.0410686,b,431384.0,247371.0,-1.15222,1.84686,j,213946.0,209386.0,0.15921,-1.54733,j,437117.0,114355.0,-2.01385,2.49587,b,325344.0,102696.0,-1.81806,-2.61373,j,264381.0,94381.4,-1.68816,0.911486,j,102289.0,79279.5,0.742968,-0.824569,j,286397,47371.0,-2.48432,0.284316,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,21,ttbar,1,34361.5,-2.39277,b,104553.0,50546.8,1.34776,2.99585,b,92290.0,41017.6,1.44207,-1.93482,e+,157002.0,84815.6,1.22637,0.880097,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,21,ttbar,1,24356.0,-0.260029,j,255675.0,73086.1,1.92294,-1.81685,b,304656.0,69365.6,2.15966,2.62402,j,34450.9,31419.0,-0.422891,1.1691,m+,143384.0,73078.3,1.29477,-0.391141,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,21,ttbar,1,36565.7,-0.666041,b,131526.0,78025.7,1.10913,-3.06327,j,481649.0,65577.1,2.68213,0.276183,j,189547.0,58669.4,1.83809,-0.951545,b,54248.3,44391.9,-0.599836,2.1239,b,46417.5,34953.1,-0.773663,-2.8925,,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,21,ttbar,1,8966.28,-1.89928,j,185195.0,79061.0,-1.49199,0.682821,j,57319.1,53416.0,-0.335488,-2.89574,j,156252.0,41927.5,-1.98904,-2.01592,j,26106.0,25063.6,-0.0169258,1.74293,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,21,ttbar,1,23917.4,1.69316,b,74494.2,52650.0,0.869359,1.74075,b,66159.8,52004.6,0.694358,-0.847169,j,407808.0,50008.8,2.7878,2.3065,j,162398.0,43717.4,1.98514,-0.648184,j,169970.0,38039.3,2.17656,-2.52978,,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,21,ttbar,1,25571.3,-2.39031,j,158001.0,134569.0,0.565989,-0.573557,j,137536.0,98372.8,0.861471,2.93359,j,95527.2,93224.6,0.12508,-2.82266,b,60877.1,58178.3,-0.211073,-0.0354712,j,72025.9,48625.8,0.940629,0.292041,,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,21,ttbar,1,19804.2,0.81136,j,145088.0,111437.0,0.736566,2.51313,j,168921.0,93965.8,-1.19046,-1.05389,j,88423.6,59281.9,-0.951829,-0.0881277,b,45176.7,43237.6,0.0439013,-0.176301,j,251543.0,38700.7,-2.55859,-3.05716,b,42004.9,33691.1,-0.67644,2.94843,,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,21,ttbar,1,62962.5,0.886331,b,166871.0,146545.0,-0.487993,-1.11008,j,92130.1,84659.9,0.318398,2.22891,j,153137.0,79068.5,1.26205,1.7495,j,79708.3,68682.7,-0.553417,-2.32295,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,21,ttbar,1,27831.8,-2.6922,j,269994.0,138348.0,1.27167,-1.45395,j,878112.0,113719.0,2.73222,2.02191,j,107918.0,106497.0,0.137113,0.840852,j,212507.0,47149.7,2.18415,-2.63698,j,42072.4,39504.3,0.31538,2.80517,j,47230.5,36065.8,0.744566,-0.391823,j,246004,31563.8,2.74167,-1.51909,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# Make column headers so we can call them something meaningful
heads = ['EventID', 'ProcessID', 'evtweight', 'MET', 'METphi']
for i in range (n_obj):
  heads.append('obj%d' % i)
  heads.append('obj%d_E' % i)
  heads.append('obj%d_pt' % i)
  heads.append('obj%d_eta' % i)
  heads.append('obj%d_phi' % i)
df.columns = df.columns[:0].tolist() + heads # Set the DataFrame column names to headers

process_mapping = { 'ttbar': 0, '4top': 1} 
df['ProcessID'] = df['ProcessID'].map(process_mapping) # Map 'ttbar' to 0, and '4top' to 1

In [None]:
df.head(50) # Let's take a look at the final data-set (first 50 events) ...

Unnamed: 0,EventID,ProcessID,evtweight,MET,METphi,obj0,obj0_E,obj0_pt,obj0_eta,obj0_phi,obj1,obj1_E,obj1_pt,obj1_eta,obj1_phi,obj2,obj2_E,obj2_pt,obj2_eta,obj2_phi,obj3,obj3_E,obj3_pt,obj3_eta,obj3_phi,obj4,obj4_E,obj4_pt,obj4_eta,obj4_phi,obj5,obj5_E,obj5_pt,obj5_eta,obj5_phi,obj6,obj6_E,obj6_pt,obj6_eta,obj6_phi,...,obj11,obj11_E,obj11_pt,obj11_eta,obj11_phi,obj12,obj12_E,obj12_pt,obj12_eta,obj12_phi,obj13,obj13_E,obj13_pt,obj13_eta,obj13_phi,obj14,obj14_E,obj14_pt,obj14_eta,obj14_phi,obj15,obj15_E,obj15_pt,obj15_eta,obj15_phi,obj16,obj16_E,obj16_pt,obj16_eta,obj16_phi,obj17,obj17_E,obj17_pt,obj17_eta,obj17_phi,obj18,obj18_E,obj18_pt,obj18_eta,obj18_phi
0,21,0,1.0,19592.4,-0.0410686,b,431384.0,247371.0,-1.15222,1.84686,j,213946.0,209386.0,0.15921,-1.54733,j,437117.0,114355.0,-2.01385,2.49587,b,325344.0,102696.0,-1.81806,-2.61373,j,264381.0,94381.4,-1.68816,0.911486,j,102289.0,79279.5,0.742968,-0.824569,j,286397.0,47371.0,-2.48432,0.284316,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,21,0,1.0,34361.5,-2.39277,b,104553.0,50546.8,1.34776,2.99585,b,92290.0,41017.6,1.44207,-1.93482,e+,157002.0,84815.6,1.22637,0.880097,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,21,0,1.0,24356.0,-0.260029,j,255675.0,73086.1,1.92294,-1.81685,b,304656.0,69365.6,2.15966,2.62402,j,34450.9,31419.0,-0.422891,1.1691,m+,143384.0,73078.3,1.29477,-0.391141,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,21,0,1.0,36565.7,-0.666041,b,131526.0,78025.7,1.10913,-3.06327,j,481649.0,65577.1,2.68213,0.276183,j,189547.0,58669.4,1.83809,-0.951545,b,54248.3,44391.9,-0.599836,2.1239,b,46417.5,34953.1,-0.773663,-2.8925,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,21,0,1.0,8966.28,-1.89928,j,185195.0,79061.0,-1.49199,0.682821,j,57319.1,53416.0,-0.335488,-2.89574,j,156252.0,41927.5,-1.98904,-2.01592,j,26106.0,25063.6,-0.0169258,1.74293,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,21,0,1.0,23917.4,1.69316,b,74494.2,52650.0,0.869359,1.74075,b,66159.8,52004.6,0.694358,-0.847169,j,407808.0,50008.8,2.7878,2.3065,j,162398.0,43717.4,1.98514,-0.648184,j,169970.0,38039.3,2.17656,-2.52978,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,21,0,1.0,25571.3,-2.39031,j,158001.0,134569.0,0.565989,-0.573557,j,137536.0,98372.8,0.861471,2.93359,j,95527.2,93224.6,0.12508,-2.82266,b,60877.1,58178.3,-0.211073,-0.0354712,j,72025.9,48625.8,0.940629,0.292041,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,21,0,1.0,19804.2,0.81136,j,145088.0,111437.0,0.736566,2.51313,j,168921.0,93965.8,-1.19046,-1.05389,j,88423.6,59281.9,-0.951829,-0.0881277,b,45176.7,43237.6,0.0439013,-0.176301,j,251543.0,38700.7,-2.55859,-3.05716,b,42004.9,33691.1,-0.67644,2.94843,,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,21,0,1.0,62962.5,0.886331,b,166871.0,146545.0,-0.487993,-1.11008,j,92130.1,84659.9,0.318398,2.22891,j,153137.0,79068.5,1.26205,1.7495,j,79708.3,68682.7,-0.553417,-2.32295,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,21,0,1.0,27831.8,-2.6922,j,269994.0,138348.0,1.27167,-1.45395,j,878112.0,113719.0,2.73222,2.02191,j,107918.0,106497.0,0.137113,0.840852,j,212507.0,47149.7,2.18415,-2.63698,j,42072.4,39504.3,0.31538,2.80517,j,47230.5,36065.8,0.744566,-0.391823,j,246004.0,31563.8,2.74167,-1.51909,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Now that we have the data-set in the form we want it in, here's the task for you. 

1. Plot a histogram showing the distribution of any kinematic variable, of 'obj1', for example the 'pt'.
2. Separate this into "signal" and "background" contributions by using the processID and the DataFrame filtering 
3. Plot the signal (4-top==1) and background (ttbar==0) distributions on the same histogram figure.
4. Make a cut, and evaluate the confusion matrix for this cut.
5. Make multiple cuts, and plot the ROC of the distribution.




In [None]:
########################################
################ Hints #################
########################################

# You may find it useful to make a DataFrame filter. Can you understand what's going on in the following line of code?
new_df = df[df["ProcessID"]==1]

# You may also find it useful to convert DataFrames to multidimensional numpy array for plotting
new_nparray = new_df["obj0_pt"].to_numpy().astype('float') # <----- Check the type of this array, and its individual elements, with "type()". See what you get!

# The solution

Nothing here (yet). Good luck! 😇