<a href="https://colab.research.google.com/github/AritraStark/E2E_GSOC_2022/blob/main/E2E_eval_task_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Common Task 2. Deep Learning based Quark-Gluon Classification :**

Datasets: https://cernbox.cern.ch/index.php/s/hqz8zE7oxyPjvsL

Description 125x125 matrices (three channel images) for two classes of particles
quarks and gluons impinging on a calorimeter.
For description of 1st dataset please refer to the link provided for the dataset.

Please use a Convolutional Neural Network (CNN) architecture of your choice to
achieve the highest possible classification on this dataset (in your preferred choice offramework for example: Tensorflow/Keras or Pytorch). 

Please provide a Jupyter notebook that shows your solution.

Downloading datasets:

In [1]:
!wget https://cernbox.cern.ch/index.php/s/hqz8zE7oxyPjvsL/download
!mkdir data
!7z x -o/content/data download

--2022-03-23 02:30:30--  https://cernbox.cern.ch/index.php/s/hqz8zE7oxyPjvsL/download
Resolving cernbox.cern.ch (cernbox.cern.ch)... 137.138.120.151, 128.142.53.35, 128.142.170.17, ...
Connecting to cernbox.cern.ch (cernbox.cern.ch)|137.138.120.151|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘download’

download                [     <=>            ] 690.93M  11.5MB/s    in 63s     

2022-03-23 02:31:37 (10.9 MB/s) - ‘download’ saved [724495360]


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 724495360 bytes (691 MiB)

Extracting archive: download
--
Path = download
Type = tar
Physical Size = 724495360
Headers Size = 2560
Code Page = UTF-8

  0%     15% - QCDToGGQQ_IMGjet_RH1a

Setting up imports:

In [16]:
import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import os
import pyarrow as pa
import pyarrow.parquet as pq
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
!pip install fastparquet




In [90]:
from google.colab import drive
drive.mount("/content/gdrive")
files=os.listdir("/content/data")
print(files)
df=[pq.ParquetFile(f).read_row_group(index, columns=None).to_pandas() for f in files]
data=pd.concat(df,ignore_index=True)


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
['QCDToGGQQ_IMGjet_RH1all_jet0_run2_n55494.test.snappy.parquet', 'QCDToGGQQ_IMGjet_RH1all_jet0_run1_n47540.test.snappy.parquet', 'QCDToGGQQ_IMGjet_RH1all_jet0_run0_n36272.test.snappy.parquet']


FileNotFoundError: ignored

Retrieve details of the dataset and then split the data:

In [85]:
data.describe()


Unnamed: 0,pt,m0,y
count,139306.0,139306.0,139306.0
mean,117.123943,21.392223,0.5
std,26.067888,6.431673,0.500002
min,70.110046,3.372931,0.0
25%,98.322607,16.934837,0.0
50%,111.731697,20.507958,0.5
75%,130.70047,24.806043,1.0
max,337.051727,75.950493,1.0


In [87]:
data.head()

Unnamed: 0,X_jets,pt,m0,y
0,,107.854118,18.723455,0.0
1,,130.238617,22.919355,1.0
2,,153.767715,28.0182,1.0
3,,114.816589,23.951887,0.0
4,,108.982056,22.844416,0.0


In [97]:
y_jet = data['y']
X_jet = data.drop(['y'], axis=1)
X_jet.shape, y_jet.shape

((139306, 3), (139306,))

Split data into testing and training sets:

In [53]:
X_train, X_test, y_train, y_test = train_test_split( X_jet, y_jet, random_state=48, test_size=0.1 )

Delete variables to free memory:

In [26]:
del data

Define the CNN model:

In [68]:
model = tf.keras.Sequential([
  tf.keras.layers.Conv2D(64, 3, padding = 'same', input_shape = (148,4,1), activation = 'relu'),
  tf.keras.layers.MaxPool2D(),
  tf.keras.layers.Dropout(0.25),
  tf.keras.layers.Conv2D(32, 3, padding = 'same', activation = 'relu'),
  tf.keras.layers.MaxPool2D(),
  tf.keras.layers.Dropout(0.25),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(64, activation = 'relu'),
  tf.keras.layers.Dropout(0.5),
  tf.keras.layers.Dense(32, activation = 'relu'),
  tf.keras.layers.Dropout(0.5),
  tf.keras.layers.Dense(1, activation = 'sigmoid'),
])

Defining callback:

In [47]:
filepath="classifier_weights2-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5"
checkpoint1 = tf.keras.callbacks.ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint1]

Compiling and Fitting the model with training data:

In [74]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['accuracy'])
history = model.fit(X_train, y_train, 
                    validation_split=0.1, 
                    epochs=100, 
                    batch_size=1000,
                    callbacks=callbacks_list)

ValueError: ignored

Plotting the results: 

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Accuracy of the model')
plt.ylabel('Accuracy')
plt.xlabel('Epochs')
plt.legend(['Train', 'Test'], loc='lower right')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Loss metrics of the model')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.legend(['train', 'test'], loc='upper right')
plt.show()

Checking the performance of the model on training data and predicitons:

In [None]:
best_epoch=np.argmax(history.history['val_accuracy'])
best_acc=np.max(history.history['val_accuracy'])
model.load_weights(f"classifier_weights2-improvement-{best_epoch+1}-{best_acc:.2f}.hdf5")

#delete history to free up memory
del history

predictions = model.predict(X_train)
bin =[0 if p<0.5 else 1 for p in predictions]

Classification report and ROC AUC score:

In [None]:
print(classification_report(y_train,bin))
print("ROC AUC:")
roc_auc_score(y_train, bin)

Classification Report and ROC AUC score on test data:

In [None]:
del predictions
del bin
predictions = model.predict(X_test)
bin =[0 if p<0.5 else 1 for p in predictions]
print(classification_report(y_test,bin))
print("ROC AUC:")
roc_auc_score(y_test, bin)


References: 


*   [Quark-Gluon Jet Discrimination Using Convolutional
Neural Networks](https://arxiv.org/pdf/2012.02531.pdf)
*   [Using Deep Learning to Discriminate Between Quark
and Gluon Jets](https://www.desy.de/f/students/2018/reports/EvansTyler.pdf)
*   [Discriminating quark/gluon jets 
with deep learning](https://indico.cern.ch/event/661284/contributions/2699312/attachments/1521324/2376721/ML_Workshop.pdf)
*   [End-to-end jet classification of quarks and gluons with the CMS Open Data](https://www.sciencedirect.com/science/article/pii/S0168900220307002)



