<a href="https://colab.research.google.com/github/MehaRima/2sxc-content-app/blob/master/Half-Running_Particle_identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About

In this programming assignment you will train a classifier to identify type of a particle. There are six particle types: electron, proton, muon, kaon, pion and ghost. Ghost is a particle with other type than the first five or a detector noise. 

Different particle types remain different responses in the detector systems or subdetectors. Thre are five systems: tracking system, ring imaging Cherenkov detector (RICH), electromagnetic and hadron calorimeters, and muon system.

![pid](https://github.com/MehaRima/Advanced-Machine-Learning-Specialization/blob/master/Addressing%20Large%20Hadron%20Collider%20Challenges%20by%20Machine%20Learning/Week2/pic/pid.jpg?raw=1)

You task is to identify a particle type using the responses in the detector systems. 

# Attention

Data files you should download from https://github.com/hse-aml/hadron-collider-machine-learning/releases/tag/Week_2

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install matplotlib-venn
!apt-get -qq install -y libfluidsynth1
# To determine which version you're using:
!pip show tensorflow

# For the current version: 
!pip install --upgrade tensorflow

# For a specific version:
!pip install tensorflow==1.2

# For the latest nightly build:
!pip install tf-nightly
# https://pypi.python.org/pypi/libarchive
!apt-get -qq install -y libarchive-dev && pip install -U libarchive
import libarchive
!apt-get -qq install python-cartopy python3-cartopy
import cartopy
# https://pypi.python.org/pypi/pydot
!apt-get -qq install -y graphviz && pip install pydot
import pydot

Selecting previously unselected package libfluidsynth1:amd64.
(Reading database ... 144579 files and directories currently installed.)
Preparing to unpack .../libfluidsynth1_1.1.9-1_amd64.deb ...
Unpacking libfluidsynth1:amd64 (1.1.9-1) ...
Setting up libfluidsynth1:amd64 (1.1.9-1) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
/sbin/ldconfig.real: /usr/local/lib/python3.6/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link

Name: tensorflow
Version: 2.3.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.6/dist-packages
Requires: astunparse, opt-einsum, termcolor, protobuf, wrapt, keras-preprocessing, absl-py, tensorboard, tensorflow-estimator, gast, wheel, grpcio, numpy, six, google-pasta, scipy, h5py
Required-by: fancyimpute
Requirement already up-to-date: tensorflow in /usr/local/lib

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas
import numpy
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
import utils

# Download data

Download data used to train classifiers.

### Read training file

In [4]:
data = pandas.read_csv('training.csv.gz')

In [5]:
data.head()

Unnamed: 0,TrackP,TrackNDoFSubdetector2,BremDLLbeElectron,MuonLooseFlag,FlagSpd,SpdE,EcalDLLbeElectron,DLLmuon,RICHpFlagElectron,EcalDLLbeMuon,TrackQualitySubdetector2,FlagPrs,DLLelectron,DLLkaon,EcalE,TrackQualityPerNDoF,DLLproton,PrsDLLbeElectron,FlagRICH1,MuonLLbeBCK,FlagHcal,EcalShowerLongitudinalParameter,Calo2dFitQuality,TrackPt,TrackDistanceToZ,RICHpFlagPion,HcalDLLbeElectron,Calo3dFitQuality,FlagEcal,MuonLLbeMuon,TrackNDoFSubdetector1,RICHpFlagProton,RICHpFlagKaon,GhostProbability,TrackQualitySubdetector1,Label,RICH_DLLbeBCK,FlagRICH2,FlagBrem,HcalDLLbeMuon,TrackNDoF,RICHpFlagMuon,RICH_DLLbeKaon,RICH_DLLbeElectron,HcalE,MuonFlag,FlagMuon,PrsE,RICH_DLLbeMuon,RICH_DLLbeProton
0,74791.156263,15.0,0.232275,1.0,1.0,3.2,-2.505719,6.604153,1.0,1.92996,17.58568,1.0,-6.411697,-7.213295,1e-06,1.46755,-26.667494,-2.730674,1.0,-5.152923,1.0,-999.0,19.954819,3141.930677,0.61364,1.0,-0.909544,-999.0,1.0,-0.661823,4.0,1.0,1.0,0.018913,5.366212,Muon,-21.913,1.0,1.0,1.015345,28.0,1.0,-7.2133,-0.2802,5586.589846,1.0,1.0,10.422315,-2.081143e-07,-24.8244
1,2738.489989,15.0,-0.357748,0.0,1.0,3.2,1.864351,0.263651,1.0,-2.061959,20.23068,1.0,5.453014,6e-06,1531.542,3.57054,-0.711194,1.773806,1.0,-999.0,0.0,33.187644,0.037601,199.573653,0.46548,1.0,0.434909,13.667366,1.0,-999.0,10.0,0.0,0.0,0.351206,9.144749,Ghost,-0.703617,0.0,1.0,-2.394644,32.0,1.0,-0.324317,1.707283,-7e-06,0.0,1.0,43.334935,2.771583,-0.648017
2,2161.409908,17.0,-999.0,0.0,0.0,-999.0,-999.0,-999.0,0.0,-999.0,11.619878,0.0,-999.0,-999.0,-999.0,0.826442,-999.0,-999.0,0.0,-999.0,0.0,-999.0,-999.0,94.829418,0.241891,0.0,-999.0,-999.0,0.0,-999.0,5.0,0.0,0.0,0.195717,1.459992,Ghost,-999.0,0.0,0.0,-999.0,27.0,0.0,-999.0,-999.0,-999.0,0.0,0.0,-999.0,-999.0,-999.0
3,15277.73049,20.0,-0.638984,0.0,1.0,3.2,-2.533918,-8.724949,1.0,-3.253981,15.336305,1.0,-10.616585,-39.447507,4385.688,1.076721,-29.291509,-3.053104,1.0,-999.0,1.0,231.190351,2.839508,808.631064,0.680705,1.0,-1.50416,1939.259641,1.0,-999.0,9.0,0.0,1.0,0.003972,22.950573,Pion,-47.223118,1.0,1.0,-0.321242,36.0,1.0,-35.202221,-14.742319,4482.803707,0.0,1.0,2.194175,-3.070819,-29.291519
4,7563.700195,19.0,-0.638962,0.0,1.0,3.2,-2.087146,-7.060422,1.0,-0.995816,10.954629,1.0,-8.144945,26.050386,1220.930044,0.439767,21.386587,-2.730648,1.0,-999.0,1.0,-794.866475,1.209193,1422.569214,0.575066,1.0,-1.576249,1867.165142,1.0,-999.0,5.0,0.0,0.0,0.015232,3.516173,Proton,15.304688,0.0,1.0,-1.038026,33.0,1.0,25.084287,-10.272412,5107.55468,0.0,1.0,1.5e-05,-5.373712,23.653087


### List of columns in the samples

Here, **Spd** stands for Scintillating Pad Detector, **Prs** - Preshower, **Ecal** - electromagnetic calorimeter, **Hcal** - hadronic calorimeter, **Brem** denotes traces of the particles that were deflected by detector.

- ID - id value for tracks (presents only in the test file for the submitting purposes)
- Label - string valued observable denoting particle types. Can take values "Electron", "Muon", "Kaon", "Proton", "Pion" and "Ghost". This column is absent in the test file.
- FlagSpd - flag (0 or 1), if reconstructed track passes through Spd
- FlagPrs - flag (0 or 1), if reconstructed track passes through Prs
- FlagBrem - flag (0 or 1), if reconstructed track passes through Brem
- FlagEcal - flag (0 or 1), if reconstructed track passes through Ecal
- FlagHcal - flag (0 or 1), if reconstructed track passes through Hcal
- FlagRICH1 - flag (0 or 1), if reconstructed track passes through the first RICH detector
- FlagRICH2 - flag (0 or 1), if reconstructed track passes through the second RICH detector
- FlagMuon - flag (0 or 1), if reconstructed track passes through muon stations (Muon)
- SpdE - energy deposit associated to the track in the Spd
- PrsE - energy deposit associated to the track in the Prs
- EcalE - energy deposit associated to the track in the Hcal
- HcalE - energy deposit associated to the track in the Hcal
- PrsDLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from Prs
- BremDLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from Brem
- TrackP - particle momentum
- TrackPt - particle transverse momentum
- TrackNDoFSubdetector1  - number of degrees of freedom for track fit using hits in the tracking sub-detector1
- TrackQualitySubdetector1 - chi2 quality of the track fit using hits in the tracking sub-detector1
- TrackNDoFSubdetector2 - number of degrees of freedom for track fit using hits in the tracking sub-detector2
- TrackQualitySubdetector2 - chi2 quality of the track fit using hits in the  tracking sub-detector2
- TrackNDoF - number of degrees of freedom for track fit using hits in all tracking sub-detectors
- TrackQualityPerNDoF - chi2 quality of the track fit per degree of freedom
- TrackDistanceToZ - distance between track and z-axis (beam axis)
- Calo2dFitQuality - quality of the 2d fit of the clusters in the calorimeter 
- Calo3dFitQuality - quality of the 3d fit in the calorimeter with assumption that particle was electron
- EcalDLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from Ecal
- EcalDLLbeMuon - delta log-likelihood for a particle candidate to be muon using information from Ecal
- EcalShowerLongitudinalParameter - longitudinal parameter of Ecal shower
- HcalDLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from Hcal
- HcalDLLbeMuon - delta log-likelihood for a particle candidate to be using information from Hcal
- RICHpFlagElectron - flag (0 or 1) if momentum is greater than threshold for electrons to produce Cherenkov light
- RICHpFlagProton - flag (0 or 1) if momentum is greater than threshold for protons to produce Cherenkov light
- RICHpFlagPion - flag (0 or 1) if momentum is greater than threshold for pions to produce Cherenkov light
- RICHpFlagKaon - flag (0 or 1) if momentum is greater than threshold for kaons to produce Cherenkov light
- RICHpFlagMuon - flag (0 or 1) if momentum is greater than threshold for muons to produce Cherenkov light
- RICH_DLLbeBCK  - delta log-likelihood for a particle candidate to be background using information from RICH
- RICH_DLLbeKaon - delta log-likelihood for a particle candidate to be kaon using information from RICH
- RICH_DLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from RICH
- RICH_DLLbeMuon - delta log-likelihood for a particle candidate to be muon using information from RICH
- RICH_DLLbeProton - delta log-likelihood for a particle candidate to be proton using information from RICH
- MuonFlag - muon flag (is this track muon) which is determined from muon stations
- MuonLooseFlag muon flag (is this track muon) which is determined from muon stations using looser criteria
- MuonLLbeBCK - log-likelihood for a particle candidate to be not muon using information from muon stations
- MuonLLbeMuon - log-likelihood for a particle candidate to be muon using information from muon stations
- DLLelectron - delta log-likelihood for a particle candidate to be electron using information from all subdetectors
- DLLmuon - delta log-likelihood for a particle candidate to be muon using information from all subdetectors
- DLLkaon - delta log-likelihood for a particle candidate to be kaon using information from all subdetectors
- DLLproton - delta log-likelihood for a particle candidate to be proton using information from all subdetectors
- GhostProbability - probability for a particle candidate to be ghost track. This variable is an output of classification model used in the tracking algorithm.

Delta log-likelihood in the features descriptions means the difference between log-likelihood for the mass hypothesis that a given track is left by some particle (for example, electron) and log-likelihood for the mass hypothesis that a given track is left by a pion (so, DLLpion = 0 and thus we don't have these columns). This is done since most tracks (~80%) are left by pions and in practice we actually need to discriminate other particles from pions. In other words, the null hypothesis is that particle is a pion.

### Look at the labels set

The training data contains six classes. Each class corresponds to a particle type. Your task is to predict type of a particle.

In [7]:
set(data.Label)

{'Electron', 'Ghost', 'Kaon', 'Muon', 'Pion', 'Proton'}

Convert the particle types into class numbers.

In [8]:
data['Class'] = utils.get_class_ids(data.Label.values)
set(data.Class)

{0, 1, 2, 3, 4, 5}

### Define training features

The following set of features describe particle responses in the detector systems:

![features](https://github.com/MehaRima/Advanced-Machine-Learning-Specialization/blob/master/Addressing%20Large%20Hadron%20Collider%20Challenges%20by%20Machine%20Learning/Week2/pic/features.jpeg?raw=1)

Also there are several combined features. The full list is following.

In [9]:
features = list(set(data.columns) - {'Label', 'Class'})
features

['RICH_DLLbeProton',
 'MuonLooseFlag',
 'TrackDistanceToZ',
 'SpdE',
 'FlagPrs',
 'FlagHcal',
 'RICHpFlagKaon',
 'FlagMuon',
 'RICH_DLLbeMuon',
 'TrackQualitySubdetector1',
 'EcalShowerLongitudinalParameter',
 'GhostProbability',
 'RICHpFlagElectron',
 'RICHpFlagProton',
 'EcalDLLbeMuon',
 'RICHpFlagPion',
 'MuonLLbeMuon',
 'MuonLLbeBCK',
 'TrackPt',
 'RICH_DLLbeBCK',
 'PrsE',
 'HcalDLLbeElectron',
 'DLLelectron',
 'RICH_DLLbeElectron',
 'DLLkaon',
 'MuonFlag',
 'FlagEcal',
 'FlagBrem',
 'DLLmuon',
 'TrackQualityPerNDoF',
 'FlagSpd',
 'RICH_DLLbeKaon',
 'HcalDLLbeMuon',
 'RICHpFlagMuon',
 'EcalDLLbeElectron',
 'FlagRICH1',
 'TrackP',
 'FlagRICH2',
 'TrackNDoFSubdetector2',
 'TrackNDoFSubdetector1',
 'DLLproton',
 'Calo3dFitQuality',
 'Calo2dFitQuality',
 'TrackNDoF',
 'PrsDLLbeElectron',
 'BremDLLbeElectron',
 'TrackQualitySubdetector2',
 'EcalE',
 'HcalE']

### Divide training data into 2 parts

In [10]:
training_data, validation_data = train_test_split(data, random_state=11, train_size=0.90, test_size=0.10)

In [11]:
len(training_data), len(validation_data)

(1080000, 120000)

In [12]:
training_data.head()

Unnamed: 0,TrackP,TrackNDoFSubdetector2,BremDLLbeElectron,MuonLooseFlag,FlagSpd,SpdE,EcalDLLbeElectron,DLLmuon,RICHpFlagElectron,EcalDLLbeMuon,TrackQualitySubdetector2,FlagPrs,DLLelectron,DLLkaon,EcalE,TrackQualityPerNDoF,DLLproton,PrsDLLbeElectron,FlagRICH1,MuonLLbeBCK,FlagHcal,EcalShowerLongitudinalParameter,Calo2dFitQuality,TrackPt,TrackDistanceToZ,RICHpFlagPion,HcalDLLbeElectron,Calo3dFitQuality,FlagEcal,MuonLLbeMuon,TrackNDoFSubdetector1,RICHpFlagProton,RICHpFlagKaon,GhostProbability,TrackQualitySubdetector1,Label,RICH_DLLbeBCK,FlagRICH2,FlagBrem,HcalDLLbeMuon,TrackNDoF,RICHpFlagMuon,RICH_DLLbeKaon,RICH_DLLbeElectron,HcalE,MuonFlag,FlagMuon,PrsE,RICH_DLLbeMuon,RICH_DLLbeProton,Class
968874,12192.330089,17.0,-0.170024,0.0,1.0,0.0,-1.831868,-0.996012,1.0,0.5017775,23.639733,0.0,-7.273712,-19.783081,-4.186518e-07,3.54997,-22.811781,-3.062694,1.0,-2.853407,1.0,-5079.33644,28.628712,477.872599,0.403936,1.0,-0.924518,1184.381821,1.0,-0.008407,1.0,0.0,1.0,0.301772,2.120835,Ghost,-18.459708,1.0,1.0,1.397398,27.0,1.0,-19.783108,-6.720408,1914.190913,1.0,1.0,1.967353e-06,-2.804208,-25.041909,1
912808,40379.488299,7.0,-999.0,0.0,0.0,-999.0,-999.0,-0.792678,1.0,-999.0,5.240217,0.0,-0.017085,-14.755505,-999.0,1.268794,-18.450505,-999.0,0.0,-999.0,0.0,-999.0,-999.0,762.761798,0.795385,1.0,-999.0,-999.0,0.0,-999.0,3.0,1.0,1.0,0.023019,0.364699,Pion,-12.998593,1.0,0.0,-999.0,13.0,1.0,-14.755493,-2.304893,-999.0,0.0,1.0,-999.0,-0.967993,-18.450494,4
30045,5294.729989,12.0,-0.381504,1.0,1.0,3.2,-0.589851,-2.054347,1.0,-2.14267,20.121625,1.0,-1.78397,10.255301,2258.582,1.229169,13.835801,2.359803,1.0,-0.928991,1.0,-232.8181,0.681833,278.262307,0.156057,1.0,-1.60722,421.908218,1.0,-0.268291,12.0,0.0,0.0,0.00619,13.794493,Muon,10.983106,0.0,1.0,0.94749,33.0,1.0,13.137106,-7.920494,2567.385966,1.0,1.0,75.69901,-1.713094,13.416006,3
313784,4645.819821,17.0,-0.417099,1.0,1.0,3.2,0.649028,-8.301831,1.0,7.650223e-07,29.58565,1.0,-5.962948,-22.006813,982.29,1.447289,-17.656713,-2.730665,1.0,-0.000606,1.0,510.621553,3.820987,296.548215,0.391684,1.0,0.114287,129.041032,1.0,-0.011906,11.0,0.0,0.0,0.024626,8.282235,Ghost,-16.707492,0.0,1.0,-2.394636,37.0,1.0,-22.006791,-24.447692,256.265351,0.0,1.0,1.919901,-7.184691,-17.656691,1
100864,3100.629887,17.0,-0.553602,0.0,1.0,3.2,-2.72342,-12.468398,1.0,-1.438773,10.914584,1.0,-11.139076,0.814791,872.42,0.64448,1.080291,-2.382539,1.0,-999.0,1.0,-999.0,4.541087,144.371037,0.058232,1.0,0.434926,-999.0,1.0,-999.0,9.0,0.0,0.0,0.002423,7.402662,Kaon,0.560707,0.0,1.0,-2.394626,36.0,1.0,0.807007,-50.220193,1.1e-05,0.0,1.0,-8.050605e-07,-9.541493,0.771707,2


In [13]:
training_data[features].describe()

Unnamed: 0,RICH_DLLbeProton,MuonLooseFlag,TrackDistanceToZ,SpdE,FlagPrs,FlagHcal,RICHpFlagKaon,FlagMuon,RICH_DLLbeMuon,TrackQualitySubdetector1,EcalShowerLongitudinalParameter,GhostProbability,RICHpFlagElectron,RICHpFlagProton,EcalDLLbeMuon,RICHpFlagPion,MuonLLbeMuon,MuonLLbeBCK,TrackPt,RICH_DLLbeBCK,PrsE,HcalDLLbeElectron,DLLelectron,RICH_DLLbeElectron,DLLkaon,MuonFlag,FlagEcal,FlagBrem,DLLmuon,TrackQualityPerNDoF,FlagSpd,RICH_DLLbeKaon,HcalDLLbeMuon,RICHpFlagMuon,EcalDLLbeElectron,FlagRICH1,TrackP,FlagRICH2,TrackNDoFSubdetector2,TrackNDoFSubdetector1,DLLproton,Calo3dFitQuality,Calo2dFitQuality,TrackNDoF,PrsDLLbeElectron,BremDLLbeElectron,TrackQualitySubdetector2,EcalE,HcalE
count,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0,1080000.0
mean,-52.29568,0.1908407,0.6790303,-144.5036,0.8510917,0.764663,0.4059546,0.8207509,-49.26364,6.851082,-425.7734,0.05513315,0.9514204,0.210612,-183.1246,0.8586537,-808.7499,-809.2076,928.2852,-52.10927,-134.0039,-235.3832,-14.03302,-51.09546,-14.52442,0.1662343,0.8175639,0.7918694,-12.68219,1.282943,0.8530556,-51.90322,-235.5069,0.9186778,-182.9481,0.832513,16132.64,0.5016713,14.75204,6.198936,-14.91176,1003.561,-167.4038,29.42315,-150.061,-207.6723,16.09497,2346.141,2898.032
std,215.3698,0.392964,1.303041,354.6216,0.3559983,0.4242095,0.4910761,0.3835609,214.8977,5.071566,5245.722,0.09242857,0.2149877,0.4077435,385.4955,0.3483786,392.069,391.1265,1605.476,215.2615,363.0068,423.4402,104.8488,215.2222,107.4946,0.372291,0.3862037,0.4059709,104.9807,0.627084,0.3540507,215.4569,423.3731,0.2733294,385.5809,0.3734103,27539.21,0.4999974,4.037619,3.115005,107.461,2408.174,396.9166,6.026995,355.0281,405.4207,7.769702,5709.488,7656.111
min,-999.0,0.0,-1.319307e-05,-999.0,0.0,0.0,0.0,0.0,-999.0,9.406193e-06,-386169.8,0.001456686,0.0,0.0,-999.0,0.0,-999.0,-999.0,1.447011,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,0.0,0.0,-999.0,0.04504852,0.0,-999.0,-999.0,0.0,-999.0,0.0,1115.38,0.0,1.0,1.0,-999.0,-999.0,-999.0,7.0,-999.0,-999.0,0.0003229625,-999.0,-999.0
25%,-16.18322,0.0,0.2824411,0.0,1.0,1.0,0.0,1.0,-4.503437,3.080789,-999.0,0.003852832,1.0,0.0,-3.367626,1.0,-999.0,-999.0,268.4007,-16.19943,9.201665e-06,-1.926173,-7.059118,-13.09732,-10.69791,0.0,1.0,1.0,-5.711785,0.8856753,1.0,-14.69696,-3.027614,1.0,-3.000639,1.0,4135.297,0.0,13.0,4.0,-12.0406,-999.0,0.09502276,26.0,-3.062679,-0.6250978,10.71093,-3.34899e-06,-1.446553e-05
50%,-2.947544e-06,0.0,0.513634,3.2,1.0,1.0,0.0,1.0,-0.4692121,5.845922,-295.5909,0.009668268,1.0,0.0,-1.956066,1.0,-999.0,-999.0,546.0906,-2.783857e-06,2.468449,-1.128672,-4.360961,-2.355985,2.762477e-06,0.0,1.0,1.0,-1.985259,1.106393,1.0,-1.248008e-06,-1.792448,1.0,-2.315904,1.0,8064.075,1.0,16.0,6.0,1.508806e-06,128.6276,0.8067113,30.0,-2.730653,-0.532902,15.24483,659.904,577.2029
75%,8.919904,0.0,0.6923893,3.2,1.0,1.0,1.0,1.0,1.300734,9.508951,155.0091,0.05510186,1.0,0.0,0.4339372,1.0,-999.0,-999.0,1110.806,8.770619,8.776679,-0.1173962,0.002593773,1.672999,8.46073,0.0,1.0,1.0,1.669644,1.448324,1.0,8.457116,0.1986138,1.0,0.5918626,1.0,17495.5,1.0,18.0,8.0,8.917628,1657.563,4.362203,34.0,-1.331085,-0.05933648,20.41093,3001.512,3047.605
max,146.2984,1.0,42.13652,3.2,1.0,1.0,1.0,1.0,142.8335,99.14788,729387.4,0.4000019,1.0,1.0,2.153017,1.0,4.311586e-05,0.02319657,427157.0,77.61361,280.58,3.127981,17.6927,186.1542,168.1698,1.0,1.0,1.0,14.71079,3.999967,1.0,158.8015,2.873421,1.0,4.341298,1.0,4673862.0,1.0,30.0,24.0,146.2984,9999.937,998.4028,52.0,3.46316,4.791513,104.8634,315680.5,798525.3


In [14]:
training_data[features].isnull().any().any()

False

In [15]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
training_scale = scaler.fit_transform(training_data[features].values)

In [16]:
validation_scale = scaler.transform(validation_data[features].values)

# Sklearn classifier

On this step your task is to train **Sklearn** classifier to provide lower **log loss** value.


TASK: your task is to tune the classifier parameters to achieve the lowest **log loss** value on the validation sample you can.



```
# This is formatted as code
**bold text**
from catboost import CatBoostClassifier

gb = CatBoostClassifier(learning_rate=0.05, iterations=1000,task_type='GPU')

gb.fit(training_data[features].values, training_data.Class.values)

```



In [18]:
from sklearn.ensemble import GradientBoostingClassifier

In [19]:
%%time 
gb = GradientBoostingClassifier(learning_rate=0.5, n_estimators=50, subsample=0.8, random_state=13,
                                min_samples_leaf=10, max_depth=3) 
#gb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=50, subsample=0.8, random_state=13, min_samples_leaf=10, max_depth=30) 
#task_type='GPU')
gb.fit(training_scale, training_data.Class.values)



CPU times: user 53min 4s, sys: 1.56 s, total: 53min 6s
Wall time: 53min 7s


### Log loss on the cross validation sample

In [None]:
# predict each track
proba_gb = gb.predict_proba(validation_scale)

In [None]:
log_loss(validation_data.Class.values, proba_gb)

# Keras neural network

On this step your task is to train **Keras** NN classifier to provide lower **log loss** value.


TASK: your task is to tune the classifier parameters to achieve the lowest **log loss** value on the validation sample you can. Data preprocessing may help you to improve your score.

In [None]:

#!pip install tensorflow==2.2 keras.layers.core.Activation(activation)
from keras.models import Sequential
from keras.layers import Dense, Activation

#from keras.layers.core import Dense, Activation
from keras.models import Sequential
from keras.optimizers import Adam
from keras.utils import np_utils
from keras.callbacks import EarlyStopping, ModelCheckpoint

In [None]:
def nn_model(input_dim):
    model = Sequential()
    model.add(Dense(100, input_dim=input_dim))
    model.add(Activation('relu'))
    
    model.add(Dense(50))
    model.add(Activation('relu'))

    model.add(Dense(6))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer=Adam())
    return model

In [None]:
callback = [EarlyStopping(monitor='val_loss', min_delta=0, patience=4, verbose=0, mode='auto'),
            ModelCheckpoint('output/{val_loss:.4f}.hdf5', monitor='val_loss', verbose=1, save_best_only=True, mode='auto')]

In [None]:
nn = nn_model(len(features))
nn.fit(training_scale, np_utils.to_categorical(training_data.Class.values),
       validation_data=(validation_scale, np_utils.to_categorical(validation_data.Class.values)),
       epochs=50, verbose=1, batch_size=256, callbacks=callback)

In [None]:
nn.load_weights('output/0.5680.hdf5')

### Log loss on the cross validation sample

In [None]:
# predict each track
proba_nn = nn.predict_proba(validation_scale)

In [None]:
log_loss(validation_data.Class.values, proba_nn)

# Quality metrics

Plot ROC curves and signal efficiency dependece from particle mometum and transverse momentum values.

In [None]:
proba = proba_nn

In [None]:
utils.plot_roc_curves(proba, validation_data.Class.values)

In [None]:
utils.plot_signal_efficiency_on_p(proba, validation_data.Class.values, validation_data.TrackP.values, 60, 50)
plt.show()

In [None]:
utils.plot_signal_efficiency_on_pt(proba, validation_data.Class.values, validation_data.TrackPt.values, 60, 50)
plt.show()

# Prepare submission

Select your best classifier and prepare submission file.

In [None]:
test = pandas.read_csv('test.csv.gz')

In [None]:
best_model = nn

In [None]:
test_scale = scaler.transform(test[features])

In [None]:
# predict test sample
submit_proba = best_model.predict_proba(test_scale)
submit_ids = test.ID

In [None]:
from IPython.display import FileLink
utils.create_solution(submit_ids, submit_proba, filename='submission_file.csv.gz')