<a href="https://colab.research.google.com/github/Carhuacusma/CC62_Data_Mining_TB_20202/blob/master/TB_Data_Mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TB: CC62 Data Mining
## Kaggle: ATLAS experiment Higss Boson

## Objectives:

- Detection of the Higgs Boson in simulated data in order to reproduce the behavior of the ATLAS experience. 
- Binary classification problem, or event detection.
- Using simulated data with features characterizing events detected by ATLAS, the task is to classify events into **"tau tau decay of a Higgs boson"** versus **"background"**. 

## Understanding the Data


The ATLAS experiment observed a signal of the Higgs boson **decaying into two tau particles**, but this decay is a small signal buried in **background** noise. The goal of the challenge is to explore the potential of advanced machine learning methods to improve the discovery significance of the experiment.

In [1]:
import pandas as pd
import numpy as np

In [4]:
# Read file
dfHiggsBoson = pd.read_csv("training.csv")
dfHiggsBoson.head()

Unnamed: 0,EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,DER_pt_ratio_lep_tau,DER_met_phi_centrality,DER_lep_eta_centrality,PRI_tau_pt,PRI_tau_eta,PRI_tau_phi,PRI_lep_pt,PRI_lep_eta,PRI_lep_phi,PRI_met,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label
0,100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2.0,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.002653,s
1,100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,125.157,0.879,1.414,-999.0,42.014,2.039,-3.011,36.918,0.501,0.103,44.704,-1.916,164.546,1.0,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.233584,b
2,100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,197.814,3.776,1.414,-999.0,32.154,-0.705,-2.093,121.409,-0.953,1.052,54.283,-2.186,260.414,1.0,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.347389,b
3,100003,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,75.968,2.354,-1.285,-999.0,22.647,-1.655,0.01,53.321,-0.522,-3.1,31.082,0.06,86.062,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0,5.446378,b
4,100004,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,57.983,1.056,-1.385,-999.0,28.209,-2.197,-2.231,29.774,0.798,1.569,2.723,-0.871,53.131,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,6.245333,b


In [5]:
def analizarColumns(dataframe):
  def auxRango(arr):
    cmin, cmax = '-','-'
    if arr.dtype == 'int64' or arr.dtype == 'float64':
      cmin = np.min(arr)
      cmax = np.max(arr)
    return cmin, cmax
  i = 0
  print("%3s | %27s | %7s | %25s"%("#","Name","Type", "Range"))
  for columna in dataframe:
    cmin, cmax = auxRango(dataframe[columna])
    print("%3s | %27s | %7s | %20s - %s"%( i,
        columna, dataframe[columna].dtype, cmin, cmax
        ))
    i+=1

In [6]:
analizarColumns(dfHiggsBoson)

  # |                        Name |    Type |                     Range
  0 |                     EventId |   int64 |               100000 - 137953
  1 |                DER_mass_MMC | float64 |               -999.0 - 858.887
  2 | DER_mass_transverse_met_lep | float64 |                0.002 - 690.075
  3 |                DER_mass_vis | float64 |                 7.33 - 674.553
  4 |                    DER_pt_h | float64 |                  0.0 - 2834.9990000000003
  5 |        DER_deltaeta_jet_jet | float64 |               -999.0 - 8.459
  6 |            DER_mass_jet_jet | float64 |               -999.0 - 3624.1540000000005
  7 |         DER_prodeta_jet_jet | float64 |               -999.0 - 14.717
  8 |          DER_deltar_tau_lep | float64 |                0.228 - 5.684
  9 |                  DER_pt_tot | float64 |                  0.0 - 2834.9990000000003
 10 |                  DER_sum_pt | float64 |   46.306000000000004 - 1419.965
 11 |        DER_pt_ratio_lep_tau | float64 |        

The description of the variables are in the OpenData page of CERN. From theese, we can remark:

- **EventId**: An unique integer identifier of the event.
- **DER_mass_MMC**: The estimated mass of the Higgs boson candidate.
- **PRI_jet_num**: The number of jets (integer with value of 0, 1, 2 or 3; possible larger values have been capped at 3). **Some variables are *undefined* if the value of PRI_jet_num is below 1.**
- **Weight**	The event weight wi
- **Label**	The event label (string) yi ∈ {s,b} (s for signal, b for background).

In [11]:
dfHiggsBoson[dfHiggsBoson['PRI_jet_num'] == 0]

Unnamed: 0,EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,DER_pt_ratio_lep_tau,DER_met_phi_centrality,DER_lep_eta_centrality,PRI_tau_pt,PRI_tau_eta,PRI_tau_phi,PRI_lep_pt,PRI_lep_eta,PRI_lep_phi,PRI_met,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label
3,100003,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.310,0.414,75.968,2.354,-1.285,-999.0,22.647,-1.655,0.010,53.321,-0.522,-3.100,31.082,0.060,86.062,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0,5.446378,b
4,100004,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,57.983,1.056,-1.385,-999.0,28.209,-2.197,-2.231,29.774,0.798,1.569,2.723,-0.871,53.131,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,6.245333,b
8,100008,105.594,50.559,100.989,4.288,-999.0,-999.0,-999.0,2.904,4.288,65.333,0.675,-1.366,-999.0,39.008,2.433,-2.532,26.325,0.210,1.884,37.791,0.024,129.804,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,5.296003,b
10,100010,-999.000,86.240,79.692,27.201,-999.0,-999.0,-999.0,2.338,27.201,81.734,1.750,-1.412,-999.0,29.718,-0.866,2.878,52.016,0.126,-1.288,51.276,0.688,250.178,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,2.299504,b
13,100013,82.488,31.663,64.128,8.232,-999.0,-999.0,-999.0,2.823,8.232,58.649,1.303,-1.414,-999.0,25.470,-0.654,-2.990,33.179,-1.665,-0.354,12.439,1.433,163.420,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,2.183892,b
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37934,137934,166.362,82.627,141.396,5.053,-999.0,-999.0,-999.0,3.140,5.053,104.638,1.367,-1.404,-999.0,44.204,-0.439,1.144,60.433,1.277,-2.509,34.543,-0.250,101.575,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,5.513306,b
37937,137937,67.787,29.559,54.955,22.727,-999.0,-999.0,-999.0,2.104,22.727,61.849,0.869,-1.414,-999.0,33.097,-0.284,0.950,28.752,0.224,-1.091,9.929,3.063,104.137,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0,1.681611,b
37942,137942,-999.000,69.684,38.274,19.379,-999.0,-999.0,-999.0,1.649,19.379,52.089,1.046,-1.414,-999.0,25.457,2.396,-1.881,26.632,2.303,2.756,54.719,0.457,75.789,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,4.230124,b
37944,137944,115.145,10.467,78.348,5.415,-999.0,-999.0,-999.0,3.273,5.415,68.953,0.622,-1.410,-999.0,42.517,-0.836,2.993,26.436,-1.983,-0.073,18.781,-0.548,82.035,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,0.018636,s


### Normalize the data:

In [None]:
# TODO Normalizar

In [2]:
def entropia(columna):
  n = len(columna)
  valores = pd.unique(columna)
  ent = 0
  for valor in valores:
    pi = len(columna[columna == valor])/n # P(i) = n(i) / n(Omega)
    ent += -1*pi*math.log2(pi)
  return ent

## References

- Udacity (2016) *The Kaggle Challenge. Higgs Bosson*. Recovered from: https://www.youtube.com/watch?v=Sombn6OSvZU 
- Nguyen, D., Mejia Breton, K. & Cruz, I. (2016) *Kaggle Competition. Higgs Bosson Machine Learning Challenge*. NYC Data Science Academy. Recovered from: https://www.youtube.com/watch?v=Rn3dwmhVH3o 