# To get started with Data Understanding:
>Kaggle Link:
https://www.kaggle.com/competitions/higgs-boson/data
>
>Journal Article about the dataset (*Documentation about this data is good.*):
https://proceedings.mlr.press/v42/cowa14.pdf
>
>Offical dataset (*if you don't want to use Kaggle*):
https://opendata.cern.ch/record/328

Type this into your notebook:
>>`!curl -O https://opendata.cern.ch/record/328/files/atlas-higgs-challenge-2014-v2.csv.gz`
>>
>>`!ls`
>>
>>`!gunzip 'atlas-higgs-challenge-2014-v2.csv.gz'`


In [1]:
!curl -O -k https://opendata.cern.ch/record/328/files/atlas-higgs-challenge-2014-v2.csv.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 62.5M  100 62.5M    0     0  5313k      0  0:00:12  0:00:12 --:--:-- 8645k


In [2]:
!ls # sanity check.

atlas-higgs-challenge-2014-v2.csv.gz  sample_data


In [3]:
!gunzip 'atlas-higgs-challenge-2014-v2.csv.gz'

In [4]:
!ls # sanity check.

atlas-higgs-challenge-2014-v2.csv  sample_data


In [5]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

In [19]:
df = pd.read_csv('atlas-higgs-challenge-2014-v2.csv')
df.drop(['KaggleSet', 'KaggleWeight', 'EventId', 'Weight'], axis=1, inplace=True) # drop columns that we do not need to feed into our model.

In [20]:
len(df.columns) # The 31st var is 'Label', which we will need in order to know what rows are sig/bkg.

31

### I looped through the dataframe columns to get:
> #### `VarNames`, which has all the **RAW + FEATURE** variables.
> #### `RawNames`, which has all the **PRI/RAW** feature variables.
> #### `FeatureNames`, which has all the **DER/FEATURE** variables.

In [21]:
VarNames=list(df.columns) # All the variables in the dataset.
RawNames=list() # Variables classified as 'RAW'.
FeatureNames=list() # Variables derived from 'RAW'.
for col in df.columns:
  if col == 'Label':
    continue # skipping sig/bkg.
  if 'PRI' in col:
    RawNames.append(col)
  elif 'DER' in col:
    FeatureNames.append(col)
# Note: After reading the article, "The subject of the Challenge
# was to study the H to tau tau channel."
# So maybe we only test ['DER_pt_h','DER_deltar_tau_lep','DER_pt_ratio_lep_tau','PRI_tau_pt','PRI_tau_eta','PRI_tau_phi']?

In [23]:
# So now, we have lists that seperate the raw variables, and feature variables.
print(VarNames)
print(RawNames)
print(FeatureNames)

['DER_mass_MMC', 'DER_mass_transverse_met_lep', 'DER_mass_vis', 'DER_pt_h', 'DER_deltaeta_jet_jet', 'DER_mass_jet_jet', 'DER_prodeta_jet_jet', 'DER_deltar_tau_lep', 'DER_pt_tot', 'DER_sum_pt', 'DER_pt_ratio_lep_tau', 'DER_met_phi_centrality', 'DER_lep_eta_centrality', 'PRI_tau_pt', 'PRI_tau_eta', 'PRI_tau_phi', 'PRI_lep_pt', 'PRI_lep_eta', 'PRI_lep_phi', 'PRI_met', 'PRI_met_phi', 'PRI_met_sumet', 'PRI_jet_num', 'PRI_jet_leading_pt', 'PRI_jet_leading_eta', 'PRI_jet_leading_phi', 'PRI_jet_subleading_pt', 'PRI_jet_subleading_eta', 'PRI_jet_subleading_phi', 'PRI_jet_all_pt', 'Label']
['PRI_tau_pt', 'PRI_tau_eta', 'PRI_tau_phi', 'PRI_lep_pt', 'PRI_lep_eta', 'PRI_lep_phi', 'PRI_met', 'PRI_met_phi', 'PRI_met_sumet', 'PRI_jet_num', 'PRI_jet_leading_pt', 'PRI_jet_leading_eta', 'PRI_jet_leading_phi', 'PRI_jet_subleading_pt', 'PRI_jet_subleading_eta', 'PRI_jet_subleading_phi', 'PRI_jet_all_pt']
['DER_mass_MMC', 'DER_mass_transverse_met_lep', 'DER_mass_vis', 'DER_pt_h', 'DER_deltaeta_jet_jet', 'DE

#### By running this code, we can identify where to start our filtering for signal and background

In [24]:
df.Label.unique()

array(['s', 'b'], dtype=object)

### Import necessary ML libraries.

In [37]:
import tensorflow as tf
from IPython.display import Image
from tensorflow.keras.utils import plot_model
from tensorflow.keras.models import Sequential #keras.Sequential([])
from tensorflow.keras.layers import Dense #layers.Dense()
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

In [38]:
n = len(df)
print("sample size: ", n)
print("80% of total data: ", int(n*.8))

sample size:  818238
80% of total data:  654590


In [39]:
#n = len(df) # sample size.
#Train_sample=df[:int(n * .8)] # training sample size.
#Test_sample=df[int(n * .8):] # testing sample size.

#X_Train=np.array(Train_sample[VarNames[:-1]]) # EXPLANATORY training sample of every feature (except Label).
#y_Train=np.array(Train_sample.Label=='s') # RESPONSE training sample of signal.

#X_Test=np.array(Test_sample[VarNames[:-1]]) # EXPLANATORY testing sample of every feature (except Label).
#y_Test=np.array(Test_sample.Label=='s') # RESPONSE testing sample of signal.

In [40]:
# This might be better for making data a SRS.
X = df[VarNames[:-1]].values
y = (df.Label == 's').astype(int).values

X_Train, X_Test, y_Train, y_Test = train_test_split(
    X,y, test_size=.2, random_state=32, shuffle=True
)