# [HIGGS UCI](https://archive.ics.uci.edu/ml/datasets/HIGGS#) 

In this notebook will work with real dataset from high energy physics (HEP). The goal is increase the accuracy on test subset compared to baseline. 

### Introduction

The field of high energy physics is devoted to the study of the elementary constituents of matter. By investigating the structure of matter and the laws that govern its interactions, this field strives to discover the fundamental
properties of the physical universe. The primary tools of experimental high energy physicists are modern accelerators, which collide protons and/or antiprotons to create exotic particles that occur only at extremely high
energy densities. Collisions at high energy particle colliders are a fruitful source of exotic particle discoveries. Observing these particles and measuring their properties may yield critical insights about the
very nature of matter. Finding these rare particles requires solving difficult signal-versus-background classification problems.  Such discoveries require powerful statistical methods, and machine learning tools play
a critical role. Given the limited quantity and expensive nature of the data, improvements in analytical tools
directly boost particle discovery potential.

### Data Set Information:

The data has been produced using Monte Carlo simulations. 

The first column is the class label:
- 1 for signal, 
- 0 for background.

Other 28 columns are features (21 low-level features then 7 high-level features): 
- lepton pT, 
- lepton eta, 
- lepton phi, 
- missing energy magnitude, 
- missing energy phi, 
- jet 1 pt, 
- jet 1 eta, 
- jet 1 phi, 
- jet 1 b-tag, 
- jet 2 pt, 
- jet 2 eta, 
- jet 2 phi, 
- jet 2 b-tag, 
- jet 3 pt, 
- jet 3 eta, 
- jet 3 phi, 
- jet 3 b-tag, 
- jet 4 pt, 
- jet 4 eta, 
- jet 4 phi, 
- jet 4 b-tag, 


- m_jj, 
- m_jjj, 
- m_lv, 
- m_jlv, 
- m_bb, 
- m_wbb, 
- m_wwbb.

The first 21 features (columns 1-21) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes.

The last 500,000 examples are used as a test set.

Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper:

**Baldi, P., P. Sadowski, and D. Whiteson.** “[Searching for Exotic Particles in High-energy Physics with Deep Learning](https://arxiv.org/pdf/1402.4735.pdf).” Nature Communications 5 (July 2, 2014).


In [None]:
# import packages

import pandas as pd
import numpy as np
import xgboost as xgb
import wget
from sklearn.metrics import accuracy_score

# apply pip install for those you don`t have

In [None]:
# donwload the HIGGS dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"  
wget.download(url, 'HIGGS.csv.gz') 

In [None]:
# open the dataset 

data = pd.read_csv('HIGGS.csv.gz', names=list(range(29)))

In [None]:
print('Number of rows = ', data.shape[0])
print('Number of columns = ', data.shape[1])

We will divide the data to train and test subsets. Our goal is to predict the class from the first column (with number 0).

In [None]:
X_train, y_train = data.to_numpy()[:-500000,1:], data.to_numpy()[:-500000,0]
X_test, y_test = data.to_numpy()[-500000:,1:], data.to_numpy()[-500000:,0]

### Baseline

Using the default values of hyperparametrs of XGboost classifier we will train the model and check the accuracy on the test.

In [None]:
# init the model

clf = xgb.XGBClassifier()

In [None]:
# fit the model on the train subset

clf.fit(X_train, y_train)

In [None]:
# make predictions on the test subset

pred = clf.predict(X_test)

In [None]:
print('Accuracy on the test is ', round(accuracy_score(pred, y_test),2)

**Task:** by applying any model that you wish and tuning its hyperparameters, increase the accuracy on the test.