# Prepare Data for Higgs Dataset

## Install requirements
We will need pandas for the data preparation. 


## Prepare data

### Download and Store Data

To run the examples, we first download the dataset from the HIGGS website. We will download, uncompress, and store the dataset under 

```
/tmp/nvflare/dataset/input/

```

You can either use wget or curl to download directly if you have wget or curl installed. Here we use curl command. It will take a while to download the  2.6+ GB zip file. 
    

In [None]:
!mkdir -p /tmp/nvflare/dataset/input

!curl -o /tmp/nvflare/dataset/input/higgs.zip https://archive.ics.uci.edu/static/public/280/higgs.zip

Alternatively, download with wget ```wget -P /tmp/nvflare/dataset/input/ https://archive.ics.uci.edu/static/public/280/higgs.zip```

With the downloaded zip file, we will unzip it with the pre-installed "unzip" and "gunzip".  

In [None]:
!unzip -d /tmp/nvflare/dataset/input/ /tmp/nvflare/dataset/input/higgs.zip

In [None]:
!gunzip -c /tmp/nvflare/dataset/input/HIGGS.csv.gz > /tmp/nvflare/dataset/input/higgs.csv

Let's check our current files under the data folder.

In [None]:
!ls -al /tmp/nvflare/dataset/input/

### Data Split 

HIGGS dataset contains 11 million instances (rows), each with 28 attributes.
The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. 
The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. The last 500,000 examples are used as a test set.

The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features): lepton  pT, lepton  eta, lepton  phi, missing energy magnitude, missing energy phi, jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag, jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag, m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb. For more detailed information about each feature, please see the original paper.

We will split the dataset uniformly: all clients has the same amount of data under the output directory 

```
/tmp/nvflare/dataset/output/

```

First to make it similar to the real world use cases, we generate a header file to store feature names (CSV file headers) in the data directory. 

#### Generate the csv header file


In [None]:
import csv

# Your list of data
features = ["label", "lepton_pt", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b_tag", "jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b_tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b_tag",\
            "jet_4_pt", "jet_4_eta", "jet_4_phi", "jet_4_b_tag", \
            "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"]

# Specify the file path
file_path =  '/tmp/nvflare/dataset/input/headers.csv'

with open(file_path, 'w', newline='') as file:
    csv_writer = csv.writer(file)
    csv_writer.writerow(features)

print(f"features written to {file_path}")

In [None]:
!cat /tmp/nvflare/dataset/input/headers.csv

Now assume you are on the "/examples/hello-world/step-by-step/higgs" directory

In [None]:
!pwd

#### Split higgs.csv into multiple csv files for clients

Then we split the data into multiple files, one for each site. We make sure each site will has a "header.csv" file corresponding to the csv data. In horizontal split, all the header will be the same; while for vertical learning, each site can have different headers. 

First, we install the requirements, assuming the current directory is '/examples/hello-world/step-by-step/higgs'

In [None]:
!pwd

In [None]:
%pip install -r requirements.txt

In this tutorial, we set to 3 clients with uniform split. To do so, simply run `split_csv.py`. It is going to take a few minutes. 

>note 
    we used a sample rate of 0.3 to make demo faster to run. You can change the number to even smaller such 0.003 to reduce the file size especially under development or debugging. 

In [None]:
!python split_csv.py \
  --input_data_path=/tmp/nvflare/dataset/input/higgs.csv \
  --input_header_path=/tmp/nvflare/dataset/input/headers.csv \
  --output_dir=/tmp/nvflare/dataset/output/ \
  --site_num=3 \
  --sample_rate=0.3

Now let's check the files and their instance counts.

In [None]:
!ls -al /tmp/nvflare/dataset/output/

In [None]:
!wc -l /tmp/nvflare/dataset/output/site-1.csv

In [None]:
!wc -l /tmp/nvflare/dataset/output/site-2.csv

In [None]:
!wc -l /tmp/nvflare/dataset/output/site-3.csv

Now we have our data prepared. we are ready to do other computations.