# Prepare Data for Higgs Dataset

## Install requirements
We will need pandas for the data preparation. 


## Prepare data

### Download and Store Data

To run the examples, we first download the dataset from the HIGGS link above, which is a single .csv file. By default, we assume the dataset is downloaded, uncompressed, and stored in 

```
/tmp/nvflare/dataset/input/higgs.zip.

```

You can either use wget or curl to download directly if you have wget or curl installed. here is using curl command. This will takes a while to download 2.6+GB file. 
    

In [1]:
! mkdir -p /tmp/nvflare/dataset/input

! curl -o /tmp/nvflare/dataset/input/higgs.zip https://archive.ics.uci.edu/static/public/280/higgs.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2686M    0 2686M    0     0  22.6M      0 --:--:--  0:01:58 --:--:-- 19.4M9M    0     0  22.7M      0 --:--:--  0:00:30 --:--:-- 23.9M  0 1431M    0     0  22.6M      0 --:--:--  0:01:03 --:--:-- 24.1M


Alternative download with wget ```wget -P /tmp/nvflare/dataset/input/ https://archive.ics.uci.edu/static/public/280/higgs.zip```

First we need to unzip the higgs.zip, we have already pre-installed "unzip" and "gunzip", so we just directly use this.  

In [2]:
! unzip -d /tmp/nvflare/dataset/input/ /tmp/nvflare/dataset/input/higgs.zip

Archive:  /tmp/nvflare/dataset/input/higgs.zip
  inflating: /tmp/nvflare/dataset/input/HIGGS.csv.gz  


In [3]:
!gunzip -c /tmp/nvflare/dataset/input/HIGGS.csv.gz > /tmp/nvflare/dataset/input/higgs.csv

In [1]:
!ls -al /tmp/nvflare/dataset/input/

total 13348436
drwxrwxr-x 2 chester chester       4096 Nov 22 10:05 .
drwxrwxr-x 3 chester chester       4096 Nov 22 09:55 ..
-rw-rw-r-- 1 chester chester 8035497980 Nov 22 10:12 higgs.csv
-rwx------ 1 chester chester 2816407858 May 22  2023 HIGGS.csv.gz
-rw-rw-r-- 1 chester chester 2816865137 Nov 22 09:57 higgs.zip


### Data Split

HIGGS dataset contains 11 million instances (rows), each with 28 attributes.
The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. 
The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. The last 500,000 examples are used as a test set.

The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features): lepton  pT, lepton  eta, lepton  phi, missing energy magnitude, missing energy phi, jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag, jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag, m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb. For more detailed information about each feature see the original paper.

Since HIGGS dataset is already randomly recorded, data split will be specified by the continuous index ranges for each client, rather than a vector of random instance indices. We will split the dataset uniformly: all clients has the same amount of data. The output directory 

```
/tmp/nvflare/dataset/output/

```

To make it similar to the real world use cases, we put features (CSV file headers) into a file in the input directory.  When we split the file, we make sure each site will has a "header.csv" file corresponding to the csv data. In horizontal split. all the header will be the same. but for vertical learning, each site may have different headers. 


We create a simple python code to split data: called split_csv.py. Let's run this, you will need to wait for few minutes. 


In [7]:
import csv

# Your list of data
features = ["label", "lepton_pt", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b_tag", "jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b_tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b_tag",\
            "jet_4_pt", "jet_4_eta", "jet_4_phi", "jet_4_b_tag", \
            "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"]

# Specify the file path
file_path =  '/tmp/nvflare/dataset/input/headers.csv'

with open(file_path, 'w', newline='') as file:
    csv_writer = csv.writer(file)
    csv_writer.writerow(features)

print(f"features written to {file_path}")

features written to /tmp/nvflare/dataset/input/headers.csv


In [8]:
!cat /tmp/nvflare/dataset/input/headers.csv

label,lepton_pt,lepton_eta,lepton_phi,missing_energy_magnitude,missing_energy_phi,jet_1_pt,jet_1_eta,jet_1_phi,jet_1_b_tag,jet_2_pt,jet_2_eta,jet_2_phi,jet_2_b_tag,jet_3_pt,jet_3_eta,jet_3_phi,jet_3_b_tag,jet_4_pt,jet_4_eta,jet_4_phi,jet_4_b_tag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb


Now we prepare to split data, note that, we used 20% (0.2) sample rate to make demo faster to run. You can change the number to even smaller such 0.003 to reduce the file size especially when you development or debugging. 

Now assume you are on the "higgs" directory

In [12]:
!pwd

/home/chester/projects/NVFlare/examples/hello-world/step-by-step/higgs


In [9]:
!python split_csv.py \
  --input_data_path=/tmp/nvflare/dataset/input/higgs.csv \
  --input_header_path=/tmp/nvflare/dataset/input/headers.csv \
  --output_dir=/tmp/nvflare/dataset/output/ \
  --site_num=3 \
  --sample_rate=0.003

site-1= start_index=0 end_index=11000
site-2= start_index=11000 end_index=22000
site-3= start_index=22000 end_index=33000
File copied to /tmp/nvflare/dataset/output/site-1_header.csv
File copied to /tmp/nvflare/dataset/output/site-2_header.csv
File copied to /tmp/nvflare/dataset/output/site-3_header.csv


In [39]:
!ls -al /tmp/nvflare/dataset/output/

total 1079464
drwxrwxr-x 2 chester chester      4096 Nov 21 21:23 .
drwxrwxr-x 4 chester chester      4096 Nov 21 14:30 ..
-rw-rw-r-- 1 chester chester 368443760 Nov 21 21:50 site-1.csv
-rw-rw-r-- 1 chester chester       287 Nov 21 21:50 site-1_header.csv
-rw-rw-r-- 1 chester chester 368444134 Nov 21 21:50 site-2.csv
-rw-rw-r-- 1 chester chester       287 Nov 21 21:50 site-2_header.csv
-rw-rw-r-- 1 chester chester 368448833 Nov 21 21:50 site-3.csv
-rw-rw-r-- 1 chester chester       287 Nov 21 21:50 site-3_header.csv


Now we have our data prepared. we are ready to do other computations

In [10]:
! wc -l /tmp/nvflare/dataset/output/site-1.csv

11000 /tmp/nvflare/dataset/output/site-1.csv


In [5]:
! grep  '0\.000000000000000000e+00\.1'  /tmp/nvflare/dataset/input/higgs.csv | wc -l

0


In [12]:
! grep  '0\.000000000000000000e+00\.1'  /tmp/nvflare/dataset/output/site-*.csv | wc -l

0


In [None]:
1.0,0.9075421094894408,0.3291472792625427,0.3594118654727936,1.49796986579895,-0.3130095303058624,1.09553062915802,-0.5575249195098877,-1.588229775428772,2.1730761528015137,0.812581181526184,-0.2136419266462326,1.2710145711898804,2.2148721218109126,0.4999939501285553,-1.2614318132400513,0.7321561574935913,0.0,0.3987008929252625,-1.1389300823211668,-0.0008191101951524,0.0,  0.3022198975086212,0.8330481648445129,0.9856996536254884,0.9780983924865722,0.7797321677207946,0.9923557639122008,0.7983425855636596
1.0,0.9075421094894408,0.3291472792625427,0.3594118654727936,1.49796986579895,-0.3130095303058624,1.09553062915802,-0.5575249195098877,-1.588229775428772,2.1730761528015137,0.812581181526184,-0.2136419266462326,1.2710145711898804,2.2148721218109126,0.4999939501285553,-1.2614318132400513,0.7321561574935913,0.0,0.3987008929252625,-1.1389300823211668,-0.0008191101951524,0.0.1,0.3022198975086212,0.8330481648445129,0.9856996536254884,0.9780983924865722,0.7797321677207946,0.9923557639122008,0.7983425855636596