Dataset before the selection:
* 13778028 $QCD$ events 
* 26335315 $tt$ events
* 11083533 $W+jets$ events

Total of 51196876 events, 
<br>
Dataset after the selection: 

## Steps

### 1) Features preparation

**HLF classifier**
* Convert HLF in a vector dense type (Needed to use the MinMax scaler)
    * New column `hfeatures_dense`
* Create a small pipeline with the following steps (for HLF):
    * One-hot-encode
        * New column `encoded_label`
    * MinMax scaler 
        * New column `HLF_input`
    
**Particle-sequence classifier**
* For each event:
    * Sort particles by decreasing $\Delta R$ distance from the isolated lepton
    * Filter the empty rows
    * Scale the features
* Create the new column `GRU_input`

### 2)  Dataset preparation

* Shuffle the dtaframe
* Create test and train dataframes (fraction: 0.8/0.2 (train/test))
    * `~12M` for training and `4M` for test 
* Create train/test dataset of different sizes and save them as parquet files
    * `100k`/`20k`
    * `1M`/`200k`
    * `6M`/ `1M`
    * `12M` / `4M` (full dataframe)

## Create a spark session

In [1]:
import findspark
findspark.init('/usr/hdp/spark/')

In [2]:
application_name = 'FeaturePreparation'
master = "yarn"
num_executors = 60
executor_memory = '6G'
driver_memory = '128G'
num_cores = 4

In [3]:
from pyspark.sql import SparkSession
import os 

os.environ["PYTHONHOME"] = "/afs/cern.ch/work/m/migliori/public/anaconda2"
os.environ["PYTHONPATH"] = "/afs/cern.ch/work/m/migliori/public/anaconda2/lib/python2.7/site-packages"

spark = SparkSession.builder\
        .appName(application_name)\
        .config("spark.pyspark.python",
                "/afs/cern.ch/work/m/migliori/public/anaconda2/bin/python")\
        .config("spark.master", master)\
        .config("spark.executor.cores", `num_cores`)\
        .config("spark.executor.instances", `num_executors`)\
        .config("spark.executor.memory", executor_memory)\
        .config("spark.driver.memory", driver_memory)\
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
        .config("spark.driver.maxResultSize", "128G") \
        .config("spark.dynamicAllocation.enabled", 'false')\
        .getOrCreate()

In [4]:
spark

## Load the dataset

In [5]:
from __future__ import print_function

In [6]:
%%time
data = spark.read.format("parquet") \
        .load("hdfs://analytix/cms/bigdatasci/vkhriste/data/events2features_19092018")

events = data.count()
print('There are', events, 'events')

There are 20021476 events
CPU times: user 8.12 ms, sys: 10 ms, total: 18.1 ms
Wall time: 1min 41s


In [7]:
data.printSchema()

root
 |-- hfeatures: vector (nullable = true)
 |-- label: long (nullable = true)
 |-- lfeatures: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = true)



Let's check how many events there are foreach class. The three classes are labeled as follows:

* $QCD=0$
* $t\bar{t}=1$
* $W=2$

In [8]:
%%time
counts = data.groupBy('label').count().collect()

CPU times: user 5.39 ms, sys: 10.2 ms, total: 15.6 ms
Wall time: 1min 8s


In [9]:
labels = ['QCD', 'tt', 'W+jets']

qcd_events = 0
tt_events = 0 
wjets_events = 0

print('There are:')
for i in range(3):
    print('\t*',counts[i][1],labels[counts[i].label],
          'events (frac = {:.3f})'.format(counts[i][1]*1.0/events))
    if counts[i].label==0:
        qcd_events = counts[i][1]
    elif counts[i].label==1:
        tt_events = counts[i][1] 
    elif counts[i].label==2:
        wjets_events = counts[i][1]

There are:
	* 1426343 QCD events (frac = 0.071)
	* 14265397 tt events (frac = 0.713)
	* 4329736 W+jets events (frac = 0.216)


## Feature preparation 

Elements of the `hfeatures` column are list, hence we need to convert them into `Vectors.Dense`

In [10]:
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf

vector_dense_udf = udf(lambda r : Vectors.dense(r),VectorUDT())
data = data.withColumn('hfeatures_dense',vector_dense_udf('hfeatures'))

Now we can build the pipeline to scale HLF and encode the labels

In [11]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator
from pyspark.ml.feature import MinMaxScaler

## One-Hot-Encode
encoder = OneHotEncoderEstimator(inputCols=["label"],
                                 outputCols=["encoded_label"],
                                 dropLast=False)

## Scale feature vector
scaler = MinMaxScaler(inputCol="hfeatures_dense",
                      outputCol="HLF_input")

pipeline = Pipeline(stages=[encoder, scaler])

%time fitted_pipeline = pipeline.fit(data)

CPU times: user 294 ms, sys: 293 ms, total: 587 ms
Wall time: 1min 34s


In [12]:
data = fitted_pipeline.transform(data)

Now, for the particle-sequence classifier, we need to sort the particles in each event by decreasing $\Delta R$ distance from the isolated lepton, where 

$$
\Delta R = \sqrt{\Delta \eta^2 + \Delta \phi^2}
$$

From the production of low level we know that the isolated lepton is the first particle and the $19$ features (foreach particle) are: <br>

['Energy', 'Px', 'Py', 'Pz', 'Pt', 'Eta', 'Phi', 'vtxX', 'vtxY', 'vtxZ', 'ChPFIso', 'GammaPFIso', 'NeuPFIso', 'isChHad', 'isNeuHad', 'isGamma', 'isEle', 'isMu', 'Charge'] <br>


hence we need feature $5$ ($\eta$) and $6$ ($\phi$) to compute $\Delta R$.

In [13]:
import math

class lepAngularCoordinates():
    
    def __init__(self, eta, phi):
        self.Eta = eta
        self.Phi = phi
        
    def Phi_mpi_pi(self, x):
        while x >= math.pi:
            x -= 2*math.pi
        while x < math.pi:
            x += 2*math.pi
        return x
    
    def DeltaR(self, eta, phi):
        deta = self.Eta - eta
        dphi = self.Phi_mpi_pi(self.Phi - phi)
        return math.sqrt(deta*deta + dphi*dphi)

In [14]:
from pyspark.sql.types import ArrayType, DoubleType
from sklearn.preprocessing import StandardScaler
#import numpy as np

@udf(returnType=ArrayType(ArrayType(DoubleType())))
def transform(particles):
    ## The isolated lepton is the first partiche in the list
    ISOlep = lepAngularCoordinates(particles[0][5], particles[0][6])
    
    ## Sort the particles based on the distance from the isolated lepton
    particles.sort(key = lambda part: ISOlep.DeltaR(part[5], part[6]),
                   reverse=True)
    
    ## Filter the empty rows before standardizing 
    #particles = np.asarray(particles)
    #particles = particles[~(particles==0.).all(axis=1)]
    
    ## Standardize
    particles = StandardScaler().fit_transform(particles).tolist()
    
    return particles

In [15]:
data = data.withColumn('GRU_input', transform('lfeatures'))

In [16]:
data.printSchema()

root
 |-- hfeatures: vector (nullable = true)
 |-- label: long (nullable = true)
 |-- lfeatures: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = true)
 |-- hfeatures_dense: vector (nullable = true)
 |-- encoded_label: vector (nullable = true)
 |-- HLF_input: vector (nullable = true)
 |-- GRU_input: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = true)



## Undersampled dataframes

### 1)  Balanced classes

In [17]:
qcd = data.filter('label=0')
tt = data.filter('label=1')
wjets = data.filter('label=2')

In [18]:
## Create the undersampled dataframes
tt = tt.sample(False, qcd_events*1.0/tt_events) 
wjets = wjets.sample(False, qcd_events*1.0/wjets_events)

dataUndersampled = qcd.union(tt).union(wjets)

In [19]:
## check if the undersample worked
dataUndersampled.groupBy('label').count().show()

+-----+-------+
|label|  count|
+-----+-------+
|    0|1426343|
|    1|1427480|
|    2|1425283|
+-----+-------+



In [20]:
from pyspark.sql.functions import rand 
trainUndersampled, testUndersampled = dataUndersampled.randomSplit([0.8, 0.2], seed=42)
trainUndersampled = trainUndersampled.orderBy(rand())

In [21]:
%%time
## write to parquet
PATH = 'hdfs://analytix/user/migliori/HLT/'
trainUndersampled.write.parquet(PATH+'trainUndersampled_v2.parquet',
                                              mode='overwrite')
testUndersampled.write.parquet(PATH+'testUndersampled_v2.parquet',
                                              mode='overwrite')

CPU times: user 3.32 s, sys: 2.85 s, total: 6.17 s
Wall time: 1h 55min 40s


### 2) Balance tt/wjets

Create the same number of tt and W events

In [22]:
qcd = data.filter('label=0')
tt = data.filter('label=1')
wjets = data.filter('label=2')

In [23]:
tt = tt.sample(False, wjets_events*1.0/tt_events)
dataUndersampled_tt = qcd.union(tt).union(wjets)

In [24]:
dataUndersampled_tt.groupBy('label').count().show()

+-----+-------+
|label|  count|
+-----+-------+
|    0|1426343|
|    1|4327752|
|    2|4329736|
+-----+-------+



In [25]:
from pyspark.sql.functions import rand 
trainUndersampled_tt, testUndersampled_tt = dataUndersampled_tt.randomSplit([0.9, 0.1], seed=42)
trainUndersampled_tt = trainUndersampled_tt.orderBy(rand())

In [26]:
%%time
## write to parquet
PATH = 'hdfs://analytix/user/migliori/HLT/'
trainUndersampled_tt.write.parquet(PATH+'trainUndersampled_tt_v2.parquet',
                                              mode='overwrite')
testUndersampled_tt.write.parquet(PATH+'testUndersampled_tt_v2.parquet',
                                              mode='overwrite')

CPU times: user 3.49 s, sys: 2.9 s, total: 6.39 s
Wall time: 2h 20min


## 3) Save the entire dataset

In [17]:
from pyspark.sql.functions import rand 
train, test = data.randomSplit([0.9, 0.1], seed=42)
train = train.orderBy(rand())

In [None]:
%%time
## write to parquet
PATH = 'hdfs://analytix/user/migliori/HLT/'
train.write.parquet(PATH+'train.parquet', mode='overwrite')
test.write.parquet(PATH+'test.parquet',mode='overwrite')