Sometimes we don't want to use all of the data availiable to us, but we want to order on the value of a particular observable. In the case of LSTM networks we would want the full distribution of this observable to be availiable, but we also want the network to train in a reasonable time. To cull some of the data we can use a query. For example, as demonstrated below we can take only data that has PT_ET > 1.5. Although this presents another issue: What is the maximum number of samples for each object type that we need in order to not cut off any information? If we were not using a query the answer to this question would be easy, it would simply be the maximum amount of rows used for each object in our pandas tables. But since we cannot know how many of those values have PT_ET > 1.5 without computing it explicitly, we can save some time and just take a quick sample of our data, apply the query and see what our maximum number of rows are. Then in our network we can simply plug in those maxes (plus some more so that we don't remove part of distribution we haven't seen yet). 

In [5]:
import glob
import pandas as pd
for c in ["ttbar_lepFilter_13TeV", "wjets_lepFilter_13TeV", "qcd_lepFilter_13TeV"]:
    files = glob.glob("/data/shared/Delphes/%s/pandas_h5/*.h5" % c)
    f = files[0]
    store = pd.HDFStore(f)
    for obj in ["EFlowPhoton", "EFlowNeutralHadron", "EFlowTrack"]:
        m = []
        dataframe = store.get(obj)
        for i in range(1,500):
            df = dataframe[dataframe["Entry"] == i]
            #l1 = len(df.index)
            #print(len(df.index))
            df = df.query("PT_ET > 1.5")
            l2 = len(df.index)
            m.append(l2)
        print(c,obj,"MAX: "+str(max(m)),"AVG: " + str(sum(m)/len(m)))
    store.close()

('ttbar_lepFilter_13TeV', 'EFlowPhoton', 'MAX: 92', 'AVG: 39')
('ttbar_lepFilter_13TeV', 'EFlowNeutralHadron', 'MAX: 167', 'AVG: 73')
('ttbar_lepFilter_13TeV', 'EFlowTrack', 'MAX: 164', 'AVG: 82')
('wjets_lepFilter_13TeV', 'EFlowPhoton', 'MAX: 65', 'AVG: 23')
('wjets_lepFilter_13TeV', 'EFlowNeutralHadron', 'MAX: 163', 'AVG: 58')
('wjets_lepFilter_13TeV', 'EFlowTrack', 'MAX: 125', 'AVG: 49')
('qcd_lepFilter_13TeV', 'EFlowPhoton', 'MAX: 78', 'AVG: 38')
('qcd_lepFilter_13TeV', 'EFlowNeutralHadron', 'MAX: 200', 'AVG: 73')
('qcd_lepFilter_13TeV', 'EFlowTrack', 'MAX: 170', 'AVG: 78')


In [7]:
store = pd.HDFStore("/data/shared/Delphes/ttbar_lepFilter_13TeV/pandas_h5/ttbar_lepFilter_13TeV_0.h5")
dataframe = store.get("NumValues")
print(dataframe)

      Electron  MuonTight  Photon  MissingET  EFlowPhoton  EFlowNeutralHadron  \
0            0          2       1          1          279                 210   
1            1          1       0          1          712                 502   
2            1          1       1          1          345                 242   
3            0          1       0          1          280                 226   
4            0          1       0          1          804                 640   
5            0          0       0          1          383                 316   
6            1          0       0          1          647                 466   
7            0          1       1          1          448                 339   
8            0          1       0          1          473                 435   
9            2          0       0          1          452                 352   
10           0          1       3          1          315                 220   
11           1          1   

In [None]:
#PT >1
('ttbar_lepFilter_13TeV', 'EFlowPhoton', 'MAX: 165', 'AVG: 82')
('ttbar_lepFilter_13TeV', 'EFlowNeutralHadron', 'MAX: 287', 'AVG: 132')
('ttbar_lepFilter_13TeV', 'EFlowTrack', 'MAX: 320', 'AVG: 158')
('wjets_lepFilter_13TeV', 'EFlowPhoton', 'MAX: 157', 'AVG: 57')
('wjets_lepFilter_13TeV', 'EFlowNeutralHadron', 'MAX: 277', 'AVG: 112')
('wjets_lepFilter_13TeV', 'EFlowTrack', 'MAX: 276', 'AVG: 116')
('qcd_lepFilter_13TeV', 'EFlowPhoton', 'MAX: 178', 'AVG: 80')
('qcd_lepFilter_13TeV', 'EFlowNeutralHadron', 'MAX: 339', 'AVG: 133')
('qcd_lepFilter_13TeV', 'EFlowTrack', 'MAX: 392', 'AVG: 155')

#PT >1.5
('ttbar_lepFilter_13TeV', 'EFlowPhoton', 'MAX: 92', 'AVG: 39')
('ttbar_lepFilter_13TeV', 'EFlowNeutralHadron', 'MAX: 167', 'AVG: 73')
('ttbar_lepFilter_13TeV', 'EFlowTrack', 'MAX: 164', 'AVG: 82')
('wjets_lepFilter_13TeV', 'EFlowPhoton', 'MAX: 65', 'AVG: 23')
('wjets_lepFilter_13TeV', 'EFlowNeutralHadron', 'MAX: 163', 'AVG: 58')
('wjets_lepFilter_13TeV', 'EFlowTrack', 'MAX: 125', 'AVG: 49')
('qcd_lepFilter_13TeV', 'EFlowPhoton', 'MAX: 78', 'AVG: 38')
('qcd_lepFilter_13TeV', 'EFlowNeutralHadron', 'MAX: 200', 'AVG: 73')
('qcd_lepFilter_13TeV', 'EFlowTrack', 'MAX: 170', 'AVG: 78')