# b jet characteristics

Having accurate modeling of the b jet characteristics is very important given the evidence that the dimuon mass excess may be correlated with a corresponding excess in the $\mu\mu$ + b jet mass.  


## b jet efficiencies

I will first focus on measuring the efficiency of b tagging (using the CSV tagger with the "tight" working point (CSV > 0.898).  The b tagging POG provides data to MC scale factors, but it is necessary for analyzers to calculate their own b tag efficiencies in MC since they will depend on both the composition of the sample under consideration and the selection that is applied.

I will consider 8 TeV MC and then 13 TeV MC.  The samples considered here will be $t\bar{t}$, Drell-Yan, and a B' signal sample.

In [11]:
# imports and initial configuration
%cd '/home/naodell/work/CMS/amumu'
%matplotlib notebook

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.stats import beta
import seaborn as sns

import nllfitter.plot_tools as pt

pt.set_new_tdr()
matplotlib.rcParams['figure.figsize'] = (8,8)

/home/naodell/work/CMS/amumu


In [12]:
# get the datasets

ntuple_dir  = 'data/flatuples/mumu_2012'
datasets    = [
               'ttbar_lep', 'ttbar_semilep',
               'zjets_m-50', 'zjets_m-10to50',
               'bprime_t-channel'
              ]
cuts        = 'lepton1_pt > 25 and abs(lepton1_eta) < 2.1 \
               and lepton2_pt > 5 and abs(lepton2_eta) < 2.4 \
               and lepton1_q != lepton2_q \
               and 12 < dilepton_mass < 70'

data_manager = pt.DataManager(input_dir     = ntuple_dir,
                              dataset_names = datasets,
                              selection     = 'mumu',
                              period        = 2012,
                              cuts          = cuts
                             )

Loading dataframes: 100%|███████████████| 5.00/5.00 [00:00<00:00, 6.76it/s]


The datasets used here have some preselection requirements:

* the IsoMu24_eta2p1 trigger must have fired
* there must be at least one good PV
* at least two tight ID, loose track isolated muons
* lead muon $p_{T} > 25$ and $|\eta| < 2.1$
* trailing muon $p_{T} > 5$ and $|\eta| < 2.4$
* $q_{\mu1} \neq q_{\mu2}$ 
* $12 < M_{\mu\mu} < 70$

The jet that is used for measuring the efficiencies is reconstructed using the anti-$k^{T}$ algorithm and clustering particle-flow objects within a cone of size dR=0.5.  The jet flavor is determined by matching a reconstructed PF jet to a jet that has been reconstructed by clustering generator level partons (_include link to CMS mc flavor matching_).  The MC flavor matching greatly simplifies the efficiency measurement in the simulation.

The definition for our denominator object is:

* PF jet with $p_{T} > 10$ and $|\eta| < 2.4$
* MC flavor is 5 or -5 (PDG ID for a b quark)

The numerator object is a denominator object with the additional requirement that:

* CSV > 0.898 (tight b tag POG WP)

It will be interesting to measure the efficiency as both a function of $p_{T}$ and $\eta$.

In [13]:
labels    = ['ttbar', 'zjets', 'bprime_xb']
fmts      = {'ttbar': 'ko', 'zjets': 'bo', 'bprime_xb': 'ro'}
data_2012 = {}
for l in labels:
    df = data_manager.get_dataframe(l)
    df = df.query('gen_bjet_pt > 10 and abs(gen_bjet_eta) < 2.4')
    df.loc[:,'tagged'] = (df['gen_bjet_tag'] > 0.898)
    data_2012[l] = df

UndefinedVariableError: name 'gen_bjet_pt' is not defined

Let's take a look at the b tag values:

In [14]:
data['ttbar'].hist('gen_bjet_tag', 
                   bins=30, 
                   range=(0, 1), 
                   histtype='step'
                  )
plt.plot([0.898, 0.898], [0, 1.5e5], 'k--')
plt.ylim([0, 1.5e5])

NameError: name 'data' is not defined

This distribution looks pretty reasonable.  Keep in mind every event that goes into the above hisogram does have a b quark in the generator process.  Now let's measure the efficiencies.  We'll want a function that will bin the data and calculate the efficiencies, and one that will plot the result.

It's important to handle the errors of the efficiency correctly.  I have adopted the [Clopper-Pearson](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval) interval construction and will be using the 95% CL interval for my error bars.

In [15]:
def calculate_efficiency(num, den, bins):
    '''
    Calculates efficiencies given the provided binning and estimates
    uncertainties using the Clopper-Pearson interval construction. 
    
    Parameters:
    ===========
    num: array for numerator (subset of denominator)
    den: array for denominator
    nbins: number of bins
    xlimit: x range
    '''
    n, _ = np.histogram(num, bins=bins)
    d, b = np.histogram(den, bins=bins)
    
    x = (b[1:] + b[:-1])/2.
    x_err = (b[1:] - b[:-1])/2.
    eff = n.astype(float)/d
    eff_err = [np.abs(eff - beta.ppf(0.025, n, d - n + 1)), 
               np.abs(eff - beta.ppf(0.975, n+1, d - n))]
    
    return x, eff, x_err, eff_err

def efficiency_plot(data, var, labels, bins, fmts):
    '''
    Produces efficiency plots
    '''
    label_dict = {'gen_bjet_pt':r'$\sf p_{T}$',
                  'gen_bjet_eta':r'$\sf \eta$',
                  'gen_bjet_phi':r'$\sf \phi$'
                 }
    for l in labels:
        df = data[l]
        numer = df.query('tagged')[var]
        denom = df[var]
        x, eff, x_err, y_err = calculate_efficiency(numer.values, 
                                                    denom.values, 
                                                    bins=bins
                                                   )
        plt.errorbar(x, eff, yerr=y_err, xerr=x_err,
                     fmt=fmts[l],
                     capsize=0,
                     markersize=8.,
                     elinewidth=2
                    )
    plt.xlim([bins[0], bins[-1]])
    plt.ylim([0., 1.1])
    plt.xlabel(label_dict[var])
    plt.ylabel(r'$\sf \epsilon_{b\,tag}$')

    plt.legend(labels)

    plt.grid()
    plt.show()

### b tag efficiency vs. jet transverse momentum (8 TeV)

In [16]:
# efficiencies vs. pt
var = 'gen_bjet_pt'
bins = [10, 30, 50, 80, 120, 210, 300]
efficiency_plot(data_2012, var, labels, bins, fmts)

KeyError: 'ttbar'

### b tag efficiency vs. jet pseudorapidity (13 TeV)

In [None]:
# efficiencies vs. pt
var = 'gen_bjet_eta'
bins = [-2.5, -1.5, -1., -0.5, 0., 0.5, 1.0, 1.5, 2.5]
efficiency_plot(data_2012, var, labels, bins, fmts)

## 13 TeV b jet efficiencies

Let's do the same with 2016 MC.

In [20]:
ntuple_dir   = 'data/flatuples/mumu_rereco_2016'
datasets     = ['ttbar_lep', 
                'zjets_m-50', 'zjets_m-10to50',
                'z1jets_m-50', 'z1jets_m-10to50',
                'z2jets_m-50', 'z3jets_m-10to50',
                'z3jets_m-50', 'z3jets_m-10to50',
               ]
data_manager = pt.DataManager(input_dir     = ntuple_dir,
                              dataset_names = datasets,
                              selection     = 'mumu',
                              period        = 2016,
                              cuts          = cuts
                             )

labels = ['ttbar']#, 'zjets']
data_2016 = {}
for l in labels:
    df = data_manager.get_dataframe(l)
    df = df.query('gen_bjet_pt > 10 and abs(gen_bjet_eta) < 2.4')
    df.loc[:,'tagged'] = (df['gen_bjet_tag'] > 0.935)
    data_2016[l] = df

Loading dataframes: 100%|███████████████| 9.00/9.00 [00:03<00:00, 1.40it/s]


### b tag efficiency vs. jet transverse momentum (13TeV)

In [21]:
# efficiencies vs. pt
var = 'gen_bjet_pt'
bins = [10, 30, 50, 80, 120, 210, 300]
efficiency_plot(data_2016, var, labels, bins, fmts)

<IPython.core.display.Javascript object>

### b tag efficiency vs. jet pseudorapidity (13 TeV)

In [22]:
# efficiencies vs. eta
var = 'gen_bjet_eta'
bins = [-2.5, -1.5, -1., -0.5, 0., 0.5, 1.0, 1.5, 2.5]
efficiency_plot(data_2016, var, labels, bins, fmts)

<IPython.core.display.Javascript object>