# Binned Likelihood fit

This is a introduction to Binned Likelihood fit, which applies when the data has been binned, but the statistics is low (otherwise, one would use a ChiSquare fit). The example is inspired by [Covid-19 calculations related to the arrival of the first variant](https://dm.dk/forskerforum/magasinet/2021/343/forskerne-bag-coronamodellerne/), where day 1 is then 1st of December 2020 (first observation of the B.1.1.7 "British" Alpha variant).

The code is an illustration of the difference between a ChiSquare and a Binned Likelihood fit.

The main task is to play around with the amount of samples we are given to fit the progression of the spread of a disease, and see how different methods cope when the data present is low.


### Authors: 
- Malthe Nordentoft (Niels Bohr Institute, malthe.nordentoft@nbi.ku.dk)
- Troels C. Petersen (Niels Bohr Institute, petersen@nbi.dk)

### Date:    
- 21-11-2024 (latest update)

***

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from iminuit import Minuit
from iminuit import cost
from scipy import stats

In [None]:
r = np.random   # Random generator
r.seed(44)      # Set a random seed (but a fixed one)

# Truth values for sampling and disease progression:
samples = 1000
slope = 0.075
thalf = 75

## Generate data

In this data set we want to predict how fast a new type of disease will spread in a population, and forcast a time point in the future $t_{1/2}$ where half of the new infections are du to the new variant. We are in the early days of the new disease, thus we only have data from the last month (Ndays = 30). Every day 100 test are conducted, and the number of new variant cases is given. We have good reason to believe that the spread of the diseases grows as a sigmoid function (two competing exponentials), so that is the fitting function.

<center>
Your mission - and you have no choice but to accept it - is to determine when one should expect $t_{1/2}$.
</center>

In [None]:
def sigmoid(time, slope, thalf, N):
    return N/(1 + np.exp(-slope*(time - thalf)))

def generate_data(time):
    rates = sigmoid(time, slope, thalf, samples)
    gamma = []
    for i in range(len(time)):
        gamma.append(np.random.poisson(rates[i]))
    return np.array(gamma)

Ndays = 30
time = np.arange(Ndays)
data_Npos = generate_data(time)       # Data, consisting of Npositives (with new variant) daily

## Visualising the data

The data given by the doctors are given below, as already binned data. As is obvious below, the data is very sparse but it is the best we can do at the current time. 

In [None]:
fig, ax = plt.subplots(1,1, figsize = (7,5))
ax.bar(time, data_Npos)
ax.set(xlabel = 'Time Days', ylabel = 'Number of positive test')
plt.show()

## Binned likelihood
We define our own Binned likelihood function, to fit the histogram given to us. It is a good assumption that the bincount of a histogram is poisson distributed, thus we compute the likelihood between the measured bin count and the predicted bincount given the rate of positive samples given our sigmoid progression of the diseases. Finally we take the log of the product of the likelihoods, as this is bennefitial when minimizing the function

In [None]:
def BinnedLH(slope, thalf):
    rates = sigmoid(time, slope, thalf, samples) 
    return np.sum(-2*np.log(stats.poisson.pmf(data_Npos, rates)))

## Fitting the data
We make an educated guess on some fitting parameter for the sigmoid function, and thereafter fit with both a Binned Likelihood method and a Chi2 method


In [None]:
slope_guess = 0.15
thalf_guess = 50

mfit = Minuit(BinnedLH, slope = slope_guess, thalf = thalf_guess)
mfit.errordef = 0.5
mfit.migrad()

mfit.errordef = 1.0
c = cost.LeastSquares(time[data_Npos>0], data_Npos[data_Npos > 0], np.sqrt(data_Npos[data_Npos>0]), sigmoid)
mfit_chi2 = Minuit(c, slope = slope_guess, thalf = thalf_guess, N = samples)
mfit_chi2.fixed["N"] = samples   # Fixed as we always take same number of samples every day, thus it is not a fitting parameter
mfit_chi2.migrad();

In [None]:
future_time = np.arange(120)

fig, ax = plt.subplots(1,1, figsize = (10, 6))
ax.set_xlim(0,120)
ax.bar(time, data_Npos)
ax.set(xlabel = 'Time Days', ylabel = 'Number of positive test')
ax.plot(future_time, sigmoid(future_time, slope, thalf,    samples), label = "Actual progression of the disease" )
ax.plot(future_time, sigmoid(future_time, *mfit.values[:], samples), label = 'Binned Likelihood fit')
ax.plot(future_time, sigmoid(future_time, *mfit_chi2.values[:]), label = 'Chi2 fit')
ax.vlines(thalf,               0, samples/2, label = 'Actual $t_{half}$', ls = '--')
ax.vlines(mfit.values[1],      0, samples/2, label = 'Binned likelihood $t_{half}$', ls = '--', color = 'tab:orange')
ax.vlines(mfit_chi2.values[1], 0, samples/2, label = 'Chi2 $t_{half}$', ls = '--', color = 'tab:green')

ax.fill_between([mfit.values[1] - mfit.errors[1], mfit.values[1] + mfit.errors[1]], [0, 0], [sigmoid(mfit.values[1] - mfit.errors[1],
                                *mfit.values, samples), sigmoid(mfit.values[1] + mfit.errors[1], *mfit.values, samples)], color = 'tab:orange', alpha = 0.5)
ax.legend();

## Questions

1. Alter the number of samples taken a day from 1000 to say 200, 100, and 50, and see the effect of statistics.

_Example Answer 1:_ As we increase the number of samples we also decrease the uncertainty on the measurements, which makes the fit way better. As we decrease the number of measureements the uncertainty increases alot, and for some measurements even end up as zero, making it much harder for the chi2 method to create a good fit. This is here where the binnedLH fit shines, and where it is preferable to use it over a Chi2 method. One could also try to use a UnbinnedLH, but this would entail trying to "unbin" the data which complicates thing unnecessarily. 

***

2. Use samples=100, and alter the amount of days that we have tested for the diseases for, to investigate how early the binned likelihood (and the ChiSquare) fit can reasonably predict the outcome.

_Example Answer 2:_ Even at low values like 15 days we get a good fit almost always. However, now the ChiSquare fit certaintly has its problems, and occationally gets the answer completely wrong.

***

### (Semi) Advanced question
3. Right now we assume that the diseases starts the day we start testing for it. Ususally we are not that great at predicting the onset of a disease. Alter the script in a way so that the $t_{1/2}$ prediction is the number of days into the future (i.e. after day 30).