## Self exciting point process models

These are a class of models which take an explicit statistical model, fit that model to the data, and then uses that fit model to produce a risk intensity, and thus a prediction.

The statistical model is a [Point Process](https://en.wikipedia.org/wiki/Point_process).  The events occurring a split into two classes:

- "Background" events which are modelled as a [Inhomogeneous Poisson Process](https://en.wikipedia.org/wiki/Poisson_point_process#Inhomogeneous_Poisson_point_process).  That is, these events occur randomly in time and space, with the probability of occurrence (a normalise "intensity") varying in time and space, but not depending on past events.
- "Triggered" events which do depend upon the past.  Typically an event will trigger further events with a probability / intensity, typically centred on the trigger event.

This model fits the theoretical pattern of e.g. burglaries whereby a single perhaps "random" event can then lead to repeat events at the same or similar location.

Further readings:

- In the financial mathematics / statistics literature, such processes are called [Hawkes Processes](http://mathworld.wolfram.com/HawkesProcess.html)
- See the introductory notes: [arXiv:1507.02822 [math.PR]](https://arxiv.org/abs/1507.02822)
- Be aware that the term "Hawkes process" can also be used to describe _one specific_ form of the trigger intensity, which is not appropriate to us.
- In Earthquake prediction, such models are also common

References:

1. Mohler et al, "Self-Exciting Point Process Modeling of Crime", Journal of the American Statistical Association 2011, DOI:10.1198/jasa.2011.ap09546
2. Ogata, "On Lewis' Simulation Method for Point Processes", IEEE Transactions of Information Theory, 1981.
3. Rasmussen, "Temporal point processes: the conditional intensity function", [Lecture notes](http://people.math.aau.dk/~jgr/teaching/punktproc11/tpp.pdf)
4. Rosser, Cheng, "Improving the Robustness and Accuracy of Crime Prediction with the Self-Exciting Point Process Through Isotropic Triggering" Appl. Spatial Analysis DOI:10.1007/s12061-016-9198-y

## Probability theory background

We work with events occuring in two dimensional space, at coordinates $(x,y)$, and at time $t$.  Somethings with think of time as being special, and so a different dimension, and some times we just think of a three dimensional point process in coordinates $(x,y,t)$.

We let $\lambda(x,y,t)$ be the "conditional intensity" of the point processes.  This is defined to be the expected number of events seen in a small region about $(x,y,t)$, divided by the volume of that region, in the limit as the size of the region tends to 0.  Intuitively, a larger intensity means that we are more likely to see events near $(x,y,t)$.

(An intuitive view of probability is the [Frequentist viewpoint](https://en.wikipedia.org/wiki/Probability_interpretations#Frequentism).  If you roll a dice repeatedly, then the statement that the probability of getting a 3 is 1/6 means that, over a large number of repeated trials, you would expect to see the outcome of the dice landing on 3 about 1/6 of the time.  I find it quite hard to an iterpretation of a point process where the _random_ occurrences of past events changes the intensity function, within this framework, because it seems difficult to say exactly what "repeating the process" would mean.  Of course, in the mathematical axiomatisation of modern probability theory, there are of course existence theorems for all the models we will study.)

Our model gives $\lambda$ in the following form.  Let $(x_i,y_i,t_i)$ be the events.  Then

$$ \lambda(x,y,t) = \nu(t)\mu(x,y) + \sum_{i : t_i<t} g(t-t_i, x-x_i, y-y_i). $$

Let us explain the two parts:

- $\nu(t)\mu(x,y)$ is the "background intensity".  The model assumes that this factorises into two parts: $\nu(t)$ the time varying component, and $\mu(x,y)$ the space component.
- For example, $\nu(t)$ could vary depending on time of day or the day of the week.
- $\mu(x,y)$ varies to allow certain areas to have a higher likelihood of crime than other areas.
- This form is easier to study than assuming $\mu(x,y,t)$, but does not allow us to model changes which occur in a coupled way in space and time (e.g. a major sporting event near a stadium might increase the intensity of crime in the area and time together).

The sum is taken over all events which occurred _before_ the current time.

- $g$ is a three dimensional intensity function (though only ever evaluated for $t>0$).
- We evaluate it at $g(t-t_i,x-x_i,y-y_i)$, so only the "delta" between the trigger $t_i$ and the next event at $(t,x,y)$ matters.
- Thus the model is that an event at $(t_i,x_i,y_i)$ causes an increase the intensity around this point.
- The increase in intensity is the same for all trigger points-- there is no dependance on the location of the trigger.
- Again, this is a simplifying assumption.  Intuitively, we might expect a model of burgulary crime to have a different "excitation function" $g$ if we were studying burgulary in inner city terrace houses, vs burgulary in low density ex-urban environments.

## Parametric vs non-parametric forms

The original Hawkes process, and self-exciting points processes as used in Earthquake research, are often [Parametric](https://en.wikipedia.org/wiki/Parametric_statistics) in nature, the underlying form of the functions $\nu, mu$ and $g$ coming from theory.

An alternative is to take a [Non-Parametric](https://en.wikipedia.org/wiki/Nonparametric_statistics) viewpoint, and estimate the functions using a kernel density estimation (KDE).  This is the approach taken by (1), where a variable bandwidth DKE is used.  Except the following quote from (1) must be made (page 104):

> ... we ﬁnd that for predictive purposes variable bandwidth KDE is less accurate than ﬁxed bandwidth KDE.  We therefore estimate $\mu (x,y)$ in Equation (10) using ﬁxed bandwidth Gaussian KDE.

The reasons given behind this change in methodology are, I believe, an admission that looking at the _local geography_ in which crime is taking place is important.

## Simulation

There are two main simulation methods, both of which treat time as a special variable, and seek to generate events ordered in time.  We should say that, in the abstract, the point processes we are disucssing are not bounded in space or time, and so there is no notion of "start time".  Thus a "perfect" simulation is difficult, as we do not (and cannot) know how far back in time we need to start in order to get an accurate simulation around time zero.  A similar problem obviously occurs in fitting data: if we have data from a relative time of $0$, then events close to $0$ might well be triggered by events before time $0$ (i.e. events we do not know about).  There is a fair amount of theory to suggest that events should not be trigger from _too far in the past_ (order of 2 months is discussed regularly in the literature) and so we shall tacitally ignore such problems

The more general method is a form of [Rejection Sampling](https://en.wikipedia.org/wiki/Rejection_sampling), and is known in the literature as Otago's thinning procedure, see (2).  Suppose we have simulated events at times $t_1 < t_2 < \cdots < t_n$.  We seek a fixed number $\lambda_\max$ such that
$$ \lambda_\max \geq \max_{x,y,t>t_n} \lambda(t,x,y) $$
Let $t_{\text{current}} = t_n$.  We then sample the next point $(t,x,y)$ with $t>{\text{current}}$ from an ordinary, homogeneous Poisson process with intensity $\lambda_\max$.

- We must constrain space to a finite area for this to make sense.
- To speed up the algorithm, we might instead choose a function $\lambda_\max(x,y)$ which dominates $\lambda$ and sample from the time homogeneous, space inhomogeneous process instead.
- We then decide to accept or reject the point: pick uniformly at random $z$ between $0$ and $\lambda_\max$, and then accept if $z\geq \lambda(t,x,y)$.
- If we reject, we try again, but update ${\text{current}}$ to be $t$, so each time we are advancing time.

This method, as with all rejection techniques, can be rather slow.

### Make use of the branching structure

An alternative simulation technique is to notice that the total intensity is _linear_, so each term contributes essentially a new, independent process.  This suggests a simulation strategy reminiscent of a [Branching Process](https://en.wikipedia.org/wiki/Branching_process):

- We firstly simulate an inhomogeous Poisson process with intensity $\nu(t)\mu(x,y)$.  These are the "background events".
- We then iteratively pass over the _all_ the events we have so far generated, in time order.
- Suppose we are now processing an event at $(t_i,x_i,y_i)$.  This will generate further events with intensity $g(t-t_i, x-x_i, y-y_i)$ which we simulate again as an inhomogeneous Poisson process.
- It is important to note that we do not simply simulate the background events, and then allow each of them to trigger extra events.  It is quite allowed that a triggered event itself triggers further events, and so on and so forth.  But each time, the triggered events will be in the future, and so if we only want to simulate a finite time window, we will eventually simulate all events in that window.

# Model fitting

We now describe the optimisation algorithm from (1) (see also the description in (4)).