# Introduction to Quant Finance

## Module 1.5: Bayesian inference

### 1.5.3 Representing prior knowledge: postcodes

In the last module we looked at incorporating prior knowledge into our models, but effectively cheated by saying "all outcomes are just as likely, until we have data". Often we have more information than that.

In this module we look at how to take prior information about the domain we are investigating, and using that information to alter our model.


- France's *La Poste* has used automated sorting since 1964.
- Handwritten digit recognition has been well studied.
- In order to use digit recognition in practice for sorting mail, we need a *prior* model for how probable each postcode is, independent of each actual digitized hand-written digit image in front of us.

A sample of some Australian and international postcodes:
- 2000 (Sydney)
- 3122 (Hawthorn, VIC)
- 4350 (used for 44 towns near Toowoomba, QLD)
- 8007 (PO boxes in Collins Street West)
- A-1220 (Vienna, Austria)
- Tsuen Wan (Hong Kong): no postcodes in HK
- 02138 (Cambridge, MA)
- EC1V 4AD (London)


### A prior for Australian postcodes

**What prior information do we have?**

- Do all Australian postcodes have 4 digits? Yes.
- What range? 0200 to 9944
- States:
   - NSW: postcodes 1000-1999 (PO boxes), 2000-2599, 2620-2899, 2921-2999
   - ACT: 0200-0299 (PO boxes), 2600-2619, 2900-2920
   - VIC: 3000-3999, 8000-8999 (PO boxes)
   - QLD: 4000-4999, 9000-9999 (PO boxes)
   - SA: 5000-5799, 5800-5999 (PO boxes)
   - WA 6000-6797, 6800-6999 (PO boxes)
   - TAS: 7000-7799, 7800-7999 (PO boxes)
   - NT: 0800-0899, 0900-0999 (PO boxes)

**Also:**

- 25% of all mail goes to these CBD postcodes: 2000, 2001, 3000, 3001, 4000, 4001, 5000, 5001, 6000, 6001.
- We can look up population for each postcode. Or, if we don't have population info by postcode, we could seed the prior with state population data.
- Within each state xxxx, 80% of mail goes to x0xx and x1xx (metropolitan city areas and suburbs).

### How do we encode this prior information for machine learning purposes?

### Goal: construct a prior $p(\textrm{postcode})$ over all 4-digit postcodes

### Valid ranges


In [1]:
postcodes_by_state = dict((
    ('Australian Capital Territory', set(range(2600, 2620)) | set(range(2900, 2920))),
    ('New South Wales', set(range(2000, 3000)) - set(range(2600, 2620)) - set(range(2900, 2920))),
    ('Victoria', set(range(3000, 4000))),
    ('Queensland', set(range(4000, 5000))),
    ('South Australia', set(range(5000, 5800))),
    ('Western Australia', set(range(6000, 6798))),
    ('Tasmania', set(range(7000, 7800))),
    ('Northern Territory', set(range(800, 900)))
))

### State populations

We will start by using state populations as a proxy for really knowing the proportion of mail sent to each postcode.

(If we obtain more data, we can update and improve our model by applying Bayes' theorem later.)

In [2]:
import pandas as pd

state_populations = pd.read_hdf('../Data/aus_state_populations.h5')

FileNotFoundError: File ../Data/aus_state_populations.h5 does not exist

In [None]:
state_populations

These are the desired feature expectations for each state.

In [None]:
# Source of the data:
def fetch_state_populations():
    url = 'http://www.ausstats.abs.gov.au/Ausstats/subscriber.nsf/0/D52DEAAFCEDF7B2ACA2580EB00133359/$File/31010do001_201609.xls'

    state_pop = pd.read_excel(url, sheetname='Table_8', skiprows=6,
                  names=['State', 'Population', '%'])

    state_pop.set_index('State', inplace=True)

    drop_row_idx = list(state_pop.index).index('Other Territories')

    state_pop.drop(state_pop.index[drop_row_idx:], inplace=True)

    state_pop['Population'] = state_pop['Population'].astype(int)
    # state_pop.to_hdf('state_populations.h5', key='populations')
    return state_pop

### How to incorporate this?

... to model the probability of e.g. $p(\textrm{postcode}=3122)$?

In [None]:
def prior_state(state):
    return state_populations['%'].loc[state] / 100

In [None]:
prior_state('New South Wales')

Now we have a prior $p(\text{state})$.

### From the definition of conditional probability:

$p(\textrm{postcode}) = \sum_{\textrm{all states}} p(\textrm{postcode | state}) p(\textrm{state})$

#### Exercise

Assuming you have a function `prior_postcode_given_state(postcode, state)`, implement this as a function `prior_postcode(postcode)`.

### Solution hint:

Iterate over all states in `state_populations.index`.

### Solution:


In [None]:
def prior_postcode(postcode):
    p = 0.0
    for state in state_populations.index:
        p += prior_postcode_given_state(postcode, state) * prior_state(state)
    assert p <= 1
    return p

#### Exercise

- Now write the function `prior_postcode_given_state(postcode, state)`.

Assume you can assign equal probability to each valid postcode in the corresponding state (or 0 probability for the wrong state).

### Then try out both your functions -- for example:


In [None]:
>>> prior_postcode_given_state(3122, 'Victoria')

>>> prior_postcode(3122)

### Solution:


In [None]:
def prior_postcode_given_state(postcode, state):
    postcodes = postcodes_by_state[state]
    return 1 / len(postcodes) if postcode in postcodes else 0

In [None]:
prior_postcode_given_state(3122, 'Victoria')

In [None]:
prior_postcode(3122)

### What did we do?

We informally constructed a prior model that was as **flat** (uninformative) as possible **subject to a constraint** that the proportion of mail being delivered to a postcode is equal to the state's population, divided by the number of postcodes for that state.

### Consider now: how would you update the model to reflect that ...

1. 25% of all mail goes to one of the CBD postcodes; and
2. Within each state xxxx, 80% of mail goes to x0xx and x1xx (metropolitan city areas and suburbs)?

### Maximum entropy models: the easy way

Here we see how to derive such prior models in a more systematic and principled way using the `maxentropy` package.

### Step 1: Set up the domain (or "sample space")


In [None]:
import numpy as np
samplespace = np.arange(10000, dtype=np.uint16)

In [None]:
samplespace

### Step 2: Set up a list of feature functions whose expectations you want to constrain


In [None]:
def is_valid(postcodes):
    return [200 <= postcode < 10000 for postcode in postcodes]

In [None]:
# def in_nsw(postcodes):
#     return [postcode in postcodes_by_state['New South Wales'] for postcode in postcodes]
# etc.

In [None]:
def in_given_state(state):
    def in_state(postcodes):
        return [postcode in postcodes_by_state[state] for postcode in postcodes]
    return in_state

In [None]:
state_populations.index

In [None]:
features = [is_valid] + \
           [in_given_state(state) for state in state_populations.index]

In [None]:
features

### Step 3: create a `MinDivergenceModel` object from this list of features and sample space


In [None]:
!pip install maxentropy

In [None]:
from maxentropy.skmaxent import MinDivergenceModel

model = MinDivergenceModel(features, samplespace)

### Step 4: define your desired array of expected feature function values (one for each feature)


In [None]:
pop = state_populations['%'] / 100
pop

In [None]:
state_populations['%'].sum()

(This excludes the other territories, like Norfolk Island. Ignore this for now.)

In [None]:
k = np.r_[0.999, pop.values].reshape(1, -1)

In [None]:
k

In [None]:
len(features) == k.shape[1]

### Step 5: fit your model under those constraints


In [None]:
model.fit(k)

In [None]:
model.expectations() - k

In [None]:
np.allclose(model.expectations(), k, atol=1e-6)

### Result: our fitted prior model is given by `model.probdist()`


In [None]:
model.probdist()

In [None]:
assert len(model.probdist() == len(samplespace))

We now have a prior probability $\textrm{prior}(\textrm{postcode})$ for each 4-digit postcode.

### What are the most probable postcodes?


In [None]:
p = model.probdist()
np.argsort(p)[::-1]

### Visualized


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
fig, axes = plt.subplots(1, figsize=(12, 5))
plt.plot(samplespace, p, '.', )
axes.set_xlabel('postcode')
axes.set_ylabel('probability')

#### Exercise incorporating more prior knowledge

Now try to incorporate the additional prior knowledge that 40% of all mail
goes to the following CBD postcodes:

In [None]:
CBD_POSTCODES = {2000, 2001, 3000, 3001, 4000, 4001, 5000, 5001, 6000, 6001}

### Solution hint:
1. Define a new feature function `in_cbd(postcode)` and append this to your list of features.
2. Add an additional value (0.4) to your array of constraint values of feature expectations.
3. Re-create your model passing in your new features.
4. Re-fit your model passing in the new constraints.

In [None]:
def in_cbd(postcodes):
    return [postcode in CBD_POSTCODES for postcode in postcodes]

features2 = features + [in_cbd]

In [None]:
k2 = np.c_[k, 0.25]
k2

In [None]:
assert len(features2) == k2.shape[1]

In [None]:
model2 = MinDivergenceModel(features2, samplespace)

In [None]:
model2.fit(k2);

In [None]:
assert np.allclose(model2.expectations(), k2, atol=1e-6)

In [None]:
p2 = model2.probdist()

In [None]:
fig, axes = plt.subplots(1, figsize=(12, 5))
plt.plot(samplespace, p2, '.', )
axes.set_xlabel('postcode')
axes.set_ylabel('probability')

We can see more by using a logarithmic vertical axis:

In [None]:
fig, axes = plt.subplots(1, figsize=(12, 5))
plt.semilogy(samplespace, p2, '.', )
axes.set_xlabel('postcode')
axes.set_ylabel('probability')

### More prior knowledge: CBD, inner suburbs, outer suburbs, regional centres

Here is an example of how to incorporate this extra information:

- Within each state xxxx, 80% of mail goes to x0xx and x1xx (metropolitan city areas and suburbs).

In [None]:
def which_ring(postcodes):
    """
    Returns
    -------
    0 if postcode is x0xx
    100 if postcode is x1xx
    200 if postcode is x2xx
    ... otherwise
    """
    return [postcode % 1000 - postcode % 100 for postcode in postcodes]

In [None]:
which_ring([1234, 800, 2900, 3000, 2001, 2099, 3122])

In [None]:
def in_city_metropolitan_area(postcodes):
    return [ring == 0 or ring == 100 for ring in which_ring(postcodes)]

In [None]:
in_city_metropolitan_area([3136, 3122, 2001])

In [None]:
features3 = features2 + [in_city_metropolitan_area]

In [None]:
k3 = np.c_[k2, 0.8]

In [None]:
model3 = MinDivergenceModel(features3, samplespace)

In [None]:
model3.fit(k3);

In [None]:
np.allclose(model3.expectations(), k3, atol=1e-5)

In [None]:
p3 = model3.probdist()

In [None]:
fig, axes = plt.subplots(1, figsize=(12, 5))
plt.semilogy(samplespace, p3, '.', )
axes.set_xlabel('postcode')
axes.set_ylabel('probability')

### Conclusion
This prior reflects **precisely** the information we put into the model.

No less:
- all constraints we placed on it are satisfied;

No more:
- no additional information is reflected / assumed which we didn't explicitly add. It is as flat as possible (maximal entropy) subject to our constraints.

### Next notebook: we will fuse this prior information with data for a better model


In [None]:
np.save('postcode_prior3.npy', p3)

### Aside: Demonstration that we cannot tweak `model` to be equivalent to `model2` by adding one constraint (`in_cbd`) and then minimizing KL divergence from `model1`.

Let's try it ...

In [None]:
model4 = MinDivergenceModel([in_cbd], samplespace, model.log_probdist())

In [None]:
k4 = np.array([0.25], ndmin=2)

In [None]:
model4.fit(k4)

In [None]:
model4.expectations()

In [None]:
p4 = model4.probdist()

In [None]:
fig, axes = plt.subplots(1, figsize=(12, 5))
plt.semilogy(samplespace, p4, '.', )
axes.set_xlabel('postcode')
axes.set_ylabel('probability')

The result is different because the process is different. We are no longer asserting the same constraints as before -- we are only asserting one single constraint. So this will in general have higher entropy (be flatter) than the more constrained model.