# MSCF 46982 Market Microstructure and Algorithmic Trading
# Fall 2025 Mini 2

Before you turn this assignment in, make sure everything runs as
expected. First, **restart the kernel**: by selecting
Kernel$\rightarrow$Restart Kernel from the menubar (or the `0,0`
keyboard shortcut) and then **run all cells** by selecting
Run$\rightarrow$Run All Cells from the menubar.

Make sure you replace all instances `YOUR CODE HERE` or "YOUR ANSWER
HERE" with your solution and remove the `notimplemented` exception.


As indicated in the syllabus, you are encouraged to discuss the
material presented in class but not of the specifics of assignments
(including this one).


---

In [1]:
import os
os.environ['PYKX_JUPYTERQ'] = 'true'
os.environ['PYKX_4_1_ENABLED'] = 'true'
import pykx as kx


PyKX now running in 'jupyter_qfirst' mode. All cells by default will be run as q code. 
Include '%%py' at the beginning of each cell to run as python code. 


## Probability of Informed Trading (PIN)

The stock market is filled with informed and uninformed traders
(participants).  But what is the likelihood that an informed trader is
trading particular security?  The PIN statistic developed by Easley,
O'Hara et al. uses buy and sell volumes to compute the probability of
informed trading.  We implemented the [Easley, Hvidkjaer and O’Hara
(2010)][EHO] (EHO) factorization of the PIN log likelihood function
(with equal buy and sell uninformed trading rates $\epsilon$) in
class.

[EHO]: https://doi.org/10.1017/S0022109010000074 "Easley, Hvidkjaer and O'Hara (2010)"

In this assignment, you are asked to expand this model to
differentiate between buy and sell uninformed trading rates
$\epsilon_b$ and $\epsilon_s$, implement the [Lin and Ke (2011)][LK]
factorization which is more numerically stable than the original EHO
factorization, and use the [Yan and Zhang (2012)][YZ] parameter
initialization algorithm to increase the probability that the
parameter estimation process finds a valid solution.

[LK]: https://doi.org/10.1016/j.finmar.2011.03.001 "Lin and Ke (2011)"
[YZ]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=890486 "Yan and Zhang (2012)"

The steps will include computing the number of buy and sell orders
using the Lee-Ready algorithm, implementing the PIN ratio function,
the PIN log likelihood function, generating a list of initial
parameters, using the the `scipy optimize` package to fit the PIN
parameters for each set of initial parameters, and finally calculating
the PIN ratio.

Before beginning this assignment, make sure the `scipy` package is
installed by running the following from the anaconda prompt.

```
conda install -c anaconda scipy
```

We start by opening a connection to the NYSE Daily TAQ database. The
symbol saved in `h` is a Kdb+ file handle.  It will be used for all
database communication.

NOTE: The database is located on a CMU server - behind the firewall.
If you are doing this assignment from home, you will need to connect
to the CMU network using the [Cisco AnyConnect VPN][] software.

[Cisco AnyConnect VPN]:
https://www.cmu.edu/computing/services/endpoint/network-access/vpn/index.html
"Cisco AnyConnect VPN"


In [2]:
\c 5 100
/ windows and mac/linux use different environment variables
home:`HOME`USERPROFILE "w"=first string .z.o
upf:0N!` sv (hsym`$getenv home),`cmu_userpass.txt
h:`$":tcps://tpr-mscf-kx.tepper.cmu.edu:5000:",first read0 upf


In [3]:
%%py
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from scipy.optimize import OptimizeResult
plt.style.use('default')

### Part A (2 Points) ###

Complete the `lr` function so that it downloads the total number of
buy and sell trades for a given list of symbols on a given date and
condition codes.

The returned table should have 4 columns: `date`, `sym`, `B`, and `S`
where the `B` column indicates the total number of buy trades and the
`S` column indicates the total number of sell trades.

NOTE: Be careful not to include any records where the side is unknown
(i.e. null)


In [57]:
/ (s)ym, trade (c)ondition codes, (d)a(t)e
lr:{[s;c;dt]
 t:select date, sym, time, price from trade where date=dt, sym in s, cond in c;
 t:aj[`sym`time;t]select sym, time, bid, ask from nbbo where date=dt;
 t:update mid:.5*ask+bid, tick:fills -1 0N 1@1+signum deltas[first price;price] by sym from t;
 t:update side:?[price>mid;1;?[price<mid;-1;tick]] from t;
 t:0!select B:sum side>0, S:sum side<0 by date, sym from t;
 t}

In [58]:
/ pass function to the database for execution
h (lr;`BAC`TSLA;" ";2020.02.03)

date       sym  B     S    
---------------------------
2020.02.03 BAC  34541 34475
2020.02.03 TSLA 66009 59999


Your results should match the following:

```
date       sym  B     S    
---------------------------
2020.02.03 BAC  34541 34475
2020.02.03 TSLA 66009 59999
```

In [59]:
rnd:{x*"j"$y%x}
assert:{if[not x~y;'"expecting '",(-3!x),"' but found '",(-3!y),"'"]}
/ confirm all columns are included
assert[`date`sym`B`S] cols h (lr;`BAC;" ";2020.02.03)
/ confirm schema is correct
assert["dsii"] first flip value meta h (lr;`BAC;" ";2020.02.03)
/ confirm query only selects specified dates
assert[2020.02.03+0 1]  exec distinct date from raze {h (lr;`BAC;" ";x)} each 2020.02.03 + 0 1
/ confirm only requested syms have been returned
assert[`BAC`TSLA] exec distinct sym from h (lr;`BAC`TSLA;" ";2020.02.03)
/ confirm correct number of rows have been returned
assert[2] count h (lr;`BAC`TSLA;" ";2020.02.03)
/ confirm only selected condition codes have been returned
assert[0i] first (h (lr;`BAC;"O";2020.02.03))`B
assert[34541i] first (h (lr;`BAC;" ";2020.02.03))`B

In [60]:
10#t:raze {h (lr;`PBJ;" ";x)} each 2020.02.01 + til 30
.pykx.set[`B] .pykx.tonp get `B set t`B
.pykx.set[`S] .pykx.tonp get `S set t`S


date       sym B  S 
--------------------
2020.02.03 PBJ 13 12
2020.02.04 PBJ 3  3 
2020.02.05 PBJ 7  13
2020.02.06 PBJ 3  3 
2020.02.07 PBJ 3  7 
..


### Part B (2 Points) ###

In class, we implemented a version of the PIN ratio which assumed
equal rates of buy and sell uninformed participants. The following is
the more generic PIN calculation, which differentiates between the
two.

$$
\text{PIN} =\frac{\alpha\mu}{\epsilon_b+\epsilon_s+\alpha\mu}
$$


Complete the `pin` function so that it computes the PIN ratio given
the parameters "eemad" $\epsilon_b$, $\epsilon_s$, $\mu$, $\alpha$, $\delta$


In [61]:
/ (e)psilon buy, (e)psilon sell, (m)u, (a)lpha, (d)elta
pin:{[eemad]
 am:eemad[3]*eemad[2];
 p:am%eemad[0]+eemad[1]+am;
 p}

In [62]:
/ test pin function
pin 10000 10000 100 .5 .5

0.002493766


Your results should match the following:
```
0.002493766
```

In [63]:
/ confirm PIN limits
assert[1f] pin 0 0 1 .5 .5
assert[0f] pin 1 1 0 .5 .5
assert[0f] pin 1 1 1 0 .5
assert[.5] pin 2 2 2 2 .5

### Part C (2 Points) ###

This assignment requires us to implement the PIN **log likelihood** in
python.  Let's get some practice by first implementing the PIN *ratio*
in python.

Complete the python `pin` function so that it computes the PIN ratio
given the generic parameters "eemad" $\epsilon_b$, $\epsilon_s$
,$\mu$, $\alpha$, $\delta$.


In [64]:
%%py
# (e)psilon buy, (e)psilon sell, (m)u, (a)lpha, (d)elta
def pin(eemad):
    epsilon_b, epsilon_s, mu, alpha, delta = eemad
    am = alpha * mu
    p = am / (epsilon_b + epsilon_s + am)
    return p

In [65]:
%%py
# test pin function
print(pin([10000,10000,100,.5,.5]))

0.0024937655860349127


Your results should match the following:

```
0.0024937655860349127
```

In [66]:
%%py
# confirm PIN limits
assert 1==pin([0, 0,1,.5,.5])
assert 0==pin([1, 1,0,.5,.5])
assert 0==pin([1, 1,1,0,.5])
assert .5==pin([2, 2,2,2,.5])


### Part D (4 Points) ###

In class, we implemented the [Easley, Hvidkjaer and O’Hara
(2010)][EHO] factorization of the PIN statistic which assumes a single
buy and sell uninformed participant rate. For this section, you will
implement a more generic PIN log likelihood statistic that
differentiates between buy and sell rates and uses the [Lin and Ke
(2011)][LK] factorization to make the calculation more numerically
stable and remove biases which make PIN values artificially low when B
and S values are high.

[EHO]: https://doi.org/10.1017/S0022109010000074 "Easley, Hvidkjaer and O'Hara (2010)"
[LK]: https://doi.org/10.1016/j.finmar.2011.03.001 "Lin and Ke (2011)"

$$
\begin{align}
\log \mathcal{L}(M|\Theta) &= \sum_{d=1}^n -\epsilon_b -\epsilon_s + B_d\log(\mu+\epsilon_b) + S_d\log(\mu+\epsilon_s) + e_{max,d} \\
&+ \sum_{d=1}^n \log \left( (1-\alpha)\exp(e_{1,d}-e_{max,d}) + \alpha(1-\delta)\exp(e_{2,d}-e_{max,d}) + \alpha\delta\exp(e_{3,d}-e_{max,d}) \right)
\end{align}
$$

where:
$$
\begin{align}
e_{1,d} &= -B_{d}\log\left(1+\frac{\mu}{\epsilon_b}\right)-S_{d}\log\left(1+\frac{\mu}{\epsilon_s}\right) \\
e_{2,d} & = -\mu- S_{d}\log\left(1 + \frac{\mu}{\epsilon_s} \right) \\
e_{3,d} &= -\mu- B_{d}\log\left(1 + \frac{\mu}{\epsilon_b} \right) \\
e_{\max, d} &= \max\left(e_{1,d}, e_{2,d}, e_{3,d} \right) \\
\end{align}
$$

Complete the `pin_ll` function so that it computes the log likelihood
given the parameters "eemad" $\epsilon_b$, $\epsilon_s$, $\mu$,
$\alpha$, $\delta$ and `B`,`S`.


In [71]:
/ (e)psilon buy, (e)psilon sell, (m)u, (a)lpha, (d)elta, (B)uy count, (S)ell count
pin_ll:{[eemad;B;S]
 eb:eemad[0]; es:eemad[1]; m:eemad[2]; a:eemad[3]; d:eemad[4];
 e1:(neg[B]*log 1+m%eb)+neg[S]*log 1+m%es;
 e2:(neg m)+neg[S]*log 1+m%es;
 e3:(neg m)+neg[B]*log 1+m%eb;
 emax:e1|e2|e3;
 sum1:sum (neg eb+es)+(B*log m+eb)+(S*log m+es)+emax;
 sum2:sum log ((1-a)*exp e1-emax)+((a*1-d)*exp e2-emax)+(a*d)*exp e3-emax;
 ll:sum1+sum2;
 ll}

In [72]:
/ test pin function
pin_ll[3 3 4 .5 .5;B;S]

393.7007


Your results should match the following:
```
393.7007
```

In [73]:
/ confirm pin_ll calculation

assert[399] rnd[1] pin_ll[3 3 4 .5 1;B;S]
assert[393] rnd[1] pin_ll[3 3 4 1 .5;B;S]
assert[424] rnd[1] pin_ll[2 2 10 .5 .5;B;S]
assert[379] rnd[1] pin_ll[1 1 20 .5 .5;B;S]
assert[288] rnd[1] pin_ll[20 20 1 .5 .5;B;S]

### Part E (4 Points) ###

We will need to use this function in the optimizer objective
function, so let's convert the `pin_ll` to python.

Complete the python `pin_ll` function so that it computes the PIN log
likelihood given the parameters "eemad" $\epsilon_b$, $\epsilon_s$ ,$\mu$,
$\alpha$, $\delta$ and `B`,`S`

In [74]:
%%py
# (e)psilon buy, (e)psilon sell, (m)u, (a)lpha, (d)elta, (B)uy count, (S)ell count
def pin_ll(eemad,B,S):
    epsilon_b, epsilon_s, mu, alpha, delta = eemad
    
    e1 = -B * np.log(1 + mu / epsilon_b) - S * np.log(1 + mu / epsilon_s)
    e2 = -mu - S * np.log(1 + mu / epsilon_s)
    e3 = -mu - B * np.log(1 + mu / epsilon_b)
    
    emax = np.maximum(np.maximum(e1, e2), e3)
    
    sum1 = np.sum(-(epsilon_b + epsilon_s) + B * np.log(mu + epsilon_b) + 
                  S * np.log(mu + epsilon_s) + emax)
    
    sum2 = np.sum(np.log((1 - alpha) * np.exp(e1 - emax) + 
                         alpha * (1 - delta) * np.exp(e2 - emax) + 
                         alpha * delta * np.exp(e3 - emax)))
    
    ll = sum1 + sum2
    return ll

In [75]:
%%py
# test function
print(pin_ll([3, 3,4,.5,.5],B,S))

393.7007238616798


Your results should match the following:
```
393.7007238616798
```

In [76]:
%%py
# confirm pin_ll calculation
assert 399 == round(pin_ll([3, 3, 4, .5, 1],B,S),0)
assert 393 == round(pin_ll([3, 3, 4, 1, .5],B,S),0)
assert 424 == round(pin_ll([2, 2, 10, .5, .5],B,S),0)
assert 379 == round(pin_ll([1, 1, 20, .5, .5],B,S),0)
assert 288 == round(pin_ll([20, 20, 1, .5, .5],B,S),0)

### Part F (2 Points) ###

To find the values of $\epsilon_b$, $\epsilon_s$, $\mu$, $\alpha$, and $\delta$ that
maximize the log likelihood we can use the `minimize` function in the
python `scipy optimize` package.  Since this is a minimization
algorithm, we must first implement our `objective` function that gets
smaller as the log likelihood gets bigger.

Complete the python `objective` function so that it returns the
negative of the log likelihood.



In [77]:
%%py
def objective(eemad,B,S):
    o = -pin_ll(eemad,B,S)
    return o

In [78]:
%%py
# test function
print(objective([3, 3,4,.5,.5],B,S))

-393.7007238616798


Your results should match the following:
```
-393.7007238616798
```

In [79]:
%%py
# confirm pin_ll calculation
assert -394 == round(objective([3, 3,4,.5,.5],B,S),0)
assert -399 == round(objective([3, 3,4,.5,1],B,S),0)

### Part G (2 Points) ###

We now need to create a set of bounds for the `minimize` function so
that it does not try to generate an illogical solutions.

Create a variable called `bounds` which has a list of bounds - one for
each of parameters expected by the `objective` function.  The order of
the bounds must match the order of the coefficients. You may refer to
the `minimize` [documentation][MINIMIZE] for specifics on the expected
bounds structure.

[MINIMIZE]:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
"Scipy Optimize Minimize"


In [80]:
%%py
bounds = [(0, None), (0, None), (0, None), (0, 1), (0, 1)]
# display bounds
print(bounds)


[(0, None), (0, None), (0, None), (0, 1), (0, 1)]


In [81]:
%%py
# confirm pin_ll calculation
assert 0 == bounds[0][0]
assert None == bounds[0][1]
assert 0 == bounds[1][0]
assert None == bounds[1][1]
assert 0 == bounds[2][0]
assert None == bounds[2][1]
assert 0 == bounds[3][0]
assert 1 == bounds[3][1]
assert 0 == bounds[4][0]
assert 1 == bounds[4][1]

### Part H (3 Points) ###

Initial parameters are crucial to finding solutions to the PIN
objective function.  Even with good initial values, it is still
possible for the optimization to return boundary solutions -- or fail
to find a solution altogether.  The chance of these problems increases
as daily volumes increase.

The [Yan and Zhang (2012)][YZ] paper sets out a method for
initializing parameters for the PIN ratio that distinguishes between
buy and sell uninformed trader rates $\epsilon_b$ and $\epsilon_s$.

[YZ]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=890486 "Yan and Zhang (2012)"

$$
\begin{align}
\alpha^0 &= \alpha_i \\
\delta^0 &= \delta_j \\
\epsilon_b^0& = \gamma_k \cdot \overline{B} \\
\mu^0 &= \frac{\overline{B}-\epsilon_b^0}{\alpha^0 \cdot (1-\delta^0)} \\
\epsilon_s^0 &= \overline{S}-\alpha^0 \cdot \delta^0 \cdot  \mu^0
\end{align}
$$
  
where $\alpha^i, \delta^j$ and $\gamma^k$ take values from .1, .3, .5,
.7, .9 and $\overline{B}$ and $\overline{S}$ are the sample buy and
sell averages respectively.

Using this technique, build a dataframe with 125 records and 5 columns
`eb`, `es`, `m`, `a` and `d`.



In [82]:
%%py
df = pd.DataFrame()
# Sample averages
B_bar = np.mean(B)
S_bar = np.mean(S)

# Values for alpha, delta, and gamma
values = [0.1, 0.3, 0.5, 0.7, 0.9]

rows = []
for a in values:
    for d in values:
        for gamma in values:
            eb = gamma * B_bar
            m = (B_bar - eb) / (a * (1 - d))
            es = S_bar - a * d * m
            rows.append({'eb': eb, 'es': es, 'm': m, 'a': a, 'd': d})

df = pd.DataFrame(rows)
print(df[['eb','es','m','a','d']])

           eb         es          m    a    d
0    0.594737  11.931579  59.473684  0.1  0.1
1    1.784211  12.063743  46.257310  0.1  0.1
2    2.973684  12.195906  33.040936  0.1  0.1
3    4.163158  12.328070  19.824561  0.1  0.1
4    5.352632  12.460234   6.608187  0.1  0.1
..        ...        ...        ...  ...  ...
120  0.594737 -35.647368  59.473684  0.9  0.9
121  1.784211 -24.942105  46.257310  0.9  0.9
122  2.973684 -14.236842  33.040936  0.9  0.9
123  4.163158  -3.531579  19.824561  0.9  0.9
124  5.352632   7.173684   6.608187  0.9  0.9

[125 rows x 5 columns]


Your results should match the following:
```
           eb         es          m    a    d
0    0.594737  11.931579  59.473684  0.1  0.1
1    1.784211  12.063743  46.257310  0.1  0.1
2    2.973684  12.195906  33.040936  0.1  0.1
3    4.163158  12.328070  19.824561  0.1  0.1
4    5.352632  12.460234   6.608187  0.1  0.1
..        ...        ...        ...  ...  ...
120  0.594737 -35.647368  59.473684  0.9  0.9
121  1.784211 -24.942105  46.257310  0.9  0.9
122  2.973684 -14.236842  33.040936  0.9  0.9
123  4.163158  -3.531579  19.824561  0.9  0.9
124  5.352632   7.173684   6.608187  0.9  0.9

[125 rows x 5 columns]
```

In [83]:
%%py
# confirm dataframe has correct columns
assert len({'eb','es','m','a','d'} - set(df.columns)) == 0
# confirm dataframe has correct number of records
assert len(df)==125
# confirm 'a' and 'd range from .1 through .9
assert df[['d','a']].values.min() == .1
assert df[['d','a']].values.max() == .9


A few of these 125 initial parameter sets can be removed because they
have negative $\epsilon_s^0$ values.  Filter the `df` dataframe to
only include positive (greater than 0) $\epsilon_s$ values.



In [84]:
%%py
df = df[df.es > 0]
print(len(df))

105


In [85]:
%%py
# confirm dataframe has been filtered correctly
assert len(df)==105
assert (df.es>0).all()

[Ersan and Alıcı (2016)][EA] demonstrate that it is possible to
exclude additional parameter sets where $\mu^0$ exceeds the maximum of
buy and sell values ($B$ and $S$).

Exclude these records from the `df` dataframe as well.

[EA]: https://doi.org/10.1016/j.intfin.2016.04.001 "Ersan and Alıcı (2016)"


In [91]:
%%py
df = df[df.m < np.array([B, S]).max()]
print(len(df))

93


In [92]:
%%py
# confirm dataframe has been filtered correctly
assert len(df)==93
assert (df.m<np.array([B,S]).max()).all()

### Part I  (3 Points) ###

The last step is to call `scipy.optimize` on all initial parameter
sets and return the solution that produces the smallest log
likelihood.

An `OptimizeResults` variable called `sol` which stores the result of
the `minimize` function has been initialized with the failure state.

We will be coding in a vector style (no explicit for loops).  You
should use the `functools.partial` function to create a projection of
the `minimize` routine that has the following parameters assigned:

1. The objective function
3. The `args` parameter which holds extra arguments that need to be
   passed to the objective function (the buy and sell vector)
4. A `method` parameter which specifies minimization method that handles
   non-linear functions (use 'nelder-mead')
5. The `bounds` parameter which holds the bounds vector

The code will then iterate over all initial parameter sets in the `df`
dataframe to find the optimal values for $\epsilon_b$, $\epsilon_s$,
$\mu$, $\alpha$, $\delta$ and return the best set.


In [93]:
%%py
sol=OptimizeResult(success=False,x=[1, 1,0,0,0])
from functools import partial
f = partial(minimize, objective, args=(B, S), method='nelder-mead', bounds=bounds)
sols = [f(eemad) for eemad in df[['eb','es','m','a','d']].values]
sols = [s for s in sols if s.success] # only include successful optimizations
sol = min(sols,default=sol,key=lambda s: s.fun) # pick minimum solution
print(sol)

       message: Optimization terminated successfully.
       success: True
        status: 0
           fun: -532.8859186327364
             x: [ 5.947e+00  6.929e+00  3.531e+01  1.585e-01  1.000e+00]
           nit: 333
          nfev: 534
 final_simplex: (array([[ 5.947e+00,  6.929e+00, ...,  1.585e-01,
                         1.000e+00],
                       [ 5.947e+00,  6.929e+00, ...,  1.585e-01,
                         1.000e+00],
                       ...,
                       [ 5.947e+00,  6.929e+00, ...,  1.585e-01,
                         1.000e+00],
                       [ 5.947e+00,  6.929e+00, ...,  1.585e-01,
                         1.000e+00]], shape=(6, 5)), array([-5.329e+02, -5.329e+02, -5.329e+02, -5.329e+02,
                       -5.329e+02, -5.329e+02]))


Your results should be similar to the following (specifically, `Success: True`):
```
       message: Optimization terminated successfully.
       success: True
        status: 0
           fun: -532.8859186327365
             x: [ 5.947e+00  6.929e+00  3.531e+01  1.585e-01  1.000e+00]
           nit: 333
          nfev: 534
 final_simplex: (array([[ 5.947e+00,  6.929e+00, ...,  1.585e-01,
                         1.000e+00],
                       [ 5.947e+00,  6.929e+00, ...,  1.585e-01,
                         1.000e+00],
                       ...,
                       [ 5.947e+00,  6.929e+00, ...,  1.585e-01,
                         1.000e+00],
                       [ 5.947e+00,  6.929e+00, ...,  1.585e-01,
                         1.000e+00]]), array([-5.329e+02, -5.329e+02, -5.329e+02, -5.329e+02,
                       -5.329e+02, -5.329e+02]))
```

In [94]:
%%py
# confirm successful optimization
assert sol.success

### Part J  (1 Points) ###

We can now determine the probability of informed trading (PIN).

Using the parameter results obtained from the optimization, compute
the PIN statistic and store the value in the variable `final_pin`


In [95]:
%%py
final_pin = pin(sol.x)
print(final_pin)

0.30296407890559435


In [96]:
%%py
# confirm PIN is computed accurately
assert hash(round(final_pin,8)) == 698587605910858368