## Ranking Reviews

To some degree we follow the logic here: https://nbviewer.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter4_TheGreatestTheoremNeverTold/Ch4_LawOfLargeNumbers_PyMC2.ipynb

However this is a problem I've encountered often before. People try to sort or rank or order things based on some score, when that score is created from different samples sizes for different entities, leading to "pathological ordering" as described in the notebook. I'll add my own extension on how to consider sorting these cases rationally.

In [65]:
import numpy as np
np.random.seed(42)

### generate scores


suppose we have k products, each with n_i reviews ranging from 1 to 5 stars. source the reviews from different multinomial distributions with randomly set params. we can use a dirichlet distribution (stick-breaking) to generate valid values for these 5-outcome probabilities. 

Dirichlet distributions are very cool! More on understanding how their params work: https://builtin.com/data-science/dirichlet-distribution

In [66]:
test_dirichlets = np.random.dirichlet(alpha=[3,3,6,6,4], size=3)
test_dirichlets

array([[0.14930397, 0.10255247, 0.21486729, 0.21486882, 0.31840746],
       [0.2012259 , 0.09612238, 0.34422275, 0.30538451, 0.05304446],
       [0.05960729, 0.14390011, 0.44425147, 0.23044177, 0.12179936]])

In [67]:
np.random.multinomial(n_reviews[0], true_product_distributions[0])

array([ 9,  7, 21, 10,  4])

In [68]:
true_product_distributions[0]

array([0.22946703, 0.20287075, 0.25821619, 0.15238942, 0.15705661])

In [69]:
k = 10
n_reviews = np.random.randint(low = 1, high=100, size=k)
true_product_distributions = np.random.dirichlet(alpha=[3,3,6,6,4], size=k)

reviews_arraylist = []
for i in range(0, k):
    reviews_by_stars = np.random.multinomial(n_reviews[i], true_product_distributions[i])
    reviews_arraylist.append(reviews_by_stars)


reviews_array = np.array(reviews_arraylist)

In [70]:
reviews_array # number of 1, 2, 3, 4, and 5 star reviews for each product; sum of these reviews correspond to n_reviews[i]

array([[ 1,  1,  1,  4,  2],
       [16, 28, 22, 15,  9],
       [ 1, 13, 21, 14,  4],
       [ 0,  1,  0,  1,  0],
       [ 9, 18, 26, 20, 11],
       [ 6, 25, 32, 10, 19],
       [ 3,  5, 26, 16, 10],
       [ 8, 10,  8, 26, 19],
       [ 5,  2, 26,  7,  4],
       [ 2,  1,  2,  2,  1]])

In [73]:
reviews_array.sum(axis=1) # needs to match n_reviews

array([ 9, 90, 53,  2, 84, 92, 60, 71, 44,  8])

In [74]:
n_reviews # matches!

array([ 9, 90, 53,  2, 84, 92, 60, 71, 44,  8])

### modeling approach

If we were modeling binomial outcomes, we could use the beta-binomial model and skip pymc altogether, because the posterior beta distribution is super easy to generate from the prior and the binomial outcomes. take the number of heads, and it to the first param, take the number of tails, add it to the second param, done.

Is as simple a closed form possible for multinomials, with say dirichlet priors, as a generalization of the beta-binomial model? I'm actually not sure. But we can definitely solve this in pymc. And since we've already reasoned out a good data generating process for simulating this data, we can use it as the model going the other way too. 

More docs: https://www.pymc.io/projects/docs/en/v3/pymc-examples/examples/mixture_models/dirichlet_mixture_of_multinomials.html

In [76]:
import pymc as pm




In [111]:
with pm.Model() as model:
    
    dirichlet = pm.Dirichlet("dirichlet",(5,5,5,5,5), shape=reviews_array.shape)
    obs = pm.Multinomial("obs",n_reviews, dirichlet, observed=reviews_array)

    start = pm.find_MAP()
    step = pm.Metropolis()
    trace = pm.sample(12000, step=step, initvals=start)




Multiprocess sampling (4 chains in 4 jobs)
Metropolis: [dirichlet]


Sampling 4 chains for 1_000 tune and 12_000 draw iterations (4_000 + 48_000 draws total) took 19 seconds.


In [112]:
model.basic_RVs

[dirichlet, obs]

In [117]:
n_reviews

array([ 9, 90, 53,  2, 84, 92, 60, 71, 44,  8])

In [115]:
trace.posterior.dirichlet.mean(axis=(0,1))

In [116]:
true_product_distributions # first order sanity check: how close are the posterior means are relative to n_reviews

array([[0.08525117, 0.06902905, 0.19863354, 0.45397739, 0.19310885],
       [0.19834825, 0.2858744 , 0.23411255, 0.17431408, 0.10735072],
       [0.06992841, 0.1774605 , 0.38267632, 0.24650138, 0.12343339],
       [0.04303441, 0.23826195, 0.43628372, 0.18426625, 0.09815366],
       [0.10918894, 0.25443778, 0.27934232, 0.20465334, 0.15237761],
       [0.05873772, 0.29226417, 0.27569534, 0.17207715, 0.20122562],
       [0.07216832, 0.12708475, 0.31644872, 0.3415245 , 0.14277371],
       [0.13412921, 0.12472006, 0.08303889, 0.38596207, 0.27214977],
       [0.11761106, 0.06867592, 0.4748723 , 0.27500973, 0.06383099],
       [0.13403426, 0.0954606 , 0.32509155, 0.2396074 , 0.20580619]])

In [None]:
## TODO: plot real generative multinomial vs. the estimated distribution for each multinomial, as well as N

In [110]:
trace.posterior

In [102]:
trace

In [94]:
pm.draw(model.d_prior)

array([[0.19461367, 0.17856145, 0.1617343 , 0.31512483, 0.14996576],
       [0.10759511, 0.25917069, 0.14669635, 0.25022167, 0.23631618],
       [0.11413721, 0.19817167, 0.17496452, 0.18849554, 0.32423107],
       [0.12928869, 0.2174786 , 0.19863772, 0.2304824 , 0.22411259],
       [0.1636006 , 0.25306585, 0.21901389, 0.21944815, 0.14487151],
       [0.25199895, 0.37617752, 0.12100707, 0.08738399, 0.16343246],
       [0.13277469, 0.25092196, 0.11986336, 0.16142649, 0.3350135 ],
       [0.10214171, 0.27287848, 0.12576772, 0.30982944, 0.18938265],
       [0.27565949, 0.23342822, 0.16225733, 0.18385917, 0.1447958 ],
       [0.14774076, 0.20141228, 0.25057921, 0.15529878, 0.24496897]])