## Ranking Reviews

To some degree we follow the logic here: https://nbviewer.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter4_TheGreatestTheoremNeverTold/Ch4_LawOfLargeNumbers_PyMC2.ipynb

However this is a problem I've encountered often before. People try to sort or rank or order things based on some score, when that score is created from different samples sizes for different entities, leading to "pathological ordering" as described in the notebook. I'll add my own extension on how to consider sorting these cases rationally.

In [2]:
import numpy as np

### generate scores


suppose we have k products, each with n_i reviews ranging from 1 to 5 stars. source the reviews from different multinomial distributions with randomly set params. we can use a dirichlet distribution (stick-breaking) to generate valid values for these 5-outcome probabilities. 

Dirichlet distributions are very cool! More on understanding how their params work: https://builtin.com/data-science/dirichlet-distribution

In [27]:
test_dirichlets = np.random.dirichlet(alpha=[3,3,6,6,4], size=3)
test_dirichlets

array([[0.11106197, 0.1860041 , 0.29004338, 0.16202759, 0.25086297],
       [0.10797905, 0.07232049, 0.45295255, 0.16084241, 0.2059055 ],
       [0.07259105, 0.07268462, 0.22065449, 0.48427044, 0.14979939]])

In [43]:
np.random.multinomial(n_reviews[0], true_product_distributions[0])

array([ 6,  5, 28, 22,  4])

In [44]:
true_product_distributions[0]

array([0.06694398, 0.08143229, 0.43420093, 0.30442076, 0.11300205])

In [54]:
k = 42
n_reviews = np.random.randint(low = 1, high=100, size=k)
true_product_distributions = np.random.dirichlet(alpha=[3,3,6,6,4], size=k)

reviews_arraylist = []
for i in range(0, k):
    reviews_by_stars = np.random.multinomial(n_reviews[i], true_product_distributions[i])
    reviews_arraylist.append(reviews_by_stars)


reviews_array = np.array(reviews_arraylist)

In [56]:
reviews_array # number of 1, 2, 3, 4, and 5 star reviews for each product; sum of these reviews correspond to n_reviews[i]

array([[ 6, 11, 27, 25, 24],
       [17,  4, 26, 14,  7],
       [ 3,  5, 13, 11,  9],
       [ 5,  3,  7, 17,  6],
       [ 2, 15, 17, 14,  5],
       [ 6,  3, 13, 13,  9],
       [12, 17, 25, 11, 20],
       [11, 11,  9, 26,  7],
       [26,  7, 25, 29,  1],
       [ 3,  8, 12, 13, 14],
       [ 6,  1,  6, 19, 15],
       [ 9,  2,  5,  5,  1],
       [ 3,  4, 15,  8,  3],
       [ 3, 15, 10, 17, 28],
       [ 1,  5,  3,  8,  1],
       [13,  2,  5,  3,  3],
       [ 8,  1,  7,  7,  6],
       [13, 18, 27, 25,  8],
       [ 5,  3,  8, 15,  4],
       [ 8,  9, 13, 25,  9],
       [ 2,  3,  7,  9,  2],
       [ 1,  4,  0,  3,  2],
       [11,  3, 17, 22, 10],
       [17,  7,  5, 39, 25],
       [ 2,  0, 10,  5,  4],
       [13,  2, 29, 11, 11],
       [11,  0, 24, 22,  3],
       [16,  7,  5, 17,  3],
       [ 6,  7, 20, 28, 12],
       [ 5,  6, 13,  4,  5],
       [ 3,  4,  5, 12,  2],
       [13,  4,  8, 13, 12],
       [ 1,  7, 14, 17, 20],
       [17,  5, 16, 30, 11],
       [10,  9