## Task One:

Take the sushi data and munge it down into the following type of thing:

https://github.com/Rahgooy/MDFT/blob/master/data/example.json

Here we need to compute the frequency and option sets for this -- so we take the sushi set A and we take oliness and price.

4000/1000 and do each fold of that and compute the empirical choice distribution for each over the 10 options.

## Task Two:

For all 5000 -- we need to do 15 subsets -- 10 we learn from other 5 for testing. For each of these we to do the following:
* Pick 15 random subset of size 3 and the whole 10
* Compute the empirical choice probabilities over this subset for all 5000 users.

In [13]:
# Includes and Standard Magic...
### Standard Magic and startup initializers.

# Load Numpy
import numpy as np
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Pandas
import pandas as pd
# Load Stats
from scipy import stats
import seaborn as sns
# import gurobipy as gpy


# This lets us show plots inline and also save PDF plots if we want them
%matplotlib inline
matplotlib.style.use('fivethirtyeight')
from matplotlib.backends.backend_pdf import PdfPages

# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [14]:
from io import StringIO

data_tsv = StringIO('''idx	name	style	major_group	minor_group	oiliness	freq_ate	price	freq_sold
0	ebi	1	0	6	2.72897800776197	2.13842173350582	1.83841991341991	0.84
1	anago	1	0	3	0.926384364820847	1.99022801302932	1.99245867768595	0.88
2	maguro	1	0	1	1.76955903271693	2.34850640113798	1.87472451790634	0.88
3	ika	1	0	5	2.68840082361016	2.04323953328758	1.51515151515152	0.92
4	uni	1	0	8	0.81304347826087	1.64347826086957	3.28728191000918	0.88
5	ikura	1	0	7	1.26487252124646	1.97946175637394	2.69536271808999	0.88
6	tamago	1	1	9	2.36807095343681	1.86622320768662	1.03246753246753	0.84
7	toro	1	0	1	0.551854655563967	2.05753217259652	4.48545454545455	0.8
8	tekka_maki	0	0	1	2.24713375796178	1.87898089171975	1.57983682983683	0.44
9	kappa_maki	0	1	11	3.73054755043228	1.45677233429395	1.02	0.4''')

columns = "idx, name, style, major_group, minor_group, oiliness, freq, price, freq_sold"

In [15]:
df_meta = pd.read_csv(data_tsv, sep='\t', index_col="idx")
df_meta

Unnamed: 0_level_0,name,style,major_group,minor_group,oiliness,freq_ate,price,freq_sold
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,ebi,1,0,6,2.728978,2.138422,1.83842,0.84
1,anago,1,0,3,0.926384,1.990228,1.992459,0.88
2,maguro,1,0,1,1.769559,2.348506,1.874725,0.88
3,ika,1,0,5,2.688401,2.04324,1.515152,0.92
4,uni,1,0,8,0.813043,1.643478,3.287282,0.88
5,ikura,1,0,7,1.264873,1.979462,2.695363,0.88
6,tamago,1,1,9,2.368071,1.866223,1.032468,0.84
7,toro,1,0,1,0.551855,2.057532,4.485455,0.8
8,tekka_maki,0,0,1,2.247134,1.878981,1.579837,0.44
9,kappa_maki,0,1,11,3.730548,1.456772,1.02,0.4


In [16]:
# Read in the data...

df_orderdata = pd.read_csv("./sushi3a.5000.10.order", sep=" ")
df_orderdata = df_orderdata.drop(columns=["drop", "drop.1"])
df_orderdata

Unnamed: 0,pos1,pos2,pos3,pos4,pos5,pos6,pos7,pos8,pos9,pos10
0,5,0,3,4,6,9,8,1,7,2
1,0,9,6,3,7,2,8,1,5,4
2,7,0,2,3,8,4,5,1,9,6
3,4,5,7,0,2,3,1,6,8,9
4,8,6,5,0,3,9,2,7,4,1
...,...,...,...,...,...,...,...,...,...,...
4995,3,7,4,5,6,0,1,2,8,9
4996,7,1,3,8,6,0,5,2,9,4
4997,7,2,4,5,3,6,1,0,8,9
4998,7,2,3,0,8,1,5,9,6,4


## First Cut

Here we need to compute the frequency and option sets for this -- so we take the sushi set A and we take oliness and price.

4000/1000 and do each fold of that and compute the empirical choice distribution for each over the 10 options.


In [17]:
# Print out the M Information properly...
m_data = []
for i,r in df_meta.iterrows():
    m_data.append([r.oiliness, r.price])
print("M:",m_data)

M: [[2.7289780077619703, 1.8384199134199097], [0.9263843648208471, 1.9924586776859499], [1.76955903271693, 1.87472451790634], [2.68840082361016, 1.51515151515152], [0.81304347826087, 3.28728191000918], [1.26487252124646, 2.69536271808999], [2.3680709534368103, 1.03246753246753], [0.551854655563967, 4.48545454545455], [2.24713375796178, 1.5798368298368302], [3.73054755043228, 1.02]]


In [18]:
# # Choice probabilities are the number of times it appears in the FIRST POSITION.
# D = df_orderdata["pos1"].value_counts(normalize=True)
# D.sort_index(inplace=True)
# print(D)
# print(list(D))


In [19]:
# Do a little SK Learn magic get us 5 folds of test/train splits and spit them out..

import numpy as np
from sklearn.model_selection import KFold
kf = KFold(n_splits = 5, shuffle = True, random_state = 2)
for i, (test_idx, train_idx) in enumerate(kf.split(df_orderdata)):
    train = df_orderdata.iloc[train_idx]
    test = df_orderdata.iloc[test_idx]
    print("FOLD ",i)
    D = train["pos1"].value_counts(normalize=True)
    D.sort_index(inplace=True)
    print("TRAIN: ",list(D))
    D = test["pos1"].value_counts(normalize=True)
    D.sort_index(inplace=True)
    print("TEST: ",list(D))


FOLD  0
TRAIN:  [0.089, 0.115, 0.088, 0.046, 0.144, 0.103, 0.043, 0.335, 0.027, 0.01]
TEST:  [0.09225, 0.10875, 0.079, 0.0455, 0.15075, 0.1105, 0.04075, 0.3445, 0.0215, 0.0065]
FOLD  1
TRAIN:  [0.089, 0.107, 0.071, 0.047, 0.129, 0.116, 0.045, 0.367, 0.022, 0.007]
TEST:  [0.09225, 0.11075, 0.08325, 0.04525, 0.1545, 0.10725, 0.04025, 0.3365, 0.02275, 0.00725]
FOLD  2
TRAIN:  [0.104, 0.114, 0.078, 0.056, 0.158, 0.106, 0.036, 0.321, 0.02, 0.007]
TEST:  [0.0885, 0.109, 0.0815, 0.043, 0.14725, 0.10975, 0.0425, 0.348, 0.02325, 0.00725]
FOLD  3
TRAIN:  [0.079, 0.107, 0.086, 0.049, 0.157, 0.104, 0.043, 0.348, 0.02, 0.007]
TEST:  [0.09475, 0.11075, 0.0795, 0.04475, 0.1475, 0.11025, 0.04075, 0.34125, 0.02325, 0.00725]
FOLD  4
TRAIN:  [0.097, 0.107, 0.081, 0.03, 0.159, 0.116, 0.039, 0.342, 0.024, 0.005]
TEST:  [0.09025, 0.11075, 0.08075, 0.0495, 0.147, 0.10725, 0.04175, 0.34275, 0.02225, 0.00775]


In [24]:
idx

[[1, 2, 9],
 [1, 5, 9],
 [2, 4, 7],
 [0, 2, 6],
 [1, 4, 7],
 [2, 4, 5],
 [6, 7, 9],
 [0, 4, 7],
 [0, 4, 8],
 [2, 5, 9],
 [0, 8, 9],
 [3, 5, 9],
 [5, 7, 8],
 [4, 5, 6],
 [5, 8, 9]]

## Task Two:

For all 5000 -- we need to do 15 subsets -- 10 we learn from other 5 for testing. For each of these we to do the following:
* Pick 15 random subset of size 3 and the whole 10
* Compute the empirical choice probabilities over this subset for all 5000 users.

In [20]:
# Compute Empirical choice probabilities for whole 10... 
# NOTE: Same as above, we just basically compute the empirical distribution of the first position.

D = df_orderdata["pos1"].value_counts(normalize=True)
D.sort_index(inplace=True)
print(list(D))

[0.0916, 0.11, 0.0808, 0.0456, 0.1494, 0.109, 0.0412, 0.3426, 0.0226, 0.0072]


In [21]:
# Pick subsets of size 3...
column_names = df_orderdata.columns

In [22]:
import itertools
import random
combos = list(itertools.combinations(list(range(10)),3))
random.shuffle(combos)
subsets = combos[:15]

In [23]:
## For each subset, we need to get the choice probabilities for each, so which ever occurs first in
## the list...
idx = []
D = []
for a,b,c in subsets:
    counts = [0,0,0]
    print("idx:",[a,b,c])
    idx.append([a, b, c])
    for i,r in df_orderdata.iterrows():
        # Very hacky... get index (position) of each option..
        posa = int(np.where(r.values == a)[0])
        posb = int(np.where(r.values == b)[0])
        posc = int(np.where(r.values == c)[0])
        # Increment whatever is lowest...
        if posa < posb and posa < posc:
            counts[0] += 1
        elif posb < posc:
            counts[1] += 1
        else:
            counts[2] += 1
    print("D:", [x / 5000. for x in counts])
    D.append([x / 5000. for x in counts])

idx: [1, 2, 9]
D: [0.4034, 0.5334, 0.0632]
idx: [1, 5, 9]
D: [0.4224, 0.47, 0.1076]
idx: [2, 4, 7]
D: [0.2028, 0.2468, 0.5504]
idx: [0, 2, 6]
D: [0.3702, 0.4964, 0.1334]
idx: [1, 4, 7]
D: [0.2168, 0.2232, 0.56]
idx: [2, 4, 5]
D: [0.4084, 0.3214, 0.2702]
idx: [6, 7, 9]
D: [0.1484, 0.7956, 0.056]
idx: [0, 4, 7]
D: [0.2252, 0.224, 0.5508]
idx: [0, 4, 8]
D: [0.3674, 0.399, 0.2336]
idx: [2, 5, 9]
D: [0.4964, 0.4376, 0.066]
idx: [0, 8, 9]
D: [0.5852, 0.3552, 0.0596]
idx: [3, 5, 9]
D: [0.3514, 0.5582, 0.0904]
idx: [5, 7, 8]
D: [0.2586, 0.6238, 0.1176]
idx: [4, 5, 6]
D: [0.3928, 0.3614, 0.2458]
idx: [5, 8, 9]
D: [0.566, 0.3476, 0.0864]


In [25]:
idx

[[1, 2, 9],
 [1, 5, 9],
 [2, 4, 7],
 [0, 2, 6],
 [1, 4, 7],
 [2, 4, 5],
 [6, 7, 9],
 [0, 4, 7],
 [0, 4, 8],
 [2, 5, 9],
 [0, 8, 9],
 [3, 5, 9],
 [5, 7, 8],
 [4, 5, 6],
 [5, 8, 9]]

In [26]:
D

[[0.4034, 0.5334, 0.0632],
 [0.4224, 0.47, 0.1076],
 [0.2028, 0.2468, 0.5504],
 [0.3702, 0.4964, 0.1334],
 [0.2168, 0.2232, 0.56],
 [0.4084, 0.3214, 0.2702],
 [0.1484, 0.7956, 0.056],
 [0.2252, 0.224, 0.5508],
 [0.3674, 0.399, 0.2336],
 [0.4964, 0.4376, 0.066],
 [0.5852, 0.3552, 0.0596],
 [0.3514, 0.5582, 0.0904],
 [0.2586, 0.6238, 0.1176],
 [0.3928, 0.3614, 0.2458],
 [0.566, 0.3476, 0.0864]]