# Matrix Porridge
by Kyle Archie, M.Eng

This work builds upon the first notebook in this repository. Please read through that one for background and explanation of the data / process.

## Notebook 2: Filtering our Input Data via Clustering

In this notebook, we'll be using the DBSCAN clustering algorithm to trim our list of potential ingredients down. In the last notebook, we got a solution, but many of the ingredients were similar to one another. By using clustering, we can select just one ingredient from clusters of many similar ones. And the reason for using DBSCAN over k-means clustering for this is that DBSCAN also automatically flags outliers, which we will will want to include here. If we were building a classifier or something like that, we might be using those outlier flags to filter out undesireable data, but here, if something's an outlier, it means there are no other ingredients quite like it, and those unique ingredients should absolutely be included. Futher, k-means analysis requires we pick a "k" (i.e. number of clusters). We can use the elbow technique to iteratively find the optimal value of k, but this will likely leave those outliers out of the final set, since the lower our k value, the more our outliers get thrown into general buckets. DBSCAN is just an obvious choice in this application.

We will use the elbow technique to tune our epsilon value. Once we have our final list of clusters, we'll pick one sample per cluster, plus all outliers, and that will become our final ingredients list, which we'll then run through the same optimization algorithm as last time.

In [1]:
import pandas as pd

ingredients = pd.read_excel("Matrix Porridge (filtered).xlsx", "core ingredients")
ingredients['Sodium']=ingredients['Sodium']/1000 #values are in mg... convert to g
ingredients.set_index('Shrt_Desc',inplace=True)
ingredients=ingredients[['Energ_Kcal','Protein','Carbohydrt','Fiber_TD','Sugar_Tot','Lipid_Tot','FA_Sat','FA_Poly','Sodium']].dropna()

Reference used: https://towardsdatascience.com/k-means-vs-dbscan-clustering-49f8e627de27

In [2]:
from sklearn import metrics
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import numpy as np

In [3]:
ingredients_dbscan = DBSCAN(eps=0.5, min_samples=2)
ingredients_dbscan.fit(ingredients.values)
labels = ingredients_dbscan.labels_

# Creating a numpy array with all values set to false by default
samples_mask = np.zeros_like(labels, dtype=bool)
# add outliers
samples_mask[labels==-1] = True

#add outliers to our final dataset
final_ingredients_df=ingredients[samples_mask]

In [4]:
# Finding the number of clusters in labels (ignoring noise if present)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
for i in range(n_clusters):
    sample=ingredients[labels==i].sample(1)
    final_ingredients_df=final_ingredients_df.append(sample)


In [5]:
inputs=final_ingredients_df.T
# inputs.columns=ingredients['Shrt_Desc']
# inputs.dropna(inplace=True,axis=1) #some of these ingredients have null values. Remove those. 
inputs.fillna(0,inplace=True)
inputs.head(10)

Shrt_Desc,"BUTTER,WITH SALT","BUTTER,WHIPPED,WITH SALT","BUTTER OIL,ANHYDROUS","CHEESE,BLUE","CHEESE,BRICK","CHEESE,BRIE","CHEESE,CAMEMBERT","CHEESE,CHEDDAR","CHEESE,COLBY","CHEESE,COTTAGE,CRMD,LRG OR SML CURD",...,"WHEAT FLR,WHITE,BREAD,ENR","MACARONI,DRY,ENR","MACARONI,COOKED,ENRICHED","NOODLES,EGG,DRY,ENRICHED","NOODLES,EGG,CKD,UNENR,WO/ SALT","SPAGHETTI,CKD,UNENR,W/ SALT","WHEAT FLR,WHITE (INDUSTRIAL),10% PROT,UNBLEACHED,ENR","WHEAT FLR,WHITE (INDUSTRIAL),11.5% PROT,UNBLEACHED,ENR","WHEAT FLR,WHITE (INDUSTRIAL),13% PROT,BLEACHED,UNENR","WHEAT FLR,WHITE (INDUSTRIAL),15% PROT,BLEACHED,ENR"
Energ_Kcal,717.0,717.0,876.0,353.0,371.0,334.0,300.0,403.0,394.0,98.0,...,361.0,371.0,158.0,384.0,138.0,157.0,366.0,363.0,362.0,362.0
Protein,0.85,0.85,0.28,21.4,23.24,20.75,19.8,24.9,23.76,11.12,...,11.98,13.04,5.8,14.16,4.54,5.8,9.71,11.5,13.07,15.33
Carbohydrt,0.06,0.06,0.0,2.34,2.79,0.45,0.46,1.28,2.57,3.38,...,72.53,74.67,30.86,71.27,25.16,30.59,76.22,73.81,72.2,69.88
Fiber_TD,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.4,3.2,1.8,3.3,1.2,1.8,2.4,2.4,2.4,2.4
Sugar_Tot,0.06,0.06,0.0,0.5,0.51,0.45,0.46,0.52,0.52,2.67,...,0.31,2.67,0.56,1.88,0.4,0.56,0.49,1.12,1.1,0.92
Lipid_Tot,81.11,81.11,99.48,28.74,29.68,27.68,24.26,33.14,32.11,4.3,...,1.66,1.51,0.93,4.44,2.07,0.93,1.48,1.45,1.38,1.41
FA_Sat,51.368,50.489,61.924,18.669,18.764,17.41,15.259,21.092,20.218,1.718,...,0.244,0.277,0.176,1.18,0.419,0.176,0.302,0.268,0.189,0.272
FA_Poly,3.043,3.012,3.694,0.8,0.784,0.826,0.724,0.942,0.953,0.123,...,0.727,0.564,0.319,1.331,0.552,0.319,0.845,0.749,0.683,0.712
Sodium,0.576,0.827,0.002,1.395,0.56,0.629,0.842,0.621,0.604,0.364,...,0.002,0.006,0.001,0.021,0.005,0.131,0.002,0.002,0.002,0.002


In [6]:
print(ingredients.columns)

Index(['Energ_Kcal', 'Protein', 'Carbohydrt', 'Fiber_TD', 'Sugar_Tot',
       'Lipid_Tot', 'FA_Sat', 'FA_Poly', 'Sodium'],
      dtype='object')


In [7]:
requirements=pd.read_excel("Matrix Porridge (filtered).xlsx", "requirements (2021)")
requirements['min (g)']/=3
requirements['max (g)']/=3

print(requirements)

                                          Unnamed: 0    min (g)     max (g)
0                                                Fat  14.814815   25.925926
1   n-6 polyunsaturated fatty acidsa (linoleic acid)   3.703704    7.407407
2  n-3 polyunsaturated fatty acidsa (α-linolenic ...   0.444444    0.888889
3                                       Carbohydrate  75.000000  108.333333
4                                            Protein  16.666667   58.333333
5                                             Sodium   0.500000    0.766667


Note: our data does not split the two kinds of polyunsaturated fat, so we'll have to sum these up. There's another dataset I've been looking at which might work better for this, but we'll get to that later.

In [8]:
fat_min=requirements.iloc[0,1]
fat_max=requirements.iloc[0,2]
fat_half_range=(fat_max-fat_min)/2 #calculate this once so we don't need to do it repeatedly in our function later
fat_opt=(fat_min+fat_max)/2

pufat_min=requirements.iloc[1,1]+requirements.iloc[2,2]
pufat_max=requirements.iloc[1,2]+requirements.iloc[2,2]
pufat_half_range=(pufat_max-pufat_min)/2
pufat_opt=(pufat_min+pufat_max)/2

carb_min=requirements.iloc[3,1]
carb_max=requirements.iloc[3,2]
carb_half_range=(carb_max-carb_min)/2
carb_opt=(carb_min+carb_max)/2

protein_min=requirements.iloc[4,1]
protein_max=requirements.iloc[4,2]
protein_half_range=(protein_max-protein_min)/2
protein_opt=(protein_min+protein_max)/2

sodium_min=requirements.iloc[5,1]
sodium_max=requirements.iloc[5,2]
sodium_half_range=(sodium_max-sodium_min)/2
sodium_opt=(sodium_min+sodium_max)/2

## The Approach
This is clearly an optimization problem. However, it is a bit more complicated than what you'd typically use Linear Programming to solve. We could potentially frame it that way... with an A matrix 2330 columns wide. But our objective function here isn't written easily as a function of pure X (our ingredients vector), no matter what we decide to optimize.

There are, however, many modern machine learning approaches that can help us here. The trick is to use a solver / algorithm that works with a custom function instead of a vector. This way, we can to optimize for a custom value function. SciPy's Linear Programming functionality requires that our objective be a vector, but that doesn't really work here. But there are plenty of alternatives, so we'll try a few of those. 

First, however, we need to define what it is we seek to optimize. Eventually, we may wish to make this something the user could select from a list of options (which would also alter constraints), to accommodate different nutrition guides, such as Atkins, or a high fiber diet. Or, maybe we seek to maximize quantity of food while still meeting nutrition guidelines. For now, I'm going to take bit of a fuzzy logic approach, with our function outputting a value that's most optimal when all nutrition requirements are exactly in the center of the ranges and where we impose serious (but linear with a slope) penalties if any nutritional requirements fall outisde the acceptable ranges.

In [9]:
import numpy as np
x=np.zeros(len(final_ingredients_df))
x[4]=1
x[39]=1
x[554]=1
x[1000]=1
A=inputs.values #get A matrix
y=A.dot(x)
print(y)

[864.     54.76   35.96    6.2     4.85   57.67   32.003   6.767   1.274]


In [10]:
def diet_function(ingredients_vector):
    y=A.dot(ingredients_vector)
    calorie_penalty=10*abs(666.7-y[0]) #weighting calories very highly here
    
    protein_penalty=abs(y[1]-protein_opt)
    if abs(y[1]-protein_opt)>protein_half_range:
        protein_penalty*=20
    
    carb_penalty=abs((y[2]+y[3])-carb_opt)
    if carb_penalty>carb_half_range:
        carb_penalty*=20    
    fiber_bonus=y[3]
    sugar_penalty=y[4]

        
    fat_penalty=abs(y[5]-fat_opt)
    if fat_penalty>fat_half_range:
        fat_penalty*=20
       
    sat_fat_penalty=y[6]
    
    pufat_penalty=abs(y[7]-pufat_opt)
    if pufat_penalty>pufat_half_range:
        pufat_penalty*=20
           
    sodium_penalty=abs(y[8]-sodium_opt)
    if sodium_penalty>sodium_half_range:
        sodium_penalty*=100            
            
    value=100-calorie_penalty-protein_penalty-carb_penalty+fiber_bonus-sugar_penalty-fat_penalty-sat_fat_penalty-pufat_penalty-sodium_penalty
    return -value #return negative since this is a minimization problem
        

Note: I increased the penalty for sodium from notebook 1. Hopefully, we'll get a result that's a bit less salty.

In [11]:
from scipy.optimize import minimize,dual_annealing

In [12]:
bounds=tuple([(0,10) for i in range(len(final_ingredients_df))])
x0=[1]*len(final_ingredients_df)

In [13]:
%%time
res=minimize(diet_function,x0,method='TNC',bounds=bounds,tol=1e-6,options={'maxiter':int(1e9),'minfev':-100})

Wall time: 3min 4s


### Local vs Global Optimization
SciPy's minimize function is a local optimization algorithm, with many different methods you can choose from to find a local minima based on various methods depending on whether you have bounded inputs or other constraints. 

For those that unfamiliar with local vs global optimization concepts, Mathworks (the makers of Matlab) explains it quite well [here](https://www.mathworks.com/help/gads/what-is-global-optimization.html).

In a nutshell, because of the way we set up our value function with different slopes / contributions to the overall value for each nutritional category depending on whether they are inside our outside our acceptable ranges, we have made this into a nonlinear problem. What that means is that if we start at a random initial position on our value function and use something like gradient descent or Newton's method (or various other approaches) to follow the slope to the local minima, we can't be sure that this is the same as the overall or global minima, which is the true optimal solution. Our starting point and various other hyperparameters (like learning rate) matter. Check out the following graph for a visual explanation:

<img src="https://www.mathworks.com/help/gads/local_vs_global.png">

So the way that we typically go about finding the true optimal solution for these sorts of problems is to use a local optimizer with a multitude of initial starting conditions. Those starting conditions can be purely random, or they can follow some sort of search logic. SciPy offers several options. Generally, I find the dual_annealing offers a good overall performance here. But, I'm going to hold off on running that for now because it's quite slow, and as you'll see later, we have a lot of work left to do before we're ready to go for the final run.

In [14]:
# %%time
# res=dual_annealing(diet_function,x0=x0,bounds=bounds,maxfun=1e8,local_search_options={'method':'TNC','options':{'maxiter':int(1e9),'minfev':-100}})

In [15]:
print(res)

     fun: 2430.7753009544113
     jac: array([5284.46448698, 5309.30551577, 6445.37049084, ..., 4807.35457131,
       4769.34555991, 4723.92857773])
 message: 'Converged (|x_n-x_(n-1)| ~= 0)'
    nfev: 2740311
     nit: 1123
  status: 2
 success: True
       x: array([0., 0., 0., ..., 0., 0., 0.])


In [16]:
solution=res.x

In [17]:
output=pd.Series(A.dot(solution),index=inputs.index)
print(output)

Energ_Kcal    666.700000
Protein        46.095861
Carbohydrt    133.846173
Fiber_TD       50.057140
Sugar_Tot      65.843184
Lipid_Tot       7.578306
FA_Sat          1.451818
FA_Poly         3.032653
Sodium          3.994649
dtype: float64


Note: this solution looks pretty good, but it's a bit high in both carbs and sodium. Sugar content is high too, thought it's offset by a very nice fiber content. On the sugar side, they do appear to be natural sugars from vegetables, at least. We may want to revise our value function a bit, but this is not a bad initial result.

In [18]:
solution_ds=pd.Series(solution,index=final_ingredients_df.index)
solution_ds.to_csv('matrix_porridge_refined_recipe.csv')
solution_ds=solution_ds[solution_ds>0.4]
print(solution_ds)

Shrt_Desc
SALAD DRSNG,SWT&SOUR                                         0.408838
BAMBOO SHOOTS,CND,DRND SOL                                   0.410254
CABBAGE,CHINESE (PE-TSAI),RAW                                0.491663
CUCUMBER,WITH PEEL,RAW                                       0.563560
CUCUMBER,PEELED,RAW                                          0.680466
LETTUCE,GRN LEAF,RAW                                         0.443536
SQUASH,SMMR,CROOKNECK&STRAIGHTNECK,RAW                       0.423032
SQUASH,SMMR,CROOKNECK&STRAIGHTNECK,CND,DRND,SOLID,WO/SALT    0.509706
SQUASH,SMMR,ZUCCHINI,INCL SKN,RAW                            0.442922
WATERCRESS,RAW                                               0.663274
WAXGOURD,(CHINESE PRESERVING MELON),CKD,BLD,DRND,WO/SALT     0.589757
BEANS,MUNG,MATURE SEEDS,SPROUTED,CND,DRND SOL                0.538580
SQUASH,SMMR,ALL VAR,RAW                                      0.491937
ASPARAGUS,CND,NO SALT,SOL&LIQUIDS                            0.531024
PICKLES,CU

Note: units are in 100g. We'll have to convert this something more useful when we're ready to make it.

## Next Steps

We still have a couple clusters of similar items here, and our sodium level is still high. One thing to see is that there are several types of items that say "W/SALT" in their description. I confirmed these have "WO/SALT" counterparts. For our next iteration, we'll remove all the items with salt added.

Also, I'm realizing that my original dataset from 2010 has no meat in it... just meat alternatives like tofu. This is probably because I was a vegetarian at the time, and even though I'm not a strict vegetarian now, it would probably be worth testing this both with and without meat to see the difference and offer people a couple options.

I've now re-downloaded the USDA database, which they offer in a convenient MS Access format, though they did not predefine any of the relationships between tables, not did they create the final summary sheet via a query to generate what we have now, so I'm having to do a bit of work.

The next version will have that updated dataset, and I will remove any items that have added salt, add back in regular table salt as an ingredient in case it's needed, and see how it does. With any luck, we'll be ready to run this through our global optimizer.