# Matrix Porridge
by Kyle Archie, M.Eng

This work builds upon the first two notebooks in this repository. Please read through that one for background and explanation of the data / process.

## Notebook 3: Filtering our Input Data via Clustering

I went back to the [root data from the USDA](https://fdc.nal.usda.gov/download-datasets.html). I used the MS Access download option for April 2021. I had to build several new queries to generate an export we can use that resembles the old format. Unfortunately, the database itself is too big to upload to GitHub, but I included the main export as an Excel sheet here which includes all the different foods within the "survey_fndds_foods" table, along with their categories, the full descriptions from the "foods" table along with the nutrition content for all the tested nutrients (I believe this is per 100g of that food). Then, I selected the categories for unprocessed, raw ingredients, at least as much I was able to. The filtered data was then copied to another worksheet, which is what we'll be working with from here on out.

Note: I still intend to use clustering technique, possibly with some refinements, to filter out similar items. Unfortunately, it appears that the descriptions for many of these foods has changed since I used this data last. "w/ salt" is no longer a string we can use to parse out variants of ingredients with added salt. I may have to manually scrub this.

In [1]:
import pandas as pd

ingredients = pd.read_excel("FoodData_Export.xlsx", "filtered")
ingredients['Sodium, Na']=ingredients['Sodium, Na']/1000 #values are in mg... convert to g
ingredients.set_index('description',inplace=True)
ingredients['Total Fat']=ingredients[['Fatty acids, total saturated','Fatty acids, total monounsaturated','Fatty acids, total polyunsaturated']].sum(axis=1)
ingredients=ingredients[['Energy','Protein','Carbohydrate, by difference','Fiber, total dietary','Sugars, total including NLEA','Total Fat','Fatty acids, total saturated','Fatty acids, total polyunsaturated','Sodium, Na']].dropna()

Reference used: https://towardsdatascience.com/k-means-vs-dbscan-clustering-49f8e627de27

In [2]:
from sklearn import metrics
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import numpy as np

In [3]:
ingredients_dbscan = DBSCAN(eps=0.5, min_samples=2)
ingredients_dbscan.fit(ingredients.values)
labels = ingredients_dbscan.labels_

# Creating a numpy array with all values set to false by default
samples_mask = np.zeros_like(labels, dtype=bool)
# add outliers
samples_mask[labels==-1] = True

#add outliers to our final dataset
final_ingredients_df=ingredients[samples_mask]

In [4]:
# Finding the number of clusters in labels (ignoring noise if present)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
for i in range(n_clusters):
    sample=ingredients[labels==i].sample(1)
    final_ingredients_df=final_ingredients_df.append(sample)


In [5]:
inputs=final_ingredients_df.T
# inputs.columns=ingredients['Shrt_Desc']
# inputs.dropna(inplace=True,axis=1) #some of these ingredients have null values. Remove those. 
inputs.fillna(0,inplace=True)
inputs.head(10)

description,"Apple juice, 100%, with calcium added","Apple, raw","Applesauce, unsweetened",Apple pie filling,"Apple, baked","Apple, candied","Beef, bacon, cooked","Turkey bacon, cooked",Bacon bits,"Banana, raw",...,"Cornish game hen, roasted, skin eaten","Cornish game hen, roasted, skin not eaten","Tomato and vegetable juice, 100%, low sodium",Tomato juice cocktail,"Yogurt, Greek, NS as to type of milk or flavor","Yogurt, Greek, low fat milk, fruit","Yogurt, Greek, NS as to type of milk, flavors other than fruit","Yogurt, NS as to type of milk or flavor","Yogurt, low fat milk, fruit","Yogurt, NS as to type of milk, flavors other than fruit"
Energy,48.0,52.0,42.0,100.0,112.0,134.0,449.0,368.0,476.0,89.0,...,257.0,133.0,22.0,22.0,73.0,105.0,95.0,63.0,89.0,73.0
Protein,0.12,0.26,0.17,0.1,0.32,1.34,31.3,29.5,32.0,1.09,...,22.08,23.11,0.6,0.93,9.95,8.17,8.64,5.25,4.66,5.09
"Carbohydrate, by difference",11.49,13.81,11.27,26.1,22.7,29.61,1.4,4.24,28.6,22.84,...,0.0,0.0,4.59,3.87,3.94,12.29,9.54,7.04,14.46,9.82
"Fiber, total dietary",0.3,2.4,1.1,1.0,2.5,1.8,0.0,0.0,10.2,2.6,...,0.0,0.0,0.8,0.5,0.0,1.0,0.0,0.0,0.1,0.0
"Sugars, total including NLEA",9.47,10.39,9.39,13.8,18.99,24.17,0.0,4.24,0.0,12.23,...,0.0,0.0,3.28,2.84,3.56,11.23,9.54,7.04,12.01,9.82
Total Fat,0.088,0.086,0.024,0.0,2.806,1.939,32.78,23.186,23.828,0.217,...,16.511,3.144,0.072,0.224,1.792,2.362,2.514,1.47,1.299,1.426
"Fatty acids, total saturated",0.029,0.028,0.008,0.0,1.812,0.64,14.35,6.933,4.055,0.112,...,5.008,0.982,0.014,0.088,1.23,1.599,1.465,1.0,0.881,0.97
"Fatty acids, total polyunsaturated",0.051,0.051,0.014,0.0,0.16,0.908,1.58,6.871,13.548,0.073,...,3.57,0.932,0.042,0.076,0.076,0.113,0.203,0.044,0.039,0.043
"Sodium, Na",0.005,0.001,0.002,0.047,0.004,0.062,1.5,2.021,1.77,0.001,...,0.386,0.385,0.058,0.169,0.034,0.033,0.04,0.07,0.065,0.068


In [6]:
print(ingredients.columns)

Index(['Energy', 'Protein', 'Carbohydrate, by difference',
       'Fiber, total dietary', 'Sugars, total including NLEA', 'Total Fat',
       'Fatty acids, total saturated', 'Fatty acids, total polyunsaturated',
       'Sodium, Na'],
      dtype='object')


In [7]:
requirements=pd.read_excel("Matrix Porridge (filtered).xlsx", "requirements (2021)")
requirements['min (g)']/=3
requirements['max (g)']/=3

print(requirements)

                                          Unnamed: 0    min (g)     max (g)
0                                                Fat  14.814815   25.925926
1   n-6 polyunsaturated fatty acidsa (linoleic acid)   3.703704    7.407407
2  n-3 polyunsaturated fatty acidsa (α-linolenic ...   0.444444    0.888889
3                                       Carbohydrate  75.000000  108.333333
4                                            Protein  16.666667   58.333333
5                                             Sodium   0.500000    0.766667


Note: our data does not split the two kinds of polyunsaturated fat, so we'll have to sum these up. There's another dataset I've been looking at which might work better for this, but we'll get to that later.

In [8]:
fat_min=requirements.iloc[0,1]
fat_max=requirements.iloc[0,2]
fat_half_range=(fat_max-fat_min)/2 #calculate this once so we don't need to do it repeatedly in our function later
fat_opt=(fat_min+fat_max)/2

pufat_min=requirements.iloc[1,1]+requirements.iloc[2,2]
pufat_max=requirements.iloc[1,2]+requirements.iloc[2,2]
pufat_half_range=(pufat_max-pufat_min)/2
pufat_opt=(pufat_min+pufat_max)/2

carb_min=requirements.iloc[3,1]
carb_max=requirements.iloc[3,2]
carb_half_range=(carb_max-carb_min)/2
carb_opt=(carb_min+carb_max)/2

protein_min=requirements.iloc[4,1]
protein_max=requirements.iloc[4,2]
protein_half_range=(protein_max-protein_min)/2
protein_opt=(protein_min+protein_max)/2

sodium_min=requirements.iloc[5,1]
sodium_max=requirements.iloc[5,2]
sodium_half_range=(sodium_max-sodium_min)/2
sodium_opt=(sodium_min+sodium_max)/2

## The Approach
This is clearly an optimization problem. However, it is a bit more complicated than what you'd typically use Linear Programming to solve. We could potentially frame it that way... with an A matrix 2330 columns wide. But our objective function here isn't written easily as a function of pure X (our ingredients vector), no matter what we decide to optimize.

There are, however, many modern machine learning approaches that can help us here. The trick is to use a solver / algorithm that works with a custom function instead of a vector. This way, we can to optimize for a custom value function. SciPy's Linear Programming functionality requires that our objective be a vector, but that doesn't really work here. But there are plenty of alternatives, so we'll try a few of those. 

First, however, we need to define what it is we seek to optimize. Eventually, we may wish to make this something the user could select from a list of options (which would also alter constraints), to accommodate different nutrition guides, such as Atkins, or a high fiber diet. Or, maybe we seek to maximize quantity of food while still meeting nutrition guidelines. For now, I'm going to take bit of a fuzzy logic approach, with our function outputting a value that's most optimal when all nutrition requirements are exactly in the center of the ranges and where we impose serious (but linear with a slope) penalties if any nutritional requirements fall outisde the acceptable ranges.

In [9]:
import numpy as np
x=np.zeros(len(final_ingredients_df))
x[4]=1
x[39]=1
x[554]=1
x[1000]=1
A=inputs.values #get A matrix
y=A.dot(x)
print(y)

[432.     30.36   49.53   11.7    25.12   11.347   3.49    3.42    0.551]


In [10]:
def diet_function(ingredients_vector):
    y=A.dot(ingredients_vector)
    calorie_penalty=(666.7-y[0])**2 #weighting calories very highly here
    
    protein_penalty=abs(y[1]-protein_opt)
    if abs(y[1]-protein_opt)>protein_half_range:
        protein_penalty*=protein_penalty
    
    carb_penalty=abs((y[2]+y[3])-carb_opt)
    if carb_penalty>carb_half_range:
        carb_penalty*=carb_penalty    
#     fiber_bonus=y[3]
    sugar_penalty=y[4]**2

        
    fat_penalty=abs(y[5]-fat_opt)
    if fat_penalty>fat_half_range:
        fat_penalty*=fat_penalty
       
    sat_fat_penalty=y[6]**2
    
    pufat_penalty=abs(y[7]-pufat_opt)
    if pufat_penalty>pufat_half_range:
        pufat_penalty*=pufat_penalty
           
    sodium_penalty=abs(y[8]-sodium_opt)
    if sodium_penalty>sodium_half_range:
        sodium_penalty*=sodium_penalty*10 #adjusting for small number         
            
    value=calorie_penalty+protein_penalty+carb_penalty+sugar_penalty+fat_penalty+sat_fat_penalty+pufat_penalty+sodium_penalty
    return value 
        

Note: There are several changes here. I now use a square of the error when outside the acceptable ranges for the penalty terms, as well as an overall square for the calorie penalty. I eliminated the 100 calorie offset, as well as the fiber bonus, meaning the optimal solution will now be the root of the function, which should work better with Newton's method, which is, according to the documentation, the core of the TNC algorithm. 

In [11]:
from scipy.optimize import minimize,dual_annealing

In [12]:
bounds=tuple([(0,10) for i in range(len(final_ingredients_df))])
x0=[.05]*len(final_ingredients_df)

Note: trying a different initial condition here... .05 instead of 1

### Local vs Global Optimization
SciPy's minimize function is a local optimization algorithm, with many different methods you can choose from to find a local minima based on various methods depending on whether you have bounded inputs or other constraints. 

For those that unfamiliar with local vs global optimization concepts, Mathworks (the makers of Matlab) explains it quite well [here](https://www.mathworks.com/help/gads/what-is-global-optimization.html).

In a nutshell, because of the way we set up our value function with different slopes / contributions to the overall value for each nutritional category depending on whether they are inside our outside our acceptable ranges, we have made this into a nonlinear problem. What that means is that if we start at a random initial position on our value function and use something like gradient descent or Newton's method (or various other approaches) to follow the slope to the local minima, we can't be sure that this is the same as the overall or global minima, which is the true optimal solution. Our starting point and various other hyperparameters (like learning rate) matter. Check out the following graph for a visual explanation:

<img src="https://www.mathworks.com/help/gads/local_vs_global.png">

So the way that we typically go about finding the true optimal solution for these sorts of problems is to use a local optimizer with a multitude of initial starting conditions. Those starting conditions can be purely random, or they can follow some sort of search logic. SciPy offers several options. Generally, I find the dual_annealing offers a good overall performance here. But, I'm going to hold off on running that for now because it's quite slow, and as you'll see later, we have a lot of work left to do before we're ready to go for the final run.

In [13]:
# %%time
# res=minimize(diet_function,x0,method='TNC',bounds=bounds,tol=1e-1,options={'maxiter':int(1e9),'minfev':0})

In [14]:
%%time
res=dual_annealing(diet_function,x0=x0,bounds=bounds,maxfun=5e9,local_search_options={'method':'TNC','options':{'maxiter':int(1e3),'minfev':0}})

Wall time: 4h 6min 32s


In [15]:
print(res)

     fun: 80.18513803165641
 message: ['Maximum number of iteration reached']
    nfev: 518776608
    nhev: 0
     nit: 1000
    njev: 0
  status: 0
 success: True
       x: array([6.83638083e-05, 2.70462482e-05, 6.06013217e-05, ...,
       1.38144044e-05, 2.33028450e-04, 3.20255267e-04])


In [16]:
solution=res.x

In [17]:
output=pd.Series(A.dot(solution),index=inputs.index)
print(output)

Energy                                666.519828
Protein                                48.726152
Carbohydrate, by difference            71.448159
Fiber, total dietary                   16.005785
Sugars, total including NLEA            5.663570
Total Fat                              20.284846
Fatty acids, total saturated            5.133039
Fatty acids, total polyunsaturated      6.448692
Sodium, Na                              1.420729
dtype: float64


Note: this solution looks pretty good, but it's a bit high in both carbs and sodium. Sugar content is high too, thought it's offset by a very nice fiber content. On the sugar side, they do appear to be natural sugars from vegetables, at least. We may want to revise our value function a bit, but this is not a bad initial result.

In [19]:
solution_ds=pd.Series(solution*100,index=final_ingredients_df.index,name="grams") #multiply by 100 for grams
solution_ds=solution_ds[solution_ds>0.1]
solution_ds.to_csv('matrix_porridge_recipe_v3.csv')
print(solution_ds)

description
Beans, from dried, NS as to type, no added fat       1.382660
Beans, from canned, NS as to type, fat added         1.128672
Beans, from canned, NS as to type, no added fat      0.923379
Beans, from fast food / restaurant, NS as to type    0.528421
White beans, from dried, no added fat                0.881901
                                                       ...   
Yellow rice, cooked, no added fat                    2.959176
Rice, wild, 100%, cooked, fat added                  1.355446
Rice, white and wild, cooked, NS as to fat           2.434946
Clams, baked or broiled, fat added                   0.198527
Mussels, cooked, NS as to cooking method             0.351050
Name: grams, Length: 514, dtype: float64


## Next Steps

There are two related issues here. First, the global optimizer ran for 4.5 hours and still didn't finish. It found a pretty good solution (better than I could find using the local optimizer on this new set), but it's high in saturated fat and sodium. But the other thing to note here is that several of the above ingredients are compound ingredients that include repeating core ingredients. Clearly there's more manual filtering I need to do on the input data.

So, next I'm going to spend an hour or two manually culling the input dataset. If we were talking a longer list of ingredients, I'd probably work on doing this a bit more programmatically, but in my experience, when working with smaller datasets (<5k records), it's usually worth having a human do this part.

# Add a Wordcloud

In [20]:
import stylecloud

In [22]:
stylecloud.gen_stylecloud(file_path='matrix_porridge_recipe_v3.csv',
                          icon_name= "fas fa-apple-alt")

<img src="stylecloud.png">