In [2]:
import pandas as pd

### The goal is to find clusters of different food groups based on Nutrient Values.

    Clustering based on nutrient value of food can be useful in demand prediction because it allows companies to identify groups of customers who have similar dietary preferences and needs. By clustering customers based on the nutrient value of the food they consume, companies can better understand the demand patterns of different customer groups and tailor their marketing and sales efforts to meet the needs of each group.

    Here are some ways clustering based on nutrient value can be helpful in demand prediction:

    1. Identifying customer preferences: By clustering customers based on the nutrient value of the food they consume, companies can identify which nutrients are most important to different customer segments. This information can be used to predict demand for foods that are high in specific nutrients, such as protein, fiber, or vitamins.

    2.Customizing product offerings: By understanding the nutrient preferences of different customer segments, companies can tailor their product offerings to meet the needs of each group. For example, a company could develop a line of products that are high in protein for customers who are looking to increase their protein intake.

    3. Creating targeted marketing campaigns: Clustering based on nutrient value can help companies create targeted marketing campaigns that resonate with specific customer segments. For example, a company could create a campaign that emphasizes the nutritional benefits of its products to appeal to health-conscious customers.

    4. Forecasting demand: By analyzing historical demand patterns for different nutrient categories, companies can forecast future demand and adjust their production, inventory, and supply chain strategies accordingly.

    Overall, clustering based on nutrient value can provide valuable insights into customer behavior and preferences that can be used to improve demand prediction and optimize business operations.


 ### Furthermore, in this project, a detailed view of how clusters tend to change over the years is displayed. This helps to understand the change in user behaviour as they tend to make a lifestyle switch overtime
 

In [60]:
nut_val_df = pd.read_excel("Nutrient_Vals/2019-2020 FNDDS At A Glance - FNDDS Nutrient Values.xlsx", header=1)
ing_nut_val_df = pd.read_excel("Nutrient_Vals/2019-2020 FNDDS At A Glance - Ingredient Nutrient Values.xlsx", header = 1)

In [65]:
import plotly
plotly.__version__

'5.6.0'

In [63]:
nut_val_df.iloc[0:1500,:].to_csv("Nutrient_Vals/nut_val_df_trimmed.csv")

In [64]:
nut_val_df

Unnamed: 0,Food code,Main food description,WWEIA Category number,WWEIA Category description,Energy (kcal),Protein (g),Carbohydrate (g),"Sugars, total\n(g)","Fiber, total dietary (g)",Total Fat (g),...,20:1\n(g),22:1\n(g),18:2\n(g),18:3\n(g),18:4\n(g),20:4\n(g),20:5 n-3\n(g),22:5 n-3\n(g),22:6 n-3\n(g),Water\n(g)
0,11000000,"Milk, human",9602,Human milk,70,1.03,6.89,6.89,0.0,4.38,...,0.040,0.000,0.374,0.052,0.0,0.026,0.000,0.000,0.000,87.50
1,11100000,"Milk, NFS",1004,"Milk, reduced fat",52,3.33,4.83,4.88,0.0,2.14,...,0.002,0.000,0.074,0.008,0.0,0.003,0.000,0.001,0.000,88.92
2,11111000,"Milk, whole",1002,"Milk, whole",61,3.27,4.63,4.81,0.0,3.20,...,0.004,0.000,0.115,0.013,0.0,0.004,0.001,0.002,0.000,88.10
3,11112110,"Milk, reduced fat (2%)",1004,"Milk, reduced fat",50,3.36,4.90,4.89,0.0,1.90,...,0.002,0.000,0.061,0.007,0.0,0.003,0.000,0.001,0.000,89.10
4,11112210,"Milk, low fat (1%)",1006,"Milk, lowfat",43,3.38,5.18,4.96,0.0,0.95,...,0.001,0.000,0.033,0.004,0.0,0.001,0.000,0.000,0.000,89.70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5619,99997802,Tomatoes as ingredient in omelet,9999,Not included in a food category,25,1.11,5.48,3.42,1.6,0.23,...,0.000,0.000,0.089,0.004,0.0,0.000,0.000,0.000,0.000,92.57
5620,99997804,Other vegetables as ingredient in omelet,9999,Not included in a food category,39,3.25,5.74,2.73,1.4,0.39,...,0.000,0.000,0.174,0.001,0.0,0.000,0.000,0.000,0.000,89.67
5621,99997810,Vegetables as ingredient in curry,9999,Not included in a food category,52,1.81,11.60,3.25,2.2,0.19,...,0.000,0.000,0.052,0.012,0.0,0.000,0.000,0.000,0.000,85.59
5622,99998130,Sauce as ingredient in hamburgers,9999,Not included in a food category,272,1.34,17.14,13.08,0.6,22.85,...,0.106,0.133,11.810,1.682,0.0,0.015,0.000,0.000,0.002,55.97


In [35]:
nut_val_df = nut_val_df.iloc[:2500,:]

The clustering analysis explains two aspects -
1. The nutrient composition of different food items - FNDD.
    - This will help us understand what kind of foods people have been consuming over the years. 
    
2. The nutrient composition of various ingredients used in food industry 
    - This will help us understand the changes in ingredients used over the years. 

## FNDD Dataset


In [13]:
nut_val_df.columns

Index(['Food code', 'Main food description', 'WWEIA Category number',
       'WWEIA Category description', 'Energy (kcal)', 'Protein (g)',
       'Carbohydrate (g)', 'Sugars, total\n(g)', 'Fiber, total dietary (g)',
       'Total Fat (g)', 'Fatty acids, total saturated (g)',
       'Fatty acids, total monounsaturated (g)',
       'Fatty acids, total polyunsaturated (g)', 'Cholesterol (mg)',
       'Retinol (mcg)', 'Vitamin A, RAE (mcg_RAE)', 'Carotene, alpha (mcg)',
       'Carotene, beta (mcg)', 'Cryptoxanthin, beta (mcg)', 'Lycopene (mcg)',
       'Lutein + zeaxanthin (mcg)', 'Thiamin (mg)', 'Riboflavin (mg)',
       'Niacin (mg)', 'Vitamin B-6 (mg)', 'Folic acid (mcg)',
       'Folate, food (mcg)', 'Folate, DFE (mcg_DFE)', 'Folate, total (mcg)',
       'Choline, total (mg)', 'Vitamin B-12 (mcg)',
       'Vitamin B-12, added\n(mcg)', 'Vitamin C (mg)',
       'Vitamin D (D2 + D3) (mcg)', 'Vitamin E (alpha-tocopherol) (mg)',
       'Vitamin E, added\n(mg)', 'Vitamin K (phylloquinone) (

#### We can see that the above dataset has a lot of features and using high diemensional data may not be suitable for the purpose of clustering. Hence, we categories the nutrient value of food as the follows - 
1. Macro nutrients - cluster analysis based on 'Protein (g)','Carbohydrate (g)', 'Sugars, total\n(g)', 'Fiber, total dietary (g)'

2. Vitamins - cluster analysis based on vitamins and minerals content of the foods

3. Fatty acids - cluster analysis based on nutrients such as cholesterol and other saturated/unsaturated fatty acids. 

4. Minerals 

5. Amino acids

In [36]:

macro_nutrients = ['WWEIA Category description','Protein (g)',
       'Carbohydrate (g)', 'Sugars, total\n(g)', 'Fiber, total dietary (g)',
       'Total Fat (g)']

fatty_acids = ['WWEIA Category description','Fatty acids, total saturated (g)',
       'Fatty acids, total monounsaturated (g)',
       'Fatty acids, total polyunsaturated (g)', 'Cholesterol (mg)',
       'Retinol (mcg)']

vitamins = ['WWEIA Category description','Vitamin A, RAE (mcg_RAE)', 'Vitamin B-12 (mcg)',
       'Vitamin B-12, added\n(mcg)', 'Vitamin C (mg)',
       'Vitamin D (D2 + D3) (mcg)', 'Vitamin E (alpha-tocopherol) (mg)',
       'Vitamin E, added\n(mg)', 'Vitamin K (phylloquinone) (mcg)']

minerals = ['WWEIA Category description','Calcium (mg)', 'Phosphorus (mg)', 'Magnesium (mg)', 'Iron\n(mg)',
       'Zinc\n(mg)', 'Copper (mg)', 'Selenium (mcg)', 'Potassium (mg)']


fatty_acids = ['WWEIA Category description','Fatty acids, total saturated (g)',
       'Fatty acids, total monounsaturated (g)',
       'Fatty acids, total polyunsaturated (g)', 'Cholesterol (mg)',
       'Retinol (mcg)']

amino_acids = ['WWEIA Category description','Carotene, alpha (mcg)',
       'Carotene, beta (mcg)', 'Cryptoxanthin, beta (mcg)', 'Lycopene (mcg)',
       'Lutein + zeaxanthin (mcg)', 'Thiamin (mg)', 'Riboflavin (mg)',
       'Niacin (mg)', 'Vitamin B-6 (mg)', 'Folic acid (mcg)',
       'Folate, food (mcg)', 'Folate, DFE (mcg_DFE)', 'Folate, total (mcg)']

df_macros = nut_val_df[macro_nutrients]
df_fatty_acids = nut_val_df[fatty_acids]
df_vitamins = nut_val_df[vitamins]
df_minerals = nut_val_df[minerals]
df_aminos = nut_val_df[amino_acids]

In [37]:
df_macros['WWEIA Category description'].value_counts()

Meat mixed dishes               254
Chicken, whole pieces           161
Eggs and omelets                147
Poultry mixed dishes            128
Fish                            109
                               ... 
Butter and animal fats            1
Coleslaw, non-lettuce salads      1
Cakes and pies                    1
Baby food: mixtures               1
Cookies and brownies              1
Name: WWEIA Category description, Length: 82, dtype: int64

In [None]:
df

### A plot of top 20 categories to understand the primary contributors

In [38]:
import plotly.express as px

In [39]:
df_to_plot = pd.DataFrame(columns = ["Category", "Value"])

In [40]:
df_to_plot["Category"] = df_macros['WWEIA Category description'].value_counts().keys()[0:20]
df_to_plot["Value"] = df_macros['WWEIA Category description'].value_counts().values[0:20]

In [41]:
fig = px.bar(df_to_plot, x="Value", y="Category", orientation='h')
fig.show()

### The main objective of this is to understand the top contributors to the nutrient clusters

## Further, let's try to understand if there's any correlation between the nutrients in the food.  

In [20]:
import plotly.graph_objects as go

In [42]:
def plot_corr_plot(df, columns):
    x = columns[1:]
    df = df.iloc[:,1:]
    heat = go.Heatmap(z = df.corr(),
                      x = x,
                      y = x,
                      xgap=1, ygap=1,
                      colorbar_thickness=20,
                      colorbar_ticklen=3,
                      hovertext = df.corr(),
                      hoverinfo='text'
                       )

    title = 'Correlation Matrix'               

    layout = go.Layout(title_text=title, title_x=0.5, 
                       width=600, height=600,
                       xaxis_showgrid=False,
                       yaxis_showgrid=False,
                       yaxis_autorange='reversed')

    fig=go.Figure(data=[heat], layout=layout)      
    fig.show()


In [43]:
(plot_corr_plot(df_macros, macro_nutrients))
(plot_corr_plot(df_fatty_acids, fatty_acids))
(plot_corr_plot(df_minerals, minerals))
(plot_corr_plot(df_aminos, amino_acids))
(plot_corr_plot(df_vitamins, vitamins))

## Now, we move on to clustering
1. For clustering, we perform a clustering on each category of nutrients

In [23]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation, DBSCAN
import scipy.cluster.hierarchy as sch
import plotly.figure_factory as ff
import plotly.express as px
import plotly.graph_objects as go

In [44]:
def plot_wcss_elbow(df):
    X = df.iloc[:,1:].values

    wcss = []
    for i in range(1, 10):
        kmeans = KMeans(n_clusters = i, init = "k-means++", max_iter = 500, n_init = 10, random_state = 123)
        kmeans.fit(X)
        wcss.append(kmeans.inertia_)

    fig = go.Figure(data = go.Scatter(x = [1,2,3,4,5,6,7,8,9,10], y = wcss))


    fig.update_layout(title='WCSS vs. Cluster number',
                       xaxis_title='Clusters',
                       yaxis_title='WCSS')
    fig.show()
    

def plot_clusters(df, n_clusters, X_, Y, Z):
    
    X = df.iloc[:,1:].values    
    kmeans = KMeans(n_clusters = 3, init="k-means++", max_iter = 500, n_init = 10, random_state = 123)
    identified_clusters = kmeans.fit_predict(X)
    data_with_clusters = df.iloc[:,1:].copy()    
    data_with_clusters['Cluster'] = identified_clusters    
    fig = px.scatter_3d(data_with_clusters, x = X_, y=Y, z=Z,
                  color='Cluster', opacity = 0.8)
    return fig
    #fig.show()
    

In [45]:
(plot_wcss_elbow(df_aminos))
(plot_wcss_elbow(df_fatty_acids))
(plot_wcss_elbow(df_macros))
(plot_wcss_elbow(df_micros))
(plot_wcss_elbow(df_minerals))


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are 


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are 


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.


KMeans is known to have a memory leak on Windows with MKL, when there are 

NameError: name 'df_micros' is not defined

In [48]:
macro_nutrients

['WWEIA Category description',
 'Protein (g)',
 'Carbohydrate (g)',
 'Sugars, total\n(g)',
 'Fiber, total dietary (g)',
 'Total Fat (g)']

In [49]:
X, Y, Z

('WWEIA Category description', 'Protein (g)', 'Fiber, total dietary (g)')

In [50]:
import random
X, Y, Z = random.sample(macro_nutrients[1:], 3)
print(X, Y, Z)
plot_clusters(df_macros, 4, X, Y, Z)

Carbohydrate (g) Sugars, total
(g) Protein (g)



KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.



In [51]:
from plotly import tools
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

n_clusters = 3
fig = tools.make_subplots(rows=1, cols=2,
                              print_grid=False,
                              subplot_titles=('The silhouette plot for the various clusters.',
                                              'The visualization of the clustered data.'),
                         )
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
fig['layout']['xaxis1'].update(title='The silhouette coefficient values',
                               range=[-0.1, 1])

# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
fig['layout']['yaxis1'].update(title='Cluster label',
                               showticklabels=False,
                               range=[0, len(X) + (n_clusters + 1) * 10])

# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.


plotly.tools.make_subplots is deprecated, please use plotly.subplots.make_subplots instead



layout.YAxis({
    'anchor': 'x', 'domain': [0.0, 1.0], 'range': [0, 56], 'showticklabels': False, 'title': {'text': 'Cluster label'}
})

In [52]:
X = df_macros.iloc[:,1:].values    

In [53]:
df_macros

Unnamed: 0,WWEIA Category description,Protein (g),Carbohydrate (g),"Sugars, total\n(g)","Fiber, total dietary (g)",Total Fat (g)
0,Human milk,1.03,6.89,6.89,0.0,4.38
1,"Milk, reduced fat",3.33,4.83,4.88,0.0,2.14
2,"Milk, whole",3.27,4.63,4.81,0.0,3.20
3,"Milk, reduced fat",3.36,4.90,4.89,0.0,1.90
4,"Milk, lowfat",3.38,5.18,4.96,0.0,0.95
...,...,...,...,...,...,...
2495,Bagels and English muffins,7.89,51.58,13.84,4.0,3.98
2496,Yeast breads,14.68,47.63,7.02,8.1,4.65
2497,Yeast breads,13.36,43.34,6.39,7.4,4.23
2498,Yeast breads,12.35,46.94,12.27,7.1,3.83


In [55]:
n_clusters

3

In [56]:
cluster_labels

array([1, 1, 1, ..., 2, 2, 2])

In [54]:
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)

# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
      "The average silhouette_score is :", silhouette_avg)

# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10

for i in range(n_clusters):
    # Aggregate the silhouette scores for samples belonging to
    # cluster i, and sort them
    ith_cluster_silhouette_values = \
        sample_silhouette_values[cluster_labels == i]

    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    #colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
    cmap = cm.get_cmap("nipy_spectral")
    colors = cmap(cluster_labels.astype(float) / n_clusters)    
    filled_area = go.Scatter(y=np.arange(y_lower, y_upper),
                             x=ith_cluster_silhouette_values,
                             mode='lines',
                             showlegend=False,
                             line=dict(width=0.5),
                              #color=colors),
                             fill='tozerox')
    fig.append_trace(filled_area, 1, 1)

    # Compute the new y_lower for next plot
    y_lower = y_upper + 10  # 10 for the 0 samples


# The vertical line for average silhouette score of all the values
axis_line = go.Scatter(x=[silhouette_avg],
                       y=[0, 10],
                       showlegend=False,
                       mode='lines',
                       line=dict(color="red", dash='dash',
                                 width =1) )

fig.append_trace(axis_line, 1, 1)

# 2nd Plot showing the actual clusters formed
#colors = matplotlib.colors.colorConverter.to_rgb(cm.spectral(float(i) / n_clusters))
#colors = 'rgb'+str(colors)

clusters = go.Scatter(
            x=X[:,0],
            y=X[:,1],
            mode='markers',
            showlegend=False,
            marker=dict(
                size=12,
                color=cluster_labels,                # set color to an array/list of desired values
                colorscale='Viridis',   # choose a colorscale
                opacity=0.8
            )
        )
fig.append_trace(clusters, 1, 2)

# Labeling the clusters
centers_ = clusterer.cluster_centers_
# Draw white circles at cluster centers
centers = go.Scatter(x=centers_[:, 0], 
                     y=centers_[:, 1],                    
                     showlegend=False,
                     mode='markers',
                     marker=dict(color='green', size=10,
                                 line=dict(color='black',
                                                         width=1))
                    )

fig.append_trace(centers, 1, 2)

fig['layout']['xaxis2'].update(title='Feature space for the 1st feature',
                               zeroline=False)
fig['layout']['yaxis2'].update(title='Feature space for the 2nd feature',
                              zeroline=False)


fig['layout'].update(title="Silhouette analysis for KMeans clustering on sample data "
                     "with n_clusters = %d" % n_clusters)
fig.show()


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=10.



For n_clusters = 3 The average silhouette_score is : 0.33090158977817774


In [134]:
import matplotlib

In [128]:

cmap(cluster_labels.astype(float) / n_clusters)

array([[0.61960784, 0.00392157, 0.25882353, 1.        ],
       [0.61960784, 0.00392157, 0.25882353, 1.        ],
       [0.61960784, 0.00392157, 0.25882353, 1.        ],
       ...,
       [0.61960784, 0.00392157, 0.25882353, 1.        ],
       [0.61960784, 0.00392157, 0.25882353, 1.        ],
       [0.74771242, 0.89803922, 0.62745098, 1.        ]])

In [123]:
colors

array([[0.61960784, 0.00392157, 0.25882353, 1.        ],
       [0.61960784, 0.00392157, 0.25882353, 1.        ],
       [0.61960784, 0.00392157, 0.25882353, 1.        ],
       ...,
       [0.61960784, 0.00392157, 0.25882353, 1.        ],
       [0.61960784, 0.00392157, 0.25882353, 1.        ],
       [0.74771242, 0.89803922, 0.62745098, 1.        ]])

'Cryptoxanthin, beta (mcg)'

In [4]:
ing_nut_val_df_reduced = ing_nut_val_df[["Ingredient description", "Nutrient description", "Nutrient value"]]

In [5]:
ing_nut_val_df_reduced

Unnamed: 0,Ingredient description,Nutrient description,Nutrient value
0,"Butter, stick, salted",Protein,0.85
1,"Butter, stick, salted",Total Fat,82.20
2,"Butter, stick, salted",Carbohydrate,0.06
3,"Butter, stick, salted",Energy,743.00
4,"Butter, stick, salted",Alcohol,0.00
...,...,...,...
122325,Folic acid as ingredient,20:5 n-3,0.00
122326,Folic acid as ingredient,22:1,0.00
122327,Folic acid as ingredient,22:5 n-3,0.00
122328,Folic acid as ingredient,"Fatty acids, total monounsaturated",0.00


In [6]:
li = []
for grp in ing_nut_val_df_reduced.groupby("Ingredient description"):
    li.append([grp[0]]+list(grp[1]["Nutrient value"]))
    

In [95]:
cols = []
cols.append("Ingredient description")
for grp in ing_nut_val_df_reduced.groupby("Ingredient description"):
    cols.extend(list(grp[1]["Nutrient description"]))    
    break

In [98]:
ing_nut_val_df_transformed = pd.DataFrame.from_records(li, columns=cols)

In [100]:
ing_nut_val_df_transformed.columns

Index(['Ingredient description', 'Protein', 'Total Fat', 'Carbohydrate',
       'Energy', 'Alcohol', 'Water', 'Caffeine', 'Theobromine',
       'Sugars, total', 'Fiber, total dietary', 'Calcium', 'Iron', 'Magnesium',
       'Phosphorus', 'Potassium', 'Sodium', 'Zinc', 'Copper', 'Selenium',
       'Retinol', 'Vitamin A, RAE', 'Carotene, beta', 'Carotene, alpha',
       'Vitamin E (alpha-tocopherol)', 'Vitamin D (D2 + D3)',
       'Cryptoxanthin, beta', 'Lycopene', 'Lutein + zeaxanthin', 'Vitamin C',
       'Thiamin', 'Riboflavin', 'Niacin', 'Vitamin B-6', 'Folate, total',
       'Vitamin B-12', 'Choline, total', 'Vitamin K (phylloquinone)',
       'Folic acid', 'Folate, food', 'Folate, DFE', 'Vitamin E, added',
       'Vitamin B-12, added', 'Cholesterol', 'Fatty acids, total saturated',
       '4:0', '6:0', '8:0', '10:0', '12:0', '14:0', '16:0', '18:0', '18:1',
       '18:2', '18:3', '20:4', '22:6 n-3', '16:1', '18:4', '20:1', '20:5 n-3',
       '22:1', '22:5 n-3', 'Fatty acids, total m

In [16]:
ing_nut_val_df["Ingredient description"].value_counts()

Butter, stick, salted                                                 65
Crackers, melba toast, plain                                          65
Danish pastry, cinnamon, enriched                                     65
Croutons, seasoned                                                    65
Croissants, cheese                                                    65
                                                                      ..
Pork, cured, ham, extra lean and regular, canned, unheated            65
Pork, cured, ham, boneless, extra lean and regular, unheated          65
Pork, fresh, variety meats and by-products, feet, cooked, simmered    65
Pork, cured, salt pork, raw                                           65
Folic acid as ingredient                                              65
Name: Ingredient description, Length: 1882, dtype: int64

Index(['Food code', 'Main food description', 'WWEIA Category number',
       'WWEIA Category description', 'Energy (kcal)', 'Protein (g)',
       'Carbohydrate (g)', 'Sugars, total\n(g)', 'Fiber, total dietary (g)',
       'Total Fat (g)', 'Fatty acids, total saturated (g)',
       'Fatty acids, total monounsaturated (g)',
       'Fatty acids, total polyunsaturated (g)', 'Cholesterol (mg)',
       'Retinol (mcg)', 'Vitamin A, RAE (mcg_RAE)', 'Carotene, alpha (mcg)',
       'Carotene, beta (mcg)', 'Cryptoxanthin, beta (mcg)', 'Lycopene (mcg)',
       'Lutein + zeaxanthin (mcg)', 'Thiamin (mg)', 'Riboflavin (mg)',
       'Niacin (mg)', 'Vitamin B-6 (mg)', 'Folic acid (mcg)',
       'Folate, food (mcg)', 'Folate, DFE (mcg_DFE)', 'Folate, total (mcg)',
       'Choline, total (mg)', 'Vitamin B-12 (mcg)',
       'Vitamin B-12, added\n(mcg)', 'Vitamin C (mg)',
       'Vitamin D (D2 + D3) (mcg)', 'Vitamin E (alpha-tocopherol) (mg)',
       'Vitamin E, added\n(mg)', 'Vitamin K (phylloquinone) (

In [8]:
nut_val_df["WWEIA Category description"].value_counts()

Meat mixed dishes                                   263
Pasta mixed dishes, excludes macaroni and cheese    175
Chicken, whole pieces                               161
Other vegetables and combinations                   154
Eggs and omelets                                    147
                                                   ... 
Enhanced water                                        2
Bottled water                                         1
Baby water                                            1
Grapes                                                1
Human milk                                            1
Name: WWEIA Category description, Length: 169, dtype: int64

In [10]:
df_macros = nut_val_df[macro_nutrients]

In [11]:
df_fatty_acids = nut_val_df[fatty_acids]

In [12]:
df_fatty_acids

Unnamed: 0,"Fatty acids, total saturated (g)","Fatty acids, total monounsaturated (g)","Fatty acids, total polyunsaturated (g)",Cholesterol (mg),Retinol (mcg)
0,2.009,1.658,0.497,14,60
1,1.249,0.458,0.070,9,57
2,1.860,0.688,0.108,12,31
3,1.110,0.400,0.058,8,83
4,0.568,0.210,0.032,5,58
...,...,...,...,...,...
5619,0.038,0.035,0.094,0,0
5620,0.061,0.002,0.175,0,0
5621,0.051,0.017,0.064,0,0
5622,3.544,5.321,13.522,13,4


In [13]:
df_macros

Unnamed: 0,Protein (g),Carbohydrate (g),"Sugars, total\n(g)","Fiber, total dietary (g)",Total Fat (g)
0,1.03,6.89,6.89,0.0,4.38
1,3.33,4.83,4.88,0.0,2.14
2,3.27,4.63,4.81,0.0,3.20
3,3.36,4.90,4.89,0.0,1.90
4,3.38,5.18,4.96,0.0,0.95
...,...,...,...,...,...
5619,1.11,5.48,3.42,1.6,0.23
5620,3.25,5.74,2.73,1.4,0.39
5621,1.81,11.60,3.25,2.2,0.19
5622,1.34,17.14,13.08,0.6,22.85


In [69]:
import plotly.express as pd
import plotly.graph_objects as go


In [51]:
macro_nutrients

['Protein (g)',
 'Carbohydrate (g)',
 'Sugars, total\n(g)',
 'Fiber, total dietary (g)',
 'Total Fat (g)']

In [66]:
data_with_clusters

Unnamed: 0,"Fatty acids, total saturated (g)","Fatty acids, total monounsaturated (g)","Fatty acids, total polyunsaturated (g)",Cholesterol (mg),Retinol (mcg),Cluster
0,2.009,1.658,0.497,14,60,0
1,1.249,0.458,0.070,9,57,0
2,1.860,0.688,0.108,12,31,0
3,1.110,0.400,0.058,8,83,0
4,0.568,0.210,0.032,5,58,0
...,...,...,...,...,...,...
5619,0.038,0.035,0.094,0,0,0
5620,0.061,0.002,0.175,0,0,0
5621,0.051,0.017,0.064,0,0,0
5622,3.544,5.321,13.522,13,4,0


In [64]:
fig = px.scatter_3d(data_with_clusters, x = 'Fatty acids, total saturated (g)', y='Fatty acids, total monounsaturated (g)', z='Fatty acids, total polyunsaturated (g)',
              color='Cluster', opacity = 0.8)
fig.show()

In [32]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords

In [52]:
df = pd.read_csv("Nutrient_Vals/nut_val_df.csv")
stop = stopwords.words('english')

In [53]:
def remove_long_desc(x):
    if len(x.split(" ")) > 6:
        return np.nan
    else:
        return x
df["Main food description"] = df["Main food description"].apply(lambda x: remove_long_desc(x))

In [55]:
df.dropna(subset=["Main food description"], inplace = True)

In [56]:
df["Main food description"].value_counts()

Milk, human                            1
Noodle soup vegetables, Asian style    1
Lemon, raw                             1
Kumquat, raw                           1
Grapefruit, canned                     1
                                      ..
Sambar, vegetable stew                 1
Lentil curry                           1
Lentil curry rice                      1
Soy nuts                               1
Industrial oil ingredient food         1
Name: Main food description, Length: 4449, dtype: int64

In [57]:
df["Main food description"] = df["Main food description"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [58]:
df["Main food description"].value_counts()

Milk, human                            1
Noodle soup vegetables, Asian style    1
Lemon, raw                             1
Kumquat, raw                           1
Grapefruit, canned                     1
                                      ..
Sambar, vegetable stew                 1
Lentil curry                           1
Lentil curry rice                      1
Soy nuts                               1
Industrial oil ingredient food         1
Name: Main food description, Length: 4449, dtype: int64

In [61]:
df_n = df.drop_duplicates(subset=["Main food description"], keep="last")

In [62]:
df_n.shape

(4449, 70)

In [72]:
d = pd.read_csv("ARM/ItemList.csv",header=None)

In [73]:
d

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,sausage,whole milk,semi-finished bread,yogurt,,,,,,,
1,whole milk,pastry,salty snack,,,,,,,,
2,canned beer,misc. beverages,,,,,,,,,
3,sausage,hygiene articles,,,,,,,,,
4,soda,pickled vegetables,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
14958,tropical fruit,berries,other vegetables,yogurt,kitchen towels,napkins,,,,,
14959,bottled water,herbs,,,,,,,,,
14960,fruit/vegetable juice,onions,,,,,,,,,
14961,soda,root vegetables,semi-finished bread,,,,,,,,


In [63]:
df_n.to_csv("Nutrient_Vals/nut_val_df_.csv")