In [35]:
import pandas as pd
import numpy as np
import plotly.express as px
from pathlib import Path

We will load in the dataset and utilize plotly for plotting, as interactive scatter plots will make it simple to identify our outliers

In [36]:
df_path = Path("complete_datasets", "2024-07-07.csv")

df = pd.read_csv(df_path)

# Remove rows with NA product data, this is simply missing data
print(f'{df["Product"].isna().sum()} rows with NA removed')
df["Product"].dropna(inplace=True)

240 rows with NA removed


Perform linear regression to estimate kcal from carbohydrates, proteins and fats

In [37]:
from sklearn.linear_model import LinearRegression

regression_df = df[["Product", "Category_1", "Carbohydrates", "Protein", "Fats", "Energy (kcal)"]].dropna()

X = regression_df[["Carbohydrates", "Protein", "Fats"]]
y = regression_df["Energy (kcal)"]

model = LinearRegression()
model.fit(X, y)

print(f"Model coefficients: {model.coef_}\nModel intercept: {model.intercept_}\nR-squared: {model.score(X, y)}")

Model coefficients: [3.93722744 4.06686215 8.99387318]
Model intercept: 6.810265089095111
R-squared: 0.9783291184924829


To visualize the residuals of the model, we create an interactive scatterplot

In [38]:
carb_coef, protein_coef, fat_coef = model.coef_

# Interpolate kcal estimations from our regression coefficients
regression_df["kcal_estimation"] = (regression_df["Carbohydrates"] * carb_coef +
                                    regression_df["Protein"] * protein_coef +
                                    regression_df["Fats"] * fat_coef +
                                    model.intercept_)

# Plot the results
fig = px.scatter(x=regression_df["kcal_estimation"] - regression_df["Energy (kcal)"],
                 y=regression_df["Energy (kcal)"],
                 hover_name=regression_df["Product"],
                 color=regression_df["Category_1"],
                 labels={
                     "x": "Residuals (Estimated - True)",
                     "y": "True Calories (kcal)"
                 })

fig.update_layout(xaxis_range=[-500, 500],
                  yaxis_range=[0, 1000],
                  legend_title_text="Category")
fig.show()

Important to note is that we imposed axis limits on this graph. Let's quickly analyse the items with residuals over 500 kcals

In [56]:
regression_df["abs_residuals"] = abs(regression_df["kcal_estimation"] - regression_df["Energy (kcal)"])

print(regression_df["Product"][regression_df["abs_residuals"] > 200])

3459             Conimex Woksaus knoflook koriander
4596          AH Salmiakdrop zout & hard suikervrij
5411                    Nutrilon A.R. 1 0-6 maanden
5586                        Canderel Green klontjes
7660                          Buisman Classic aroma
13164                  Marqu�s de Requena Brut doos
13347                      Canderel Stevia zoetstof
14639                       Marqu�s de Requena Brut
17173    Flower Farm Hazelnootpasta zonder palmolie
18237                           Rio Stevia zoetstof
18238                        Rio Sucralose zoetstof
19164             Ghaia Surinaamse roti gele erwten
21532                      Sella & Mosca Vermentino
22400                  Knorr Spaghetteria formaggio
22403                                   So fab Rose
22540       Pure Via Alternatief voor kristalsuiker
23203          Garden Gourmet Sensational chipolata
23204            Garden Gourmet Sensational merguez
Name: Product, dtype: object


Interesting about these products is that our calculations are not off, Albert Heijn simply made an error, switching the kJ energy with the kcal energy. Here is the most extreme example detailed below:

<div style="text-align: center;">
    <img src="images/kcal_error.png" alt="Albert Heijn Hazelnut Paste" style="width:500px;" />
</div>


Here is the energy details for these 4 products on the Albert Heijn website as of October 13th 2024:

* Flower Farm Hazelnootpasta zonder palmolie: (547 kJ, 2288 kcal)
* Ghaia Surinaamse roti gele erwten (228 kJ, 963 kcal)
* Garden Gourmet Sensational chipolata (212 kJ, 877 kcal)
* Garden Gourmet Sensational merguez (210 kJ, 869 kcal)

Some more products with this error:
* Marqués de Requena Brut (& brut doos) (75 kJ, 314 kcal)
* Sella & Mosca Vermentino (70 kJ, 293 kcal)
* Knorr Spaghetteria formaggio (104 kJ, 435 kcal)
* So fab Rose (75 kJ, 312 kcal)

In all these cases, kcal exceeds kJ which is not possible as 1 kJ = 0,239 kcal

We also notice that many sugar free products have high carbohydrate counts, but these are indigestible carbohydrates so we should attempt to identify these

In [107]:
pd.set_option('display.max_rows', 500)
high_carb_products = regression_df[["Product", "abs_residuals"]][(regression_df["Carbohydrates"] > 50)].sort_values("abs_residuals", ascending=False)

fig = px.line(
    x=range(0, len(high_carb_products)),
    y=high_carb_products["abs_residuals"],
    title="Sorted absolute residual values",
    labels={
        "x": "Product #",
        "y": "Absolute residual value"
    },
    width = 600,
    height = 400
)

fig.show()

We notice a sudden drop in residual values in our line chart, here is a zoom-in:

In [112]:
fig = px.line(
    x=range(0, len(high_carb_products)),
    y=high_carb_products["abs_residuals"],
    title="Sorted absolute residual values",
    labels={
        "x": "Product #",
        "y": "Absolute residual value"
    },
    width = 600,
    height = 400
)

fig.update_layout(xaxis_range=[0, 115], yaxis_range=[0, 500])
fig.show()

Let's set a cut-off at product #111,

We will create some keywords to check if these "high carb" products are indeed gum and sugar free candy

In [139]:
high_carb_products = high_carb_products[:111]


keywords = ["gum", "kauwgom", "suikervrij", "sugarfree", "sugar-free", "zoetstof", "mint", "menthol"]  # Add your desired keywords here

no_keywords = ~high_carb_products["Product"].str.contains('|'.join(keywords), case=False, na=False)

print(high_carb_products["Product"][no_keywords])

17173    Flower Farm Hazelnootpasta zonder palmolie
22540       Pure Via Alternatief voor kristalsuiker
5586                        Canderel Green klontjes
7660                          Buisman Classic aroma
4636                            Abdij Broodmix mais
20303                     Vicks Ademvrij eucalyptus
16851                         Food2Smile Very berry
15903                          Stimorol Wild cherry
15899           Stimorol Max splash strawberry lime
15902                             Stimorol Original
Name: Product, dtype: object


This list still contains some sugar free mints and a sugar alternative, but it makes it easier to identify the exceptions

* Buisman Classic aroma is a glucose syrup and molasses mixture for coffee, which contains pure carbohydrate
* Abdij Broodmix mais is a mixture of flours, dietary fibers and starches for bread making

Our regression error in these cases likely comes from the fact that dietary fiber is not taken into account for nutrition and 