In [1]:
import pandas as pd
import numpy as np

In [2]:
data_folder = 'data/'
data = pd.read_csv(data_folder + 'en.openfoodfacts.org.products.csv', sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


# Data Preprocessing

### Chosen Fields

(at least some text that explains why we dropped the other fields)

In [3]:
chosen_fields = ['product_name', 'packaging_tags', 'brands_tags',
                 'origins_tags', 'manufacturing_places_tags', 'labels_en', 'stores', 'countries_en',
                 'additives_n', 'ingredients_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil_n', 
                 'nutrition_grade_fr', 'pnns_groups_1', 'fruits-vegetables-nuts_100g',
                 'main_category_en', 'energy_100g', 'energy-from-fat_100g', 'fat_100g', 
                 'saturated-fat_100g', 'monounsaturated-fat_100g', 'polyunsaturated-fat_100g', 
                 'omega-3-fat_100g', 'omega-6-fat_100g', 'omega-9-fat_100g', 'trans-fat_100g', 
                 'cholesterol_100g', 'carbohydrates_100g', 'sugars_100g', 'proteins_100g', 'sodium_100g', 
                 'nutrition-score-fr_100g', 'nutrition-score-uk_100g']

data = data[chosen_fields]

## Field Cleaning

The 'pnns_groups_1' field is particularly usefull in our study. 
Indeed, it sorts the different food entries by some clear category, as seen below.

In [4]:
data['pnns_groups_1'].value_counts()

unknown                    122802
Sugary snacks               29975
Milk and dairy products     17932
Composite foods             14659
Cereals and potatoes        14575
Fish Meat Eggs              13970
Beverages                   12410
Fat and sauces              11400
Fruits and vegetables       11086
Salty snacks                 5591
fruits-and-vegetables        1537
sugary-snacks                1450
cereals-and-potatoes           25
salty-snacks                    3
Name: pnns_groups_1, dtype: int64

A simple map allows for better organisation and also cleans the duplicate field entries:

In [6]:
my_map = {'unknown' : 'Unknown',
 'Sugary snacks' : 'Snacks', 
 'Milk and dairy products' : 'Fish Meat Eggs Dairy',
 'Composite foods' : 'Composite', 
 'Cereals and potatoes' : 'Starchy', 
 'Fish Meat Eggs' : 'Fish Meat Eggs Dairy',
 'Beverages' : 'Beverages',
 'Fat and sauces' : 'Fat Sauces',
 'Fruits and vegetables' : 'Fruits Vegetables',
 'Salty snacks' : 'Snacks',
 'fruits-and-vegetables' : 'Fruits Vegetables',
 'sugary-snacks' : 'Snacks',
 'cereals-and-potatoes' : 'Starchy',
 'salty-snacks' : 'Snacks'
}

data['pnns_groups_1'].replace(my_map, inplace=True)

Now the categories are clear:

In [7]:
data['pnns_groups_1'].value_counts()

Unknown                 122802
Snacks                   37019
Fish Meat Eggs Dairy     31902
Composite                14659
Starchy                  14600
Fruits Vegetables        12623
Beverages                12410
Fat Sauces               11400
Name: pnns_groups_1, dtype: int64

From the 'main_category_en' field, we can recover more entries for the previous categories, as well as a new meaningfull category 'Plant-based foods and beverages', that we will rename 'Veggie' and add it as a new binary field in the data set.

In [8]:
data['main_category_en'].value_counts().head(20)

Plant-based foods and beverages    38490
Beverages                          26082
Sugary snacks                      25179
Dairies                            16129
Meats                               9783
Groceries                           9674
Meals                               8338
Spreads                             4624
Frozen foods                        3152
Fruit juices                        3108
Desserts                            3076
Salty snacks                        3005
Seafood                             2919
Canned foods                        2766
Fats                                1878
Baby foods                          1036
Sweeteners                           944
Sandwiches                           905
Farming products                     796
Fish and meat and eggs               740
Name: main_category_en, dtype: int64

First the usefull categories of `pnns_groups_1` and `main_category_en` are merged.
This merged categorical field is called `category`.

In [9]:
my_map_2 = {
    'Beverages' : 'Beverages',
    'Sugary snacks' : 'Snacks',
    'Dairies' : 'Fish Meat Eggs Dairy',
    'Meats' : 'Fish Meat Eggs Dairy',
    'Meals' : 'Composite',
    'Fruit juices' : 'Beverages',
    'Salty snacks' : 'Snacks',
    'Fats' : 'Fat Sauces',
    'Fish and meat and eggs' : 'Fish Meat Eggs Dairy'
}

In [10]:
not_in_pnns = data[data['pnns_groups_1'].isna()]
keys = my_map_2.keys()
not_in_pnns.query('main_category_en in @keys')['main_category_en'].value_counts()

Beverages                 5842
Meats                       68
Meals                       16
Fish and meat and eggs      12
Name: main_category_en, dtype: int64

We see that the number of usefull entries that are not already present in `pnns_group_1` are mainly from the `Beverages` category. Nonetheless, we build the new `Category` field as previously exposed, and add these categories, as well as the `pnns_group_1` fields.

In [11]:
# First add the values from 'pnns_groups_1'
data['Category'] = data['pnns_groups_1']
# Adds the values in 'main_category_en' that are not in 'pnns_groups_1' after applying the map
new_vals = not_in_pnns.query('main_category_en in @keys')['main_category_en'].replace(my_map_2)
data.loc[new_vals.index, 'Category'] = new_vals

Here is the new 'Category' field.

In [15]:
data['Category'].value_counts()

Unknown                 122802
Snacks                   37019
Fish Meat Eggs Dairy     31982
Beverages                18252
Composite                14675
Starchy                  14600
Fruits Vegetables        12623
Fat Sauces               11400
Name: Category, dtype: int64

Now the `Plant based food and beverages category` is used to create a new field called `Veggie`.

In [27]:
data['Veggie'] = data['main_category_en'] == 'Plant-based foods and beverages'

The veggie fields from `Category` are also added:

In [25]:
data['Veggie'] = np.logical_or(data['Veggie'], data['Category'] == 'Fruits Vegetables')
data['Veggie'] = np.logical_or(data['Veggie'], data['Category'] == 'Starchy')

In [28]:
data['Veggie'].value_counts()

False    659334
True      38490
Name: Veggie, dtype: int64

Notice that the `False` value doesn't mean that an entry is not veggie. It only means that we don't know if it is.

In [30]:
data

Unnamed: 0,product_name,packaging_tags,brands_tags,origins_tags,manufacturing_places_tags,labels_en,stores,countries_en,additives_n,ingredients_from_palm_oil_n,...,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,proteins_100g,sodium_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,Category,Veggie
0,Vitória crackers,,,,,,,France,,,...,,,70.1,15.0,7.8,0.551181,,,,False
1,Cacao,,,,,,,France,,,...,,,,,,,,,,False
2,Sauce Sweety chili 0%,,,,,,,France,,,...,,,4.8,0.4,0.2,0.803150,,,,False
3,Mini coco,,,,,,,France,,,...,,,10.0,3.0,2.0,0.452756,,,,False
4,Mendiants,,,,,,,France,,,...,,,,,,,,,,False
5,Salade de carottes râpées,,,,,,,France,,,...,,,5.3,3.9,0.9,0.165354,,,,False
6,Fromage blanc aux myrtilles,,,,,,,France,,,...,,,16.3,16.3,4.4,0.098425,,,,False
7,,,,,,,,France,,,...,,,,,,,,,,False
8,Vainilla,,,,,,,France,,,...,,,,,,,,,,False
9,Baguette parisien,,,,,,,France,,,...,,,38.4,1.8,11.7,0.266929,,,,False
