# Open Food Facts Data Exploration

In [1]:
import pandas as pd
import numpy as np

#### Data Loading

In [2]:
data_folder = 'data/'
data = pd.read_csv(data_folder + 'en.openfoodfacts.org.products.csv', sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


We can see the size of the data is not a problem; we can load it easily using pandas.

### Data Preprocessing

#### Choosing The Fields
There are 174 different fields in this OpenFoodFacts dataset. After explorating each field and their value counts, here are the 33 fields we decided to keep (for now). The reasons we dropped the other fields are the following: either
- they were too poorly represented (less than 50 occurrences)
- they were too specific (micronutrients)
- some other field conveyed the same information in a clearer format
- they are of no interest for this study (barcodes, image urls)

In [292]:
chosen_fields = ['product_name', 'packaging_tags', 'brands_tags',
                 'origins_tags', 'manufacturing_places_tags', 'labels_en', 'stores', 'countries_en',
                 'additives_n', 'ingredients_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil_n', 
                 'nutrition_grade_fr', 'pnns_groups_1', 'fruits-vegetables-nuts_100g',
                 'main_category_en', 'energy_100g', 'energy-from-fat_100g', 'fat_100g', 
                 'saturated-fat_100g', 'monounsaturated-fat_100g', 'polyunsaturated-fat_100g', 
                 'omega-3-fat_100g', 'omega-6-fat_100g', 'omega-9-fat_100g', 'trans-fat_100g', 
                 'cholesterol_100g', 'carbohydrates_100g', 'sugars_100g', 'proteins_100g', 'sodium_100g', 
                 'nutrition-score-fr_100g', 'nutrition-score-uk_100g']

In [196]:
chosen_data = data[chosen_fields]

#### Data Cleaning

We want to clean (i.e. agglomerate similar values together) the packaging_tags, brands_tags, origins_tags, manufacturing_places_tags, labels_en, stores, countries_en, pnns_groups_1 fields. The fields ending with "\_100g" and "\_n" are floats, hence clean. 

### Plan for Analysis and Communication

In [333]:
data_size = data.shape[0]
def calculate_apparition_pct(field):
    print('The apparition percentage is %.3f%%' % (100*data[field].dropna().shape[0]/data_size))

Now that we have a cleaned dataset, we can start working on the analysis and communication. 
The first step will be to calculate the [NutriScore](http://fr.openfoodfacts.org/score-nutritionnel-experimental-france) where it is missing.

In [334]:
calculate_apparition_pct('nutrition_grade_fr')

The apparition percentage is 20.263%


We have all the parameters needed for the NutriScore formula (energy, saturated fats, sugars, proteins, sodium, fibers and "fruits, vegetables and nuts percentage" per 100g). 
Sadly, we can see below that the percentage of products where we could derive the NutriScore drops from 77.3% to 38.3% when we use the fiber. 
This might or might not be a problem, depending if the nans occur when a product doesn't have fibers (i.e. NaN corresponds to 0) or if they are simply missing. 
Thus, we will have to investigate.

A bigger problem however is the ridiculously small apparition percentage of the fruits-vegetables-nuts_100g field. Again, perhaps that the NaNs correspond to 0s. In any case, we will need to investigate. 

In [335]:
calculate_apparition_pct(['energy_100g', 'saturated-fat_100g', 'sugars_100g', 'proteins_100g', 'sodium_100g'])

The apparition percentage is 77.319%


In [336]:
calculate_apparition_pct(['energy_100g', 'saturated-fat_100g', 'sugars_100g', 'proteins_100g', 'sodium_100g', 'fiber_100g'])

The apparition percentage is 38.294%


In [337]:
calculate_apparition_pct('fruits-vegetables-nuts_100g')

The apparition percentage is 0.494%


In the worst case, where we don't find a smart way to tackle these issues, we will simply have overall smaller nutritional scores. Hence we will still be able to proceed with the analysis in any case. 