# Open Food Facts: the carbon “food-print” we do not eat

## Abstract
<i>Everything we do has a carbon footprint, and our diet is no exception. From growing, farming, processing and packaging our food, energy and organic resources are consumed and released, which reflects in the emission of greenhouse gases, like CO<sub>2</sub>. In our project, we analyze the processed foods industry - its manufacturing, product composition, and sales - for the main sources of carbon emissions, using the Open Food Facts dataset. We explain the carbon footprint repartition, starting on an understanding of the products, followed by the breakdown of production countries as well as point of sales and evaluating trends in diet composition, with a special focus on nutritionally high marked products in France and the UK. 

With this study, we want to provide a better understanding of the agri-food industry, and eventually help reducing carbon emissions.</i>

In this notebook, we are performingt the above analysis on the OpenFoodFacts database, which we pre-processed using the __Open Food Facts - Cleanse Data__ notebook in the main directory.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Abstract" data-toc-modified-id="Abstract-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Abstract</a></span></li><li><span><a href="#Import-cleansed-data" data-toc-modified-id="Import-cleansed-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import cleansed data</a></span></li><li><span><a href="#Analyse-data" data-toc-modified-id="Analyse-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Analyse data</a></span><ul class="toc-item"><li><span><a href="#Production-/-manufacture-impact" data-toc-modified-id="Production-/-manufacture-impact-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Production / manufacture impact</a></span><ul class="toc-item"><li><span><a href="#Global-distribution-of-global-food-producers" data-toc-modified-id="Global-distribution-of-global-food-producers-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Global distribution of global food producers</a></span><ul class="toc-item"><li><span><a href="#How-is-this-distribution-impacted-when-we-consider-neutral-and-large-carbon-footprint-products?" data-toc-modified-id="How-is-this-distribution-impacted-when-we-consider-neutral-and-large-carbon-footprint-products?-3.1.1.1"><span class="toc-item-num">3.1.1.1&nbsp;&nbsp;</span>How is this distribution impacted when we consider neutral and large carbon footprint products?</a></span></li></ul></li><li><span><a href="#Case-study:-Palm-oil" data-toc-modified-id="Case-study:-Palm-oil-3.1.2"><span class="toc-item-num">3.1.2&nbsp;&nbsp;</span>Case study: Palm oil</a></span><ul class="toc-item"><li><span><a href="#Can-we-observe-any-trend-in-the-number-of-products-including-palm-oil-(assuming-a-strong-dependence-between-date-the-product-was-added-to-the-database-and-data-the-product-was-invented)?" data-toc-modified-id="Can-we-observe-any-trend-in-the-number-of-products-including-palm-oil-(assuming-a-strong-dependence-between-date-the-product-was-added-to-the-database-and-data-the-product-was-invented)?-3.1.2.1"><span class="toc-item-num">3.1.2.1&nbsp;&nbsp;</span>Can we observe any trend in the number of products including palm oil (assuming a strong dependence between date the product was added to the database and data the product was invented)?</a></span></li><li><span><a href="#Which-countries-use-palm-oils-for-production?" data-toc-modified-id="Which-countries-use-palm-oils-for-production?-3.1.2.2"><span class="toc-item-num">3.1.2.2&nbsp;&nbsp;</span>Which countries use palm oils for production?</a></span></li></ul></li></ul></li><li><span><a href="#Good-nutrition-impact" data-toc-modified-id="Good-nutrition-impact-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Good nutrition impact</a></span><ul class="toc-item"><li><span><a href="#High-nutrional-products" data-toc-modified-id="High-nutrional-products-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>High-nutrional products</a></span><ul class="toc-item"><li><span><a href="#Has-there-been-a-surge-in-high-graded-Products-in-the-France-over-the-past-years?" data-toc-modified-id="Has-there-been-a-surge-in-high-graded-Products-in-the-France-over-the-past-years?-3.2.1.1"><span class="toc-item-num">3.2.1.1&nbsp;&nbsp;</span>Has there been a surge in high graded Products in the France over the past years?</a></span></li><li><span><a href="#Has-there-been-a-surge-in-high-graded-Products-in-the-France-over-the-past-years?" data-toc-modified-id="Has-there-been-a-surge-in-high-graded-Products-in-the-France-over-the-past-years?-3.2.1.2"><span class="toc-item-num">3.2.1.2&nbsp;&nbsp;</span>Has there been a surge in high graded Products in the France over the past years?</a></span></li><li><span><a href="#Where-do-these-product-come-from-and-where-are-they-manufactured?" data-toc-modified-id="Where-do-these-product-come-from-and-where-are-they-manufactured?-3.2.1.3"><span class="toc-item-num">3.2.1.3&nbsp;&nbsp;</span>Where do these product come from and where are they manufactured?</a></span></li><li><span><a href="#Where-are-those-products-sold?" data-toc-modified-id="Where-are-those-products-sold?-3.2.1.4"><span class="toc-item-num">3.2.1.4&nbsp;&nbsp;</span>Where are those products sold?</a></span></li></ul></li><li><span><a href="#Carbon-footprint-of-nutritionally-high-graded-products" data-toc-modified-id="Carbon-footprint-of-nutritionally-high-graded-products-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Carbon footprint of nutritionally-high graded products</a></span><ul class="toc-item"><li><span><a href="#Is-there-a-general-correlation-between-high-carbon-footprint-and-price?" data-toc-modified-id="Is-there-a-general-correlation-between-high-carbon-footprint-and-price?-3.2.2.1"><span class="toc-item-num">3.2.2.1&nbsp;&nbsp;</span>Is there a general correlation between high carbon footprint and price?</a></span></li></ul></li></ul></li></ul></li></ul></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import folium
from scipy import stats
from datetime import datetime

import json
import pickle

import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)
    
%load_ext autoreload
%autoreload 2
    
import libs.exploring as explore
import libs.visualising as visualize
import libs.cleansing as cleanse

# Set up plotly environment
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import plotly.tools as tls
init_notebook_mode(connected=True)

Command to import the link for the website

In [None]:
# tls.get_embed('https://plot.ly/~maxencedraguet/25/')
save_plots_offline = False

## Import cleansed data

In [None]:
# Import data
open_food_facts_csv_file = "./data/openfoodfacts_clean.csv"

food_facts_pd = pd.read_csv(open_food_facts_csv_file,
                            delimiter="\t")

In [None]:
# Change column data types
food_facts_pd['carbon-footprint_100g'] = food_facts_pd['carbon-footprint_100g'].apply(pd.to_numeric, args=('coerce',))
food_facts_pd['energy_100g'] = food_facts_pd['energy_100g'].apply(pd.to_numeric, args=('coerce',))
food_facts_pd['price_per_100g'] = food_facts_pd['price_per_100g'].apply(pd.to_numeric, args=('coerce',))
food_facts_pd['created_datetime'] = food_facts_pd['created_datetime'].apply(pd.to_datetime, args=('coerce',))

In [None]:
# Extract year from created date
food_facts_pd['created_yyyy'] = food_facts_pd["created_datetime"].dt.year

Additionally to the OpenFoodFact dataset, we obtained an extract of the Eaternity dataset hosted by the ETH Zurich, which contains 692 more products and their CO2 footprint. Unfortunately, these products are not contained in the OpenFoodFacts database, so we lack manufacturing and purchasing information for this set. Further, the OpenFoodFacts categories were assigned based on manually matching the categories strings (since they were provided in German). 

In [None]:
# Import data
eaternity_csv_file = "./data/carbon_footprint_clean.csv"

eaternity_pd = pd.read_csv(eaternity_csv_file, delimiter="\t")

## Analyse data

Before we analyse the data, we have some confessions to make:

The data that we loaded into this notebook was already preprocessed in the "Open Food Facts - Cleanse data" notebook, that can be found in the same directory. In there we translated countries, labels, and categories and formatted and matched tags. However, we also dropped more than 90% of the data set, because the data points were not complete for the purpose of our analysis.

OpenFoodFacts was initiated in France, and products sold in France dominate data set by far. Moreover, most of the products are sold in Europe or industrial nations, and we have no or only sparse data about the African, Asian, Australian, and South-American continent, which excludes the majority of the world population and especially the societies in Asia and Africa, that undergo the most decisive transformations at the moment.

Further, we only have qualitative data about the products, meaning no information about quantities in which they are produced and purchased world-wide. Also we could not find such public available data sets about the quantities in which certain products are consumed. As a consequence, we cannot provide a scale for all the insights that we gain throughout this notebook.

What we are trying to say is, that the data is under no circumstances representative to analyse the research questions that we have posed in the abstract. However, we will provide the methods to perform this analysis on this comprised dataset, and see what kind of insights we can already squeeze out of the data at hand.

### Production / manufacture impact

#### Global distribution of global food producers

In [None]:
countries_label = pd.read_csv("./data/country_lookup.csv")[['name', 'cca3']]     

First, we answer the question where the most foot-items come from.

In [None]:
visualize.plot_occurrences_on_map(df=food_facts_pd, 
                                  column_key='origins',
                                  save_offline = save_plots_offline, 
                                  save_offline_title= 'origin_all',
                                  show_distances=False)

**Where are those products manufactured?**

In [None]:
visualize.plot_occurrences_on_map(df=food_facts_pd, 
                                 column_key='manufacturing_places',
                                 save_offline = save_plots_offline, 
                                 save_offline_title= 'manufacturing_all',
                                 show_distances=False)

Please note the log-scale of the color plot. We can observe that most manufactures in the dataset are located in France. Other common manufacturers are located in European countries, as well as North- and Middle America. Only few products are manufactured from companies located in Africa, South-America, and Asia + Oceania.

**Where are those products bought?**

In [None]:
visualize.plot_occurrences_on_map(df=food_facts_pd, 
                                  column_key='purchase_places',
                                  save_offline = save_plots_offline, 
                                  save_offline_title= 'purchase_all',
                                  show_distances=False)

Also this plot reveals the predominance of products sold in France. The trend remains, that most products in the data base are sold in Europe, while all the other continents play only a marginal role.

In conclusion, we note that we mainly have data for "western" countries, with a <b> huge bias toward France</b>. We mostly lack information for countries in Africa, South-America and the centre of Asia. Our dataset is thus clearly not a truthful representation of the global trends. We shall therefore restrict our analysis to the case of France, meaning purchases countries will be limited to the case of France. [This category was selected since it is the most furnished one.] 

This is carried out in the next cell. Note that <i> purchases_places </i> is only requested to contain 'France' as one of the entries in its list. There could thus be other countries still contained in the <i> purchases_places </i> column. 

In [None]:
# Extract only products that are sold in France.
food_facts_pd['filter'] = food_facts_pd.purchase_places.apply(lambda l: explore.filter_france(l))
food_facts_pd = food_facts_pd[food_facts_pd['filter'] == 'France'].drop(columns=['filter'])

The next two figures show the new distribution of the data, and additionally mark the routes that the products need to travel before they land in a supermarket in France.

In [None]:
visualize.plot_occurrences_on_map(df=food_facts_pd, 
                                  column_key='manufacturing_places',
                                  save_offline = save_plots_offline, 
                                  save_offline_title= 'manufacturing_fr',
                                  show_distances=True)

In [None]:
visualize.plot_occurrences_on_map(df=food_facts_pd, 
                                  column_key='origins',
                                  save_offline = save_plots_offline, 
                                  save_offline_title= 'origins_fr',
                                  show_distances=True)

Filtering the products to the ones that are sold in France affected mostly products manufactured in Central Asia and North-America. Most of the products, that we are going to analyse further are most-likely manufactured in Europe, followed by the American continent. Only very few products originate from Asia, and even less from Africa.

##### How is this distribution impacted when we consider neutral and large carbon footprint products? 

The dataset carbon footprint coming from Eaternity is restrained to Germany and that of the Open Food Facts is much to sparsed to be informative (and biased towards France). Hence, we unfortunately lack the data to answer this question. 

#### Case study: Palm oil

In [None]:
#extracting products with palm oil
palm_oil_pd = food_facts_pd[food_facts_pd.ingredients_text.str.contains("palm").fillna(value=False)]

##### Can we observe any trend in the number of products including palm oil (assuming a strong dependence between date the product was added to the database and data the product was invented)?

In [None]:
print('{0:.2f} % of the products in the dataset contain palm oil'.format(palm_oil_pd.shape[0]/food_facts_pd.shape[0]*100))

In [None]:
proportions,palm_oil_over_time = explore.proportion_palm_oil(food_facts_pd)

In [None]:
visualize.palm_oil_overtime(proportions,
                            palm_oil_over_time,
                            save = save_plots_offline, 
                            save_title= 'palm_oil_over_time')

In the above plot, we can see the percentage of products added to the data base over the years that contain palm oil. In general, the amount varies between 3.5-5.6%, with a steady decrease since 2015. Even though we cannot make any statement about the quantities of palm oil products sold, we can assume that producers try to avoid ever more to use palm oil in new products that they bring to the market.

##### Which countries use palm oils for production?

In [None]:
a = palm_oil_pd.origins.groupby(palm_oil_pd.origins).sum

In [None]:
df_colors = visualize.create_colorbar_df(food_facts_pd)

In [None]:
visualize.plot_column_composition(palm_oil_pd,
                                  df_colors,
                                  'manufacturing_places',
                                  save_offline=save_plots_offline, 
                                  save_offline_title='palm_oil_manufacturing_places')

More than 76% products in the database, that contain palm oil as an ingredient, are manufactured in Europe. From the [Atlas dataset](https://atlas.media.mit.edu/en/profile/hs92/1511/), we see that these countries however import at least 15 times more palm oil that they produce, which let's us suggest that even the ingredients are shipped over the globe before they are processed to become food products.

### Good nutrition impact

Next, let's take a look at the nutrition grades of the products being sold in France. The nutrition grade of a product is an indicator of the type of ingredients that are processed into a food product. This grade is build from a score, that in turn is computed by a [research team lead by Professor Serge Hercberg](https://world.openfoodfacts.org/nutriscore), that is based on the percentage of plant-based (organic) ingredients, and the nutritious value concerning energy(kJ), saturated fats, sugar and sodium. We are going to use it to observe how the composition of products changed over the years.

In [None]:
nutrition_fr = food_facts_pd[['product_name',
                              'created_datetime',
                              'nutrition-score-fr_100g', 
                              'product_name', 
                              'main_category', 
                              'origins', 
                              'purchase_places', 
                              'manufacturing_places',
                              'stores']
                            ]

nutrition_fr = nutrition_fr[nutrition_fr['nutrition-score-fr_100g'].notna()]
nutrition_over_time = nutrition_fr.sort_values(by = 'created_datetime')
nutrition_over_time['main_category'] = nutrition_over_time.main_category.fillna(value='Unknown')

Meaning of the nutrition score index can be found at https://world.openfoodfacts.org/nutriscore. The main facts are the following : 
- Products are marked according to the amount of nutrients they contain [per 100 g] and given a grade between A and E (A being obviously the best mark).

<img src="Images/nutriscore.png" height="540" width="336">

- If the product is solids, this is linked to a nutrition score as displayed the next table. This score itself is computed with two parts. The first one considers the energy, saturated fat, sugars and sodium. A high level in that category is considered unhealthy. The second part reflects the proportion of fruits, vegetables and nuts, fibers and proteins for which high levels are considered beneficial to the health.

<center><img src="Images/nutriscore_table.png" height="1000" width="900"></center>



In [None]:
#Assigning the grades
nutrition_over_time["nutrition_grade"] =\
                                    nutrition_over_time[['nutrition-score-fr_100g','main_category']].\
                                    apply(explore.assign_score, axis=1)


#### High-nutrional products

##### What are those products made of?

In [None]:
nutrition_over_time_reduced = explore.count_nutrition_grade(nutrition_over_time)

This plot displays the most common categories in the list of product possessing a nutritional index

In [None]:
visualize.make_content_stacked_bar(visualize.plot_grade_content(nutrition_over_time), 
                                   df_colors,
                                   'keys', 
                                   'grade', 
                                   'Percentage',
                                   save_offline=save_plots_offline, 
                                   save_offline_title='nutrition_content')

Observe how good nutritional products are mostly (more than 50%) plant-based and how this category as well as carbs and canned food shrink when considering less beneficial food standards. This reduction is compensated by a sharp increase in prevalence of sugary snacks and a lesser increase of meat-based products. Both seafood and dairy seem to concentrate in, respectively, the lower and higher part of the middle marks. 

##### Has there been a surge in high graded Products in the France over the past years?

In [None]:
visualize.make_grade_stacked_bar(nutrition_over_time_reduced, 
                                 'nutrition_grade', 
                                 'year', 
                                 'Count',
                                 save_plots_offline, 
                                 'nutrition_grade')                        

We observe that, as time passes, more products are being added with a nutritional grade, with a peak occurring during the years 2015-2016. Now, how has the composition of such products evolved:

In [None]:
visualize.make_grade_stacked_bar(nutrition_over_time_reduced, 
                                 'nutrition_grade', 
                                 'year', 
                                 'Percentage',
                                 save_plots_offline, 
                                 'nutrition_percentage')

We observe that the percentage of prevalence of each grade has been mostly maintained during the last six years with a barely noticeable peak in 2013 for a high nutritional grade 'A'. However, since 2013, the proportion of badly graded products has grown over the graded product portfolio.

##### Where do these product come from and where are they manufactured?

In [None]:
visualize.plot_column_composition(nutrition_fr, 
                                  df_colors,
                                  'manufacturing_places', 
                                  save_offline=save_plots_offline, 
                                  save_offline_title='nutrition_manufacturing_places') 

Naturally, most of the food consumed in France is manufactured there though approximately 30% is produced somewhere else. The plot above only displays these 30%. 

##### Where are those products sold?
Since we filtered that dataset to products sold in France, this question becomes obsolete. However observe that the products contained in the database range over a variety of supermarket and hence consumer groups.

In [None]:
visualize.plot_column_composition(nutrition_fr, 
                                  df_colors,
                                  'stores',
                                  save_offline=save_plots_offline, 
                                  save_offline_title='nutrition_stores',
                                  num_values=8)

#### Carbon footprint of nutritionally-high graded products

In this section we investigate the carbon footprint of different products and categories. Therefore, we investigate the OpenFoodFacts dataset. Common sense would suggest most nutritionally-high graded products are organic (plant, fruit, vegetables, …), thus having a small footprint due to manufacturing. Let's see what story the data has to tell...

But as before, we should be careful as the dataset is biased. So we begin with examining what kind of data is present in the datasets.

In [None]:
# Extracting all products containing carbon footprint information from the database
carbon_footprints = food_facts_pd[food_facts_pd['carbon-footprint_100g'].notna()]
display(carbon_footprints.main_category.value_counts())

Before we begin, please note that most of the products are either sugary snacks (dominantly chocolates) or plant-based, hence only containing few ingredients.

First, we should sensibilize for the data that we are dealing with. Therefore we visualize the origin and composition of the products.

In [None]:
visualize.plot_column_composition(carbon_footprints, 
                                  df_colors,
                                  column_str='manufacturing_places',
                                  save_offline = save_plots_offline, 
                                  save_offline_title = 'carbon_manufacturing_places')

We wanted to investigate if there is a correlation between country of origin, and hence transportation distance, and the carbon footprint. However, as more than 80% the products are produced in France or neighbouring countries, the uncertainty would be too high, regarding that we only know manufacturers on a country granularity.

In [None]:
visualize.plot_column_composition(carbon_footprints, 
                                  df_colors,
                                  column_str='main_category',
                                  save_offline = save_plots_offline, 
                                  save_offline_title = 'carbon_main_category')

We see that the main categories that we have carbon footprint data of are sugary snacks (mainly plain chocolets), plant-based products and dairies. This is not surprising since they are made up of only few ingredients and therefore easy for the manufacturers to trace. 

##### Is there a general correlation between high carbon footprint and price? 

In order to investigat this question, we combine the data above with what we have obtained from the Eaternity database.

In [None]:
carbon_footprints_food_facts = food_facts_pd[food_facts_pd['carbon-footprint_100g'].notna()]\
    [['product_name', 'main_category', 'energy_100g', 'carbon-footprint_100g', 'price_per_100g']]

carbon_footprints_eaternity = eaternity_pd\
    [['product_name', 'main_category', 'energy_100g', 'carbon-footprint_100g', 'price_per_100g']]

carbon_footprints = pd.concat([carbon_footprints_food_facts, carbon_footprints_eaternity], ignore_index=True).dropna()

In [None]:
visualize.plot_column_composition(carbon_footprints, 
                                  df_colors,
                                  column_str='main_category',
                                  num_values=8,
                                  save_offline = save_plots_offline, 
                                  save_offline_title = 'carbon_eat_main_category')

As we see above, combining these two datasets gives us more variety in the categories, that the products come from.

Next we plot them over prices, that we found from online stores of Walmart, Monoprix, Kaufland and Migros. 

In [None]:
# Food calories over carbon-foot print
visualize.plot_cluster_by_tags(carbon_footprints,
                               df_colors,
                               save_offline = save_plots_offline, 
                               save_offline_title = 'carbon_scatter_O',
                               plot2D_features = ["carbon-footprint_100g", "price_per_100g"],
                               cluster="main_category")

In [None]:
# Food calories over carbon-foot print
visualize.plot_cluster_by_tags(carbon_footprints,
                               df_colors,
                               save_offline = save_plots_offline, 
                               save_offline_title = 'carbon_energy',
                               plot2D_features = ["carbon-footprint_100g", "energy_100g"],
                               cluster="main_category")

**Thanks for Reading !**