# Open Food Facts: the carbon “food-print” we do not eat

## Abstract
<i>Everything we do has a carbon footprint, and our diet is no exception. From growing, farming, processing and packaging our food, energy and organic resources are consumed and released, which reflects in the emission of greenhouse gases, like CO<sub>2</sub>. In our project, we analyze the processed foods industry - its manufacturing, product composition, and sales - for the main sources of carbon emissions, using the Open Food Facts dataset. We explain the carbon footprint repartition, starting on an understanding of the products, followed by the breakdown of production countries as well as point of sales and evaluating trends in diet composition, with a special focus on nutritionally high marked products in France and the UK. 

With this study, we want to provide a better understanding of the agri-food industry, and eventually help reducing carbon emissions.</i>

In this notebook, we are performingt the above analysis on the OpenFoodFacts database, which we pre-processed using the __Open Food Facts - Cleanse Data__ notebook in the main directory.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Abstract" data-toc-modified-id="Abstract-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Abstract</a></span></li><li><span><a href="#Import-cleansed-data" data-toc-modified-id="Import-cleansed-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import cleansed data</a></span></li><li><span><a href="#Analyse-data" data-toc-modified-id="Analyse-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Analyse data</a></span><ul class="toc-item"><li><span><a href="#Production-/-manufacture-impact" data-toc-modified-id="Production-/-manufacture-impact-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Production / manufacture impact</a></span><ul class="toc-item"><li><span><a href="#Global-distribution-of-global-food-producers" data-toc-modified-id="Global-distribution-of-global-food-producers-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Global distribution of global food producers</a></span><ul class="toc-item"><li><span><a href="#Which-are-the-dominant-global-food-producers-and-manufacturers?" data-toc-modified-id="Which-are-the-dominant-global-food-producers-and-manufacturers?-3.1.1.1"><span class="toc-item-num">3.1.1.1&nbsp;&nbsp;</span>Which are the dominant global food producers and manufacturers?</a></span></li><li><span><a href="#How-is-this-distribution-impacted-when-we-consider-neutral-and-large-carbon-footprint-products?" data-toc-modified-id="How-is-this-distribution-impacted-when-we-consider-neutral-and-large-carbon-footprint-products?-3.1.1.2"><span class="toc-item-num">3.1.1.2&nbsp;&nbsp;</span>How is this distribution impacted when we consider neutral and large carbon footprint products?</a></span></li></ul></li><li><span><a href="#Case-study:-Palm-oil" data-toc-modified-id="Case-study:-Palm-oil-3.1.2"><span class="toc-item-num">3.1.2&nbsp;&nbsp;</span>Case study: Palm oil</a></span><ul class="toc-item"><li><span><a href="#Can-we-observe-any-trend-in-the-number-of-products-including-this-oil-(assuming-a-strong-dependence-between-date-the-product-was-added-to-the-database-and-data-the-product-was-invented)?" data-toc-modified-id="Can-we-observe-any-trend-in-the-number-of-products-including-this-oil-(assuming-a-strong-dependence-between-date-the-product-was-added-to-the-database-and-data-the-product-was-invented)?-3.1.2.1"><span class="toc-item-num">3.1.2.1&nbsp;&nbsp;</span>Can we observe any trend in the number of products including this oil (assuming a strong dependence between date the product was added to the database and data the product was invented)?</a></span></li><li><span><a href="#Which-country-use-palm-oils-for-production?" data-toc-modified-id="Which-country-use-palm-oils-for-production?-3.1.2.2"><span class="toc-item-num">3.1.2.2&nbsp;&nbsp;</span>Which country use palm oils for production?</a></span></li></ul></li></ul></li><li><span><a href="#Good-nutrition-impact" data-toc-modified-id="Good-nutrition-impact-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Good nutrition impact</a></span><ul class="toc-item"><li><span><a href="#High-nutrional-products" data-toc-modified-id="High-nutrional-products-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>High-nutrional products</a></span><ul class="toc-item"><li><span><a href="#Has-there-been-a-surge-in-high-graded-Products-in-the-UK-/-France-over-the-past-years?" data-toc-modified-id="Has-there-been-a-surge-in-high-graded-Products-in-the-UK-/-France-over-the-past-years?-3.2.1.1"><span class="toc-item-num">3.2.1.1&nbsp;&nbsp;</span>Has there been a surge in high graded Products in the UK / France over the past years?</a></span></li><li><span><a href="#What-are-those-products-made-of?" data-toc-modified-id="What-are-those-products-made-of?-3.2.1.2"><span class="toc-item-num">3.2.1.2&nbsp;&nbsp;</span>What are those products made of?</a></span></li><li><span><a href="#Where-do-these-product-come-from-and-where-are-they-manufactured?" data-toc-modified-id="Where-do-these-product-come-from-and-where-are-they-manufactured?-3.2.1.3"><span class="toc-item-num">3.2.1.3&nbsp;&nbsp;</span>Where do these product come from and where are they manufactured?</a></span></li><li><span><a href="#Where-are-those-products-sold?" data-toc-modified-id="Where-are-those-products-sold?-3.2.1.4"><span class="toc-item-num">3.2.1.4&nbsp;&nbsp;</span>Where are those products sold?</a></span></li></ul></li><li><span><a href="#Carbon-footprint-of-nutrionally-high-graded-products" data-toc-modified-id="Carbon-footprint-of-nutrionally-high-graded-products-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Carbon footprint of nutrionally-high graded products</a></span><ul class="toc-item"><li><span><a href="#Can-we-establish-a-meaningful-correlation-between-these-product-and-the-carbon-footprint--or-an-estimated-price-(using-another-dataset-or-creating-our-own-with-web-scraping)?" data-toc-modified-id="Can-we-establish-a-meaningful-correlation-between-these-product-and-the-carbon-footprint--or-an-estimated-price-(using-another-dataset-or-creating-our-own-with-web-scraping)?-3.2.2.1"><span class="toc-item-num">3.2.2.1&nbsp;&nbsp;</span>Can we establish a meaningful correlation between these product and the carbon footprint  or an estimated price (using another dataset or creating our own with web scraping)?</a></span></li><li><span><a href="#Is-there-a-general-correlation-between-high-carbon-footprint-and-price?" data-toc-modified-id="Is-there-a-general-correlation-between-high-carbon-footprint-and-price?-3.2.2.2"><span class="toc-item-num">3.2.2.2&nbsp;&nbsp;</span>Is there a general correlation between high carbon footprint and price?</a></span></li></ul></li></ul></li></ul></li></ul></div>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import folium
from scipy import stats
import seaborn as sns
from datetime import datetime

import json
import pickle

import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)
    
%load_ext autoreload
%autoreload 2
    
import libs.exploring as explore
import libs.visualising as visualize
import libs.cleansing as cleanse

## Import cleansed data

In [2]:
# Import data
open_food_facts_csv_file = "./data/openfoodfacts_clean.csv"

food_facts_pd = pd.read_csv(open_food_facts_csv_file,
                            delimiter="\t")

In [3]:
# Change column data types
food_facts_pd['carbon-footprint_100g'] = food_facts_pd['carbon-footprint_100g'].apply(pd.to_numeric, args=('coerce',))
food_facts_pd['energy_100g'] = food_facts_pd['energy_100g'].apply(pd.to_numeric, args=('coerce',))
food_facts_pd['price_per_100g'] = food_facts_pd['price_per_100g'].apply(pd.to_numeric, args=('coerce',))
food_facts_pd['created_datetime'] = food_facts_pd['created_datetime'].apply(pd.to_datetime, args=('coerce',))

# Replace missing values
food_facts_pd.origins_cleaned= food_facts_pd.origins_cleaned.fillna("['Unknown']")
food_facts_pd.manufacturing_place_cleaned= food_facts_pd.manufacturing_place_cleaned.fillna("['Unknown']")
food_facts_pd.purchase_places_cleaned= food_facts_pd.purchase_places_cleaned.fillna("['Unknown']")
# food_facts_pd = food_facts_pd.fillna('')

# List tags
food_facts_pd.origins_cleaned = \
    food_facts_pd.origins_cleaned.apply(lambda l: cleanse.read(l))

food_facts_pd.manufacturing_place_cleaned = \
    food_facts_pd.manufacturing_place_cleaned.apply(lambda l: cleanse.read(l))

food_facts_pd.purchase_places_cleaned = \
                        food_facts_pd.purchase_places_cleaned.apply(lambda l: cleanse.read(l))

In [4]:
food_facts_pd.head(5)

Unnamed: 0.1,Unnamed: 0,code,created_t,created_datetime,product_name,quantity,packaging,brands,categories_en,labels_en,...,main_category,energy_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,origins_cleaned,manufacturing_place_cleaned,purchase_places_cleaned,price_per_100g,store_currency
0,0,274722,1514659309,2017-12-30 18:41:49,Blanquette de Volaille et son Riz,,"carton,plastique",Comme J’aime,"Meals,Meat-based products,Meals with meat,Poul...","['Viande Française', 'Made In France']",...,meals,450.0,,0.0,0.0,[Unknown],[France],[France],,
1,1,394710,1484497370,2017-01-15 16:22:50,Danoises à la cannelle roulées,1.150 kg,Frais,Kirkland Signature,"Sugary snacks,Biscuits and cakes,Pastries",[''],...,sugary-snacks,1520.0,,,,[France],[France],[France],,
2,2,1071894,1409411252,2014-08-30 15:07:32,Flute,,"Paper,plastic film",Waitrose,"Plant-based foods and beverages,Plant-based fo...",[''],...,plant-based-foods-and-beverages,,,,,[Canada],[Unknown],[Canada],,
3,3,1938067,1484501528,2017-01-15 17:32:08,Chaussons tressés aux pommes,1.200 kg,Frais,Kirkland Signature,"Sugary snacks,Biscuits and cakes,Pastries",[''],...,sugary-snacks,1090.0,,9.0,9.0,[France],[United Kingdom],[United Kingdom],,
4,4,4302544,1488464896,2017-03-02 14:28:16,Pain Burger Artisan,1.008 kg / 12 pain,"Frais,plastique",Kirkland Signature,boulange,[''],...,boulange,1160.0,,1.0,1.0,[Canada],[Unknown],[Canada],,


In [5]:
# Import data
carbon_footprint_csv_file = "./data/carbon_footprint_categories.csv"

carbon_footprint_pd = pd.read_csv(carbon_footprint_csv_file)


In [6]:
carbon_footprint_pd.head(5)

Unnamed: 0.1,Unnamed: 0,ID,Title,Weight [gram/serving],CO2-Value [gram CO2/serving],CO2 rating,FAT,WATER,ENERC,PROT,category
0,0,4300175162708,K Classic - Junger Gemüsemais,100,9.0,20.812,3.480252,52.999834,765.65552,8.601195,Gemüsekonserven
1,1,4388840231829,ja! Gemüsemais,100,17.0,37.941,2.312597,35.218431,1070.401621,5.715417,Gemüsekonserven
2,2,8851613101392,Aroy-D - Kokosnussmilch,100,35.0,47.49,25.2,33.86,1230.0,5.34,Kokosmilch
3,3,4003994111000,Kelloggs Cornflakes Die Originalen 375 g,100,29.0,50.203,2.30561,10.145873,1458.792557,7.147392,"Ceralien, Cornflakes"
4,4,4005009100542,Tortilla Chips Meersalz,100,55.0,53.102,25.168,5.083,1918.4,13.321,Tortilla Chips


## Analyse data

### Production / manufacture impact

#### Global distribution of global food producers

In [None]:
countries_label = pd.read_csv("./data/country_lookup.csv")[['name', 'cca3']]                                          

##### Which are the dominant global food producers and manufacturers?

- From where are those products originating?

In [None]:
values_set, values_count_origins = visualize.plot_occurences_of_distinct_values(food_facts_pd, 'origins_cleaned')

In [None]:
values_count_origins = pd.DataFrame.from_dict(values_count_origins, orient='index', columns=['Count']).reset_index().rename(index=str, columns={"index": "Country", "Count": "Count"})
values_count_origins = values_count_origins[values_count_origins.Country != "Unknown"]
values_count_origins['cca3'] = values_count_origins.Country.apply(lambda l: visualize.search_cca3(l, countries_label))
values_count_origins['Count'] = values_count_origins.Count.apply(lambda l: np.log(l))
values_count_origins= values_count_origins[['cca3', 'Count']]


In [None]:
country_geo = './data/world-countries.json'

m = folium.Map(location=[0, 0], tiles='Mapbox Bright', zoom_start=1.5, control_scale= True)

# choropleth maps bind Pandas Data Frames and json geometries.
folium.Choropleth(geo_data=country_geo,
               data=values_count_origins,
               columns=['cca3', 'Count'],
               fill_color='YlGnBu',
               nan_fill_color='purple', 
               nan_fill_opacity=0.4,
               key_on='feature.id',
               threshold_scale=[0, 1, 2, 5,8,11],
               legend_name='Origin Country : logscale of entries (base 10) [Countries in purple are empty]',
               fill_opacity=0.7, 
               line_opacity=0.2,
               ).add_to(m)
m.save("folium-palm_oil_products-origin_countries.html")
m

Note that country in purple are  not assigned any value. 

- Where are those products manufactured?

In [None]:
values_set, values_count_manufacturing = visualize.plot_occurences_of_distinct_values(food_facts_pd, 'manufacturing_place_cleaned')

In [None]:
values_count_manufacturing = pd.DataFrame.from_dict(values_count_manufacturing, orient='index', columns=['Count']).reset_index().rename(index=str, columns={"index": "Country", "Count": "Count"})
values_count_manufacturing = values_count_manufacturing[values_count_manufacturing.Country != "Unknown"]
values_count_manufacturing['cca3'] = values_count_manufacturing.Country.apply(lambda l: visualize.search_cca3(l, countries_label))
values_count_manufacturing['Count'] = values_count_manufacturing.Count.apply(lambda l: np.log(l))
values_count_manufacturing= values_count_manufacturing[['cca3', 'Count']]


In [None]:
country_geo = './data/world-countries.json'

m = folium.Map(location=[0, 0], tiles='Mapbox Bright', zoom_start=1.5)

# choropleth maps bind Pandas Data Frames and json geometries.
folium.Choropleth(geo_data=country_geo,
               data=values_count_manufacturing,
               columns=['cca3', 'Count'],
               fill_color='YlGnBu', 
               nan_fill_color='purple', 
               nan_fill_opacity=0.4,
               key_on='feature.id',
               threshold_scale=[0, 1, 2, 5,8,11],
               legend_name='Manufacturing : logscale of entries (base 10) [Countries in purple are empty]',
               fill_opacity=0.7, 
               line_opacity=0.2,
               ).add_to(m)
m.save("folium-palm_oil_products-production_countries.html")
m

- Where are those products bought?

In [None]:
values_set, values_count_purchase = visualize.plot_occurences_of_distinct_values(food_facts_pd, 'purchase_places_cleaned')

In [None]:
values_count_purchase = pd.DataFrame.from_dict(values_count_purchase, orient='index', columns=['Count']).reset_index().rename(index=str, columns={"index": "Country", "Count": "Count"})
values_count_purchase = values_count_purchase[values_count_purchase.Country != "Unknown"]
values_count_purchase['cca3'] = values_count_purchase.Country.apply(lambda l: visualize.search_cca3(l, countries_label))
values_count_purchase['Count'] = values_count_purchase.Count.apply(lambda l: np.log(l))
values_count_purchase= values_count_purchase[['cca3', 'Count']]


In [None]:
country_geo = './data/world-countries.json'

m = folium.Map(location=[0, 0], tiles='Mapbox Bright', zoom_start=1.5)

# choropleth maps bind Pandas Data Frames and json geometries.
folium.Choropleth(geo_data=country_geo,
               data=values_count_purchase,
               columns=['cca3', 'Count'],
               fill_color='YlGnBu',
               nan_fill_color='purple', 
               nan_fill_opacity=0.4,
               key_on='feature.id',
               threshold_scale=[0, 1, 2, 5,8,11],
               legend_name='Purchasing : logscale of entries (base 10) [Countries in purple are empty]',
               fill_opacity=0.7, 
               line_opacity=0.2,
               ).add_to(m)
m.save("folium-palm_oil_products-purchase_countries.html")
m

In conclusion, we note that we mainly have data for "western" countries, with a huge bias toward France. We mostly lack information for country in Africa and the centre of Asia. Our dataset is thus clearly not a truthful representation of the world. We shall therefore restrict our analysis to the case of France, meaning purchases countries will be limited to the case of France. [This category was selected since it is the most furnished one.] 

This is carried out in the next cell. Note that <i> purchases_places_cleaned </i> is only requested to contain 'France' as one of the entries in its list, there could be more than one. 

In [None]:
food_facts_pd['filter'] = food_facts_pd.purchase_places_cleaned.apply(lambda l: explore.filter_france(l))
food_facts_pd = food_facts_pd[food_facts_pd['filter'] == 'France'].drop(columns=['filter'])

##### How is this distribution impacted when we consider neutral and large carbon footprint products? 

In [None]:
# dataset carbon footprint coming from Eaternity
# This will be assess in future version of this project

#### Case study: Palm oil

##### Can we observe any trend in the number of products including this oil (assuming a strong dependence between date the product was added to the database and data the product was invented)?

In [None]:
#extracting products with palm oil 
palm_oil_pd = food_facts_pd[food_facts_pd.ingredients_text.str.contains("palm").fillna(value=False)]

In [None]:
print('{0:.2f} % of the products in the dataset contain palm oil'.format(palm_oil_pd.shape[0]/food_facts_pd.shape[0]*100))

In [None]:
#palm_oil_pd.groupby('main_category')

##### Which country use palm oils for production?

In [None]:
a = palm_oil_pd.origins_cleaned.groupby(palm_oil_pd.origins_cleaned).sum

In [None]:
visualize.plot_column_composition(palm_oil_pd, 
                                  ['purchase_places_cleaned', 
                                   'manufacturing_place_cleaned']
                                 )

### Good nutrition impact

#### High-nutrional products

##### Has there been a surge in high graded Products in the UK / France over the past years?

In [None]:
nutrition_fr = food_facts_pd[['created_datetime',
                              'nutrition-score-fr_100g', 
                              'main_category', 
                              'origins_cleaned', 
                              'purchase_places_cleaned', 
                              'manufacturing_place_cleaned',
                              'stores']
                            ]
nutrition_fr = nutrition_fr[nutrition_fr['nutrition-score-fr_100g'].notna()]
nutrition_over_time = nutrition_fr.sort_values(by = 'created_datetime')

In [None]:
ax = nutrition_over_time["created_datetime"]\
        .groupby(nutrition_over_time["created_datetime"].dt.year)\
        .count()\
        .plot(kind="bar", color="#1F77B4")
plt.title('Added products with nutrition factor by year')
plt.show()

##### What are those products made of?
What is the composition? Do they contain many additives?  Where are these products sold? 

In [None]:
visualize.plot_column_composition(nutrition_fr, 
                                  ['main_category']
                                 )

##### Where do these product come from and where are they manufactured?

In [None]:
visualize.plot_column_composition(nutrition_fr, 
                                  ['purchase_places_cleaned', 
                                   'manufacturing_place_cleaned']
                                 )

##### Where are those products sold?

In [None]:
visualize.plot_column_composition(nutrition_fr, 
                                  ['stores']
                                 )

#### Carbon footprint of nutrionally-high graded products
Common sense would suggest most nutritionally-high graded products are organic (plant, fruit, vegetables, …) and are therefore not manufactured, thus having a small footprint.

In [None]:
carbon_footprints = food_facts_pd[food_facts_pd['carbon-footprint_100g'].notna()]

First, we should sensibilize for the data that we are dealing with. Therefore we visualize the origin and composition of the products.

In [None]:
visualize.plot_column_composition(carbon_footprints, 
                                  columns=['origins_cleaned', 
                                           'stores'])

In [None]:
visualize.plot_column_composition(carbon_footprints, 
                                  columns=['main_category'])

##### Can we establish a meaningful correlation between these product and the carbon footprint  or an estimated price (using another dataset or creating our own with web scraping)? 

In [None]:
# Food calories over carbon-foot print
visualize.plot_cluster_by_tags(df=carbon_footprints,
                               plot2D_features = ["carbon-footprint_100g", "price_per_100g"],
                               cluster="main_category")

The above prices were found from online stores of Walmart, Monoprix, and Migros. It should be noted, that the dataset only contains dairies and sweets with carbon footprint, so we hope to gain more insight for other products from the Carbon Footprint Eaternity dataset.

##### Is there a general correlation between high carbon footprint and price? 

We wait for more carbon footprint data before we analyse this dependency.

In [21]:
from gensim.models import word2vec
import logging
from googletrans import Translator

In [11]:
from translate import Translator
translator = Translator(from_lang="german",to_lang="english")
for i in range(len(carbon_footprint_pd)):
    category = carbon_footprint_pd['category'].iloc[i]
    cat_translated = translator.translate(category)
    print(category,cat_translated)



Gemüsekonserven Tinned vegetables
Gemüsekonserven Tinned vegetables
Kokosmilch Coconut milk
Ceralien, Cornflakes Kiosks, cornflakes
Tortilla Chips Tortilla chips
Sonnenblumenkerne Sunflower seeds
Brot Bread
Brot Bread
Gemüsaufstriche &amp; -salate Gemüsaufstriche &amp;amp; salads
Brot Bread
Brot Bread
Senf &amp; Senfsaucen Mustard &amp;amp; mustard sauces
Kakaopulver ohne Zucker Cocoa without sugar
Kartoffeln Potatoes
Kartoffeln Potatoes
Kokosmilch Coconut milk
Sonnenblumenkerne Sunflower seeds
Brot Bread
Zucker Sugar
Rübenkraut, Sirup &amp; Melasse Turnip tops, syrup &amp;amp; molasses
Zucker Sugar
Brot Bread
Bonbons &amp; Lutscher Candy &amp;amp; lollipops
Kartoffelbeilagen &amp; Pommes Frites Potato side dishes &amp;amp; French fries
Verdickungs- &amp; Geliermittel Thickeners &amp;amp; gelling agents
Mohn Poppy seeds
Verdickungs- &amp; Geliermittel Thickeners &amp;amp; gelling agents
Kartoffelbeilagen &amp; Pommes Frites Potato side dishes &amp;amp; French fries
Schokoladetafeln Cho

KeyboardInterrupt: 

**Thanks for Reading !**