# Projet 3
# Concevez une application au service de la santé publique

## Problématique

Appel à projets pour trouver des idées innovantes d’applications en lien avec l'alimentation.
   - Problématique à se fixer en fonction des objectifs visés

## Missions

- Traiter le jeu de données afin de repérer des variables pertinentes pour les traitements à venir.
- Automatiser ces traitements pour éviter de répéter ces opérations.
- Produire des visualisations afin de mieux comprendre les données. 
- Effectuer une analyse univariée pour chaque variable intéressante, afin de synthétiser son comportement.
- Confirmer ou infirmer les hypothèses  à l’aide d’une analyse multivariée. 
- Effectuer les tests statistiques appropriés pour vérifier la significativité des résultats.
- Élaborer une idée d’application. 
- Identifier des arguments justifiant la faisabilité (ou non) de l’application à partir des données Open Food Facts.
- Rédiger un rapport d’exploration et pitcher votre idée durant la soutenance du projet.

## Compétences évaluées

- Communiquer ses résultats à l’aide de représentations graphiques lisibles et pertinentes
- Effectuer une analyse statistique univariée
- Effectuer des opérations de nettoyage sur des données structurées
- Effectuer une analyse statistique multivariée

## Hypothèse de départ pour l'application

Ellaborer une application qui permet de dire si oui ou non l'aliment scanner par un utilisateur contient des allergènes ou s'il est convenable pour un utilisateur suivant un régime spécial (par exemple les diabétiques) afin de lui recommander des produits alternatifs.


## Nettoyage des données

In [4]:
# Chargement des librairies

%matplotlib inline

import numpy as np
import pandas as pd

In [5]:
# Lecture du fichier de données par séquence avec une liste de valeurs manquantes pre-définies

naValues = ['nan','unknown']
chunksize = 100000
chunks = []

for chunk in pd.read_csv('en.openfoodfacts.org.products.csv',sep='\t',encoding="utf-8",na_values=naValues,
                         chunksize=chunksize,low_memory=False):
    chunks.append(chunk)

data00 = pd.concat(chunks,axis=0)

data00.to_csv('data_00.csv')

# Statistiques sur le fichier d'origine

statList = ['Taille','Nbre ligne','Nbre colonne','Nbre NaN','Pourcentage de NaN']

statsValues = pd.DataFrame(columns=statList)

data00 = data00.loc[:, ~data00.columns.str.contains('^Unnamed')]
    
nbRows = data00.shape[0]
nbCols = data00.shape[1]
nbNaN = data00.isna().sum().sum()
pNaN = round(100.0 * nbNaN/data00.size,2)
    
statsValues.loc[0] = [data00.size,nbRows,nbCols,nbNaN,pNaN]

statsValues

Unnamed: 0,Taille,Nbre ligne,Nbre colonne,Nbre NaN,Pourcentage de NaN
0,250140009.0,1381989.0,181.0,198021674.0,79.16


In [9]:
pd.options.display.max_rows = 200
pd.options.display.max_columns = 181

data00.head()

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,serving_quantity,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutriscore_score,nutriscore_grade,nova_group,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,brand_owner,main_category,main_category_en,image_url,image_small_url,image_ingredients_url,image_ingredients_small_url,image_nutrition_url,image_nutrition_small_url,energy-kj_100g,energy-kcal_100g,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-butyric-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-capric-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,-soluble-fiber_100g,-insoluble-fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-dried_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
0,17,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1529059080,2018-06-15T10:38:00Z,1561463718,2019-06-25T11:55:18Z,Vitória crackers,,,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,1569.0,1569.0,,7.0,3.08,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,70.1,15.0,,,,,,,,,,,,7.8,,,,1.4,0.56,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,31,http://world-en.openfoodfacts.org/product/0000...,isagoofy,1539464774,2018-10-13T21:06:14Z,1539464817,2018-10-13T21:06:57Z,Cacao,,130 g,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,3327986,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1574175736,2019-11-19T15:02:16Z,1574175737,2019-11-19T15:02:17Z,Filetes de pollo empanado,,,,,,,,,,,,,,,,,,,,,,,,en:es,en:spain,Spain,,,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,100,http://world-en.openfoodfacts.org/product/0000...,del51,1444572561,2015-10-11T14:09:21Z,1444659212,2015-10-12T14:13:32Z,moutarde au moût de raisin,,100g,,,courte paille,courte-paille,"Epicerie, Condiments, Sauces, Moutardes","en:groceries,en:condiments,en:sauces,en:mustards","Groceries,Condiments,Sauces,Mustards",,,,,Delois france,fr:delois-france,fr:delois-france,,,,,,,courte paille,France,en:france,France,eau graines de téguments de moutarde vinaigre ...,en:mustard,,,,,,,,0.0,,,,0.0,,,0.0,,,18.0,d,,Fat and sauces,Dressings and sauces,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,en:mustards,Mustards,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,,936.0,,936.0,,8.2,2.2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,29.0,22.0,,,,,,,,,0.0,,,5.1,,,,4.6,1.811024,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,18.0,,,,,,,,
4,1111111111,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1560020173,2019-06-08T18:56:13Z,1560020173,2019-06-08T18:56:13Z,Sfiudwx,,dgesc,,,Watt,watt,Xsf,fr:xsf,fr:xsf,,,,,,,,,,,,,,,en:France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,fr:xsf,fr:xsf,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [6]:
# Lecture du fichier de données par séquence avec une liste de valeurs manquantes pre-définies
# et conservation de 10% des échantillons

for chunk in pd.read_csv('en.openfoodfacts.org.products.csv',sep='\t',encoding="utf-8",na_values=naValues,
                         chunksize=chunksize,low_memory=False):
    chunks.append(chunk.sample(frac=0.1))

data0 = pd.concat(chunks,axis=0)

In [10]:
# Réécriture des noms de colonnes

data0.columns = data0.columns.str.strip('-')
    
data0.to_csv('data_01.csv',sep='\t',encoding="utf-8")


pd.options.display.max_rows = 200
pd.options.display.max_columns = 181

data0.head()

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,serving_quantity,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutriscore_score,nutriscore_grade,nova_group,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,brand_owner,main_category,main_category_en,image_url,image_small_url,image_ingredients_url,image_ingredients_small_url,image_nutrition_url,image_nutrition_small_url,energy-kj_100g,energy-kcal_100g,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,butyric-acid_100g,caproic-acid_100g,caprylic-acid_100g,capric-acid_100g,lauric-acid_100g,myristic-acid_100g,palmitic-acid_100g,stearic-acid_100g,arachidic-acid_100g,behenic-acid_100g,lignoceric-acid_100g,cerotic-acid_100g,montanic-acid_100g,melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,alpha-linolenic-acid_100g,eicosapentaenoic-acid_100g,docosahexaenoic-acid_100g,omega-6-fat_100g,linoleic-acid_100g,arachidonic-acid_100g,gamma-linolenic-acid_100g,dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,oleic-acid_100g,elaidic-acid_100g,gondoic-acid_100g,mead-acid_100g,erucic-acid_100g,nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,sucrose_100g,glucose_100g,fructose_100g,lactose_100g,maltose_100g,maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,soluble-fiber_100g,insoluble-fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-dried_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
0,17,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1529059080,2018-06-15T10:38:00Z,1561463718,2019-06-25T11:55:18Z,Vitória crackers,,,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,1569.0,1569.0,,7.0,3.08,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,70.1,15.0,,,,,,,,,,,,7.8,,,,1.4,0.56,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,31,http://world-en.openfoodfacts.org/product/0000...,isagoofy,1539464774,2018-10-13T21:06:14Z,1539464817,2018-10-13T21:06:57Z,Cacao,,130 g,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,3327986,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1574175736,2019-11-19T15:02:16Z,1574175737,2019-11-19T15:02:17Z,Filetes de pollo empanado,,,,,,,,,,,,,,,,,,,,,,,,en:es,en:spain,Spain,,,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,100,http://world-en.openfoodfacts.org/product/0000...,del51,1444572561,2015-10-11T14:09:21Z,1444659212,2015-10-12T14:13:32Z,moutarde au moût de raisin,,100g,,,courte paille,courte-paille,"Epicerie, Condiments, Sauces, Moutardes","en:groceries,en:condiments,en:sauces,en:mustards","Groceries,Condiments,Sauces,Mustards",,,,,Delois france,fr:delois-france,fr:delois-france,,,,,,,courte paille,France,en:france,France,eau graines de téguments de moutarde vinaigre ...,en:mustard,,,,,,,,0.0,,,,0.0,,,0.0,,,18.0,d,,Fat and sauces,Dressings and sauces,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,en:mustards,Mustards,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,,936.0,,936.0,,8.2,2.2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,29.0,22.0,,,,,,,,,0.0,,,5.1,,,,4.6,1.811024,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,18.0,,,,,,,,
4,1111111111,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1560020173,2019-06-08T18:56:13Z,1560020173,2019-06-08T18:56:13Z,Sfiudwx,,dgesc,,,Watt,watt,Xsf,fr:xsf,fr:xsf,,,,,,,,,,,,,,,en:France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,fr:xsf,fr:xsf,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Description du fichier de données

In [22]:
statList = ['Taille','Nbre ligne','Nbre colonne','Nbre NaN','Pourcentage de NaN']

statsValues = pd.DataFrame(columns=statList)

data0 = data0.loc[:, ~data0.columns.str.contains('^Unnamed')]
    
nbRows = data0.shape[0]
nbCols = data0.shape[1]
nbNaN = data0.isna().sum().sum()
pNaN = round(100.0 * nbNaN/data0.size,2)
    
statsValues.loc[0] = [data0.size,nbRows,nbCols,nbNaN,pNaN]

statsValues

Unnamed: 0,Taille,Nbre ligne,Nbre colonne,Nbre NaN,Pourcentage de NaN
0,275154028.0,1520188.0,181.0,217825645.0,79.16


### Liste des columns pertinentes:
Je présente ici une liste de colonnes choisies selon leur pertinence pour l'hypothèse de départ. Ce sont: creator, created_t, last_modified_t, product_name, labels_en, countries_en, allergens, additives_n, additives_tags, nutriscore_score, nutriscore_grade, pnns_groups_1,pnns_groups_2, main_category_en et la liste des apports nutritionnels pour 100g de produits.

Cette liste évoluera tout au long de l'analyse en fonction des hypthèse de travail.

In [23]:
columns100g = [col for col in data0.columns if '100g' in col]
#print(columns100g)


list_columns_1 =  ['creator','created_t','last_modified_t','product_name','labels_en',
                 'countries_en','allergens','additives_n','additives_tags','nutriscore_score',
                 'nutriscore_grade','pnns_groups_1','pnns_groups_2','main_category_en']
list_columns = list_columns_1 + columns100g

data1 = data0.loc[:,list_columns]

In [24]:
data1.shape

(1520188, 125)

### Nettoyage des colonnes contenant des chaînes de caractère

Il s'agit ici de réécrire les différentes colonnes conténant des chaines de caractère. Cette consiste à la suppression de de caractère inattendus pouvant compliquer la recherche de produit saisi par un utilisateur.

In [25]:
def clean_columns(self):
    newList = None
    
    unicodeStr = {'Æ':'AE','Ð':'D','Ø':'O','Þ':'TH','ß':'ss','æ':'ae','ð':'d','ø':'o',
                  'þ':'th','Œ':'OE','œ':'oe','ƒ':'f'}
    
    if isinstance(self,str):
        for key,value in unicodeStr.items():
            if key in self:
                self = self.replace(key,value)
            
        self = self.strip(',').strip().replace('&quot;','').replace('+','').replace('#','')
        myList0 = self.split(sep=',')
        myList1 = [cnt[3:].strip().capitalize() if ':' in cnt else cnt.strip().capitalize() for cnt in myList0]
        myList2 = str(myList1).strip("[]")
        newList = myList2.replace("\'","").replace('\"',"").replace('-'," ")
        
    return newList

In [26]:
data1.loc[:,'product_name'] = [clean_columns(pName) for pName in data1['product_name']]
data1.loc[:,'labels_en'] = [clean_columns(lbls) for lbls in data1['labels_en']]
data1.loc[:,'countries_en'] = [clean_columns(country) for country in data1['countries_en']]
data1.loc[:,'allergens'] = [clean_columns(aName) for aName in data1['allergens']]
data1.loc[:,'additives_tags'] = [clean_columns(aTags) for aTags in data1['additives_tags']]
data1.loc[:,'pnns_groups_1'] = [clean_columns(pGroup1) for pGroup1 in data1['pnns_groups_1']]
data1.loc[:,'pnns_groups_2'] = [clean_columns(pGroup2) for pGroup2 in data1['pnns_groups_2']]
data1.loc[:,'main_category_en'] = [clean_columns(mCat) for mCat in data1['main_category_en']]

### Suppression des doublons à partir des noms des produits

In [27]:
data1 = data1[data1['product_name'].notna()]
data1.sort_values('product_name',inplace = True)
data1 = data1.drop_duplicates(subset='product_name',keep='first')

data1.head()

Unnamed: 0,creator,created_t,last_modified_t,product_name,labels_en,countries_en,allergens,additives_n,additives_tags,nutriscore_score,nutriscore_grade,pnns_groups_1,pnns_groups_2,main_category_en,energy-kj_100g,energy-kcal_100g,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,butyric-acid_100g,caproic-acid_100g,caprylic-acid_100g,capric-acid_100g,lauric-acid_100g,myristic-acid_100g,palmitic-acid_100g,stearic-acid_100g,arachidic-acid_100g,behenic-acid_100g,lignoceric-acid_100g,cerotic-acid_100g,montanic-acid_100g,melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,alpha-linolenic-acid_100g,eicosapentaenoic-acid_100g,docosahexaenoic-acid_100g,omega-6-fat_100g,linoleic-acid_100g,arachidonic-acid_100g,gamma-linolenic-acid_100g,dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,oleic-acid_100g,elaidic-acid_100g,gondoic-acid_100g,mead-acid_100g,erucic-acid_100g,nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,sucrose_100g,glucose_100g,fructose_100g,lactose_100g,maltose_100g,maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,soluble-fiber_100g,insoluble-fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-dried_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
1159041,gaetanorusso,1511464316,1511786161,,,Italy,,,,,,,,Beverages,0.0,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
414772,openfoodfacts-contributors,1570900068,1570988951,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
565250,kiliweb,1556851883,1556851883,n / a ferments natali pour yaourt au b...,,France,,,,,,,,,,1335.0,1335.0,,1.3,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,46.0,44.0,,,,,,,,,,,,31.0,,,,1.2,0.48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1048868,kiliweb,1564505827,1564505829,10 salsichas frankfurt,,France,,,,,,,,,,590.0,590.0,,10.3,3.8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.5,0.0,,,,,,,,,,,,8.7,,,,1.7,0.68,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
995037,kiliweb,1521144793,1564036884,apple crisps,,France,,,,,,,,,,1540.0,1540.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,89.4,73.7,,,,,,,,,,,,0.0,,,,0.8,0.32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Nettoyage des colonnes contenant les valeurs nutritionnelles

Je traite maintenant les colonnes contenant contenant les valeurs numériques. Les valeurs manquantes dans la colonnes 'additives_n' expriment une absence d'additifs dans un produit donné. Elles seront remplacées par des valeurs 0.0. 

In [28]:
data1['additives_n'].fillna(0.0,inplace=True)

Pour ce qui concerne les valeurs nutritionnelles, il faudrait élimniner toutes les valeurs négatives et supprimer les quantités supérieures à 100 pour 100g de produit à l'exception des colonnes contenant les valeurs énergétiques. Enfin, je ne conserverai que les colonnes contenant des valeurs.

In [29]:
for col in columns100g:
    
    if 'nutriscore' not in col:
        data1.loc[data1[col] < 0.0,col] = np.nan
    
    if 'energy' not in col:
        #print(col)
        data1.loc[data1[col] > 100.0,col] = np.nan
        
data1['select100g'] = np.nan

data1['select100g'] = data1.apply(lambda row: sum(0 if np.isnan(row[col]) else 1 for col in columns100g),axis=1)
data1 = data1[data1['select100g'] > 0].drop(['select100g'],axis=1)

data1.head()

Unnamed: 0,creator,created_t,last_modified_t,product_name,labels_en,countries_en,allergens,additives_n,additives_tags,nutriscore_score,nutriscore_grade,pnns_groups_1,pnns_groups_2,main_category_en,energy-kj_100g,energy-kcal_100g,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,butyric-acid_100g,caproic-acid_100g,caprylic-acid_100g,capric-acid_100g,lauric-acid_100g,myristic-acid_100g,palmitic-acid_100g,stearic-acid_100g,arachidic-acid_100g,behenic-acid_100g,lignoceric-acid_100g,cerotic-acid_100g,montanic-acid_100g,melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,alpha-linolenic-acid_100g,eicosapentaenoic-acid_100g,docosahexaenoic-acid_100g,omega-6-fat_100g,linoleic-acid_100g,arachidonic-acid_100g,gamma-linolenic-acid_100g,dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,oleic-acid_100g,elaidic-acid_100g,gondoic-acid_100g,mead-acid_100g,erucic-acid_100g,nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,sucrose_100g,glucose_100g,fructose_100g,lactose_100g,maltose_100g,maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,soluble-fiber_100g,insoluble-fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-dried_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
1159041,gaetanorusso,1511464316,1511786161,,,Italy,,0.0,,,,,,Beverages,0.0,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
565250,kiliweb,1556851883,1556851883,n / a ferments natali pour yaourt au b...,,France,,0.0,,,,,,,,1335.0,1335.0,,1.3,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,46.0,44.0,,,,,,,,,,,,31.0,,,,1.2,0.48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1048868,kiliweb,1564505827,1564505829,10 salsichas frankfurt,,France,,0.0,,,,,,,,590.0,590.0,,10.3,3.8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.5,0.0,,,,,,,,,,,,8.7,,,,1.7,0.68,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
995037,kiliweb,1521144793,1564036884,apple crisps,,France,,0.0,,,,,,,,1540.0,1540.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,89.4,73.7,,,,,,,,,,,,0.0,,,,0.8,0.32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1065671,kiliweb,1539432615,1586787450,biotechusa zero bar 2050 chocolate hazelnut....,"Gluten free, No lactose",France,,0.0,,,,,,,,745.0,745.0,,7.5,2.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6.0,0.1,,,,,,,,,,,,20.0,,,,0.34,0.136,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [30]:
data1.to_csv('data_02.csv',sep='\t',encoding="utf-8")

In [31]:
data1['main_category_en'].value_counts()

Snacks                    21654
Confectioneries           10358
Sauces                     9694
Biscuits                   9187
Cheeses                    8203
                          ...  
Gingerbread toasts            1
Pate de figatelli             1
Kangourous                    1
Lardons de dinde fumes        1
Whey protein                  1
Name: main_category_en, Length: 17689, dtype: int64

In [32]:
data1['pnns_groups_1'].unique()

array([None, 'Milk and dairy products', 'Fruits and vegetables',
       'Beverages', 'Composite foods', 'Salty snacks',
       'Cereals and potatoes', 'Sugary snacks', 'Fish meat eggs',
       'Fat and sauces'], dtype=object)

In [33]:
data1['pnns_groups_1'].value_counts()

Sugary snacks              72350
Milk and dairy products    37833
Cereals and potatoes       37127
Fish meat eggs             32951
Beverages                  32058
Fat and sauces             28909
Composite foods            28438
Fruits and vegetables      21084
Salty snacks               11831
Name: pnns_groups_1, dtype: int64

### Choix des produits uniquement vendus en France

In [34]:
data1 = data1[data1['pnns_groups_1'].notna()]
data1 = data1[data1['nutriscore_score'].notna()]

data2 = data1.loc[data1['countries_en'].str.contains('France',na=False)]

#data2.to_csv('data_03.csv',sep='\t',encoding="utf-8")

In [35]:
print(data2.shape)

(115652, 125)


In [36]:
data2['pnns_groups_1'].value_counts()

Sugary snacks              25993
Fish meat eggs             15636
Milk and dairy products    15012
Composite foods            13444
Cereals and potatoes       11818
Beverages                  10529
Fruits and vegetables       8989
Fat and sauces              7556
Salty snacks                6675
Name: pnns_groups_1, dtype: int64

In [37]:
statList = ['Taille','Nbre ligne','Nbre colonne','Nbre NaN','Pourcentage de NaN']

statsValues = pd.DataFrame(columns=statList)

data2 = data2.loc[:, ~data2.columns.str.contains('^Unnamed')]
    
nbRows = data2.shape[0]
nbCols = data2.shape[1]
nbNaN = data2.isna().sum().sum()
pNaN = round(100.0 * nbNaN/data2.size,2)
    
statsValues.loc[0] = [data2.size,nbRows,nbCols,nbNaN,pNaN]

statsValues

Unnamed: 0,Taille,Nbre ligne,Nbre colonne,Nbre NaN,Pourcentage de NaN
0,14456500.0,115652.0,125.0,11796876.0,81.6
