---
# Data Cleaning
---
In this notebook, it is the data cleaning section of the project.

1. Each column type and it ranges of values/category will be analyzed.
2. Columns that provide no valuable information will be discarded.
3. Missing values will be found and replaced.
4. Categories that are highly correleated to another column will be replaced, inorder to no pass the same information twice to the classification algorithm.


## The columns available in the dataset and their description:

| Column name (French)      | Column name (English)               | Units  | Needed for further analysis  |
|------------------------------------------------------------------------|----------|----------|----------|
| Code AGB                                                                | AGB Code                                                                  | None                                   | No      |
| Code CIQUAL                                                             | CIQUAL Code                                                              | None                                   | No      |
| Groupe d'aliment                                                         | Food Group                                                                | None                                   | Yes      |
| Sous-groupe d'aliment                                                    | Food Sub-group                                                            | None                                   | Yes      |
| Nom du Produit en Français                                              | Product Name in French                                                    | None                                   | No (keeping LCI Name)      |
| LCI Name                                                                | LCI Name                                                                  | None                                   | Yes      |
| Code saison  | Season Code     | 0: out of season</br> 1: in season</br> 2: mixed consumption                                  | Yes      |
| Code avion                                               | Airplane Code                                            | 1: by airplane                                 | Yes      |
| Livraison (temperature et distance)                                     | Delivery (temperature and distance)                                                                | None                                   | Yes      |
| Approche emballage                                                      | Packaging Approach                                                        | None                                   | Yes      |
| Préparation (cuisson)                                                            | Preparation (cooking)                                                               | None                                   | Yes      |
| DQR - Note de qualité de la donnée         | DQR - Data Quality Rating                     | 1 excellent</br> to</br> 5 very low                                    | No      |
| Score unique EF 3.1                                                     | Unique EF 3.1 Score                                                       | mPt/kg de produit                      | Yes      |
| Changement climatique                                                   | Climate Change                                                             | kg CO2 eq/kg de produit                | Yes      |
| Appauvrissement de la couche d'ozone                                     | Ozone Layer Depletion                                                     | kg CVC11 eq/kg de produit              | Yes      |
| Rayonnements ionisants                                                  | Ionizing Radiation                                                        | kBq U-235 eq/kg de produit             | Yes      |
| Formation photochimique d'ozone                                         | Photochemical Ozone Formation                                             | kg NMVOC eq/kg de produit              | Yes      |
| Particules fines                                                         | Fine Particles                                                            | disease inc./kg de produit             | Yes      |
| Effets toxicologiques sur la santé humaine : substances non-cancérogènes | Toxicological Effects on Human Health: Non-carcinogenic Substances        | CTUh/kg de produit                     | Yes      |
| Effets toxicologiques sur la santé humaine : substances cancérogènes    | Toxicological Effects on Human Health: Carcinogenic Substances            | CTUh/kg de produit.1                   | Yes      |
| Acidification terrestre et eaux douces                                  | Terrestrial and Freshwater Acidification                                  | mol H+ eq/kg de produit                | Yes      |
| Eutrophisation eaux douces                                              | Freshwater Eutrophication                                                 | kg P eq/kg de produit                  | Yes      |
| Eutrophisation marine                                                   | Marine Eutrophication                                                     | kg N eq/kg de produit                  | Yes      |
| Eutrophisation terrestre                                                | Terrestrial Eutrophication                                                | mol N eq/kg de produit                 | Yes      |
| Écotoxicité pour écosystèmes aquatiques d'eau douce                      | Ecotoxicity for Freshwater Aquatic Ecosystems                             | CTUe/kg de produit                     | Yes      |
| Utilisation du sol                                                      | Land Use                                                                  | Pt/kg de produit                       | Yes      |
| Épuisement des ressources eau                                           | Water Resource Depletion                                                  | m3 depriv./kg de produit               | Yes      |
| Épuisement des ressources énergétiques                                  | Energy Resource Depletion                                                 | MJ/kg de produit                       | Yes      |
| Épuisement des ressources minéraux                                      | Mineral Resource Depletion                                                | kg Sb eq/kg de produit                 | Yes      |
| Changement climatique - émissions biogéniques                           | Climate Change - Biogenic Emissions                                        | kg CO2 eq/kg de produit.1              | Yes      |
| Changement climatique - émissions fossiles                              | Climate Change - Fossil Emissions                                         | kg CO2 eq/kg de produit.2              | Yes      |
| Changement climatique - émissions liées au changement d'affectation des sols | Climate Change - Emissions from Land Use Change                           | kg CO2 eq/kg de produit.3              | Yes      |

---

### Importing necessary library

In [1]:
import pandas as pd
import numpy as np

### Read data file

In [2]:
df = pd.read_excel('../data/AGRIBALYSE3.2_Tableur produits alimentaires_VF_20_11_24.xlsx', sheet_name="Synthese", header=1)  
df.head()

Unnamed: 0,Code\nAGB,Code\nCIQUAL,Groupe d'aliment,Sous-groupe d'aliment,Nom du Produit en Français,LCI Name,code saison (0 : hors saison ; 1 : de saison ; 2 : mix de consommation FR),code avion (1 : par avion),Livraison,Approche emballage,...,Eutrophisation marine,Eutrophisation terrestre,Écotoxicité pour écosystèmes aquatiques d'eau douce,Utilisation du sol,Épuisement des ressources eau,Épuisement des ressources énergétiques,Épuisement des ressources minéraux,Changement climatique - émissions biogéniques,Changement climatique - émissions fossiles,Changement climatique - émissions liées au changement d'affectation des sols
0,Code\nAGB,Code\nCIQUAL,Groupe d'aliment,Sous-groupe d'aliment,Nom du Produit en Français,LCI Name,code saison (0 : hors saison ; 1 : de saison ;...,code avion (1 : par avion),Livraison,Approche emballage,...,kg N eq/kg de produit,mol N eq/kg de produit,CTUe/kg de produit,Pt/kg de produit,m3 depriv./kg de produit,MJ/kg de produit,kg Sb eq/kg de produit,kg CO2 eq/kg de produit,kg CO2 eq/kg de produit,kg CO2 eq/kg de produit
1,11172,11172,aides culinaires et ingrédients divers,aides culinaires,"Court-bouillon pour poissons, déshydraté","Aromatic stock cube, for fish, dehydrated",2,0,Ambiant (long),PACK PROXY,...,0.026783,0.137099,70.183757,106.3095,3.380742,700.15958,0.000051,0.103694,7.459628,0.021197
2,25525,25525,aides culinaires et ingrédients divers,aides culinaires,"Pizza, sauce garniture pour",Topping sauce for pizza,2,0,Ambiant (long),PACK PROXY,...,0.004162,0.030263,11.027442,67.673943,2.468103,24.405351,0.000006,0.033626,1.015114,-0.108325
3,11214,11214,aides culinaires et ingrédients divers,aides culinaires,"Préparation culinaire à base de soja, type ""cr...","Soy ""cream"" preparation",2,0,Ambiant (long),PACK PROXY,...,0.007233,0.024434,30.835753,116.49228,0.422468,22.429809,0.000004,0.02518,0.964544,0.184348
4,11084,11084,aides culinaires et ingrédients divers,algues,"Agar (algue), cru","Seaweed, agar, raw",2,0,Ambiant (long),PACK PROXY,...,0.015034,0.143648,57.86752,26.718351,4.833158,395.94639,0.000079,0.040063,11.740311,0.006545


### Remove first row
Contains duplicated header or units information

In [3]:
df = df.drop(index=[0])
df

Unnamed: 0,Code\nAGB,Code\nCIQUAL,Groupe d'aliment,Sous-groupe d'aliment,Nom du Produit en Français,LCI Name,code saison (0 : hors saison ; 1 : de saison ; 2 : mix de consommation FR),code avion (1 : par avion),Livraison,Approche emballage,...,Eutrophisation marine,Eutrophisation terrestre,Écotoxicité pour écosystèmes aquatiques d'eau douce,Utilisation du sol,Épuisement des ressources eau,Épuisement des ressources énergétiques,Épuisement des ressources minéraux,Changement climatique - émissions biogéniques,Changement climatique - émissions fossiles,Changement climatique - émissions liées au changement d'affectation des sols
1,11172,11172,aides culinaires et ingrédients divers,aides culinaires,"Court-bouillon pour poissons, déshydraté","Aromatic stock cube, for fish, dehydrated",2,0,Ambiant (long),PACK PROXY,...,0.026783,0.137099,70.183757,106.3095,3.380742,700.15958,0.000051,0.103694,7.459628,0.021197
2,25525,25525,aides culinaires et ingrédients divers,aides culinaires,"Pizza, sauce garniture pour",Topping sauce for pizza,2,0,Ambiant (long),PACK PROXY,...,0.004162,0.030263,11.027442,67.673943,2.468103,24.405351,0.000006,0.033626,1.015114,-0.108325
3,11214,11214,aides culinaires et ingrédients divers,aides culinaires,"Préparation culinaire à base de soja, type ""cr...","Soy ""cream"" preparation",2,0,Ambiant (long),PACK PROXY,...,0.007233,0.024434,30.835753,116.49228,0.422468,22.429809,0.000004,0.02518,0.964544,0.184348
4,11084,11084,aides culinaires et ingrédients divers,algues,"Agar (algue), cru","Seaweed, agar, raw",2,0,Ambiant (long),PACK PROXY,...,0.015034,0.143648,57.86752,26.718351,4.833158,395.94639,0.000079,0.040063,11.740311,0.006545
5,20995,20995,aides culinaires et ingrédients divers,algues,"Ao-nori (Enteromorpha sp.), séchée ou déshydratée","Sea lettuce (Enteromorpha sp.), dried or dehyd...",2,0,Ambiant (long),PACK PROXY,...,0.015034,0.143648,57.86752,26.718351,4.833158,395.94639,0.000079,0.040063,11.740311,0.006545
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2454,6581,6581,"viandes, œufs, poissons",viandes cuites,"Veau, jarret, braisé ou bouilli","Veal, knuckle or shank, braised or boiled",2,0,Glacé,PACK AGB,...,0.091173,1.457487,391.69164,1336.417,4.021206,149.24438,0.000057,15.861507,11.205989,2.617929
2455,6523,6523,"viandes, œufs, poissons",viandes cuites,"Veau, noix, grillée/poêlée","Veal, tenderloin, grilled/pan-fried",2,0,Glacé,PACK AGB,...,0.091187,1.457302,392.31917,1339.1618,3.972692,141.89131,0.000056,15.821821,10.946006,2.621672
2456,6524,6524,"viandes, œufs, poissons",viandes cuites,"Veau, noix, rôtie","Veal, tenderloin, roasted",2,0,Glacé,PACK AGB,...,0.091077,1.457291,392.24462,1336.1372,3.991295,149.67724,0.000059,15.82174,10.910768,2.617884
2457,6551,6551,"viandes, œufs, poissons",viandes cuites,"Veau, rôti, cuit","Veal, roast, cooked",2,0,Glacé,PACK AGB,...,0.113935,1.821645,490.59667,1670.0795,4.984198,184.17966,0.000072,19.864699,13.630845,3.272343


---
### Removing unnecessary columns 
---

In [4]:
df.columns

Index(['Code\nAGB', 'Code\nCIQUAL', 'Groupe d'aliment',
       'Sous-groupe d'aliment', 'Nom du Produit en Français', 'LCI Name',
       'code saison (0 : hors saison ; 1 : de saison ; 2 : mix de consommation FR)',
       'code avion (1 : par avion)', 'Livraison', 'Approche emballage ',
       'Préparation',
       'DQR - Note de qualité de la donnée (1 excellente ; 5 très faible)',
       'Score unique EF 3.1', 'Changement climatique',
       'Appauvrissement de la couche d'ozone', 'Rayonnements ionisants',
       'Formation photochimique d'ozone', 'Particules fines',
       'Effets toxicologiques sur la santé humaine : substances non-cancérogènes',
       'Effets toxicologiques sur la santé humaine : substances cancérogènes',
       'Acidification terrestre et eaux douces', 'Eutrophisation eaux douces',
       'Eutrophisation marine', 'Eutrophisation terrestre',
       'Écotoxicité pour écosystèmes aquatiques d'eau douce',
       'Utilisation du sol', 'Épuisement des ressources eau',

In [5]:
df = df.drop(columns=['Code\nAGB', 
                      'Code\nCIQUAL',
                      'Nom du Produit en Français',
                      'DQR - Note de qualité de la donnée (1 excellente ; 5 très faible)'               
                     ])

df.head()

Unnamed: 0,Groupe d'aliment,Sous-groupe d'aliment,LCI Name,code saison (0 : hors saison ; 1 : de saison ; 2 : mix de consommation FR),code avion (1 : par avion),Livraison,Approche emballage,Préparation,Score unique EF 3.1,Changement climatique,...,Eutrophisation marine,Eutrophisation terrestre,Écotoxicité pour écosystèmes aquatiques d'eau douce,Utilisation du sol,Épuisement des ressources eau,Épuisement des ressources énergétiques,Épuisement des ressources minéraux,Changement climatique - émissions biogéniques,Changement climatique - émissions fossiles,Changement climatique - émissions liées au changement d'affectation des sols
1,aides culinaires et ingrédients divers,aides culinaires,"Aromatic stock cube, for fish, dehydrated",2,0,Ambiant (long),PACK PROXY,Pas de préparation,1.874152,7.584518,...,0.026783,0.137099,70.183757,106.3095,3.380742,700.15958,5.1e-05,0.103694,7.459628,0.021197
2,aides culinaires et ingrédients divers,aides culinaires,Topping sauce for pizza,2,0,Ambiant (long),PACK PROXY,Pas de préparation,0.148315,0.940414,...,0.004162,0.030263,11.027442,67.673943,2.468103,24.405351,6e-06,0.033626,1.015114,-0.108325
3,aides culinaires et ingrédients divers,aides culinaires,"Soy ""cream"" preparation",2,0,Ambiant (long),PACK PROXY,Pas de préparation,0.147701,1.174072,...,0.007233,0.024434,30.835753,116.49228,0.422468,22.429809,4e-06,0.02518,0.964544,0.184348
4,aides culinaires et ingrédients divers,algues,"Seaweed, agar, raw",2,0,Ambiant (long),PACK PROXY,Pas de préparation,1.547348,11.78692,...,0.015034,0.143648,57.86752,26.718351,4.833158,395.94639,7.9e-05,0.040063,11.740311,0.006545
5,aides culinaires et ingrédients divers,algues,"Sea lettuce (Enteromorpha sp.), dried or dehyd...",2,0,Ambiant (long),PACK PROXY,Pas de préparation,1.547348,11.78692,...,0.015034,0.143648,57.86752,26.718351,4.833158,395.94639,7.9e-05,0.040063,11.740311,0.006545


---
### Renaming columns 
---
(French to English)

In [6]:
french = df.drop(columns=['LCI Name']).columns

french

Index(['Groupe d'aliment', 'Sous-groupe d'aliment',
       'code saison (0 : hors saison ; 1 : de saison ; 2 : mix de consommation FR)',
       'code avion (1 : par avion)', 'Livraison', 'Approche emballage ',
       'Préparation', 'Score unique EF 3.1', 'Changement climatique',
       'Appauvrissement de la couche d'ozone', 'Rayonnements ionisants',
       'Formation photochimique d'ozone', 'Particules fines',
       'Effets toxicologiques sur la santé humaine : substances non-cancérogènes',
       'Effets toxicologiques sur la santé humaine : substances cancérogènes',
       'Acidification terrestre et eaux douces', 'Eutrophisation eaux douces',
       'Eutrophisation marine', 'Eutrophisation terrestre',
       'Écotoxicité pour écosystèmes aquatiques d'eau douce',
       'Utilisation du sol', 'Épuisement des ressources eau',
       'Épuisement des ressources énergétiques',
       'Épuisement des ressources minéraux',
       'Changement climatique - émissions biogéniques',
       'Ch

In [7]:
eng = [ 
    'Food Group', 
    'Food Sub-group', 
    'Season Code', 
    'Airplane Code', 
    'Delivery', 
    'Packaging', 
    'Preparation', 
    'EF Score', 
    'Climate Change', 
    'Ozone Layer Depletion', 
    'Ionizing Radiation', 
    'Photochemical Ozone Formation', 
    'Fine Particles', 
    'Toxicological Effects (Non-carcinogenic)', 
    'Toxicological Effects (Carcinogenic)', 
    'Terrestrial and Freshwater Acidification', 
    'Freshwater Eutrophication', 
    'Marine Eutrophication', 
    'Terrestrial Eutrophication', 
    'Ecotoxicity for Freshwater Aquatic Ecosystems', 
    'Land Use', 
    'Water Resource Depletion', 
    'Energy Resource Depletion', 
    'Mineral Resource Depletion', 
    'Climate Change - Biogenic Emissions', 
    'Climate Change - Fossil Emissions', 
    'Climate Change - Emissions from Land Use Change'
]

In [8]:
french2eng = dict(zip(french, eng))

df = df.rename(columns=french2eng)

df.head()

Unnamed: 0,Food Group,Food Sub-group,LCI Name,Season Code,Airplane Code,Delivery,Packaging,Preparation,EF Score,Climate Change,...,Marine Eutrophication,Terrestrial Eutrophication,Ecotoxicity for Freshwater Aquatic Ecosystems,Land Use,Water Resource Depletion,Energy Resource Depletion,Mineral Resource Depletion,Climate Change - Biogenic Emissions,Climate Change - Fossil Emissions,Climate Change - Emissions from Land Use Change
1,aides culinaires et ingrédients divers,aides culinaires,"Aromatic stock cube, for fish, dehydrated",2,0,Ambiant (long),PACK PROXY,Pas de préparation,1.874152,7.584518,...,0.026783,0.137099,70.183757,106.3095,3.380742,700.15958,5.1e-05,0.103694,7.459628,0.021197
2,aides culinaires et ingrédients divers,aides culinaires,Topping sauce for pizza,2,0,Ambiant (long),PACK PROXY,Pas de préparation,0.148315,0.940414,...,0.004162,0.030263,11.027442,67.673943,2.468103,24.405351,6e-06,0.033626,1.015114,-0.108325
3,aides culinaires et ingrédients divers,aides culinaires,"Soy ""cream"" preparation",2,0,Ambiant (long),PACK PROXY,Pas de préparation,0.147701,1.174072,...,0.007233,0.024434,30.835753,116.49228,0.422468,22.429809,4e-06,0.02518,0.964544,0.184348
4,aides culinaires et ingrédients divers,algues,"Seaweed, agar, raw",2,0,Ambiant (long),PACK PROXY,Pas de préparation,1.547348,11.78692,...,0.015034,0.143648,57.86752,26.718351,4.833158,395.94639,7.9e-05,0.040063,11.740311,0.006545
5,aides culinaires et ingrédients divers,algues,"Sea lettuce (Enteromorpha sp.), dried or dehyd...",2,0,Ambiant (long),PACK PROXY,Pas de préparation,1.547348,11.78692,...,0.015034,0.143648,57.86752,26.718351,4.833158,395.94639,7.9e-05,0.040063,11.740311,0.006545


---
### Change data type of columns
---

**Looking at the data type of the dataset columns**


In [9]:
df.dtypes

Food Group                                         object
Food Sub-group                                     object
LCI Name                                           object
Season Code                                        object
Airplane Code                                      object
Delivery                                           object
Packaging                                          object
Preparation                                        object
EF Score                                           object
Climate Change                                     object
Ozone Layer Depletion                              object
Ionizing Radiation                                 object
Photochemical Ozone Formation                      object
Fine Particles                                     object
Toxicological Effects (Non-carcinogenic)           object
Toxicological Effects (Carcinogenic)               object
Terrestrial and Freshwater Acidification           object
Freshwater Eut

In [10]:
col_int = ['Season Code', 'Airplane Code']

col_float = [
    'EF Score', 
    'Climate Change', 
    'Ozone Layer Depletion', 
    'Ionizing Radiation', 
    'Photochemical Ozone Formation', 
    'Fine Particles', 
    'Toxicological Effects (Non-carcinogenic)', 
    'Toxicological Effects (Carcinogenic)', 
    'Terrestrial and Freshwater Acidification', 
    'Freshwater Eutrophication', 
    'Marine Eutrophication', 
    'Terrestrial Eutrophication', 
    'Ecotoxicity for Freshwater Aquatic Ecosystems', 
    'Land Use', 
    'Water Resource Depletion', 
    'Energy Resource Depletion', 
    'Mineral Resource Depletion', 
    'Climate Change - Biogenic Emissions', 
    'Climate Change - Fossil Emissions', 
    'Climate Change - Emissions from Land Use Change'
]

dict_col_int = dict(zip(col_int, ['int'] * len(col_int)))

dict_col_float = dict(zip(col_float, ['float'] * len(col_float)))

In [11]:
df = df.astype(dict_col_int)

df = df.astype(dict_col_float)

df.dtypes

Food Group                                          object
Food Sub-group                                      object
LCI Name                                            object
Season Code                                          int64
Airplane Code                                        int64
Delivery                                            object
Packaging                                           object
Preparation                                         object
EF Score                                           float64
Climate Change                                     float64
Ozone Layer Depletion                              float64
Ionizing Radiation                                 float64
Photochemical Ozone Formation                      float64
Fine Particles                                     float64
Toxicological Effects (Non-carcinogenic)           float64
Toxicological Effects (Carcinogenic)               float64
Terrestrial and Freshwater Acidification           float

---
### Removing error rows
---
There are 7 duplicated rows (with errors and without). Need to use the 'LCI Name' column to remove rows containing 'error'.

In [12]:
df = df[~df['LCI Name'].str.contains('error')]

df

Unnamed: 0,Food Group,Food Sub-group,LCI Name,Season Code,Airplane Code,Delivery,Packaging,Preparation,EF Score,Climate Change,...,Marine Eutrophication,Terrestrial Eutrophication,Ecotoxicity for Freshwater Aquatic Ecosystems,Land Use,Water Resource Depletion,Energy Resource Depletion,Mineral Resource Depletion,Climate Change - Biogenic Emissions,Climate Change - Fossil Emissions,Climate Change - Emissions from Land Use Change
1,aides culinaires et ingrédients divers,aides culinaires,"Aromatic stock cube, for fish, dehydrated",2,0,Ambiant (long),PACK PROXY,Pas de préparation,1.874152,7.584518,...,0.026783,0.137099,70.183757,106.309500,3.380742,700.159580,0.000051,0.103694,7.459628,0.021197
2,aides culinaires et ingrédients divers,aides culinaires,Topping sauce for pizza,2,0,Ambiant (long),PACK PROXY,Pas de préparation,0.148315,0.940414,...,0.004162,0.030263,11.027442,67.673943,2.468103,24.405351,0.000006,0.033626,1.015114,-0.108325
3,aides culinaires et ingrédients divers,aides culinaires,"Soy ""cream"" preparation",2,0,Ambiant (long),PACK PROXY,Pas de préparation,0.147701,1.174072,...,0.007233,0.024434,30.835753,116.492280,0.422468,22.429809,0.000004,0.025180,0.964544,0.184348
4,aides culinaires et ingrédients divers,algues,"Seaweed, agar, raw",2,0,Ambiant (long),PACK PROXY,Pas de préparation,1.547348,11.786920,...,0.015034,0.143648,57.867520,26.718351,4.833158,395.946390,0.000079,0.040063,11.740311,0.006545
5,aides culinaires et ingrédients divers,algues,"Sea lettuce (Enteromorpha sp.), dried or dehyd...",2,0,Ambiant (long),PACK PROXY,Pas de préparation,1.547348,11.786920,...,0.015034,0.143648,57.867520,26.718351,4.833158,395.946390,0.000079,0.040063,11.740311,0.006545
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2454,"viandes, œufs, poissons",viandes cuites,"Veal, knuckle or shank, braised or boiled",2,0,Glacé,PACK AGB,Cuisson à l'eau,2.744841,29.685425,...,0.091173,1.457487,391.691640,1336.417000,4.021206,149.244380,0.000057,15.861507,11.205989,2.617929
2455,"viandes, œufs, poissons",viandes cuites,"Veal, tenderloin, grilled/pan-fried",2,0,Glacé,PACK AGB,Poêle,2.719832,29.389500,...,0.091187,1.457302,392.319170,1339.161800,3.972692,141.891310,0.000056,15.821821,10.946006,2.621672
2456,"viandes, œufs, poissons",viandes cuites,"Veal, tenderloin, roasted",2,0,Glacé,PACK AGB,Four,2.738488,29.350392,...,0.091077,1.457291,392.244620,1336.137200,3.991295,149.677240,0.000059,15.821740,10.910768,2.617884
2457,"viandes, œufs, poissons",viandes cuites,"Veal, roast, cooked",2,0,Glacé,PACK AGB,Four,3.418940,36.767887,...,0.113935,1.821645,490.596670,1670.079500,4.984198,184.179660,0.000072,19.864699,13.630845,3.272343


---
### Investigating missing values
---

In [13]:
df.isna().sum()

Food Group                                         0
Food Sub-group                                     0
LCI Name                                           0
Season Code                                        0
Airplane Code                                      0
Delivery                                           0
Packaging                                          0
Preparation                                        0
EF Score                                           0
Climate Change                                     0
Ozone Layer Depletion                              0
Ionizing Radiation                                 0
Photochemical Ozone Formation                      0
Fine Particles                                     0
Toxicological Effects (Non-carcinogenic)           0
Toxicological Effects (Carcinogenic)               0
Terrestrial and Freshwater Acidification           0
Freshwater Eutrophication                          0
Marine Eutrophication                         

---
### Investigating the values of object columns
---

**Food Group**

In [14]:
print('There are ',len(df['Food Group'].unique()),'categories.')

df['Food Group'].unique()

There are  11 categories.


array(['aides culinaires et ingrédients divers', 'aliments infantiles',
       'boissons', 'entrées et plats composés',
       'fruits, légumes, légumineuses et oléagineux', 'glaces et sorbets',
       'lait et produits laitiers', 'matières grasses',
       'produits céréaliers', 'produits sucrés',
       'viandes, œufs, poissons'], dtype=object)

**Food Sub-group**


In [15]:
print('There are ',len(df['Food Sub-group'].unique()),'categories.')

df['Food Sub-group'].unique()

There are  61 categories.


array(['aides culinaires', 'algues', 'condiments',
       'denrées destinées à une alimentation particulière', 'épices',
       'herbes', 'ingrédients divers', 'sauces', 'sels',
       'céréales et biscuits infantiles', 'desserts infantiles',
       'laits et boissons infantiles',
       'petits pots salés et plats infantiles', 'boisson alcoolisées',
       'boissons sans alcool', 'eaux', 'feuilletées et autres entrées',
       'pizzas, tartes et crêpes salées', 'plats composés',
       'plats végétariens', 'salades composées et crudités', 'sandwichs',
       'soupes', 'fruits', 'fruits à coque et graines oléagineuses',
       'légumes', 'légumineuses', 'pommes de terre et autres tubercules',
       'desserts glacés', 'glaces', 'sorbets',
       'crèmes et spécialités à base de crème', 'fromages', 'laits',
       'produits laitiers frais et assimilés', 'autres matières grasses',
       'beurres', 'huiles de poissons', 'huiles et graisses végétales',
       'margarines', 'céréales de pe

**LCI Name**


In [16]:
print('There are ',len(df['LCI Name'].unique()),'categories.')


There are  2448 categories.


**Delivery**

In [17]:
print('There are ',len(df['Delivery'].unique()),'categories.')

df['Delivery'].unique()

There are  5 categories.


array(['Ambiant (long)', 'Glacé', 'Congelé', 'Ambiant (moyen)',
       'Ambiant (court)'], dtype=object)

**Packaging**

In [18]:
print('There are ',len(df['Packaging'].unique()),'categories.')

df['Packaging'].unique()

There are  2 categories.


array(['PACK PROXY', 'PACK AGB'], dtype=object)

**Preparation**

In [19]:
print('There are ',len(df['Preparation'].unique()),'categories.')

df['Preparation'].unique()

There are  8 categories.


array(['Pas de préparation', 'Micro-onde ',
       'Réfrigéré chez le consommateur', "Cuisson à l'eau", 'Four ',
       'Four', 'Poêle', 'Micro-onde'], dtype=object)

---
### Cleaning the Preparation column
---
There are categories that are repeated with a white space at the end. 

In [20]:
df['Preparation'] = df['Preparation'].str.strip()

---
### Write cleaned data file
---

In [21]:
df.to_excel('../data/AGRIBALYSE3.2_Synthese_cleaned.xlsx', index=False)