In [30]:
import pandas as pd
import string
import cufflinks as cf
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot 
import matplotlib.pyplot as plt

# Data Recovery and Exploration

In this notebook we will explain :
- How to import the different data sets.
- Describe the feature selection.
- Clean up all data to make it usable for NLP.
- Discuss about the variable we want to predict.


## I) Download the full data set

Go to the page : https://www.oqali.fr/donnees-publiques/base-de-donnees-oqali/, then select "Exporter toute la base de données Oqali" and download the datasets called : "Ingredients" and "Description du produit". 

**Warning**: The are missing values in the file that you will find on this website, my notebooks are based on another data set without any missing values, but it will still work with the datas you can find on Oqali.

Finally, put them in the file : "...\food-classification\Data_Preprocessing\Data_Oqali\"


## II) The features

The main goal of this project is to classify the data from the web-scraping with an algorithm trained on a dataset and to predict a "Secteur" and a "Famille" to each products. Among the variables present, only five will hold our attention because we must find features that can both be identified with the web-scraping and the data set from Oqali. We identified different features available with Oqali and the web-scraping, however some won't help us for our project.

- Features we can find withboth Oqali and web-scrping :

**Nom**: The name of the article, example : "VITTEL Eau minérale naturelle plate 50cl".

**Dénomination légale de vente**: the sales name, the commercial name found on product labels labels; example : "VITTEL NATURE PET 50CL".

**Ingrédients**: The composition of the article.

**Marque**: the brand of the product, example "Auchan". The brand could be a determining variable in the allocation of products to a sector or family. However, as we are going to classify products from new brands that are not present in the database, this variable will not be used in the future. However, it is sometimes possible to find brand names in other fields, such as the label, which will be taken into account when data pre-processing.

**Mode de conservation**: the storage method, example "à stocker entre +5°C et +30°C dans un endroit propre et sans odeur". The storage method is either 'ambient', 'fresh' or 'frozen'

Due to different reasons, we can't use all this features, for example the brand is irrevelant because of our dataset which is based only on specific brands, if we exctract data from other brand this won't help us to classify the products. We will keep for the rest of the project : **Nom**, **Dénomination légale de vente**, **Mode de conservation**, **Ingrédients**.

There are two types of variables:
- text variables: Nom, Dénomination légale de vente and Ingrédients;
- binary variables: Mode de Conservation.

For text variables, a product is represented by an n-dimensional vector. n is the number of different words in the space of words present in the set of textual variables. The value of element number k will be 1 if word k appears in the textual fields of the product, and 0 otherwise.

As for the binary variable, since the conservation mode can only have three possible values
only three possible values, we vectorize it into a vector of size 3, so that each column
corresponds to these possible values: 'Fresh', 'Ambient', 'Frozen'.

## III) Data Cleaning

Now that all files are imported, wye will create the full data set, which will be used in the second part with the NLP.

**Warning**: This part was adapted for another data set, you can do small modifications.

In [31]:
produits = pd.read_csv(r'C:\Users\Thomas Aujoux\Documents\GitHub\food-classification\Data_Preprocessing\Data_Oqali\produits.csv', low_memory=False, sep=';', encoding = 'latin1')
produits = produits.rename(columns={ 'Code.du.Produit': 'Code_produit', 'Dénomination.de.vente': 'Denomination_de_vente', 'Famille.de.produits': 'Famille', "Mode.de.conservation": 'Conservation', 'Nom.du.produit': 'Nom'})
produits = produits[["Code_produit", "Secteur", "Famille", "Denomination_de_vente", "Nom", "Conservation"]]
produits

Unnamed: 0,Code_produit,Secteur,Famille,Denomination_de_vente,Nom,Conservation
0,450,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Lait fermente sucre a boire aux fruits enrichi...,GERVAIS A BOIRE,Frais
1,453,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Yaourt a boire sucre aromatise,P_TIT YOP PARFUM FRAISE,Frais
2,455,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Yaourt a boire sucre aux fruits,YOCO A BOIRE,Frais
3,456,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Yaourt a boire sucre aux fruits,YOCO A BOIRE,Frais
4,460,Produits laitiers et desserts frais,Fromages frais nature non sucres gourmands,_,FROMAGE BLANC 40% MG,Frais
...,...,...,...,...,...,...
66995,101536,Sirops et boissons concentrees a diluer,Sirops,Sirop d_orange,SIROP D_ORANGE,Ambiant
66996,101537,Sirops et boissons concentrees a diluer,Sirops,Sirop de fruits exotiques,SIROP TROPICAL,Ambiant
66997,101540,Sirops et boissons concentrees a diluer,Sirops,Sirop de the peche,SIROP THE PECHE,Ambiant
66998,101542,Sirops et boissons concentrees a diluer,Sirops,Sirop de grenadine,SIROP DE GRENADINE,Ambiant


This dataset contains many important informations that we will use in the future, there are the features "Secteur", "Famille", "Denomination_de_vente", "Nom" and "Conservation". We can see that they are 67 000 products, we will see if it is enough rows for Machine Learning compared to the number of "Secteur" and "Famille".

In [3]:
ingredients2 = pd.read_csv(r'C:\Users\Thomas Aujoux\Documents\GitHub\food-classification\Data_Preprocessing\Data_Oqali\ingredients.csv', low_memory=False, sep=';', encoding = 'latin1')
ingredients2 = ingredients2.rename(columns={ 'Code.du.Produit': 'Code_produit' })
ingredients2 = ingredients2[["Code_produit", "Ingrédient"]]
ingredients2 = pd.merge(produits[["Code_produit", "Secteur", "Famille", "Denomination_de_vente", "Nom", "Conservation"]], ingredients2, on="Code_produit")
ingredients2 = ingredients2.rename(columns={ 'Ingrédient': 'Ingredient' })
ingredients2 = ingredients2[ingredients2["Secteur"].str.contains("Bouillons et potages_RHF") == False]
ingredients2 = ingredients2[ingredients2["Secteur"].str.contains("Sauces condimentaires_RHF") == False]
ingredients2.to_csv(r'C:\Users\Thomas Aujoux\Documents\GitHub\food-classification\Data_Preprocessing\Data_Oqali\merged_final.csv', index=True, sep=';')
len(ingredients2["Secteur"].unique())

31

In [32]:
ingredients = pd.read_csv(r'C:\Users\Thomas Aujoux\Documents\GitHub\food-classification\Data_Preprocessing\Data_Oqali\ingredients.csv', low_memory=False, sep=';', encoding = 'latin1')
ingredients = ingredients.rename(columns={ 'Code.du.Produit': 'Code_produit' })
ingredients = ingredients[["Code_produit", "Ingrédient"]]
ingredients = pd.merge(ingredients, produits[["Code_produit", "Secteur", "Famille"]], on="Code_produit")
ingredients

Unnamed: 0,Code_produit,Ingrédient,Secteur,Famille
0,450,lait ecreme reconstitue,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques
1,450,sucre,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques
2,450,fruit : framboise,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques
3,450,creme,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques
4,450,"epaississants : amidon transforme de mais, gom...",Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques
...,...,...,...,...
1466083,101542,extrait de vanille,Sirops et boissons concentrees a diluer,Sirops
1466084,101543,sucre de canne liquide (ingredient issu de l_a...,Sirops et boissons concentrees a diluer,Sirops
1466085,101543,eau,Sirops et boissons concentrees a diluer,Sirops
1466086,101543,aromes naturels de menthe,Sirops et boissons concentrees a diluer,Sirops


This dataset contains many important informations that we will use in the future, there are the features "Secteur", "Famille" and "Ingrédient". We can see that they are 1 466 088, so approximatively 22 ingredients for one product which is a lot and for each ingredients there are precision about his composition like : "epaississants : amidon transforme de mais, gomme, ...". This can lead to 2 potential issues : 

The first thing is that they are a lot of different words and some can be useless for our machine learning algortihm. 

The other thing is that with the data imported from the web scraping the words can be slightly different, so they won't match with our data set.

1. They are many rows in the dataset that we won't use because the values 'Sauces condimentaires_RHF' and 'Bouillons et potages_RHF' are old data we should remove.

2. The features contain too many informations, we want to have a clean features without too many possibilities in order to match easily with the data from the web-scraping. We will delete every ingredients after the ":" which describes us the composition of this ingredient. We can give one exemple with "chocolat noir : masse de cacao, sucre, emulsifiant : lecithine de soja"

We will do this for 3 reasons, the first one is that some data from the web-scraping won't put every detail about the ingredient but just "chocolat noir" because the data from Oqali are really precise. 

The second is that they are not the main composition of the product, they are just a part of an ingredient, we can suppose that in the majority their % in the product is really low. 

Finally we tested different classification with them and without them and the result where almost the same but we were in a condition of overfitting, to many feature.

3. We will delete every ingredients that are in "()" for the same reasons as ":".

4. To see if the data is clean and for the rest of the study we will remplace every "_" and "*" with " ".

5. We remove unnecessary ponctuations with punctuation Removal. In this step, all the punctuations from the text are removed. string library of Python contains some pre-defined list of punctuations such as ‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’

6. We drop every rows that contain an ingredient which is unique in the data.

In [33]:
def remove_colon(list):
    n = len(list)
    i = 1
    colonfree = list[0]
    while i < n and list[i] != ":":
        colonfree = colonfree + " " + list[i]
        i = i + 1
    return colonfree


list_punctuation = '!"#$%&\'()+,-./;:<=>?@[\\]^_`{|}~1234567890'
#defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in list_punctuation])
    return punctuationfree #storing the puntuation free text


def data_cleaning(df, column):

    df = df[~df['Secteur'].isin(['Sauces condimentaires_RHF', 'Bouillons et potages_RHF'])]
    
    df[column] = df[column].str.split().map(lambda x:remove_colon(x))

    df[column] = df[column].str.replace(r"\s\(.*\)", "", regex=True)
    df[column] = df[column].str.replace(r"\s\(.*\)\s", " ", regex=True)
    df[column] = df[column].str.replace(r"\(.*\)", "", regex=True)
    df[column] = df[column].str.replace(r"\(.*", "", regex=True)

    df[column] = df[column].apply(lambda x: x.rstrip())
    df[column] = df[column].apply(lambda x: x.lstrip())

    #df = df[df.duplicated(subset=[column], keep=False)]

    df[column] = df[column].apply(lambda x : x.replace('_',' '))
    df[column] = df[column].apply(lambda x : x.replace('*',' '))

    df[column]= df[column].apply(lambda x:remove_punctuation(x))

    return df

In [34]:
produits = data_cleaning(produits, "Denomination_de_vente")
produits = data_cleaning(produits, "Nom")
produits = data_cleaning(produits, "Conservation")
ingredients = data_cleaning(ingredients, 'Ingrédient')
ingredients = ingredients[ingredients.duplicated(subset=['Ingrédient'], keep=False)]

Now that our features are clean we can merged the two dataset to use them in the NLP part.

In [35]:
ingredients = ingredients.groupby(['Code_produit', 'Secteur'])['Ingrédient'].agg(lambda col: ' '.join(col)).reset_index(name='Ingrédient')
df = pd.merge(produits, ingredients, on='Code_produit')
#df = df.drop('index', axis=1)
df = df.drop('Secteur_y', axis=1)
df = df.rename(columns={ 'Secteur_x': 'Secteur' })
df

Unnamed: 0,Code_produit,Secteur,Famille,Denomination_de_vente,Nom,Conservation,Ingrédient
0,450,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Lait fermente sucre a boire aux fruits enrichi...,GERVAIS A BOIRE,Frais,lait ecreme reconstitue sucre fruit creme epai...
1,453,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Yaourt a boire sucre aromatise,P TIT YOP PARFUM FRAISE,Frais,yaourt au lait partiellement ecreme sucre siro...
2,455,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Yaourt a boire sucre aux fruits,YOCO A BOIRE,Frais,yaourt au lait mg sucre fruits fruit fruit fr...
3,456,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Yaourt a boire sucre aux fruits,YOCO A BOIRE,Frais,yaourt au lait mg sucre fraise fructose conce...
4,460,Produits laitiers et desserts frais,Fromages frais nature non sucres gourmands,,FROMAGE BLANC MG,Frais,lait ecreme pasteurise creme pasteurisee ferme...
...,...,...,...,...,...,...,...
65793,101536,Sirops et boissons concentrees a diluer,Sirops,Sirop d orange,SIROP D ORANGE,Ambiant,sucre eau jus d orange a base de concentre aci...
65794,101537,Sirops et boissons concentrees a diluer,Sirops,Sirop de fruits exotiques,SIROP TROPICAL,Ambiant,sucre eau jus de fruits a base de concentres j...
65795,101540,Sirops et boissons concentrees a diluer,Sirops,Sirop de the peche,SIROP THE PECHE,Ambiant,sucre eau jus de peche a base de concentre aci...
65796,101542,Sirops et boissons concentrees a diluer,Sirops,Sirop de grenadine,SIROP DE GRENADINE,Ambiant,sirop de glucose fructose eau sucre liquide ju...


In [36]:
df3 = df[["Secteur", "Famille", "Denomination_de_vente", "Conservation", "Ingrédient"]]
import dataframe_image as dfi
dfi.export(df3, "../images/df.png", max_rows=10)


## IV) Descriptive statistics about variables of interest : "Secteur" and "Famille".

### 1) Analysis of our dataset

The Oqali nomenclature is a 2-level nomenclature. It comprises 31 sectors and 637 families, with most sectors comprising 3 to 62 families. As more than 58,000 products are already classified in this nomenclature, we will perform a supervised cat'egorization. 

The particularity of this nomenclature is that it can change at any time according to changes in the market, be it the appearance, deletion or modification of sectors or families. sectors or families. We therefore need to be able to provide an algorithm capable of taking this into account. into account.

In [17]:
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

df["Secteur"].iplot(kind='hist',
    xTitle='Sector',
    linecolor='black',
    yTitle='Quantity',
    title="Distribution of food products by sector")
print("There are",len(df["Secteur"].unique()), "different Sectors")

There are 31 different Sectors


In [16]:
df.groupby('Secteur').count()['Famille'].sort_values(ascending=False).iplot(kind='bar', yTitle='Quantité', linecolor='black', opacity=0.8, title='Diagramme à barres des secteurs', xTitle='Secteur')

In [18]:
trace = go.Table(
                header=dict(values=['Secteurs','Produits par secteur'],
                fill = dict(color=['#EABEB0']), 
                align = ['left'] * 5),
                cells=dict(values=[df.Secteur.value_counts().index,df.Secteur.value_counts()],
                align = ['left'] * 5))

layout = go.Layout(title='Number of recipes in each cuisine category',
                   titlefont = dict(size = 20),
                   width=500, height=650, 
                   paper_bgcolor =  'rgba(0,0,0,0)',
                   plot_bgcolor = 'rgba(0,0,0,0)',
                   autosize = False,
                   margin=dict(l=30,r=30,b=1,t=50,pad=1),
                   )
data = [trace]
fig = dict(data=data, layout=layout)
iplot(fig)

Will we analyse the repartition of our dataset in term of sectors.

The first thing we can observe is that they are 31 sectors, which is a big number for Multi-Class Classification, the second thing is that we are faced to a very highly imbalanced dataset, some sectors have around 6 000 products, others only 128. The definition of imbalanced classification is when the distribution across label is not equal for the different sectors. This is a known problem in fraud detection and required to change things in the approche. We will see some solutions at the end of the notebook.

In [60]:
df["Famille"].iplot(kind='hist',
    xTitle='Famille',
    linecolor='black',
    yTitle='Quantité',
    title="Distribution de l'offre alimentaire par famille")

In [20]:
df.groupby('Famille').count()['Secteur'].sort_values(ascending=False).iplot(kind='bar', yTitle='Quantity', linecolor='black', opacity=0.8, title='Family bar chart', xTitle='Family')

In [74]:
trace = go.Table(
                header=dict(values=['Cuisine','Number of recipes'],
                fill = dict(color=['#EABEB0']), 
                align = ['left'] * 5),
                cells=dict(values=[df.Famille.value_counts().index,df.Famille.value_counts()],
                align = ['left'] * 5))

layout = go.Layout(title='Number of recipes in each cuisine category',
                   titlefont = dict(size = 20),
                   width=500, height=650, 
                   paper_bgcolor =  'rgba(0,0,0,0)',
                   plot_bgcolor = 'rgba(0,0,0,0)',
                   autosize = False,
                   margin=dict(l=30,r=30,b=1,t=50,pad=1),
                   )
data = [trace]
fig = dict(data=data, layout=layout)
iplot(fig)

Will we analyse the repartition of our dataset in term of families.

#### Imbalanced Dataset

Imbalanced data often refers to a classification problem in which the classes are not equally represented. The distribution of data across classes is fairly uneven and unbalanced, which could be detrimental to sectors or families with few products. There is also an unbalanced distribution of families among sectors.

As seen previously the data were highly imbalanced for the sectors and for the families it is an extreme case. Some families contain 1 400 products, whereas others only 1 product like "Fruits au sirop tres leger" or "Fruits au sirop lourd". Even with some solutions it will impossible to classify those families.

#### Imbalanced Dataset and biased models

This problem will lead to a biased classification model, however this is not a big problem compare to fraud detection because some families are just less present in our supermarket and this is not an important subject like "Fraud" but just a fimily classifier. For example, if the model overfit on the family "Yaourts et laits fermentes sucres classiques" and underfit on the family "Fruits au sirop lourd" this is not an issue because there are not many products with "Fruits au sirop lourd" to classify. However, maybe one day we will web-scrape data with only "Fruits au sirop lourd" and this will become an issue. Even if it's not the biggest problem, we still need to find solutions.

Typical classifier techniques, such as Decision Trees and Logistic Regression, are incapable of handling imbalanced classes. This results in a strong bias toward larger classes, while classes with fewer data points are viewed as noise and are frequently ignored. As a result, minority classes have a greater misclassification rate than majority classes.

### 2) Solutions

#### Get new data

The first solution will be to find new data for the different families, however as said previously the data are entered manually and this won't be possible.

#### Data-level approches

The second solution will be re-sampling the dataset, to make our dataset balanced there are two ways "Under-sampling" (Remove samples from over-represented classes) but we won't use it because we already don't have many products or "Over-sampling" with SMOTE (Synthetic Minority Oversampling Technique) or other techniques, SMOTE is an over-sampling method. It creates synthetic samples of the minority class. We won't apply this solution because Over-sampling for text-classification is a difficult task and previous works showed that this method don't give good results. 

#### Class Weight

The third solution is to provide some bias to minority classes while training the mode. This will provide some bias towards the minority classes and help improving the performance if the model while classifying classes.


### 3) Evaluation Metrics

There are different evaluation metrics for our problem :

- Accuracy: the proportion of the total number of predictions that were correct.

- Precision: the proportion of positive cases that were correctly identified.

- Sensitivity or Recall : the proportion of actual positive cases which are correctly identified.

- F1 Score: The F1 score can be interpreted as a harmonic mean of the precision and recall, F1 Score = 2 * (precision * recall) / (precision + recall)

We our in a special case of imbalanced data, if we use the accuracy and the model overfit in a big family, the model will always predict the same family which will give a good accuracy. In this case, it is easy to get high accuracy without actually making usefull predictions. Accuracy as an evaluation metrics makes sense only if the class labels are uniformly distributed.

Confusion Matrix can be a good perfomance instrument for this task.

F1 Score with 'Macro' average is a useful measure of success of prediction when the classes are very imbalanced. Precision is a measure of the ability of a classification model to identify only the relevant data points, while recall is a measure of the ability of a model to find all the relevant cases within a dataset.
High scores for both precision and recall show that the classifier is returning accurate results (precision), as well as returning a majority of all positive results (recall). An ideal system with high precision and high recall will return many results, with all results labeled correctly.
Macro average is computed by taking the arithmetic mean of all f1 scores par classes.


In [9]:
df["variable"] = df["Denomination_de_vente"] + " " + df["Nom"] + " " + df["Conservation"] + " " + df["Ingrédient"]
df2 = df[["Code_produit", "Secteur", "Famille", "variable"]]

In [11]:
df2.to_csv(r'C:\Users\Thomas Aujoux\Documents\GitHub\food-classification\Data_Preprocessing\Data_Oqali\merged_final.csv')
df2

Unnamed: 0,Code_produit,Secteur,Famille,variable
0,450,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Lait fermente sucre a boire aux fruits enrichi...
1,453,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Yaourt a boire sucre aromatise P TIT YOP PARFU...
2,455,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Yaourt a boire sucre aux fruits YOCO A BOIRE F...
3,456,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,Yaourt a boire sucre aux fruits YOCO A BOIRE F...
4,460,Produits laitiers et desserts frais,Fromages frais nature non sucres gourmands,FROMAGE BLANC MG Frais lait ecreme pasteuri...
...,...,...,...,...
65793,101536,Sirops et boissons concentrees a diluer,Sirops,Sirop d orange SIROP D ORANGE Ambiant sucre ea...
65794,101537,Sirops et boissons concentrees a diluer,Sirops,Sirop de fruits exotiques SIROP TROPICAL Ambia...
65795,101540,Sirops et boissons concentrees a diluer,Sirops,Sirop de the peche SIROP THE PECHE Ambiant suc...
65796,101542,Sirops et boissons concentrees a diluer,Sirops,Sirop de grenadine SIROP DE GRENADINE Ambiant ...


In [1]:
df.to_csv(r'C:\Users\Thomas Aujoux\Documents\GitHub\food-classification\Data_Preprocessing\Data_Oqali\merged.csv')
df

NameError: name 'df' is not defined