### Identifier les valeurs aberrantes dans `products.csv`

Utiliser `pandas` pour charger et analyser les données du fichier CSV `products.csv`. Repérer les valeurs aberrantes (ordre de grandeur : quelques centaines).

In [85]:
from pathlib import Path

import numpy as np
import pandas as pd

In [34]:
DATA_DIR = Path("../../data")
product_file_path = DATA_DIR / "products.csv"

In [35]:
product_df = pd.read_csv(product_file_path, low_memory=False)

In [36]:
product_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320772 entries, 0 to 320771
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Unnamed: 0          320772 non-null  int64  
 1   code                320749 non-null  object 
 2   fat_100g            243891 non-null  float64
 3   saturated-fat_100g  229554 non-null  float64
 4   sugars_100g         244971 non-null  float64
 5   fiber_100g          200886 non-null  float64
 6   proteins_100g       259922 non-null  float64
 7   salt_100g           255510 non-null  float64
 8   sodium_100g         255463 non-null  float64
 9   autre               320772 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 24.5+ MB


In [9]:
product_df.describe()

Unnamed: 0.1,Unnamed: 0,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
count,320772.0,243891.0,229554.0,244971.0,200886.0,259922.0,255510.0,255463.0,320772.0
mean,160385.5,12.730379,5.129932,16.003484,2.862111,7.07594,2.028624,0.798815,65.861511
std,92599.044612,17.578747,8.014238,22.327284,12.867578,8.409054,128.269454,50.504428,32.091021
min,0.0,0.0,0.0,-17.86,-6.7,-800.0,0.0,0.0,0.0
25%,80192.75,0.0,0.0,1.3,0.0,0.7,0.0635,0.025,41.907242
50%,160385.5,5.0,1.79,5.71,1.5,4.76,0.58166,0.229,75.670259
75%,240578.25,20.0,7.14,24.0,3.6,10.0,1.37414,0.541,94.145336
max,320771.0,714.29,550.0,3520.0,5380.0,430.0,64312.8,25320.0,889.38


Doublons de code ? -> Didier
Valeurs négatives ? -> Salomé
Valeurs > 100 (avec une tolérance) ? -> Aatif
Somme des valeurs des colonnes != 100 (avec une tolérance) ? -> Salomé
Valeurs manquantes ? Aatif + Didier



In [42]:
product_df.head()

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
0,0,3087,,,,,,,,100.0
1,1,4530,28.57,28.57,14.29,3.6,3.57,0.0,0.0,21.4
2,2,4559,17.86,0.0,17.86,7.1,17.86,0.635,0.25,38.435
3,3,16087,57.14,5.36,3.57,7.1,17.86,1.22428,0.482,7.26372
4,4,16094,1.43,,,5.7,8.57,,,84.3


## Codes dupliqués

In [38]:
columns_to_check = [
    "fat_100g",
    "saturated-fat_100g",
    "sugars_100g",
    "fiber_100g",
    "proteins_100g",
    "salt_100g",
    "sodium_100g",
    "autre",
]

In [40]:
sum((product_df[["code"] + columns_to_check].groupby("code").nunique() > 1).any(axis=1))

59

Bilan :
- 23 codes manquants,
- 320580 valeurs de code uniques sur 320749 articles avec un code non manquant,
- 59 codes associés à au moins 2 articles avec des compositions différentes.

## Valeurs négatives

In [41]:
mask_neg = (product_df[columns_to_check] < 0).any(axis=1)
product_df[mask_neg]

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
8582,8582,11213420608,0.0,0.0,-1.2,1.2,2.41,0.38354,0.151,97.05546
18209,18209,21130493432,0.8,0.0,-0.8,0.8,0.8,0.87376,0.344,97.18224
23784,23784,28400231053,33.33,13.33,0.0,-6.7,,6.43382,2.533,51.07318
33781,33781,36800416727,46.43,8.93,3.57,3.6,-3.57,0.99822,0.393,39.64878
115310,115310,4029816,0.0,,,,-500.0,25.4,10.0,564.6
117739,117739,608866999263,3.57,0.0,-3.57,3.6,7.14,0.9525,0.375,87.9325
146284,146284,789280259062,13.33,3.33,-6.67,6.7,,2.032,0.8,80.478
150858,150858,813922021028,6.25,1.25,-6.25,1.2,1.25,1.1938,0.47,94.6362
164030,164030,856336001538,21.43,3.57,-17.86,17.9,17.86,1.93294,0.761,54.40606
169119,169119,875208001230,0.0,,0.0,,-800.0,7.62,3.0,889.38


In [60]:
sum(mask_neg)

11

## Salt et sodium

In [62]:
ratio_col = product_df["salt_100g"] / product_df["sodium_100g"]

In [67]:
expected_ratio = ratio_col[~ratio_col.isna()].mean()

In [74]:
expected_salt_100g = product_df["sodium_100g"] * expected_ratio

mask_inf = expected_salt_100g < product_df["salt_100g"] - 0.3
mask_sup = expected_salt_100g > product_df["salt_100g"] + 0.3

product_df[mask_inf | mask_sup]

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
21632,21632,24600017008,0.0,,,,0.0,102.0,40.0,0.0
109154,109154,96619911936,0.0,,,,0.0,107.0,42.0,0.0


## Fat et saturated-fat

In [93]:
fat_mask = product_df["saturated-fat_100g"] > product_df["fat_100g"]
print(sum(fat_mask))

na_fat_mask = product_df["fat_100g"].isna() & (~ product_df["saturated-fat_100g"].isna())
product_df[na_fat_mask]

354


Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre


In [92]:
product_df["fat_100g"] = np.where(
    na_fat_mask,
    product_df["saturated-fat_100g"],
    product_df["fat_100g"],
)

In [89]:
product_df.head()

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
0,0,3087,,,,,,,,100.0
1,1,4530,28.57,28.57,14.29,3.6,3.57,0.0,0.0,21.4
2,2,4559,17.86,0.0,17.86,7.1,17.86,0.635,0.25,38.435
3,3,16087,57.14,5.36,3.57,7.1,17.86,1.22428,0.482,7.26372
4,4,16094,1.43,,,5.7,8.57,,,84.3


## Somme à 100 grammes

In [94]:
columns_to_sum = [
    "fat_100g",
    "sugars_100g",
    "fiber_100g",
    "proteins_100g",
    "salt_100g",
    "autre",
]

In [100]:
product_df["total_weight_100g"] = product_df[columns_to_sum].sum(axis=1, skipna=True)

display(product_df)

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre,total_weight_100g
0,0,3087,,,,,,,,100.00000,100.000
1,1,4530,28.57,28.57,14.29,3.6,3.57,0.00000,0.000,21.40000,71.430
2,2,4559,17.86,0.00,17.86,7.1,17.86,0.63500,0.250,38.43500,99.750
3,3,16087,57.14,5.36,3.57,7.1,17.86,1.22428,0.482,7.26372,94.158
4,4,16094,1.43,,,5.7,8.57,,,84.30000,100.000
...,...,...,...,...,...,...,...,...,...,...,...
320767,320767,9948282780603,,,,,,,,100.00000,100.000
320768,320768,99567453,0.00,0.00,0.00,0.0,0.00,0.00000,0.000,100.00000,100.000
320769,320769,9970229501521,,,,,,,,100.00000,100.000
320770,320770,9980282863788,,,,,,,,100.00000,100.000


In [102]:
too_high_mask = product_df["total_weight_100g"] > 101
sum(too_high_mask)

888

In [103]:
too_low_mask = product_df["total_weight_100g"] < 99
sum(too_low_mask)

136932

# Avec dask

In [53]:
import dask.dataframe as dd  
product_ddf = dd.read_csv(product_file_path, dtype={'code': 'object'})  


In [54]:
describe_df = product_df.describe()

In [55]:
display(describe_df)

Unnamed: 0.1,Unnamed: 0,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
count,320772.0,243891.0,229554.0,244971.0,200886.0,259922.0,255510.0,255463.0,320772.0
mean,160385.5,12.730379,5.129932,16.003484,2.862111,7.07594,2.028624,0.798815,65.861511
std,92599.044612,17.578747,8.014238,22.327284,12.867578,8.409054,128.269454,50.504428,32.091021
min,0.0,0.0,0.0,-17.86,-6.7,-800.0,0.0,0.0,0.0
25%,80192.75,0.0,0.0,1.3,0.0,0.7,0.0635,0.025,41.907242
50%,160385.5,5.0,1.79,5.71,1.5,4.76,0.58166,0.229,75.670259
75%,240578.25,20.0,7.14,24.0,3.6,10.0,1.37414,0.541,94.145336
max,320771.0,714.29,550.0,3520.0,5380.0,430.0,64312.8,25320.0,889.38


In [56]:
mask_neg = (product_ddf[columns_to_check] < 0).any(axis=1)
neg_ddf = product_ddf[mask_neg]

In [121]:
pandas_mask_neg = (product_df[columns_to_check] < 0).any(axis=1)
pandas_neg_ddf = product_df[pandas_mask_neg]

In [122]:
print(neg_ddf)

Dask DataFrame Structure:
              Unnamed: 0    code fat_100g saturated-fat_100g sugars_100g fiber_100g proteins_100g salt_100g sodium_100g    autre
npartitions=1                                                                                                                   
                   int64  string  float64            float64     float64    float64       float64   float64     float64  float64
                     ...     ...      ...                ...         ...        ...           ...       ...         ...      ...
Dask Name: getitem, 6 graph layers


In [123]:
pandas_neg_ddf

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre,total_weight_100g
8582,8582,11213420608,0.0,0.0,-1.2,1.2,2.41,0.38354,0.151,97.05546,99.849
18209,18209,21130493432,0.8,0.0,-0.8,0.8,0.8,0.87376,0.344,97.18224,99.656
23784,23784,28400231053,33.33,13.33,0.0,-6.7,,6.43382,2.533,51.07318,84.137
33781,33781,36800416727,46.43,8.93,3.57,3.6,-3.57,0.99822,0.393,39.64878,90.677
115310,115310,4029816,0.0,,,,-500.0,25.4,10.0,564.6,90.0
117739,117739,608866999263,3.57,0.0,-3.57,3.6,7.14,0.9525,0.375,87.9325,99.625
146284,146284,789280259062,13.33,3.33,-6.67,6.7,,2.032,0.8,80.478,95.87
150858,150858,813922021028,6.25,1.25,-6.25,1.2,1.25,1.1938,0.47,94.6362,98.28
164030,164030,856336001538,21.43,3.57,-17.86,17.9,17.86,1.93294,0.761,54.40606,95.669
169119,169119,875208001230,0.0,,0.0,,-800.0,7.62,3.0,889.38,97.0


In [124]:
neg_ddf.compute()

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
8582,8582,11213420608,0.0,0.0,-1.2,1.2,2.41,0.38354,0.151,97.05546
18209,18209,21130493432,0.8,0.0,-0.8,0.8,0.8,0.87376,0.344,97.18224
23784,23784,28400231053,33.33,13.33,0.0,-6.7,,6.43382,2.533,51.07318
33781,33781,36800416727,46.43,8.93,3.57,3.6,-3.57,0.99822,0.393,39.64878
115310,115310,4029816,0.0,,,,-500.0,25.4,10.0,564.6
117739,117739,608866999263,3.57,0.0,-3.57,3.6,7.14,0.9525,0.375,87.9325
146284,146284,789280259062,13.33,3.33,-6.67,6.7,,2.032,0.8,80.478
150858,150858,813922021028,6.25,1.25,-6.25,1.2,1.25,1.1938,0.47,94.6362
164030,164030,856336001538,21.43,3.57,-17.86,17.9,17.86,1.93294,0.761,54.40606
169119,169119,875208001230,0.0,,0.0,,-800.0,7.62,3.0,889.38


In [59]:
len(neg_ddf)

11