### Identifier les valeurs aberrantes dans `products.csv`

Utiliser `pandas` pour charger et analyser les données du fichier CSV `products.csv`. Repérer les valeurs aberrantes (ordre de grandeur : quelques centaines).

In [1]:
from pathlib import Path

import pandas as pd

In [2]:
DATA_DIR = Path("../../data")
product_file_path = DATA_DIR / "products.csv"

In [3]:
product_df = pd.read_csv(product_file_path, low_memory=False)

In [4]:
product_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320772 entries, 0 to 320771
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Unnamed: 0          320772 non-null  int64  
 1   code                320749 non-null  object 
 2   fat_100g            243891 non-null  float64
 3   saturated-fat_100g  229554 non-null  float64
 4   sugars_100g         244971 non-null  float64
 5   fiber_100g          200886 non-null  float64
 6   proteins_100g       259922 non-null  float64
 7   salt_100g           255510 non-null  float64
 8   sodium_100g         255463 non-null  float64
 9   autre               320772 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 24.5+ MB


In [5]:
product_df.head(10)

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
0,0,3087,,,,,,,,100.0
1,1,4530,28.57,28.57,14.29,3.6,3.57,0.0,0.0,21.4
2,2,4559,17.86,0.0,17.86,7.1,17.86,0.635,0.25,38.435
3,3,16087,57.14,5.36,3.57,7.1,17.86,1.22428,0.482,7.26372
4,4,16094,1.43,,,5.7,8.57,,,84.3
5,5,16100,18.27,1.92,11.54,7.7,13.46,,,47.11
6,6,16117,,,,,8.89,,,91.11
7,7,16124,18.75,4.69,15.62,9.4,14.06,0.1397,0.055,37.2853
8,8,16193,37.5,22.5,42.5,7.5,5.0,,,0.0
9,9,16513,100.0,7.14,,,,,,0.0


In [6]:
product_df.describe(include="all").round(2)

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
count,320772.0,320749.0,243891.0,229554.0,244971.0,200886.0,259922.0,255510.0,255463.0,320772.0
unique,,320579.0,,,,,,,,
top,,70650800367.0,,,,,,,,
freq,,3.0,,,,,,,,
mean,160385.5,,12.73,5.13,16.0,2.86,7.08,2.03,0.8,65.86
std,92599.04,,17.58,8.01,22.33,12.87,8.41,128.27,50.5,32.09
min,0.0,,0.0,0.0,-17.86,-6.7,-800.0,0.0,0.0,0.0
25%,80192.75,,0.0,0.0,1.3,0.0,0.7,0.06,0.02,41.91
50%,160385.5,,5.0,1.79,5.71,1.5,4.76,0.58,0.23,75.67
75%,240578.25,,20.0,7.14,24.0,3.6,10.0,1.37,0.54,94.15


In [7]:
sorted_product_df = product_df.sort_values("fat_100g", ascending=False)
product_df.head(10)

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
0,0,3087,,,,,,,,100.0
1,1,4530,28.57,28.57,14.29,3.6,3.57,0.0,0.0,21.4
2,2,4559,17.86,0.0,17.86,7.1,17.86,0.635,0.25,38.435
3,3,16087,57.14,5.36,3.57,7.1,17.86,1.22428,0.482,7.26372
4,4,16094,1.43,,,5.7,8.57,,,84.3
5,5,16100,18.27,1.92,11.54,7.7,13.46,,,47.11
6,6,16117,,,,,8.89,,,91.11
7,7,16124,18.75,4.69,15.62,9.4,14.06,0.1397,0.055,37.2853
8,8,16193,37.5,22.5,42.5,7.5,5.0,,,0.0
9,9,16513,100.0,7.14,,,,,,0.0


Articles avec un code manquant

In [8]:
mask = product_df["code"].isna()
print(mask.value_counts())
display(product_df[mask])

code
False    320749
True         23
Name: count, dtype: int64


Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
189068,189068,,,,,,,,,100.0
189103,189103,,,,,,0.137,,,99.863
189109,189109,,,,,,,,,100.0
189119,189119,,,,,,0.122,,,99.878
189152,189152,,,,,,0.158,,,99.842
189160,189160,,,,,,0.156,,,99.844
189162,189162,,,,,,0.158,,,99.842
189168,189168,,,,,,0.12,,,99.88
189242,189242,,,,,,,,,100.0
189244,189244,,,,,,,,,100.0


In [12]:
negative_mask = (
    (product_df["fat_100g"] < 0)
    |
    (product_df["saturated-fat_100g"] < 0)
    |
    (product_df["sugars_100g"] < 0)
    |
    (product_df["fiber_100g"] < 0)
    |
    (product_df["proteins_100g"] < 0)
    |
    (product_df["salt_100g"] < 0)
    |
    (product_df["sodium_100g"] < 0)
    |
    (product_df["autre"] < 0)
)

In [9]:
columns_to_check = ["fat_100g",
    "saturated-fat_100g",
    "sugars_100g",
    "fiber_100g",
    "proteins_100g",
    "salt_100g",
    "sodium_100g",
    "autre",
]

negative_mask = (product_df[columns_to_check].fillna(0) < 0).any(axis=1)


In [13]:
negative_mask.value_counts()

False    320761
True         11
Name: count, dtype: int64

In [11]:
product_df[negative_mask]

Unnamed: 0.1,Unnamed: 0,code,fat_100g,saturated-fat_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,autre
8582,8582,11213420608,0.0,0.0,-1.2,1.2,2.41,0.38354,0.151,97.05546
18209,18209,21130493432,0.8,0.0,-0.8,0.8,0.8,0.87376,0.344,97.18224
23784,23784,28400231053,33.33,13.33,0.0,-6.7,,6.43382,2.533,51.07318
33781,33781,36800416727,46.43,8.93,3.57,3.6,-3.57,0.99822,0.393,39.64878
115310,115310,4029816,0.0,,,,-500.0,25.4,10.0,564.6
117739,117739,608866999263,3.57,0.0,-3.57,3.6,7.14,0.9525,0.375,87.9325
146284,146284,789280259062,13.33,3.33,-6.67,6.7,,2.032,0.8,80.478
150858,150858,813922021028,6.25,1.25,-6.25,1.2,1.25,1.1938,0.47,94.6362
164030,164030,856336001538,21.43,3.57,-17.86,17.9,17.86,1.93294,0.761,54.40606
169119,169119,875208001230,0.0,,0.0,,-800.0,7.62,3.0,889.38


Bonus : reprendre l'étude d'une typologie de valeurs aberrantes avec Dask.

In [None]:
# Votre code ici