# Projet du cours de Machine Learning : analyse du dataset d'OpenFoodFact

## Chargement des données 

### A partir du format csv

In [14]:
import pandas as pd

In [15]:
path = "https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv.gz"
df = pd.read_csv(path, nrows=100, sep='\t',encoding="utf-8")

In [16]:
df

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,last_modified_by,last_updated_t,last_updated_datetime,...,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g,sulphate_100g,nitrate_100g,acidity_100g
0,54,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1582569031,2020-02-24T18:30:31Z,1733085204,2024-12-01T20:33:24Z,,1736276870,2025-01-07T19:07:50Z,...,,,,,,,,,,
1,63,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1673620307,2023-01-13T14:31:47Z,1732913331,2024-11-29T20:48:51Z,insectproductadd,1738686803,2025-02-04T16:33:23Z,...,,,,,,,,,,
2,114,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1580066482,2020-01-26T19:21:22Z,1737247862,2025-01-19T00:51:02Z,smoothie-app,1738687801,2025-02-04T16:50:01Z,...,,,,,,,,,,
3,1,http://world-en.openfoodfacts.org/product/0000...,inf,1634745456,2021-10-20T15:57:36Z,1738676541,2025-02-04T13:42:21Z,waistline-app,1738684455,2025-02-04T15:54:15Z,...,,,,,,,,,,
4,105,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1572117743,2019-10-26T19:22:23Z,1738073570,2025-01-28T14:12:50Z,,1738685184,2025-02-04T16:06:24Z,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,91,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1536994879,2018-09-15T07:01:19Z,1729528185,2024-10-21T16:29:45Z,roboto-app,1738681915,2025-02-04T15:11:55Z,...,,,,,,,,,,
96,92,http://world-en.openfoodfacts.org/product/0000...,product-scan-com,1638580447,2021-12-04T01:14:07Z,1728651371,2024-10-11T12:56:11Z,fix-code-bot,1738682899,2025-02-04T15:28:19Z,...,,,,,,,,,,
97,93,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1537041112,2018-09-15T19:51:52Z,1728236141,2024-10-06T17:35:41Z,macrofactor,1738687164,2025-02-04T16:39:24Z,...,,,,,,,,,,
98,94,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1572449868,2019-10-30T15:37:48Z,1728034902,2024-10-04T09:41:42Z,fix-code-bot,1737510101,2025-01-22T01:41:41Z,...,,,,,,,,,,


In [17]:
numeric_cols, ordinal_cols, non_ordinal_cols = detect_and_filter_columns(df, max_categories=3)

# Example: Access numeric columns
print("\nNumeric Columns DataFrame:")
for col, values in numeric_cols.items():
    print(f"{col}: {values.tolist()}")


Numeric Columns DataFrame:
code: [54, 63, 114, 1, 105, 2, 3, 4, 475, 5, 6, 6666, 7, 8, 9, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 60, 61, 62, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]
created_t: [1582569031, 1673620307, 1580066482, 1634745456, 1572117743, 1722606455, 1716818343, 1560176426, 1714206330, 1605337720, 1732037972, 1709219541, 1678803019, 1609862762, 1527242583, 1677761447, 1476947941, 1718649410, 1578683912, 1553970319, 1710345659, 1523810594, 1703127173, 1728736811, 1729882037, 1718717714, 1536930846, 1560170250, 1614525537, 1669383630, 1579212608, 1673331788, 1572188426, 1559030989, 1626699977, 1648551873, 1481840144, 1673637219, 1719401058, 1622534821, 1536945788, 1545469597, 1537366963, 1551029683, 1535603981, 1537945936, 172743

### A partir du format parquet 

parquet est un format optimisé pour la maniupulation de gros data set

on peut charger le data set sous ce format, à partir de [son emplacement sur HuggingFace](https://huggingface.co/datasets/openfoodfacts/product-database) (attention il faudra installer les librairies suivantes pour cela) 

Si ca n'est pas déja fait, télécharger les librairies nécessaires : 
```
pip install huggingface-hub
pip fastparquet

```

In [18]:
# Login using e.g. `huggingface-cli login` to access this dataset
splits = {'food': 'food.parquet', 'beauty': 'beauty.parquet'}
df = pd.read_parquet("hf://datasets/openfoodfacts/product-database/" + splits["food"])

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

### Autres méthodes

Pour des détails complets sur les différentes options pour charger les données, consultez la [page dédiée du projet](https://world.openfoodfacts.org/data)