# Explication des cellules d'initialisation
Nous avons testé plusieurs approches : en local, avec Pycharm. Dans le cloud, avec Kaggle & Google Colab.
Nous avons donc mis en place plusieurs configurations pour chaque environnement.
Si local, vérifier que java 11 est bien installé. Si Kaggle, uploader le csv comme dataset afin de l'avoir à disposition.


In [1]:
import time
from pyspark.sql import SparkSession

In [2]:
!pip install pyspark
file_path_csv = "/kaggle/input/openfoodfacts/en.openfoodfacts.org.products.csv"
file_path_parquet = "/kaggle/working/en.openfoodfacts.org.products.parquet"

In [2]:
file_path_csv = "./data/en.openfoodfacts.org.products.csv"
file_path_parquet = "./data/en.openfoodfacts.org.products.parquet"

In [3]:
start_time = time.time()
print("Démarrage du script...")

# Initialiser une SparkSession avec des logs réduits
spark = SparkSession.builder \
    .appName("Exploration OpenFoodFacts") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")  # Réduction des logs

print("PySpark chargé")

try:
    # Charger le fichier CSV en tant que DataFrame Spark puis échantillonne 20%
    df_csv_before_sample = spark.read.csv(file_path_csv, header=True, inferSchema=True, sep="\t")
    print("Fichier CSV chargé.")
    df_csv = df_csv_before_sample.sample(withReplacement=False, fraction=0.2)  # Échantillonnage à 20%
    print("Echantillonage terminé")

    # Sauvegarder le DataFrame au format Parquet
    df_csv.write.parquet(file_path_parquet, mode="overwrite")
    print("Données sauvegardées au format Parquet.")

    # Charger le fichier Parquet pour une analyse future
    df_parquet = spark.read.parquet(file_path_parquet)
    print("Fichier Parquet chargé.")

    # Création de la table Hive
    print("Création et insertion dans la table Hive...")
    hive_table_start_time = time.time()

    # Écrire les données dans la table Hive
    df_csv.write.mode("overwrite").saveAsTable("hive_table")
    
except:
    print("ERRRRROOOOOOR")

finally:
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Temps d'exécution : {elapsed_time:.2f} secondes")


Démarrage du script...


24/12/03 11:29:23 WARN Utils: Your hostname, cedric-galaxy resolves to a loopback address: 127.0.1.1; using 192.168.1.26 instead (on interface wlo1)
24/12/03 11:29:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/03 11:29:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


PySpark chargé


                                                                                

Fichier CSV chargé.
Echantillonage terminé


                                                                                

Données sauvegardées au format Parquet.
Fichier Parquet chargé.
Création et insertion dans la table Hive...




Temps d'exécution : 78.20 secondes


                                                                                

In [4]:
# Mesure du temps pour compter les lignes du DataFrame CSV
csv_start_time = time.time()
csv_row_count = df_csv.count()  # Compter les lignes
csv_end_time = time.time()
csv_elapsed_time_ms = (csv_end_time - csv_start_time) * 1000
print(f"CSV (comptage des lignes): {csv_elapsed_time_ms:.3f} ms - Nombre de lignes: {csv_row_count}")

# Mesure du temps pour compter les lignes du DataFrame Parquet
parquet_start_time = time.time()
parquet_row_count = df_parquet.count()  # Compter les lignes
parquet_end_time = time.time()
parquet_elapsed_time_ms = (parquet_end_time - parquet_start_time) * 1000
print(f"Parquet (comptage des lignes): {parquet_elapsed_time_ms:.3f} ms - Nombre de lignes: {parquet_row_count}")

# Mesure du temps pour compter les lignes de la table Hive
hive_start_time = time.time()
df_hive = spark.sql("SELECT * FROM hive_table")  # Charger la table Hive
hive_row_count = df_hive.count()  # Compter les lignes
hive_end_time = time.time()
hive_elapsed_time_ms = (hive_end_time - hive_start_time) * 1000
print(f"Hive (comptage des lignes): {hive_elapsed_time_ms:.3f} ms - Nombre de lignes: {hive_row_count}")

# Comparaison des temps
print("\nComparaison des performances :")
print(f"CSV Execution Time: {csv_elapsed_time_ms:.3f} ms")
print(f"Parquet Execution Time: {parquet_elapsed_time_ms:.3f} ms")
print(f"Hive Execution Time: {hive_elapsed_time_ms:.3f} ms")

                                                                                

CSV (comptage des lignes): 6113.584 ms - Nombre de lignes: 702698
Parquet (comptage des lignes): 569.154 ms - Nombre de lignes: 702698
Hive (comptage des lignes): 577.465 ms - Nombre de lignes: 702698

Comparaison des performances :
CSV Execution Time: 6113.584 ms
Parquet Execution Time: 569.154 ms
Hive Execution Time: 577.465 ms


# Analyse préliminaire
### 1. Mise en valeur des lignes, colonnes



In [6]:
# Mesure du temps pour compter les lignes du DataFrame CSV
csv_start_time = time.time()
csv_row_count = df_csv.count()
csv_end_time = time.time()
csv_elapsed_time_ms = (csv_end_time - csv_start_time) * 1000
print(f"CSV (comptage des lignes): {csv_elapsed_time_ms:.3f} ms - Nombre de lignes: {csv_row_count}")

# Mesure du temps pour compter les lignes du DataFrame Parquet
parquet_start_time = time.time()
parquet_row_count = df_parquet.count()
parquet_end_time = time.time()
parquet_elapsed_time_ms = (parquet_end_time - parquet_start_time) * 1000
print(f"Parquet (comptage des lignes): {parquet_elapsed_time_ms:.3f} ms - Nombre de lignes: {parquet_row_count}")

# Mesure du temps pour compter les lignes de la table Hive
hive_start_time = time.time()
df_hive = spark.sql("SELECT * FROM hive_table")
hive_row_count = df_hive.count()
hive_end_time = time.time()
hive_elapsed_time_ms = (hive_end_time - hive_start_time) * 1000
print(f"Hive (comptage des lignes): {hive_elapsed_time_ms:.3f} ms - Nombre de lignes: {hive_row_count}")

# Comparaison des temps
print("\nComparaison des performances :")
print(f"CSV Execution Time: {csv_elapsed_time_ms:.3f} ms")
print(f"Parquet Execution Time: {parquet_elapsed_time_ms:.3f} ms")
print(f"Hive Execution Time: {hive_elapsed_time_ms:.3f} ms")

                                                                                

CSV (comptage des lignes): 4385.785 ms - Nombre de lignes: 702698
Parquet (comptage des lignes): 117.743 ms - Nombre de lignes: 702698
Hive (comptage des lignes): 137.143 ms - Nombre de lignes: 702698

Comparaison des performances :
CSV Execution Time: 4385.785 ms
Parquet Execution Time: 117.743 ms
Hive Execution Time: 137.143 ms


### .2 Gestion des valeurs manquantes


In [12]:
from pyspark.sql.functions import col, count, when

start_time = time.time()

# Calculate missing data percentage for each column
total_rows = df_parquet.count()
missing_data = (
    df_parquet.select([
        (count(when(col(c).isNull() | (col(c) == ""), c)) / total_rows).alias(c)
        for c in df_parquet.columns
    ])
)

# Transform columns into rows (melt operation)
missing_data_melted = missing_data.selectExpr(
    "stack({0}, {1}) as (Column, MissingPercentage)".format(
        len(df_parquet.columns),
        ", ".join([f"'{col}', `{col}`" for col in df_parquet.columns])
    )
).filter(col("MissingPercentage").isNotNull()).orderBy(col("MissingPercentage").desc())

# Identify columns with 100% missing data
columns_to_drop = (
    missing_data_melted.filter(col("MissingPercentage") == 1.0)
    .select("Column")
    .rdd.flatMap(lambda x: x)
    .collect()
)

# Drop columns with 100% missing values
df_cleanedby_missing_value = df_parquet.drop(*columns_to_drop)

# Display the top 10 columns with the highest missing percentages
print("Top 10 columns with the highest missing percentages:")
missing_data_melted.show(10, truncate=False)

# Print dropped columns
print(f"Columns dropped due to 100% missing values: {columns_to_drop}")

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Execution completed in {elapsed_time:.2f} seconds")


                                                                                

Top 10 columns with the highest missing percentages:




+--------------------------------+------------------+
|Column                          |MissingPercentage |
+--------------------------------+------------------+
|cities                          |1.0               |
|allergens_en                    |1.0               |
|additives                       |0.9999985769135532|
|nutrition-score-uk_100g         |0.9999985769135532|
|nervonic-acid_100g              |0.9999971538271064|
|elaidic-acid_100g               |0.9999971538271064|
|chlorophyl_100g                 |0.9999971538271064|
|water-hardness_100g             |0.9999971538271064|
|gamma-linolenic-acid_100g       |0.9999957307406596|
|dihomo-gamma-linolenic-acid_100g|0.9999957307406596|
+--------------------------------+------------------+
only showing top 10 rows

Columns dropped due to 100% missing values: ['cities', 'allergens_en']
Execution completed in 15.94 seconds


                                                                                

### 3. Gestion des valeurs doublons

In [15]:
from pyspark.sql.functions import col, count

start_time = time.time()

# Analyzing Duplicates in 'code', 'product_name', and 'brands'
duplicates = (
    df_cleanedby_missing_value.groupBy("code", "product_name", "brands")
    .count()
    .filter(col("count") > 1)
)

# Affiche le nombre de doublons identifiés
print(f"There are {duplicates.count()} duplicate rows based on 'code', 'product_name', and 'brands'.")
duplicates.show(truncate=False)

# Remove duplicates where 'code', 'product_name', and 'brands' are the same
df_cleaned_by_duplicate = df_cleanedby_missing_value.dropDuplicates(["code", "product_name", "brands"])

print(f"Number of rows before removing duplicates: {df_cleanedby_missing_value.count()}")
print(f"Number of rows after removing duplicates: {df_cleaned_by_duplicate.count()}")

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Execution completed in {elapsed_time:.2f} seconds")


There are 87 duplicate rows based on 'code', 'product_name', and 'brands'.
+---------------------+-----------------------------------------------------+----------------------------------------+-----+
|code                 |product_name                                         |brands                                  |count|
+---------------------+-----------------------------------------------------+----------------------------------------+-----+
|3.250393046759E12    |Porc mariné longue conservation                      |Jean roze                               |2    |
|3.25456643534E12     |ERREUR_IMAGES                                        |Pierre Chanau,Auchan                    |2    |
|3.25456649037E12     |Saucisses de volaille                                |Auchan                                  |2    |
|3.596710344819E12    |Rôti de Porc cuit supérieur                          |Auchan                                  |2    |
|3.596710408924E12    |Chiffonnade de jambon cuit 

### 4. Handle outliers

In [14]:
from pyspark.sql.functions import col, regexp_extract
from pyspark.sql.types import IntegerType, DoubleType, FloatType

# Extraire les valeurs numériques de la colonne "quantity"
df_cleaned_by_outliers = df_cleaned_by_duplicate.withColumn(
    "quantity_numeric",
    regexp_extract(col("quantity"), r"(\d+)", 1).cast("double")
)

# Détecter les colonnes numériques
numeric_columns = [
    field.name for field in df_cleaned_by_outliers.schema.fields
    if isinstance(field.dataType, (IntegerType, DoubleType, FloatType))
]
print(f"Numeric columns detected: {numeric_columns}")

if not numeric_columns:
    print("No numeric columns found. Please check your data.")
else:
    total_rows_before = df_cleaned_by_outliers.count()
    removed_rows_total = 0

    # Boucle sur les colonnes numériques pour détecter les outliers
    for column in numeric_columns:
        try:
            # Calcul des quartiles avec approxQuantile
            quantiles = df_cleaned_by_outliers.approxQuantile(column, [0.25, 0.75], 0.05)
            if len(quantiles) < 2:
                print(f"Column '{column}' has insufficient data. Skipping...")
                continue

            q1, q3 = quantiles
            iqr = q3 - q1
            lower_bound = q1 - 1.5 * iqr
            upper_bound = q3 + 1.5 * iqr

            # Filtrer les outliers
            df_outliers = df_cleaned_by_outliers.filter((col(column) < lower_bound) | (col(column) > upper_bound))
            removed_rows = df_outliers.count()
            removed_rows_total += removed_rows

            print(f"Column '{column}': {removed_rows} rows detected as outliers.")

        except Exception as e:
            print(f"Error processing column '{column}': {e}")

    print(f"Total rows before filtering: {total_rows_before}")
    print(f"Total rows removed as outliers: {removed_rows_total}")
    print(f"Total rows remaining: {total_rows_before - removed_rows_total}")

Numeric columns detected: ['code', 'created_t', 'last_modified_t', 'last_updated_t', 'serving_quantity', 'additives_n', 'nutriscore_score', 'nova_group', 'ecoscore_score', 'product_quantity', 'unique_scans_n', 'completeness', 'last_image_t', 'energy-kj_100g', 'energy-kcal_100g', 'energy_100g', 'energy-from-fat_100g', 'fat_100g', 'saturated-fat_100g', 'butyric-acid_100g', 'caproic-acid_100g', 'caprylic-acid_100g', 'capric-acid_100g', 'lauric-acid_100g', 'myristic-acid_100g', 'palmitic-acid_100g', 'stearic-acid_100g', 'arachidic-acid_100g', 'behenic-acid_100g', 'lignoceric-acid_100g', 'cerotic-acid_100g', 'montanic-acid_100g', 'melissic-acid_100g', 'unsaturated-fat_100g', 'monounsaturated-fat_100g', 'omega-9-fat_100g', 'polyunsaturated-fat_100g', 'omega-3-fat_100g', 'omega-6-fat_100g', 'alpha-linolenic-acid_100g', 'eicosapentaenoic-acid_100g', 'docosahexaenoic-acid_100g', 'linoleic-acid_100g', 'arachidonic-acid_100g', 'gamma-linolenic-acid_100g', 'dihomo-gamma-linolenic-acid_100g', 'olei

                                                                                

Column 'created_t': 3302 rows detected as outliers.
Column 'last_modified_t': 3453 rows detected as outliers.
Column 'last_updated_t': 0 rows detected as outliers.
Column 'serving_quantity': 23175 rows detected as outliers.
Column 'additives_n': 20688 rows detected as outliers.
Column 'nutriscore_score': 68 rows detected as outliers.
Column 'nova_group': 21424 rows detected as outliers.
Column 'ecoscore_score': 905 rows detected as outliers.
Column 'product_quantity': 20553 rows detected as outliers.
Column 'unique_scans_n': 30026 rows detected as outliers.
Column 'completeness': 39599 rows detected as outliers.
Column 'last_image_t': 9984 rows detected as outliers.
Column 'energy-kj_100g': 829 rows detected as outliers.
Column 'energy-kcal_100g': 8312 rows detected as outliers.
Column 'energy_100g': 8359 rows detected as outliers.
Column 'energy-from-fat_100g': 22 rows detected as outliers.
Column 'fat_100g': 28966 rows detected as outliers.
Column 'saturated-fat_100g': 60539 rows det

                                                                                

Column 'quantity_numeric': 10045 rows detected as outliers.
Total rows before filtering: 702611
Total rows removed as outliers: 586842
Total rows remaining: 115769


# Data cleaning


In [None]:
# display the schema of the cleaned DataFrame
df_outliers.describe()

In [None]:
#select columns to keep
selected_column = [
    'code',
    'product_name',
    'brands',
    'categories',
    "main_category",
    'quantity',
    'packaging',
    'countries',
    'ingredients_text',
    'allergens',
    'serving_size',
    'energy-kcal_100g',
    'fat_100g',
    'saturated-fat_100g',
    "proteins_100g",
    'sugars_100g',
    'salt_100g',
    'nutriscore_score',
    'nutriscore_grade',
    "food_groups_en",
    "sodium",
    "sugar",
    "fiber"
]

df_transformed = df_parquet.select(selected_column)
df_transformed.show(5, truncate=False)

In [None]:
# convert the columns to the appropriate format
column_to_convert = ["quantity", "nutriscore_score", "energy-kcal_100g",
                     "fat_100g", "saturated-fat_100g", "proteins_100g", "sugars_100g", "salt_100g"]
# apply the conversion
for column in column_to_convert:
    df_transformed = df_transformed.withColumn(column, col(column).cast("double"))


In [None]:
# display the schema of the transformed DataFrame
df_transformed.printSchema()

In [None]:
# convert code in string
df_transformed = df_transformed.withColumn("code", col("code").cast("string"))

In [None]:
# display the schema of the transformed DataFrame
df_transformed.printSchema()

# Transformation des données Transform :
Ajouter des colonnes calculées, par exemple : Indice de qualité nutritionnelle 
Calculer un score basé sur les nutriments (e.g., sodium, sugar, fiber). 
Extraire la catégorie principale d'un produit (e.g., "boissons", "snacks"). 
Regrouper les données par catégories (categories) pour analyser les tendances (e.g., moyenne des calories par catégorie).

--> Quel calcules effectuer ?  
--> Quel catégories créer ?


In [None]:
print("Transformation")

# Analyse exploratoire :
Utiliser des fonctions de calcul sur fenêtre pour : 
Trouver les produits les plus caloriques par catégorie. 
Identifier les tendances de production par brands (marques). 
Générer des statistiques descriptives (e.g., médiane, moyenne des nutriments par catégorie

In [None]:
print("Exploration")

# Sauvegarde des données Save :
Partitionner les données par catégories (categories) et années (year). 
Sauvegarder les résultats transformés en format Parquet avec compression Snappy. 
Sauvegarder les résultats transformés dans les bases de données: postgresql/sqlserver/mysql/Snowflake/BigQuery

In [None]:
print("Sauvegarde des données (load)")



# Présentation des résultats :
Visualiser les résultats sous forme de graphiques ou tableaux 
(les étudiants peuvent utiliser un outil comme Jupyter Notebook en local ou Google Colab 

In [None]:
print("Présentation des données")