# Data Processing and Metadata Enrichment Pipeline

This notebook guides you through a step-by-step process for loading, processing, and analyzing a dataset using a combination of custom scripts. The workflow includes loading data, creating metadata, filtering data, and performing fuzzy matching.

## Steps:
1. Set up environment and import necessary modules.
2. Define and check dataset directory.
3. Load and cache dataframes.
4. Create and display metadata.
5. Fetch, compare, and configure data fields.
6. Filter and process dataframes.
7. Perform fuzzy matching.
8. Save processed data.

## Step 1: Setup Environment

We begin by importing the necessary libraries and functions.


In [1]:

# Import necessary libraries and custom modules
import os
import json
import pandas as pd
import gc
import numpy as np



# Now, import the necessary custom functions from the scripts

from src.scripts.df_metadata import display_metadata_dfs, create_metadata_dfs, enrich_metadata_df
from src.scripts.fetch_data_fields import fetch_and_compare_data_fields
from src.scripts.build_data_fields_config import build_data_fields_config


# Change the current working directory to 'src'
os.chdir(os.path.join(os.getcwd(), 'src'))

## Step 2: Define and Check Dataset Directory

Define the dataset directory and ensure it exists. This step is crucial as it sets the working directory for subsequent operations.


In [2]:
from src.scripts.df_generator import get_dataset_directory, check_directory_exists, load_or_cache_dataframes, show_loaded_dfs
# Define the dataset directory
notebook_directory =os.getcwd()  # This points to the root where the notebook is
dataset_directory = os.path.join(notebook_directory, 'dataset')

# Check if the dataset directory exists
if not check_directory_exists(dataset_directory):
    print(f"Error: Directory '{dataset_directory}' does not exist.")
else:
    print(f"Dataset directory found: {dataset_directory}")


Dataset directory found: c:\Git\Mission3\src\dataset


## Step 3: Load and Cache DataFrames

Load the data from the dataset directory into pandas DataFrames. The data can be loaded from cache or directly from the source files if the cache is not available.


In [3]:
# Directory to store cached DataFrames
CACHE_DIR = os.path.join(notebook_directory, 'data', 'cache') 

# Optionally, you can define a list of specific files to process
specific_files = ['fr.openfoodfacts.org.products.csv']  # Set to None to process all files

# Load DataFrames from cache or source files
dfs = load_or_cache_dataframes(dataset_directory, CACHE_DIR, file_list=specific_files, separator='\t')

# Check if DataFrames are loaded
if not dfs:
    print("No DataFrames were loaded. Exiting.")
else:
    print(f"Loaded DataFrames: {list(dfs.keys())}")
    show_loaded_dfs(dfs, df_names=None)


Loaded 'fr.openfoodfacts.org.products' from cache.
Nullity matrix for 'fr.openfoodfacts.org.products' has been generated.
Loaded DataFrames: ['fr.openfoodfacts.org.products']
Currently loaded DataFrames:
DataFrame for file 'fr.openfoodfacts.org.products (320767, 146)':
            code                                                url  ... nutrition-score-fr_100g nutrition-score-uk_100g
0  0000000003087  http://world-fr.openfoodfacts.org/produit/0000...  ...                     NaN                     NaN
1  0000000004530  http://world-fr.openfoodfacts.org/produit/0000...  ...                    14.0                    14.0
2  0000000004559  http://world-fr.openfoodfacts.org/produit/0000...  ...                     0.0                     0.0
3  0000000016087  http://world-fr.openfoodfacts.org/produit/0000...  ...                    12.0                    12.0
4  0000000016094  http://world-fr.openfoodfacts.org/produit/0000...  ...                     NaN                     NaN

[5 

<Figure size 1500x900 with 0 Axes>

## Step 4: Create and Display Metadata

Generate metadata for the loaded DataFrames and display it to understand the structure and content of the data.


In [4]:
# Create metadata DataFrames
metadata_dfs = create_metadata_dfs(dfs)

# Check if metadata DataFrames were created
if not metadata_dfs:
    print("No metadata DataFrames were created. Exiting.")
else:
    print(f"Created Metadata DataFrames: {list(metadata_dfs.keys())}")
    display_metadata_dfs(metadata_dfs)


Created Metadata DataFrames: ['fr.openfoodfacts.org.products']
Metadata for fr.openfoodfacts.org.products (146, 8):
                          Column Name    Dtype  ... Duplicate Percentage  Missing Percentage
0                                code   object  ...             0.000000            0.007170
1                                 url   object  ...             0.000000            0.007170
2                             creator   object  ...            98.897947            0.000624
3                           created_t   object  ...            40.901410            0.000935
4                    created_datetime   object  ...            40.899993            0.002806
..                                ...      ...  ...                  ...                 ...
141  collagen-meat-protein-ratio_100g  float64  ...            96.363636           99.948561
142                        cocoa_100g  float64  ...            91.139241           99.704458
143             carbon-footprint_100g  float64 

## Step 5: Fetch, Compare, and Configure Data Fields

Fetch and compare data fields from the dataset, and build the necessary configuration files.


In [5]:
DATA_DIR = os.path.join(notebook_directory,'data')

# Run the fetch and compare data fields script
fetch_and_compare_data_fields(DATA_DIR)

# Build the config file
build_data_fields_config()


Created HISTORY_DIR: True
Created DIFF_DIR: True
No changes detected on data\data_fields.txt.
Config file 'data_fields_config.json' has been updated and saved.


## Step 6: Filter and Process DataFrames

Filter the metadata and corresponding DataFrames, and save the filtered data.


In [6]:
# Load the config.json
script_dir = os.path.join(notebook_directory,'scritps')
config_path = os.path.join(notebook_directory, 'config', 'data_fields_config.json')

with open(config_path, 'r') as file:
    config = json.load(file)

# Enrich the metadata DataFrame
combined_metadata = pd.concat(metadata_dfs.values(), keys=metadata_dfs.keys()).reset_index(level=0).rename(columns={'level_0': 'DataFrame'})
combined_metadata = enrich_metadata_df(combined_metadata, config)


# Save the combined metadata DataFrame to a CSV file
output_dir = os.path.join(notebook_directory, 'data')
os.makedirs(output_dir, exist_ok=True)

combined_metadata_path = os.path.join(output_dir, 'combined_metadata.csv')
combined_metadata.to_csv(combined_metadata_path, index=False)
print(f"Combined metadata {combined_metadata.shape} has been saved or updated.")


Combined metadata (146, 11) has been saved or updated.


## Step 7: Identify columns cluster

Checking cluster of columns based on Duplicate(%) and Fill(%)



In [7]:
from src.scripts.plot_metadata_clusters import run_dash_app

# Run the Dash app  
run_dash_app(combined_metadata)


## Step 8: Analyse, fuzzy and other stuff

blabla



In [8]:
from src.scripts.df_filtering import filter_metadata_and_dataframes, process_dataframe
from src.scripts.df_fuzzywuzzy import fuzzy_dataframe


# Specify your datetime checks as a list of tuples
datetime_checks = [
    # ('created_t', 'created_datetime'),
    # ('last_modified_t', 'last_modified_datetime')
]

# Specify your field frequency checks as a list of tuples
field_checks = [
    #(['countries', 'countries_tags', 'countries_fr'], 'countries'),
    #(['ingredients_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil_n'], 'ingredients_palm_oil'),
    (['nutrition_grade_fr', 'nutrition-score-fr_100g', 'nutrition-score-uk_100g'], 'nutrition'),
    #(['brands_tags', 'brands'], 'brands'),
    #(['additives_n', 'additives', 'additives_tags', 'additives_fr'], 'additives'),
    #(['states', 'states_tags', 'states_fr'], 'states'),
    (['pnns_groups_1', 'pnns_groups_2'], 'pnns_groups')
    
]

# Columns to check for at least one non-null value
columns_to_check = [
    'nutrition_grade_fr', 'energy_100g', 'fat_100g', 'saturated-fat_100g',
    'trans-fat_100g', 'cholesterol_100g', 'carbohydrates_100g', 'sugars_100g',
    'fiber_100g', 'proteins_100g', 'salt_100g', 'sodium_100g', 'vitamin-a_100g',
    'vitamin-c_100g', 'calcium_100g', 'iron_100g', 'nutrition-score-fr_100g',
    'nutrition-score-uk_100g'
]

# Fields to be deleted after anaylysis
fields_to_delete = [
    'url', 'created_t', 'created_datetime','last_modified_t', 'last_modified_datetime', 'states', 'states_tags', 'states_fr', 'countries', 'countries_tags', 'countries_fr',
    'brands_tags', 'brands', 'additives_n', 'additives', 'additives_tags', 'additives_fr',
    'creator','ingredients_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil_n',
    'quantity', 'serving_size', 'additives', 'ingredients_text','product_name',
    'categories','categories_tags','categories_fr','packaging','packaging_tags','image_url','image_small_url','main_category','main_category_fr'
    ]

# Now process your specific DataFrame
df_name = 'fr.openfoodfacts.org.products'

# Filter and process DataFrame in one step
if df_name in dfs:
    combined_metadata, filtered_dfs = filter_metadata_and_dataframes(combined_metadata, dfs, 20)
    process_dataframe(filtered_dfs[df_name], log_dir='logs', temp_dir='temp', datetime_checks=datetime_checks, field_checks=field_checks)
    fuzzy_dataframe(temp_dir='temp', config_dir='config', checks=field_checks, threshold=90)

    # Drop rows where all specified columns are null
    filtered_dfs[df_name].dropna(subset=columns_to_check, how='all', inplace=True)

    # Drop duplicates after filtering rows
    filtered_dfs[df_name].drop_duplicates(inplace=True)

    # Delete the specified columns from combined_metadata and update the related DataFrame
    combined_metadata = combined_metadata[~combined_metadata['Column Name'].isin(fields_to_delete)]
    filtered_dfs[df_name] = filtered_dfs[df_name][combined_metadata['Column Name']]

    # Save the processed DataFrame to the dataset directory
    dataset_path = os.path.join('dataset', f'processed_{df_name}.csv')
    filtered_dfs[df_name].to_csv(dataset_path, index=False)
    
    # Save the updated metadata to the data directory
    metadata_path = os.path.join('data', f'processed_metadata.csv')
    combined_metadata.to_csv(metadata_path, index=False)
    
    print(f"Processed DataFrame '{df_name}' and metadata have been saved.")
else:
    print(f"DataFrame '{df_name}' not found in the loaded DataFrames.")


INFO:root:Updated DataFrame 'fr.openfoodfacts.org.products' to retain only relevant columns.
INFO:root:Check the nutrition combination file for more details about fields frequency.
INFO:root:Check the pnns_groups combination file for more details about fields frequency.
INFO:root:[nutrition] Total combinations to process: 441
INFO:root:[nutrition] Processing combination 1/441
INFO:root:[nutrition] Processing combination 2/441
INFO:root:[nutrition] Processing combination 3/441
INFO:root:[nutrition] Processing combination 4/441
INFO:root:[nutrition] Processing combination 5/441
INFO:root:[nutrition] Processing combination 6/441
INFO:root:[nutrition] Processing combination 7/441
INFO:root:[nutrition] Processing combination 8/441
INFO:root:[nutrition] Processing combination 9/441
INFO:root:[nutrition] Processing combination 10/441
INFO:root:[nutrition] Processing combination 11/441
INFO:root:[nutrition] Processing combination 12/441
INFO:root:[nutrition] Processing combination 13/441
INFO:

Processed DataFrame 'fr.openfoodfacts.org.products' and metadata have been saved.


## Step 9: Nutrition score Clustering Dashboard

blabla



In [9]:
from src.scripts.plot_nutriscore import run_dash_app_nutriscore, safe_eval
# Import necessary libraries and custom modules
import os
import pandas as pd


# Change the current working directory to 'src'
#os.chdir(os.path.join(os.getcwd(), 'src'))
#notebook_directory =os.getcwd()

# Assuming notebook_directory is already defined in your notebook
nutriscore_directory = os.path.join(notebook_directory, 'temp', 'nutrition_combination_log.csv')
nutriscore = pd.read_csv(nutriscore_directory)



run_dash_app_nutriscore(nutriscore)


### Nutrient Maximum Limits Justification

The following maximum limits are established to ensure that the values in each column of your dataset remain within a realistic and scientifically accurate range. These limits are based on general nutritional guidelines, food composition databases, and what is considered physiologically plausible for the nutrients and metrics in question.

#### 1. **Energy (energy_100g): 900 kcal/100g**
- **Justification**: The upper limit of 900 kcal per 100 grams is based on the highest energy-dense foods, such as pure fats and oils. For instance, oils like olive oil can contain up to 884 kcal per 100g, making 900 kcal a reasonable upper boundary to catch extreme outliers that could suggest errors in data entry.

#### 2. **Fat (fat_100g): 100g/100g**
- **Justification**: Pure fat, like oils and lard, contains 100g of fat per 100g. This limit reflects the fact that no food item should logically contain more fat than its total weight, so 100g/100g is the natural upper boundary.

#### 3. **Saturated Fat (saturated-fat_100g): 50g/100g**
- **Justification**: Saturated fat typically makes up a portion of total fat content. For high-saturated fat products like butter, which may have up to 50-60% saturated fat, a limit of 50g/100g ensures that products are within expected ranges.

#### 4. **Carbohydrates (carbohydrates_100g): 100g/100g**
- **Justification**: Similar to fat, carbohydrates can theoretically make up 100% of a food's weight. However, this is rare, as most foods contain a mix of nutrients. This limit helps identify potential errors where carbohydrate content might have been overstated.

#### 5. **Sugars (sugars_100g): 100g/100g**
- **Justification**: Sugars, a subset of carbohydrates, can also theoretically reach 100g/100g in foods composed entirely of sugar (e.g., pure glucose or sucrose). The 100g/100g limit ensures that entries exceeding this amount are flagged as potential errors.

#### 6. **Sodium (sodium_100g): 2.3g/100g**
- **Justification**: The Dietary Guidelines for Americans recommend a maximum sodium intake of 2,300 mg per day. While this applies to daily intake, 2.3g per 100g in foods represents a high sodium concentration typical in heavily salted products. Foods like soy sauce can reach these levels, but it remains an upper threshold to catch extreme cases.

#### 7. **Salt (salt_100g): 5.75g/100g**
- **Justification**: Salt is composed of sodium (40%) and chloride (60%). If sodium is at its upper limit of 2.3g/100g, the corresponding salt content would be around 5.75g/100g. This ensures that the sodium-salt relationship is maintained within logical bounds.

#### 8. **Trans Fat (trans-fat_100g): 55.33g/100g**
- **Justification**: Trans fats are typically found in processed foods, and while 55.33g/100g is higher than what would be found in natural foods, this limit allows for capturing high-trans fat industrial products, though it is still within a physiologically plausible range.

#### 9. **Cholesterol (cholesterol_100g): 55.08mg/100g**
- **Justification**: Cholesterol content in food can vary widely, with organ meats like liver containing very high levels. A limit of 55.08mg/100g provides a boundary that accommodates high-cholesterol foods without allowing for implausible entries.

#### 10. **Fiber (fiber_100g): 99.49g/100g**
- **Justification**: Fiber is a non-digestible carbohydrate. Foods high in fiber, like bran, can contain significant amounts, though a fiber content near 100% would be extremely rare. The limit ensures the integrity of fiber content without exceeding realistic values.

#### 11. **Proteins (proteins_100g): 99.04g/100g**
- **Justification**: Protein content can be very high in certain food products, like protein supplements. The upper limit of 99.04g/100g ensures that entries are realistic, as 100% protein content would be nearly impossible in natural foods.

#### 12. **Vitamin A (vitamin-a_100g): 57.12mg/100g**
- **Justification**: Vitamin A levels can be very high in foods like liver, but 57.12mg/100g represents an upper limit where concentrations beyond this could indicate a data entry error, as such high levels are unusual.

#### 13. **Vitamin C (vitamin-c_100g): 56.09mg/100g**
- **Justification**: Foods rich in Vitamin C, such as certain fruits, can have high concentrations, but levels above this are uncommon. This limit helps to filter out improbable values.

#### 14. **Calcium (calcium_100g): 56.03mg/100g**
- **Justification**: High-calcium foods like dairy products can approach these levels, but any values above this would likely be due to fortification or data errors. The limit ensures consistency with expected nutritional values.

#### 15. **Iron (iron_100g): 56.21mg/100g**
- **Justification**: Iron-rich foods like red meat and fortified cereals can have significant iron content. However, 56.21mg/100g acts as a reasonable upper bound, capturing outliers without excluding naturally iron-rich foods.

### Additional Justification:
- **Nutritional Guidelines**: These limits are set based on standard nutritional data and guidelines provided by sources such as the USDA Food Composition Databases, European Food Safety Authority (EFSA), and general dietary recommendations.
- **Data Integrity**: These limits also ensure that the data is free from common errors, such as mistyping or incorrect unit conversions, which could lead to implausible nutrient values.

By setting these limits, the script can effectively identify and remove outliers or erroneous entries, ensuring that the dataset is clean and reliable for further analysis or reporting.


In [10]:
from src.scripts.df_generator import check_directory_exists, load_or_cache_dataframes, show_loaded_dfs
from src.scripts.df_business_data_integrity import run_integrity_check
# Import necessary libraries and custom modules
import os

# Change the current working directory to 'src'
# os.chdir(os.path.join(os.getcwd(), 'src'))

# Define the dataset directory
notebook_directory = os.getcwd()  # This points to the root where the notebook is
dataset_directory = os.path.join(notebook_directory, 'dataset')

# Check if the dataset directory exists
if not check_directory_exists(dataset_directory):
    print(f"Error: Directory '{dataset_directory}' does not exist.")
else:
    print(f"Dataset directory found: {dataset_directory}")

# Directory to store cached DataFrames
CACHE_DIR = os.path.join(notebook_directory, 'data', 'cache') 

# Optionally, you can define a list of specific files to process
specific_files = ['processed_fr.openfoodfacts.org.products.csv']

# Load DataFrames from cache or source files
processed_dfs = load_or_cache_dataframes(dataset_directory, CACHE_DIR, file_list=specific_files, separator=',')

# Check if DataFrames are loaded
if not processed_dfs:
    print("No DataFrames were loaded. Exiting.")
else:
    print(f"Loaded DataFrames: {list(processed_dfs.keys())}")
    show_loaded_dfs(processed_dfs, df_names=None)

# Get the specific DataFrame for processing
df_name = 'processed_fr.openfoodfacts.org.products'
if df_name in processed_dfs:
    df = processed_dfs[df_name]  # Pass the DataFrame directly if not a dictionary
    run_integrity_check(df, log_dir='logs')
else:
    print(f"DataFrame '{df_name}' not found in the loaded DataFrames.")


Dataset directory found: c:\Git\Mission3\src\dataset
Loaded 'processed_fr.openfoodfacts.org.products' from cache.
Nullity matrix for 'processed_fr.openfoodfacts.org.products' has been generated.
Loaded DataFrames: ['processed_fr.openfoodfacts.org.products']
Currently loaded DataFrames:
DataFrame for file 'processed_fr.openfoodfacts.org.products (262828, 21)':
            code nutrition_grade_fr  ... nutrition-score-fr_100g nutrition-score-uk_100g
0  0000000004530                  d  ...                    14.0                    14.0
1  0000000004559                  b  ...                     0.0                     0.0
2  0000000016087                  d  ...                    12.0                    12.0
3  0000000016094                NaN  ...                     NaN                     NaN
4  0000000016100                NaN  ...                     NaN                     NaN

[5 rows x 21 columns]




<Figure size 1500x900 with 0 Axes>

In [11]:
df.head()

Unnamed: 0,code,nutrition_grade_fr,...,nutrition-score-fr_100g,nutrition-score-uk_100g
0,4530,d,...,14.0,14.0
1,4559,b,...,0.0,0.0
2,16087,d,...,12.0,12.0
3,16094,,...,,
4,16100,,...,,
