# Data Processing and Metadata Enrichment Pipeline

This notebook guides you through a step-by-step process for loading, processing, and analyzing a dataset using a combination of custom scripts. The workflow includes loading data, creating metadata, filtering data, and performing fuzzy matching.

## Steps:
1. Set up environment and import necessary modules.
2. Define and check dataset directory.
3. Load and cache dataframes.
4. Create and display metadata.
5. Fetch, compare, and configure data fields.
6. Filter and process dataframes.
7. Perform fuzzy matching.
8. Save processed data.

## Step 1: Setup Environment

We begin by importing the necessary libraries and functions.


In [1]:

# Import necessary libraries and custom modules
import os
import json
import pandas as pd
import gc
import numpy as np



# Now, import the necessary custom functions from the scripts
from src.scripts.df_generator import get_dataset_directory, check_directory_exists, load_or_cache_dataframes, show_loaded_dfs
from src.scripts.df_metadata import display_metadata_dfs, create_metadata_dfs, enrich_metadata_df
from src.scripts.fetch_data_fields import fetch_and_compare_data_fields
from src.scripts.build_data_fields_config import build_data_fields_config
from src.scripts.df_filtering import filter_metadata_and_dataframes, process_dataframe
from src.scripts.df_fuzzywuzzy import fuzzy_dataframe

# Change the current working directory to 'src'
os.chdir(os.path.join(os.getcwd(), 'src'))

## Step 2: Define and Check Dataset Directory

Define the dataset directory and ensure it exists. This step is crucial as it sets the working directory for subsequent operations.


In [2]:
# Define the dataset directory
notebook_directory =os.getcwd()  # This points to the root where the notebook is
dataset_directory = os.path.join(notebook_directory, 'dataset')

# Check if the dataset directory exists
if not check_directory_exists(dataset_directory):
    print(f"Error: Directory '{dataset_directory}' does not exist.")
else:
    print(f"Dataset directory found: {dataset_directory}")


Dataset directory found: c:\Users\Shiva\Documents\Data Scientist\Mission 3 - Preparer des données pour un organismer de sante publique\src\dataset


## Step 3: Load and Cache DataFrames

Load the data from the dataset directory into pandas DataFrames. The data can be loaded from cache or directly from the source files if the cache is not available.


In [3]:
# Directory to store cached DataFrames
CACHE_DIR = os.path.join(notebook_directory, 'data', 'cache') 

# Optionally, you can define a list of specific files to process
specific_files = ['fr.openfoodfacts.org.products.csv']  # Set to None to process all files

# Load DataFrames from cache or source files
dfs = load_or_cache_dataframes(dataset_directory, CACHE_DIR, file_list=specific_files, separator='\t')

# Check if DataFrames are loaded
if not dfs:
    print("No DataFrames were loaded. Exiting.")
else:
    print(f"Loaded DataFrames: {list(dfs.keys())}")
    show_loaded_dfs(dfs, df_names=None)


Loaded 'fr.openfoodfacts.org.products' from cache.
Nullity matrix for 'fr.openfoodfacts.org.products' has been generated.
Loaded DataFrames: ['fr.openfoodfacts.org.products']
Currently loaded DataFrames:
DataFrame for file 'fr.openfoodfacts.org.products (320767, 146)':
            code                                                url                     creator   created_t      created_datetime  ... collagen-meat-protein-ratio_100g cocoa_100g carbon-footprint_100g nutrition-score-fr_100g nutrition-score-uk_100g
0  0000000003087  http://world-fr.openfoodfacts.org/produit/0000...  openfoodfacts-contributors  1474103866  2016-09-17T09:17:46Z  ...                              NaN        NaN                   NaN                     NaN                     NaN
1  0000000004530  http://world-fr.openfoodfacts.org/produit/0000...             usda-ndb-import  1489069957  2017-03-09T14:32:37Z  ...                              NaN        NaN                   NaN                    14.0        

<Figure size 1500x900 with 0 Axes>

## Step 4: Create and Display Metadata

Generate metadata for the loaded DataFrames and display it to understand the structure and content of the data.


In [4]:
# Create metadata DataFrames
metadata_dfs = create_metadata_dfs(dfs)

# Check if metadata DataFrames were created
if not metadata_dfs:
    print("No metadata DataFrames were created. Exiting.")
else:
    print(f"Created Metadata DataFrames: {list(metadata_dfs.keys())}")
    display_metadata_dfs(metadata_dfs)


Created Metadata DataFrames: ['fr.openfoodfacts.org.products']
Metadata for fr.openfoodfacts.org.products (146, 8):
                          Column Name    Dtype      Type  Fill Percentage  NaN Percentage  Bad Null Percentage  Duplicate Percentage  Missing Percentage
0                                code   object   str(41)        99.992830        0.007170                  0.0              0.006859            0.007170
1                                 url   object  str(652)        99.992830        0.007170                  0.0              0.006859            0.007170
2                             creator   object   str(68)        99.999376        0.000624                  0.0             98.897642            0.000624
3                           created_t   object   str(37)        99.999065        0.000935                  0.0             40.901651            0.000935
4                    created_datetime   object   str(25)        99.997194        0.002806                  0.0         

## Step 5: Fetch, Compare, and Configure Data Fields

Fetch and compare data fields from the dataset, and build the necessary configuration files.


In [5]:
DATA_DIR = os.path.join(notebook_directory,'data')

# Run the fetch and compare data fields script
fetch_and_compare_data_fields(DATA_DIR)

# Build the config file
build_data_fields_config()


Created HISTORY_DIR: True
Created DIFF_DIR: True
No changes detected on data\data_fields.txt.
Config file 'data_fields_config.json' has been updated and saved.


## Step 6: Filter and Process DataFrames

Filter the metadata and corresponding DataFrames, and save the filtered data.


In [6]:
# Load the config.json
script_dir = os.path.join(notebook_directory,'scritps')
config_path = os.path.join(notebook_directory, 'config', 'data_fields_config.json')

with open(config_path, 'r') as file:
    config = json.load(file)

# Enrich the metadata DataFrame
combined_metadata = pd.concat(metadata_dfs.values(), keys=metadata_dfs.keys()).reset_index(level=0).rename(columns={'level_0': 'DataFrame'})
combined_metadata = enrich_metadata_df(combined_metadata, config)


# Save the combined metadata DataFrame to a CSV file
output_dir = os.path.join(notebook_directory, 'data')
os.makedirs(output_dir, exist_ok=True)

combined_metadata_path = os.path.join(output_dir, 'combined_metadata.csv')
combined_metadata.to_csv(combined_metadata_path, index=False)
print(f"Combined metadata {combined_metadata.shape} has been saved or updated.")


Combined metadata (146, 11) has been saved or updated.


In [12]:
import os
import pandas as pd
from src.scripts.plot_metadata_clusters import run_dash_app

# Load or prepare your data
output_dir = os.path.join(notebook_directory, 'data')
combined_metadata_path = os.path.join(output_dir, 'combined_metadata.csv')
combined_metadata = pd.read_csv(combined_metadata_path)

dfs = {}  # Load your actual DataFrames if needed

# Run the Dash app
run_dash_app(combined_metadata, dfs)

TypeError: run_dash_app() takes 0 positional arguments but 2 were given