# Goal 
In this notbook I will reorganize the columns for:
1. **Food taxonomy**
2. **Ferment type**
3. **Processing**

## The taxonomy levels:
1. Top level - Animal or Plant origin
2. Second level - Main ingredient type
>> 2.1 Animal will be separated to: <br>
>> >> 2.1.1 - Meat<br>
>> >> 2.1.2 - Fish<br>
>> >> 2.1.3 - Dairy<br>

>> 2.2 Plant will be separated to: <br>
>> >> 2.2.1 - Vegetables/Fruit<br>
>> >> 2.2.2 - Cereal<br>
>> >> 2.2.3 - Legumes<br>
>> >> 2.2.4 - Root<br>
3. Third level - specific substrate
4. Fourth level - food name. 


## Ferment type:
We will indicate the following:
1. Acid - lactic, acetic, mixed
2. Alcohol - yes, no
3. Degradation - protein, fat, or mixed

While there are foods that contain both we will try to indicate the main product produced. <br> 
For example - the main acid ferment in cheese is lactic acid, but most (if not all) cheese also contian acetic acid.

## Processing:
We will indicate the following:
1. Aerobic vs anaerobic fermentation.
2. Addtional ingredients - salt, enzymes (e.g. renet)
3. Fermentation temp
* For ferementation we will focus on the primary fermentation and not the aging temp.


### Start date: 08/04/25
### Last update to goals: 

In [30]:
import pandas as pd
import numpy as np
import os
import sys

# load the data
dataset = pd.read_csv('data/data_source/2025-03-24-all-ff-mag-metadata-cleaned-curated.tsv', sep='\t')

# get the relevant columns
dataset = dataset[['sample_description', 'fermented_food', 'specific_substrate', 'substrate_category', 'general_category']]

# remove duplicates
dataset = dataset.drop_duplicates()
dataset = dataset.reset_index(drop=True)

display(dataset.shape, dataset.head())

(259, 5)

Unnamed: 0,sample_description,fermented_food,specific_substrate,substrate_category,general_category
0,meat,sausage,meat,meat,fermented_meat
1,whey,whey,milk,dairy,fermented_dairy
2,cheese_brine,cheese,milk,dairy,fermented_dairy
3,cheese,cheese,milk,dairy,fermented_dairy
4,tea,tea,tea,plant,fermented_beverages


In [31]:
# check the number of unique values in each column
for column in dataset.columns:
    print(f'{column}: {dataset[column].nunique()}')


sample_description: 248
fermented_food: 136
specific_substrate: 78
substrate_category: 22
general_category: 21


In [32]:
# check for missing values
missing_values = dataset.isnull().sum()
print(missing_values)


sample_description    0
fermented_food        0
specific_substrate    1
substrate_category    0
general_category      0
dtype: int64


In [33]:
# Count unique values per row
unique_counts = dataset.apply(lambda x: len(set(x)), axis=1)

# Print summary of rows by number of unique values
for i in range(1, 6):
    count = (unique_counts == i).sum()
    print(f"Rows with {i} unique values: {count}")

Rows with 1 unique values: 0
Rows with 2 unique values: 2
Rows with 3 unique values: 19
Rows with 4 unique values: 127
Rows with 5 unique values: 111


# quick pulse check
We currently have five columns describing the food but less than half or the rows have 5 unique values. <br>
Based on the number of unique values per column it looks as though sample_description corresponds to the lowest level, <br>
but a view of the table indicates the lowest level is a composite/combinaiton of sample_description and fermented_food.<br>



In [34]:
# Evaluating the unique values in the high level columns
    #  substrate_category and general_category

# Check the unique values in the substrate_category column
print(dataset['substrate_category'].unique())
print()
# Check the unique values in the general_category column
print(dataset['general_category'].unique())
print()
# Check the unique values in the specific_substrate column
print(dataset['specific_substrate'].unique())


['meat' 'dairy' 'plant' 'other_plant_based' 'grains' 'grain' 'vegetable'
 'legumes' 'tuber_root' 'fruit' 'rhizome' 'fruits_and_vegetables'
 'fruit_rhizome' 'legume' 'seed' 'flower' 'seafood' 'sugar' 'seaweed'
 'supplement' 'probiotics' 'soybean']

['fermented_meat' 'fermented_dairy' 'fermented_beverages'
 'other_fermented_grains' 'other_plant_based_fermentation'
 'pickled_fruits_vegetables_roots' 'fermented_soybean'
 'other_fermented_legumes' 'other_fermented_legume' 'sourdough' 'vinegar'
 'fermented_cassava' 'other_fermented_seeds' 'chocolate'
 'fermented_seafood' 'other_fermented_meat' 'grains' 'probiotics'
 'fermented_grain' 'fermented_soybeans' 'fermented_vegetable']

['meat' 'milk' 'tea' 'agave' 'teff' 'nectar' 'barley' 'grain' 'cabbage'
 'soybean' 'yellow_pea' 'pea' 'legume' 'wheat' 'rye' 'millet' 'turnip'
 'orange' 'pineapple' 'ginger' 'cucumber' 'fruit' 'daikon' 'vegetable'
 'rice' 'apple_cider' 'garlic' 'okra' 'tomato_and_mustards' 'carrot'
 'lemon_ginger' 'cassava' 'locust_be

In [43]:
# create the first level column - substrate_origin
    # if the substrate_category is 'meat' or 'dairy' then the substrate_origin is 'Animal', else it is 'Plant'
def get_substrate_origin(row):
    if row['substrate_category'] in ['meat', 'dairy', 'seafood']:
        return 'Animal'
    elif row['substrate_category'] in ['probiotics']:
        return 'Microbial'
    else:
        return 'Plant'
dataset['substrate_origin'] = dataset.apply(get_substrate_origin, axis=1)

dataset['substrate_origin'].value_counts()



substrate_origin
Plant        198
Animal        60
Microbial      1
Name: count, dtype: int64

In [64]:
# grouping substrate category to create the second level column - main_ingredient
main_ingredigent = {'Meat': ['meat'],
                    'Fish': ['seafood'],
                    'Dairy': ['dairy'],
                    'Vegetables/Fruit': ['plant','other_plant_based','vegetable','fruit','fruits_and_vegetables','flower','sugar','seaweed'],
                    'Cereal': ['grains', 'grain',],
                    'Legumes/nuts/seed': ['legumes','legume','seed','soybean'],
                    'Root': ['tuber_root', 'rhizome','fruit_rhizome'],
                    'Microbial': ['probiotics','supplement']}

# creating the second level column - main_ingredient
def get_main_ingredient(row):
    for key, values in main_ingredigent.items():
        if row['substrate_category'] in values:
            return key
    return 'Other'
dataset['main_ingredient_group'] = dataset.apply(get_main_ingredient, axis=1)
dataset['main_ingredient_group'].value_counts()


main_ingredient_group
Vegetables/Fruit     98
Dairy                45
Cereal               40
Legumes/nuts/seed    40
Root                 19
Meat                 11
Fish                  4
Microbial             2
Name: count, dtype: int64

# Specific substrate
After reviewing the data I decided that it would be easier to assign values to this column manually.<br>
There are multiple instances in which I would have to define multi-column condisionts for specific rows. <br>
Considering that there are only 259 rows and many are easy to define it shoul be easier to add that manually. 


In [69]:
# save the dataset
dataset.to_csv('data/food_taxonomy.csv', index=False)

In [68]:
dataset.shape

(259, 7)

In [70]:
dataset[dataset['main_ingredient_group'] == 'Microbial']

Unnamed: 0,sample_description,fermented_food,specific_substrate,substrate_category,general_category,substrate_origin,main_ingredient_group
210,dietary supplement,dietary_supplement,,supplement,probiotics,Plant,Microbial
211,probiotics,probiotics,unknown,probiotics,probiotics,Microbial,Microbial


In [66]:
dataset['specific_substrate'].unique()

array(['meat', 'milk', 'tea', 'agave', 'teff', 'nectar', 'barley',
       'grain', 'cabbage', 'soybean', 'yellow_pea', 'pea', 'legume',
       'wheat', 'rye', 'millet', 'turnip', 'orange', 'pineapple',
       'ginger', 'cucumber', 'fruit', 'daikon', 'vegetable', 'rice',
       'apple_cider', 'garlic', 'okra', 'tomato_and_mustards', 'carrot',
       'lemon_ginger', 'cassava', 'locust_bean', 'melon_seed', 'oil_bean',
       'corn', 'sorghum', 'palm_sap', 'cassava_kapok_seeds',
       'sorghum_millet', 'plantain', 'African_yam_bean',
       'cacao_fruit_pulp_and_seeds', 'chili_pepper', 'cacao', 'fig',
       'chilli', 'coffee', 'tea_sugar', 'grape', 'apple_juice', 'banana',
       'black_cherry', 'blueberry', 'chrysanthemum', 'apple',
       'sugar_cane', 'fish', 'pepper', 'radish', 'broth', 'mustard',
       'seafood', 'sugar', 'scallion', 'wakame', 'lupine', 'grains',
       'turnip_leaves', 'wine', 'bamboo_shoots', nan, 'unknown',
       'coffee_berries', 'potato', 'elderberries',
    

# 05/09/25
# Creating an up-do-date food taxonomy table

In [1]:
import pandas as pd

pd.set_option('display.max_columns', None)

# load the metadata table
data = pd.read_csv('../data/Food_MAGs_curated_metadata_250502.csv')

data.head()

Unnamed: 0,mag_id,sample_description,completeness,contamination,contigs,total_length,gc,n50,sample_accession,run_accession,country,project_accession,study_accession,database_origin,Reference,level1_substrate_origin,level2_main_ingredient_group,main_ingredient,food_name,consistency,main_ferment,acid_type,acid_level,alcohol_lovel,protein_degradation,fat_degradation,added_ingredients,fermentation_temp,aging_time,agin_temp,level3_food_type,representative_95id,representative_99id,domain,phylum,class,order,family,genus,species
0,C-03.Ssa-BR,raw canastra cheese,97.97,1.1,182,1896140,39.4,16848,SAMN17450155,SRR13496634,Brazil,PRJNA693797,SRP302686,,"Kothe CI, Mohellibi N, Renault P. Revealing th...",Animal,Dairy,milk,cheese canastra,solid,acid; amino acids,lactic,medium,none,high,medium,salt; rennet,mesophilic,>1 month,cold,cheese,C-03.Ssa-BR,C-03.Ssa-BR,Bacteria,Bacillota,Bacilli,Lactobacillales,Streptococcaceae,Streptococcus,Streptococcus sp003521145
1,C-R02.bin.1,raw canastra cheese,92.75,1.03,84,3174852,65.79,74585,SAMN17450161,SRR13496623,Brazil,PRJNA693797,SRP302686,NCBI,"Kothe CI, Mohellibi N, Renault P. Revealing th...",Animal,Dairy,milk,cheese canastra,solid,acid; amino acids,lactic,medium,none,high,medium,salt; rennet,mesophilic,>1 month,cold,cheese,C-R02.bin.1,C-R02.bin.1,Bacteria,Actinomycetota,Actinomycetes,Actinomycetales,Micrococcaceae,Galactobacter,Galactobacter sp022712295
2,C-R03.bin.7,raw canastra cheese,98.91,0.0,161,2047554,38.52,23316,SAMN17450162,SRR13496622,Brazil,PRJNA693797,SRP302686,NCBI,"Kothe CI, Mohellibi N, Renault P. Revealing th...",Animal,Dairy,milk,cheese canastra,solid,acid; amino acids,lactic,medium,none,high,medium,salt; rennet,mesophilic,>1 month,cold,cheese,C-R03.bin.7,C-R03.bin.7,Bacteria,Bacillota,Bacilli,Lactobacillales,Aerococcaceae,Bavariicoccus,Bavariicoccus seileri
3,C-R06.bin.10,raw araxa cheese,91.62,1.35,147,3636195,37.73,38153,SAMN17450164,SRR13496620,Brazil,PRJNA693797,SRP302686,NCBI,"Kothe CI, Mohellibi N, Renault P. Revealing th...",Animal,Dairy,milk,cheese araxa,solid,acid; amino acids,lactic,medium,none,medium,low,salt; rennet,mesophilic,>1 month,cold,cheese,C-R06.bin.10,C-R06.bin.10,Bacteria,Pseudomonadota,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Proteus,Proteus vulgaris
4,C-R06.bin.12,raw araxa cheese,100.0,0.64,54,3940686,70.36,126465,SAMN17450164,SRR13496620,Brazil,PRJNA693797,SRP302686,NCBI,"Kothe CI, Mohellibi N, Renault P. Revealing th...",Animal,Dairy,milk,cheese araxa,solid,acid; amino acids,lactic,medium,none,medium,low,salt; rennet,mesophilic,>1 month,cold,cheese,WolfeBE_2014__491_Bayley__bin.8,WolfeBE_2014__491_Bayley__bin.8,Bacteria,Actinomycetota,Actinomycetes,Actinomycetales,Dermabacteraceae,Brachybacterium,Brachybacterium alimentarium


In [3]:
food_cols = ['level1_substrate_origin',
       'level2_main_ingredient_group', 'main_ingredient', 'level3_food_type', 'food_name',
       'consistency', 'main_ferment', 'acid_type', 'acid_level',
       'alcohol_lovel', 'protein_degradation', 'fat_degradation',
       'added_ingredients', 'fermentation_temp', 'aging_time', 'agin_temp',
       ]

food_table = data[food_cols]
food_table = food_table.drop_duplicates()
food_table.head()

Unnamed: 0,level1_substrate_origin,level2_main_ingredient_group,main_ingredient,level3_food_type,food_name,consistency,main_ferment,acid_type,acid_level,alcohol_lovel,protein_degradation,fat_degradation,added_ingredients,fermentation_temp,aging_time,agin_temp
0,Animal,Dairy,milk,cheese,cheese canastra,solid,acid; amino acids,lactic,medium,none,high,medium,salt; rennet,mesophilic,>1 month,cold
3,Animal,Dairy,milk,cheese,cheese araxa,solid,acid; amino acids,lactic,medium,none,medium,low,salt; rennet,mesophilic,>1 month,cold
9,Animal,Dairy,milk,cheese,cheese serro,solid,acid; amino acids,lactic,medium,none,medium,low,salt; rennet,mesophilic,>1 month,cold
14,Animal,Dairy,milk goat,fresh cheese,ambriss,cream,acid,lactic,medium,none,low,none,salt,mesophilic,20 days,ambient
15,Animal,Dairy,milk,fresh cheese,labneh,cream,acid,lactic,medium,none,low,none,salt,mesophilic,20 days,ambient


In [4]:
food_table.shape

(184, 16)

In [5]:
# save updated food taxonomy table
food_table.to_csv('../data/food_taxonomy/food_taxonomy_250509.csv', index=False)