# SWSN - EaT-PIM with simplified ingredients and ingredient metadata

Bartosz Stachowiak 148259<br>
Andrzej Kajdasz 148273

## 1. Introduction
After reading [EaT-PIM: Substituting Entities in Procedural Instructions using Flow Graphs and Embeddings (Sola S. Shirai, HyeongSik Kim)](https://dspace.rpi.edu/bitstream/handle/20.500.13015/6364/ISWC_EaT_PIM.pdf?sequence=1&isAllowed=y&fbclid=IwAR3RaVCT2_kb0T5NmVNs2ulIxjWkbDBu8T9wfUrIk7pSrjLcRQEeA6BqVkg) we noticed a room for improvement in the appraoch. We hypothesize that we could improve the results by incorporating other information besides the ingredients themselves to improve the quality of the suggestions.
<br><br>
The first and most important task for us is to incorporate ingredient metadata in substitute prediction. The proposed model utilizes only the contextual information about the ingredient from recipes, which is not always accurate (completely different ingredients might be prepared in similar ways but do not taste alike). To improve the quality of the prediction, we could incorporate the metadata about individual ingredients (e.g. taste, type) and make a prediction as a combination of the two sources. The plan for us is to describe the metadata in form of a feature matrix (one-hot encoded), combine similarity metric between the missing ingredient and its alternatives, combine the metric with paper's original approach using weights, find most appropriate replacements
<br><br>
The second proposition equally important is to simplify the ingredients to get more informative predictions. The authors seem to have used very granular distinction between individual ingredients, which sometimes give very unhelpful substitution propositions (e.g. pork => boneless pork). Simplifying the distinction and grouping very similar ingredients could help with more insightful predictions. To achieve such results we will prune any modifiers from existing ingredients (e.g. boneless pork), aggregate ingredients to their simplest form.

We strongly believe that with these two changes, we will be able to improve upon the initial solution.

## 2. EDA

### 2.1 Original Dataset

Raw data are taken from [kaggle - Food.com Recipes and Interactions dataset](https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions).

This dataset was created by scraping over 230k recipies from [Food.com](food.com).

In [None]:
import itertools
import collections
import json

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('data/RAW_recipes.csv')
df.set_index('id', inplace=True)
df.head()

#### 2.1.1. Column Description

The dataset consists of recipies, each described by the following:
- **name** - name of the recipe, string, lowercase 
- **id** - id of the recipe, int
- **minutes** - time needed to prepare the recipe in minutes, int
- **contributor_id** - id of the contributor, int
- **submitted** - date of submission, string, format: YYYY-MM-DD
- **tags** - tags of the recipe, list of lowercase strings
- **nutrition** - nutrition information, list of 7 floats:
  - **calories (#)**
  - **total fat (PDV)**
  - **sugar (PDV)**
  - **sodium (PDV)**
  - **protein (PDV)**
  - **saturated fat (PDV)**
  - **carbohydrates (PDV)**
- **n_steps** - number of steps in the recipe, int
- **steps** - steps of the recipe, list of strings
- **description** - description of the recipe, string
- **ingredients** - ingredients of the recipe, list of strings
- **n_ingredients** - number of ingredients in the recipe, int

The authors of the paper have used only the **ingredients** and **steps** columns, which we will also use in our approach.

In [None]:
def parse_string_list(string_list: str) -> list[str]:
    return [token[1:-1] for token in string_list[1:-1].replace(" , ", "; ").split(", ")]

ingredients = df.ingredients.apply(parse_string_list).tolist()
steps = df.steps.apply(parse_string_list).tolist()

#### 2.1.2. Ingredients

On average, each recipe has 9 ingredients.

In [None]:
df.n_ingredients.describe()

In [None]:
df.n_ingredients.hist(bins=80)
plt.title("Number of ingredients per recipe")
plt.show()

In [None]:
ingredient_counts = collections.Counter(itertools.chain.from_iterable(ingredients))

print("Total number of ingredients:", len(ingredient_counts))

In [None]:
most_common_ingredients = ingredient_counts.most_common(20)
plt.figure(figsize=(10, 5))
plt.barh([ingredient for ingredient, _ in most_common_ingredients], [count for _, count in most_common_ingredients])
plt.show()

In [None]:
counts_arr = np.array(tuple(ingredient_counts.values()))
used_only_once_count = (counts_arr < 2).sum()
used_less_than_5_count = (counts_arr < 5).sum()

print(f"{used_only_once_count / len(ingredient_counts) * 100:.2f}% of ingredients are used only in one recipe")
print(f"{used_less_than_5_count / len(ingredient_counts) * 100:.2f}% of ingredients are used in less than 5 recipes")

In [None]:
ing_x = [i for i in range(1, counts_arr.max() + 1)]
ing_y = [(counts_arr < i).sum() / counts_arr.shape[0] for i in ing_x]
plt.plot(ing_x, ing_y)
plt.xscale('log')
plt.title("Cumulative distribution of ingredient counts")
plt.xlabel("Used in less than n recipes")
plt.ylabel("% of ingredients")
plt.show()

Many ingredients are very sparsly used, which is a problem for our approach. They are unlikely to be useful in the prediction, but some of them might be grouped to some more general ingredients. Some examples of such ingredients are:

In [None]:
for ing in list(ingredient_counts.keys())[-10:]:
    print(ing)

#### 2.1.3. Steps

On average, each recipe has 10 steps.

In [None]:
df.n_steps.describe()

In [None]:
df.n_ingredients.hist(bins=80)
plt.title("Number of steps per recipe")
plt.show()

Example of a recipe:

In [None]:
for i, step in enumerate(steps[0]):
    print(f"{i+1:>4}. {step}")

### 2.2. Taste Dataset

As we will want to include ingredients metadata in the component substitutes prediction, we needed a dataset that will facilitate this information.

For this reason we analyzed the [DANS - TASTE, FAT AND TEXTURE DATABASE - TASTE VALUES DUTCH FOODS](https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:195372/tab/2) dataset as it appears to be most promising, yet overall it's hard to find an extensive dataset with ingredients from the original problem, hence at later stages of the project we might resort to using a different dataset or even creating our own.


The selected dataset was compiled with the most frequently consumed foods in the Netherlands and their taste intensity relative to other available foods.
For the purposes of our program, some of the data is irrelevant and will be omitted. 

In [None]:
ingredient_df = pd.read_csv('data/ingredients_taste/Dutch_Foods.csv').fillna(0)
ingredient_df.head()

#### 2.2.1. Column Description

The dataset consists of products described by the following features:
- **Food_code** - Food code, as much as possible based on the nevocode
- **Product_brand** - Product tested
- **NEVO_code** - Corresponding NEVO code (0=no NEVO code)
- **Product_description_NL** - Product description for the nevocode, in Dutch
- **Product_description_EN** - Product description for the nevocode, in English
- **Food_group_code** - Food group code
- **Food_group_NL** - Food group in Dutch
- **Food_group_EN** - Food group in English
- **Date** - Date of profiling
- **Serving_methods** - Standardized serving methods (temperature, with or without crust, etc.)
- **Preparation_method** - Standardized preparation method for cooked foods
- **Reference_control_foods** - Reference foods (=1) and control foods (=2)

For each five basic tastes (sweet, sour, bitter, umami, salt) and also fat:
- **no_taste** - Number of panellists for *taste*
- **m_taste** - Mean taste intensity value for *taste*
- **sd_taste** - Standard deviation for the mean taste *taste* intensity value
- **se_taste**- Standard error for the mean taste *taste* intensity value

For our program we need only *taste* data with product description and food group.

In [None]:
ingredient_df = ingredient_df.drop(columns = ['Reference_control_foods', 'Food_code', 'Product_brand', 'NEVO_code', 'Product_description_NL','Food_group_code', 'Food_group_NL', 'Date', 'Serving_methods', 'Preparation_method', 'no_sweet', 'no_sour', 'no_bitter', 'no_umami', 'no_fat', 'no_salt'])
value_taste_columns = ['m_sweet', 'm_sour', 'm_bitter', 'm_umami', 'm_fat', 'm_salt']
print(f'There are {ingredient_df.shape[0]} different products')

#### 2.2.2. Food groups

In [None]:
print(f"There are {len(ingredient_df.groupby('Food_group_EN'))} different food groups in the dataset")
ingredient_df.groupby('Food_group_EN').count()['Product_description_EN'].sort_values(ascending=False).plot.barh(figsize = (15, 10))
plt.title('Number of products per food group')
plt.show()

Four groups have only one product. This gives a very big contrast when comparing to a group of almost 80 different products of vegetables and (non) alcoholic beverages.

#### 2.2.3. Taste

In [None]:
ingredient_df.filter(items = value_taste_columns).describe()

Each product was rated on a scale of 0 to 100 for each of the tastes. Most products have a high value of only one of the tastes (rarely two).

In [None]:
for taste in value_taste_columns:
    product = ingredient_df.sort_values(taste).tail(1)[['Product_description_EN', taste]]
    print(f"The most {taste[2:]} product was {product['Product_description_EN'].values[0]} with value {int(product[taste].values[0])}.")

### 2.3. Summary

The structure of the original dataset will for sure be useful in our approach, as the ingredients are given in a uniform way, which will allow us to group them into more general groups and simplify the problem.

After the grouping we should be able to use at least part of the taste dataset to enrich the information about the ingredients.

Our main concern is that that taste dataset is relatively small: only 627 products, whereas the original dataset consists of almost 15k - even after simplification, it's likely we'll have missing metadata for some ingredients.

For this reason we might switch to a different dataset or even create our own if we find it necessary.

## 3. Data Preparation - Exploring Approaches

As a part of our project, we need to recategorize the ingredients to their simplest form. We tried two approaches - AI-based and rule-based.
We also discarded the ingredients that could not be recategorized.

### 3.1. Ingredients categorization by GPT-3.5 turbo model

To facilitate AI-based categorization we decided to use OpenAI's GPT-3.5 turbo model using their API.

This solution allowed us to quickly obtained the results of decent quality.
The drawback however is that this approach is not reproducible and each run of the script will yield slightly different results (even with fixed seed and temperature = 0).

#### 3.1.1. Loading the categories

In [None]:
CATEGORIZED_INGREDIENTS_PATH = "./data/categorized.json"
BAD_CATEGORY = "unknown"

with open(CATEGORIZED_INGREDIENTS_PATH) as f:
    categorized_ingredients: dict = json.load(f)

categorized_ingredients.pop(BAD_CATEGORY, None)
categorized_ingredients = {k.lower(): v for k, v in categorized_ingredients.items() if v != BAD_CATEGORY}

failed = [
    ingredient
    for ingredient in ingredient_counts.keys()
    if ingredient not in categorized_ingredients
]
print(f"Failed to categorize {len(failed)} ingredients")

Due to the way GPT-3.5 turbo model works, some ingredients were not processed by the model - either skipped or processed incorrectly. We decided to discard these ingredients.

In [None]:
failed[:10]

In [None]:
counted_categories = collections.Counter(categorized_ingredients.values())
most_common = counted_categories.most_common(20)

plt.figure(figsize=(10, 5))
plt.barh([category for category, _ in most_common], [count for _, count in most_common])
plt.show()

#### 3.1.2. Loss of data analysis

In [None]:
num_categories = len(counted_categories)
num_ingredients = len(ingredient_counts)

print("Total number of categories:", num_categories)
print("Total number of ingredients:", num_ingredients)
print(f"Reduction in number of ingredients to: {num_categories / num_ingredients * 100:.2f}% of original size")

In [None]:
ai_recategorized_ingredients = [
    categorized_ingredients[ingredient]
    for ingredient in itertools.chain.from_iterable(ingredients)
    if ingredient in categorized_ingredients
]

len(ai_recategorized_ingredients)
print(f"Retained {len(ai_recategorized_ingredients) / df['n_ingredients'].sum() * 100:.2f}% of ingredients usage")

#### 3.1.3. Recategorization results

In [None]:
ingredient_recounts = collections.Counter(ai_recategorized_ingredients)
most_common_ingredients_recategorized = ingredient_recounts.most_common(20)

fig, axs = plt.subplots(1, 2, figsize=(15, 5))
axs[0].barh([ingredient for ingredient, _ in most_common_ingredients], [count for _, count in most_common_ingredients])
axs[0].set_title("Most common ingredients (original)")
axs[1].barh([ingredient for ingredient, count in most_common_ingredients_recategorized if count > 1000], [count for ingredient, count in most_common_ingredients_recategorized if count > 1000])
axs[1].set_title("Most common ingredients (after recategorization)")
plt.show()

In [None]:
MIN_RECIPIES_COUNT = 5

re_counts_arr = np.array(tuple(ingredient_recounts.values()))
re_ing_x = [i for i in range(1, re_counts_arr.max() + 1)]
re_ing_y = [(re_counts_arr < i).sum() / re_counts_arr.shape[0] for i in re_ing_x]

fig, axs = plt.subplots(1, 2, figsize=(15, 5))

axs[0].plot(ing_x, ing_y, label="Original")
axs[0].plot(re_ing_x, re_ing_y, label="After recategorization")
axs[0].plot([MIN_RECIPIES_COUNT, MIN_RECIPIES_COUNT], [0, 1], label="Minimum count", linestyle="--", color="black")
axs[0].set_xscale('log')
axs[0].set_title("Cumulative distribution of ingredient counts")
axs[0].set_xlabel("Used in less than n recipes")
axs[0].set_ylabel("% of ingredients")
axs[0].legend()

axs[1].plot(sorted(counts_arr), label="Original")
axs[1].plot(sorted(re_counts_arr), label="After recategorization")
axs[1].plot([0, counts_arr.shape[0]], [MIN_RECIPIES_COUNT, MIN_RECIPIES_COUNT], label="Minimum count", linestyle="--", color="black")
axs[1].set_title("Sorted ingredient counts")
axs[1].set_xlabel("Ingredient index")
axs[1].set_ylabel("Number of recipes")
axs[1].legend()
axs[1].set_yscale('log')

plt.show()

In [None]:
parsed_min_count = [(v, count) for v, count in ingredient_counts.most_common() if count >= MIN_RECIPIES_COUNT]
ai_parsed_min_recount = [(v, count) for v, count in ingredient_recounts.most_common() if count >= MIN_RECIPIES_COUNT]

print(f"Number of ingredients with at least {MIN_RECIPIES_COUNT} recipes:")
print(f"Original: {len(parsed_min_count)}")
print(f"After recategorization: {len(ai_parsed_min_recount)}")
print(f"Reduction: {len(ai_parsed_min_recount) / len(parsed_min_count) * 100:.2f}%")

#### 3.1.4. Summary

Recategorization by GPT-3.5 turbo model was a good starting point, but it was not perfect. It managed to reduce the number of ingredients from 15k to 3k overall, and from 8k to 2k valid frequently used enough ones, but this is still a lot of ingredients to work with - much more than the number of entries in the taste dataset.

With this approach it will be necessary to create our own dataset with ingredients and their metadata.

### 3.2. Ingredients categorization by analytical method

As an alternative to using advanced AI, we also decided to use the analytical method of grouping components based on the noun used in it. 

#### 3.2.1 Prepare group with key words

Ingredients as keywords grouped into several categories. Created based on the most frequent nouns in the dataset.

In [None]:
from report_utils import grouping
grouping.groups.keys()

#### 3.2.2. Splite complex ingredients

In a dataset there are several ingredients that appear as a combination of two three or even four different elements in fact such ingredients can be separated into their subcomponents. 

In [None]:
for ingredient in ingredient_counts:
    if " and " in ingredient and " with " in ingredient:
        print(ingredient)

In [None]:
separator_words = [" and ", " or ", " with ", " & ", " in "]
formula = grouping.create_formula(separator_words)

splited_ingredient = {}
for ingredient in ingredient_counts:
    for word in grouping.split_ingriedents(ingredient, formula):
        if word in splited_ingredient.keys():
            splited_ingredient[word] += 1
        else:
            splited_ingredient[word] = 1

In [None]:
print(f"Number of ingredients before split: {len(ingredient_counts)}")
print(f"Number of ingredients after split: {len(splited_ingredient)}")

#### 3.2.3. Grouping by the categories

In [None]:
ingredient_map = grouping.group_ingredients(splited_ingredient.keys())

#### 3.2.4. Recategorization results

In [None]:
group_list = [[len(group.keys()), key_group] for key_group, group in ingredient_map.items()]
group_list.sort(reverse=True)
plt.figure(figsize=(10, 5))
plt.barh([name.split("_")[0] for _, name in group_list], [counts for counts, _ in group_list])
plt.title("Number of different products in each group")
plt.show()

In [None]:
product_list = []
for product in ingredient_map.values():
    for key, products in product.items():
        product_list.append([len(products), key])
product_list.sort(reverse=True)
plt.figure(figsize=(10, 5))
plt.barh([name for _, name in product_list[1:25]], [counts for counts, _ in product_list[1:25]])
plt.title("Most popular products")
plt.show()


#### 3.2.5 Uncategorized products

In [None]:
uncategorized = len(ingredient_map["uncategorized"]["uncategorized"])
print(f"Number of uncategorized ingredients: {uncategorized}")

In [None]:
invert_map = {
    ingredient: key
    for key, group in ingredient_map.items()
    for _, ingredients in group.items()
    for ingredient in ingredients
}

In [None]:
invert_map = grouping.invert_grouping(ingredient_map)

rule_recategorized_ingredients = [
    invert_map[ingredient]
    for ingredient in itertools.chain.from_iterable(ingredients)
    if ingredient in invert_map and invert_map[ingredient] != "uncategorized"
]

In [None]:
ingredient_rule_recounts = collections.Counter(rule_recategorized_ingredients)
most_common_ingredients_rule_recategorized = ingredient_rule_recounts.most_common(20)

fig, axs = plt.subplots(1, 2, figsize=(15, 5))
axs[0].barh([ingredient for ingredient, _ in most_common_ingredients], [count for _, count in most_common_ingredients])
axs[0].set_title("Most common ingredients (original)")
axs[1].barh([ingredient for ingredient, count in most_common_ingredients_rule_recategorized if count > 1000], [count for ingredient, count in most_common_ingredients_recategorized if count > 1000])
axs[1].set_title("Most common ingredients (after recategorization)")
plt.show()

#### 3.2.6. Comparison with baseline and previous approach

In [None]:
rule_re_counts_arr = np.array(tuple(ingredient_rule_recounts.values()))
rule_re_ing_x = [i for i in range(1, rule_re_counts_arr.max() + 1)]
rule_re_ing_y = [(rule_re_counts_arr < i).sum() / rule_re_counts_arr.shape[0] for i in rule_re_ing_x]

fig, axs = plt.subplots(1, 2, figsize=(15, 5))

axs[0].plot(ing_x, ing_y, label="Original")
axs[0].plot(re_ing_x, re_ing_y, label="After recategorization (AI)")
axs[0].plot(rule_re_ing_x, rule_re_ing_y, label="After recategorization (rule based)")
axs[0].plot([MIN_RECIPIES_COUNT, MIN_RECIPIES_COUNT], [0, 1], label="Minimum count", linestyle="--", color="black")
axs[0].set_xscale('log')
axs[0].set_title("Cumulative distribution of ingredient counts")
axs[0].set_xlabel("Used in less than n recipes")
axs[0].set_ylabel("% of ingredients")
axs[0].legend()

axs[1].plot(sorted(counts_arr), label="Original")
axs[1].plot(sorted(re_counts_arr), label="After recategorization (AI)")
axs[1].plot(sorted(rule_re_counts_arr), label="After recategorization (rule based)")
axs[1].plot([0, counts_arr.shape[0]], [MIN_RECIPIES_COUNT, MIN_RECIPIES_COUNT], label="Minimum count", linestyle="--", color="black")
axs[1].set_title("Sorted ingredient counts")
axs[1].set_xlabel("Ingredient index")
axs[1].set_ylabel("Number of recipes")
axs[1].legend()
axs[1].set_yscale('log')

plt.show()

Numbr of ingredients after recategorization:

In [None]:
len(ingredient_rule_recounts)

In [None]:
rule_parsed_min_recount = [(v, count) for v, count in ingredient_rule_recounts.most_common() if count >= MIN_RECIPIES_COUNT]

print(f"Number of ingredients with at least {MIN_RECIPIES_COUNT} recipes:")
print(f"Original: {len(parsed_min_count)}")
print(f"After recategorization: {len(rule_parsed_min_recount)}")
print(f"Reduction: {len(rule_parsed_min_recount) / len(parsed_min_count) * 100:.2f}%")

#### 3.2.7. Summary

The second approach brings down the number of much more substantially, down to 250 ingredients. This is a much more manageable number, but it's still a lot more than the number of entries in the taste dataset.

Moreover this method omits almost 1k ingredients, which is a lot of data to lose.

### 3.3. Conclusion from data preparation

Both approaches have their pros and cons. The AI-based approach is much more flexible and can be easily extended to include more categories, but it's not reproducible and the results are not always the same. The analytical approach is much more stable, but it's not as flexible and requires manual work to create the categories.

Analyzing the results of AI approach more deeply we may still observe some redundancy, such as having separate categories for "beef" and "roasted beef" and having several ingredients split into their singular and plural form.

Henceforth we decided to apply extended analytical method on the results of AI-based approach to get the best of both worlds.


## 4. Data Preparation - Combining both methods

### 4.1. Applying analytical method on the results of AI-based approach

As stated in the previous section, we decided to apply the analytical method on the results of AI-based approach to get the best of both worlds.

Whilst this leads to less ingredients than the analytical method alone, the final result is of better quality.

In [None]:
final_grouping = grouping.group_ingredients(categorized_ingredients.values())
final_inversed_map = grouping.invert_grouping(final_grouping)

len(final_grouping["uncategorized"]["uncategorized"])

In [None]:
counted_categories

In [None]:
for ingredient in set(final_grouping['uncategorized']['uncategorized']):
    print(f"{ingredient:<20} {counted_categories[ingredient]:>4}")

In [None]:
final_rule_recategorized_ingredients = [
    final_inversed_map[ingredient]
    for ingredient in itertools.chain.from_iterable(ingredients)
    if ingredient in final_inversed_map and final_inversed_map[ingredient] != "uncategorized"
]

### 4.2. Recategorization results

In [None]:
ingredient_final_recounts = collections.Counter(final_rule_recategorized_ingredients)

final_re_counts_arr = np.array(tuple(ingredient_final_recounts.values()))

final_ing_x = [i for i in range(1, final_re_counts_arr.max() + 1)]
final_ing_y = [(final_re_counts_arr < i).sum() / final_re_counts_arr.shape[0] for i in final_ing_x]

fig, axs = plt.subplots(1, 2, figsize=(15, 5))

axs[0].plot(ing_x, ing_y, label="Original")
axs[0].plot(re_ing_x, re_ing_y, label="After recategorization (AI)")
axs[0].plot(rule_re_ing_x, rule_re_ing_y, label="After recategorization (rule based)")
axs[0].plot(final_ing_x, final_ing_y, label="After recategorization (final)")
axs[0].plot([MIN_RECIPIES_COUNT, MIN_RECIPIES_COUNT], [0, 1], label="Minimum count", linestyle="--", color="black")
axs[0].set_xscale('log')
axs[0].set_title("Cumulative distribution of ingredient counts")
axs[0].set_xlabel("Used in less than n recipes")
axs[0].set_ylabel("% of ingredients")
axs[0].legend()

axs[1].plot(sorted(counts_arr), label="Original")
axs[1].plot(sorted(re_counts_arr), label="After recategorization (AI)")
axs[1].plot(sorted(rule_re_counts_arr), label="After recategorization (rule based)")
axs[1].plot(sorted(final_re_counts_arr), label="After recategorization (final)")
axs[1].plot([0, counts_arr.shape[0]], [MIN_RECIPIES_COUNT, MIN_RECIPIES_COUNT], label="Minimum count", linestyle="--", color="black")
axs[1].set_title("Sorted ingredient counts")
axs[1].set_xlabel("Ingredient index")
axs[1].set_ylabel("Number of recipes")
axs[1].legend()
axs[1].set_yscale('log')

plt.show()

In [None]:
# most common ingredients recategorized

most_common_ingredients_recategorized = ingredient_final_recounts.most_common(20)

fig, axs = plt.subplots(1, 2, figsize=(15, 5))
axs[0].barh([ingredient for ingredient, _ in most_common_ingredients], [count for _, count in most_common_ingredients])
axs[0].set_title("Most common ingredients (original)")
axs[1].barh([ingredient for ingredient, count in most_common_ingredients_recategorized if count > 1000], [count for ingredient, count in most_common_ingredients_recategorized if count > 1000])
axs[1].set_title("Most common ingredients (after recategorization)")
plt.show()

In [None]:
categories_reduction = {
    "original": len(ingredient_counts),
    "ai": len(ai_parsed_min_recount),
    "rule": len(rule_parsed_min_recount),
    "final": len(ingredient_final_recounts)
}

usage_reduction = {
    "original": sum(ingredient_counts.values()),
    "ai": sum(dict(ai_parsed_min_recount).values()),
    "rule": sum(dict(rule_parsed_min_recount).values()),
    "final": sum(ingredient_final_recounts.values())
}

fig, axs = plt.subplots(1, 2, figsize=(15, 5))

axs[0].bar(categories_reduction.keys(), categories_reduction.values())
axs[0].set_title("Number of ingredients after recategorization")

axs[1].bar(usage_reduction.keys(), usage_reduction.values())
axs[1].set_title("Number of ingredient usages after recategorization")
plt.show()


In [None]:
with open("./data/final_recategorized.json", 'w') as f:
    json.dump(list(ingredient_final_recounts.keys()), f)

## 5. Data Preparation - Ingredients metadata

As a part of our project, we need to enrich the ingredients with metadata. We decided to create our own dataset with ingredients and their metadata, using once again AI-based approach.

We tasked OpenAI's GPT-3.5 turbo model with generating the metadata for each ingredient, that contains the following information:
- **origin** - origin of the ingredient, one of: "animal", "plant", "other"
- **type** - type of the ingredient, one of: "raw", "processed"
- **state** - state of the ingredient, one of: "solid", "liquid", "gas"
- **texture** - texture of the ingredient, one of: "soft", "hard", "crunchy", "smooth"
- **taste** - taste of the ingredient, one of: "sweet", "sour", "bitter", "umami", "salty"
- **taste-intensity** - intensity of the taste of the ingredient, one of: "low", "medium", "high"
- **smell** - smell of the ingredient, one of: "sweet", "sour", "bitter", "umami", "salty"
- **smell-intensity** - intensity of the smell of the ingredient, one of: "low", "medium", "high"


### 5.1. Loading the metadata

In [None]:
with open("./data/characterized.json") as f:
    characterized_ingredients_df = pd.read_json(f).T

characterized_ingredients_df.head()

In [None]:
fig, axs = plt.subplots(2, 4, figsize=(15, 5))
for i, column in enumerate(characterized_ingredients_df.columns):
    ax = axs[i // 4, i % 4]
    characterized_ingredients_df[column].value_counts().plot.barh(ax=ax, title=column)
    ax.set_ylabel("")

plt.tight_layout() 
plt.show()