<a href="https://colab.research.google.com/github/AEStoa/ELI/blob/main/Testcode2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Recipe Cleaning Steps  

 call database, call recipe

 lowercase all words

 convert fractions to decimals

 "half" -> "0.5"  

 "¾" -> "0.75"

 convert written numbers ("five" -> "5")

 break numbers from letters ("5g" becomes "5 g")

 remove non alphanumeric except periods

 remove periods except those in decimals

 Move to tokenizing


Tokenizing Steps

 tokenize recipe   

 check database for ingredient terms and keep multi-word ingredients as one token

 Extract ingredient, quantity, and unit from tokenized recipe (2nd cleaning function, removes extra words)  

 no matter the order they appear in the recipe, group them into sets of 3 (tuples)  

 in the groups of 3 find the unit location in the tuple (check the unit set)  

 reorganize the tuples of 3 and print them in the output  



Assumptions about recipe:

*unit will always follow quantity

*quantity is always written or listed as a number/decimal/fraction

In [2]:
pip install word2number


Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: word2number
  Building wheel for word2number (setup.py) ... [?25l[?25hdone
  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5567 sha256=1d6be47a0a82193136d1f93864e06e24d2043b14079259c9d6df299e094cb326
  Stored in directory: /root/.cache/pip/wheels/84/ff/26/d3cfbd971e96c5aa3737ecfced81628830d7359b55fbb8ca3b
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1


In [7]:
import requests
import re
import csv
from word2number import w2n
from fractions import Fraction
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

def load_database(database_url):
    response = requests.get(database_url)
    database = set()
    if response.status_code == 200:
        csv_content = response.text.splitlines()
        csv_reader = csv.reader(csv_content)
        next(csv_reader)  # Skip the first row (headers)
        for row in csv_reader:
            ingredient, _ = row
            database.add(ingredient.lower())
    return database

def convert_common_fractions(text):
    common_fractions = {
        'half': '0.5',
        'quarter': '0.25',
        'three quarter': '0.75'
        # Add more common fractions as needed
    }
    for fraction, value in common_fractions.items():
        text = text.replace(fraction, value)
    return text

def convert_fractions_to_decimals(text):
    # Function to convert fractions to decimals
    fraction_pattern = r'(\d+)/(\d+)'

    def replace_fraction(match):
        numerator, denominator = map(int, match.groups())
        decimal_value = numerator / denominator
        return str(decimal_value)

    cleaned_text = re.sub(fraction_pattern, replace_fraction, text)
    return cleaned_text

def convert_written_numbers(text):
    # Convert written-out numbers to numeric values
    words = text.split()
    for i, word in enumerate(words):
        try:
            numeric_value = w2n.word_to_num(word)
            words[i] = str(numeric_value)
        except ValueError:
            pass  # Ignore words that are not written-out numbers
    return ' '.join(words)

def break_numbers_from_letters(text):
    pattern = r'(\d+)([a-zA-Z]+)'

    def separate_numbers_letters(match):
        return match.group(1) + ' ' + match.group(2)

    cleaned_text = re.sub(pattern, separate_numbers_letters, text)
    return cleaned_text

def remove_non_alphanumeric_except_periods(text):
    # Remove all non-alphanumeric characters except periods
    cleaned_text = re.sub(r'[^\w.]', ' ', text)
    return cleaned_text

def remove_punctuation_except_periods(text):
    # Remove all periods except those next to or inside numbers
    cleaned_text = re.sub(r'(?<!\d)\.(?!\d)', '', text)
    return cleaned_text

def tokenize_text(cleaned_text, database):
    # Tokenize the cleaned text while preserving decimals and multi-word ingredients
    words = re.findall(r'\b\d+\.\d+\b|\b\w+\b', cleaned_text)
    tokens = []
    i = 0

    while i < len(words):
        is_multi_word = False
        for j in range(i, len(words)):
            phrase = ' '.join(words[i:j + 1])
            if phrase.lower() in database:
                tokens.append(phrase.lower())
                i = j + 1
                is_multi_word = True
                break
        if not is_multi_word:
            tokens.append(words[i].lower())
            i += 1

    return tokens


def extract_numerical_units_and_ingredients(tokenized_text):
    numerical_values = []
    units = []
    ingredients = []

    i = 0
    while i < len(tokenized_text):
        token = tokenized_text[i]
        if re.match(r'^\d+(\.\d+)?$', token):  # Check if the token is a numerical value
            numerical_values.append(float(token))
            next_token_index = i + 1
            if next_token_index < len(tokenized_text):
                next_token = tokenized_text[next_token_index]
                if next_token in ["g", "gram", "grams",
    "kg", "kilogram", "kilograms",
    "mg", "milligram", "milligrams",
    "oz", "ounce", "ounces",
    "lb", "pound", "pounds",
    "ml", "milliliter", "milliliters",
    "l", "liter", "liters",
    "tsp", "teaspoon", "teaspoons",
    "tbsp", "tablespoon", "tablespoons",
    "cup", "cups",
    "pt", "pint", "pints",
    "qt", "quart", "quarts",
    "gal", "gallon", "gallons"]:
                    units.append(next_token)
                    i += 2  # Skip the numerical value and unit token
                else:
                    units.append(None)
                    i += 1  # Skip only the numerical value
            else:
                units.append(None)
                i += 1  # Skip only the numerical value
        else:
            numerical_values.append(None)
            units.append(None)
            ingredients.append(token)
            i += 1

    return numerical_values, units, ingredients


def clean_text(text, database):
    # Function to clean the text and print original, cleaned, and tokenized versions
    lowercased_text = text.lower()
    cleaned_text = convert_fractions_to_decimals(lowercased_text)
    cleaned_text = convert_written_numbers(cleaned_text)
    cleaned_text = convert_common_fractions(cleaned_text)
    cleaned_text = break_numbers_from_letters(cleaned_text)
    cleaned_text = remove_non_alphanumeric_except_periods(cleaned_text)
    cleaned_text = remove_punctuation_except_periods(cleaned_text)

    # Tokenize the cleaned text preserving decimals and multi-word ingredients
    tokenized_text = tokenize_text(cleaned_text, database)

    # Extract ingredient details
    ingredient_details = extract_ingredient_details(tokenized_text, database)

    # Print original, cleaned, and tokenized versions
    print("Original Recipe:")
    print(text)
    print("\nCleaned Recipe:")
    print(cleaned_text)
    print("\nTokenized Recipe:")
    print(tokenized_text)
    print("\nCleaned Tokenized Recipe:")
    print(ingredient_details)

    # Return tokenized text and ingredient details
    return tokenized_text, ingredient_details

def extract_ingredient_details(tokenized_text, database):
    ingredient_details = []
    i = 0

    while i < len(tokenized_text):
        ingredient = None
        numerical_value = None
        unit = None

        # Extract numerical value
        if tokenized_text[i] and (tokenized_text[i].replace(".", "").isdigit() or re.match(r'^\d+/\d+$', tokenized_text[i])):
            numerical_value = tokenized_text[i]
            i += 1

            # Extract unit
            if i < len(tokenized_text):
                unit = tokenized_text[i]
                i += 1

        # Extract ingredient
        if i < len(tokenized_text):
            ingredient = tokenized_text[i]
            i += 1

        # Check if the ingredient is in the database (valid ingredient)
        if ingredient.lower() in database:
            # Append to the ingredient_details list
            ingredient_details.append({
                'ingredient': ingredient,
                'numerical_value': numerical_value,
                'unit': unit
            })

    return ingredient_details



# Dictionary of recipe URLs
recipe_urls = {
    1: 'https://raw.githubusercontent.com/AEStoa/ELI/main/Recipe2_6.txt',
    2: 'https://raw.githubusercontent.com/AEStoa/ELI/main/Recipe2_5.txt',
    3: 'https://raw.githubusercontent.com/AEStoa/ELI/main/Recipe1v1%20-%20Copy%20(3).txt',
    5: 'https://raw.githubusercontent.com/AEStoa/ELI/main/Recipe1v1%20-%20Copy%20(6).txt',
    7: 'https://raw.githubusercontent.com/AEStoa/ELI/main/Recipe1v1%20-%20Copy(10).txt',
    8: 'https://raw.githubusercontent.com/AEStoa/ELI/main/Recipe2_1.txt'
}

# Database URL
database_url = 'https://raw.githubusercontent.com/AEStoa/ELI/main/FakeingredientC02e.csv'

# Ask user to pick a recipe
print("Choose a recipe:")

for key in recipe_urls:
    print(f"{key}: Recipe {key}")

recipe_choice = int(input("Enter the number of the recipe: "))

# Validate user input
selected_url = recipe_urls.get(recipe_choice)
if not selected_url:
    print('Invalid recipe choice.')
else:
    database = load_database(database_url)
    response = requests.get(selected_url)
    if response.status_code == 200:
        cleaned_text = clean_text(response.text, database)
    else:
        print('Failed to retrieve the recipe file. Status code:', response.status_code)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Choose a recipe:
1: Recipe 1
2: Recipe 2
3: Recipe 3
5: Recipe 5
7: Recipe 7
8: Recipe 8
Enter the number of the recipe: 5
Original Recipe:


The recipe will have 5g of salt, 2g of sugar, and 3g of powdered sugar



Cleaned Recipe:
the recipe will have 5 g of salt  2 g of sugar  and 3 g of powdered sugar

Tokenized Recipe:
['the', 'recipe', 'will', 'have', '5', 'g', 'of', 'salt', '2', 'g', 'of', 'sugar', 'and', '3', 'g', 'of', 'powdered sugar']

Cleaned Tokenized Recipe:
[{'ingredient': 'salt', 'numerical_value': None, 'unit': None}, {'ingredient': 'sugar', 'numerical_value': None, 'unit': None}, {'ingredient': 'powdered sugar', 'numerical_value': None, 'unit': None}]
