<a href="https://colab.research.google.com/github/AvichalTrivedi7/Generative-AI-Intel-Unnati/blob/main/Recipe_preprocessing_Gen_AI_Practice_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Recipe Preprocessing (NLP Pipeline)

This notebook cleans a free-text recipe paragraph and extracts a structured list of ingredients using lightweight NLP preprocessing.

**Pipeline:** cleaning → lowercasing → tokenization → stopword removal → lemmatization & stemming → ingredient extraction.


In [None]:
# Recipe Preprocessing Assignment

# Download required NLTK resources
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("punkt_tab")

from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import re


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:
# Step 1: Input Recipe Paragraph

text = """To make fried rice, take 2 cups of rice, 1 tablespoon oil,
2 onions, 1 carrot, 1/2 cup peas, and salt to taste. Cook for 10 minutes."""

print("Original Text:\n", text, "\n")


Original Text:
 To make fried rice, take 2 cups of rice, 1 tablespoon oil, 
2 onions, 1 carrot, 1/2 cup peas, and salt to taste. Cook for 10 minutes. 



In [None]:
# Step 2: Cleaning (remove numbers, measurements, punctuation)

def clean_text(text):
    text = text.lower()
    text = re.sub(r"\d+/\d+", " ", text)             # remove fractions like 1/2
    text = re.sub(r"\d+", " ", text)                 # remove pure numbers
    text = re.sub(r"[^a-z\s]", " ", text)            # keep only letters
    text = re.sub(r"\s+", " ", text).strip()         # remove extra spaces
    return text

cleaned = clean_text(text)
print("Cleaned Text:\n", cleaned, "\n")


Cleaned Text:
 to make fried rice take cups of rice tablespoon oil onions carrot cup peas and salt to taste cook for minutes 



In [None]:
# Step 3: Tokenization

tokens = word_tokenize(cleaned)
print("Tokenized Words:\n", tokens, "\n")


Tokenized Words:
 ['to', 'make', 'fried', 'rice', 'take', 'cups', 'of', 'rice', 'tablespoon', 'oil', 'onions', 'carrot', 'cup', 'peas', 'and', 'salt', 'to', 'taste', 'cook', 'for', 'minutes'] 



In [None]:
# Step 4: Stopword Removal

stop_words = set(stopwords.words("english"))
filtered_tokens = [w for w in tokens if w not in stop_words]
print("After Stopword Removal:\n", filtered_tokens, "\n")


After Stopword Removal:
 ['make', 'fried', 'rice', 'take', 'cups', 'rice', 'tablespoon', 'oil', 'onions', 'carrot', 'cup', 'peas', 'salt', 'taste', 'cook', 'minutes'] 



In [None]:
# Step 5: Lemmatization & Stemming
# Helper function for POS tagging

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

lemmatizer = WordNetLemmatizer()
pos_tags = nltk.pos_tag(filtered_tokens, lang='eng')

lemmatized_tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(tag))
                     for word, tag in pos_tags]
print("After Lemmatization:\n", lemmatized_tokens, "\n")

stemmer = SnowballStemmer("english")
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("After Stemming:\n", stemmed_tokens, "\n")


After Lemmatization:
 ['make', 'fried', 'rice', 'take', 'cup', 'rice', 'tablespoon', 'oil', 'onion', 'carrot', 'cup', 'pea', 'salt', 'taste', 'cook', 'minute'] 

After Stemming:
 ['make', 'fri', 'rice', 'take', 'cup', 'rice', 'tablespoon', 'oil', 'onion', 'carrot', 'cup', 'pea', 'salt', 'tast', 'cook', 'minut'] 



In [None]:
# Step 6: Extract Ingredients
# (Simple dictionary lookup approach)

INGREDIENT_LEXICON = {"rice","oil","onion","carrot","peas","salt","pepper","garlic"}

ingredients = [w for w in lemmatized_tokens if w in INGREDIENT_LEXICON]
ingredients = list(dict.fromkeys(ingredients))  # remove duplicates, keep order
print("Final List of Extracted Ingredients:\n", ingredients, "\n")


Final List of Extracted Ingredients:
 ['rice', 'oil', 'onion', 'carrot', 'salt'] 




### Reflection (Why preprocessing matters)
Preprocessing reduces noise and standardizes text, which makes it easier
for machines to extract structured information. Steps like stopword removal
focus on meaningful words, while lemmatization and stemming reduce variations
(e.g., onions → onion). This structured representation allows reliable extraction
of ingredients from free-form recipe paragraphs.
