### PDF Text Extraction from the Cookbook PDF

In [35]:
import pymupdf
import re
from rapidfuzz import fuzz, process
import os
import json

Examing a random page in the book to see the structure and quality of the extracted text.

In [36]:
doc = pymupdf.open("a guide to modern cookery.pdf")
example_page = doc[201]
text = example_page.get_text()
print(text)

176  GUIDE  TO  MODERN  COOKERY 
448— EQQS  EN  COCOTTE  A  LA  SOUBISE 
Garnish  the  bottom  and  sides  of  the  cocottes  with  a  coating 
of  thick  Soubise  purde.  Break  the  eggs,  season,  and  poach. 
When  dishing  up,  surround  the  yolks  with  a  thread  of  melted 
meat-glaze. 
449— MOULDED  EGGS 
These  form  a  very  ornamental  dish,  but  the  time  required 
to  prepare  them  being  comparatively  long,  poached,  soft-boiled, 
and  other  kinds  of  eggs  are  generally  preferred  in  their  stead. 
They  are  made  in  variously  shaped  moulds,  ornamented  accord- 
ing to  the  nature  of  the  preparation,  and  the  eggs  are  broken 
into  them  direct,  or  they  may  be  inserted  in  the  form  of 
scrambled  eggs,  together  with  raw  eggs  poached  in  a  hain- 
marie. 
Whatever  be  the  mode  of  preparation,  the  moulds  should 
always  be  liberally  buttered.  The  usual  time  allowed  for  the 
poaching  of  the  eggs  in  moulds  is  from 

Fantastic, the data seems to be well extracted and follow a consistent structure. Recipe titles are in all caps followed by instructions. There are some mistakes due to whatever OCR was originally used but they appear to be minor (e.g. "EQGS" instead of "EGGS".)

Now I'll open the PDF and examine it manually:
-pages 1-22 are the preface and table of contents.
- The glossary could be useful but is structured differently and not all entries are useful. I'll ignore it for now since most entries are elaborated on later.
- Each part appears to have a brief introduction before following the structure seen above. 
- Given that the page numbers of the book begin at PDF page 27, I'll extract each chapter separately to label their metadata according to the chapter title (e.g. "Stocks", "Sauces", "Soups", etc.) without having to iterate through the entire 1000 page PDF immediately.

Extracting the table of contents to get the titles and page numbers for each chapter.

In [37]:
doc = pymupdf.open("a guide to modern cookery.pdf")
toc_pages = [doc[20], doc[21]]
toc_page_texts = []

for page in toc_pages:
    text = page.get_text()
    toc_page_texts.append(text)
    print(text)

CONTENTS 
PART    I 
FUNDAMENTAL   ELEMENTS 
CHAPTER  I 
PAGE 
FONDS  DE  CUISINE  ........  I 
CHAPTER  II 
THE  LEADING  WARM   SAUCES     .....  •  '5 
CHAPTER  III 
THE   SMALL  COMPOUND   SAUCES  ...  .  .  24 
CHAPTER  IV 
COLD  SAUCES  AND  COMPOUND  BUTTERS        .....  48 
CHAPTER  V 
SAVOURY  JELLIES  OR  ASPICS  .  ......  59 
CHAPTER  VI 
THE  COURT-BOUILLONS  AND  THE  MARINADES  .  .  .  -64 
CHAPTER  VII 
\J/:  ELEMENTARY  PREPARATIONS  .....  70 
CHAPTER  VIII 
THE  VARIOUS  GARNISHES   FOR  SOUPS  .  .  .  .  87 
CHAPTER  IX 
GARNISHING  PREPARATIONS   FOR  RELEVis  AND   ENTR]£eS  .  .  92 
CHAPTER  X 
U^DING  CULINARY  OPERATIONS  .  ....  97 

xii  CONTENTS 
PART   II 
RECIPES  AND   MODES   OF  PROCEDURE 
CHAPTER  XI 
PAGE 
HORS-D'CEUVRES      .  .  .  .  .  .  .  ,  .137 
CHAPTER  XII 
EGGS  .......  .  .        164 
CHAPTER  XIII 
SOUPS  ..........      197 
CHAPTER  XIV 
FISH  ..........        260 
CHAPTER  XV 
RELEVilS  AND  ENTRIES  OF  BUTCHER'S  MEAT  ....

In [None]:
chapters = {}

for text in toc_page_texts:
    lines = text.split("\n")
    for i, line in enumerate(lines):
        if i > 3:
            if i%2 == 1:
                # the title is the first words of the line 
                # followed by some amount of "."
                # the page number is at the end
                split_text = line.split(".")
                title = split_text[0]
                page_num = split_text[-1]
                chapters[title] = page_num 
            
for chapter in chapters:
    print(chapter)
    print(chapters[chapter])

FONDS  DE  CUISINE  
  I 
THE  LEADING  WARM   SAUCES     
  •  '5 
THE   SMALL  COMPOUND   SAUCES  
  24 
COLD  SAUCES  AND  COMPOUND  BUTTERS        
  48 
SAVOURY  JELLIES  OR  ASPICS  
  59 
THE  COURT-BOUILLONS  AND  THE  MARINADES  
  -64 
\J/:  ELEMENTARY  PREPARATIONS  
  70 
THE  VARIOUS  GARNISHES   FOR  SOUPS  
  87 
GARNISHING  PREPARATIONS   FOR  RELEVis  AND   ENTR]£eS  
  92 
U^DING  CULINARY  OPERATIONS  
  97 
HORS-D'CEUVRES      
137 
EGGS  
        164 
SOUPS  
      197 
FISH  
        260 
RELEVilS  AND  ENTRIES  OF  BUTCHER'S  MEAT  
       352 
RELEVES  AND  ENTRIES  OF  POULTRY  AND  GAME    
       473 
ROASTS  AND  SALADS         
       605 
VEGETABLES  AND   FARINACEOUS  PRODUCTS  
       624 
SAVORIES       
       678 
ENTREMETS
       687 
ICES   AND  SHERBETS  
  788 
DRINKS   AND   REFRESHMENTS     
816 
FRUIT-STEWS  AND  JAMS  ,
       820 


These chapters have errors:
- FONDS  DE  CUISINE  
- THE  LEADING  WARM   SAUCES     
- THE  COURT-BOUILLONS  AND  THE  MARINADES  
- \J/:  ELEMENTARY  PREPARATIONS  
- GARNISHING  PREPARATIONS   FOR  RELEVis  AND   ENTR]£eS  
- U^DING  CULINARY  OPERATIONS  
- HORS-D'CEUVRES      
- RELEVilS  AND  ENTRIES  OF  BUTCHER'S  MEAT  
- RELEVES  AND  ENTRIES  OF  POULTRY  AND  GAME    

In [29]:
chapters['FONDS  DE  CUISINE  '] = 1
chapters['THE  LEADING  WARM   SAUCES     '] = 15
chapters['THE  COURT-BOUILLONS  AND  THE  MARINADES  '] = 64
correct_titles = [
    'THE  ELEMENTARY  PREPARATIONS  ',
    'GARNISHING  PREPARATIONS   FOR  RELEVES  AND   ENTREES  ',
    'LEADING  CULINARY  OPERATIONS  ',
    "HORS-D'OEUVRES      ",
    "RELEVES  AND  ENTREES  OF  BUTCHER'S  MEAT  ",
    "RELEVES  AND  ENTREES  OF  POULTRY  AND  GAME    "  
]
incorrect_titles = [
    '\J/:  ELEMENTARY  PREPARATIONS  ',
    'GARNISHING  PREPARATIONS   FOR  RELEVis  AND   ENTR]£eS  ',
    'U^DING  CULINARY  OPERATIONS  ',
    "HORS-D'CEUVRES      ",
    "RELEVilS  AND  ENTRIES  OF  BUTCHER'S  MEAT  ",
    "RELEVES  AND  ENTRIES  OF  POULTRY  AND  GAME    " 
]
for i in range(len(correct_titles)):
    chapters[correct_titles[i]] = chapters[incorrect_titles[i]]
    del chapters[incorrect_titles[i]]

I can disregard the first 3 lines, and there are some minor issues with the text (e.g. "U^DING" instead of "LEADING"). Looking at the PDF, it seems there are some text marks on the original pages of the book that underwent OCR for the PDF. Since there are only a few mistakes, I'll fix these manually just to move forward. If this problem becomes too hindering I'll find a better method.

Confirm the fixed titles and update the page numbers accordingly.

In [30]:
for chapter in chapters:
    chapters[chapter] = int(chapters[chapter])+26  # 27-1 for zero indexing
    print(chapter)
    print(chapters[chapter])

FONDS  DE  CUISINE  
27
THE  LEADING  WARM   SAUCES     
41
THE   SMALL  COMPOUND   SAUCES  
50
COLD  SAUCES  AND  COMPOUND  BUTTERS        
74
SAVOURY  JELLIES  OR  ASPICS  
85
THE  COURT-BOUILLONS  AND  THE  MARINADES  
90
THE  VARIOUS  GARNISHES   FOR  SOUPS  
113
EGGS  
190
SOUPS  
223
FISH  
286
ROASTS  AND  SALADS         
631
VEGETABLES  AND   FARINACEOUS  PRODUCTS  
650
SAVORIES       
704
ENTREMETS
713
ICES   AND  SHERBETS  
814
DRINKS   AND   REFRESHMENTS     
842
FRUIT-STEWS  AND  JAMS  ,
846
THE  ELEMENTARY  PREPARATIONS  
96
GARNISHING  PREPARATIONS   FOR  RELEVES  AND   ENTREES  
118
LEADING  CULINARY  OPERATIONS  
123
HORS-D'OEUVRES      
163
RELEVES  AND  ENTREES  OF  BUTCHER'S  MEAT  
378
RELEVES  AND  ENTREES  OF  POULTRY  AND  GAME    
499


Extracting the recipes from Chapter 1, Fonds de Cuisine. After refining the method, then iterate and extract for every chapter. Will also write each extracted chapter to its own text file for future use to prevent having to run all of these cells in order to pick this up again later.

In [None]:
page_nums = list(chapters.values())
chapter_range = [page_nums[0], page_nums[1]]
pages = []

for i in range(chapter_range[0], chapter_range[1]):
    page = doc[i]
    pages.append(page)
    text = page.get_text()
    print(text)
    

By briefly examining the single chapter output we see the following pattern for most recipes:
- Title: "3- CHICKEN CONSOMME"
- optionally: "Quantities" followed by the final amount and a list of ingredients
- Instructions: "Mode of Procedure" or "Preparation"
- Remarks: "Remarks"

Occassionally there is a book title, page number, chapter title,  etc on its own line. Will remove these from the extracted text before extracting recipes via the patterns noted above.

Lines for removal will contain:
- "GUIDE  TO  MODERN  COOKERY"
- Chapter title (e.g. "FONDS DE CUISINE")

There are occasional typos. To handle the occasional typo, use fuzzy matching. 

In [None]:
# testing line cleaning to an ouput file for easier viewing
page_nums = list(chapters.values())
titles = list(chapters.keys())
chapter_range = [page_nums[0], page_nums[1]]
chapter_out = open("extracted chapters/chapter1.txt",'wb')
pages = []

target_phrases = ["GUIDE TO MODERN COOKERY", titles[0]]

for i in range(chapter_range[0], chapter_range[1]):
    page = doc[i]
    pages.append(page)
    text = page.get_text()
    lines = text.splitlines()
    filtered_lines = [
        line for line in lines
        if not any(fuzz.partial_ratio(line.strip(), phrase) > 80 for phrase in target_phrases)
    ]
    chapter_out.write("\n".join(filtered_lines).encode("utf8") + b"\n")

chapter_out.close()

Now that most of the inconsequential lines have been removed, the recipes need to be separated and parsed for relevant information. 
Following the earlier idenitification of the recipe structure, a possible format for a JSONified recipe could be:
{
    "recipe_id": 1,
    "title": "Ordinary or White Consomme"
    "quantities": {
        "final_amount": "4 quarts",
        "ingredients": [
            "3 lbs. of shin of beef",
            "3 lbs. of lean beef",
            ...
        ]
    },
    "instructions": "Preparation - Put the emeat into a stock-pot...",
    "remarks": "Remarks Relative to..."
}

Unfortunately, upon further reading and skipping through the book, many of the chapters have only a title and instructions with ingredients spread throughout. The recipe format will instead have "id", "title", and "instructions".

I'll now combine the above cleaning step and the recipe extraction. Later, I may revisit the recipe extraction to add Named Entity Recognition to pull out various ingredients or methods to use as metadata.

In [33]:
page_nums = list(chapters.values())
titles = list(chapters.keys())
chapter_range = [page_nums[0], page_nums[1]]
all_recipes = {}

n = 0
for i in range(0, len(page_nums) - 1, 2):
    chapter_range = [page_nums[i], page_nums[i+1]]
    pages = []
    recipes = []

    # filtering out the title and chapter title lines
    target_phrases = ["GUIDE TO MODERN COOKERY", titles[n]]
    n += 1
    for j in range(chapter_range[0], chapter_range[1]):
        page = doc[j]
        text = page.get_text()
        lines = text.splitlines()
        filtered_lines = [
            line for line in lines
            if not any(fuzz.partial_ratio(line.strip(), phrase) > 80 for phrase in target_phrases)
        ]
        pages.append("\n".join(filtered_lines))
    chapter_text = "\n".join(pages)

    # extracting recipes
    # extract number and title
    recipe_pattern = re.compile(
        r'(?P<header>[A-Za-z0-9]{1,4}—\s+.+?)(?=\n[A-Za-z0-9]{1,4}—|\Z)',
    re.DOTALL)  # 1 to 4 character identifiers,the em dash, spaces, look ahead to stop capturing when another identifier is found

    for block in recipe_pattern.finditer(chapter_text):
        block_text = block.group(0) # extract whole match
        current_recipe = {}

        # extract title and id
        header_patter = re.compile(
            r'^(?P<id>[A-Za-z0-9]{1,4})—(?P<title>.*)$',
            re.MULTILINE
        )
        header_match = header_patter.search(block_text)
        if header_match:
            current_recipe["recipe_id"] = header_match.group("id")
            current_title = header_match.group("title").strip()
            # extract everything until the next recipe match and add to current_recipe as "instructions"
            current_instructions = block_text[header_match.end():].strip().replace("\n","")
            
            # remove the extra spaces and swap '^' with 'e' 
            current_title = current_title.replace('^', 'e')
            current_instructions = current_instructions.replace('^', 'e')
            current_title = re.sub(r'\s+', ' ', current_title).strip()
            current_instructions = re.sub(r'\s+', ' ', current_instructions).strip()

            current_recipe['title'] = current_title
            current_recipe['instructions'] = current_instructions
        else:
            # if no header is found, just store everything as instructions
            current_instructions = block_text.strip().replace("\n", "")
            current_instructions = current_instructions.replace('^', 'e')
            current_instructions = re.sub(r'\s+', ' ', current_instructions).strip()
            current_recipe["instructions"] = current_instructions

        # add to list of recipes for this chapter
        recipes.append(current_recipe)
    
    # add list of this chapter's recipes to the dictionary of recipes by chapter
    all_recipes[titles[n-1]] = recipes 

Save each chapter of JSONified recipes to a file for future use.

In [34]:
output_dir = "json_chapters"
os.makedirs(output_dir, exist_ok=True)

for chapter, recipe in all_recipes.items():
    filename = f"{chapter}.json"
    filepath = os.path.join(output_dir, filename)

    with open(filepath, "w", encoding='utf-8') as f:
        json.dump(recipes, f, indent=4, ensure_ascii=False)

    

### Next Steps:
1. Preprocess and clean the extracted recipes further
    - Optionally: 
        - May want to perform Named Entity Recognition (NER) to pull out ingredients or other specific techniques.
        - Chunk recipe instructions when they get too long
2. Compute embeddings
    - Generate embeddings for each text chunk and link it with its associated metadata.
    - Ollama: https://ollama.com/blog/embedding-models 
    - Relevant Reddit Thread: https://www.reddit.com/r/LocalLLaMA/comments/17oyd1r/finding_better_embedding_models/
    - Embedding Leaderboard: https://huggingface.co/spaces/mteb/leaderboard
    - Will likely use either `bge-large-en-v1.5` or `e5-large-v2` or `mxbai-embed-large-v1`, best for my purpose currently seems to be e5
    - 
3. Index with a Vector db 
    - FAISS, Pinecone,  Weaviate, or Milvus
    - insert each embedding with its metadata
    - configure similarity parameters
    - Ollama tutorials use ChromaDB
4. Integrate with the LLM
    - Embed the query with the embed model from earlier, then perform similarity search in the vector db. Use whatever is retrieved as context for the LLM.
5. Prototype, Test, and Iterate
    - Experiment with the various parameters
7. Make a Gradio Frontend