# HOL4 to LEAN Translation using Gemini API

This notebook translates HOL4 theorem statements to LEAN using Google's Gemini API.

## Setup and Imports

In [16]:
# Install required packages
!pip install google-generativeai



In [17]:
import json
import google.generativeai as genai
import time
from typing import List, Dict
import os

## Configuration

Set your Gemini API key here. You can get one from: https://makersuite.google.com/app/apikey

In [18]:
# Set your API key
API_KEY = os.getenv("GEMINI_API_KEY")

genai.configure(api_key=API_KEY)

## File Paths Configuration

In [20]:
# Input and output file paths - processing 4 files in order
INPUT_FILES = [
    "extracted/locationScript.json",
    "extracted/namespaceScript.json",
    "extracted/astScript.json", 
    "extracted/namespacePropsScript.json"
]

OUTPUT_DIR = "extracted"

In [38]:
# LEAN project configuration
LEAN_PROJECT_DIR = "extracted/CML_Lean"
LEAN_SOURCE_DIR = f"{LEAN_PROJECT_DIR}/CML_Lean"

In [39]:
## Setup LEAN Package Structure

def create_lean_package_structure():
    """Create the LEAN package directory structure and configuration files."""
    import os
    
    # Create directories
    os.makedirs(LEAN_SOURCE_DIR, exist_ok=True)
    
    # Create lakefile.lean
    lakefile_content = '''import Lake
open Lake DSL

package «CML_Lean» where
  -- add package configuration options here

lean_lib «CML_Lean» where
  -- add library configuration options here

@[default_target]
lean_exe «cml_lean» where
  root := `Main
'''
    
    with open(f"{LEAN_PROJECT_DIR}/lakefile.lean", 'w', encoding='utf-8') as f:
        f.write(lakefile_content)
    
    # Create lean-toolchain file
    toolchain_content = 'leanprover/lean4:stable'
    with open(f"{LEAN_PROJECT_DIR}/lean-toolchain", 'w', encoding='utf-8') as f:
        f.write(toolchain_content)
    
    # Create Main.lean
    main_content = '''import «CML_Lean»

def main : IO Unit :=
  IO.println "CakeML Translated Theories"
'''
    with open(f"{LEAN_PROJECT_DIR}/Main.lean", 'w', encoding='utf-8') as f:
        f.write(main_content)
    
    print(f"Created LEAN package structure at: {LEAN_PROJECT_DIR}/")
    print(f"  - lakefile.lean")
    print(f"  - lean-toolchain")
    print(f"  - Main.lean")
    print(f"  - {LEAN_SOURCE_DIR}/ (source directory)")

# Create the package structure
create_lean_package_structure()

Created LEAN package structure at: extracted/CML_Lean/
  - lakefile.lean
  - lean-toolchain
  - Main.lean
  - extracted/CML_Lean/CML_Lean/ (source directory)


## Important: Building the LEAN Package

After generating the LEAN files, you need to:

1. **Open the package folder in a NEW VS Code window**: 
   - Open `extracted/CML_Lean/` as the workspace root (not the current directory)
   - This tells LEAN's VS Code extension where to find the package

2. **Build the package** (run in terminal inside `extracted/CML_Lean/`):
   ```bash
   lake build
   ```

3. **Alternative**: If you want to work from the current workspace, add a `lean-toolchain` file here and configure it to find the package.

The error "unknown module prefix 'CML_Lean'" occurs because:
- LEAN looks for packages relative to the current workspace root
- The `lakefile.lean` defines the package, but LEAN needs to know where it is
- Opening the package folder directly or building with `lake` resolves this

In [40]:
## Create CML_Lean root module file

def create_cml_lean_root_module():
    """Create the root CML_Lean.lean file that imports all theory modules."""
    import os
    
    # Get list of all .lean files (theories) in the source directory
    theories = []
    if os.path.exists(LEAN_SOURCE_DIR):
        for filename in sorted(os.listdir(LEAN_SOURCE_DIR)):
            if filename.endswith('.lean'):
                theory_name = filename.replace('.lean', '')
                theories.append(theory_name)
    
    # Create CML_Lean.lean root file
    root_content = f'''-- Root module for CML_Lean package
-- This file imports all translated HOL4 theories

'''
    
    for theory in theories:
        root_content += f'import CML_Lean.{theory}\n'
    
    with open(f"{LEAN_SOURCE_DIR}.lean", 'w', encoding='utf-8') as f:
        f.write(root_content)
    
    print(f"Created root module file: {LEAN_SOURCE_DIR}.lean")
    print(f"Imported {len(theories)} theory modules:")
    for theory in theories:
        print(f"  - {theory}")

# Create the root module
create_cml_lean_root_module()

Created root module file: extracted/CML_Lean/CML_Lean.lean
Imported 4 theory modules:
  - ast
  - location
  - namespace
  - namespaceProps


## Build the LEAN Package

**Run the cell below to automatically build the package!**

No manual steps needed - the notebook will:
1. Run `lake build` in the package directory
2. Show you the build output
3. Tell you if it succeeded or failed

(Alternatively, you can run `lake build` manually in a terminal if you prefer)

In [41]:
# Build the LEAN package
import subprocess
import os

print("Building LEAN package...")
print("="*80)

# Change to the package directory and run lake build
try:
    result = subprocess.run(
        ["lake", "build"],
        cwd=LEAN_PROJECT_DIR,
        capture_output=True,
        text=True,
        timeout=300  # 5 minute timeout
    )
    
    print("STDOUT:")
    print(result.stdout)
    
    if result.stderr:
        print("\nSTDERR:")
        print(result.stderr)
    
    if result.returncode == 0:
        print("\n✓ Package built successfully!")
        print("\nTo work with these LEAN files:")
        print(f"1. Open '{LEAN_PROJECT_DIR}' folder in VS Code")
        print("2. The LEAN extension will recognize the package")
        print("3. Imports like 'import CML_Lean.namespace' will work")
    else:
        print(f"\n✗ Build failed with exit code {result.returncode}")
        
except FileNotFoundError:
    print("ERROR: 'lake' command not found!")
    print("\nYou need to install LEAN 4 first:")
    print("  Visit: https://lean-lang.org/lean4/doc/setup.html")
    print("\nOr run manually in terminal:")
    print(f"  cd {LEAN_PROJECT_DIR}")
    print("  lake build")
    
except subprocess.TimeoutExpired:
    print("ERROR: Build timed out after 5 minutes")
    
except Exception as e:
    print(f"ERROR: {str(e)}")

Building LEAN package...
STDOUT:
✖ [2/14] Building CML_Lean.namespace (927ms)
trace: .> LEAN_PATH=E:\NUS\mcomp\Dissertation\CakeML_data_extraction\extracted\CML_Lean\.lake\build\lib\lean c:\Users\my-pc\.elan\toolchains\leanprover--lean4---v4.25.2\bin\lean.exe E:\NUS\mcomp\Dissertation\CakeML_data_extraction\extracted\CML_Lean\CML_Lean\namespace.lean -o E:\NUS\mcomp\Dissertation\CakeML_data_extraction\extracted\CML_Lean\.lake\build\lib\lean\CML_Lean\namespace.olean -i E:\NUS\mcomp\Dissertation\CakeML_data_extraction\extracted\CML_Lean\.lake\build\lib\lean\CML_Lean\namespace.ilean -c E:\NUS\mcomp\Dissertation\CakeML_data_extraction\extracted\CML_Lean\.lake\build\ir\CML_Lean\namespace.c --setup E:\NUS\mcomp\Dissertation\CakeML_data_extraction\extracted\CML_Lean\.lake\build\ir\CML_Lean\namespace.setup.json --json
error: CML_Lean/namespace.lean:54:0: (kernel) arg #5 of 'CML_Lean.namespace.cml_namespace.mk' contains a non valid occurrence of the datatypes being declared
error: CML_Lean/names

## Initialize Gemini Model

In [25]:
model = genai.GenerativeModel('gemini-2.5-pro')

# Initialize chat for maintaining conversation context across files
chat = model.start_chat(history=[])

## Translation Functions

### Translation Strategy

This notebook uses a **sequential file translation approach with conversation context**:

1. **Ordered Processing**: Files are processed in dependency order:
   - locationScript.json (HOL4 base script)
   - namespaceScript.json (base definitions)
   - astScript.json (may depend on namespace)
   - namespacePropsScript.json (theorems about namespace)

1. **Conversation Context**: Uses Gemini's chat API to maintain context across files, so later translations can reference earlier ones.

2. **Dependency Awareness**: The LLM remembers previous translations, ensuring consistent naming and type usage across files.

3. **Type Consistency**: When translating theorems that reference datatypes or definitions from previous files, the LLM can recall how those were translated.

In [26]:
def translate_file_with_context(data: List[Dict], file_name: str, is_first: bool = False) -> List[Dict]:
    """
    Translate HOL4 statements to LEAN using chat context.
    
    Args:
        data: List of dictionaries with 'kind', 'name', 'statement', 'theory', 'ancestors' fields
        file_name: Name of the file being translated
        is_first: Whether this is the first file (sets up initial context)
    
    Returns:
        List of translated items with LEAN statements
    """
    
    # Get theory and ancestors info
    theory = data[0]['theory'] if data else 'unknown'
    ancestors = data[0]['ancestors'] if data else []
    
    # Build the prompt with all statements
    if is_first:
        prompt = f"""You are an expert in formal theorem proving systems. I will be translating multiple HOL4 theory files to LEAN 4 syntax. Please maintain context across our conversation as later files may reference definitions from earlier ones.

Starting with file: {file_name}
Theory: {theory}
Ancestors: {ancestors}

Translate ALL of the following HOL4 statements to LEAN 4 syntax.

Instructions:
- Use LEAN 4 syntax (not LEAN 3)
- Preserve the logical structure and meaning
- Use appropriate LEAN type annotations
- Handle option types (SOME/NONE in HOL4 → some/none in LEAN)
- Convert HOL4 list notation to LEAN list notation
- Use LEAN's unicode symbols where appropriate (e.g., ∀, ∃, →, ∧, ∨)
- Later statements may reference the previous ones.
- This theory's ancestors are: {ancestors}. If any of these ancestors have been translated in our conversation, you may reference their definitions, types, and functions in your translation.

CRITICAL - Reserved Keywords:
- LEAN 4 has reserved keywords. If a HOL4 variable name matches a LEAN reserved keyword, append a "cml_" before it
- Examples: 'id' becomes 'cml_id', 'namespace' becomes 'cml_namespace', etc.

CRITICAL - Function Usage:
- ONLY use functions that are predefined in LEAN 4 standard library or Mathlib
- Common safe functions: List.map, List.filter, List.length, Option.map, Nat.add, etc.
- DO NOT assume functions exist - verify they are in LEAN 4 stdlib

Format your response as a JSON array where each element has:
{{
  "name": "original_name",
  "statement": "translated LEAN 4 statement"
}}

Here are the HOL4 statements to translate:

"""
    else:
        prompt = f"""Now translating the next file: {file_name}
Theory: {theory}
Ancestors: {ancestors}

IMPORTANT - Ancestor Dependencies:
This theory depends on the following ancestors: {ancestors}
- Some of these ancestors may have been translated in our previous conversation
- When translating this file, you SHOULD reference types, datatypes, definitions, and functions from those ancestor theories
- Use the SAME names and type signatures that you used when translating the ancestor files
- Maintain consistency with all previous translations

This file may also reference types and definitions from other previously translated files. Please use the SAME translated names and types from our earlier conversation.

Translate ALL of the following HOL4 statements to LEAN 4 syntax, maintaining consistency with previous translations and utilizing definitions from ancestor theories:

"""
    
    # Add all statements to the prompt
    for i, item in enumerate(data, 1):
        prompt += f"\n{i}. {item['kind']}: {item['name']}\n"
        prompt += f"   HOL4 Statement:\n   {item['statement']}\n"
    
    prompt += "\n\nPlease provide the translations as a JSON array. Include ONLY the JSON array in your response, no additional text or markdown."
    
    try:
        print(f"Sending {len(data)} statements from {file_name} to LLM for translation...")
        response = chat.send_message(prompt)
        response_text = response.text.strip()
        
        # Clean up markdown formatting if present
        if response_text.startswith("```json"):
            response_text = response_text.replace("```json", "").replace("```", "").strip()
        elif response_text.startswith("```"):
            lines = response_text.split("\n")
            response_text = "\n".join(lines[1:-1]).strip()
        
        # Parse the JSON response
        translated_items = json.loads(response_text)
        
        # Match translations back to original items
        name_to_translation = {item['name']: item['statement'] for item in translated_items}
        
        result = []
        for item in data:
            lean_statement = name_to_translation.get(item['name'], f"[Translation not found for {item['name']}]")
            translated_item = {
                "kind": item['kind'],
                "name": item['name'],
                "statement": lean_statement,
                "original_hol4": item['statement'],
                "theory": item.get('theory'),
                "ancestors": item.get('ancestors', [])
            }
            result.append(translated_item)
        
        return result
        
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON response: {str(e)}")
        print(f"Response text: {response_text[:500]}...")
        raise
    except Exception as e:
        print(f"Error during translation: {str(e)}")
        raise

In [27]:
def translate_multiple_files(input_files: List[str], output_dir: str) -> None:
    """
    Translate multiple JSON files in order, maintaining conversation context.
    
    Args:
        input_files: List of input JSON file paths in dependency order
        output_dir: Directory to save output files
    """
    
    all_results = {}
    
    for i, input_path in enumerate(input_files):
        is_first = (i == 0)
        file_name = input_path.split('/')[-1]
        
        # Load the input JSON file
        print(f"\n{'='*80}")
        print(f"Processing file {i+1}/{len(input_files)}: {file_name}")
        print(f"{'='*80}")
        
        with open(input_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        print(f"Loaded {len(data)} items")
        print(f"  Types: {sum(1 for x in data if x['kind'] == 'Type')}")
        print(f"  Datatypes: {sum(1 for x in data if x['kind'] == 'Datatype')}")
        print(f"  Definitions: {sum(1 for x in data if x['kind'] == 'Definition')}")
        print(f"  Theorems: {sum(1 for x in data if x['kind'] == 'Theorem')}")
        
        # Translate with context
        translated_data = translate_file_with_context(data, file_name, is_first)
        
        # Save individual output file
        output_path = f"{output_dir}/output_{file_name}"
        print(f"Saving translated data to: {output_path}")
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(translated_data, f, indent=2, ensure_ascii=False)
        
        all_results[file_name] = translated_data
        print(f"✓ Completed {file_name}: {len(translated_data)} items translated")
        
        # Small delay between files
        if i < len(input_files) - 1:
            print("\nWaiting 2 seconds before next file...")
            time.sleep(2)
    
    print(f"\n{'='*80}")
    print("All translations complete!")
    print(f"{'='*80}")
    print(f"Total files processed: {len(input_files)}")
    print(f"Total items translated: {sum(len(results) for results in all_results.values())}")
    
    return all_results

## Test Translation on First File

Let's test the translation on the first file (namespaceScript.json):

In [28]:
# Test on first file only
test_file = INPUT_FILES[0]
print(f"Testing translation on: {test_file}")

with open(test_file, 'r', encoding='utf-8') as f:
    test_data = json.load(f)

print(f"Loaded {len(test_data)} items")
print(f"  Types: {sum(1 for x in test_data if x['kind'] == 'Type')}")
print(f"  Datatypes: {sum(1 for x in test_data if x['kind'] == 'Datatype')}")
print(f"  Definitions: {sum(1 for x in test_data if x['kind'] == 'Definition')}")
print(f"  Theorems: {sum(1 for x in test_data if x['kind'] == 'Theorem')}")

# Test translation on a small subset (first 5 items)
test_sample = test_data[:5]
print(f"\nTesting translation on first {len(test_sample)} items...")

# Reset chat for testing
chat = model.start_chat(history=[])
translated_sample = translate_file_with_context(test_sample, test_file.split('/')[-1], is_first=True)

print(f"\nTranslation Results:")
print("="*80)
for item in translated_sample:
    print(f"\n{item['kind']}: {item['name']}")
    print(f"Theory: {item['theory']}, Ancestors: {item['ancestors']}")
    print(f"HOL4: {item['original_hol4'][:100]}..." if len(item['original_hol4']) > 100 else f"HOL4: {item['original_hol4']}")
    print(f"LEAN: {item['statement'][:100]}..." if len(item['statement']) > 100 else f"LEAN: {item['statement']}")
    print("-"*80)

Testing translation on: extracted/locationScript.json
Loaded 28 items
  Types: 0
  Datatypes: 2
  Definitions: 12
  Theorems: 14

Testing translation on first 5 items...
Sending 5 statements from locationScript.json to LLM for translation...

Translation Results:

Datatype: locn
Theory: location, Ancestors: []
HOL4: locn = UNKNOWNpt | EOFpt | POSN num num
LEAN: inductive Locn where
  | unknownpt
  | eofpt
  | posn (row : Nat) (col : Nat)
--------------------------------------------------------------------------------

Definition: locnrow_def
Theory: location, Ancestors: []
HOL4: locnrow (POSN r c) = r
LEAN: def locnrow (l : Locn) : Nat :=
  match l with
  | .posn r _ => r
  | _ => 0
--------------------------------------------------------------------------------

Definition: locn_rowupdate_def
Theory: location, Ancestors: []
HOL4: locn_rowupdate f (POSN r c) = POSN (f r) c
LEAN: def locn_rowupdate (f : Nat → Nat) (l : Locn) : Locn :=
  match l with
  | .posn r c => .posn (f r) ...
----

In [29]:
## Run Full Translation on All Files

# Reset chat to start fresh
chat = model.start_chat(history=[])

print("Starting sequential translation with context...")
results = translate_multiple_files(INPUT_FILES, OUTPUT_DIR)

print("\n" + "="*80)
print("Translation completed successfully!")
print("="*80)

Starting sequential translation with context...

Processing file 1/4: locationScript.json
Loaded 28 items
  Types: 0
  Datatypes: 2
  Definitions: 12
  Theorems: 14
Sending 28 statements from locationScript.json to LLM for translation...
Saving translated data to: extracted/output_locationScript.json
✓ Completed locationScript.json: 28 items translated

Waiting 2 seconds before next file...
Saving translated data to: extracted/output_locationScript.json
✓ Completed locationScript.json: 28 items translated

Waiting 2 seconds before next file...

Processing file 2/4: namespaceScript.json
Loaded 22 items
  Types: 1
  Datatypes: 2
  Definitions: 19
  Theorems: 0
Sending 22 statements from namespaceScript.json to LLM for translation...

Processing file 2/4: namespaceScript.json
Loaded 22 items
  Types: 1
  Datatypes: 2
  Definitions: 19
  Theorems: 0
Sending 22 statements from namespaceScript.json to LLM for translation...
Saving translated data to: extracted/output_namespaceScript.json
✓ C

## Export Statistics

In [30]:
# Generate statistics about the translations
print("Translation Statistics by File:")
print("="*80)

total_items = 0
for input_file in INPUT_FILES:
    file_name = input_file.split('/')[-1]
    output_file = f"{OUTPUT_DIR}/output_{file_name}"
    
    try:
        with open(output_file, 'r', encoding='utf-8') as f:
            translated_data = json.load(f)
        
        print(f"\n{file_name}:")
        print(f"  Total items: {len(translated_data)}")
        print(f"  Types: {sum(1 for x in translated_data if x['kind'] == 'Type')}")
        print(f"  Datatypes: {sum(1 for x in translated_data if x['kind'] == 'Datatype')}")
        print(f"  Definitions: {sum(1 for x in translated_data if x['kind'] == 'Definition')}")
        print(f"  Theorems: {sum(1 for x in translated_data if x['kind'] == 'Theorem')}")
        print(f"  Theory: {translated_data[0]['theory'] if translated_data else 'N/A'}")
        
        total_items += len(translated_data)
    except FileNotFoundError:
        print(f"\n{file_name}: Output file not found")

print(f"\n{'='*80}")
print(f"Grand Total: {total_items} items translated across {len(INPUT_FILES)} files")

Translation Statistics by File:

locationScript.json:
  Total items: 28
  Types: 0
  Datatypes: 2
  Definitions: 12
  Theorems: 14
  Theory: location

namespaceScript.json:
  Total items: 22
  Types: 1
  Datatypes: 2
  Definitions: 19
  Theorems: 0
  Theory: namespace

astScript.json:
  Total items: 30
  Types: 4
  Datatypes: 19
  Definitions: 7
  Theorems: 0
  Theory: ast

namespacePropsScript.json:
  Total items: 100
  Types: 0
  Datatypes: 0
  Definitions: 3
  Theorems: 97
  Theory: namespaceProps

Grand Total: 180 items translated across 4 files


In [34]:
## Generate LEAN Files

def create_lean_file_from_json(input_json_path: str, output_lean_path: str, ancestors: List[str] = None) -> None:
    """
    Create a .lean file from the translated JSON data.
    
    Args:
        input_json_path: Path to the translated JSON file
        output_lean_path: Path to the output .lean file
        ancestors: List of ancestor theory names to import
    """
    with open(input_json_path, 'r', encoding='utf-8') as f:
        translated_data = json.load(f)
    
    theory_name = translated_data[0]['theory'] if translated_data else 'Unknown'
    ancestors = ancestors or (translated_data[0].get('ancestors', []) if translated_data else [])
    
    # Get list of all translated theory files in the package
    import os
    available_theories = set()
    if os.path.exists(LEAN_SOURCE_DIR):
        for filename in os.listdir(LEAN_SOURCE_DIR):
            if filename.endswith('.lean'):
                theory = filename.replace('Script.lean', '').replace('.lean', '')
                available_theories.add(theory)
    
    # Filter ancestors to only include those we have translated
    valid_ancestors = [a for a in ancestors if a in available_theories]
    
    with open(output_lean_path, 'w', encoding='utf-8') as f:
        # Write file header
        f.write(f"-- Auto-generated LEAN 4 file from HOL4 translation\n")
        f.write(f"-- Theory: {theory_name}\n")
        f.write(f"-- Generated using Gemini API\n\n")
        
        # Import ancestor files if they exist and have been translated
        if valid_ancestors:
            f.write(f"-- Import ancestor theories\n")
            for ancestor in valid_ancestors:
                # Use CML_Lean module prefix for imports
                f.write(f"import CML_Lean.{ancestor}\n")
            f.write("\n")
        
        # Open namespace for this theory
        f.write(f"namespace CML_Lean.{theory_name}\n\n")
        
        # Process each item
        for item in translated_data:
            kind = item['kind']
            name = item['name']
            statement = item['statement']
            
            # Add a block comment with the original HOL4 statement
            f.write(f"/-\n")
            f.write(f"Original HOL4 {kind}: {name}\n")
            f.write(f"{item['original_hol4']}\n")
            f.write(f"-/\n")
            
            if kind == "Type":
                # Write type as-is
                f.write(f"{statement}\n\n")
                
            elif kind == "Datatype":
                # Write datatype as-is
                f.write(f"{statement}\n\n")
                
            elif kind == "Definition":
                # Write definition as-is
                f.write(f"{statement}\n\n")
                
            elif kind == "Theorem":
                # Format theorem with := by sorry for tactic-mode proving
                # Check if statement already has a proof placeholder
                statement_stripped = statement.strip()
                if statement_stripped.endswith(":= by sorry") or statement_stripped.endswith(":= sorry"):
                    # Already has a proof placeholder, use as-is
                    f.write(f"{statement}\n\n")
                elif statement.startswith("theorem "):
                    # Statement starts with "theorem", append := by sorry
                    f.write(f"{statement} := by sorry\n\n")
                else:
                    # Need to add "theorem" keyword and proof placeholder
                    f.write(f"theorem {name} : {statement} := by sorry\n\n")
        
        # Close namespace
        f.write(f"end CML_Lean.{theory_name}\n")
    
    print(f"LEAN file created: {output_lean_path}")
    
    # Print statistics
    types = sum(1 for x in translated_data if x['kind'] == 'Type')
    datatypes = sum(1 for x in translated_data if x['kind'] == 'Datatype')
    definitions = sum(1 for x in translated_data if x['kind'] == 'Definition')
    theorems = sum(1 for x in translated_data if x['kind'] == 'Theorem')
    
    print(f"Content summary:")
    print(f"  - Theory: {theory_name}")
    print(f"  - Valid ancestors (translated): {valid_ancestors}")
    if len(ancestors) > len(valid_ancestors):
        skipped = [a for a in ancestors if a not in valid_ancestors]
        print(f"  - Skipped ancestors (not translated): {skipped}")
    print(f"  - {types} Types")
    print(f"  - {datatypes} Datatypes")
    print(f"  - {definitions} Definitions")
    print(f"  - {theorems} Theorems (with := sorry)")

In [36]:
def create_all_lean_files(input_files: List[str], output_dir: str) -> None:
    """
    Create LEAN files for all translated JSON files in the proper package structure.
    
    Args:
        input_files: List of original input JSON file paths
        output_dir: Directory containing the translated JSON files
    """
    print("Generating LEAN files from translated JSON files...")
    print("="*80)
    
    for input_file in input_files:
        file_name = input_file.split('/')[-1]
        
        # Path to translated JSON
        translated_json = f"{output_dir}/output_{file_name}"
        
        # Check if translated JSON exists
        try:
            with open(translated_json, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            # Get theory name from the data
            theory_name = data[0]['theory'] if data else file_name.replace('.json', '')
            ancestors = data[0].get('ancestors', []) if data else []
            
            # Path to output LEAN file using theory name
            lean_output = f"{LEAN_SOURCE_DIR}/{theory_name}.lean"
            
            print(f"\nProcessing: {file_name} -> {theory_name}.lean")
            create_lean_file_from_json(translated_json, lean_output, ancestors)
            
        except FileNotFoundError:
            print(f"\nWarning: Translated file not found: {translated_json}")
    
    print(f"\n{'='*80}")
    print("All LEAN files generated in package structure!")
    print(f"LEAN files location: {LEAN_SOURCE_DIR}/")

# Generate LEAN files for all translated JSON files
create_all_lean_files(INPUT_FILES, OUTPUT_DIR)

Generating LEAN files from translated JSON files...

Processing: locationScript.json -> location.lean
LEAN file created: extracted/CML_Lean/CML_Lean/location.lean
Content summary:
  - Theory: location
  - Valid ancestors (translated): []
  - 0 Types
  - 2 Datatypes
  - 12 Definitions
  - 14 Theorems (with := sorry)

Processing: namespaceScript.json -> namespace.lean
LEAN file created: extracted/CML_Lean/CML_Lean/namespace.lean
Content summary:
  - Theory: namespace
  - Valid ancestors (translated): []
  - Skipped ancestors (not translated): ['alist']
  - 1 Types
  - 2 Datatypes
  - 19 Definitions
  - 0 Theorems (with := sorry)

Processing: astScript.json -> ast.lean
LEAN file created: extracted/CML_Lean/CML_Lean/ast.lean
Content summary:
  - Theory: ast
  - Valid ancestors (translated): ['namespace', 'location']
  - Skipped ancestors (not translated): ['integer', 'words', 'string']
  - 4 Types
  - 19 Datatypes
  - 7 Definitions
  - 0 Theorems (with := sorry)

Processing: namespaceProps