# AuraTokenizer: Advanced Features and Integrations

This notebook demonstrates advanced features, integrations, and development workflows for the AuraTokenizer project. We'll cover:

1. Development Environment Setup
2. Advanced Tokenization Features
3. Cross-Language Integration
4. Performance Benchmarking
5. Custom Extensions
6. Advanced Testing
7. Documentation Generation

Let's get started!

## 1. Development Environment Setup

First, let's set up our development environment and import the necessary modules. We'll install AuraTokenizer and its dependencies:

### Prerequisites

To build and use AuraTokenizer, you'll need:

1. Python 3.7 or later
2. Rust toolchain (can be installed from https://rustup.rs/)
3. A C++ compiler (MSVC on Windows, GCC on Linux/macOS)
4. CMake

Please refer to the project's README for detailed installation instructions. The package needs to be built from source as it contains Rust and C++ components.

For this notebook, we'll assume you have already:
1. Clone the repository
2. Install the prerequisites
3. Build and install the package using:
   ```bash
   python -m pip install maturin
   maturin develop
   ```

### Building from Source

The AuraTokenizer package consists of Python, Rust, and C++ components. Here's the complete build process:

1. Build the C++ components:
   ```bash
   # Windows (PowerShell)
   cmake -S . -B build
   cmake --build build --config Release

   # Linux/macOS
   cmake -S . -B build
   cmake --build build
   ```

2. Build the Rust components:
   ```bash
   cd aura_tokenizer_bridge
   cargo build --release
   ```

3. Build and install the Python package:
   ```bash
   python -m pip install maturin
   maturin develop
   ```

If you encounter any issues:
- Make sure all prerequisites are installed
- Check that environment variables (PATH, RUSTUP_HOME, etc.) are properly set
- Consult the troubleshooting section in the project documentation

For this notebook to work, you must complete the build process first.

### Common Build Issues

#### CMake Errors

If you encounter a CMake parsing error like:
```
CMake Error at CMakeLists.txt:192:
Parse error. Expected a command name, got right paren with text ")".
```

This error is caused by an extra closing parenthesis in the CMakeLists.txt file. To fix it:

1. Open `Aura-Tokenizer/CMakeLists.txt`
2. Look for this section (around line 187-192):
   ```cmake
   if(WIN32)
       set_target_properties(auratokenizer PROPERTIES
           WINDOWS_EXPORT_ALL_SYMBOLS ON
       )
   endif()
   )  # <- Remove this extra closing parenthesis
   ```
3. Remove the extra closing parenthesis after `endif()`
4. Save the file and try building again:
   ```bash
   cmake -S . -B build
   cmake --build build --config Release
   ```

If you encounter other CMake issues:
1. Make sure you're in the correct directory:
   ```bash
   # Navigate to the Aura-Tokenizer subdirectory
   cd Aura-Tokenizer
   ```

2. Try cleaning the build directory:
   ```bash
   # Windows
   rmdir /s /q build
   # Linux/macOS
   rm -rf build
   ```

3. Run CMake with verbose output to see more details:
   ```bash
   cmake -S . -B build --log-level=VERBOSE
   ```

#### Include Directory Issues

If you see CMake errors about INTERFACE_INCLUDE_DIRECTORIES:
```
Target "auratokenizer" INTERFACE_INCLUDE_DIRECTORIES property contains path:
which is prefixed in the source directory.
```

This happens when include directories are not properly configured for installation. The fix involves:

1. Using generator expressions for public includes:
   ```cmake
   target_include_directories(auratokenizer
       PUBLIC
           $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
           $<INSTALL_INTERFACE:include>
       PRIVATE
           ${OTHER_INCLUDE_DIRS}
   )
   ```

2. Making dependency includes PRIVATE unless they're part of the public interface

The build should now proceed without include directory errors.

#### GTest Dependency

If you encounter a GTest dependency error:
```
Could NOT find GTest (missing: GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY)
```

You have two options:

1. Skip building tests (recommended for initial setup):
   ```bash
   cmake -S . -B build -DBUILD_TESTS=OFF
   ```

2. Install GTest if you want to build tests:
   ```bash
   # Windows with vcpkg
   vcpkg install gtest:x64-windows
   
   # Linux
   sudo apt-get install libgtest-dev
   
   # Then configure with tests enabled
   cmake -S . -B build -DBUILD_TESTS=ON
   ```

The core library will build without GTest. Testing is optional and can be enabled later when needed.

In [3]:
import sys
import subprocess

def run_command(cmd):
    print(f"Running: {' '.join(cmd)}")
    subprocess.run(cmd, check=True)

# Install build dependencies
run_command([sys.executable, "-m", "pip", "install", "maturin"])

# Build and install the package
print("\nBuilding and installing AuraTokenizer...")
run_command([sys.executable, "-m", "pip", "install", "-e", ".", "--no-deps"])

Running: c:\Users\rstiw\Desktop\AURA\python\python.exe -m pip install maturin

Building and installing AuraTokenizer...
Running: c:\Users\rstiw\Desktop\AURA\python\python.exe -m pip install -e . --no-deps


CalledProcessError: Command '['c:\\Users\\rstiw\\Desktop\\AURA\\python\\python.exe', '-m', 'pip', 'install', '-e', '.', '--no-deps']' returned non-zero exit status 1.

In [4]:
import sys
import os

print(f"Python version: {sys.version}")
print(f"Current directory: {os.getcwd()}")

Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
Current directory: c:\Users\rstiw\Desktop\AURA\Packages\Aura-Tokenizer\docs\examples


## 2. Basic Tokenizer Setup

Once AuraTokenizer is installed, we can create a simple tokenizer with basic configuration. This section demonstrates the core functionality before we move on to advanced features.

In [5]:
# Try importing aura_tokenizer if available
try:
    from aura_tokenizer import AuraTokenizer
    print("AuraTokenizer is installed and imported successfully.")
except ImportError as e:
    print("AuraTokenizer is not installed or cannot be imported.")
    print(f"Error: {str(e)}")
    print("\nPlease follow the installation instructions in the prerequisites section.")

AuraTokenizer is not installed or cannot be imported.
Error: cannot import name 'PyBPETokenizer' from 'aura_tokenizer.core' (c:\Users\rstiw\Desktop\AURA\python\Lib\site-packages\aura_tokenizer\core.cp312-win_amd64.pyd)

Please follow the installation instructions in the prerequisites section.


In [6]:
def create_basic_tokenizer():
    """Create a basic character-level tokenizer for demonstration."""
    try:
        from aura_tokenizer import AuraTokenizer, TokenizerConfig
        
        config = TokenizerConfig()
        config.type = "char"  # Character-level tokenization
        config.add_special_tokens = True
        
        tokenizer = AuraTokenizer(config)
        return tokenizer
    except ImportError as e:
        print(f"Could not create tokenizer: {str(e)}")
        return None

# Try creating and using a basic tokenizer
tokenizer = create_basic_tokenizer()
if tokenizer is not None:
    text = "Hello, World!"
    tokens = tokenizer.encode(text)
    print(f"\nInput text: {text}")
    print(f"Tokens: {tokens}")
    decoded = tokenizer.decode(tokens)
    print(f"Decoded text: {decoded}")

Could not create tokenizer: cannot import name 'PyBPETokenizer' from 'aura_tokenizer.core' (c:\Users\rstiw\Desktop\AURA\python\Lib\site-packages\aura_tokenizer\core.cp312-win_amd64.pyd)


### Current Status

As shown above, the package needs to be properly built and installed before we can proceed with the examples. The error message indicates that some core components are missing or not properly built.

**Next Steps:**
1. Exit this notebook
2. Follow the build instructions in the "Building from Source" section above
3. Verify the installation by running:
   ```python
   python -c "import aura_tokenizer; print(aura_tokenizer.__file__)"
   ```
4. Return to this notebook once the installation is successful

The remaining sections of this notebook will demonstrate advanced features once you have a working installation.

## 3. Advanced Tokenization Features

*This section will cover:*
- Custom tokenization models
- Pre-tokenizers and post-processors
- Training custom vocabularies
- Handling special tokens
- Unicode normalization

## 4. Cross-Language Integration

*This section will demonstrate:*
- Using Rust components
- C++ integration
- FFI interfaces
- Performance optimizations

## 5. Performance Benchmarking

*This section will cover:*
- Throughput measurements
- Memory usage analysis
- Comparison with other tokenizers
- Optimization techniques

## 6. Custom Extensions

*This section will show:*
- Creating custom pre-tokenizers
- Implementing custom post-processors
- Adding new tokenization algorithms
- Plugin development

## 7. Advanced Testing

*This section will demonstrate:*
- Unit test creation
- Integration testing
- Performance testing
- Cross-language testing

## 8. Documentation Generation

*This section will cover:*
- API documentation
- Example generation
- Documentation testing
- Cross-language documentation

Please complete the installation steps before proceeding with these sections.