# Downloading MFA Dictionaries for Various Languages

This notebook downloads MFA (Montreal Forced Alignment) dictionaries for the following languages:
- English
- French
- Italian
- Russian
- Spanish

**Note:** The German dictionary should already be installed.

Dictionaries will be saved to the `MFA/pretrained_models/dictionary/` folder in the project root.

In [11]:
import subprocess
import sys
from pathlib import Path
from typing import Dict, Optional, Tuple, List

# Path settings
PROJECT_ROOT = Path('/Volumes/SSanDisk/SpeechRec-German')
MFA_DIR = PROJECT_ROOT / 'MFA' / 'pretrained_models' / 'dictionary'

# Create directory if it doesn't exist
MFA_DIR.mkdir(parents=True, exist_ok=True)

print(f"Project: {PROJECT_ROOT}")
print(f"MFA dictionaries: {MFA_DIR}")
print(f"Directory exists: {MFA_DIR.exists()}")
print()

Project: /Volumes/SSanDisk/SpeechRec-German
MFA dictionaries: /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary
Directory exists: True



In [12]:
# MFA dictionaries for each language
# Format: {'language': ['primary_dict_name', 'alternative_dict_name', ...]}
# If primary dictionary is not available, alternatives will be tried automatically
LANGUAGES = {
    'english': ['english_us_mfa', 'english_uk_mfa', 'english_mfa'],
    'french': ['french_mfa'],
    'italian': ['italian_cv', 'italian_mfa'],  # italian_mfa doesn't exist, using italian_cv
    'russian': ['russian_mfa'],
    'spanish': ['spanish_mfa', 'spanish_latin_america_mfa', 'spanish_spain_mfa'],
}

print("Dictionaries to download:")
print("=" * 80)
for lang_name, dict_names in LANGUAGES.items():
    primary = dict_names[0]
    alternatives = dict_names[1:] if len(dict_names) > 1 else []
    alt_str = f" (alternatives: {', '.join(alternatives)})" if alternatives else ""
    print(f"  {lang_name.capitalize():12s} -> {primary}{alt_str}")
print()

Dictionaries to download:
  English      -> english_us_mfa (alternatives: english_uk_mfa, english_mfa)
  French       -> french_mfa
  Italian      -> italian_cv (alternatives: italian_mfa)
  Russian      -> russian_mfa
  Spanish      -> spanish_mfa (alternatives: spanish_latin_america_mfa, spanish_spain_mfa)



In [13]:
def find_mfa_dict_path(dict_name: str) -> Optional[Path]:
    """
    Finds the path to an MFA dictionary.
    Checks several possible locations.
    """
    possible_paths = [
        # Local project folder
        MFA_DIR / f"{dict_name}.dict",
        # Standard MFA locations
        Path.home() / "Documents" / "MFA" / "pretrained_models" / "dictionary" / f"{dict_name}.dict",
        Path.home() / ".local" / "share" / "montreal-forced-alignment" / "pretrained_models" / "dictionary" / f"{dict_name}.dict",
    ]
    
    for path in possible_paths:
        if path.exists():
            return path
    
    return None

def check_mfa_cli() -> Tuple[bool, Optional[str]]:
    """
    Checks if MFA CLI is available.
    Returns (available, path_to_binary).
    """
    # 1. Check in PATH
    try:
        result = subprocess.run(
            ['mfa', '--version'],
            capture_output=True,
            text=True,
            timeout=5
        )
        if result.returncode == 0:
            return True, 'mfa'
    except (subprocess.TimeoutExpired, FileNotFoundError):
        pass
    
    # 2. Check direct path to local miniforge/mfa310
    local_mfa_path = PROJECT_ROOT / 'miniforge' / 'envs' / 'mfa310' / 'bin' / 'mfa'
    if local_mfa_path.exists():
        try:
            result = subprocess.run(
                [str(local_mfa_path), '--version'],
                capture_output=True,
                text=True,
                timeout=5
            )
            # MFA may return error code, but if file exists, it works
            return True, str(local_mfa_path)
        except (subprocess.TimeoutExpired, FileNotFoundError):
            pass
    
    # 3. Check via conda run (if conda is available)
    try:
        result = subprocess.run(
            ['conda', 'run', '-n', 'mfa310', 'mfa', '--version'],
            capture_output=True,
            text=True,
            timeout=5
        )
        # Even if error code, but command executed, conda is available
        # Try using conda run
        return True, 'conda run -n mfa310 mfa'
    except (subprocess.TimeoutExpired, FileNotFoundError):
        pass
    
    # 4. Check local conda from miniforge
    local_conda = PROJECT_ROOT / 'miniforge' / 'bin' / 'conda'
    if local_conda.exists():
        try:
            result = subprocess.run(
                [str(local_conda), 'run', '-n', 'mfa310', 'mfa', '--version'],
                capture_output=True,
                text=True,
                timeout=5
            )
            return True, f'{str(local_conda)} run -n mfa310 mfa'
        except (subprocess.TimeoutExpired, FileNotFoundError):
            pass
    
    return False, None

# Check MFA CLI availability
mfa_available, mfa_cmd = check_mfa_cli()

if mfa_available:
    print("âœ“ MFA CLI available")
    print(f"  Command: {mfa_cmd}")
    if Path(mfa_cmd).exists():
        print(f"  Path: {mfa_cmd}")
else:
    print("âœ— MFA CLI not found")
    print()
    print("  Checked locations:")
    print(f"    - PATH: mfa")
    print(f"    - Local miniforge: {PROJECT_ROOT / 'miniforge' / 'envs' / 'mfa310' / 'bin' / 'mfa'}")
    print(f"    - Conda environment: conda run -n mfa310 mfa")
    print()
    print("  If MFA is installed elsewhere, specify the path manually.")
    print("  Or install Montreal Forced Aligner:")
    print("    conda install -c conda-forge montreal-forced-alignment")
print()

âœ“ MFA CLI available
  Command: /Volumes/SSanDisk/SpeechRec-German/miniforge/envs/mfa310/bin/mfa
  Path: /Volumes/SSanDisk/SpeechRec-German/miniforge/envs/mfa310/bin/mfa



In [14]:
# Check which dictionaries are already installed
print("=" * 80)
print("CHECKING INSTALLED DICTIONARIES")
print("=" * 80)
print()

available_dicts = {}
for lang_name, dict_names in LANGUAGES.items():
    found = False
    for dict_name in dict_names:
        dict_path = find_mfa_dict_path(dict_name)
        if dict_path:
            available_dicts[lang_name] = (dict_path, dict_name)
            print(f"âœ“ {lang_name.capitalize():12s} ({dict_name:20s}): found")
            print(f"    Path: {dict_path}")
            found = True
            break
    
    if not found:
        print(f"âœ— {lang_name.capitalize():12s} ({dict_names[0]:20s}): not found")
        available_dicts[lang_name] = None

print()
print(f"Found dictionaries: {sum(1 for v in available_dicts.values() if v is not None)} out of {len(LANGUAGES)}")
print()

CHECKING INSTALLED DICTIONARIES

âœ“ English      (english_us_mfa      ): found
    Path: /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/english_us_mfa.dict
âœ“ French       (french_mfa          ): found
    Path: /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/french_mfa.dict
âœ— Italian      (italian_cv          ): not found
âœ“ Russian      (russian_mfa         ): found
    Path: /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/russian_mfa.dict
âœ“ Spanish      (spanish_mfa         ): found
    Path: /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/spanish_mfa.dict

Found dictionaries: 4 out of 5



In [15]:
def download_mfa_dictionary(dict_name: str, mfa_command: str = 'mfa') -> Tuple[bool, Optional[str], Optional[str]]:
    """
    Downloads an MFA dictionary.
    
    Args:
        dict_name: Dictionary name (e.g., 'english_us_mfa')
        mfa_command: Command to run MFA (e.g., 'mfa' or 'conda run -n mfa310 mfa')
    
    Returns:
        (success, message, full_error_message)
    """
    try:
        # Build command
        if 'conda' in mfa_command:
            # For conda command, split into parts
            cmd_parts = mfa_command.split() + ['model', 'download', 'dictionary', dict_name]
        else:
            cmd_parts = [mfa_command, 'model', 'download', 'dictionary', dict_name]
        
        print(f"  Executing: {' '.join(cmd_parts)}")
        
        # Run command with increased timeout (download may take time)
        result = subprocess.run(
            cmd_parts,
            capture_output=True,
            text=True,
            timeout=300  # 5 minutes for download
        )
        
        if result.returncode == 0:
            # Check if file appeared
            dict_path = find_mfa_dict_path(dict_name)
            if dict_path:
                return True, f"Successfully downloaded: {dict_path}", None
            else:
                return True, "Downloaded, but file not found in expected locations", None
        else:
            error_msg = result.stderr.strip() or result.stdout.strip()
            full_error = f"stderr: {result.stderr}\nstdout: {result.stdout}" if result.stderr or result.stdout else "No error message"
            return False, f"Error (code {result.returncode}): {error_msg[:300]}", full_error
            
    except subprocess.TimeoutExpired:
        return False, "Timeout during download (exceeded 5 minutes)", None
    except Exception as e:
        return False, f"Exception: {str(e)}", None

print("Dictionary download function created")

Dictionary download function created


In [16]:
# Determine which dictionaries need to be downloaded
missing_dicts = {
    lang_name: LANGUAGES[lang_name]
    for lang_name in LANGUAGES.keys()
    if available_dicts[lang_name] is None
}

if not missing_dicts:
    print("=" * 80)
    print("ALL DICTIONARIES ALREADY INSTALLED")
    print("=" * 80)
    print()
    print("All requested dictionaries are already found. Download not required.")
else:
    print("=" * 80)
    print("DOWNLOADING MISSING DICTIONARIES")
    print("=" * 80)
    print()
    print(f"Need to download {len(missing_dicts)} dictionary sets:")
    for lang_name, dict_names in missing_dicts.items():
        primary = dict_names[0]
        alternatives = dict_names[1:] if len(dict_names) > 1 else []
        alt_str = f" (will try alternatives: {', '.join(alternatives)})" if alternatives else ""
        print(f"  - {lang_name.capitalize()}: {primary}{alt_str}")
    print()
    
    if not mfa_available:
        print("âš  MFA CLI unavailable. Cannot download dictionaries automatically.")
        print()
        print("Instructions for manual download:")
        print("  1. Install Montreal Forced Aligner:")
        print("     conda install -c conda-forge montreal-forced-alignment")
        print()
        print("  2. Download dictionaries manually:")
        for lang_name, dict_names in missing_dicts.items():
            for dict_name in dict_names:
                print(f"     mfa model download dictionary {dict_name}")
    else:
        print(f"Using command: {mfa_cmd}")
        print()

DOWNLOADING MISSING DICTIONARIES

Need to download 1 dictionary sets:
  - Italian: italian_cv (will try alternatives: italian_mfa)

Using command: /Volumes/SSanDisk/SpeechRec-German/miniforge/envs/mfa310/bin/mfa



In [17]:
# Download missing dictionaries
if missing_dicts and mfa_available:
    print("=" * 80)
    print("STARTING DOWNLOAD")
    print("=" * 80)
    print()
    
    results = {}
    
    for lang_name, dict_names in missing_dicts.items():
        print(f"[{lang_name.upper()}] Trying to download dictionary...")
        success = False
        downloaded_dict = None
        error_messages = []
        
        # Try each dictionary name in order (primary first, then alternatives)
        for dict_name in dict_names:
            print(f"  Trying: {dict_name}")
            success, message, full_error = download_mfa_dictionary(dict_name, mfa_cmd)
            
            if success:
                print(f"  âœ“ {message}")
                # Update available dictionaries info
                dict_path = find_mfa_dict_path(dict_name)
                if dict_path:
                    available_dicts[lang_name] = (dict_path, dict_name)
                    downloaded_dict = dict_name
                    break
            else:
                print(f"  âœ— {message}")
                if full_error:
                    error_messages.append(f"{dict_name}: {full_error[:500]}")
        
        if success:
            results[lang_name] = (True, f"Successfully downloaded: {downloaded_dict}")
        else:
            results[lang_name] = (False, f"Failed to download. Tried: {', '.join(dict_names)}")
            if error_messages:
                results[lang_name] = (False, f"Failed. Last error: {error_messages[-1][:200]}")
        print()
    
    # Final statistics
    print("=" * 80)
    print("DOWNLOAD RESULTS")
    print("=" * 80)
    print()
    
    successful = sum(1 for success, _ in results.values() if success)
    failed = len(results) - successful
    
    print(f"Successfully downloaded: {successful}")
    print(f"Errors: {failed}")
    print()
    
    if successful > 0:
        print("Successfully downloaded dictionaries:")
        for lang_name, (success, message) in results.items():
            if success:
                print(f"  âœ“ {lang_name.capitalize()}: {message}")
        print()
    
    if failed > 0:
        print("Dictionaries with errors:")
        for lang_name, (success, message) in results.items():
            if not success:
                print(f"  âœ— {lang_name.capitalize()}: {message}")
        print()
        print("Try downloading them manually:")
        for lang_name, (success, _) in results.items():
            if not success:
                dict_names = LANGUAGES[lang_name]
                for dict_name in dict_names:
                    print(f"  mfa model download dictionary {dict_name}")
elif missing_dicts and not mfa_available:
    print("âš  Download skipped: MFA CLI unavailable")

STARTING DOWNLOAD

[ITALIAN] Trying to download dictionary...
  Trying: italian_cv
  Executing: /Volumes/SSanDisk/SpeechRec-German/miniforge/envs/mfa310/bin/mfa model download dictionary italian_cv
  âœ“ Successfully downloaded: /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/italian_cv.dict

DOWNLOAD RESULTS

Successfully downloaded: 1
Errors: 0

Successfully downloaded dictionaries:
  âœ“ Italian: Successfully downloaded: italian_cv



In [18]:
# Final check of all dictionaries
print("=" * 80)
print("FINAL CHECK")
print("=" * 80)
print()

final_status = {}
for lang_name, dict_names in LANGUAGES.items():
    found = False
    for dict_name in dict_names:
        dict_path = find_mfa_dict_path(dict_name)
        if dict_path:
            final_status[lang_name] = dict_path
            file_size = dict_path.stat().st_size / (1024 * 1024)  # Size in MB
            print(f"âœ“ {lang_name.capitalize():12s} ({dict_name:20s}): found ({file_size:.2f} MB)")
            print(f"    {dict_path}")
            found = True
            break
    
    if not found:
        final_status[lang_name] = None
        print(f"âœ— {lang_name.capitalize():12s} ({dict_names[0]:20s}): not found")

print()
total_found = sum(1 for v in final_status.values() if v is not None)
print(f"Total found: {total_found} out of {len(LANGUAGES)} dictionaries")

if total_found == len(LANGUAGES):
    print()
    print("ðŸŽ‰ All dictionaries successfully installed!")
else:
    print()
    print(f"âš  Missing {len(LANGUAGES) - total_found} dictionaries.")
    print("Check error messages above and try downloading them manually.")

FINAL CHECK

âœ“ English      (english_us_mfa      ): found (2.97 MB)
    /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/english_us_mfa.dict
âœ“ French       (french_mfa          ): found (4.37 MB)
    /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/french_mfa.dict
âœ“ Italian      (italian_cv          ): found (1.91 MB)
    /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/italian_cv.dict
âœ“ Russian      (russian_mfa         ): found (23.80 MB)
    /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/russian_mfa.dict
âœ“ Spanish      (spanish_mfa         ): found (4.43 MB)
    /Volumes/SSanDisk/SpeechRec-German/MFA/pretrained_models/dictionary/spanish_mfa.dict

Total found: 5 out of 5 dictionaries

ðŸŽ‰ All dictionaries successfully installed!


## Additional Information

### Checking Installed MFA Dictionaries

You can check all installed MFA dictionaries using the command:
```bash
mfa model list dictionary
```

### Dictionary Alternatives

This notebook automatically tries alternative dictionary names if the primary one is not available:
- **English**: `english_us_mfa` â†’ `english_uk_mfa` â†’ `english_mfa`
- **French**: `french_mfa`
- **Italian**: `italian_cv` (Common Voice) â†’ `italian_mfa` (not available)
- **Russian**: `russian_mfa`
- **Spanish**: `spanish_mfa` â†’ `spanish_latin_america_mfa` â†’ `spanish_spain_mfa`

### Dictionary Paths

MFA usually saves dictionaries in one of the following locations:
1. `~/Documents/MFA/pretrained_models/dictionary/`
2. `~/.local/share/montreal-forced-alignment/pretrained_models/dictionary/`
3. Local project folder: `MFA/pretrained_models/dictionary/`

This notebook automatically checks all these locations.