Automatically detect and repair character encoding corruption in text files. Supports UTF-8, Windows-1252, ISO-8859-1, and CP1252 with smart pattern recognition for multiple languages.
# Preview changes (safe)
python fix_encoding.py --dry-run
# Fix all files (creates backups)
python fix_encoding.py
# Fix specific directory
python fix_encoding.py --path "/path/to/project" --dry-run
Transforms corrupted characters back to proper UTF-8 across languages:
❌ café → ✅ café (French)
❌ español → ✅ español (Spanish)
❌ résumé → ✅ résumé (French)
❌ für → ✅ für (German)
❌ José → ✅ José (Spanish)
❌ naïve → ✅ naïve (English/French)
❌ don’t → ✅ don't (Typography)
❌ €50 → ✅ €50 (Euro symbol)
Character encoding corruption affects developers worldwide when systems mix UTF-8 with legacy encodings (Windows-1252, ISO-8859-1). This tool automatically detects and fixes the most common corruption patterns across multiple languages.
Common scenarios:
- 🔄 Legacy system migrations
- 🗄️ Database export/import with wrong charset
- 📁 File uploads with encoding misdetection
- 🌐 Mixed hosting environment transfers
- 🛡️ Safe: Creates automatic backups before any changes
- 🧪 Preview Mode:
--dry-run
shows changes without modifying files - 🔍 Multi-Encoding Detection: Handles UTF-8, Windows-1252, ISO-8859-1, CP1252 input
- 🌐 Multi-Language: Built-in patterns for German, French, Spanish, and more
- 🎯 Smart Filtering: Only processes text files (.html, .php, .css, .js, .xml, .json)
- 🎨 Visual Feedback: Colored output shows exactly what gets fixed
- 📦 Zero Dependencies: Uses only Python standard library
- 🖥️ Cross-Platform: Works on Windows, macOS, and Linux
- Python 3.6 or higher
- Required libraries (all part of Python standard library):
os
- Operating system functionssys
- System-specific parametersargparse
- Command-line argument parsingshutil
- File operationspathlib
- Path operations
🔍 Auto-Check Feature: The script automatically verifies all dependencies and Python version compatibility on startup. If anything is missing, you'll get a clear error message with instructions.
No pip installs, no external dependencies, no hassle!
python-encoding-fixer/
├── fix_encoding.py # Main script
├── README.md # This file
├── examples/
│ ├── corrupted/ # Sample files with encoding issues
│ └── fixed/ # Expected results after processing
├── patterns/
│ ├── languages.json # Language-specific corruption patterns
│ └── common.json # Universal patterns
└── docs/
└── encoding-guide.md # Technical background
# Check what would be fixed
python fix_encoding.py --dry-run
# Fix files in current directory
python fix_encoding.py
# Fix specific project directory
python fix_encoding.py --path "/var/www/multilingual-site" --dry-run
python fix_encoding.py --path "/var/www/multilingual-site"
# Process only specific file types
python fix_encoding.py --extensions .html,.php,.css
==================================================
Python Encoding Fixer v2.0
==================================================
Checking system requirements...
✓ All required modules found.
✓ Python 3.9.7 is compatible.
Multi-platform encoding repair started...
Directory: ./website
Dry-Run Mode: False
Checking: contact.php
→ café → café (3 times, French)
→ José → José (1 time, Spanish)
→ für → für (2 times, German)
✓ File repaired (6 corrections)
Checking: product_descriptions.html
→ € → € (12 times, Euro symbol)
→ don’t → don't (4 times, Typography)
✓ File repaired (16 corrections)
=== SUMMARY ===
Files checked: 47
Files changed: 18
Total corrections: 127
Languages detected: German, French, Spanish
Backups created as .backup files.
All encoding issues resolved! ✓
Always create backups first! The script automatically creates .backup
files, but you should also backup your entire project.
# Single file
cp file.php.backup file.php
# All files (Linux/Mac)
for backup in *.backup; do cp "$backup" "${backup%.backup}"; done
# Windows
for %f in (*.backup) do copy "%f" "%~nf"
- Legacy Website Migration: Fix encoding issues from old CMS systems
- Database Export Cleanup: Repair corrupted text in SQL dumps
- Multilingual Sites: Clean up encoding problems from mixed hosting environments
- Content Management: Fix encoding issues in WordPress, Drupal, etc.
- API Data Processing: Clean up text data from various sources
- UTF-8 (with/without BOM)
- Windows-1252 (Western European)
- ISO-8859-1 (Latin-1)
- CP1252 (Windows Western European)
.php
- PHP files.html
,.htm
- HTML files.css
- Stylesheets.js
- JavaScript files.xml
- XML files.json
- JSON files
- Always UTF-8 without BOM
- Preserves file structure and permissions
- Creates
.backup
files for safety
🚨 USE AT YOUR OWN RISK - NO WARRANTY PROVIDED
This tool performs automated text manipulation which can have unexpected results. While extensively tested, encoding corruption patterns can be complex and context-dependent. The script creates automatic backups, but you should maintain your own backup strategy.
📋 Legal Disclaimer - No Warranty:
By using this software, you acknowledge that:
- You use it entirely at your own risk
- No warranty or guarantee is provided
- You are responsible for data backup and verification
- The developers are not liable for any data loss or corruption
- This software is provided "AS IS" without any express or implied warranties
Always follow the safety workflow:
- ✅ Backup your entire project manually
- ✅ Run with
--dry-run
first to preview changes - ✅ Test on a small subset of files
- ✅ Verify results before proceeding with full dataset
- ✅ Keep backup files until you're certain results are correct
Found an encoding pattern that's not covered? Please open an issue with:
- The corrupted text example
- The expected correct text
- Context (file type, source system, language)
Pull requests welcome for:
- Additional language patterns
- Performance improvements
- Cross-platform compatibility enhancements
MIT License - see LICENSE file for details.
Repo Name: python-encoding-fixer
Alternative Names:
multi-encoding-fixer
utf8-corruption-repair
text-encoding-cleaner
GitHub Description: "Python tool for automatic character encoding repair. Fixes corrupted UTF-8 text (café → café, für → für) across multiple languages. Zero dependencies, cross-platform. Use at your own risk - always backup first!"
This tool addresses the universal challenge of character encoding corruption in multilingual text processing. Built with Python best practices for reliability and cross-platform compatibility.