Skip to content

Python encoding fixer for multiple languages. Repairs corrupted UTF-8 (café → café, für → für, José → José). Supports German, French, Spanish, Portuguese. Cross-platform, zero dependencies. Always backup first!

License

Notifications You must be signed in to change notification settings

Rigel-Computer/python-encoding-fixer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

🌍 Python Encoding Fixer

Automatically detect and repair character encoding corruption in text files. Supports UTF-8, Windows-1252, ISO-8859-1, and CP1252 with smart pattern recognition for multiple languages.

🚀 Quick Start

# Preview changes (safe)
python fix_encoding.py --dry-run

# Fix all files (creates backups)
python fix_encoding.py

# Fix specific directory
python fix_encoding.py --path "/path/to/project" --dry-run

⚡ What It Fixes

Transforms corrupted characters back to proper UTF-8 across languages:

❌ café        → ✅ café        (French)
❌ español     → ✅ español     (Spanish)  
❌ résumé     → ✅ résumé      (French)
❌ für         → ✅ für         (German)
❌ José        → ✅ José        (Spanish)
❌ naïve       → ✅ naïve       (English/French)
❌ don’t      → ✅ don't       (Typography)
❌ €50        → ✅ €50         (Euro symbol)

🎯 The Universal Problem

Character encoding corruption affects developers worldwide when systems mix UTF-8 with legacy encodings (Windows-1252, ISO-8859-1). This tool automatically detects and fixes the most common corruption patterns across multiple languages.

Common scenarios:

  • 🔄 Legacy system migrations
  • 🗄️ Database export/import with wrong charset
  • 📁 File uploads with encoding misdetection
  • 🌐 Mixed hosting environment transfers

✨ Features

  • 🛡️ Safe: Creates automatic backups before any changes
  • 🧪 Preview Mode: --dry-run shows changes without modifying files
  • 🔍 Multi-Encoding Detection: Handles UTF-8, Windows-1252, ISO-8859-1, CP1252 input
  • 🌐 Multi-Language: Built-in patterns for German, French, Spanish, and more
  • 🎯 Smart Filtering: Only processes text files (.html, .php, .css, .js, .xml, .json)
  • 🎨 Visual Feedback: Colored output shows exactly what gets fixed
  • 📦 Zero Dependencies: Uses only Python standard library
  • 🖥️ Cross-Platform: Works on Windows, macOS, and Linux

📋 Requirements

  • Python 3.6 or higher
  • Required libraries (all part of Python standard library):
    • os - Operating system functions
    • sys - System-specific parameters
    • argparse - Command-line argument parsing
    • shutil - File operations
    • pathlib - Path operations

🔍 Auto-Check Feature: The script automatically verifies all dependencies and Python version compatibility on startup. If anything is missing, you'll get a clear error message with instructions.

No pip installs, no external dependencies, no hassle!

📁 Repository Structure

python-encoding-fixer/
├── fix_encoding.py           # Main script
├── README.md                 # This file
├── examples/
│   ├── corrupted/           # Sample files with encoding issues
│   └── fixed/               # Expected results after processing
├── patterns/
│   ├── languages.json      # Language-specific corruption patterns
│   └── common.json          # Universal patterns
└── docs/
    └── encoding-guide.md    # Technical background

🔧 Usage Examples

Basic Usage

# Check what would be fixed
python fix_encoding.py --dry-run

# Fix files in current directory
python fix_encoding.py

Advanced Usage

# Fix specific project directory
python fix_encoding.py --path "/var/www/multilingual-site" --dry-run
python fix_encoding.py --path "/var/www/multilingual-site"

# Process only specific file types
python fix_encoding.py --extensions .html,.php,.css

Sample Output

==================================================
     Python Encoding Fixer v2.0
==================================================

Checking system requirements...
✓ All required modules found.
✓ Python 3.9.7 is compatible.

Multi-platform encoding repair started...
Directory: ./website
Dry-Run Mode: False

Checking: contact.php
  → café → café (3 times, French)
  → José → José (1 time, Spanish)  
  → für → für (2 times, German)
  ✓ File repaired (6 corrections)

Checking: product_descriptions.html
  → € → € (12 times, Euro symbol)
  → don’t → don't (4 times, Typography)
  ✓ File repaired (16 corrections)

=== SUMMARY ===
Files checked: 47
Files changed: 18
Total corrections: 127
Languages detected: German, French, Spanish

Backups created as .backup files.
All encoding issues resolved! ✓

🛡️ Safety & Recovery

Always create backups first! The script automatically creates .backup files, but you should also backup your entire project.

Restore from backups:

# Single file
cp file.php.backup file.php

# All files (Linux/Mac)
for backup in *.backup; do cp "$backup" "${backup%.backup}"; done

# Windows
for %f in (*.backup) do copy "%f" "%~nf"

🎯 Common Use Cases

  • Legacy Website Migration: Fix encoding issues from old CMS systems
  • Database Export Cleanup: Repair corrupted text in SQL dumps
  • Multilingual Sites: Clean up encoding problems from mixed hosting environments
  • Content Management: Fix encoding issues in WordPress, Drupal, etc.
  • API Data Processing: Clean up text data from various sources

🔍 Technical Details

Supported Input Encodings

  • UTF-8 (with/without BOM)
  • Windows-1252 (Western European)
  • ISO-8859-1 (Latin-1)
  • CP1252 (Windows Western European)

File Types Processed

  • .php - PHP files
  • .html, .htm - HTML files
  • .css - Stylesheets
  • .js - JavaScript files
  • .xml - XML files
  • .json - JSON files

Output

  • Always UTF-8 without BOM
  • Preserves file structure and permissions
  • Creates .backup files for safety

⚠️ Important Warnings & Disclaimers

🚨 USE AT YOUR OWN RISK - NO WARRANTY PROVIDED

⚠️ ALWAYS CREATE BACKUPS BEFORE RUNNING THE SCRIPT

⚠️ TEST WITH --dry-run FIRST TO PREVIEW CHANGES

⚠️ VERIFY RESULTS THOROUGHLY BEFORE DELETING BACKUP FILES

This tool performs automated text manipulation which can have unexpected results. While extensively tested, encoding corruption patterns can be complex and context-dependent. The script creates automatic backups, but you should maintain your own backup strategy.

📋 Legal Disclaimer - No Warranty:

By using this software, you acknowledge that:

  • You use it entirely at your own risk
  • No warranty or guarantee is provided
  • You are responsible for data backup and verification
  • The developers are not liable for any data loss or corruption
  • This software is provided "AS IS" without any express or implied warranties

Always follow the safety workflow:

  1. ✅ Backup your entire project manually
  2. ✅ Run with --dry-run first to preview changes
  3. ✅ Test on a small subset of files
  4. ✅ Verify results before proceeding with full dataset
  5. ✅ Keep backup files until you're certain results are correct

🤝 Contributing

Found an encoding pattern that's not covered? Please open an issue with:

  • The corrupted text example
  • The expected correct text
  • Context (file type, source system, language)

Pull requests welcome for:

  • Additional language patterns
  • Performance improvements
  • Cross-platform compatibility enhancements

📄 License

MIT License - see LICENSE file for details.

🏷️ Repository Name Suggestion

Repo Name: python-encoding-fixer

Alternative Names:

  • multi-encoding-fixer
  • utf8-corruption-repair
  • text-encoding-cleaner

GitHub Description: "Python tool for automatic character encoding repair. Fixes corrupted UTF-8 text (café → café, für → für) across multiple languages. Zero dependencies, cross-platform. Use at your own risk - always backup first!"


This tool addresses the universal challenge of character encoding corruption in multilingual text processing. Built with Python best practices for reliability and cross-platform compatibility.

About

Python encoding fixer for multiple languages. Repairs corrupted UTF-8 (café → café, für → für, José → José). Supports German, French, Spanish, Portuguese. Cross-platform, zero dependencies. Always backup first!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages