Skip to content

Update chardet to 7.1.0#500

Open
pyup-bot wants to merge 1 commit intomasterfrom
pyup-update-chardet-3.0.4-to-7.1.0
Open

Update chardet to 7.1.0#500
pyup-bot wants to merge 1 commit intomasterfrom
pyup-update-chardet-3.0.4-to-7.1.0

Conversation

@pyup-bot
Copy link
Collaborator

This PR updates chardet from 3.0.4 to 7.1.0.

Changelog

7.1.0

-------------------

**Features:**

- Added PEP 263 encoding declaration detection — `` -*- coding: ... -*-``
and `` coding=...`` declarations on lines 1–2 of Python source files are
now recognized with confidence 0.95 (`249
<https://github.com/chardet/chardet/issues/249>`_)
- Added ``chardet.universaldetector`` backward-compatibility stub so that
``from chardet.universaldetector import UniversalDetector`` works with a
deprecation warning (`341
<https://github.com/chardet/chardet/issues/341>`_)

**Fixes:**

- Fixed false UTF-7 detection of ASCII text containing ``++`` or ``+word``
patterns (`332 <https://github.com/chardet/chardet/issues/332>`_)
- Fixed 0.5s startup cost on first ``detect()`` call — model norms are now
computed during loading instead of lazily iterating 21M entries (`333
<https://github.com/chardet/chardet/issues/333>`_)
- Fixed undocumented encoding name changes between chardet 5.x and 7.0 —
``detect()`` now returns chardet 5.x-compatible names by default (`338
<https://github.com/chardet/chardet/issues/338>`_)
- Improved ISO-2022-JP family detection — recognizes ESC sequences for
ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
- Fixed silent truncation of corrupt model data (``iter_unpack`` yielded
fewer tuples instead of raising)
- Fixed incorrect date in LICENSE

**Performance:**

- 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model
norms as a side-product of ``load_models()``
- ~40% faster model parsing via ``struct.iter_unpack`` for bulk entry
extraction (eliminates ~305K individual ``unpack`` calls)

**New API parameters:**

- Added ``compat_names`` parameter (default ``True``) to
:func:`~chardet.detect`, :func:`~chardet.detect_all`, and
:class:`~chardet.UniversalDetector` — set to ``False`` to get raw Python
codec names instead of chardet 5.x/6.x compatible display names
- Added ``prefer_superset`` parameter (default ``False``) — remaps legacy
ISO/subset encodings to their modern Windows/CP superset equivalents
(e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252).
**This will default to ``True`` in the next major version (8.0).**
- Deprecated ``should_rename_legacy`` in favor of ``prefer_superset`` —
a deprecation warning is emitted when used

**Improvements:**

- Switched internal canonical encoding names to Python codec names
(e.g., ``"utf-8"`` instead of ``"UTF-8"``), with ``compat_names``
controlling the public output format.  See :doc:`usage` for the full
mapping table.
- Added ``lookup_encoding()`` to ``registry`` for case-insensitive
resolution of arbitrary encoding name input to canonical names
- Achieved 100% line coverage across all source modules (+31 tests)
- Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language
accuracy on 2,510 test files
- Pinned test-data cloning to chardet release version tags for
reproducible builds

7.0.1

-------------------

**Fixes:**

- Fixed false UTF-7 detection of SHA-1 git hashes (`324
<https://github.com/chardet/chardet/issues/324>`_)
- Fixed ``_SINGLE_LANG_MAP`` missing aliases for single-language encoding
lookup (e.g., ``big5`` → ``big5hkscs``)
- Fixed PyPy ``TypeError`` in UTF-7 codec handling

**Improvements:**

- Retrained bigram models — 24 previously failing test cases now pass
- Updated language equivalences for mutual intelligibility (Slovak/Czech,
East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages)

7.0.0

-------------------

Ground-up, MIT-licensed rewrite of chardet. Same package name, same
public API — drop-in replacement for chardet 5.x/6.x.

**Highlights:**

- **MIT license** (previous versions were LGPL)
- **96.8% accuracy** on 2,179 test files (+2.3pp vs chardet 6.0.0,
+7.7pp vs charset-normalizer)
- **41x faster** than chardet 6.0.0 with mypyc (**28x** pure Python),
**7.5x faster** than charset-normalizer
- **Language detection** for every result (90.5% accuracy across 49
languages)
- **99 encodings** across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC,
LEGACY_REGIONAL, DOS, MAINFRAME)
- **12-stage detection pipeline** — BOM, UTF-16/32 patterns, escape
sequences, binary detection, markup charset, ASCII, UTF-8 validation,
byte validity, CJK gating, structural probing, statistical scoring,
post-processing
- **Bigram frequency models** trained on CulturaX multilingual corpus
data for all supported language/encoding pairs
- **Optional mypyc compilation** — 1.49x additional speedup on CPython
- **Thread-safe** ``detect()`` and ``detect_all()`` with no measurable
overhead; scales on free-threaded Python 3.13t+
- **Negligible import memory** (96 B)
- **Zero runtime dependencies**

6.0.0

-------------------

**Features:**

- Unified single-byte charset detection with proper language-specific
bigram models for all single-byte encodings (replaces ``Latin1Prober``
and ``MacRomanProber`` heuristics)
- 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish,
Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German,
Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian,
Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian,
Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik,
Ukrainian, Vietnamese, Welsh
- ``EncodingEra`` filtering via new ``encoding_era`` parameter
- ``max_bytes`` and ``chunk_size`` parameters for ``detect()``,
``detect_all()``, and ``UniversalDetector``
- ``-e``/``--encoding-era`` CLI flag
- EBCDIC detection (CP037, CP500)
- Direct GB18030 support (replaces redundant GB2312 prober)
- Binary file detection
- Python 3.12, 3.13, and 3.14 support

**Breaking changes:**

- Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)
- Removed ``Latin1Prober`` and ``MacRomanProber``
- Removed EUC-TW support
- Removed ``LanguageFilter.NONE``
- ``detect()`` default changed to ``encoding_era=EncodingEra.MODERN_WEB``

**Fixes:**

- Fixed CP949 state machine
- Fixed SJIS distribution analysis (second-byte range >= 0x80)
- Fixed UTF-16/32 detection for non-ASCII-heavy text
- Fixed GB18030 ``char_len_table``
- Fixed UTF-8 state machine
- Fixed ``detect_all()`` returning inactive probers
- Fixed early cutoff bug

5.2.0

-------------------

- Added support for running the CLI via ``python -m chardet``

5.1.0

-------------------

- Added ``should_rename_legacy`` argument to remap legacy encoding names
to modern equivalents
- Added MacRoman encoding prober
- Added ``--minimal`` flag to ``chardetect`` CLI
- Added type annotations and mypy CI
- Added support for Python 3.11
- Removed support for Python 3.6

5.0.0

-------------------

- Added Johab Korean prober
- Added UTF-16/32 BE/LE probers
- Added test data for Croatian, Czech, Hungarian, Polish, Slovak,
Slovene, Greek, Turkish
- Improved XML tag filtering
- Made ``detect_all`` return child prober confidences
- Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)

4.0.0

-------------------

- Added ``detect_all()`` function returning all candidate encodings
- Converted single-byte charset probers to nested dicts (performance)
- ``CharsetGroupProber`` now short-circuits on definite matches
(performance)
- Added ``language`` field to ``detect_all`` output
- Dropped Python 2.6, 3.4, 3.5
Links

@pyup-bot pyup-bot mentioned this pull request Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant