Update chardet to 7.1.0 by pyup-bot · Pull Request #500 · Al1rios/libpythonpro

pyup-bot · 2026-03-11T23:39:00Z

This PR updates chardet from 3.0.4 to 7.1.0.

Changelog

7.1.0

-------------------

**Features:**

- Added PEP 263 encoding declaration detection — `` -*- coding: ... -*-``
and `` coding=...`` declarations on lines 1–2 of Python source files are
now recognized with confidence 0.95 (`249
&lt;https://github.com/chardet/chardet/issues/249&gt;`_)
- Added ``chardet.universaldetector`` backward-compatibility stub so that
``from chardet.universaldetector import UniversalDetector`` works with a
deprecation warning (`341
&lt;https://github.com/chardet/chardet/issues/341&gt;`_)

**Fixes:**

- Fixed false UTF-7 detection of ASCII text containing ``++`` or ``+word``
patterns (`332 &lt;https://github.com/chardet/chardet/issues/332&gt;`_)
- Fixed 0.5s startup cost on first ``detect()`` call — model norms are now
computed during loading instead of lazily iterating 21M entries (`333
&lt;https://github.com/chardet/chardet/issues/333&gt;`_)
- Fixed undocumented encoding name changes between chardet 5.x and 7.0 —
``detect()`` now returns chardet 5.x-compatible names by default (`338
&lt;https://github.com/chardet/chardet/issues/338&gt;`_)
- Improved ISO-2022-JP family detection — recognizes ESC sequences for
ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
- Fixed silent truncation of corrupt model data (``iter_unpack`` yielded
fewer tuples instead of raising)
- Fixed incorrect date in LICENSE

**Performance:**

- 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model
norms as a side-product of ``load_models()``
- ~40% faster model parsing via ``struct.iter_unpack`` for bulk entry
extraction (eliminates ~305K individual ``unpack`` calls)

**New API parameters:**

- Added ``compat_names`` parameter (default ``True``) to
:func:`~chardet.detect`, :func:`~chardet.detect_all`, and
:class:`~chardet.UniversalDetector` — set to ``False`` to get raw Python
codec names instead of chardet 5.x/6.x compatible display names
- Added ``prefer_superset`` parameter (default ``False``) — remaps legacy
ISO/subset encodings to their modern Windows/CP superset equivalents
(e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252).
**This will default to ``True`` in the next major version (8.0).**
- Deprecated ``should_rename_legacy`` in favor of ``prefer_superset`` —
a deprecation warning is emitted when used

**Improvements:**

- Switched internal canonical encoding names to Python codec names
(e.g., ``&quot;utf-8&quot;`` instead of ``&quot;UTF-8&quot;``), with ``compat_names``
controlling the public output format.  See :doc:`usage` for the full
mapping table.
- Added ``lookup_encoding()`` to ``registry`` for case-insensitive
resolution of arbitrary encoding name input to canonical names
- Achieved 100% line coverage across all source modules (+31 tests)
- Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language
accuracy on 2,510 test files
- Pinned test-data cloning to chardet release version tags for
reproducible builds

7.0.1

-------------------

**Fixes:**

- Fixed false UTF-7 detection of SHA-1 git hashes (`324
&lt;https://github.com/chardet/chardet/issues/324&gt;`_)
- Fixed ``_SINGLE_LANG_MAP`` missing aliases for single-language encoding
lookup (e.g., ``big5`` → ``big5hkscs``)
- Fixed PyPy ``TypeError`` in UTF-7 codec handling

**Improvements:**

- Retrained bigram models — 24 previously failing test cases now pass
- Updated language equivalences for mutual intelligibility (Slovak/Czech,
East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages)

7.0.0

-------------------

Ground-up, MIT-licensed rewrite of chardet. Same package name, same
public API — drop-in replacement for chardet 5.x/6.x.

**Highlights:**

- **MIT license** (previous versions were LGPL)
- **96.8% accuracy** on 2,179 test files (+2.3pp vs chardet 6.0.0,
+7.7pp vs charset-normalizer)
- **41x faster** than chardet 6.0.0 with mypyc (**28x** pure Python),
**7.5x faster** than charset-normalizer
- **Language detection** for every result (90.5% accuracy across 49
languages)
- **99 encodings** across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC,
LEGACY_REGIONAL, DOS, MAINFRAME)
- **12-stage detection pipeline** — BOM, UTF-16/32 patterns, escape
sequences, binary detection, markup charset, ASCII, UTF-8 validation,
byte validity, CJK gating, structural probing, statistical scoring,
post-processing
- **Bigram frequency models** trained on CulturaX multilingual corpus
data for all supported language/encoding pairs
- **Optional mypyc compilation** — 1.49x additional speedup on CPython
- **Thread-safe** ``detect()`` and ``detect_all()`` with no measurable
overhead; scales on free-threaded Python 3.13t+
- **Negligible import memory** (96 B)
- **Zero runtime dependencies**

6.0.0

-------------------

**Features:**

- Unified single-byte charset detection with proper language-specific
bigram models for all single-byte encodings (replaces ``Latin1Prober``
and ``MacRomanProber`` heuristics)
- 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish,
Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German,
Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian,
Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian,
Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik,
Ukrainian, Vietnamese, Welsh
- ``EncodingEra`` filtering via new ``encoding_era`` parameter
- ``max_bytes`` and ``chunk_size`` parameters for ``detect()``,
``detect_all()``, and ``UniversalDetector``
- ``-e``/``--encoding-era`` CLI flag
- EBCDIC detection (CP037, CP500)
- Direct GB18030 support (replaces redundant GB2312 prober)
- Binary file detection
- Python 3.12, 3.13, and 3.14 support

**Breaking changes:**

- Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)
- Removed ``Latin1Prober`` and ``MacRomanProber``
- Removed EUC-TW support
- Removed ``LanguageFilter.NONE``
- ``detect()`` default changed to ``encoding_era=EncodingEra.MODERN_WEB``

**Fixes:**

- Fixed CP949 state machine
- Fixed SJIS distribution analysis (second-byte range &gt;= 0x80)
- Fixed UTF-16/32 detection for non-ASCII-heavy text
- Fixed GB18030 ``char_len_table``
- Fixed UTF-8 state machine
- Fixed ``detect_all()`` returning inactive probers
- Fixed early cutoff bug

5.2.0

-------------------

- Added support for running the CLI via ``python -m chardet``

5.1.0

-------------------

- Added ``should_rename_legacy`` argument to remap legacy encoding names
to modern equivalents
- Added MacRoman encoding prober
- Added ``--minimal`` flag to ``chardetect`` CLI
- Added type annotations and mypy CI
- Added support for Python 3.11
- Removed support for Python 3.6

5.0.0

-------------------

- Added Johab Korean prober
- Added UTF-16/32 BE/LE probers
- Added test data for Croatian, Czech, Hungarian, Polish, Slovak,
Slovene, Greek, Turkish
- Improved XML tag filtering
- Made ``detect_all`` return child prober confidences
- Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)

4.0.0

-------------------

- Added ``detect_all()`` function returning all candidate encodings
- Converted single-byte charset probers to nested dicts (performance)
- ``CharsetGroupProber`` now short-circuits on definite matches
(performance)
- Added ``language`` field to ``detect_all`` output
- Dropped Python 2.6, 3.4, 3.5

Links

PyPI: https://pypi.org/project/chardet
Changelog: https://data.safetycli.com/changelogs/chardet/

Update chardet from 3.0.4 to 7.1.0

ab80dc4

pyup-bot mentioned this pull request Mar 11, 2026

Update chardet to 7.0.1 #499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update chardet to 7.1.0#500

Update chardet to 7.1.0#500
pyup-bot wants to merge 1 commit intomasterfrom
pyup-update-chardet-3.0.4-to-7.1.0

pyup-bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pyup-bot commented Mar 11, 2026

7.1.0

7.0.1

7.0.0

6.0.0

5.2.0

5.1.0

5.0.0

4.0.0

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant