An educational project demonstrating how invisible Unicode characters can be used to hide malicious code within empty variables in seemingly legitimate source files, along with defense tools to detect them.
- What is this?
- Project structure
- Learning levels
- The trick: invisible code between quotes
- Where do these attacks appear?
- Invisible Unicode techniques
- Requirements
- Defense and Detection Strategies
- Disclaimer
- License
This project simulates several real-world cyberattack techniques, though in a benign and controlled manner: concealing a payload within invisible Unicode characters. The malicious code is concealed between the quotes of a string that appears to be empty; text editors, web browsers, and even GitHub show nothing there. Anyone opening the file sees normal code with completely empty quotes, but running it launches a hidden background process.
The goal is educational: understand how these attacks work so you can defend against them. Learn how these attacks operate in supply chain attacks (Supply Chain Attacks) and other attack types so you can protect CI/CD pipelines, developer workstations, production environments, and even guard against running a seemingly harmless file containing invisible code.
.
├── level_1/ # Level 1: Zero-Width binary encoding
│ ├── README.md
│ ├── level1_python.py
│ └── level1_node.js
│
├── level_2/ # Level 2: Variation Selectors
│ ├── README.md
│ ├── level2_python.py
│ └── level2_node.js
│
├── level_3/ # Level 3: Full reverse shell
│ ├── README.md
│ ├── python/ # Generator and listener in Python
│ │ ├── bait_generator.py
│ │ ├── rs_listener.py
│ │ └── Bait To Client/
│ ├── javascript/ # Generator and listener in JavaScript
│ │ ├── bait_generator.js
│ │ ├── rs_listener.js
│ │ └── Bait To Client/
│ └── csharp/ # Generator and listener in C#
│ ├── BaitGenerator.cs
│ └── rs_listener.cs
│
└── defense/ # Detection scanners (14 categories)
├── README.md
├── python/unicode_scanner.py
└── javascript/unicode_scanner.js
- Each level is self-contained with its own code and README
- Each language is independent: use Python, JavaScript, or C# without needing the others
- The
defense/folder contains scanners that detect invisible characters in any file
The project is organized into 3 progressive levels. Each level is self-contained with its own code, scripts, and README.
| Level | Technique | What you will learn | Difficulty |
|---|---|---|---|
| Level 1 | Zero-Width binary | Binary encoding with invisible characters. 2 characters (U+200B, U+200D) represent bits 0 and 1. Invisible in ALL editors. | Introductory |
| Level 2 | Variation Selectors | 256 invisible characters that map 1:1 to bytes. 8x more compact than Level 1. Invisible in IDEs and GitHub. | Basic |
| Level 3 | Full reverse shell | Real-world application: a benign reverse shell hidden inside files that look normal. Generators, listeners, and defense. | Basic type II |
Recommendation: start with Level 1 even if you already have experience. Each level builds on the previous one, and the progression makes everything clearer.
# Level 1, try the binary encoding:
cd level_1
python level1_python.py # or: node level1_node.js
# Level 2, try Variation Selectors:
cd level_2
python level2_python.py # or: node level2_node.js
# Level 3, the full attack:
cd level_3 # see level_3/README.md for the step-by-step guide# This LOOKS like an empty string:
empty = ''
# But between those quotes there are thousands of invisible Unicode characters
# encoding a complete hidden PAYLOAD.
# No editor, IDE, browser, or even GitHub shows ANYTHING between the quotes.
# A 5-line file can weigh 15 KB because of the hidden code.Open any file from the levels in your favorite text editor. Look for the string that appears empty. You will not see anything. Run it, and you will discover that a hidden message was there (levels 1 and 2) or a full reverse shell (level 3).
This invisible character technique can be used across multiple attack vectors. It is not just theory: real cases have been found in GitHub projects and npm/PyPI packages. Recent incidents in npm repositories (e.g., massively popular libraries with millions of downloads) have demonstrated the impact of these vulnerabilities.
| Vector | How it works | Real-world example |
|---|---|---|
| Supply Chain Attack | An attacker publishes a package on npm/PyPI with code hidden in invisible characters. Upon installation, the payload executes silently | Malicious npm packages that steal environment variables or tokens |
| Trojan Source (CVE-2021-42574) | Bidi characters are used to make code look different from what it actually does. An if appears to guard a function, but in reality it always executes |
University of Cambridge research (2021) |
| Pre-install / Post-install hooks | In package.json, the preinstall or postinstall fields run a script automatically during npm install. The attacker hides the payload there |
Typosquatting on npm and CI/CD pipeline attacks |
| Pull requests with hidden commits | An attacker opens a PR on an open source project. In some commit, one of the .js, .py, or .cs files hides the payload in "empty" strings. If the reviewer lacks detection tools, the malicious code makes it to production |
Malicious commits in projects |
| Shared files | A "useful" script is sent to a colleague (hex converter, utility, emulator). The file works normally, but running it also launches a hidden background process | Exactly what this project demonstrates |
-
In a
package.json: Using lifecycle hooks like"preinstall": "node setup.js"or"postinstall". Ifsetup.jshas a payload obfuscated with Unicode, the system (or the CI/CD pipeline) will blindly execute it when building the project. -
In a GitHub commit: a
.jsor.pyfile that looks like a normal utility. Inside a string that appears empty (''or` `), there are thousands of invisible characters encoding a malicious payload. -
In a configuration file: A
.envor.yamllooks normal but contains a hidden payload that executes when parsed by certain frameworks.
Multiple families of invisible Unicode characters can be exploited. Each one carries a different risk level depending on its potential to conceal code:
| Technique | Unicode Range | Risk | Description |
|---|---|---|---|
| Variation Selectors | U+FE00-FE0F, U+E0100-E01EF | CRITICAL | Steganography: 256 values, 1 char = 1 byte |
| Tags Block | U+E0001-E007F | CRITICAL | Steganography: maps 1:1 to ASCII |
| Zero-Width Characters | U+200B-200D | HIGH | Binary encoding: 8 chars = 1 byte |
| Bidi Overrides | U+202A-202E, U+2066-2069 | HIGH | Trojan Source (CVE-2021-42574) |
| Bidi Marks | U+200E-200F, U+061C | MEDIUM | LTR/RTL direction marks |
| Invisible Operators | U+2060-2064 | MEDIUM | Word Joiner and invisible operators |
| Mongolian Free VS | U+180B-180D | MEDIUM | Mongolian variation selectors |
| Hangul Fillers | U+115F-1160, U+3164, U+FFA0 | MEDIUM | Empty, invisible Hangul characters |
| Line/Paragraph Separators | U+2028-2029 | MEDIUM | Break strings in JavaScript |
| Deprecated Format | U+206A-206F | LOW | Deprecated but functional formatting |
| Interlinear Annotations | U+FFF9-FFFB | LOW | Invisible annotation markers |
| Musical Formatting | U+1D173-1D17A | LOW | Invisible musical formatting |
| Shorthand Controls | U+1BCA0-1BCA3 | LOW | Invisible shorthand formatting |
| Other Invisible | U+00AD, U+034F, U+180E, U+FEFF | LOW | Soft Hyphen, CGJ, MVS, BOM |
The defense scanners in this project detect all 14 categories.
You only need 1 of the following languages to use the project:
| Language | Minimum version |
|---|---|
| Python | 3.6+ |
| Node.js | 14+ |
| C# (.NET) | 6+ |
You do not need all three installed. Use whichever you prefer.
No single method protects against all vectors. Each strategy covers different scenarios:
This project includes scanners that detect invisible characters in any source code file. This is the most direct defense against this technique, regardless of how the file arrived (PR, package, shared file, etc.):
# Python:
python defense/python/unicode_scanner.py --decode suspicious_file.py
# JavaScript:
node defense/javascript/unicode_scanner.js --decode suspicious_file.jsProtects against: all vectors (shared files, PRs, packages, hidden commits)
For more details on using the scanner, see the defense README.
This blocks the preinstall and postinstall fields in package.json, preventing code from running automatically when you install a package:
# For npm:
npm config set ignore-scripts true
# For pnpm:
pnpm config set ignore-scripts trueProtects against: Supply Chain Attacks via npm/pnpm hooks only. Does not protect against files you run manually (like the baits in this project) or against hidden code in PRs or commits.
A file with a hidden payload weighs much more than it appears. For example, a 1 KB visible script can weigh 15 KB on disk because of the invisible characters. If a file with only a few lines weighs more than expected, it is suspicious.
Protects against: shared files and PRs, a quick check before running or merging.
LEGAL WARNING
This project is EXCLUSIVELY EDUCATIONAL and is designed to:
- Demonstrate how invisible Unicode characters can be exploited in real attacks
- Teach about the multiple attack vectors where this technique appears
- Provide DEFENSE tools to detect these attacks
- Train cybersecurity professionals in detection techniques
Using this software for unauthorized access to computer systems is ILLEGAL and is punishable under:
- Computer Fraud and Abuse Act (CFAA)
- Computer Misuse Act 1990, United Kingdom
- And equivalent legislation in each jurisdiction
The author assumes NO responsibility for misuse of this software. It should only be used in controlled environments, with explicit authorization, and for the purposes of learning, research, or authorized penetration testing.
By cloning or downloading this repository, you agree to use it solely for educational and legal purposes.
This project is licensed under the MIT License.
Share with others who might benefit from this tool