A custom-built static site generator written in Python with zero external dependencies. This project transforms Markdown content from a folder of markdown files into production-ready static HTML files. It uses a hierarchical tree-based node architecture and recursive directory processing. It also handles static asset copying and can fit into different html templates. This project was implemented as part of the Boot.dev programming course.
The architecture follows a three-layer HTML node hierarchy:
- HTMLNode: Abstract base class defining the interface for all HTML nodes
- LeafNode: Represents HTML elements with no children (text nodes, images, etc.)
- ParentNode: Represents HTML elements containing child nodes (containers like
<div>,<p>,<ul>, etc.)
The transformation pipeline converts content through several stages:
Markdown → Blocks → TextNodes → HTMLNodes → HTML Tree → HTML String
+-----------------+ +------------------+ +-----------------+
| content/ |---->| Recursive Dir |---->| Markdown Files |
| (source) | | Traversal | | Discovered |
+-----------------+ +------------------+ +--------+--------+
|
v
+------------------+ +-----------------+
| Title Extract |<----| Parse Markdown |
| (from h1 tags) | | |
+--------+---------+ +--------+--------+
| |
v v
+-----------------+ +-----------------+ +-----------------+
| Static/ | | template.html | | Generate HTML |
| (assets) | | (HTML shell) | | (Node Tree) |
+--------+--------+ +--------+--------+ +--------+--------+
| | |
| v |
| +------------------+ |
+------------->| HTML Files |<-------------+
| (Generated) |
+--------+---------+
|
v
+------------------+
| docs/ |
| (output) |
+------------------+
- Static Asset Copying: Recursively copies all files from
static/todocs/ - Content Discovery: Recursively traverses
content/directory to find all.mdfiles - Page Generation: For each Markdown file:
- Parse Markdown into HTML node tree
- Extract page title from h1 heading
- Inject content and title into HTML template
- Write generated HTML to the corresponding path in
docs/
The parsing process occurs in distinct stages:
Raw Markdown
|
v
+-------------------------+
| markdown_to_blocks() | → Split into block-level chunks
| (Block detection) | (headings, lists, paragraphs, etc.)
+-------------------------+
|
v
+-------------------------+
| text_to_textnodes() | → Parse inline formatting
| (Inline parsing) | (bold, italic, code, links, images)
+-------------------------+
|
v
+-------------------------+
| text_node_to_html_node | → Convert to HTML representation
| (Node conversion) | (TextNode → LeafNode/ParentNode)
+-------------------------+
|
v
+-------------------------+
| Parent/Leaf Nodes | → Build HTML tree structure
| (Tree assembly) | (Nest nodes into container elements)
+-------------------------+
|
v
HTML String (via to_html())
There are two levels of the markdown parsing pipeline.
Here the document is split into blocks of text which are defined by blank lines. Each block has its own type which represents the kind of text it is such as a heading or a code block.
Once blocks are split up the inline text needs to be parsed to look for styling like bold or italic text and links or image links. This parsing step maps text lines to text nodes with their associated HTML type
A key implementation detail is the split_image_or_link_nodes() function, which parses link and image syntax without using regular expressions.
- Simplicity: Every parsing decision is explicit and modifiable.
- Performance: The regex parsing goes character by character as well to look for expressions. I would have had to do multiple regex fucntions to acheive the same funcionality. So I thought I would just look at the string once and get all of the necessary information.
- Zero dependencies: Uses only uses built in Python functionality.
The parser maintains two boolean states:
in_link_text: Currently parsing the text portion between[and]in_link_url: Currently parsing the URL portion between(and)
As it iterates through each character:
- When
[is encountered outside of any state → enterin_link_text - When
]is encountered inin_link_text→ capture text, enterin_link_url - When
)is encountered inin_link_url→ capture URL, create TextNode - For images, check if
[is preceded by!to distinguish and 
Process:
- Parse "Check out " as plain text
- Detect
[→ enter link text state, capture "my site" - Detect
]→ exit link text, enter URL state - Detect
)→ capture "https://example.com", create LINK_TEXT node - Parse " and " as plain text
- Detect
![→ enter image state, capture "logo" - Detect
]→ exit image text, enter URL state - Detect
)→ capture "img.png", create IMAGE_TEXT node
Contains the core data structures and parsing logic:
- HTMLNode: Base class with
tag,value,children, andpropsattributes - LeafNode: Renders as
<tag>value</tag>or plain text - ParentNode: Renders as
<tag>children_html</tag>by concatenating childto_html()results - TextNode: Intermediate representation with
text,text_type, and optionalurl - Parsing functions: Block detection, inline formatting, and conversion utilities
Orchestrates the generation process:
copy_directory(): Recursively copies static assetsgenerate_pages_recursive(): Discovers and processes all Markdown filesgenerate_page(): Converts a single Markdown file to HTMLmain(): Entry point with CLI argument handling for basepath
HTML template containing placeholders:
{{ Title }}: Replaced with extracted h1 heading{{ Content }}: Replaced with generated HTML from Markdown This template is replacable or changeable so you could have different kinds of base sites to build off of.
Generate the static site:
# Default basepath ("/")
python3 src/main.py
# Custom basepath (for GitHub Pages, etc.)
python3 src/main.py "/your-repo-name/"The generator will:
- Copy all files from
static/todocs/ - Process all
.mdfiles incontent/and subdirectories - Output generated HTML to corresponding paths in
docs/
.
├── content/ # Markdown source files
│ ├── index.md
│ └── blog/
│ ├── tom/
│ ├── glorfindel/
│ └── majesty/
├── static/ # Static assets (CSS, images)
│ ├── index.css
│ └── images/
├── docs/ # Generated output (created by generator)
├── src/
│ ├── main.py # Entry point and orchestration
│ └── node.py # Core classes and parsing logic
├── template.html # HTML template
└── build.sh # Build script