## 🪴 seedGPT

**A seed-sized, Pythonic GPT-2 implementation from scratch.**  
No frameworks. No frills. Just pure attention.

> _"Every forest begins with a seed."_  
> `seedGPT` is a minimal yet faithful implementation of the GPT-2 architecture, handcrafted from the ground up using vanilla Python and PyTorch.  
> It's designed as a **starting point** — to learn, experiment, and grow your own language model ecosystem.

---

### 🌱 Highlights

- GPT-2 style Transformer decoder
- Clean training and inference loops
- From scratch, no external GPT libraries
- Ideal for beginners, researchers, and tinkerers

---

- Aaron Hung


---


---
### 📚 **Corpus Download in One Line**

We use **Project Gutenberg** to download large-scale, high-quality literary texts for language model training.

---

### 🔍 **Creative Alternatives to Gutenberg**

1. **Wikipedia Dumps** (via `wikiextractor` or Hugging Face):
   - High coverage, factual, multi-language.
   - Great for general knowledge and clean sentence structure.

2. **Common Crawl**:
   - Massive web-scale data, noisy but diverse.
   - Used by many large models (e.g., GPT-3, LLaMA).

3. **OpenSubtitles**:
   - Dialog-style corpora from movies and TV.
   - Great for conversational tone and multilingual content (e.g., Swedish, Taiwanese, Moroccan Arabic).

4. **Books3** (from ThePile):
   - Rich English literature, more modern and diverse than Gutenberg.
   - Contains fiction, non-fiction, some controversy over copyright.

5. **Tatoeba Project**:
   - Sentence-aligned multilingual database.
   - Especially useful for under-resourced languages like **Minnan/Taiwanese**, **Moroccan Darija**, **Swedish**, etc.

6. **CC100**:
   - A cleaned version of Common Crawl available in over 100 languages.
   - Good for training multilingual models.

7. **Language-specific Wikis & Blogs**:
   - For example, **摩洛哥 Darija (Arabic dialect)** → scrape local forums, YouTube comments.
   - **繁體中文** → zh.wikisource.org, Chinese literature sites like open-lit.com.
   - **台語（Minnan）** → Taiwanese bible corpora, folk song lyrics, gov.tw resources.

---

### 🧠 **What Did Karpathy Use for nanoGPT?**

Karpathy trained `nanoGPT` on:
- **TinyShakespeare** (sample)  
- **OpenWebText** (for full-scale runs, similar to GPT-2)  
- **The Pile** (optional, contains Books3, arXiv, etc.)

**Advantages:**
- Highly replicable and clean structure
- Small enough to understand learning dynamics
- Great for bootstrapping/debugging models

**Disadvantages:**
- Not diverse
- Lacks multilinguality
- Not suitable for production-grade models

---

### ✅ **Comparison Table**

| Source              | Size     | Quality  | Diversity     | Languages     | License Issues |
|---------------------|----------|----------|---------------|---------------|----------------|
| Project Gutenberg   | Medium   | High     | Mostly English classics | English, some others | ✅ Free |
| Karpathy Datasets   | Small    | Very High| Low (focused) | English only  | ✅ Free |
| Common Crawl        | Huge     | Mixed    | Very High     | 🌍 100+        | ⚠️ Mixed |
| OpenSubtitles       | Large    | Medium   | Dialogue-heavy| Many          | ⚠️ Grey area |
| Tatoeba             | Small    | High     | Sentence-level| 💬 400+        | ✅ Free |
| CC100               | Huge     | Medium   | Web-style     | 🌐 100+        | ✅ Free |

---

### 🧪 **Recommendation for Special Languages**

- **繁體中文 (Traditional Chinese)**: Use zh.wikisource, Chinese Open-Lit, or scrape Taiwanese news sites.
- **台語 (Minnan)**: Bible corpora, gov.tw local dialect projects, or oral transcription data.
- **Moroccan Darija**: Hard to find! Try scraping Facebook/YT comments, local forums, WhatsApp corpora.
- **Swedish**: Use `sv.wikipedia`, OpenSubtitles in Swedish, and local government documents (open data portals).

---


### For now, the Corpus will be:
from `Project Gutenburg` -- `Dorothy and the Wizard in Oz`
https://www.gutenberg.org/ebooks/22566
in Plain-Text UTF8

### 🐙 Some Dev-Notes:
- install pylzma for unzipping files for Windows
- but pylzma might not work for your MAC/Linux, consider to use `py7zr`, or standard library `lzma`, or `zipfile`, `tarfile` built-in
