A pipeline for scraping, preprocessing, and anonymizing Arabic judicial rulings from the Saudi Ministry of Justice.
1. Download the dataset from [https://data.mendeley.com/drafts/np538c95yy]
2. Install dependencies (requirements.txt)
3. Run preprocessing/redaction scripts as described in the configuration section
MOJScraper.py → ColumnSplit.py → NameRedaction.py
| Step | Script | Input | Output |
|---|---|---|---|
| 1 | MOJScraper.py |
MOJ website | output1.csv, output2.csv, … |
| 2 | ColumnSplit.py |
.xlsx from step 1 |
.xlsx with EVENTS, REASONING, RULING columns |
| 3 | NameRedaction.py |
.xlsx from step 2 |
.xlsx with person names replaced by [Person Name] |
Requirements: Python 3.9+, Google Chrome, ChromeDriver matching your Chrome version, and Stanza Arabic models.
pip install -r requirements.txtDownload Stanza Arabic models (one-time):
import stanza
stanza.download("ar")All settings are in config.py. Edit it before running any script:
- Set
CHROME_DRIVER_PATHto your local ChromeDriver executable. - Set
STANZA_MODEL_DIRto where Stanza models are saved. - Set input/output file paths for each step.
- Adjust
COURT_TYPEto target a different court category.
Run each script in order:
python MOJScraper.py
python ColumnSplit.py
python NameRedaction.pyMOJScraper.py saves a checkpoint.json and resumes automatically if interrupted.
- Scraper output is split into chunks of 1,000 judgments per CSV (configurable via
CHUNK_SIZE). - Redaction uses two passes: role pattern matching followed by Stanza NER, then a global sweep for any names.
- Only columns specified in
TARGET_COLUMN_LETTERSare redacted.