ArabiCCR

A pipeline for scraping, preprocessing, and anonymizing Arabic judicial rulings from the Saudi Ministry of Justice.

Step-by-step instructions:

1. Download the dataset from [https://data.mendeley.com/drafts/np538c95yy]
2. Install dependencies (requirements.txt)
3. Run preprocessing/redaction scripts as described in the configuration section

Pipeline

MOJScraper.py  →  ColumnSplit.py  →  NameRedaction.py

Step	Script	Input	Output
1	`MOJScraper.py`	MOJ website	`output1.csv`, `output2.csv`, …
2	`ColumnSplit.py`	`.xlsx` from step 1	`.xlsx` with `EVENTS`, `REASONING`, `RULING` columns
3	`NameRedaction.py`	`.xlsx` from step 2	`.xlsx` with person names replaced by `[Person Name]`

Setup

Requirements: Python 3.9+, Google Chrome, ChromeDriver matching your Chrome version, and Stanza Arabic models.

pip install -r requirements.txt

Download Stanza Arabic models (one-time):

import stanza
stanza.download("ar")

Configuration

All settings are in config.py. Edit it before running any script:

Set CHROME_DRIVER_PATH to your local ChromeDriver executable.
Set STANZA_MODEL_DIR to where Stanza models are saved.
Set input/output file paths for each step.
Adjust COURT_TYPE to target a different court category.

Usage

Run each script in order:

python MOJScraper.py
python ColumnSplit.py
python NameRedaction.py

MOJScraper.py saves a checkpoint.json and resumes automatically if interrupted.

Notes

Scraper output is split into chunks of 1,000 judgments per CSV (configurable via CHUNK_SIZE).
Redaction uses two passes: role pattern matching followed by Stanza NER, then a global sweep for any names.
Only columns specified in TARGET_COLUMN_LETTERS are redacted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArabiCCR

Step-by-step instructions:

Pipeline

Setup

Configuration

Usage

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ColumnSplit.py		ColumnSplit.py
LICENSE		LICENSE
MOJScraper.py		MOJScraper.py
NameRedaction.py		NameRedaction.py
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ArabiCCR

Step-by-step instructions:

Pipeline

Setup

Configuration

Usage

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages