Skip to content

AlshamRepo/ArabiCCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArabiCCR

A pipeline for scraping, preprocessing, and anonymizing Arabic judicial rulings from the Saudi Ministry of Justice.


Step-by-step instructions:

1. Download the dataset from [https://data.mendeley.com/drafts/np538c95yy]
2. Install dependencies (requirements.txt)
3. Run preprocessing/redaction scripts as described in the configuration section

Pipeline

MOJScraper.py  →  ColumnSplit.py  →  NameRedaction.py
Step Script Input Output
1 MOJScraper.py MOJ website output1.csv, output2.csv, …
2 ColumnSplit.py .xlsx from step 1 .xlsx with EVENTS, REASONING, RULING columns
3 NameRedaction.py .xlsx from step 2 .xlsx with person names replaced by [Person Name]

Setup

Requirements: Python 3.9+, Google Chrome, ChromeDriver matching your Chrome version, and Stanza Arabic models.

pip install -r requirements.txt

Download Stanza Arabic models (one-time):

import stanza
stanza.download("ar")

Configuration

All settings are in config.py. Edit it before running any script:

  • Set CHROME_DRIVER_PATH to your local ChromeDriver executable.
  • Set STANZA_MODEL_DIR to where Stanza models are saved.
  • Set input/output file paths for each step.
  • Adjust COURT_TYPE to target a different court category.

Usage

Run each script in order:

python MOJScraper.py
python ColumnSplit.py
python NameRedaction.py

MOJScraper.py saves a checkpoint.json and resumes automatically if interrupted.


Notes

  • Scraper output is split into chunks of 1,000 judgments per CSV (configurable via CHUNK_SIZE).
  • Redaction uses two passes: role pattern matching followed by Stanza NER, then a global sweep for any names.
  • Only columns specified in TARGET_COLUMN_LETTERS are redacted.

About

A pipeline for scraping, preprocessing, and anonymizing Arabic judicial rulings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages