<a href="https://colab.research.google.com/github/AlinaniS/Data_Mining_Judgement_Topic_Classification/blob/main/Data_Mining_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification of types of topics associated with judgements using text of judgement.

## Business Understanding

### General and Specific Objectives

Business Understanding

The primary business objective is to automate the classification of legal judgments by topic. This addresses the significant inefficiency and inconsistency of manual classification, which is a key bottleneck in legal information management. By developing an accurate and automated system, the project aims to:

General objectives:
1.	Define the overall business problem or opportunity that data mining will address: The core problem is the inefficiency and subjectivity of manually classifying legal judgments by topic. This project aims to address this by developing an automated system capable of classifying types of topics present in judicial case judgements into predefined topic categories (e.g., "Criminal Law," "Family Law," "Contract Disputes") based on their textual content thereby enabling faster information retrieval, improved legal research efficiency, and enhanced decision-making support for legal practitioners, researchers, and policymakers.

2.	To understand the business context: Legal professionals, researchers, and the public often need to search for judgments based on specific legal topics (e.g., contract law, criminal law, intellectual property). Manually tagging these judgments is a time-consuming and often inconsistent process. By applying text mining and classification techniques, the project seeks to address the need for a scalable and accurate topic classification method which will enable faster access to relevant case law and support data-driven legal analytics.


Specific objectives:
1.	Translate business problems into data mining tasks: Convert the need for topic classification into a supervised text classification task, leveraging NLP techniques (e.g., TF-IDF, word embeddings) and ML models (e.g., Super Vector Machines (SVM), BERT).

2.	Align objectives with measurable business outcomes:
I.	Increased efficiency: Reduce the time and human effort required to classify a new judgment by a certain percentage (e.g., 86%)
II.	Improved consistency: Achieve a higher inter-rater reliability score compared to manual classification.
III.	Enhanced searchability: Enable more precise and faster searching of legal databases by topic, leading to a quantifiable reduction in search time for legal professionals.

3.	Requirements elicitation to determine customer needs:
I.	Interview legal stakeholders to identify:
II.	Required topic categories (e.g., "Intellectual Property," "Labor Disputes").
III.	Acceptable error rates (e.g., 95% accuracy for high-priority categories).
IV.	Integration needs (e.g., compatibility with legal databases like Westlaw or JSTOR).

4.	Identify factors influencing project outcomes:
I.	Data quality and availability: The success of the project is highly dependent on having a large, diverse, and well-labeled dataset of judgments
II.	Complexity of legal language: The nuanced and technical nature of legal language can pose a challenge to classification models.
III.	Computational resources: Training complex deep learning models will require significant computational power.
IV.	Expert input: The need for domain experts (legal professionals) to help with data labeling and model validation.


5.	Phase Outcomes

•   Phase 1 (Business Understanding & Data Understanding): A clear project plan, identified data sources, and an initial understanding of the data's quality and characteristics.

•   Phase 2 (Data Preparation): A clean and processed dataset ready for modelling.

•   Phase 3 (Modelling): A trained and evaluated classification model.

•   Phase 4 (Evaluation & Deployment): A final report on the model's performance and a plan for integration into a real-world system.


Core business objectives:

•	To create a valuable tool that can be used to improve access to and analysis of legal data.

•	Reduce manual effort in legal document management.

•	To improve the quality and consistency of legal information classification.

Background information:

• The project builds upon existing work in natural language processing (NLP) and text mining, specifically in the legal domain. It leverages advances in deep learning for text analysis to tackle a long-standing problem in legal information management.

Key success criteria—how will the relative success of the project be measured?

The project will be considered successful if it achieves an accuracy of ≥ 86% (with balanced precision, recall, and F1-score), reduces manual review time by 75% (from ~20 minutes to ≤ 5 minutes per judgement), attains ≥ 80% positive feedback from legal professionals, and can scale to process large volumes of judgements without performance loss.


Jonathan, Please add your text here.

### Situational analysis

This section provides a detailed assessment of the current state of the project, including a review of available resources, constraints, and a preliminary cost-benefit analysis. This analysis is crucial for determining the feasibility and scope of the project.

---

1.3.1. Resources Available

**Data**
- Primary source: ZambiaLII (access to thousands of judicial judgments).
- Current collection: 8,871 judgments stored as large scanned PDFs.
- Key constraint: documents are image-based (not text), so OCR is required (e.g., Tesseract or cloud OCR services) to extract text.
- Labeling constraint: no pre-existing topic labels — labels must be created manually or inferred from metadata (titles, citations).
- Legal/access note: ZambiaLII is open access, so there are no financial or legal barriers to acquiring the data.

**Personnel**
- Team composition: five computer science students with data-mining background.
- Constraint/risk: no legal domain expert on Zambian law, increasing risk of misinterpretation and mislabeling legal terms.
- Mitigation strategy: rely on publicly available case abstracts and metadata for cross-checking; explicitly document this limitation in the project plan.

**Technology**
- Available resources: personal laptops and free cloud notebooks (e.g., Google Colab).
- Software approach: open-source stack (Python, NLTK, spaCy, scikit-learn, etc.) to avoid licensing costs.
- Resource constraint: limited to free Colab GPU/compute resources; dataset size (a few thousand docs) is considered manageable within those limits.

**Project Requirements & Deliverables**
- Academic requirements: deliver a fully documented codebase in Jupyter notebook format and a formal written report within the semester.
- Data requirement: prepare a dataset of at least a few hundred to ~1,000 judgments for model training.
- Representativeness: dataset must include judgments from different courts and time periods to reduce systemic bias.

---

1.3.2. Cost-Benefit Analysis

**Summary**
A preliminary cost-benefit analysis indicates the project is worthwhile: educational and prototyping benefits outweigh costs.

**Costs**
- Student effort/time: primary cost (OCR processing, manual labeling, model development).
- Financial costs: negligible due to open-source tools and free data sources; possible but avoidable paid cloud costs if the project remains within free tiers.

**Benefits**
- Educational value: hands-on experience applying data-mining and NLP to a real problem.
- Prototype deliverable: a functional automated legal topic classifier.
- Long-term impact: a successful prototype could form the foundation for a larger system that improves legal research efficiency and could justify future investment.



### Broad goals of data mining process

Broad goals of the data mining process

Purpose: turn the legal team's need for topic tagging of judgements into clear technical targets, usable outputs, and success criteria so stakeholders can act on model results with confidence.

1.Outputs that enable achievement of the business objectives

Per-document outputs
- Predicted topic label(s) (single-label or multi-label) + calibrated confidence scores.
- Short human-readable explanation per prediction (top contributing phrases / feature importances / example similar cases).
System outputs / artifacts
- Labeled training dataset and annotation guide (taxonomic definitions, examples).
- Evaluation report (per-class precision/recall, confusion matrix, macro/micro F1, calibration).
- Scoring pipeline: batch job and real-time API (with SLA: latency, throughput).
- Dashboard / report for legal teams: topic distributions, trending topics, error examples, uncertain cases for manual review.
- Integration docs and deployment checklist for case-management/search systems.

2.Technical criteria for a successful outcome (objective)

Model quality
- Target metrics (set with stakeholders): e.g., macro-F1 ≥ 0.75; per-priority-topic recall ≥ 0.80; precision thresholds for high-cost mistake classes.
- Calibration: predicted probabilities reflect true likelihoods (Brier score or calibration curve).
Operational
- API latency ≤ X ms; batch throughput ≥ N docs/hour; uptime ≥ 99%.
- Resource/compute limits acceptable for deployment environment.
Business KPIs
- Measurable reduction in manual labeling/review time (e.g., ≥ 50% during pilot).
- Increase in retrieval/triage accuracy (e.g., topic-based search precision improvement).

3.Subjective criteria that require human judgment

Interpretability & trust
- Legal analysts must understand explanations and accept model decisions in a sample audit (human acceptance rate target, e.g., ≥ 85%).
Usability
- Labels and example outputs match legal definitions and are actionable for routing/prioritization.
Adoption
- Stakeholder sign-off in a pilot: confidence to use model outputs without manual double-check for low-risk topics.

These subjective criteria should be evaluated by named stakeholders (legal lead, adjudication manager) with clear acceptance tests (sample review protocols).

4.Link outputs → criteria → business use

Example mapping:

Output: per-judgement topic + confidence → Use: auto-route to specialist team when confidence ≥ 0.9 → Technical success: precision ≥ 0.90 on routed cases; Business success: 60% reduction in routing time.

Output: low-confidence queue → Use: human-in-loop labeling to improve rare-topic performance → Technical success: increased recall for rare topics after 2 retraining iterations.

5.Risks & mitigations (brief)

Risk: class imbalance / rare topics → Mitigation: active learning, oversampling, hierarchical taxonomy (coarse → fine).

Risk: noisy text (OCR/redaction) → Mitigation: preprocessing & quality flags; route bad-quality docs to human review.

Risk: stakeholder mistrust → Mitigation: provide explanations, confusion examples, and a pilot with feedback loop.



Katrina, please add your text here.

### Project Planning

Project Plan
This section outlines the detailed plan for executing the data mining project, translating the business and data mining goals into a structured, time-bound, and resourced-based work breakdown. The plan is organized into four distinct phases that align with the CRISP-DM methodology.
1.5.1. Executive Summary & High-Level Timeline
The primary goal is to deliver an automated topic-classification pipeline for judicial judgments using supervised Natural Language Processing (NLP). The solution must meet the defined success criteria, including achieving a target accuracy of ≥86%, reducing manual review time, and securing positive stakeholder feedback. The project is structured across four phases, with a focus on iterative development to manage key constraints such as data availability and computational resources.
Phase	Dates (2025)	Duration (Days)
1. Business & Data Understanding	11 Aug – 14 Aug	4
2. Data Preparation	15 Aug – 19 Aug	5
3. Modelling (with iterations)	20 Aug – 24 Aug	5
4. Evaluation & Deployment Plan	25 Aug – 29 Aug	5
1.5.2. Phase-by-Phase Plan
Phase 1: Business & Data Understanding (11–14 Aug)
•	Goal: To finalize project scope, confirm requirements with stakeholders, and secure initial data access.
•	Tasks:
•	Day 1 (11 Aug): Project kickoff and stakeholder interviews to define topic categories, acceptable error rates, and key performance indicators (KPIs).
•	Day 2 (12 Aug): Document requirements and formalize acceptance criteria.
•	Day 3 (13 Aug): Identify data sources (e.g., court repositories) and request access to extract an initial data sample (≈ 10-50 judgments per category).
•	Day 4 (14 Aug): Conduct a quick quality assessment of the data sample (e.g., OCR quality, class imbalance) and finalize the data annotation plan.
•	Resources: Project lead, data scientist, legal subject matter expert (SME), and a data engineer.
•	Dependencies: Timely approval for data access is a critical path item. Delays here will directly impact subsequent phases.
Phase 2: Data Preparation (15–19 Aug)
•	Goal: To create a clean, annotated, and version-controlled dataset ready for modeling.
•	Tasks:
•	Day 5 (15 Aug): Set up the annotation tool (e.g., Label Studio) and train labelers on the established guidelines.
•	Day 6-7 (16-17 Aug): Annotate a small seed dataset (≈ 200-300 judgments) and perform an inter-rater reliability check to ensure consistency.
•	Day 8 (18 Aug): Perform preprocessing steps including OCR cleanup, normalization, and initial feature extraction (e.g., TF-IDF).
•	Day 9 (19 Aug): Create a stratified train, validation, and test split, and version the final dataset.
•	Resources: Data engineer, 1-2 labelers, and a legal SME for quality assurance.
•	Dependencies: This phase is entirely dependent on the successful completion of data extraction in Phase 1.
Phase 3: Modelling & Iterations (20–24 Aug)
•	Goal: To develop and tune a robust classification model that meets the performance criteria. This phase is designed to be iterative.
•	Tasks:
•	Day 10 (20 Aug): Establish a baseline model using a simple approach (e.g., TF-IDF with SVM or Logistic Regression) and evaluate its performance.
•	Day 11-12 (21-22 Aug): Develop and train an advanced model, such as a fine-tuned transformer (e.g., BERT), and perform hyperparameter tuning.
•	Day 13-14 (23-24 Aug): Conduct two planned iterations to refine the model. This includes addressing identified weaknesses (e.g., through data augmentation, class weighting) and selecting the final model.
•	Resources: ML engineer, data scientist, and a GPU-enabled cloud environment (e.g., Google Colab).
•	Dependencies: This phase relies on the availability of the prepared labeled dataset from Phase 2 and sufficient computational resources (GPU access) for advanced models. A fallback plan (using simpler models like SVM) will be used if GPU access is a limiting factor.
Phase 4: Evaluation, UAT & Deployment Plan (25–29 Aug)
•	Goal: To finalize model performance, validate it with end-users, and create a clear plan for system deployment and integration.
•	Tasks:
•	Day 15 (25 Aug): Final evaluation of the selected model against all technical KPIs (e.g., accuracy, runtime).
•	Day 16 (26 Aug): Conduct User Acceptance Testing (UAT) with legal SMEs to gather feedback on usability and trust, targeting ≥80% positive feedback.
•	Day 17-18 (27-28 Aug): Prepare deployment artifacts (e.g., serialized model, API specifications) and create a formal integration plan for connecting the model to existing legal databases.
•	Day 19 (29 Aug): Final project presentation to stakeholders and formal sign-off.
•	Resources: ML engineer, DevOps/infra specialist, legal SME, and the project lead.
•	Dependencies: This phase is contingent on the successful completion of the modeling phase and requires active participation from stakeholders for UAT and sign-off.
1.5.3. Risk Analysis & Mitigation
Several risks have been identified, and a contingency strategy is in place to ensure the project stays on track.
•	Data Access & Quality: The highest risk. A delay in securing data access or poor OCR quality could halt the project. The mitigation strategy is to request data samples immediately and work in parallel on public data to build a prototype.
•	Compute Availability (GPU): A high risk for advanced models. The mitigation is to reserve cloud resources in advance and have a fallback to lighter models if needed.
•	Expert Availability: The limited availability of legal SMEs is a medium risk. This will be managed by scheduling specific time blocks for them and using a targeted sampling approach for reviews instead of full-document checks.
•	Contingency Strategy: A two-iteration buffer is built into the modeling phase. In case of significant delays or unmet targets, the project scope may be reduced (e.g., focusing on only the highest-priority topics) to ensure a successful outcome within the allotted timeline.



## Data Understanding

## Data Preparation

## Modelling

## Evaluation

## Deployment

| Phase                                                         | Dates (2025)    | Duration (days) |
|---------------------------------------------------------------|-----------------|-----------------|
| Phase 1 — Business & Data Understanding                       | 11 Aug — 14 Aug | 4               |
| Phase 2 — Data Preparation (collection, labeling, preprocessing) | 15 Aug — 19 Aug | 5               |
| Phase 3 — Modelling (baseline → advanced; 2 iterations)       | 20 Aug — 24 Aug | 5               |
| Phase 4 — Evaluation, UAT & Deployment plan                   | 25 Aug — 29 Aug | 5               |

In [None]:
# Display the content of the Markdown cell with the table
with open('/content/n/notebook.ipynb', 'r') as f:
    notebook_content = f.read()

import json
notebook_json = json.loads(notebook_content)

for cell in notebook_json['cells']:
    if cell.get('id') == 'c96deac5':
        print("Content of cell c96deac5:")
        print("".join(cell['source']))
        break