<a href="https://colab.research.google.com/github/M0M0-M/csc4792-project_team_28/blob/DU%5D-Update-Colab/csc4792_project_team_28_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 1. BUSINESS UNDERSTANDING
## 1.1 PROBLEM STATEMENT

The Zambia Government Gazette publishes official notices, legal updates, tenders, and public announcements. These documents are published in PDF format, and often unstructured, making it difficult for stakeholders such as lawyers, journalists, researchers, and the general public to quickly find relevant information. Manual searching is time-consuming and prone to oversight. There is a need for an automated method to classify Gazette notices into categories (e.g., legal notices, tenders, appointments, public warnings) for faster retrieval and analysis.

## 1.2 BUSINESS OBJECTIVES

The primary objective is to develop a system that automatically processes and classifies Gazette publications into predefined categories. Success will mean that end users can:

Quickly identify and filter notices by category.

Reduce time spent manually scanning through documents.

Gain improved access to relevant legal or public information.

From a real-world perspective, this will increase efficiency for professionals and citizens who rely on the Gazette for important updates.

## 1.3 DATA MINING GOALS

We will build a text classification model that:

Extracts text from Gazette PDFs.

Preprocesses the text (cleaning, tokenization, stopword removal).

Classifies each notice into categories.

Outputs labeled data for easy search and retrieval.

The approach will likely involve Natural Language Processing (NLP) and machine learning algorithms such as Logistic Regression or Support Vector Machines.

## 1.4 INITIAL SUCCESS CRITERIA

The project will be considered successful if:

The classification model achieves at least 80% accuracy on the test dataset.

Categories are clearly and vividly defined, distinct, and interpretable.

The pipeline can handle at least 10 new Gazette PDFs per month without major manual intervention.

Users confirm that classification results improve search speed and relevance compared to manual reading.

## 1.5 SCOPE & ASSUMPTIONS

Scope:

Focus on classifying Gazette notices into 4–6 main categories.

Work only with English-language Gazettes.

Process and analyze a subset of recent Gazette issues.

Assumptions:

PDF files are accessible and legally permissible for analysis.

Categories remain consistent over time.

OCR (Optical Character Recognition) will be needed for scanned documents.

## 1.6 RISKS & CONSTRAINTS

Risks:

Poor text quality from scanned PDFs may reduce OCR accuracy.

Some notices may belong to multiple categories, complicating classification.

Limited labelled training data could impact model performance.

Constraints:

Legal constraint: Must comply with any copyright or government data use regulations.

## 1.7 Expected Benefits
The system will enable faster retrieval of Gazette notices, enhance transparency, and support informed decision-making for both professionals and the public.

# 2. METHODOLOGY
## 2.1 DATA COLLECTION

Download Gazette PDFs from the official Zambia Government Gazette website.

Ensure documents cover a representative period to include diverse categories.

Maintain a record of file metadata (date, publication number) for reference.
## 2.2 DATA PREPROCESSING

Convert PDFs to text using OCR for scanned documents.

Remove irrelevant elements (headers, footers, page numbers).

Tokenize text and remove stopwords, punctuation, and special characters.

Standardize text formatting (e.g., lowercasing, stemming).

## 2.3 MODEL SELECTION

Evaluate multiple classification algorithms: Logistic Regression, Support Vector Machines, and Random Forest.

Use TF-IDF or word embeddings to represent text features.

Optimize model parameters using cross-validation.

## 2.4 EVALUATION METRICS

Accuracy, precision, recall, and F1-score for each category.

Confusion matrix to identify misclassification trends.

User feedback on relevance and usefulness of classified notices.

# 3. TOOLS AND TECHNOLOGIES

Programming Languages: Python

Libraries: scikit-learn, pandas, NumPy, NLTK, spaCy, PyPDF2, Tesseract OCR

Environment: Jupyter Notebook / Python IDE

Version Control: Git / GitHub

# 4. EXPECTED OUTCOMES

Automated classification of Gazette notices into predefined categories.

A searchable dataset with labeled notices for faster retrieval.

Insights into the distribution and frequency of notice types.

Reduced manual effort for users accessing Gazette information.

# 5. FUTURE ENHANCEMENTS

Implement a web interface for searching and filtering classified notices.

Incorporate advanced NLP techniques like BERT for improved classification accuracy.

Expand to multilingual Gazettes or other official publications.

Introduce trend analysis and reporting for frequently published notice types.




In [None]:
# Create project folders on Drive (adjust path if different)
import os
base = "/content/drive/MyDrive/csc4792-project_team_28"
folders = [
    "01_Business_Understanding",
    "02_Data_Understanding/raw_pdfs",
    "02_Data_Understanding/extracted_csv",
    "03_Data_Preparation",
    "04_Modeling",
    "05_Evaluation",
    "06_Deployment",
    "Reports",
    "Slides"
]
for f in folders:
    os.makedirs(os.path.join(base, f), exist_ok=True)

print("Folders created at:", base)

<IPython.core.display.Javascript object>

In [None]:
!pip install PyMuPDF

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Extract text from PDFs using PyMuPDF
import fitz  # PyMuPDF
import pandas as pd
import glob, os

pdf_folder = "/content/drive/MyDrive/csc4792-project_team_28/02_Data_Understanding/raw_pdfs"
out_csv = "/content/drive/MyDrive/csc4792-project_team_28/02_Data_Understanding/extracted_csv/gazette_raw_texts.csv"

records = []
pdf_paths = sorted(glob.glob(os.path.join(pdf_folder, "*.pdf")))
print("Found PDFs:", len(pdf_paths))

def extract_text_from_pdf(path):
    text_pages = []
    try:
        doc = fitz.open(path)
        for page in doc:
            text_pages.append(page.get_text("text"))
        return "\n".join(text_pages)
    except Exception as e:
        print("Error reading", path, e)
        return ""

for p in pdf_paths:
    fname = os.path.basename(p)
    print("Extracting:", fname)
    txt = extract_text_from_pdf(p)
    records.append({"filename": fname, "text": txt})

df_raw = pd.DataFrame(records)
df_raw.to_csv(out_csv, index=False)
print("Saved CSV:", out_csv)
df_raw.head(3)


Found PDFs: 10
Extracting: zm-government-gazette-dated-2024-11-29-no-7677.pdf
Extracting: zm-government-gazette-dated-2024-12-13-no-7684.pdf
Extracting: zm-government-gazette-dated-2025-02-14-no-7712.pdf
Extracting: zm-government-gazette-dated-2025-03-21-no-7732.pdf
Extracting: zm-government-gazette-dated-2025-04-04-no-7737.pdf
Extracting: zm-government-gazette-dated-2025-04-11-no-7742.pdf
Extracting: zm-government-gazette-dated-2025-05-23-no-7763.pdf
Extracting: zm-government-gazette-dated-2025-06-20-no-7773.pdf
Extracting: zm-government-gazette-dated-2025-06-27-no-7775.pdf
Extracting: zm-government-gazette-dated-2025-06-27-no-7776.pdf
Saved CSV: /content/drive/MyDrive/csc4792-project_team_28/02_Data_Understanding/extracted_csv/gazette_raw_texts.csv


Unnamed: 0,filename,text
0,zm-government-gazette-dated-2024-11-29-no-7677...,\nPublished by Authority \nREPUBLIC \nOF ZAM...
1,zm-government-gazette-dated-2024-12-13-no-7684...,\n \n \n \n \nREPUBLIC \nOF ZAMBIA \nGOVE...
2,zm-government-gazette-dated-2025-02-14-no-7712...,\na \niter \ni \n_ \nREPUBLIC \nOF ZAMBIA \...
