# Kenya Constitution AI Agent – Business Understanding

## Project Context
The Kenyan Constitution of 2010 is a comprehensive legal document that governs the country's laws and citizens' rights. However, it is long, complex, and written in legal language that can be difficult for the general population, especially youths, to understand. 

Accessing relevant information quickly can be challenging, making it difficult for citizens to exercise their rights or comply with legal obligations.

## Business Problem
*Problem Statement:*  
Citizens, especially young people, need an accessible way to understand and query the Kenyan Constitution in everyday language. Traditional legal consultation is expensive, and reading the full document is time-consuming.

*Objective:*  
Develop an AI agent that uses Natural Language Processing (NLP) to understand user queries and provide relevant answers from the Constitution. The agent should:

- Accept questions in *English and Kiswahili*.
- Return accurate, understandable answers.
- Quote the specific article/section from the Constitution for credibility.

*Stakeholders:*  
- Kenyan youths (primary users)
- NGOs and civic education organizations
- Government institutions promoting civic engagement

## Expected Impact
- *Empowerment:* Citizens can better understand their rights and duties.
- *Accessibility:* Reduces reliance on legal experts for basic constitutional questions.
- *Scalability:* The system can be deployed online and integrated via API for wider use.

## Key Considerations
- Language support (English + Kiswahili)
- Handling legal jargon
- Accurate referencing of articles/chapters
- Efficient query handling for fast responses

 # Data Understanding

## Overview of the Data
For this project, our data sources are:

1. *English version of the Kenyan Constitution (2010)*  
   - Source: [The Constitution of Kenya 2010 PDF](Data/The_Constitution_of_Kenya_2010.pdf)  
   - Contains all chapters, articles, and legal provisions in English.  
   - Needs text extraction and cleaning to convert it into a structured format for NLP.

2. *Kiswahili version of the Kenyan Constitution*  
   - Source: [Kielelezo_Pantanifu_cha_Katiba_ya_Kenya.pdf](Data/Kielelezo_Pantanifu_cha_Katiba_ya_Kenya.pdf)  
   - Same content as the English version, translated into Kiswahili.  
   - Requires extraction, cleaning, and alignment with the English dataset.

*Purpose of Using Both Languages:*  
- Enable the AI agent to respond to user queries in *English* or *Kiswahili*.  
- Improve accessibility and inclusivity for all Kenyan citizens.  

## Data Structure
After preprocessing, the expected data structure is:

| Article/Section | Text (English) | Text (Kiswahili) |
|-----------------|----------------|-----------------|
| Article 1       | Text content   | Swahili content |
| Article 2       | Text content   | Swahili content |
| ...             | ...            | ...             |

*Notes:*  
- Each row corresponds to an *article* or *clause*.  
- This will allow the NLP model to retrieve relevant sections when users ask questions.  

## Data Quality Considerations
- PDFs contain headers, footers, and formatting that need cleaning.  
- Ensure *text alignment* between English and Kiswahili versions.  
- Maintain *article/chapter references* to allow citations in responses.  

## Next Steps
1. Extract text from PDFs into a structured format (JSON/CSV).  
2. Clean text by removing:
   - Page numbers
   - Footnotes
   - Unnecessary whitespace and formatting characters  
3. Verify consistency between English and Kiswahili versions.  
4. Prepare a dataset ready for:
   - Embeddings creation
   - NLP query retrieval



In [4]:
!pip install pdfplumber
import pdfplumber
print("pdfplumber installed successfully!")

pdfplumber installed successfully!


## Extract texts from PDFs

In [9]:
# Import required libraries
import pdfplumber  # For PDF text extraction
import pandas as pd

# Define file paths
english_pdf_path = "../Data/The_Constitution_of_Kenya_2010.pdf"
kiswahili_pdf_path = "../Data/Kielelezo_Pantanifu_cha_Katiba_ya_Kenya.pdf"

# Function to extract text from a PDF
def extract_pdf_text(pdf_path):
    all_text = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                all_text.append(text)
    return "\n".join(all_text)

# Extract English and Kiswahili texts
english_text = extract_pdf_text(english_pdf_path)
kiswahili_text = extract_pdf_text(kiswahili_pdf_path)

# Optional: Preview first 1000 characters
print("English Preview:\n", english_text[:1000])
print("\nKiswahili Preview:\n", kiswahili_text[:1000])

English Preview:
 LAWS OF KENYA
THE CONSTITUTION OF KENYA, 2010
Published by the National Council for Law Reporting
with the Authority of the Attorney-General
www.kenyalaw.org
Constitution of Kenya, 2010
THE CONSTITUTION OF KENYA, 2010
ARRANGEMENT OF ARTICLES
PREAMBLE
CHAPTER ONE—SOVEREIGNTY OF THE PEOPLE AND
SUPREMACY OF THIS CONSTITUTION
1—Sovereignty of the people.
2—Supremacy of this Constitution.
3—Defence of this Constitution.
CHAPTER TWO—THE REPUBLIC
4—Declaration of the Republic.
5—Territory of Kenya.
6—Devolution and access to services.
7—National, official and other languages.
8—State and religion.
9—National symbols and national days.
10—National values and principles of governance.
11—Culture.
CHAPTER THREE—CITIZENSHIP
12—Entitlements of citizens.
13—Retention and acquisition of citizenship.
14—Citizenship by birth.
15—Citizenship by registration.
16—Dual citizenship.
17—Revocation of citizenship.
18—Legislation on citizenship.
CHAPTER FOUR—THE BILL OF RIGHTS
PART 1—GENERAL

## Split text into articles


In [12]:
import re
def split_articles(text, keyword="Article"):
    """
    Splits the Constitution text into articles based on the keyword.
    Returns a list of tuples: (Article Number/Title, Text)
    """
    pattern = rf"({keyword} \d+.*?)\n"
    splits = re.split(pattern, text)
    
    articles = []
    for i in range(1, len(splits), 2):
        title = splits[i].strip()
        body = splits[i+1].strip() if i+1 < len(splits) else ""
        articles.append((title, body))
    return articles

english_articles = split_articles(english_text, keyword="Article")
kiswahili_articles = split_articles(kiswahili_text, keyword="Kifungu")  # Kiswahili keyword

## Align English and Swahili

In [13]:
aligned_articles = []
min_len = min(len(english_articles), len(kiswahili_articles))

for i in range(min_len):
    eng_title, eng_text = english_articles[i]
    kis_title, kis_text = kiswahili_articles[i]
    aligned_articles.append({
        "Article/Section": eng_title,
        "Text_English": eng_text,
        "Text_Kiswahili": kis_text
    })

print(f"Aligned {len(aligned_articles)} articles successfully.")

Aligned 61 articles successfully.


## Convert to Dataframe

In [14]:
data = pd.DataFrame(aligned_articles)
data.head(5)

Unnamed: 0,Article/Section,Text_English,Text_Kiswahili
0,Article 10.,Territory of Kenya.\n5. Kenya consists of the ...,"hiyo.\ninaweza kutoa usaidizi, ikiwemo-\n(3) M..."
1,Article 24.,Retention and acquisition of citizenship.\n13....,(5) Hatua yoyote inayochukuliwa chini ya (4) i...
2,"Article 14 (4), may be revoked if—","(a) the citizenship was acquired by fraud, fal...",uamuzi. (c) kuwa huru dhidi ya aina zote za gh...
3,"Article 43, if the State claims that it",does not have the resources to implement the r...,(2) Kila mtu anayo haki ya kutaka kurekebishwa...
4,Article 43.,(3) All State organs and all public officers h...,"(2) Haki hiyo inaendelea hadi kwa kutunga, kue..."


## Save Structured Dataset

In [16]:
data.to_csv("../Data/kenya_constitution_structured.csv", index=False)


# Step 3: Data Preparation

In this step, we will prepare the Kenya Constitution dataset for all downstream tasks of our AI Agent:

*Objectives:*
1. Load structured CSV containing English and Kiswahili articles.
2. Clean and normalize text:
   - Remove extra whitespaces, newlines, and tabs.
   - Remove page numbers and non-standard characters.
3. Create useful NLP features:
   - Word counts, sentence counts, character counts.
4. Build reusable pipelines:
   - Modular functions for cleaning, preprocessing, and vectorization.
5. Ensure alignment between English and Kiswahili articles.
6. Prepare the dataset for embeddings, retrieval, ML, and Deep Learning.