# Kenya Constitution AI Agent – Business Understanding

## Project Context
The Kenyan Constitution of 2010 is a comprehensive legal document that governs the country's laws and citizens' rights. However, it is long, complex, and written in legal language that can be difficult for the general population, especially youths, to understand. 

Accessing relevant information quickly can be challenging, making it difficult for citizens to exercise their rights or comply with legal obligations.

## Business Problem
*Problem Statement:*  
Citizens, especially young people, need an accessible way to understand and query the Kenyan Constitution in everyday language. Traditional legal consultation is expensive, and reading the full document is time-consuming.

*Objective:*  
Develop an AI agent that uses Natural Language Processing (NLP) to understand user queries and provide relevant answers from the Constitution. The agent should:

- Accept questions in *English and Kiswahili*.
- Return accurate, understandable answers.
- Quote the specific article/section from the Constitution for credibility.

*Stakeholders:*  
- Kenyan youths (primary users)
- NGOs and civic education organizations
- Government institutions promoting civic engagement

## Expected Impact
- *Empowerment:* Citizens can better understand their rights and duties.
- *Accessibility:* Reduces reliance on legal experts for basic constitutional questions.
- *Scalability:* The system can be deployed online and integrated via API for wider use.

## Key Considerations
- Language support (English + Kiswahili)
- Handling legal jargon
- Accurate referencing of articles/chapters
- Efficient query handling for fast responses

 # Data Understanding

## Overview of the Data
For this project, our data sources are:

1. *English version of the Kenyan Constitution (2010)*  
   - Source: [The Constitution of Kenya 2010 PDF](Data/The_Constitution_of_Kenya_2010.pdf)  
   - Contains all chapters, articles, and legal provisions in English.  
   - Needs text extraction and cleaning to convert it into a structured format for NLP.

2. *Kiswahili version of the Kenyan Constitution*  
   - Source: [Kielelezo_Pantanifu_cha_Katiba_ya_Kenya.pdf](Data/Kielelezo_Pantanifu_cha_Katiba_ya_Kenya.pdf)  
   - Same content as the English version, translated into Kiswahili.  
   - Requires extraction, cleaning, and alignment with the English dataset.

*Purpose of Using Both Languages:*  
- Enable the AI agent to respond to user queries in *English* or *Kiswahili*.  
- Improve accessibility and inclusivity for all Kenyan citizens.  

## Data Structure
After preprocessing, the expected data structure is:

| Article/Section | Text (English) | Text (Kiswahili) |
|-----------------|----------------|-----------------|
| Article 1       | Text content   | Swahili content |
| Article 2       | Text content   | Swahili content |
| ...             | ...            | ...             |

*Notes:*  
- Each row corresponds to an *article* or *clause*.  
- This will allow the NLP model to retrieve relevant sections when users ask questions.  

## Data Quality Considerations
- PDFs contain headers, footers, and formatting that need cleaning.  
- Ensure *text alignment* between English and Kiswahili versions.  
- Maintain *article/chapter references* to allow citations in responses.  

## Next Steps
1. Extract text from PDFs into a structured format (JSON/CSV).  
2. Clean text by removing:
   - Page numbers
   - Footnotes
   - Unnecessary whitespace and formatting characters  
3. Verify consistency between English and Kiswahili versions.  
4. Prepare a dataset ready for:
   - Embeddings creation
   - NLP query retrieval



In [1]:
!pip install pdfplumber
import pdfplumber
print("pdfplumber installed successfully!")

Collecting pdfplumber
  Using cached pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
Collecting pdfminer.six==20250506 (from pdfplumber)
  Using cached pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Using cached pypdfium2-4.30.0-py3-none-win_amd64.whl.metadata (48 kB)
Using cached pdfplumber-0.11.7-py3-none-any.whl (60 kB)
Using cached pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
Using cached pypdfium2-4.30.0-py3-none-win_amd64.whl (2.9 MB)
Installing collected packages: pypdfium2, pdfminer.six, pdfplumber
Successfully installed pdfminer.six-20250506 pdfplumber-0.11.7 pypdfium2-4.30.0
pdfplumber installed successfully!


## Extract texts from PDFs

In [3]:
# Import required libraries
import pdfplumber  # For PDF text extraction
import pandas as pd

# Define file paths
english_pdf_path = "../Data/The_Constitution_of_Kenya_2010.pdf"
kiswahili_pdf_path = "../Data/Kielelezo_Pantanifu_cha_Katiba_ya_Kenya.pdf"

# Function to extract text from a PDF
def extract_pdf_text(pdf_path):
    all_text = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                all_text.append(text)
    return "\n".join(all_text)

# Extract English and Kiswahili texts
english_text = extract_pdf_text(english_pdf_path)
kiswahili_text = extract_pdf_text(kiswahili_pdf_path)

# Optional: Preview first 1000 characters
print("English Preview:\n", english_text[:1000])
print("\nKiswahili Preview:\n", kiswahili_text[:1000])

English Preview:
 LAWS OF KENYA
THE CONSTITUTION OF KENYA, 2010
Published by the National Council for Law Reporting
with the Authority of the Attorney-General
www.kenyalaw.org
Constitution of Kenya, 2010
THE CONSTITUTION OF KENYA, 2010
ARRANGEMENT OF ARTICLES
PREAMBLE
CHAPTER ONE—SOVEREIGNTY OF THE PEOPLE AND
SUPREMACY OF THIS CONSTITUTION
1—Sovereignty of the people.
2—Supremacy of this Constitution.
3—Defence of this Constitution.
CHAPTER TWO—THE REPUBLIC
4—Declaration of the Republic.
5—Territory of Kenya.
6—Devolution and access to services.
7—National, official and other languages.
8—State and religion.
9—National symbols and national days.
10—National values and principles of governance.
11—Culture.
CHAPTER THREE—CITIZENSHIP
12—Entitlements of citizens.
13—Retention and acquisition of citizenship.
14—Citizenship by birth.
15—Citizenship by registration.
16—Dual citizenship.
17—Revocation of citizenship.
18—Legislation on citizenship.
CHAPTER FOUR—THE BILL OF RIGHTS
PART 1—GENERAL

## Split text into articles


In [4]:
import re
def split_articles(text, keyword="Article"):
    """
    Splits the Constitution text into articles based on the keyword.
    Returns a list of tuples: (Article Number/Title, Text)
    """
    pattern = rf"({keyword} \d+.*?)\n"
    splits = re.split(pattern, text)
    
    articles = []
    for i in range(1, len(splits), 2):
        title = splits[i].strip()
        body = splits[i+1].strip() if i+1 < len(splits) else ""
        articles.append((title, body))
    return articles

english_articles = split_articles(english_text, keyword="Article")
kiswahili_articles = split_articles(kiswahili_text, keyword="Kifungu")  # Kiswahili keyword

## Align English and Swahili

In [5]:
aligned_articles = []
min_len = min(len(english_articles), len(kiswahili_articles))

for i in range(min_len):
    eng_title, eng_text = english_articles[i]
    kis_title, kis_text = kiswahili_articles[i]
    aligned_articles.append({
        "Article/Section": eng_title,
        "Text_English": eng_text,
        "Text_Kiswahili": kis_text
    })

print(f"Aligned {len(aligned_articles)} articles successfully.")

Aligned 61 articles successfully.


## Convert to Dataframe

In [6]:
data = pd.DataFrame(aligned_articles)
data.head(5)

Unnamed: 0,Article/Section,Text_English,Text_Kiswahili
0,Article 10.,Territory of Kenya.\n5. Kenya consists of the ...,"hiyo.\ninaweza kutoa usaidizi, ikiwemo-\n(3) M..."
1,Article 24.,Retention and acquisition of citizenship.\n13....,(5) Hatua yoyote inayochukuliwa chini ya (4) i...
2,"Article 14 (4), may be revoked if—","(a) the citizenship was acquired by fraud, fal...",uamuzi. (c) kuwa huru dhidi ya aina zote za gh...
3,"Article 43, if the State claims that it",does not have the resources to implement the r...,(2) Kila mtu anayo haki ya kutaka kurekebishwa...
4,Article 43.,(3) All State organs and all public officers h...,"(2) Haki hiyo inaendelea hadi kwa kutunga, kue..."


## Save Structured Dataset

In [7]:
data.to_csv("../Data/kenya_constitution_structured.csv", index=False)


# Step 3: Data Preparation

In this step, we will prepare the Kenya Constitution dataset for all downstream tasks of our AI Agent:

*Objectives:*
1. Load structured CSV containing English and Kiswahili articles.
2. Clean and normalize text:
   - Remove extra whitespaces, newlines, and tabs.
   - Remove page numbers and non-standard characters.
3. Create useful NLP features:
   - Word counts, sentence counts, character counts.
4. Build reusable pipelines:
   - Modular functions for cleaning, preprocessing, and vectorization.
5. Ensure alignment between English and Kiswahili articles.
6. Prepare the dataset for embeddings, retrieval, ML, and Deep Learning.

## Step 3.1: Load and Preview Dataset

We load the structured CSV containing the Constitution in English and Kiswahili.
We perform an initial preview to understand its structure and content.

In [8]:
# -------------------------------
# Import required libraries
# -------------------------------
import pandas as pd
import numpy as np
import re

# -------------------------------
# Load structured CSV
# -------------------------------
data_path = "../Data/kenya_constitution_structured.csv"
data = pd.read_csv(data_path)

# Preview first 5 rows
print(f"Dataset shape: {data.shape}")
data.head(5)

Dataset shape: (61, 3)


Unnamed: 0,Article/Section,Text_English,Text_Kiswahili
0,Article 10.,Territory of Kenya.\n5. Kenya consists of the ...,"hiyo.\ninaweza kutoa usaidizi, ikiwemo-\n(3) M..."
1,Article 24.,Retention and acquisition of citizenship.\n13....,(5) Hatua yoyote inayochukuliwa chini ya (4) i...
2,"Article 14 (4), may be revoked if—","(a) the citizenship was acquired by fraud, fal...",uamuzi. (c) kuwa huru dhidi ya aina zote za gh...
3,"Article 43, if the State claims that it",does not have the resources to implement the r...,(2) Kila mtu anayo haki ya kutaka kurekebishwa...
4,Article 43.,(3) All State organs and all public officers h...,"(2) Haki hiyo inaendelea hadi kwa kutunga, kue..."


## Step 3.2: Text Cleaning and Normalization

We clean both English and Kiswahili text by:
1. Removing extra whitespaces, newlines, and tabs.
2. Stripping leading/trailing spaces.
3. Preparing the text for NLP tokenization and embeddings.

In [9]:
# -------------------------------
# Define cleaning function
# -------------------------------
def clean_text(text):
    """
    Cleans text by:
    - Removing extra whitespaces and newlines
    - Removing non-standard characters
    """
    if pd.isna(text):
        return ""
    text = text.replace("\n", " ").replace("\t", " ")
    text = re.sub(r"\s+", " ", text)
    text = text.strip()
    return text

# Apply cleaning to both English and Kiswahili
data['Text_English'] = data['Text_English'].apply(clean_text)
data['Text_Kiswahili'] = data['Text_Kiswahili'].apply(clean_text)

# Preview cleaned text
data[['Text_English','Text_Kiswahili']].head(3)

Unnamed: 0,Text_English,Text_Kiswahili
0,Territory of Kenya. 5. Kenya consists of the t...,"hiyo. inaweza kutoa usaidizi, ikiwemo- (3) Mtu..."
1,Retention and acquisition of citizenship. 13. ...,(5) Hatua yoyote inayochukuliwa chini ya (4) i...
2,"(a) the citizenship was acquired by fraud, fal...",uamuzi. (c) kuwa huru dhidi ya aina zote za gh...


## Step 3.3: Feature Engineering

We create new features from the text:
- **Word count** for English and Kiswahili articles.
- **Character count** for each article.
- **Sentence count** based on punctuation.
These features will help in EDA, statistics, and as potential ML features.

In [10]:
# -------------------------------
# Word counts
# -------------------------------
data['English_Word_Count'] = data['Text_English'].apply(lambda x: len(x.split()))
data['Kiswahili_Word_Count'] = data['Text_Kiswahili'].apply(lambda x: len(x.split()))

# -------------------------------
# Character counts
# -------------------------------
data['English_Char_Count'] = data['Text_English'].apply(len)
data['Kiswahili_Char_Count'] = data['Text_Kiswahili'].apply(len)

# -------------------------------
# Sentence counts
# -------------------------------
data['English_Sentence_Count'] = data['Text_English'].apply(lambda x: len(re.split(r'[.!?]', x)))
data['Kiswahili_Sentence_Count'] = data['Text_Kiswahili'].apply(lambda x: len(re.split(r'[.!?]', x)))

# Preview dataset with new features
data.head(5)

Unnamed: 0,Article/Section,Text_English,Text_Kiswahili,English_Word_Count,Kiswahili_Word_Count,English_Char_Count,Kiswahili_Char_Count,English_Sentence_Count,Kiswahili_Sentence_Count
0,Article 10.,Territory of Kenya. 5. Kenya consists of the t...,"hiyo. inaweza kutoa usaidizi, ikiwemo- (3) Mtu...",627,81,4047,519,36,4
1,Article 24.,Retention and acquisition of citizenship. 13. ...,(5) Hatua yoyote inayochukuliwa chini ya (4) i...,576,1134,3311,7010,26,32
2,"Article 14 (4), may be revoked if—","(a) the citizenship was acquired by fraud, fal...",uamuzi. (c) kuwa huru dhidi ya aina zote za gh...,435,1419,2634,8926,16,56
3,"Article 43, if the State claims that it",does not have the resources to implement the r...,(2) Kila mtu anayo haki ya kutaka kurekebishwa...,170,112,1083,651,5,5
4,Article 43.,(3) All State organs and all public officers h...,"(2) Haki hiyo inaendelea hadi kwa kutunga, kue...",444,2203,2715,13852,13,70


## Step 3.4: Check for Missing Data and Alignment

Ensure that all English and Kiswahili articles are aligned and there are no missing values.

In [11]:
# -------------------------------
# Check for missing values
# -------------------------------
missing_data = data.isnull().sum()
print("Missing values in dataset:\n", missing_data)

# -------------------------------
# Verify English-Kiswahili alignment
# -------------------------------
if len(data['Text_English']) != len(data['Text_Kiswahili']):
    print("Warning: Mismatch between English and Kiswahili articles!")
else:
    print("English and Kiswahili articles are aligned correctly.")

Missing values in dataset:
 Article/Section             0
Text_English                0
Text_Kiswahili              0
English_Word_Count          0
Kiswahili_Word_Count        0
English_Char_Count          0
Kiswahili_Char_Count        0
English_Sentence_Count      0
Kiswahili_Sentence_Count    0
dtype: int64
English and Kiswahili articles are aligned correctly.


## Step 3.5: Save Cleaned Dataset

We save the cleaned and feature-engineered dataset to CSV.  
This dataset will be used across EDA, visualization, ML, and Deep Learning pipelines.

In [12]:
cleaned_data_path = "../Data/kenya_constitution_prepared.csv"
data.to_csv(cleaned_data_path, index=False)
print(f"Cleaned dataset saved to {cleaned_data_path}")

Cleaned dataset saved to ../Data/kenya_constitution_prepared.csv


# Step 4: Exploratory Data Analysis (EDA)

In this step, we aim to understand the structure and content of the cleaned Kenyan Constitution dataset. 
We will explore both English and Kiswahili text to identify:
- Number of articles/sections
- Distribution of text length per article
- Common words and phrases
- Coverage of topics across the Constitution

This helps us identify potential preprocessing needs and informs the feature engineering and NLP pipeline.

In [14]:
# Import required libraries
import pandas as pd

# Load the cleaned dataset
cleaned_data_path = "../Data/kenya_constitution_prepared.csv"
data = pd.read_csv(cleaned_data_path)

# Overview of the dataset
print("Dataset shape:", data.shape)
print("Columns:", data.columns)
data.head(5)

Dataset shape: (61, 9)
Columns: Index(['Article/Section', 'Text_English', 'Text_Kiswahili',
       'English_Word_Count', 'Kiswahili_Word_Count', 'English_Char_Count',
       'Kiswahili_Char_Count', 'English_Sentence_Count',
       'Kiswahili_Sentence_Count'],
      dtype='object')


Unnamed: 0,Article/Section,Text_English,Text_Kiswahili,English_Word_Count,Kiswahili_Word_Count,English_Char_Count,Kiswahili_Char_Count,English_Sentence_Count,Kiswahili_Sentence_Count
0,Article 10.,Territory of Kenya. 5. Kenya consists of the t...,"hiyo. inaweza kutoa usaidizi, ikiwemo- (3) Mtu...",627,81,4047,519,36,4
1,Article 24.,Retention and acquisition of citizenship. 13. ...,(5) Hatua yoyote inayochukuliwa chini ya (4) i...,576,1134,3311,7010,26,32
2,"Article 14 (4), may be revoked if—","(a) the citizenship was acquired by fraud, fal...",uamuzi. (c) kuwa huru dhidi ya aina zote za gh...,435,1419,2634,8926,16,56
3,"Article 43, if the State claims that it",does not have the resources to implement the r...,(2) Kila mtu anayo haki ya kutaka kurekebishwa...,170,112,1083,651,5,5
4,Article 43.,(3) All State organs and all public officers h...,"(2) Haki hiyo inaendelea hadi kwa kutunga, kue...",444,2203,2715,13852,13,70


## Step 4.1: Text Length Analysis

Understanding the length of articles helps us identify:
- Very short or very long articles
- Articles that may need splitting or combining
- Distribution differences between English and Kiswahili versions

In [15]:
# Add text length columns
data['Length_English'] = data['Text_English'].apply(lambda x: len(str(x).split()))
data['Length_Kiswahili'] = data['Text_Kiswahili'].apply(lambda x: len(str(x).split()))

# Basic statistics
print(data[['Length_English', 'Length_Kiswahili']].describe())

       Length_English  Length_Kiswahili
count       61.000000         61.000000
mean       292.950820        801.426230
std        453.638565       1431.641557
min          1.000000          8.000000
25%         45.000000        123.000000
50%        170.000000        337.000000
75%        411.000000        835.000000
max       3160.000000       9713.000000


### Step 4.1: Text Length Analysis – Interpretation

We analyzed the number of words per article in both the English and Kiswahili versions of the Kenyan Constitution. Here are the key insights:

- **Number of Articles:** There are 61 articles/sections in the dataset.
  
- **Mean Length:**
  - English articles have an average length of approximately **293 words**.
  - Kiswahili articles are longer on average, with approximately **801 words** per article. This difference may be due to language structure and translation differences.

- **Variability (Standard Deviation):**
  - English: 454 words  
  - Kiswahili: 1,432 words  
  - The Kiswahili text has much higher variability, meaning some sections are extremely long compared to others.

- **Minimum & Maximum:**
  - The shortest English article has **1 word** (likely a title or placeholder), while the shortest Kiswahili article has **8 words**.  
  - The longest English article has **3,160 words**, whereas the longest Kiswahili article has **9,713 words**, showing a wide range in section lengths.

- **Quartiles:**
  - English 25th percentile: 45 words, 50th percentile (median): 170 words, 75th percentile: 411 words  
  - Kiswahili 25th percentile: 123 words, 50th percentile: 337 words, 75th percentile: 835 words  
  - This confirms that most Kiswahili articles are longer than their English counterparts.

**Implications for NLP:**
- Models need to handle a wide range of text lengths, especially for the Kiswahili version.  
- Extremely long articles may require splitting or special handling in embeddings or LLM input.  
- Very short sections may need context aggregation for meaningful responses from the AI agent.