# Text Cleaning and Preprocessing Pipeline for **PDF Text Data**

This notebook extracts and preprocesses textual data from PDF files using **PyMuPDF** for extraction and **spaCy** for processing and cleaning.  
The cleaned text is organized and exported for use in NLP tasks such as keyword extraction, topic modeling, and qualitative validation.


## Step 1: Import & Load Dependencies

We import all required Python libraries used throughout the notebook. These include:

- `fitz` from PyMuPDF – for PDF reading and text extraction.
- `spacy` – for natural language processing and text cleaning.
- `pandas` – for working with structured tabular data.


In [1]:
import spacy
import fitz
import pandas as pd

## Step 2: List the PDF Files to Process

In this step, we use Python’s `pathlib.Path` to navigate the file system and locate all PDF files within the target dataset folder.

The `Path` object provides an easy and readable way to:
- Set the directory to search (e.g., `Path("../1_datasets")`).
- Recursively search for all files ending in `.pdf` using `.rglob("*.pdf")`.

For each PDF file found, we create a dictionary containing:
- The file name,
- The full file path (as a `Path` object),
- The theme (based on directory name).

This structured list of dictionaries allows us to track the necessary information and apply consistent preprocessing in the following steps.


In [2]:
from pathlib import Path

# Relative path to the dataset folder
datasets_folder = Path.cwd().parent / "1_datasets"
raw_text_data = []

# Recursively search for all PDF files in subdirectories
for pdf_path in datasets_folder.rglob("*.pdf"):
    raw_text_data.append(
        {
            "name": pdf_path.stem,  # filename without extension
            "theme": pdf_path.parent.name,  # name of the immediate parent folder
            "path": str(pdf_path),  # path to pdf file for later processing
        }
    )

raw_text_data

[{'name': 'Access Denied',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/Access Denied.pdf'},
 {'name': 'Accessibility to digital technology',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/Accessibility to digital technology.pdf'},
 {'name': 'Bridging the digital divide',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/Bridging the digital divide.pdf'},
 {'name': 'ICT and Disability',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/ICT and Disability.pdf'},
 {'name': 'Mobile Disability Gap Report 2021',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/Mobile Disability G

## Step 3: Extract Raw Text from Each PDF

Using `PyMuPDF`, we extract raw textual content from each PDF, page by page. This is done through a helper function that:

- Opens the PDF file
- Iterates through its pages
- Extracts plain text content from each page
- And combines it into a single string per document

The resulting text is stored back into the list of dictionaries under the `'cleaned_text'` key.


In [3]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts all text from a PDF file using PyMuPDF.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the PDF.
    """
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

## Step 4: Clean and Normalize Extracted Text Using spaCy

In this step, we use `spaCy`'s English language model (`en_core_web_sm`) to clean and normalize the raw text extracted from the PDFs.

Each text entry is processed through a custom function that:
- Tokenizes the text into individual words and symbols.
- Removes stopwords (common words like "the", "and", "of" that carry little meaning).
- Filters out punctuation, less than 3 digits, and whitespace-only tokens.
- Retains only alphabetic words (e.g., no numbers or symbols).
- Converts tokens to lowercase for uniformity.

The result is a cleaned version of the text that preserves meaningful words, which is better suited for NLP tasks such as keyword extraction, topic modeling, or qualitative analysis.


In [4]:
nlp = spacy.load("en_core_web_sm")


def clean_text(text):
    doc = nlp(text)
    return " ".join(
        token.lemma_.lower()
        for token in doc
        if not token.is_stop
        and not token.is_punct
        and not token.like_url
        and not token.like_email
        and not token.is_space
        and not (token.is_digit and len(token.text) < 3)
    )

## Step 5: Store Cleaned Text into the Dataset

Here we iterate through the list of PDF records and apply the text cleaning function from Step 4. The cleaned output is added as a new key-value pair (`'cleaned_text'`) to each dictionary.

This allows us to retain both the raw and cleaned versions of each document in the same structure.


In [5]:
for item in raw_text_data:
    try:
        raw = clean_text(extract_text_from_pdf(item["path"]))
        item["cleaned_text"] = clean_text(raw)
    except Exception as e:
        item["cleaned_text"] = None
        print(f"Error with {item['path']}: {e}")

## Step 6: Convert Data into a Structured DataFrame
We now convert the cleaned dictionary into a pandas DataFrame for easier handling and export.

In [6]:
cleaned_datasets_df = pd.DataFrame(
    raw_text_data, columns=["name", "theme", "cleaned_text"]
)

cleaned_datasets_df

Unnamed: 0,name,theme,cleaned_text
0,Access Denied,barriers_to_access,telecom operators africa fail person disabilit...
1,Accessibility to digital technology,barriers_to_access,term condition access use find assistive techn...
2,Bridging the digital divide,barriers_to_access,vol.:(0123456789 discover global society 2025 ...
3,ICT and Disability,barriers_to_access,information communication technology ict disab...
4,Mobile Disability Gap Report 2021,barriers_to_access,mobile disability gap report 2021 mobile disab...
5,Social protection and access to assistive tech...,barriers_to_access,term condition access use find assistive techn...
6,Strengthening ICT Accessibility for PWDs in Af...,barriers_to_access,strengthen ict accessibility pwd africa post n...
7,The Millions of Nigerians Whose Use of Technol...,barriers_to_access,assistive tech leave million nigerian use tech...
8,The State of Digital Inclusion in Africa,barriers_to_access,state digital inclusion africa challenge disab...
9,Akinola Opeolu,case_studies,blind tech ceo return inspire student pacelli ...


## Step 7: Export Cleaned Data to CSV

The final cleaned DataFrame is exported as `cleaned_datasets.csv` to the `/1_datasets` directory.

It serves as the starting point for downstream NLP workflows like keyword extraction, topic modeling, or annotation.


In [9]:
cleaned_datasets_df.to_csv(
    "../1_datasets/processed_data/cleaned_datasets.csv", index=False
)

## Next Steps
- Use the cleaned CSV for topic modeling, keyword extraction, or any qualitative coding.
