# Keywords-in-Context (KWIC) Analysis by Theme

This notebook presents a **Keywords-in-Context (KWIC)** analysis across a collection of documents categorized into different thematic areas. The goal is to examine how the most relevant keywords for each theme are used within the actual textual context of the documents.

### 🔍 Objectives

- Analyze keyword usage contextually within each theme.
- Surface insights about how critical terms are framed in the data.


### 🧪 Analysis Workflow

1. **Load and extract text** from PDF files grouped by theme.
2. **Preprocess** text by removing non-ASCII characters to ensure clean NLP tokenization.
3. **Load top keywords** per theme (previously identified).
4. **Perform KWIC analysis** using spaCy to extract left and right context around each keyword.
5. **Display results** in a structured table showing:
   - Left context
   - Keyword
   - Right context

### 📁 Themes Covered

- Barriers to Access  
- Digital Infrastructure  
- Inclusive Digital Technology  
- Tech Ecosystem  
- Case Studies

---

By comparing how key terms appear across different themes, this notebook enables deeper insights into language patterns and framing within the dataset.


### Importing Required Libraries
In this step, we import essential Python libraries:
- `pandas` for data manipulation,
- `fitz` (PyMuPDF) to extract text from PDF documents,
- `spacy` for natural language processing (NLP) tasks like tokenization.

In [1]:
import pandas as pd
import fitz
import spacy

### Loading the spaCy Language Model
We load the small English spaCy model `en_core_web_sm`, which will be used to tokenize our text and identify keywords in context.


In [2]:
nlp = spacy.load("en_core_web_sm")

### Reading PDF Files from Dataset
Here, we define the dataset folder structure and collect the text data from all PDF files across different thematic folders.
Each file is tagged with:
- its name (filename),
- the theme it belongs to (parent folder name),
- and the path for accessing its contents.

In [3]:
from pathlib import Path

# Relative path to the dataset folder
datasets_folder = Path.cwd().parent / "1_datasets"
raw_text_data = []

# Recursively search for all PDF files in subdirectories
for pdf_path in datasets_folder.rglob("*.pdf"):
    raw_text_data.append(
        {
            "name": pdf_path.stem,  # filename without extension
            "theme": pdf_path.parent.name,  # name of the immediate parent folder
            "path": str(pdf_path),  # path to pdf file for later processing
        }
    )

raw_text_data

[{'name': 'Access Denied',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/Access Denied.pdf'},
 {'name': 'Accessibility to digital technology',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/Accessibility to digital technology.pdf'},
 {'name': 'Bridging the digital divide',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/Bridging the digital divide.pdf'},
 {'name': 'ICT and Disability',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/ICT and Disability.pdf'},
 {'name': 'Mobile Disability Gap Report 2021',
  'theme': 'barriers_to_access',
  'path': '/Volumes/Adata su650 512gb/C/karim/ET6-CDSP-group-24-repo/1_datasets/barriers_to_access/Mobile Disability G

### Extracting Text from PDFs
This function, `extract_text_from_pdf()`, uses PyMuPDF to extract all text from a given PDF file, page by page.
It will later be applied to each file collected from the dataset.

In [4]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts all text from a PDF file using PyMuPDF.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the PDF.
    """
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

### Applying Text Extraction
We apply the text extraction function to each file in our dataset.
If any PDF cannot be processed (e.g., due to corruption), we handle the error gracefully and log the file path for reference.

In [6]:
for item in raw_text_data:
    try:
        raw = extract_text_from_pdf(item["path"])
        item["text"] = raw
    except Exception as e:
        item["text"] = None
        print(f"Error with {item['path']}: {e}")

### Creating Theme-based DataFrames
Now we create a `DataFrame` from all the text data and split it into separate DataFrames according to their thematic category.
This allows us to conduct KWIC analysis independently for each theme, such as:
- Barriers to Access
- Digital Infrastructure
- Inclusive Digital Technology
- Tech Ecosystem
- Case Studies

In [7]:
# create a DataFrame from the raw text data
text_datasets_df = pd.DataFrame(raw_text_data, columns=["name", "theme", "text"])

# Create separate dataframes for each thematic category
barrier_to_access = text_datasets_df[text_datasets_df["theme"] == "barriers_to_access"]
digital_infrastructure = text_datasets_df[
    text_datasets_df["theme"] == "digital_infrastructure"
]
inclusive_digital_technology = text_datasets_df[
    text_datasets_df["theme"] == "inclusive_digital_technology"
]
tech_ecosystem = text_datasets_df[text_datasets_df["theme"] == "tech_ecosystem"]
case_studies = text_datasets_df[
    text_datasets_df["theme"] == "case_studies"
]  # Documents about specific case studies

###  Loading Top Keywords by Theme
We read a CSV file containing the **top 20 extracted keywords** for each theme.
Then, we split these keywords into separate DataFrames by theme to match the document groups prepared earlier.

In [8]:
# Gain the top 20 keywords for each thematic category
per_theme_keywords = pd.read_csv(
    "../1_datasets/processed_data/per_theme_top_keywords.csv",
)

# create separate DataFrames for top keywords for each theme
barrier_to_access_keywords = per_theme_keywords[
    per_theme_keywords["theme"] == "barriers_to_access"
]
digital_infrastructure_keywords = per_theme_keywords[
    per_theme_keywords["theme"] == "digital_infrastructure"
]
inclusive_digital_technology_keywords = per_theme_keywords[
    per_theme_keywords["theme"] == "inclusive_digital_technology"
]
tech_ecosystem_keywords = per_theme_keywords[
    per_theme_keywords["theme"] == "tech_ecosystem"
]
case_studies_keywords = per_theme_keywords[
    per_theme_keywords["theme"] == "case_studies"
]
barrier_to_access_keywords.head(20)

Unnamed: 0,theme,keyword,score
0,barriers_to_access,access,0.348421
1,barriers_to_access,accessibility,0.301716
2,barriers_to_access,accessible,0.144786
3,barriers_to_access,assistive,0.126104
4,barriers_to_access,barrier,0.124236
5,barriers_to_access,communication,0.137313
6,barriers_to_access,country,0.122368
7,barriers_to_access,device,0.119565
8,barriers_to_access,digital,0.492273
9,barriers_to_access,gap,0.119565


### Helper Function to Clean Text

This helper function removes non-ASCII characters from the text. This ensures clean tokenization and prevents issues during spaCy processing.

In [9]:
def remove_non_ascii(text):
    return text.encode("ascii", errors="ignore").decode("ascii")

### Keywords In Context (KWIC) Function Definition
This function, `kwic_spacy_multiple()`, is at the core of our analysis.

It takes:
- a list of documents,
- a list of keywords,
- and a context window size (default is 7 tokens to the left and right).

For each keyword found in each document, it extracts:
- the left context (up to N tokens),
- the keyword itself,
- and the right context (up to N tokens).

This structure allows us to understand **how keywords are used in different contexts** across documents in a given theme.

In [10]:
def kwic_spacy_multiple(texts, keywords, window=7):
    tokenized_docs = []

    for doc in texts:
        spacy_doc = nlp(doc)
        tokens = [token.text for token in spacy_doc]
        tokenized_docs.append(tokens)

    results = []

    for tokens in tokenized_docs:
        for i, token in enumerate(tokens):
            for kw in keywords:
                if token.lower() == kw.lower():
                    left = tokens[max(i - window, 0) : i]
                    right = tokens[i + 1 : i + window + 1]
                    left_str = " ".join(left).replace("\n", " ").strip()
                    right_str = " ".join(right).replace("\n", " ").strip()
                    results.append((kw, left_str, token, right_str))
    return results

### KWIC for Barriers to Access Theme

Here we perform KWIC analysis for the "Barriers to Access" theme using:
- the associated documents (with non-ASCII characters removed),
- and its top 20 keywords.

The resulting DataFrame (`barrier_to_access_kwic_df`) contains:
- the keyword,
- the context to its left and right,
which will help us qualitatively explore how these keywords are used.

In [13]:
barrier_to_access_kwic_list = kwic_spacy_multiple(
    barrier_to_access["text"].astype(str).apply(remove_non_ascii).tolist(),
    barrier_to_access_keywords["keyword"].tolist(),
    window=7,
)

barrier_to_access_kwic_df = pd.DataFrame(
    barrier_to_access_kwic_list,
    columns=["Keyword Label", "Left Context", "Keyword", "Right Context"],
)
barrier_to_access_kwic_df.head(60)

Unnamed: 0,Keyword Label,Left Context,Keyword,Right Context
0,access,Persons With Disabilities August 2020,Access,Denied : Introduction Research Design
1,accessible,and Scope Results Availability of,Accessible,Handsets in Sales Outlets Promotion of
2,accessible,Sales Outlets Promotion of Awareness of,Accessible,Mobile Telecommunications Procurement Polici...
3,mobile,Outlets Promotion of Awareness of Accessible,Mobile,Telecommunications Procurement Policies on A...
4,accessible,Accessible Mobile Telecommunications Procure...,Accessible,Handsets Physical Accessibility of Sales Out...
5,accessibility,Procurement Policies on Accessible Handsets ...,Accessibility,of Sales Outlets Capacity of Telecoms
6,accessible,with Disabilities Development and Availabil...,Accessible,Applications Availability of Discounted Rat...
7,accessibility,of Discounted Rates for Telecom Services,Accessibility,and Awareness of Emergency Mobile Communications
8,mobile,Services Accessibility and Awareness of Emer...,Mobile,Communications Existence of Code of Conduct
9,mobile,Existence of Code of Conduct on,Mobile,Telecommunications Accessibility Discussion ...


### KWIC for Digital Infrastructure Theme

We now apply the same KWIC analysis process described earlier (see "Barriers to Access") to the **Digital Infrastructure** theme.

This involves:
- Preprocessing the text (cleaning non-ASCII characters),
- Using spaCy to tokenize documents,
- Extracting left and right contexts for the top 20 keywords.

Then explore how keywords are used contextually in this theme’s documents.


In [17]:
digital_infrastructure_kwic_list = kwic_spacy_multiple(
    digital_infrastructure["text"].astype(str).apply(remove_non_ascii).tolist(),
    digital_infrastructure_keywords["keyword"].tolist(),
    window=7,
)

digital_infrastructure_kwic_df = pd.DataFrame(
    digital_infrastructure_kwic_list,
    columns=["Keyword Label", "Left Context", "Keyword", "Right Context"],
)
digital_infrastructure_kwic_df.head(50)

Unnamed: 0,Keyword Label,Left Context,Keyword,Right Context
0,household,2024 FinAccess,Household,Survey ACCESS USAGE QUALITY
1,survey,2024 FinAccess Household,Survey,ACCESS USAGE QUALITY
2,access,2024 FinAccess Household Survey,ACCESS,USAGE QUALITY IMPACT
3,kenya,"However , CBK , KNBS , FSD",Kenya,"and partners make no claims ,"
4,kenya,this report . Central Bank of,Kenya,Collaborating Partners 2024 FINACCESS HOUSEHOLD
5,household,Kenya Collaborating Partners 2024 FINACCESS,HOUSEHOLD,SURVEY i 2024 FINACCESS HOUSEHOLD
6,survey,Collaborating Partners 2024 FINACCESS HOUSEHOLD,SURVEY,i 2024 FINACCESS HOUSEHOLD SURVEY
7,household,HOUSEHOLD SURVEY i 2024 FINACCESS,HOUSEHOLD,SURVEY CONTENTS FOrEwOrd v
8,survey,SURVEY i 2024 FINACCESS HOUSEHOLD,SURVEY,CONTENTS FOrEwOrd v
9,survey,Economic Context 2 1.2,Survey,Objectives 3 1.3 Survey


###  KWIC Analysis for Inclusive Digital Technology Theme

We now apply the same KWIC analysis process described earlier (see "Barriers to Access") to the **Inclusive Digital Technology** theme.

This involves:
- Preprocessing the text (cleaning non-ASCII characters),
- Using spaCy to tokenize documents,
- Extracting left and right contexts for the top 20 keywords.

Then explore how keywords are used contextually in this theme’s documents.


In [18]:
inclusive_digital_technology_kwic_list = kwic_spacy_multiple(
    inclusive_digital_technology["text"].astype(str).apply(remove_non_ascii).tolist(),
    inclusive_digital_technology_keywords["keyword"].tolist(),
    window=7,
)

inclusive_digital_technology_kwic_df = pd.DataFrame(
    inclusive_digital_technology_kwic_list,
    columns=["Keyword Label", "Left Context", "Keyword", "Right Context"],
)
inclusive_digital_technology_kwic_df.head(50)

Unnamed: 0,Keyword Label,Left Context,Keyword,Right Context
0,digital,Theme : Taking,Digital,Accessibility & Assistive Technology in Africa to
1,accessibility,Theme : Taking Digital,Accessibility,& Assistive Technology in Africa to the
2,assistive,Theme : Taking Digital Accessibility &,Assistive,Technology in Africa to the Next level
3,technology,Theme : Taking Digital Accessibility & Assistive,Technology,in Africa to the Next level
4,access,change and make generations to come,access,". If we get together ,"
5,kenya,Njoki Coordinator Short Stature Society of,Kenya,( SSSK ) 6 . Julius
6,need,"be impaired by disability , thus the",need,to support accessibility . The Internet
7,accessibility,"disability , thus the need to support",accessibility,. The Internet needs to be
8,accessibility,"Vint Cerf , VP , and Chief",Accessibility,Evangelist Google Key Takeaways The
9,accessibility,Takeaways Today we are talking about,accessibility,and the various ways we can improve


###  KWIC Analysis for Tech EcoSystem Theme

We now apply the same KWIC analysis process described earlier (see "Barriers to Access") to the **Tech EcoSystem** theme.

This involves:
- Preprocessing the text (cleaning non-ASCII characters),
- Using spaCy to tokenize documents,
- Extracting left and right contexts for the top 20 keywords.

Then explore how keywords are used contextually in this theme’s documents.


In [19]:
tech_ecosystem_kwic_list = kwic_spacy_multiple(
    tech_ecosystem["text"].astype(str).apply(remove_non_ascii).tolist(),
    tech_ecosystem_keywords["keyword"].tolist(),
    window=7,
)

tech_ecosystem_kwic_df = pd.DataFrame(
    tech_ecosystem_kwic_list,
    columns=["Keyword Label", "Left Context", "Keyword", "Right Context"],
)
tech_ecosystem_kwic_df.head(50)  # display top 50 results

Unnamed: 0,Keyword Label,Left Context,Keyword,Right Context
0,employment,Best Practices in the,Employment,of Persons with Intellectual Disabilities :
1,intellectual,in the Employment of Persons with,Intellectual,Disabilities : A Case Study of Uganda
2,case,Persons with Intellectual Disabilities : A,Case,"Study of Uganda , December 2022"
3,uganda,Intellectual Disabilities : A Case Study of,Uganda,", December 2022 1"
4,uganda,", December 2022 1",Uganda,Association for the Mentally Handicapped / Inc...
5,inclusion,Uganda Association for the Mentally Handicapped /,Inclusion,Uganda Best Practices in the
6,uganda,for the Mentally Handicapped / Inclusion,Uganda,Best Practices in the Employment of
7,employment,Uganda Best Practices in the,Employment,of Persons with Intellectual Disabilities : A
8,intellectual,Practices in the Employment of Persons with,Intellectual,Disabilities : A Case Study of
9,case,Persons with Intellectual Disabilities : A,Case,Study of Uganda Prepared by


###  KWIC Analysis for Case Studies Theme

We now apply the same KWIC analysis process described earlier (see "Barriers to Access") to the **Case Studies** theme.

This involves:
- Preprocessing the text (cleaning non-ASCII characters),
- Using spaCy to tokenize documents,
- Extracting left and right contexts for the top 20 keywords.

Then explore how keywords are used contextually in this theme’s documents.


In [20]:
case_studies_kwic_list = kwic_spacy_multiple(
    case_studies["text"].astype(str).apply(remove_non_ascii).tolist(),
    case_studies_keywords["keyword"].tolist(),
    window=7,
)

case_studies_kwic_df = pd.DataFrame(
    case_studies_kwic_list,
    columns=["Keyword Label", "Left Context", "Keyword", "Right Context"],
)
case_studies_kwic_df.head(50)  # display top 50 results

Unnamed: 0,Keyword Label,Left Context,Keyword,Right Context
0,student,Returns to Inspire Students former Pacelli,student,leads digital skills training for Y'ello Care
1,digital,Inspire Students former Pacelli student leads,digital,skills training for Y'ello Care Years
2,school,Akinola walked the corridors of the Pacelli,School,for the Blind & Partially Sighted
3,student,Partially Sighted Children in Surulere as a,student,with big dreams and uncertain pathways
4,student,", he returned , not as a",student,", but as a CEO , a"
5,assistive,empowering individuals with vision loss through,assistive,technology and digital skills training .
6,technology,individuals with vision loss through assistive,technology,and digital skills training . Since
7,digital,vision loss through assistive technology and,digital,skills training . Since launching in
8,assistive,"provided over $ 26,000 worth of",assistive,tech and trained more than 200 blind
9,digital,trained more than 200 blind persons in,digital,"literacy , from foundation to advanced"
