# NLP Text Loading and Cleaning Pipeline

## Overview
This module handles the complete text extraction and cleaning pipeline for sustainability reports, transforming PDF documents into cleaned text files suitable for NLP analysis. The cleaning process was developed after manually reviewing extracted text files to identify document-level issues requiring correction.

## Text Extraction Process
**PDF to Text**: Uses spaCyLayout with spaCy to extract text from PDF sustainability reports while preserving reading order
- **Challenge addressed**: Multi-column layouts in sustainability reports cause standard extraction tools to scramble sentences by reading left-to-right across columns
- **Processing time**: ~30 minutes per report due to layout-aware parsing complexity
- **Caching mechanism**: Saves extracted text as .txt files to avoid re-processing

## Multi-Stage Cleaning Pipeline

### Stage 1: Initial Extraction (`/Clean/` folder)
Raw text extraction from PDFs using spaCyLayout, cached to prevent re-processing

### Stage 2: Manual Corrections (`/Cleaner/` folder)  
**Company-specific character encoding fixes** identified through manual review of extracted files:
- **EDP**: Extensive character mapping (40+ replacements) for garbled characters like "Ä¸" → "g", "Æ™" → "r", number encoding fixes
- **Terna Energy**: Specific character issues like "/idotaccent" → "i"
- **Other companies**: Targeted fixes for PDF extraction artifacts and encoding errors

### Stage 3: Automated Final Cleaning (`/Cleanest/` folder)
Systematic text standardization across all documents:
- **Space normalization**: Multiple spaces reduced to single spaces
- **Chemical term standardization**: "CO 2" → "CO2" (case-insensitive) to prevent token splitting in spaCy
- **Bracket spacing**: Remove spaces in "( text )" → "(text)" for consistent tokenization

## Processing Configuration
- **Test mode**: 2 companies for development/testing
- **Actual mode**: 14 sample companies for full analysis
- **Years processed**: 2021 and 2022 sustainability reports
- **Text handling**: Increased spaCy max_length to 1.5M characters for large reports

## Manual Review Integration
The cleaning steps were specifically designed after manually checking extracted .txt files to identify:
- Encoding issues from PDF extraction artifacts
- Layout-specific problems requiring targeted corrections
- Document-level cleaning requirements for reliable NLP processing

## Final Output
Clean, standardized text files in `/Cleanest/` folders ready for communication analysis modules, with all major extraction artifacts and encoding issues resolved through the systematic multi-stage approach.

In [None]:
import spacy
from spacy_layout import spaCyLayout
from pathlib import Path
import pandas as pd
import numpy as np

# Load models
nlp_blank = spacy.blank("en")
layout = spaCyLayout(nlp_blank)
nlp = spacy.load("en_core_web_lg")

# Increase max_length to safely handle long texts
nlp.max_length = 1_500_000

print("Models loaded successfully")

In [None]:
# Toggle between "test" and "actual"
MODE = "actual"  

# Define configuration based on mode
if MODE == "test":
    report_names = [ 
        "Axpo_Holding_AG", "NEOEN_SA"
    ]
    folders = {
        "2021": Path("data/NLP/Testing/Reports/2021"),
        "2022": Path("data/NLP/Testing/Reports/2022")
    }

    # Output paths for clean text files
    clean_text_folders = {
        "2021": Path("data/NLP/Testing/Reports/Clean/2021"),
        "2022": Path("data/NLP/Testing/Reports/Clean/2022")
    }

elif MODE == "actual":
    report_names = [ 
        "Akenerji_Elektrik_Uretim_AS",
        "Arendals_Fossekompani_ASA",
        "Atlantica_Sustainable_Infrastructure_PLC",
        "CEZ",
        "EDF",
        "EDP_Energias_de_Portugal_SA",
        "Endesa",
        "ERG_SpA",
        "Orsted",
        "Polska_Grupa_Energetyczna_PGE_SA",
        "Romande_Energie_Holding_SA",
        "Scatec_ASA",
        "Solaria_Energia_y_Medio_Ambiente_SA",
        "Terna_Energy_SA"
    ]
    folders = {
        "2021": Path("data/NLP/Reports/2021"),
        "2022": Path("data/NLP/Reports/2022")
    }

    # Output paths for clean text files
    clean_text_folders = {
        "2021": Path("data/NLP/Reports/Clean/2021"),
        "2022": Path("data/NLP/Reports/Clean/2022")
    }

else:
    raise ValueError("Invalid MODE. Use 'test' or 'actual'.")

# Check availability
for name in report_names:
    file_name = f"{name.replace('_', ' ')}.pdf"
    in_2021 = (folders["2021"] / file_name).exists()
    in_2022 = (folders["2022"] / file_name).exists()
    print(f"{file_name}: 2021: {'✔️' if in_2021 else '❌'} | 2022: {'✔️' if in_2022 else '❌'}")


In [None]:
print(f"Processing {len(report_names)} companies for years: {list(folders.keys())}")

documents = {}

for version, folder_path in folders.items():
    for name in report_names:
        file_name = name.replace("_", " ") + ".pdf"
        pdf_path = folder_path / file_name
        clean_text_path = clean_text_folders[version] / f"{name}.txt"

        try:
            # Only parse layout if text file doesn't exist
            if clean_text_path.exists():
                with open(clean_text_path, "r", encoding="utf-8") as f:
                    clean_text = f.read()
                print(f"Loaded cached text: {clean_text_path.name}")
            else:
                layout_doc = layout(str(pdf_path))
                clean_text = layout_doc.text
                with open(clean_text_path, "w", encoding="utf-8") as f:
                    f.write(clean_text)
                print(f"Saved clean text: {clean_text_path.name}")

            # Run linguistic analysis
            nlp_doc = nlp(clean_text)
            doc_key = f"{name}_{version}"
            documents[doc_key] = nlp_doc

        except Exception as e:
            print(f"Error processing {file_name}: {e}")

print(f"\nTotal documents loaded: {len(documents)}")

## Clean specific reports

In [None]:
from pathlib import Path

# Input- and outputpath
input_path = Path("data/NLP/Reports/Clean/2021/EDP_Energias_de_Portugal_SA.txt")
output_path = Path("data/NLP/Reports/Cleaner/2021/EDP_Energias_de_Portugal_SA_corrected.txt")
output_path.parent.mkdir(parents=True, exist_ok=True)

# Table with replacements
replacements = {
    "ĸ": "g",
    "ƙ": "r",
    "ķ": "f",
    "ǎ": "w",
    "Ǎ": "v",
    "ǔ": "y",
    "Ǧ": "fi",
    "Ǔ": "x",
    "s̫": "s",
    "E̺": "E",
    "I": "I",
    "i̺": "i)",
    "O̺": "O)",
    "o̺": "o)",
    "U̺": "U)",
    "s̺": "s)",
    "Ƙ": "q",
    "ǧ": "fl",
    "a̬": "a;",
    "n̬": "n;",
    "e̬": "e;",
    "t̺": "t)",
    "˩": "0",
    "˪": "1",
    "˫": "2",
    "˩": "0",
    "˪": "1",
    "ˬ": "3",
    "˭": "4",
    "ˮ": "5",
    "˯": "6",
    "˰": "7",
    "˱": "8",
    "˲": "9"
}

# Read original text
with open(input_path, "r", encoding="utf-8") as f:
    text = f.read()

# Replace incorrect characters with correct ones
for wrong, correct in replacements.items():
    text = text.replace(wrong, correct)

# Write the corrected text to the output file
with open(output_path, "w", encoding="utf-8") as f:
    f.write(text)


In [None]:
from pathlib import Path

# Input- and outputpath
input_path = Path("data/NLP/Reports/Clean/2021/Terna_Energy_SA.txt")
output_path = Path("data/NLP/Reports/Cleaner/2021/Terna_Energy_SA_corrected.txt")
output_path.parent.mkdir(parents=True, exist_ok=True)

# Table with replacements
replacements = {
    "/idotaccent": "i",
}

# Read original text
with open(input_path, "r", encoding="utf-8") as f:
    text = f.read()

# Replace incorrect characters with correct ones
for wrong, correct in replacements.items():
    text = text.replace(wrong, correct)

# Write the corrected text to the output file
with open(output_path, "w", encoding="utf-8") as f:
    f.write(text)


In [None]:
import re
from pathlib import Path

input_folders = {
    "2021": Path("data/NLP/Reports/Cleaner/2021"),
    "2022": Path("data/NLP/Reports/Cleaner/2022")
}
output_folders = {
    "2021": Path("data/NLP/Reports/Cleanest/2021"),
    "2022": Path("data/NLP/Reports/Cleanest/2022")
}

for folder in output_folders.values():
    folder.mkdir(parents=True, exist_ok=True)

for year, input_folder in input_folders.items():
    if input_folder.exists():
        for file_path in input_folder.glob("*.txt"):
            with open(file_path, "r", encoding="utf-8") as f:
                text = f.read()

            # Replace multiple spaces with a single space
            cleaned_text = re.sub(r' +', ' ', text)
            # Replace 'co 2' with 'co2' (case insensitive)
            cleaned_text = re.sub(r'\bco 2\b', 'co2', cleaned_text, flags=re.IGNORECASE)
            # Remove space after opening bracket and before closing bracket
            cleaned_text = cleaned_text.replace('( ', '(').replace(' )', ')')


            output_path = output_folders[year] / file_path.name
            with open(output_path, "w", encoding="utf-8") as f:
                f.write(cleaned_text)

            print(f"Processed: {file_path.name}")


Processed: Akenerji_Elektrik_Uretim_AS.txt
Processed: Arendals_Fossekompani_ASA.txt
Processed: Atlantica_Sustainable_Infrastructure_PLC.txt
Processed: CEZ.txt
Processed: EDF.txt
Processed: EDP_Energias_de_Portugal_SA.txt
Processed: Endesa.txt
Processed: ERG_SpA.txt
Processed: Orsted.txt
Processed: Polska_Grupa_Energetyczna_PGE_SA.txt
Processed: Romande_Energie_Holding_SA.txt
Processed: Scatec_ASA.txt
Processed: Solaria_Energia_y_Medio_Ambiente_SA.txt
Processed: Terna_Energy_SA.txt
Processed: Akenerji_Elektrik_Uretim_AS.txt
Processed: Arendals_Fossekompani_ASA.txt
Processed: Atlantica_Sustainable_Infrastructure_PLC.txt
Processed: CEZ.txt
Processed: EDF.txt
Processed: EDP_Energias_de_Portugal_SA.txt
Processed: Endesa.txt
Processed: ERG_SpA.txt
Processed: Orsted.txt
Processed: Polska_Grupa_Energetyczna_PGE_SA.txt
Processed: Romande_Energie_Holding_SA.txt
Processed: Scatec_ASA.txt
Processed: Solaria_Energia_y_Medio_Ambiente_SA.txt
Processed: Terna_Energy_SA.txt
