# 02. Cleaning Logic R&D (Experiment)

**Status:** `ARCHIVED / EXPERIMENTAL`
**Outcome:** The complex recursive logic developed here was **rejected** in favor of a robust, deterministic approach.

---

## 1. Objective
To analyze why ~30% of addresses failed during the initial geocoding pass and determine if aggressive text preprocessing (Regex) could recover them.

**Hypothesis:** A recursive cleaner that strips descriptions (e.g., *"Hermosa casa en..."*) and architectural terms will significantly increase the API Hit Rate.

In [15]:
import pandas as pd
import re
import unicodedata

# 1. LOAD FAILURE SAMPLES
# We manually define the "Hardest Cases" found in the geocoding cache for reproducibility.
dirty_data_samples = [
    "meseta, sonterra",
    "boulevard villas del meson, villas del mesón, juriquilla",  # Redundancy + Accents
    "altozano el nuevo",                                       # Adjective Noise
    "anillo vial fray junipero el refugio",
    "pre del reconocido despacho de arquitectura goma arquitectos, el campanario", # Description Leakage
    "moderna con excelente distribución y acabados en jurica",  # Description Leakage
    "zikura, zibatá, el marqués",                              # Unmapped Cluster
    "faisan 1418, nuevo refugio",
    "valles, la purísima"
]

df_test = pd.DataFrame(dirty_data_samples, columns=['raw_input'])

print("--- BASELINE: HARD FAILURE CASES ---")
display(df_test)

--- BASELINE: HARD FAILURE CASES ---


Unnamed: 0,raw_input
0,"meseta, sonterra"
1,"boulevard villas del meson, villas del mesón, juriquilla"
2,altozano el nuevo
3,anillo vial fray junipero el refugio
4,"pre del reconocido despacho de arquitectura goma arquitectos, el campanario"
5,moderna con excelente distribución y acabados en jurica
6,"zikura, zibatá, el marqués"
7,"faisan 1418, nuevo refugio"
8,"valles, la purísima"


## 2. The Experiment: Aggressive Recursive Cleaning (V4)

We developed a complex class `PrototypedAddressCleaner` that uses:
1.  **Recursion:** Loops until the string stabilizes to peel off stacked stopwords (e.g., *"con y en..."*).
2.  **Specific Stopwords:** Targeted specific words found in the failures (`goma`, `despacho`, `reconocido`).
3.  **Accent Normalization:** To handle deduplication logic.

## 2. Iterative Logic Development

Below, we prototype the `AddressCleaner` class. We will add rules incrementally to solve the identified patterns.

In [16]:
# --- EXPERIMENTAL LOGIC (DO NOT USE IN PROD) ---
class ExperimentalCleaner:
    NOISE_PATTERNS = [
        r"venta", r"preventa", r"oportunidad", r"fraccionamiento",
        r"residencial", r"condominio", r"lotes?", r"terrenos?",
        r"despacho", r"arquitectura", r"arquitectos", # Specific Overfitting?
        r"moderna", r"excelente", r"distribuci[óo]n", r"acabados",
        r"reconocido", r"goma", r"pre",
        r"\bnueva\b", r"\bnuevo\b"
    ]

    @staticmethod
    def _remove_accents(input_str):
        nfkd_form = unicodedata.normalize('NFKD', input_str)
        return "".join([c for c in nfkd_form if not unicodedata.combining(c)])

    @staticmethod
    def clean(raw_address):
        if not raw_address: return ""
        cleaned = raw_address.lower()

        # 1. Strip Noise
        for pattern in ExperimentalCleaner.NOISE_PATTERNS:
            cleaned = re.sub(pattern, "", cleaned, flags=re.IGNORECASE)

        # 2. Recursive Loop (The "Onion Peeler")
        prev_string = ""
        loop_count = 0
        while cleaned != prev_string and loop_count < 5:
            prev_string = cleaned
            loop_count += 1
            # Remove connectors at start/end
            cleaned = re.sub(r'^\s*(?:y|con|de|del|en)\b\s*', '', cleaned)
            cleaned = re.sub(r'\s+\b(?:y|con|de|del|en|el|la)\s*$', '', cleaned)
            cleaned = re.sub(r'\s+', ' ', cleaned).strip()

        return cleaned

# Apply Experiment
df_test['experimental_clean'] = df_test['raw_input'].apply(ExperimentalCleaner.clean)

print("--- EXPERIMENTAL RESULTS ---")
display(df_test[['raw_input', 'experimental_clean']])

--- EXPERIMENTAL RESULTS ---


Unnamed: 0,raw_input,experimental_clean
0,"meseta, sonterra","meseta, sonterra"
1,"boulevard villas del meson, villas del mesón, juriquilla","boulevard villas del meson, villas del mesón, juriquilla"
2,altozano el nuevo,altozano
3,anillo vial fray junipero el refugio,anillo vial fray junipero el refugio
4,"pre del reconocido despacho de arquitectura goma arquitectos, el campanario",", el campanario"
5,moderna con excelente distribución y acabados en jurica,jurica
6,"zikura, zibatá, el marqués","zikura, zibatá, el marqués"
7,"faisan 1418, nuevo refugio","faisan 1418, refugio"
8,"valles, la purísima","valles, la purísima"


## 3. Analysis & Pivot Decision

### The Findings
1.  **Overfitting Risk:** To clean rows like *"pre del reconocido despacho..."*, we had to hardcode specific words like `goma` and `reconocido`. This logic is brittle; if a new scraper brings "bonita casa de autor", this logic will fail.
2.  **Map Granularity vs. Text Cleaning:** * Input: `zikura, zibatá` -> Cleaned: `zikura, zibatá`.
    * **Issue:** This still fails geocoding because OpenStreetMap does not know "Zikura". Cleaning the text does not solve the lack of map data.
3.  **Diminishing Returns:** The recursive loop adds computational complexity and unpredictability for edge cases that are better handled by a fallback strategy.

### The Decision
**We reject the Experimental V4 logic.**

Instead of trying to "regex" our way to perfection, we will implement a **Robust Strategy**:
1.  **Conservative Cleaning:** Remove only obvious marketing noise (`venta`, `oportunidad`).
2.  **Geocoding Fallback:** Use the Geocoder script to handle granularity. If `Zikura, Zibatá` fails, the *Code* should split the string and try `Zibatá`.

---

## 4. Final Production Logic (Selected)

The following logic has been migrated to `src/utils/clean_text.py`. It is deterministic, safe, and avoids overfitting.

In [17]:
# --- FINAL PRODUCTION LOGIC ---
class AddressCleaner:
    """
    Service class for normalizing real estate address strings.
    Selected for Production: Jan 2026.
    """

    # Marketing patterns and common OCR/Typo errors to strip
    NOISE_PATTERNS = [
        r"venta de casa en", r"casa en venta", r"en venta",
        r"venta", r"preventa", r"remate", r"oportunidad",
        r"fraccionamiento", r"residencial", r"condominio",
        r"lotes?", r"terrenos?", r"departamentos?", r"casas?",
        r"\bnueva\b", r"\bnuevo\b",  # Removes adjectives like "Nueva en..."
        r"fraccionamient[o0]"  # Handles common typos like 'fraccionamient0'
    ]

    @staticmethod
    def clean(raw_address: str) -> str:
        if not isinstance(raw_address, str) or len(raw_address) < 3:
            return ""

        cleaned = raw_address.lower()

        # 1. Strip Marketing Noise
        for pattern in AddressCleaner.NOISE_PATTERNS:
            cleaned = re.sub(pattern, "", cleaned, flags=re.IGNORECASE)

        # 2. Prune Macro-Locations
        cleaned = re.sub(r'\b(quer[ée]taro|m[ée]xico|qro)\b', '', cleaned)

        # 3. Structural Deduplication
        cleaned = re.sub(r'\b(.+?)(?:[\s,]+)\1\b', r'\1', cleaned)

        # 4. Remove Orphan Prepositions (Safe Mode)
        # We protect 'el', 'la' because 'El Refugio' requires them.
        cleaned = re.sub(r'^\s*(?:en|de)\b\s*', '', cleaned)

        # 5. Final Formatting
        cleaned = re.sub(r'[^a-z0-9\s,áéíóúñ]', '', cleaned)
        cleaned = re.sub(r'\s+', ' ', cleaned).strip()
        cleaned = re.sub(r'^,+,*|,*,$', '', cleaned)

        return cleaned

# Validation of Final Logic
df_test['final_production_clean'] = df_test['raw_input'].apply(AddressCleaner.clean)
display(df_test[['raw_input', 'final_production_clean']])

Unnamed: 0,raw_input,final_production_clean
0,"meseta, sonterra","meseta, sonterra"
1,"boulevard villas del meson, villas del mesón, juriquilla","boulevard villas del meson, villas del mesón, juriquilla"
2,altozano el nuevo,altozano el
3,anillo vial fray junipero el refugio,anillo vial fray junipero el refugio
4,"pre del reconocido despacho de arquitectura goma arquitectos, el campanario","pre del reconocido despacho de arquitectura goma arquitectos, el campanario"
5,moderna con excelente distribución y acabados en jurica,moderna con excelente distribución y acabados en jurica
6,"zikura, zibatá, el marqués","zikura, zibatá, el marqués"
7,"faisan 1418, nuevo refugio","faisan 1418, refugio"
8,"valles, la purísima","valles, la purísima"


## 5. Next Steps

1.  **Code Migration:** The robust `AddressCleaner` class is saved in `src/utils/clean_text.py`.
2.  **Geocoding Strategy:** We will handle the remaining "Not Found" cases (like *Meseta* or *Zikura*) in the **Geocoding Script** using a comma-splitting fallback mechanism (e.g., if `A, B` fails, try `B`).