# District Cleaning â€” Chhattisgarh

This notebook standardizes **district names** for this state across:
- Enrolment data
- Demographic update data
- Biometric update data

**All data is saved back to the same cleaned files.**

In [1]:
import pandas as pd
from pathlib import Path

pd.set_option("display.max_rows", None)
pd.set_option("display.width", None)

CLEAN_DIR = Path("../../data/processed/cleaned")

enrol_df = pd.read_csv(CLEAN_DIR / "enrolment_clean.csv")
demo_df  = pd.read_csv(CLEAN_DIR / "demographic_clean.csv")
bio_df   = pd.read_csv(CLEAN_DIR / "biometric_clean.csv")

for df in [enrol_df, demo_df, bio_df]:
    df["state"] = df["state"].astype(str).str.strip().str.title()
    df["district"] = df["district"].astype(str).str.strip().str.title()

print("âœ… All datasets loaded and normalized (Title Case)")


âœ… All datasets loaded and normalized (Title Case)


In [5]:
STATE_NAME = "Chhattisgarh"

districts = sorted(
    set(
        enrol_df.loc[enrol_df["state"] == STATE_NAME, "district"].dropna()
        .tolist()
    )
)

print(f"State: {STATE_NAME}")
print(f"Number of unique districts: {len(districts)}")

pd.DataFrame(
    {"District Name": districts}
)


State: Chhattisgarh
Number of unique districts: 33


Unnamed: 0,District Name
0,Balod
1,Balodabazar-Bhatapara
2,Balrampur-Ramanujganj
3,Bastar
4,Bemetara
5,Bijapur
6,Bilaspur
7,Dakshin Bastar Dantewada
8,Dhamtari
9,Durg


## District Mapping

Add mappings in **Title Case only**.

Format:
```python
DISTRICT_MAPPING = {
    "Correct District": ["Wrong Name 1", "Wrong Name 2"],
}
```

In [3]:
DISTRICT_MAPPING = {
    # "Correct District": ["Wrong Variant 1", "Wrong Variant 2"]
    "Balodabazar-Bhatapara": ["Baloda Bazar"],
    "Balrampur-Ramanujganj": ["Balrampur"],
    "Dakshin Bastar Dantewada": ["Dantewada"],
    "Gaurela-Pendra-Marwahi": ["Gaurella Pendra Marwahi"],
    "Janjgir-Champa": ["Janjgir Champa", "Janjgir - Champa"],
    "Uttar Bastar Kanker": ["North Bastar Kanker", "Kanker"],
    "Kabeerdham": ["Kawardha"],
    "Khairagarh-Chhuikhadan-Gandai": ["Khairagarh Chhuikhadan Gandai"],
    "Korea": ["Koriya"],
    "Mohla-Manpur-Ambagarh Chouki": ["Mohalla-Manpur-Ambagarh Chowki"]
}

def apply_mapping(df, state, mapping, label):
    total = 0
    for correct, wrongs in mapping.items():
        mask = (
            (df["state"] == state) &
            (df["district"].isin(wrongs))
        )
        count = mask.sum()
        df.loc[mask, "district"] = correct
        total += count
        if count > 0:
            print(f"âœ” {label} â†’ {correct} : {count} rows fixed")
    return total

total_fixes = 0
total_fixes += apply_mapping(enrol_df, STATE_NAME, DISTRICT_MAPPING, "Enrolment")
total_fixes += apply_mapping(demo_df,  STATE_NAME, DISTRICT_MAPPING, "Demographic")
total_fixes += apply_mapping(bio_df,   STATE_NAME, DISTRICT_MAPPING, "Biometric")

print(f"âœ… Total fixes in {STATE_NAME}: {total_fixes}")


âœ” Enrolment â†’ Balodabazar-Bhatapara : 46 rows fixed
âœ” Enrolment â†’ Balrampur-Ramanujganj : 22 rows fixed
âœ” Enrolment â†’ Dakshin Bastar Dantewada : 7 rows fixed
âœ” Enrolment â†’ Gaurela-Pendra-Marwahi : 3 rows fixed
âœ” Enrolment â†’ Janjgir-Champa : 1 rows fixed
âœ” Enrolment â†’ Uttar Bastar Kanker : 24 rows fixed
âœ” Enrolment â†’ Kabeerdham : 10 rows fixed
âœ” Enrolment â†’ Khairagarh-Chhuikhadan-Gandai : 4 rows fixed
âœ” Enrolment â†’ Korea : 19 rows fixed
âœ” Demographic â†’ Balodabazar-Bhatapara : 103 rows fixed
âœ” Demographic â†’ Balrampur-Ramanujganj : 38 rows fixed
âœ” Demographic â†’ Dakshin Bastar Dantewada : 26 rows fixed
âœ” Demographic â†’ Janjgir-Champa : 13 rows fixed
âœ” Demographic â†’ Uttar Bastar Kanker : 76 rows fixed
âœ” Demographic â†’ Kabeerdham : 19 rows fixed
âœ” Demographic â†’ Khairagarh-Chhuikhadan-Gandai : 23 rows fixed
âœ” Demographic â†’ Korea : 76 rows fixed
âœ” Demographic â†’ Mohla-Manpur-Ambagarh Chouki : 4 rows fixed
âœ” Biometric â†’ Ba

In [4]:
enrol_df.to_csv(CLEAN_DIR / "enrolment_clean.csv", index=False)
demo_df.to_csv(CLEAN_DIR / "demographic_clean.csv", index=False)
bio_df.to_csv(CLEAN_DIR / "biometric_clean.csv", index=False)

print("ðŸ’¾ All cleaned files saved successfully (overwritten)")


ðŸ’¾ All cleaned files saved successfully (overwritten)
