# Data‑Cleaning Pipeline – Quick Guide

Welcome! This short guide shows how to run the `data_cleaning_modular.py` script, tweak its behaviour, and understand the available options.

---

## 1. Directory layout

```text
project/
├─ data_cleaning_modular.py   # the refactored pipeline
├─ Projet+Python_Dataset_Edstats_csv/   # raw CSVs
└─ cleaned_data/              # will be auto‑created for outputs
```

*Change the paths in the **Configuration** section at the top of the script if your folders differ.*

---

## 2. Running the script

### 2.1 As a CLI script

```bash
python data_cleaning_modular.py
```

All CSVs declared in **`FILES_INFO`** are processed in order, and a `_cleaned.csv` copy is saved in `./cleaned_data`.

### 2.2 From a Notebook

```python
from data_cleaning_modular import main, FILES_INFO
main()  # uses default list
```

Or pass your own list of `FileConfig` objects.

---

## 3. Configuration building‑blocks

### 3.1 `FileConfig`

Represents one file to clean.

| Field              | Type                | Purpose                                                                              |
| ------------------ | ------------------- | ------------------------------------------------------------------------------------ |
| `filename`         | `str`               | CSV file name **inside** `SOURCE_DIR`.                                               |
| `critical_columns` | `list[str] \| None` | Rows missing *all* of these columns are dropped; columns themselves are always kept. |
| `options`          | `dict`              | Per‑file overrides for cleaning behaviour (see below).                               |

```python
{
    "filename": "EdStatsSeries.csv",
    "critical_columns": None,
    "options": {
        "remove_outliers": False,
        "drop_columns_threshold": 0.4,
    },
}
```

### 3.2 `CleaningOptions`

All numeric thresholds are fractions (0–1).  Defaults in **bold**.

| Option                    | Default  | Meaning                                                                                    |
| ------------------------- | -------- | ------------------------------------------------------------------------------------------ |
| `drop_columns_threshold`  | **0.5**  | Drop any column whose missing‑value ratio > threshold *unless* it’s critical.              |
| `amputate_rows_threshold` | **0.05** | Per column: if #missing rows ≤ threshold \* row‑count → drop those rows; otherwise impute. |
| `remove_duplicates`       | **True** | Toggle duplicate‑row removal.                                                              |
| `remove_outliers`         | **True** | Toggle numeric outlier removal (IQR rule).                                                 |
| `numeric_cols`            | `None`   | List of columns to apply outlier logic to. `None` = all numeric cols.                      |

Example—turn off outlier removal and tweak thresholds:

```python
"options": {
    "remove_outliers": False,
    "drop_columns_threshold": 0.4,
    "amputate_rows_threshold": 0.02,
}
```

---

## 4. Behaviour at a glance

1. **Column pruning** – sparsity check, preserves critical columns.
2. **Row pruning** – if a row is missing **all** critical columns.
3. **Amputate vs. Impute** – small gaps → drop rows; bigger gaps → impute (median for numeric, mode for others).
4. **Duplicate removal** – optional.
5. **Outlier filtering** – optional, IQR (1.5×) on selected numeric columns.
6. **Save** – cleaned CSV to `OUTPUT_DIR`.

Every step logs what it did so you can audit the run.

---

## 5. Extending the pipeline

* **YAML/JSON config** – load a file, map each dict into `FileConfig.from_mapping()`, then call `main()`.
* **Custom imputation** – subclass `DataCleaner` and override `_handle_missing_values()`.
* **CI integration** – a single `python data_cleaning_modular.py` in your pipeline cleans fresh data every commit.

Happy cleaning!


In [1]:
from data_cleaning_modular import main, FileConfig

FILES_INFO: list[FileConfig] = [
    FileConfig.from_mapping(cfg)
    for cfg in [
        {
            "filename": "EdStatsCountry.csv",
            "critical_columns": [
                "Country Code",
                "Short Name",
                "Region",
                "Income Group",
            ],
        },
        {
            "filename": "EdStatsCountry-Series.csv",
        },
        {
            "filename": "EdStatsSeries.csv",
        },
        {
            "filename": "EdStatsData.csv",
            "critical_columns": ["2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2025", "2030"],
            "options": {
                "numeric_cols": ["2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2025", "2030"],
            },
        },
    ]
]

main(FILES_INFO)


--- Processing Projet+Python_Dataset_Edstats_csv/EdStatsCountry.csv ---

=== Cleaning: EdStatsCountry.csv ===
Initial shape: (241, 32)


Unnamed: 0,Country Code,Short Name,Table Name,Long Name,2-alpha code,Currency Unit,Special Notes,Region,Income Group,WB-2 code,...,IMF data dissemination standard,Latest population census,Latest household survey,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,Unnamed: 31
0,ABW,Aruba,Aruba,Aruba,AW,Aruban florin,SNA data for 2000-2011 are updated from offici...,Latin America & Caribbean,High income: nonOECD,AW,...,,2010,,,Yes,,,2012.0,,
1,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,Fiscal year end: March 20; reporting period fo...,South Asia,Low income,AF,...,General Data Dissemination System (GDDS),1979,"Multiple Indicator Cluster Survey (MICS), 2010/11","Integrated household survey (IHS), 2008",,2013/14,,2012.0,2000.0,



Missing percentage per column:
Unnamed: 31                                          100.0%
National accounts reference year                      86.7%
Alternative conversion factor                         80.5%
Other groups                                          75.9%
Latest industrial data                                55.6%
Vital registration complete                           53.9%
External debt Reporting status                        48.5%
Latest household survey                               41.5%
Latest agricultural census                            41.1%
Lending category                                      40.2%
PPP survey year                                       39.8%
Special Notes                                         39.8%
Source of most recent Income and expenditure data     33.6%
Government Accounting concept                         33.2%
Latest water withdrawal data                          25.7%
Balance of Payments Manual in use                     24.9%
IMF data

Unnamed: 0,Country Code,Short Name,Table Name,Long Name,2-alpha code,Currency Unit,Special Notes,Region,Income Group,WB-2 code,...,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Latest population census,Latest household survey,Source of most recent Income and expenditure data,Latest agricultural census,Latest trade data,Latest water withdrawal data
0,ABW,Aruba,Aruba,Aruba,AW,Aruban florin,SNA data for 2000-2011 are updated from offici...,Latin America & Caribbean,High income: nonOECD,AW,...,Actual,Special trade system,Consolidated central government,General Data Dissemination System (GDDS),2010,"Multiple Indicator Cluster Survey (MICS), 2012","Integrated household survey (IHS), 2012",2010,2012.0,2000
1,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,Fiscal year end: March 20; reporting period fo...,South Asia,Low income,AF,...,Actual,General trade system,Consolidated central government,General Data Dissemination System (GDDS),1979,"Multiple Indicator Cluster Survey (MICS), 2010/11","Integrated household survey (IHS), 2008",2013/14,2012.0,2000


Saved cleaned file → cleaned_data/EdStatsCountry_cleaned.csv

--- Processing Projet+Python_Dataset_Edstats_csv/EdStatsCountry-Series.csv ---

=== Cleaning: EdStatsCountry-Series.csv ===
Initial shape: (613, 4)


Unnamed: 0,CountryCode,SeriesCode,DESCRIPTION,Unnamed: 3
0,ABW,SP.POP.TOTL,Data sources : United Nations World Population...,
1,ABW,SP.POP.GROW,Data sources: United Nations World Population ...,



Missing percentage per column:
Unnamed: 3     100.0%
CountryCode      0.0%
SeriesCode       0.0%
DESCRIPTION      0.0%
dtype: object

Dropping columns with >50% missing: ['Unnamed: 3']
Columns after drop: ['CountryCode', 'SeriesCode', 'DESCRIPTION']

No critical columns set.

-- Handling missing values in non-critical columns --
No duplicates found.

=== Summary for EdStatsCountry-Series.csv ===
Final shape: (613, 3)
Columns: ['CountryCode', 'SeriesCode', 'DESCRIPTION']


Unnamed: 0,CountryCode,SeriesCode,DESCRIPTION
0,ABW,SP.POP.TOTL,Data sources : United Nations World Population...
1,ABW,SP.POP.GROW,Data sources: United Nations World Population ...


Saved cleaned file → cleaned_data/EdStatsCountry-Series_cleaned.csv

--- Processing Projet+Python_Dataset_Edstats_csv/EdStatsSeries.csv ---

=== Cleaning: EdStatsSeries.csv ===
Initial shape: (3665, 21)


Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Unit of measure,Periodicity,Base Period,Other notes,Aggregation method,...,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15-19 with...,Percentage of female population age 15-19 with...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...,Percentage of population age 15-19 with no edu...,Percentage of population age 15-19 with no edu...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,



Missing percentage per column:
Other web links                        100.0%
Unnamed: 20                            100.0%
License Type                           100.0%
Notes from original source             100.0%
Unit of measure                        100.0%
Related indicators                     100.0%
Development relevance                   99.9%
General comments                        99.6%
Limitations and exceptions              99.6%
Statistical concept and methodology     99.4%
Aggregation method                      98.7%
Periodicity                             97.3%
Related source links                    94.1%
Base Period                             91.4%
Other notes                             84.9%
Short definition                        41.2%
Series Code                              0.0%
Long definition                          0.0%
Indicator Name                           0.0%
Topic                                    0.0%
Source                                   0.0%
dt

Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Source
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15-19 with...,Percentage of female population age 15-19 with...,Robert J. Barro and Jong-Wha Lee: http://www.b...
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...,Percentage of population age 15-19 with no edu...,Percentage of population age 15-19 with no edu...,Robert J. Barro and Jong-Wha Lee: http://www.b...


Saved cleaned file → cleaned_data/EdStatsSeries_cleaned.csv

--- Processing Projet+Python_Dataset_Edstats_csv/EdStatsData.csv ---

=== Cleaning: EdStatsData.csv ===
Initial shape: (886930, 70)


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,



Missing percentage per column:
Unnamed: 69       100.0%
2017              100.0%
2016               98.1%
1971               96.0%
1973               96.0%
                   ...  
2010               72.7%
Indicator Name      0.0%
Indicator Code      0.0%
Country Name        0.0%
Country Code        0.0%
Length: 70, dtype: object

Dropping columns with >50% missing: ['1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2020', '2035', '2040', '2045', '2050', '2055', '2060', '2065', '2070', '2075', '2080', '2085', '2090', '2095', '2100', 'Unnamed: 69']
Columns after drop: ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2025', '2030']

-- M

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,2010,2011,2012,2013,2014,2015,2016,2017,2025,2030
4,Arab World,ARB,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,85.211998,85.24514,86.101669,85.51194,85.320152,,,,,
5,Arab World,ARB,"Adjusted net enrolment rate, primary, female (%)",SE.PRM.TENR.FE,82.871651,82.861389,84.401413,83.914032,83.820831,,,,,


Saved cleaned file → cleaned_data/EdStatsData_cleaned.csv
