## PM2.5 Concentration Data Cleaning

This notebook documents the process of cleaning and preparing the PM2.5 air pollution dataset for analysis.

- **Goal**: To ensure data quality and consistency for valid statistical analysis
- **Dataset**: Annual mean PM2.5 concentrations by country from [WHO]
- **Timeframe**: [2010–2019]
- **Geography**: [25 diverse countries]

---

### 📦 Import Required Pandas Library

In [51]:
import pandas as pd

### 📁 loading the dataset

from the CSV raw file, and making a copy for the cleaned to be dataset
and preview the first few rows.

In [52]:
raw_data = pd.read_csv("../1_datasets/raw_datasets/pm25_annual_concentration.csv")
cleaned_pm25_data = raw_data.copy()
cleaned_pm25_data.head()

Unnamed: 0,IndicatorCode,Indicator,ValueType,ParentLocationCode,ParentLocation,Location type,SpatialDimValueCode,Location,Period type,Period,...,FactValueUoM,FactValueNumericLowPrefix,FactValueNumericLow,FactValueNumericHighPrefix,FactValueNumericHigh,Value,FactValueTranslationID,FactComments,Language,DateModified
0,SDGPM25,Concentrations of fine particulate matter (PM2.5),text,AFR,Africa,Country,KEN,Kenya,Year,2019,...,,,6.29,,13.74,10.01 [6.29-13.74],,,EN,2022-08-11T21:00:00.000Z
1,SDGPM25,Concentrations of fine particulate matter (PM2.5),text,AMR,Americas,Country,TTO,Trinidad and Tobago,Year,2019,...,,,7.44,,12.55,10.02 [7.44-12.55],,,EN,2022-08-11T21:00:00.000Z
2,SDGPM25,Concentrations of fine particulate matter (PM2.5),text,EUR,Europe,Country,GBR,United Kingdom of Great Britain and Northern I...,Year,2019,...,,,9.73,,10.39,10.06 [9.73-10.39],,,EN,2022-08-11T21:00:00.000Z
3,SDGPM25,Concentrations of fine particulate matter (PM2.5),text,AMR,Americas,Country,GRD,Grenada,Year,2019,...,,,7.07,,13.2,10.08 [7.07-13.20],,,EN,2022-08-11T21:00:00.000Z
4,SDGPM25,Concentrations of fine particulate matter (PM2.5),text,AMR,Americas,Country,BRA,Brazil,Year,2019,...,,,8.23,,12.46,10.09 [8.23-12.46],,,EN,2022-08-11T21:00:00.000Z


In [53]:
# checking initial rows and columns number
cleaned_pm25_data.shape

(9450, 34)

### 🧾 Initial Column Inspection
We inspect:
- Full list of column names
- Identify metadata columns 
- Identify unnecessary columns


In [54]:
cleaned_pm25_data.columns

Index(['IndicatorCode', 'Indicator', 'ValueType', 'ParentLocationCode',
       'ParentLocation', 'Location type', 'SpatialDimValueCode', 'Location',
       'Period type', 'Period', 'IsLatestYear', 'Dim1 type', 'Dim1',
       'Dim1ValueCode', 'Dim2 type', 'Dim2', 'Dim2ValueCode', 'Dim3 type',
       'Dim3', 'Dim3ValueCode', 'DataSourceDimValueCode', 'DataSource',
       'FactValueNumericPrefix', 'FactValueNumeric', 'FactValueUoM',
       'FactValueNumericLowPrefix', 'FactValueNumericLow',
       'FactValueNumericHighPrefix', 'FactValueNumericHigh', 'Value',
       'FactValueTranslationID', 'FactComments', 'Language', 'DateModified'],
      dtype='object')

### 🧹 Column Renaming and Dropping
Drop columns:
- Drop columns not needed for analysis

Rename columns:
- Standardize names


In [55]:
cleaned_pm25_data = cleaned_pm25_data.drop(
    [
        "IndicatorCode",
        "Indicator",
        "ValueType",
        "ParentLocationCode",
        "ParentLocation",
        "SpatialDimValueCode",
        "Period type",
        "IsLatestYear",
        "Dim1 type",
        "Dim1ValueCode",
        "Dim2 type",
        "Dim2",
        "Dim2ValueCode",
        "Dim3 type",
        "Dim3",
        "Dim3ValueCode",
        "DataSourceDimValueCode",
        "DataSource",
        "FactValueNumericPrefix",
        "FactValueUoM",
        "FactValueNumericLowPrefix",
        "FactValueNumericHighPrefix",
        "FactValueTranslationID",
        "FactComments",
        "Language",
        "DateModified",
        "Location type",
        "Value",
    ],
    axis=1,
    errors="ignore",
)
cleaned_pm25_data.head()

Unnamed: 0,Location,Period,Dim1,FactValueNumeric,FactValueNumericLow,FactValueNumericHigh
0,Kenya,2019,Cities,10.01,6.29,13.74
1,Trinidad and Tobago,2019,Rural,10.02,7.44,12.55
2,United Kingdom of Great Britain and Northern I...,2019,Cities,10.06,9.73,10.39
3,Grenada,2019,Total,10.08,7.07,13.2
4,Brazil,2019,Towns,10.09,8.23,12.46


In [56]:
cleaned_pm25_data = cleaned_pm25_data.rename(
    columns={
        "Location": "Country",
        "Period": "Year",
        "FactValueNumeric": "PM25 concentration",
        "Dim1": "Population Category",
        "FactValueNumericLow": "PM25 lower bound",
        "FactValueNumericHigh": "PM25 upper bound",
    }
)

### 🎯 Filtering by Country, Year, and Category

To ensure data consistency and focus our analysis on relevant entries, we applied filters using the `isin()` method

In [57]:
countries_to_keep = [
    "Egypt",
    "India",
    "United States of America",
    "Chile",
    "Nigeria",
    "Italy",
    "China",
    "Japan",
    "Afghanistan",
    "Germany",
    "Russian Federation",
    "Spain",
    "Romania",
    "Indonesia",
    "Saudi Arabia",
    "Bangladesh",
    "Republic of Korea",
    "South Africa",
    "Ethiopia",
    "Kenya",
    "Brazil",
    "Mexico",
    "Canada",
    "Australia",
    "Fiji",
]
years_to_keep = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
category_to_keep = ["Total"]

cleaned_pm25_data = cleaned_pm25_data[
    cleaned_pm25_data["Country"].isin(countries_to_keep)
    & cleaned_pm25_data["Year"].isin(years_to_keep)
    & cleaned_pm25_data["Population Category"].isin(category_to_keep)
]

### 🔍 Summary Statistics & Missing Data

- Check `.info()` for data types and nulls
- Use `.isnull().sum()` to count missing values

In [58]:
cleaned_pm25_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 250 entries, 32 to 9420
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Country              250 non-null    object 
 1   Year                 250 non-null    int64  
 2   Population Category  250 non-null    object 
 3   PM25 concentration   250 non-null    float64
 4   PM25 lower bound     250 non-null    float64
 5   PM25 upper bound     250 non-null    float64
dtypes: float64(3), int64(1), object(2)
memory usage: 13.7+ KB


In [59]:
cleaned_pm25_data.isnull().sum()

Country                0
Year                   0
Population Category    0
PM25 concentration     0
PM25 lower bound       0
PM25 upper bound       0
dtype: int64

### 🧹 Drop Unnecessary Column

We dropped the `"Population Category"` column as it’s not needed for our analysis.  
Also, we reset the index to keep it clean and sequential after filtering.


In [60]:
cleaned_pm25_data = cleaned_pm25_data.drop(
    ["Population Category"], axis=1, errors="ignore"
)
cleaned_pm25_data.reset_index(drop=True)

Unnamed: 0,Country,Year,PM25 concentration,PM25 lower bound,PM25 upper bound
0,Germany,2019,10.73,10.56,10.93
1,Japan,2019,10.84,10.02,11.55
2,Brazil,2019,10.94,9.37,13.01
3,Kenya,2019,12.52,7.80,17.78
4,Romania,2019,13.30,12.55,14.10
...,...,...,...,...,...
245,Afghanistan,2010,68.97,49.48,96.51
246,Fiji,2010,7.19,3.79,12.39
247,Canada,2010,7.76,7.55,7.96
248,Australia,2010,9.03,8.14,9.93


### 💾 Export Cleaned Data

Save the cleaned dataset to a `.csv` file for future use:

In [61]:
cleaned_pm25_data.to_csv("cleaned_pm25_data.csv", index=False)