# **Data Handling and Preprocessing**  
## **Professional Air Pollution Data Processing Notebook**  
*This notebook focuses on comprehensive preprocessing of air pollution time-series data. The goal is to clean, structure, and transform raw data into a format suitable for forecasting and analysis.*  
📌 **Objectives of this Notebook:**  
- Load, inspect, and preprocess raw data.  
- Handle missing values and inconsistencies.  
- Create engineered features to enhance forecasting models.  
- Save the final processed dataset for modeling and visualization.  
***


## **Step 1: Import Required Libraries**  
To efficiently manage and preprocess air pollution data, we import key Python libraries:  
- **`pandas`**: For data manipulation and handling.  
- **`matplotlib.pyplot`**: For visualization (if needed later).  
Ensuring the correct version of pandas is used for consistency.

In [31]:
import pandas as pd
import matplotlib.pyplot as plt

# Ensure consistent pandas version
print(f"Using pandas version: {pd.__version__}")




Using pandas version: 2.2.3


## **Step 2: Load the Raw Dataset**  
The dataset, stored in CSV format, is loaded into a Pandas DataFrame.  
**Key actions in this step:**  
- Read the file and store it as `raw_data`.  
- Display the first and last few rows for initial inspection.  
- Identify potential missing values and inconsistencies.

In [32]:
# Step 1: Load the raw dataset
file_path = "all_stations_daily_data.csv"
raw_data = pd.read_csv(file_path)
print("Step 1: Loaded raw dataset")
print(raw_data.head(3))
print(raw_data.tail(3))


Step 1: Loaded raw dataset
    Timestamp  PM2.5 (µg/m³)  PM10 (µg/m³)  NO (µg/m³)  NO2 (µg/m³)  \
0  01-01-2017          230.5        329.45       14.22        11.21   
1  02-01-2017            NaN           NaN         NaN          NaN   
2  03-01-2017            NaN           NaN         NaN          NaN   

   NOx (ppb)  NH3 (µg/m³)  SO2 (µg/m³)  CO (mg/m³)  Ozone (µg/m³)  ...  \
0      25.43          NaN          NaN        0.74           56.5  ...   
1        NaN          NaN          NaN         NaN            NaN  ...   
2        NaN          NaN          NaN         NaN            NaN  ...   

   AT (°C)  RH (%)  WS (m/s)  WD (deg)  RF (mm)  TOT-RF (mm)  SR (W/mt2)  \
0      NaN     NaN       NaN       NaN      NaN          0.0         NaN   
1      NaN     NaN       NaN       NaN      NaN          0.0         NaN   
2      NaN     NaN       NaN       NaN      NaN          0.0         NaN   

   BP (mmHg)  VWS (m/s)  Station_ID  
0        NaN        NaN           1  
1        N

## **Step 3: Convert 'Timestamp' to Datetime Format**  
The dataset contains a column named 'Timestamp' in string format.  
To perform time-series operations, we convert it to a `datetime` object.  
**Why is this necessary?**  
- Ensures proper handling of time-dependent trends.  
- Enables feature engineering (such as extracting year, month, and day).  
- Facilitates smooth visualization and forecasting.

In [33]:
# Step 2: Convert 'Timestamp' to datetime for feature engineering
raw_data['Timestamp'] = pd.to_datetime(raw_data['Timestamp'], dayfirst=True, errors='coerce')
print("\nStep 2: Converted 'Timestamp' to datetime")
print(raw_data[['Timestamp']].head())





Step 2: Converted 'Timestamp' to datetime
   Timestamp
0 2017-01-01
1 2017-01-02
2 2017-01-03
3 2017-01-04
4 2017-01-05


## **Step 4: Generate Time-based Features**  
To enhance forecasting accuracy, we extract multiple time-based features:  
- **Year**: Extracts the year from the timestamp.  
- **Month**: Extracts the month of the year.  
- **Day**: Extracts the day of the month.  
- **DayOfWeek**: Identifies the weekday (0=Monday, 6=Sunday).  
These features provide valuable insights for trend analysis and seasonality detection.

In [34]:

# Step 3: Add time-based features
raw_data['Year'] = raw_data['Timestamp'].dt.year
raw_data['Month'] = raw_data['Timestamp'].dt.month
raw_data['Day'] = raw_data['Timestamp'].dt.day
raw_data['DayOfWeek'] = raw_data['Timestamp'].dt.dayofweek
print("\nStep 3: Added time-based features")
print(raw_data[['Year', 'Month', 'Day', 'DayOfWeek']].head())




Step 3: Added time-based features
   Year  Month  Day  DayOfWeek
0  2017      1    1          6
1  2017      1    2          0
2  2017      1    3          1
3  2017      1    4          2
4  2017      1    5          3


## **Step 5: Define Core Features for Processing**  
Certain pollutants are crucial indicators of air quality. We focus on the following:  
- **PM2.5 (µg/m³)**: Fine particulate matter.  
- **PM10 (µg/m³)**: Coarse particulate matter.  
- **NOx (ppb)**: Nitrogen oxides concentration.  
- **CO (mg/m³)**: Carbon monoxide concentration.  
- **Ozone (µg/m³)**: Ground-level ozone levels.  
These pollutants are critical for understanding pollution trends and forecasting.

In [35]:
# Step 4: Define core features for processing
core_features = ['PM2.5 (µg/m³)', 'PM10 (µg/m³)', 'NOx (ppb)', 'CO (mg/m³)', 'Ozone (µg/m³)']
print("\nStep 4: Defined core features for processing")
print(core_features)





Step 4: Defined core features for processing
['PM2.5 (µg/m³)', 'PM10 (µg/m³)', 'NOx (ppb)', 'CO (mg/m³)', 'Ozone (µg/m³)']


## **Step 6: Interpolate Missing Values**  
Handling missing data is essential to ensure dataset integrity.  
**Approach used:**  
- **Linear Interpolation:** Missing values are estimated based on neighboring values within the same station group.  
- **Explicit Rounding:** Values are rounded to four decimal places to maintain consistency.  
This step prevents data gaps from affecting forecasting models.

In [36]:

# Step 5: Interpolate missing values with explicit rounding
for feature in core_features:
    if feature in raw_data.columns:
        interpolated_data = (
            raw_data.reset_index()  # Reset index
            .groupby('Station_ID')[feature]
            .apply(lambda x: x.interpolate(method='linear', limit_direction='both'))
        )
        raw_data[feature] = interpolated_data.values  # Align values directly
        raw_data[feature] = raw_data[feature].round(4)  # Explicit rounding to 4 decimal places







## **Step 7: Create Rolling Averages and Lag Features**  
To capture temporal dependencies, we introduce additional statistical features:  
- **7-day rolling mean**: Computes the average of pollutant levels over the past week.  
- **Lag features**: Stores the previous day's pollutant values as an independent feature.  
These transformations help in time-series forecasting by identifying trends and variations.

In [37]:
# Step 6: Create rolling averages and lag features for pollutants
for feature in core_features:
    if feature in raw_data.columns:
        # Rolling mean
        raw_data[f'{feature}_rolling_mean'] = raw_data.groupby("Station_ID")[feature].transform(
            lambda x: x.rolling(window=7, min_periods=1).mean()
        ).round(4)  # Explicit rounding to 4 decimal places
        # Lag feature
        raw_data[f'{feature}_lag_1'] = raw_data.groupby("Station_ID")[feature].shift(1).round(4)
print("\nStep 6: Created rolling averages and lag features")
print(raw_data[[f'{feature}_rolling_mean' for feature in core_features]].head())




Step 6: Created rolling averages and lag features
   PM2.5 (µg/m³)_rolling_mean  PM10 (µg/m³)_rolling_mean  \
0                    230.5000                   329.4500   
1                    230.0676                   328.8471   
2                    229.6352                   328.2442   
3                    229.2028                   327.6412   
4                    228.7705                   327.0383   

   NOx (ppb)_rolling_mean  CO (mg/m³)_rolling_mean  Ozone (µg/m³)_rolling_mean  
0                 25.4300                   0.7400                     56.5000  
1                 25.4745                   0.7410                     56.4408  
2                 25.5190                   0.7421                     56.3816  
3                 25.5636                   0.7431                     56.3224  
4                 25.6081                   0.7441                     56.2632  


## **Step 8: Handle NaNs in Derived Features**  
Some newly created features may contain missing values. To address this, we use:  
- **Backward-fill (`bfill`) method**: Fills NaNs by propagating values backward.  
- **Explicit rounding**: Ensures numerical stability for models.  
This step ensures that no NaN values remain in the dataset.

In [38]:
# Step 7: Handle NaNs in derived features with rounding
for feature in core_features:
    if feature in raw_data.columns:
        raw_data[f'{feature}_lag_1'] = raw_data[f'{feature}_lag_1'].fillna(method='bfill').round(4)
        raw_data[f'{feature}_rolling_mean'] = raw_data[f'{feature}_rolling_mean'].fillna(method='bfill').round(4)
print("\nStep 7: Handled NaNs in derived features")
print(raw_data[[f'{feature}_lag_1' for feature in core_features]].head())




Step 7: Handled NaNs in derived features
   PM2.5 (µg/m³)_lag_1  PM10 (µg/m³)_lag_1  NOx (ppb)_lag_1  CO (mg/m³)_lag_1  \
0             230.5000            329.4500          25.4300            0.7400   
1             230.5000            329.4500          25.4300            0.7400   
2             229.6352            328.2442          25.5190            0.7421   
3             228.7705            327.0383          25.6081            0.7441   
4             227.9057            325.8325          25.6971            0.7462   

   Ozone (µg/m³)_lag_1  
0              56.5000  
1              56.5000  
2              56.3816  
3              56.2632  
4              56.1449  


  raw_data[f'{feature}_lag_1'] = raw_data[f'{feature}_lag_1'].fillna(method='bfill').round(4)
  raw_data[f'{feature}_rolling_mean'] = raw_data[f'{feature}_rolling_mean'].fillna(method='bfill').round(4)


## **Step 9: Forward-Fill Any Remaining Missing Data**  
To ensure data continuity, any remaining missing values are forward-filled (`ffill`).  
**Key benefits:**  
- Ensures time-series consistency.  
- Avoids data gaps affecting model predictions.  
- Preserves historical trends in pollution levels.

In [39]:
# Step 8: Forward-fill any remaining NaNs
raw_data = raw_data.fillna(method='ffill').round(4)
print("\nStep 8: Forward-filled remaining NaNs")





Step 8: Forward-filled remaining NaNs


  raw_data = raw_data.fillna(method='ffill').round(4)


In [40]:
# Step 9: Save the enhanced dataset to a new file with rounding applied
output_path = "Enhanced_Time-Series_Air_Pollution_Data_Revised.csv"
cleaned_data = raw_data.reset_index(drop=True).round(4)
cleaned_data.to_csv(output_path, index=False)
print(f"\nStep 9: Enhanced dataset saved to: {output_path}")


Step 9: Enhanced dataset saved to: Enhanced_Time-Series_Air_Pollution_Data_Revised.csv


## Step 10: Save the Enhanced Dataset

After processing, we save the cleaned dataset into a new CSV file for further analysis and modeling.

## **Step 10: Save the Enhanced Dataset**  
Once the data is fully processed, we save it to a new CSV file.  
**Filename:** `Enhanced_Time-Series_Air_Pollution_Data_Revised.csv`  
This cleaned dataset is now ready for advanced modeling, forecasting, and visualization.