# XGBoost Model for Predicting Atmospheric Emissions

In this notebook, we will develop an XGBoost model to predict pollutant emissions using the preprocessed training and test datasets.

---

## 1. Importing Libraries and Loading the Data

We will start by loading the necessary libraries and data.


In [1]:
# Import necessary libraries
import os
import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score

# Define paths for train and test data
TRAIN_DATA_PATH = '../../data/processed/train_data.csv'
TEST_DATA_PATH = '../../data/processed/test_data.csv'

# Load the data
train_data = pd.read_csv(TRAIN_DATA_PATH)
test_data = pd.read_csv(TEST_DATA_PATH)

# Display the first few rows of the data
train_data.head()


Unnamed: 0,Easting,Northing,Borough_Barnet,Borough_Bexley,Borough_Brent,Borough_Bromley,Borough_Camden,Borough_City,Borough_City of Westminster,Borough_Croydon,...,Source_Small Private Vessels,Source_Small Scale Waste Burning,Source_Taxi,Source_TfL Bus,Source_WTS,Source_Wood Burning,nox,pm10,pm2.5,so2
0,-0.152332,0.689451,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,-0.07177,-0.090787,-0.1597,-0.014359
1,0.880811,1.171608,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,0.053368,-0.084313,-0.129099,-0.005289
2,-0.841094,1.332327,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,-0.071278,-0.091106,-0.160221,-0.014359
3,-0.634465,0.93053,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,-0.07177,-0.092857,-0.163465,-0.014359
4,-0.772218,1.010889,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,-0.07177,-0.092857,-0.163465,-0.014359


## 2. Separating Features and Target Variables

Next, we will separate the features (X) and the target columns (y) that we are trying to predict.


In [2]:
# Define the target columns (pollutants)
TARGET_COLUMNS = ['nox', 'pm10', 'pm2.5', 'so2']

# Separate features (X) and targets (y) for training and test sets
X_train = train_data.drop(columns=TARGET_COLUMNS)
y_train = train_data[TARGET_COLUMNS]

X_test = test_data.drop(columns=TARGET_COLUMNS)
y_test = test_data[TARGET_COLUMNS]

# Ensure there are no missing values
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)
y_train = y_train.fillna(0)
y_test = y_test.fillna(0)

# Check the shape of the training and test data
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


X_train shape: (115180, 95), y_train shape: (115180, 4)
X_test shape: (28796, 95), y_test shape: (28796, 4)


## 3. Training the XGBoost Model

We will now initialize the XGBoost model and train it on the training data.


In [3]:
# Initialize the XGBoost model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

# Train the model
xgb_model.fit(X_train, y_train)


## 4. Making Predictions and Evaluating the Model

After training, we will make predictions on the test set and evaluate the model's performance using metrics such as Mean Squared Error (MSE) and R-squared.


In [4]:
# Make predictions on the test set
y_pred = xgb_model.predict(X_test)

# Evaluate the model using Mean Squared Error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")


Mean Squared Error: 1.030501155992772
R-squared: -4.707204818725586


## 5. Saving the Trained Model

Finally, we will save the trained model for future use.


In [5]:
# Save the trained XGBoost model in the respective models folder
MODEL_SAVE_PATH = '../../models/xgboost_model.json'
xgb_model.save_model(MODEL_SAVE_PATH)

print(f"Model saved to: {MODEL_SAVE_PATH}")


Model saved to: ../../models/xgboost_model.json
