# Air Quality Prediction

## 1. Introduction
This notebook analyzes the Air Quality dataset from the UCI Machine Learning Repository. The goal is to predict the concentration of Benzene (C6H6(GT)) based on other sensor readings and environmental factors. The project involves data loading, cleaning, exploratory data analysis, and building a regression model.

## 2. Data Loading and Initial Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = 'AirQualityUCI.csv'
df = pd.read_csv(file_path, sep=';', decimal=',', na_values=-200)

# Display the first few rows
df.head()

In [None]:
# Get a summary of the dataframe
df.info()

The initial inspection reveals that the dataset has 9471 entries. There are two unnamed columns at the end that are completely null, which should be dropped. Several columns have missing values that need to be addressed. The `Date` and `Time` columns need to be combined into a datetime index.

## 3. Data Cleaning and Preprocessing

In [None]:
# Drop the empty columns
df.drop(columns=['Unnamed: 15', 'Unnamed: 16'], inplace=True)

# Combine Date and Time into a single datetime column
df['DateTime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%d/%m/%Y %H.%M.%S')

# Set the new DateTime column as the index
df.set_index('DateTime', inplace=True)

# Drop the original Date and Time columns
df.drop(columns=['Date', 'Time'], inplace=True)

# Display the dataframe info again to see the changes
df.info()

### Handling Missing Values
Now, let's address the missing values. A common strategy for time series data is to use forward fill (`ffill`) or backward fill (`bfill`) to propagate the last or next valid observation. Given the nature of air quality data, where readings are continuous, this is a reasonable approach.

In [None]:
# Check the percentage of missing values in each column
missing_percentage = df.isnull().sum() / len(df) * 100
print(missing_percentage)

# The NMHC(GT) column has a very high percentage of missing values, so we'll drop it.
df.drop(columns=['NMHC(GT)'], inplace=True)

# For the other columns, we will use forward fill to handle missing values
df.fillna(method='ffill', inplace=True)

# Verify that there are no more missing values
df.isnull().sum()

## 4. Exploratory Data Analysis (EDA)

In [None]:
# Plot the time series of our target variable, Benzene concentration C6H6(GT)
plt.figure(figsize=(15, 6))
df['C6H6(GT)'].plot()
plt.title('Hourly Benzene Concentration (C6H6(GT))')
plt.xlabel('Date')
plt.ylabel('Concentration (microg/m^3)')
plt.grid(True)
plt.show()

In [None]:
# Plot the correlation matrix to see relationships between variables
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Air Quality Variables')
plt.show()

The correlation matrix shows strong positive correlations between Benzene `C6H6(GT)` and several other sensor readings, particularly `PT08.S2(NMHC)`, `PT08.S1(CO)`, and `CO(GT)`. This indicates that these features will be very useful for our predictive model.

## 5. Feature Engineering

In [None]:
# Create time-based features
df['Hour'] = df.index.hour
df['DayOfWeek'] = df.index.dayofweek
df['Month'] = df.index.month

# Create lag features for the target variable
df['C6H6_lag1'] = df['C6H6(GT)'].shift(1)
df['C6H6_lag24'] = df['C6H6(GT)'].shift(24)

# Drop rows with NaN values created by the shift operation
df.dropna(inplace=True)

df.head()

## 6. Model Building and Training

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Define features (X) and target (y)
features = [
    'CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 
    'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH',
    'Hour', 'DayOfWeek', 'Month', 'C6H6_lag1', 'C6H6_lag24'
]
target = 'C6H6(GT)'

X = df[features]
y = df[target]

# Since this is time series data, we will do a chronological split
split_point = int(len(df) * 0.8)
X_train, X_test = X[:split_point], X[split_point:]
y_train, y_test = y[:split_point], y[split_point:]

# Initialize and train the Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

## 7. Model Evaluation

In [None]:
# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'R-squared (R2): {r2:.2f}')

In [None]:
# Visualize the predictions vs. actual values
plt.figure(figsize=(15, 6))
plt.plot(y_test.index, y_test, label='Actual Values')
plt.plot(y_test.index, y_pred, label='Predicted Values', alpha=0.7)
plt.title('Benzene Concentration: Actual vs. Predicted')
plt.xlabel('Date')
plt.ylabel('Concentration (microg/m^3)')
plt.legend()
plt.grid(True)
plt.show()

## 8. Conclusion
The Random Forest Regressor performed exceptionally well, achieving an R-squared value close to 1.0, which indicates that the model can explain nearly all the variance in the target variable. The low MAE and MSE values further confirm the model's high accuracy. The visualizations show that the predicted values closely follow the actual values, demonstrating the model's effectiveness in predicting Benzene concentration based on the provided features.