# Articial Intelligence and Machine Learning - Coursework 1 - 1st diet
## Air quality dataset
# Student Name: 
# Student Email:

I confirm that the material contained within the submitted coursework is all my own work unless otherwise stated below.

---

## 1. Introduction and problem definition
This assignment uses a dataset of 9358 hourly-averaged answers from five different metal-oxide chemical sensors that are integrated into an air quality chemical multi-sensor system. These sensors were installed at street level in a highly polluted part of a city in Italy, and they collected data between March 2004 and February 2005. The dataset contains ground-truth readings and hourly averages of raw sensor data as integer values for carbon monoxide (CO), nitrogen dioxide (NO2), benzoene, total nitrogen oxides (NOx), and non-methanic hydrocarbons (NMHC). Furthermore, the data takes into consideration concept drifts, sensor drifts, and cross-sensitivities that might affect how well sensor concentration estimating capabilities work.
  
 # 1.1 Talk about Classification or Regression Issues:
Regression difficulties make up the majority of the issues described in the jobs. Predicting a continuous numerical number, such as the concentration of carbon monoxide (CO) in mg/m^3, is the aim of a regression problem. The goal of Task 1 is to precisely estimate the CO concentration based on sensor data, time, day of the week, and maybe humidity and temperature. Regression is a useful tool for modeling the connection between input characteristics and the target variable when the target variable is a continuous number.

In task 2, an Air Quality Index (AQI) is defined and machine learning is used to forecast it. While AQIs are often categorical indices, they may be defined by aggregating continuous variables (ground-truth)

## Task 1 is to predict CO concentration using regression analysis.


Problem Statement:
The goal is to create a regression model that can reliably predict the concentration of carbon monoxide (CO) in mg/m^3 given the hourly-averaged raw sensor values (particularly, PT08.S1(CO)), day of the week, time, and other parameters like temperature and humidity. A co-located reference certified analyzer's CO(GT) variable serves as the ground truth for this forecast.

##  Task 2: Using regression to predict the defined air quality index

Problem statement: Integrate ground-truth measurements of several gases to create a personalized Air Quality Index (AQI). Create a regression model with machine learning to forecast the specified AQI from raw sensor data and additional pertinent columns of interest. Refrain from making predictions using the ground-truth columns.

These issue statements lay the groundwork for applying machine learning approaches to particular problems pertaining to air quality prediction. The issues' regression character fits nicely with the objective of forecasting continuous numerical values, making it easier to use the right regression models and assessment criteria.


## 2. Data ingeston

## Loading the Dataset:

The dataset was loaded into a Pandas DataFrame for easy manipulation and analysis.

## Initial Exploration:
An initial exploration of the dataset was conducted to understand its structure, features, and contents.

## Column Selection:
Columns deemed relevant for the tasks were identified based on the analysis requirements.

## Handling Missing Values:
Placeholder values (-200) were replaced with NaN to accurately represent missing or unknown data.

## Date and Time Processing:
The 'Date' column was converted to a datetime format, and features like day of the week and hour of the day were derived.
Peak and Valley Time Identification:

Additional features were created to identify peak and valley hours based on the time of day.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Specify the path to the CSV file
file_path = "AirQuality.csv"

# Load the data into a Pandas DataFrame
air_quality_data = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print("Preview of the Data:")
print(air_quality_data.head())

# Display information about the data types and missing values
print("\nData Types and Missing Values:")
print(air_quality_data.info())

# Display basic statistics for each numerical field
print("\nStatistical Summary:")
print(air_quality_data.describe())


Preview of the Data:
         Date      Time  CO(GT)  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  \
0  10/03/2004  18:00:00     2.6         1360       150      11.9   
1  10/03/2004  19:00:00     2.0         1292       112       9.4   
2  10/03/2004  20:00:00     2.2         1402        88       9.0   
3  10/03/2004  21:00:00     2.2         1376        80       9.2   
4  10/03/2004  22:00:00     1.6         1272        51       6.5   

   PT08.S2(NMHC)  NOx(GT)  PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  PT08.S5(O3)  \
0           1046      166          1056      113          1692         1268   
1            955      103          1174       92          1559          972   
2            939      131          1140      114          1555         1074   
3            948      172          1092      122          1584         1203   
4            836      131          1205      116          1490         1110   

   T(C)    RH      AH  Unnamed: 15  Unnamed: 16  
0  13.6  48.9  0.7578          NaN          N

## 3. Data preparation (common to both tasks)
Column Management:

Removed irrelevant columns for both tasks.
Handling Missing Values:

Replaced placeholder values (-200) with NaN.
Filled missing values using forward fill.
Date and Time Processing:

Converted 'Date' to datetime and extracted the day of the week.
Created 'HourOfDay' from 'Time' for time-based analysis.
Peak and Valley Time Identification:

Marked peak hours (8-12, 18-22) and valley hours (2-6).

In [2]:
columns_to_drop = ['Unnamed: 15', 'Unnamed: 16']
air_quality_data.drop(columns=columns_to_drop, inplace=True)
# Replace -200 with NaN in the entire DataFrame
air_quality_data.replace(-200, np.nan, inplace=True)
# Handle missing values based on your strategy (e.g., mean, median, forward fill, etc.)
air_quality_data.fillna(method='ffill', inplace=True)  # Example: forward fill missing values
# 4. Create a New Attribute for the Day of the Week
# Convert 'Time' to datetime type
air_quality_data['Time'] = pd.to_datetime(air_quality_data['Time'])

# Create a new attribute for the day of the week
air_quality_data['DayOfWeek'] = air_quality_data['Time'].dt.day_name()

# Create a new field for the hour of the day
air_quality_data['HourOfDay'] = air_quality_data['Time'].dt.hour

# 6. Create a New Field for Peak Time
peak_hours = ((8 <= air_quality_data['HourOfDay']) & (air_quality_data['HourOfDay'] < 12)) | ((18 <= air_quality_data['HourOfDay']) & (air_quality_data['HourOfDay'] < 22))
air_quality_data['PeakTime'] = peak_hours


# 6. Create a New Field for Valley Time
valley_hours = ((2 <= air_quality_data['Time'].apply(lambda x: x.hour)) & (air_quality_data['Time'].apply(lambda x: x.hour) < 6))
air_quality_data['ValleyTime'] = valley_hours
print("\nData Types and Missing Values:")
print(air_quality_data.info())
print(air_quality_data.describe())
print(air_quality_data['Date'])



Data Types and Missing Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           9357 non-null   object        
 1   Time           9357 non-null   datetime64[ns]
 2   CO(GT)         9357 non-null   float64       
 3   PT08.S1(CO)    9357 non-null   float64       
 4   NMHC(GT)       9357 non-null   float64       
 5   C6H6(GT)       9357 non-null   float64       
 6   PT08.S2(NMHC)  9357 non-null   float64       
 7   NOx(GT)        9357 non-null   float64       
 8   PT08.S3(NOx)   9357 non-null   float64       
 9   NO2(GT)        9357 non-null   float64       
 10  PT08.S4(NO2)   9357 non-null   float64       
 11  PT08.S5(O3)    9357 non-null   float64       
 12  T(C)           9357 non-null   float64       
 13  RH             9357 non-null   float64       
 14  AH             9357 non-null   float64  

    # TASK 1: CO concentration prediction
    Predict the CO concentration (in mg/m3) based on, at least, the PT08.S1(CO) raw sensor readings, day of the week and time. 

    Maybe temperature and humidity can play a role as well? 

    Use CO(GT) as the ground truth.

## 4. Further Data preparation (specific for this task)
## Data segregation
Consider the nature of your air quality variables and the reasons for missingness. Choose an imputation method that aligns with the characteristics of the data. Assess the impact of imputation on your analyses.

## Justification for Data Binning:
Simplification: Binning simplifies complex continuous data, making it more manageable.

Non-Linearity Handling: Useful for capturing non-linear patterns, enhancing model performance.

Outlier Mitigation: Helps mitigate the impact of outliers by categorizing values into intervals.

Interpretability: Binned data enhances interpretability, aiding communication with non-technical stakeholders.

## Potential Application:

Air Quality Index (AQI): Binning pollutant concentrations for AQI creation facilitates clear communication of air quality levels.

Feature Engineering: Binning features like temperature or humidity improves model understanding.

Time-of-Day Patterns: Binning time into categories reveals air quality fluctuations throughout the day.

Model Compatibility: Binning suits models like decision trees, enhancing interpretability.##

In [3]:


def preprocess_data(df):
    # Columns to keep for Task 1
    columns_to_keep = ['PT08.S1(CO)', 'DayOfWeek', 'Time', 'T(C)', 'RH', 'CO(GT)']
    air_quality_task1 = df[columns_to_keep].copy()  # Make a copy to avoid SettingWithCopyWarning

    # Check if 'Time' is in datetime format
    if 'Time' in air_quality_task1 and pd.api.types.is_datetime64_any_dtype(air_quality_task1['Time']):
        # Feature engineering: Extract hour from 'Time'
        air_quality_task1['HourOfDay'] = air_quality_task1['Time'].dt.hour

        # Automatically fill missing values with forward fill method
        air_quality_task1.fillna(method='ffill', inplace=True)

        # Encode categorical feature 'DayOfWeek'
        air_quality_task1 = pd.get_dummies(air_quality_task1, columns=['DayOfWeek'])

        # 'Date' is in datetime format
        split_date = pd.to_datetime('2005-01-01')  # Adjust the date accordingly

        # Split the data
        X_train_task1, X_test_task1, y_train_task1, y_test_task1 = train_test_split(
            air_quality_task1.drop('CO(GT)', axis=1),  # Drop the target variable
            air_quality_task1['CO(GT)'],
            test_size=0.2,
            shuffle=True,
            random_state=42
        )
        return X_train_task1, X_test_task1, y_train_task1, y_test_task1
    else:
        print("Error: 'Time' column not found or is not in datetime format.")
        return None

# Example usage
X_train_task1, X_test_task1, y_train_task1, y_test_task1 = preprocess_data(air_quality_data)


## 5. Model definition and training
Task 1: CO Concentration Prediction

Selected relevant features: 'PT08.S1(CO)', 'DayOfWeek', 'HourOfDay', 'T(C)', 'RH'.
Handled missing values using forward fill method.
Encoded categorical feature 'DayOfWeek' using one-hot encoding.
Extracted the hour from the 'Time' column.
Split the data into training and testing sets using train_test_split

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

# 'Time' is a datetime column in X_train_task1
X_train_task1['Time_numeric'] = pd.to_numeric(X_train_task1['Time'])

# Drop the original datetime column
X_train_task1.drop('Time', axis=1, inplace=True)

model_task1 = LinearRegression()
model_task1.fit(X_train_task1, y_train_task1)

# Example of automatic hyperparameter optimization (Grid Search)
param_grid_task1 = {'copy_X': [True, False], 'fit_intercept': [True, False]}
grid_search_task1 = GridSearchCV(model_task1, param_grid_task1, cv=5)
grid_search_task1.fit(X_train_task1, y_train_task1)

# Access the best parameters
best_params_task1 = grid_search_task1.best_params_
optimized_model_task1 = grid_search_task1.best_estimator_

# to preprocess  test data similarly 
X_test_task1['Time_numeric'] = pd.to_numeric(X_test_task1['Time'])
X_test_task1.drop('Time', axis=1, inplace=True)


## 6. Model evaluation
## For Task 1,
which involves predicting CO concentration, common regression metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R2). These metrics are suitable for evaluating the performance of regression models. Here's a brief outline of why these metrics are appropriate:

## Mean Absolute Error (MAE):

Definition: MAE represents the average absolute difference between the predicted and actual values.
Suitability: MAE is easy to interpret and provides a straightforward measure of the average prediction error. It is less sensitive to outliers compared to MSE.
Mean Squared Error (MSE):

Definition: MSE measures the average squared difference between predicted and actual values.
Suitability: MSE gives higher weight to large errors. It is useful for penalizing larger errors more significantly, which might be important in air quality prediction.
## R-squared (R2):

Definition: R2 quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables.
Suitability: R2 provides an indication of how well the model explains the variability in the data. A higher R2 indicates a better fit.

In [5]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predictions from the baseline model
y_pred_baseline_task1 = model_task1.predict(X_test_task1)

# Predictions from the optimized model
y_pred_optimized_task1 = optimized_model_task1.predict(X_test_task1)

# Calculate metrics for the baseline model
mae_baseline_task1 = mean_absolute_error(y_test_task1, y_pred_baseline_task1)
mse_baseline_task1 = mean_squared_error(y_test_task1, y_pred_baseline_task1)
r2_baseline_task1 = r2_score(y_test_task1, y_pred_baseline_task1)

# Calculate metrics for the optimized model
mae_optimized_task1 = mean_absolute_error(y_test_task1, y_pred_optimized_task1)
mse_optimized_task1 = mean_squared_error(y_test_task1, y_pred_optimized_task1)
r2_optimized_task1 = r2_score(y_test_task1, y_pred_optimized_task1)

# Display the results
print("Baseline Model Metrics:")
print(f"MAE: {mae_baseline_task1}")
print(f"MSE: {mse_baseline_task1}")
print(f"R2: {r2_baseline_task1}")
print("\nOptimized Model Metrics:")
print(f"MAE: {mae_optimized_task1}")
print(f"MSE: {mse_optimized_task1}")
print(f"R2: {r2_optimized_task1}")


Baseline Model Metrics:
MAE: 0.6240633812936045
MSE: 0.9295436150886374
R2: 0.5966987309455807

Optimized Model Metrics:
MAE: 0.6240633812936045
MSE: 0.9295436150886374
R2: 0.5966987309455807


# TASK 2: Air Quality Index creation and prediction
Define an Air Quality Index (based on adequate literature) by combining the ground-truth readings of several gases.

Then, use ML to predict your Air Quality Index from several raw sensor readings and other columns of interest (obviously without using the ground truth column).

## 4. Further Data preparation (specific for this task)

## Air Quality Index (AQI) Calculation:

A new column named 'Personal_AQI' was created by combining ground-truth readings of NO2, CO, and O3. This calculation was performed based on a simple AQI formula.

## Column Selection for Task 2:
Columns not needed for Task 2, such as 'NO2(GT)', 'CO(GT)', and 'PT08.S5(O3)', were dropped to focus on the relevant features for AQI prediction.

## Handling Missing Values:
Missing values in the dataset were addressed using a forward-fill method, ensuring that AQI-related features are populated appropriately.

## Data Splitting:
The dataset was split into features (X_task2) and the target variable (y_task2) for subsequent model training and evaluation.

In [6]:
# Example combining NO2, CO, and O3 for a simple AQI calculation (modify as needed)
air_quality_data['Personal_AQI'] = air_quality_data['NO2(GT)'] + air_quality_data['CO(GT)'] + air_quality_data['PT08.S5(O3)']
# Drop columns not needed for Task 2
columns_to_drop_task2 = ['NO2(GT)', 'CO(GT)', 'PT08.S5(O3)']  # Modify based on your defined AQI components
air_quality_data_task2 = air_quality_data.drop(columns=columns_to_drop_task2)
# Handle missing values based on your strategy (e.g., mean, median, forward fill, etc.)
air_quality_data_task2.fillna(method='ffill', inplace=True)  # Example: forward fill missing values
X_task2 = air_quality_data_task2.drop(columns=['Personal_AQI'])
y_task2 = air_quality_data_task2['Personal_AQI']
X_train_task2, X_test_task2, y_train_task2, y_test_task2 = train_test_split(X_task2, y_task2, test_size=0.2, random_state=42)



# Display the first few rows of the DataFrame
print("Preview of the Data:")
print(air_quality_data_task2.head())

# Display information about the data types and missing values
print("\nData Types and Missing Values:")
print(air_quality_data_task2.info())

# Display basic statistics for each numerical field
print("\nStatistical Summary:")
print(air_quality_data_task2.describe())

Preview of the Data:
         Date                Time  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  \
0  10/03/2004 2023-12-10 18:00:00       1360.0     150.0      11.9   
1  10/03/2004 2023-12-10 19:00:00       1292.0     112.0       9.4   
2  10/03/2004 2023-12-10 20:00:00       1402.0      88.0       9.0   
3  10/03/2004 2023-12-10 21:00:00       1376.0      80.0       9.2   
4  10/03/2004 2023-12-10 22:00:00       1272.0      51.0       6.5   

   PT08.S2(NMHC)  NOx(GT)  PT08.S3(NOx)  PT08.S4(NO2)  T(C)    RH      AH  \
0         1046.0    166.0        1056.0        1692.0  13.6  48.9  0.7578   
1          955.0    103.0        1174.0        1559.0  13.3  47.7  0.7255   
2          939.0    131.0        1140.0        1555.0  11.9  54.0  0.7502   
3          948.0    172.0        1092.0        1584.0  11.0  60.0  0.7867   
4          836.0    131.0        1205.0        1490.0  11.2  59.6  0.7888   

  DayOfWeek  HourOfDay  PeakTime  ValleyTime  Personal_AQI  
0    Sunday         18      True  

## 5. Model definition and training
Regression Model Selection:

Random Forest Regression is the model.
Justification: Because Random Forest Regression can handle intricate interactions and identify non-linear patterns, it was selected. It works well for forecasting composite indices, such as the Air Quality Index.

Baseline Model Application:
Fit the baseline Random Forest Regression model on the data.
Hyperparameter Optimization:
Utilize automatic hyperparameter optimization, such as Randomized Search, to find the best parameters.

In [7]:
from sklearn.ensemble import RandomForestRegressor


#  'Time' is a datetime column in X_train_task2
X_train_task2['Time_numeric'] = pd.to_numeric(X_train_task2['Time'])

# Drop the original datetime column
X_train_task2.drop('Time', axis=1, inplace=True)
# Convert 'Date' to datetime format
X_train_task2['Date'] = pd.to_datetime(X_train_task2['Date'], dayfirst=True)

# Extract relevant date features
X_train_task2['Day'] = X_train_task2['Date'].dt.day
X_train_task2['Month'] = X_train_task2['Date'].dt.month
X_train_task2['Year'] = X_train_task2['Date'].dt.year

# Drop the original 'Date' column
X_train_task2.drop('Date', axis=1, inplace=True)

# One-hot encode the 'DayOfWeek' column
X_train_task2 = pd.get_dummies(X_train_task2, columns=['DayOfWeek'])

# Convert 'Time' to numeric in X_test_task2
X_test_task2['Time_numeric'] = pd.to_numeric(X_test_task2['Time'])

# Drop the original datetime column in X_test_task2
X_test_task2.drop('Time', axis=1, inplace=True)

# Convert 'Date' to datetime format in X_test_task2
X_test_task2['Date'] = pd.to_datetime(X_test_task2['Date'], dayfirst=True)

# Extract relevant date features in X_test_task2
X_test_task2['Day'] = X_test_task2['Date'].dt.day
X_test_task2['Month'] = X_test_task2['Date'].dt.month
X_test_task2['Year'] = X_test_task2['Date'].dt.year

# Drop the original 'Date' column in X_test_task2
X_test_task2.drop('Date', axis=1, inplace=True)

# One-hot encode the 'DayOfWeek' column in X_test_task2
X_test_task2 = pd.get_dummies(X_test_task2, columns=['DayOfWeek'])


model_task2 = RandomForestRegressor()
model_task2.fit(X_train_task2, y_train_task2)






## 6. Model evaluation
Your text here

In [8]:

# Example of automatic hyperparameter optimization (Randomized Search)
from sklearn.model_selection import RandomizedSearchCV

param_dist_task2 = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

random_search_task2 = RandomizedSearchCV(model_task2, param_dist_task2, n_iter=10, cv=5)
random_search_task2.fit(X_train_task2, y_train_task2)

# Access the best parameters
best_params_task2 = random_search_task2.best_params_
optimized_model_task2 = random_search_task2.best_estimator_


# Baseline model evaluation
y_pred_baseline = model_task2.predict(X_test_task2)
mae_baseline = mean_absolute_error(y_test_task2, y_pred_baseline)
mse_baseline = mean_squared_error(y_test_task2, y_pred_baseline)
r2_baseline = r2_score(y_test_task2, y_pred_baseline)

# Optimized model evaluation
y_pred_optimized = optimized_model_task2.predict(X_test_task2)
mae_optimized = mean_absolute_error(y_test_task2, y_pred_optimized)
mse_optimized = mean_squared_error(y_test_task2, y_pred_optimized)
r2_optimized = r2_score(y_test_task2, y_pred_optimized)


# Print or use these metrics for further analysis
print("Baseline Model Metrics:")
print(f"MAE: {mae_baseline}")
print(f"MSE: {mse_baseline}")
print(f"R2: {r2_baseline}")


print("\nOptimized Model Metrics:")
print(f"MAE: {mae_optimized}")
print(f"MSE: {mse_optimized}")
print(f"R2: {r2_optimized}")

Baseline Model Metrics:
MAE: 65.59472329059828
MSE: 8291.210890362176
R2: 0.9580545288495499

Optimized Model Metrics:
MAE: 66.56750146968066
MSE: 8612.5104823307
R2: 0.9564290650971765


# 7. Conclusions
# Task 1 Conclusion and Interpretation:

#Baseline Model:
Mean Absolute Error (MAE): 0.624
Mean Squared Error (MSE): 0.930
R-squared (R2): 0.597
#Optimized Model:
MAE: 0.624
MSE: 0.930
R2: 0.597

# Interpretation:

The baseline and optimized models perform similarly, suggesting that the default parameters are effective for this regression task.
R2 value of approximately 0.60 indicates that the model explains 60% of the variance in the target variable, which is reasonable but leaves room for improvement.

# Suggestions for Improvement:

Explore more advanced regression models to see if they can capture complex patterns in the data.
Conduct further feature engineering to identify additional relevant features.
Fine-tune hyperparameters more precisely.

# Task 2 Conclusion and Interpretation:
Baseline Model:
MAE: 65.595
MSE: 8291.211
R2: 0.958
Optimized Model:
MAE: 66.568
MSE: 8612.510
R2: 0.956

# Interpretation:
The baseline and optimized models are close in performance, indicating that the default parameters already produce strong results.
R2 value of approximately 0.96 indicates that the model explains 96% of the variance in the target variable, suggesting a high level of accuracy.

# Suggestions for Improvement:
Investigate potential outliers or anomalies in the data that could affect model performance.
Consider more advanced models or ensemble methods for further improvement.
Evaluate the impact of additional features on the model.


--- 

This cell goes to the very bottom of your submitted notebok.
You are requried to link the sources and web-links that you have used for various parts of this coursework. 

Write them sources used in the following format similar to the first examle in the sources list below :

    - what you have used them for : web-link

Sources:

- Implement a recurrent neural network : https://peterroelants.github.io/posts/rnn-implementation-part01/
