<a href="https://colab.research.google.com/github/Ambaright/ST-554-Project1/blob/main/Task2/ST554_Project_1_Task_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ST 554 Project 1: Task 2
Programmed by: Amanda Baright

## Introduction

The increasing incidence of respiratory illness and the known carcinogenic risks associated with prolonged exposure to pollutants like benzene (`C6H6(GT)`) have made precise urban air quality monitoring a critical priority for public health and municipal traffic management. Currently, urban monitoring relies on sparse networks of fixed stations equipped with high-precision industrial spectrometers; however, the high cost and significant size of these instruments prevent the deployment of a monitoring mesh dense enough to capture the complex, turbulent diffusion of gases in a city environment. To address this gap, research has shifted toward low-cost gas multi-sensor devices, often termed "electronic noses," which utilize solid-state sensors to provide a more granular view of urban pollution.

The provided report examines data from a 13-month measurement campaign (March 2004 to April 2005) conducted along a high-traffic road in an Italian city. The studyâ€™s primary objective was to evaluate the feasibility of using these low-cost devices to "densify" existing monitoring networks by comparing their readings against "Ground Truth" (GT) reference data provided by a conventional monitoring station. The dataset includes hourly mean concentrations for several "true" pollutants - `CO`, `NMHC`, `C6H6`, `NOx`, and `NO2` - recorded alongside the responses of five metal oxide chemoresistive sensors (targeted at CO, NMHC, NOx, NO2, and O3) and two sensors for weather-related variables, specifically temperature (`T`), relative humidity (`RH`), and absolute humidity (`AH`).

A central focus of this analysis is the estimation of `C6H6(GT)` (benzene). Notably, the multi-sensor device used in the study did not include a sensor specifically targeted at benzene. Instead, the study aimed to reconstruct benzene levels by employing artificial neural networks to exploit the significant linear correlations that exist between different urban pollutants. For instance, a very strong correlation coefficient of 0.98 was observed between benzene and Non-Metanic Hydrocarbons (`NMHC`).

Furthermore, the study investigates the critical role of atmospheric dynamics, as the stability and selectivity of solid-state sensors are heavily influenced by seasonal changes and weather variables. Earlier findings suggest that sensor performance can be impacted by rapid shifts in humidity and low temperatures, which may necessitate periodic re-calibration to account for sensor drift and changing gas mixture ratios in the winter. By conducting an Exploratory Data Analysis (EDA) on the relationships between sensor outputs, weather conditions, and benzene concentrations, this report seeks to understand the effectiveness of "cooperative" sensor fusion in providing reliable, low-cost environmental monitoring.

## Reading in the Data

In this section, the data is read in from [Air Quality Data](https://archive.ics.uci.edu/dataset/360/air+quality) and the features are extracted and stored into a saved DataFrame `air`. We then investigate the data to understand how it is stored using `.head()` and `.info()` methods. With `.head()` we can see what the first five rows of our data look like, and with `.tail()` we can see what the last five rows of our data look like. With `.info()` we can see the data types for each variable.

In [15]:
# Install ucimlrepo if you haven't already
!pip install ucimlrepo

# Import needed packages
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np

# Fetch dataset
air_quality = fetch_ucirepo(id=360)

# Extract the Features
air = air_quality.data.features
print(".head()")
print(air.head())
print("\n" + ".tail()")
print(air.tail())


.head()
        Date      Time  CO(GT)  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  \
0  3/10/2004  18:00:00     2.6         1360       150      11.9   
1  3/10/2004  19:00:00     2.0         1292       112       9.4   
2  3/10/2004  20:00:00     2.2         1402        88       9.0   
3  3/10/2004  21:00:00     2.2         1376        80       9.2   
4  3/10/2004  22:00:00     1.6         1272        51       6.5   

   PT08.S2(NMHC)  NOx(GT)  PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  PT08.S5(O3)  \
0           1046      166          1056      113          1692         1268   
1            955      103          1174       92          1559          972   
2            939      131          1140      114          1555         1074   
3            948      172          1092      122          1584         1203   
4            836      131          1205      116          1490         1110   

      T    RH      AH  
0  13.6  48.9  0.7578  
1  13.3  47.7  0.7255  
2  11.9  54.0  0.7502  
3  11.0  60.0  0.7

In [16]:
# Summary of DataFrame
air.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   int64  
 4   NMHC(GT)       9357 non-null   int64  
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   int64  
 7   NOx(GT)        9357 non-null   int64  
 8   PT08.S3(NOx)   9357 non-null   int64  
 9   NO2(GT)        9357 non-null   int64  
 10  PT08.S4(NO2)   9357 non-null   int64  
 11  PT08.S5(O3)    9357 non-null   int64  
 12  T              9357 non-null   float64
 13  RH             9357 non-null   float64
 14  AH             9357 non-null   float64
dtypes: float64(5), int64(8), object(2)
memory usage: 1.1+ MB


## Do Basic Data Validation

When examining data, it is best practice to look at quick summary statistics of all the data to check that things make sense. We can do this using the `.describe()` method on the DataFrame. This will produce the count, mean, standard deviation, min, 25th quantile, median, 75th quantile, and the max.

It should be noted that from the variable information provided, missing values are tagged with `-200` value. This then explains the common min of -200. Once we determine the rate of missingness and convert these missing values to `NaN`, we can then rerun this for a second data validation.

Additionally, the max values for each variable seem quite high compared to the 75th quantile, which indicates a potential outlier. However, there are no alarming values that would indicate a data entry error.

In [17]:
air.describe()

Unnamed: 0,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
count,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0
mean,-34.207524,1048.990061,-159.090093,1.865683,894.595276,168.616971,794.990168,58.148873,1391.479641,975.072032,9.778305,39.48538,-6.837604
std,77.65717,329.83271,139.789093,41.380206,342.333252,257.433866,321.993552,126.940455,467.210125,456.938184,43.203623,51.216145,38.97667
min,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0
25%,0.6,921.0,-200.0,4.0,711.0,50.0,637.0,53.0,1185.0,700.0,10.9,34.1,0.6923
50%,1.5,1053.0,-200.0,7.9,895.0,141.0,794.0,96.0,1446.0,942.0,17.2,48.6,0.9768
75%,2.6,1221.0,-200.0,13.6,1105.0,284.0,960.0,133.0,1662.0,1255.0,24.1,61.9,1.2962
max,11.9,2040.0,1189.0,63.7,2214.0,1479.0,2683.0,340.0,2775.0,2523.0,44.6,88.7,2.231


## Determine rate of missing values

Now that we looked at how the data was stored, we want to determine the rate of missing values. It's important to note again that the dataset uses the value `-200` to indicate a missing value. Thus, we will need to look for any cases of `-200` and switch it to `NaN`. Here we'll use the `.replace()` method to do this task.

In [18]:
air.replace(-200, np.nan, inplace = True)

Now we can look at the summary statistics for the data to see if our min values changed.

In [19]:
air.describe()

Unnamed: 0,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
count,7674.0,8991.0,914.0,8991.0,8991.0,7718.0,8991.0,7715.0,8991.0,8991.0,8991.0,8991.0,8991.0
mean,2.15275,1099.833166,218.811816,10.083105,939.153376,246.896735,835.493605,113.091251,1456.264598,1022.906128,18.317829,49.234201,1.02553
std,1.453252,217.080037,204.459921,7.44982,266.831429,212.979168,256.81732,48.370108,346.206794,398.484288,8.832116,17.316892,0.403813
min,0.1,647.0,7.0,0.1,383.0,2.0,322.0,2.0,551.0,221.0,-1.9,9.2,0.1847
25%,1.1,937.0,67.0,4.4,734.5,98.0,658.0,78.0,1227.0,731.5,11.8,35.8,0.7368
50%,1.8,1063.0,150.0,8.2,909.0,180.0,806.0,109.0,1463.0,963.0,17.8,49.6,0.9954
75%,2.9,1231.0,297.0,14.0,1116.0,326.0,969.5,142.0,1674.0,1273.5,24.4,62.5,1.3137
max,11.9,2040.0,1189.0,63.7,2214.0,1479.0,2683.0,340.0,2775.0,2523.0,44.6,88.7,2.231


Now we see that all of our min values are unique to each variable. This will allow us to easily see the unique distribtuion for each variable and the relationships between the different variables.

Additionally, we can now determine the rate of missing values by summing up (with the `.sum()` method) the number of instances where the `.isnull()` method returns `True`. This `.isnull()` method will return true if a value is `NaN`.

In [20]:
air.isnull().sum()

Unnamed: 0,0
Date,0
Time,0
CO(GT),1683
PT08.S1(CO),366
NMHC(GT),8443
C6H6(GT),366
PT08.S2(NMHC),366
NOx(GT),1639
PT08.S3(NOx),366
NO2(GT),1642


The table above lists the rate of missingness for each variable in our DataFrame. From this table, we see that `NMHC(GT)` had the highest rate of missingness with n = 8443 missing values, followed by `CO(GT)` (n = 1683), `NO2(GT)` (n = 1642), and `NOx(GT)` (n = 1639). The other variables of interest (excluding `Time` and `Date`) have n = 366 missing values.

## Data Cleaning

Now that we replaced our missing values with `NaN` and re-calculated the summary statistics with `.describe()` for our basic data validation, we can now move onto some data cleaning. For the purposes of EDA, we'll remove all rows with missing values (`NaN`) using `.dropna()`.

In [21]:
air.dropna(inplace = True)

# Verify all NaN are dropped
air.isnull().sum()

Unnamed: 0,0
Date,0
Time,0
CO(GT),0
PT08.S1(CO),0
NMHC(GT),0
C6H6(GT),0
PT08.S2(NMHC),0
NOx(GT),0
PT08.S3(NOx),0
NO2(GT),0


Now that we dropped all rows with missing values, we can rename some of these columns to be easier to work with using the `.rename()` method from `pandas`. We can do this by creating a dictionary, where the old names are the keys and the new names are the values. We then use the `.rename()` method with the columns set to be the new names.

In [22]:
new_air_names = {
    'C6H6(GT)': 'Benzene',
    'CO(GT)': 'CO',
    'NOx(GT)': 'NOx',
    'NO2(GT)': 'NO2',
    'NMHC(GT)': 'NMHC',
    'PT08.S1(CO)': 'sensorCO',
    'PT08.S2(NMHC)': 'sensorNMHC',
    'PT08.S3(NOx)': 'sensorNOx',
    'PT08.S4(NO2)': 'sensorNO2',
    'PT08.S5(O3)': 'sensorO3',
    'T': 'Temp',
    'RH': 'relHumidity',
    'AH': 'absHumidity'
}

air.rename(columns = new_air_names, inplace = True)

# Check that the variables have the new names
air.head()

Unnamed: 0,Date,Time,CO,sensorCO,NMHC,Benzene,sensorNMHC,NOx,sensorNOx,NO2,sensorNO2,sensorO3,Temp,relHumidity,absHumidity
0,3/10/2004,18:00:00,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888


For the remaining data exploration, we will focus on the relationship between `Benzene` and the sensor variables, `Temp`, `relHumidity`, `absHumidity`, `Date`, and `Time`. Thus, we may want a subsetted DataFrame with just these variables, which is done by copying our `air` DataFrame with `.copy()` and using the `.drop()` method.

In [23]:
sub_air = air.copy()

sub_air = sub_air.drop(columns = ["CO", "NMHC", "NOx", "NO2"])
sub_air.head()

Unnamed: 0,Date,Time,sensorCO,Benzene,sensorNMHC,sensorNOx,sensorNO2,sensorO3,Temp,relHumidity,absHumidity
0,3/10/2004,18:00:00,1360.0,11.9,1046.0,1056.0,1692.0,1268.0,13.6,48.9,0.7578
1,3/10/2004,19:00:00,1292.0,9.4,955.0,1174.0,1559.0,972.0,13.3,47.7,0.7255
2,3/10/2004,20:00:00,1402.0,9.0,939.0,1140.0,1555.0,1074.0,11.9,54.0,0.7502
3,3/10/2004,21:00:00,1376.0,9.2,948.0,1092.0,1584.0,1203.0,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1272.0,6.5,836.0,1205.0,1490.0,1110.0,11.2,59.6,0.7888


### Creating New Variables

Now that we have a subset of our `air` data, we may want to create new variables that can be used in our exploratory data analysis. This section will cover the creation of the following variables:

* High vs Low Value Variables: Looking at the median of the variable, and determining if the current observation has a higher or lower value.

* Month, Day, and Year Categorical Variables: Categorical variables of the month name, day of the week, and numeric and categorical variables for the year.

* Season: Using a defined user-function to determine which season the observation falls in.

#### High vs Low Value Variables

Now that we have a subset of our `air` data, we may want to create new variables for our exploratory data analysis.

The first set of variables we may want to consider is a categorical variable for each sensor value, `Temp`, `relHumidity`, and `absHumidity`, where the variable takes on a value of `high` if the observation has a value higher than the median and a value of `low` if the observation has a value lower than the median. Here we will use the median as its less susceptible to the influence of outliers. To make this process easier, we'll use a `for` loop to do this for each variable of interest.

In [28]:
# List of variables
indep_vars = ["Benzene","sensorCO", "sensorNMHC", "sensorNOx", "sensorNO2", "sensorO3", "Temp", "relHumidity", "absHumidity"]

# Using a for loop to create a categorical variable for each variable in the list indep_var
for var in indep_vars:
    # Calculate the median of the current variable
    median = sub_air[var].median()

    # Create a new categorical column
    new_cat_col = f"{var}_cat"

    # Using np.where() to assign the value high and low, where its high if value > median
    # Add this new_cat_col to the exisiting sub_air DataFrame
    sub_air[new_cat_col] = np.where(sub_air[var] > median, 'high', 'low')

    # Make the new variable a category variable
    sub_air[new_cat_col] = sub_air[new_cat_col].astype('category')

# Check the .head() of sub_air with these new variables
sub_air.head()


Unnamed: 0,Date,Time,Benzene,Benzene_cat,sensorCO,sensorCO_cat,sensorNMHC,sensorNMHC_cat,sensorNOx,sensorNOx_cat,sensorNO2,sensorNO2_cat,sensorO3,sensorO3_cat,Temp,Temp_cat,relHumidity,relHumidity_cat,absHumidity,absHumidity_cat
0,3/10/2004,18:00:00,11.9,high,1360.0,high,1046.0,high,1056.0,high,1692.0,high,1268.0,high,13.6,low,48.9,low,0.7578,low
1,3/10/2004,19:00:00,9.4,high,1292.0,high,955.0,high,1174.0,high,1559.0,high,972.0,low,13.3,low,47.7,low,0.7255,low
2,3/10/2004,20:00:00,9.0,low,1402.0,high,939.0,low,1140.0,high,1555.0,low,1074.0,high,11.9,low,54.0,high,0.7502,low
3,3/10/2004,21:00:00,9.2,high,1376.0,high,948.0,high,1092.0,high,1584.0,high,1203.0,high,11.0,low,60.0,high,0.7867,low
4,3/10/2004,22:00:00,6.5,low,1272.0,high,836.0,low,1205.0,high,1490.0,low,1110.0,high,11.2,low,59.6,high,0.7888,low


We might want to reorder some of these variables so that the categorical variable is next to the numeric variable.

In [29]:
var_order = ["Date", "Time", "Benzene", "Benzene_cat", "sensorCO", "sensorCO_cat", "sensorNMHC", "sensorNMHC_cat",
                "sensorNOx", "sensorNOx_cat", "sensorNO2", "sensorNO2_cat", "sensorO3", "sensorO3_cat",
                "Temp", "Temp_cat", "relHumidity", "relHumidity_cat", "absHumidity", "absHumidity_cat"]

sub_air = sub_air[var_order]
sub_air.head()

Unnamed: 0,Date,Time,Benzene,Benzene_cat,sensorCO,sensorCO_cat,sensorNMHC,sensorNMHC_cat,sensorNOx,sensorNOx_cat,sensorNO2,sensorNO2_cat,sensorO3,sensorO3_cat,Temp,Temp_cat,relHumidity,relHumidity_cat,absHumidity,absHumidity_cat
0,3/10/2004,18:00:00,11.9,high,1360.0,high,1046.0,high,1056.0,high,1692.0,high,1268.0,high,13.6,low,48.9,low,0.7578,low
1,3/10/2004,19:00:00,9.4,high,1292.0,high,955.0,high,1174.0,high,1559.0,high,972.0,low,13.3,low,47.7,low,0.7255,low
2,3/10/2004,20:00:00,9.0,low,1402.0,high,939.0,low,1140.0,high,1555.0,low,1074.0,high,11.9,low,54.0,high,0.7502,low
3,3/10/2004,21:00:00,9.2,high,1376.0,high,948.0,high,1092.0,high,1584.0,high,1203.0,high,11.0,low,60.0,high,0.7867,low
4,3/10/2004,22:00:00,6.5,low,1272.0,high,836.0,low,1205.0,high,1490.0,low,1110.0,high,11.2,low,59.6,high,0.7888,low


#### Month, Day, and Year Categorical Variables

For the next new variable, we may want to explore the `Date` column and see if we notice any trends across the month, day of the week, and year. For this we can create a new variable for each of these components. However, before we start this process, we will need to convert `Date` to be a workable datetime object. Once we do this, we can use the datatime accessor `.dt` to extract the information we need. We will also convert this information into categories.

In [37]:
# Make `Date` a datetime object
sub_air['Date'] = pd.to_datetime(sub_air['Date'], errors = "coerce")

# Extract the month name & convert to category
sub_air['month'] = sub_air['Date'].dt.month_name()
sub_air['month'] = sub_air['month'].astype('category')

# Extract day of the week & convert to category
sub_air['day'] = sub_air['Date'].dt.day_name()
sub_air['day'] = sub_air['day'].astype('category')

# Extract the year, except here we will have a numeric and category type variable
sub_air['year'] = sub_air['Date'].dt.year
sub_air['year_cat'] = sub_air['year'].astype('category')

# Check the .head() of sub_air with these new variables
sub_air.head()


Unnamed: 0,Date,Time,Benzene,Benzene_cat,sensorCO,sensorCO_cat,sensorNMHC,sensorNMHC_cat,sensorNOx,sensorNOx_cat,...,Temp,Temp_cat,relHumidity,relHumidity_cat,absHumidity,absHumidity_cat,month,day,year,year_cat
0,2004-03-10,18:00:00,11.9,high,1360.0,high,1046.0,high,1056.0,high,...,13.6,low,48.9,low,0.7578,low,March,Wednesday,2004,2004
1,2004-03-10,19:00:00,9.4,high,1292.0,high,955.0,high,1174.0,high,...,13.3,low,47.7,low,0.7255,low,March,Wednesday,2004,2004
2,2004-03-10,20:00:00,9.0,low,1402.0,high,939.0,low,1140.0,high,...,11.9,low,54.0,high,0.7502,low,March,Wednesday,2004,2004
3,2004-03-10,21:00:00,9.2,high,1376.0,high,948.0,high,1092.0,high,...,11.0,low,60.0,high,0.7867,low,March,Wednesday,2004,2004
4,2004-03-10,22:00:00,6.5,low,1272.0,high,836.0,low,1205.0,high,...,11.2,low,59.6,high,0.7888,low,March,Wednesday,2004,2004


#### Season Variable

Another thing we may want to consider is the season (Winter, Spring, Summer, Fall). We'll then use common days throughout the year to determine when the seasons change. That is:

* Winter: New Years up to March 21st and Dec 21 to New Years Eve
* Spring: up to June 20
* Summer: up to Sept 22
* Fall: up to Dec 21

Here we can define a function called `seasons` to determine which season a date may fall in using the date ranges defined above. We then use a lambda function to apply this `seasons` function to each date.

In [42]:
# Create a seasons variable
def seasons(date):

    '''
    Taking in a date, that is a datetime object, we will pull out the month and day to determine which season the date falls in.

    Winter is defined as New Years up to March 21st and Dec 21 to New Years Eve
    Spring is defined as March 21 up to June 20
    Summer is defined as June 21 up to Sept 22
    Fall is defined as Sept 23 up to Dec 21

    We then return the season as a category.
    '''

    m = date.month
    d = date.day

    if (m == 12 and d >= 21) or m in [1, 2] or (m == 3 and d < 21):
        return 'winter'
    elif (m == 3 and d >= 21) or m in [4, 5] or (m == 6 and d < 21):
        return 'spring'
    elif (m == 6 and d >= 21) or m in [7, 8] or (m == 9 and d < 23):
        return 'summer'
    elif (m == 9 and d >= 23) or m in [10, 11] or (m == 12 and d < 21):
        return 'fall'

    return season

# Use a lambda function to apply the seasons function
sub_air['season'] = sub_air.apply(lambda x: seasons(x['Date']), axis=1)

# Convert the season to a category
sub_air['season'] = sub_air['season'].astype('category')

# Check sub_air with .head() again
sub_air.head()

Unnamed: 0,Date,Time,Benzene,Benzene_cat,sensorCO,sensorCO_cat,sensorNMHC,sensorNMHC_cat,sensorNOx,sensorNOx_cat,...,Temp_cat,relHumidity,relHumidity_cat,absHumidity,absHumidity_cat,month,day,year,year_cat,season
0,2004-03-10,18:00:00,11.9,high,1360.0,high,1046.0,high,1056.0,high,...,low,48.9,low,0.7578,low,March,Wednesday,2004,2004,winter
1,2004-03-10,19:00:00,9.4,high,1292.0,high,955.0,high,1174.0,high,...,low,47.7,low,0.7255,low,March,Wednesday,2004,2004,winter
2,2004-03-10,20:00:00,9.0,low,1402.0,high,939.0,low,1140.0,high,...,low,54.0,high,0.7502,low,March,Wednesday,2004,2004,winter
3,2004-03-10,21:00:00,9.2,high,1376.0,high,948.0,high,1092.0,high,...,low,60.0,high,0.7867,low,March,Wednesday,2004,2004,winter
4,2004-03-10,22:00:00,6.5,low,1272.0,high,836.0,low,1205.0,high,...,low,59.6,high,0.7888,low,March,Wednesday,2004,2004,winter


Now that we created a few more variables that we can use in our exploratory data analysis we can move forward with investigating the relationship between all of these variables and `benzene`.

## Numerical Summaries

Here will be a section for `You should have numeric summaries of the C6H6(GT) variable (at different levels/combinations of other
variables)`.

One thing I might want to consider is creating bins, or a categorical variable for low and high levels based on the literature.


## Correlations

Here will be a section to investigate the correlations between the variables. Again, using the literature as a basis for our investigation and adding on more observations.

## Data Visualizations

### Benzene and Other Variables

Here is a section to focus on the plots of Benzene against other variables.

`You should have plots of the C6H6(GT) variable (again showing relationships with other variables)`


### Benzene and Other Variables across Time and Date

Here is a section that will look at the time series element of the data.

`You should look at relationships over time and also ignoring time.`
