# Assessment 1: Anthropogenic pollutants across diverse land covers
<a name="top"></a>

❗️ **Deadline: 2pm, Friday 6th December 2024**
<br>
❗️ **Please Submit your notebook as a PDF. For instructions, see the end of this notebook**

This is the first piece of assessed coursework for GEOG5302 Data Science for Practical Applications. It's worth 30% of the module credit. For this assignment, you'll be putting all the skills you've learnt in lectures and computer labs so far into practice. You'll find Labs 1-8 should be helpful in identifying the code and processes you'll need for each step. We start by loading in data, doing some wrangling
and visual investigation, before analysing the data using clustering, classification, regression, and some spatial tools.

Some of the steps in this notebook are simple stages in preparation/loading needed to complete the larger tasks. Items which are assigned marks will be clearly numbered and highlighted __in bold__, with the number of marks specified in (_italics and brackets_) after the question. To receive the marks, you must clearly type out your answer to demonstrate your understanding- just printing out the answer isn't sufficient.

You'll be working with a custom dataset containing dominant land cover and mean values of Nitrogen Oxides (NOx), Nitrogen Dioxide (NO₂), and Sulfur Dioxide (SO₂) (`Land_Cover_and_Pollution_Dataset_2017_2023.csv`) for each Local Administrative Unit (LAU2) in England and Wales (`LAU2_Dec_2014_FCB_in_England_and_Wales.shp`). The dataset spans 2017 to 2023 (7 years) and provides insights into spatial pollution patterns over time.


<img src="https://upload.wikimedia.org/wikipedia/commons/a/a4/Air_pollution_by_industrial_chimneys.jpg" alt="Industrial chimneys emitting pollutants" width="500"/>

**Figure 1:** Industrial chimneys emitting pollutants into the atmosphere (*source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Air_pollution_by_industrial_chimneys.jpg). License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/).*)


## The Problem

Elevated levels of NOX and NO₂ are linked to a variety of societal issues inlcuding serious respiratory health issues, such as asthma and chronic obstructive pulmonary disease (COPD) [(Yorifuji et al., 2019)](#references) and ground-level ozone, which can lead to smog in dense urban areas [(Chang et al., 2019; Liu et al., 2021)](#references). SO₂ emissions are  linked to industrial activities, leading to acid rain formation that adversely affects both terrestrial and aquatic ecosystems [(Yeung et al., 2021)](#references). SO₂ emissions primarily arise from the combustion of fossil fuels and industrial processes. The formation of acid rain due to SO₂ emissions poses significant threats to ecosystems, affecting soil and water quality [(Oxley et al., 2013)](#references). LAU2 units provide a granular spatial resolution that allows for detailed analysis of air pollution patterns and their impacts at local scales. This level of detail is crucial for identifying specific areas that are disproportionately affected by pollutants such as NOx, NO₂, and SO₂, thereby enabling targeted interventions and resource allocation.

_In this assessment we look at NOx, NO₂, and SO₂ at local scales in the context of landcover dominance._





## The Data

### Land Cover
Dominant land cover (e.g., agriculture, built-up areas, etc.) for each LAU2 is derived from the UK Centre for Ecology and Hydrology (CEH) Land Cover Map series ([LCM](https://www.ceh.ac.uk/data/ukceh-land-cover-maps)). The land classes we used are grouped into a simpler classification called the **UK CEH Aggregate Classes**. Understanding the primary land cover type is essential, as it can influence pollution levels and air quality in different ways.

### Atmospheric Pollution
Mean annual measurements for **NOx**, **NO₂**, and **SO₂** have been derived from the UK Government's Department for Environment, Food & Rural Affairs ([DEFRA Modelled Background Pollution Data](https://uk-air.defra.gov.uk/data/modelling-data)). We're looking at these three pollutants to explore spatial patterns and understand their impact on air quality and the environment. Each pollutant reveals a unique perspective on air quality.

- **Nitrogen Oxides (NOx)**: Primarily linked to vehicles, industry, and some agricultural practices. Elevated NOx levels contribute to respiratory health issues, acid rain formation, ground-level ozone, which can harm both ecosystems and human health.

- **Nitrogen Dioxide (NO₂)**: A specific type of NOx often concentrated in urban areas with heavy traffic. Closely tied to vehicle emissions, NO₂ contributes to smog, which is why it's particularly useful to study NO₂ in relation to urban land cover, where pollution may be higher.

- **Sulfur Dioxide (SO₂)**: Produced mainly from burning fossil fuels and certain industrial activities. SO₂ can cause acid rain and respiratory health issues, making it especially relevant in areas near industrial zones.

### Boundaries
The [Local Administrative Units (LAU)](https://ec.europa.eu/eurostat/web/nuts/local-administrative-units) are part of a geographical classification defined by Eurostat to provide a standardised division of territorial units within the EU. Although updated post-Brexit, LAUs remain useful for historical data studies. Specifically, **LAU2** refers to the smallest administrative unit level, covering areas such as parishes, wards, or town boundaries in England and Wales.

> _Note_: LAU2 boundaries are similar to the [Lower Super Output Areas (LSOA)](https://www.ons.gov.uk/methodology/geography/ukgeographies/statisticalgeographies) you've used in labs, though they differ in their specific delineations.

---

*All data and Data Dictionaries can be downloaded from Minerva.*



## Setting Up
This is an independent data analysis task, but we'll provide the overall steps and the questions you need to answer in this notebook. We'll make sure you're set up with all the preliminary modules you'll need, too. <br>

__Note:__ This is just to get you started! You'll need to import more modules for specific techniqies (e.g. linear regression) later on in the task!

In [None]:
!pip install palettable==3.3.0
!pip install descartes==1.1.0
!pip install pysal==2.6.0
!pip install contextily
!pip install geopandas

import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import metrics

pd.options.mode.chained_assignment = None

<br>

## Over to you!
### Load in the data
Load in the pollution `.csv` data, using the call `Land_Cover_and_Pollution_Dataset_2017_2023`.


<br>

## Section 1: Describing the Data

Start with some quick investigations to get to know the data.

__(total: 3 marks)__

__Question 1: How many rows and variables are in the data? _(1 mark)___

<br>

__Question 2: What different types of variables are in the data set, and what do the types mean? How many are there of each? _(2 marks)___

<br>

## Section 2: Data Wrangling

You'll need to do some data cleaning and preprocessing, to get the data set ready for analysis.

__(total: 13 marks)__

__Question 3: Which columns have NAs, and how many are there in each column? Describe what this means _(2 marks)___


__Question 4: One way of dealing with missing values is to remove them. Make this change, and explain how this method may affect the results of our analysis _(2 marks)___

<br>

__Question 5: Create a box plot with year on the x axis, Mean NO2 on the y axis, and categorised by type of Dominant Land Cover. Explain what you observe: _(3 marks)___

<br>

__Question 6: It looks like something has gone wrong with one of the years. What do you think might have happened? Provide an explanation, and rectify the mistake with the year. Recreate the box plot to demonstrate this has worked _(2 marks)___

<br>

__Question 7: The final stage of our wrangling requires us to create dummy variables for each of our different land cover types, so they can later be included in regression models. Check the values first, then create a set of dummies, called `Lc_`, followed by the type. Then use `info` to show that this has worked. _(2 marks)___

__Question 8: Finally, create a stacked histogram of the total counts/frequency of `Dominant land cover` using `hue`: _(2 marks)___

<br>

## Statistics
You'll now begin to statistically explore the variables and relationships between them.

__(total: 13 marks)__

__Question 9: What is the mean SO₂ measurement in 2018 (round your answer to 2 decimal places)? _(1 mark)___

<br>

__Hypothesis Testing__
Before we start building predictive models, we're going to use hypothesis testing to examine the relationship between a couple of variables of interest. We're going to be focussing on the SO₂ levels for different land cover types. Our research question will be: _'is there a relationship between land cover type and levels of SO₂ pollution?'_

__Question 10: We'll start by creating a null hypothesis. Please explain what is meant by a null hypothesis: _(1 mark)___

<br>

__Question 11: If we assume our null hypothesis is that "there is no statistically significant difference in average SO₂ emissions between 'Built-up Areas and Gardens', and 'Arable' ", what would our alternative hypothesis be? _(1 mark)___

<br>

__Question 12: To see if there is a difference between groups, we can start by running a t-test to examine the difference between SO₂ emission levels for Built-up Areas and Gardens, and Arable. Run the t-test below, report the value to 2 decimal places, and explain what the results show. Please also justify which type of t-test you have used. _(4 marks)___

<br>

__Question 13: Going back to our hypotheses, which hypothesis do we accept, and which do we reject? Is there a difference in SO₂ levels between Built-up Areas and Gardens, and Arable? _(1 mark)___

<br>

__Question 14: Before we start building predictive models, we'll explore the correlations between our key variables of interest. Create a subset (called `pollution_sub` which includes the columns for year, NOx, NO₂, S0₂, and the land cover dummies. Then create an annotated correlation heat map of this data.__

__a) Which variables have the strongest negative assocation, and what is the value of this? _(1 marks)___

__b) What is the association between Arable, and Improved grassland? _(1 mark)___

*Hint: for the heatmap add the following parameters to make sure you can read the table:* `annot_kws={"size": 8}, fmt=".2f"`

<br>

__Question 15: Finally, create a scatterplot of age and average emissions, coloured by dominant land cover type. Comment on what you observe _(3 marks)___

For the clustering and classifications sections of our notebook, we'll focus just the top three dominant land covers (Built-up Areas and Gardens', 'Arable', and 'Improved Grassland') and only focus on the year 2020. Create a subset here  with just these variables, called `pollution_2020`.

<br>

## Clustering
You'll now start exploring these relationships in more detail, by clustering your data.

Start with the K-Means method.

(__total: 10 marks__)


First, create a new subset of the data, called `pollution_stats`. This should include just NO2 and SO2 emissions for 2020.

<br>

__Question 16: Using your pollution_stats dataset, use the elbow method to find the optimal number of clusters to use. Test for between 1 and 10 clusters, plot your results, and explain your choice of K- with reference to the meaning of the x and y axis. _(4 marks)___

<br>

__Question 17: Create a K-Means clustering model using your chosen number of clusters. Report the silhouette score (rounded to 3 decimal places), and explain what this means. _(3 marks)___

(__note:__ the silhouette score can take a minute to run- this is quite a big data set!)

<br>

To compare the fit of our model, we could use another clustering method, such as DBSCAN.
This function requires to values for `eps` (epsilon) and `min_samples` to be specified.

__Question 18: Explain how the DBSCAN Algorithm uses these values to calculate clusters _(3 marks)___

<br>

## Classification

You'll extend what you've learnt from the clustering model, by adding back in Dominant Land Cover, to predict this based on pollutants.

__(total: 16 marks)__

__Question 19: Explain the difference between supervised and _un_ supervised learning methods, and how they are used with different data types _(1 mark)___

Go back to your pollution_stats data, and add the original `Dominant land cover` variable back in, but only focusing on the top three dominant land covers (Built-up Areas and Gardens', 'Arable', and 'Improved Grassland').

This should give you a dataset with SO₂, NO₂, Dominant land cover, again for the year 2020, but as we only extract out the three top landcovers, the number or rows is reduced slightly.

One problem is that sklearn only allows us to classify on numeric data, so we need to convert our three top land cover classes ('Built-up Areas and Gardens', 'Arable', 'Improved Grassland') into integers (0,1,2). To do this, use the `key` method, which can recode categorical data. Assign this to a new variable `Lc_type`:

Load in the label encoder and encode the labels, using the new `Lc_type` variable.

Now split the data into training and test datasets, using a 75:25 split. Set the random state to 42 (`random_state=42`).

<br>

__Question 20: How many instances are in the testing and training sets? Explain how these are used to create a classification model _(3 marks)___

Now you're going to build two types of models: a decision tree, and a K-Nearest Neighbours, and then compare the results to find the best fit.

First build a decision tree, set the random state to 42 (`random_state=42`), and fit it to the data.

<br>

__Question 21:__ <br>
__a) How does the model score? (to 3 decimal places) _(1 mark)___ <br>
__b) What is the model weighted average Precision and Recall? _(2 marks)___ <br>

__Question 22: Now plot the confusion matrix. Comment on what you observe? _(2 marks)___

You can now compare the results of the decision tree to a K-nearest neighbour classifier. Use the nearest 20 neighbours to build the model. <br>

__Question 23:__ <br>
__a) How does the model score? (to 3 decimal places) _(1 mark)___ <br>
__b) What is the model weighted average Precision and Recall? _(2 marks)___ <br>

<br>

__Question 24: Now plot the confusion matrix. Comment on what you observe? _(2 marks)___

<br>

__Question 25: Compare the statistical and visual results of the classification algorithms. Which is most effective, and does this vary by category? _(2 marks)___
<br>
__Hint:__ Remember the meaning of the classes: 0 = Built-up Areas and Gardens, 1 = Arable, 2 = Improved Grassland

<br>

## Regression

You've now explored different patterns in your data using correlations, clustering, and classification. It's now time to return to your original 2020 data to explore the linear relationships with a simple linear regression. First you'll calculate the ratio of NO₂/NOₓ, then you'll predict the amount of SO₂ based on NO₂, NOₓ, and the dummy variables you created earlier.


__(total: 14 marks)__

Go back to your original `pollution` data with the expanded range of land cover types. You should already have extra dummy variables you created for all land cover types earlier. Create a subset called `pollution_subset` that only includes data for the year 2023.

 Create a new variable in `pollution_subset` called `NOx_NO2_Ratio`. As we are interested in NOx relative to NO2, divide NOx by NO2

Earlier we checked for missing values, but sometimes, data will have 0 values present. If any zeros were present in the above calculation, then `NOx_NO2_Ratio` will have NaNs present. You will not be able to run an OLS in this case, so lets check for missing values again, this time only for the `NOx_NO2_Ratio`, and remove any rows with missing values.

__Question 26: We're going to build a linear regression model, but first we need to check for multicollinearity between our variables. Explain what multicollinearity is, and why it's important to check for _(2 marks)___

__Question 27: Now check for multicollinearity using `heatmap`. If there are any associations above 0.9, remove one of these variables (remove the one that comes later alphanumerically, for consistency- hint; numbers proceed letters) _(2 marks)___

*Hint: for the heatmap add the following parameters to make sure you can read the table:* `annot_kws={"size": 8}, fmt=".2f"`

__Question 28: _(8 marks)___
__Build a linear regression model to predict the SO₂ for each land cover type. Use 'Lc_Freshwater' as your omitted reference category__
<BR>
__a) What is the R-squared value of the model, and what does this mean? (to 3 decimal places)__ <br>
__b) Which variables are statistically significant at the 5% level?__ <br>
__c) What are the regression coefficients of the statistically significant variables? (to 3 decimal places)__ <br>
__d) Briefly explain what these findings mean.__

<br>

__Question 29: A linear regression model can tell us about correlations. Explain the difference between correlations and causation, and whether we can infer causation from the results of this model _(2 marks)___

<br>

## Data Joining

You'll now proceed to some spatial modelling, so first load into the `LAU2_Dec_2014_FCB_in_England_and_Wales` shape file, and call it `lau2`.

Remember, the LAU2 ID is called `lau214cd`.

Now join the data to the `pollution_subset` data frame, based on the `lau214cd`. Call the new dataset `pollution_lau2`.

<br>

## Spatial Analysis

Now you have your final dataset, you can begin to explore the spatial patterns, and whether a spatial regression model is needed.

__(total: 31 marks)__

<br>

__Question 30: Create a choropleth of the SO₂ in each LAU. Use an appropriate colour scheme, and number of quantiles. Remember to not let the legend cover the map. Explain your choice of colour scheme and what you think the choropleth shows _(4 marks)___

<br>

__Question 31:  To statistically explore the spatial patterns, we begin by calculating weights. Calculate K-nearest neighbours based on the LAU column using 20 neighbours. Using the KNN weights: What is the Moran's I value for the SO₂ (to 3 d.p.)? Is it statistically significant? Explain what this means. (_5 marks_)__

<br>
    
__Question 32: Create a Moran scatterplot and explain what you observe (_2 marks_)__

<br>

__Question 33: Create a LISA cluster map of the average SO₂ emissions, and comment on what you observe, including what is meant by the different cluster types (_4 marks_)__

<br>
As there are spatial patterns in this data, this suggests there may be underlying spatial processes, which we can attempt to capture with a GWR model. Usually, you would need to calculate the bandwidth yourself. For simplicity, the bandwidth is specified for you here.
<br>
<br>

__Question 34: Now extend the ealier linear model to a Geographically Weighted Regression, using a bandwidth of 4200. Report:__ <br>

__a) The local correlation coefficient for NOx/NO2 mean (GWR mean) _(2 marks)___ <br>

__b) Your findings of whether/how this model has accounted for underlying spatial patterns in the data _(3 marks)___

As we have polygons, we need to extract out the centroid for the GWR. We've done this for your below.
You can then set these as your co-ordinates, and continue with your GWR model.

In [None]:
pollution_lau2['geometry'] = pollution_lau2.geometry.centroid #calculate the centroid
pollution_lau2['x'] = pollution_lau2.geometry.x #extract x
pollution_lau2['y'] = pollution_lau2.geometry.y #extract y

__Question 35: Now plot how the regression coefficients for the NOX/NO2 ratio and 'Built-up Areas and gardens' across space, and discuss your findings, related to how these variables are associated with the SO2 emissions. (_6 marks_)__

<br>

__Question 36: Explain how Geographically Weighted Regression can be used to capture spatial processes in the data, whether there are furhter checks we could do, and what these results  can tell us about the relationship between land use and pollutants across space _(5 marks)___

## End of Assessment 1
Well done on completing Assessment 1! See below for instructions on generating your PDF.

<br>

## ❗️Submission Instructions
The deadline for this assignment is 2pm on Friday 6th December, via the Minerva submission area.
You __must__ ensure you have __run your notebook in order__ prior to submission (to demonstrate that the code works), and saved your work as a PDF file.

### Creating a PDF:
On Colab, go to File > Print > Save as PDF.

Sometimes, on Chrome or Edge, this can cut off the bottom of your notebook- but this doesn't seem to be an issue on Safari or Firefox, so we recommend you try with one of these browsers first. If you're still having issues, follow the instructions below to export your notebook as a PDF via Google Drive.


__Step 1:__ Make sure you have a Google account, and save your notebook to Drive.

__Step 2:__ Run the following two cells, which connect your notebook to the Drive file path, and set up the export.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!sudo apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic pandoc

__Step 3:__ Copy the filepath to your notebook- either by using the folder icon on the left of Colab, or via your browser (making sure there are no spaces in the folder names). Then you can update the filepath below, and run the cell. It will save the PDF in the same file on Drive.

In [None]:
!jupyter nbconvert --to pdf /content/drive/MyDrive/Colab_Notebooks/Assessment1.ipynb

And that's it!

## References
<a name="references"></a>

Chang, C.J., Li, G., Zhang, S.Q. and Yu, K.P., 2019. Employing a fuzzy-based grey modeling procedure to forecast China's sulfur dioxide emissions. *International Journal of Environmental Research and Public Health*, 16(14), p.2504. Available from: https://doi.org/10.3390/ijerph16142504

Liu, X., Fang, W., Li, H., Han, X. and Xiao, H., 2021. Is Urbanization Good for the Health of Middle-Aged and Elderly People in China?—Based on CHARLS Data. *Sustainability*, 13(9), p.4996. Available from: https://doi.org/10.3390/su13094996

Oxley, T., Dore, A.J., ApSimon, H., Hall, J. and Kryza, M., 2013. Modelling future impacts of air pollution using the multi-scale UK Integrated Assessment Model (UKIAM). Environment international, 61, pp.17-35. Available from: https://doi.org/10.1016/j.envint.2013.09.009

Yeung, D.W., Zhang, Y., Bai, H. and Islam, S., 2021. Collaborative environmental management for transboundary air pollution problems: A differential levies game. Journal of Industrial & Management Optimization, 17(2). Available from: https://doi.org/10.3934/jimo.2019121

Yorifuji, T., Kashima, S., Suryadhi, M.A.H. and Abudureyimu, K., 2019. Acute exposure to sulfur dioxide and mortality: historical data from Yokkaichi, Japan. *Archives of Environmental & Occupational Health*, 74(5), pp.271-278. Available from: https://doi.org/10.1080/19338244.2018.1434474

<br>


↑  [Click here to go to the top of the notebook](#top) ↑