<a href="https://colab.research.google.com/github/JaySanthanam/Programming-for-data/blob/main/Projects/Work_in_progress/Air_quality_Analyses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Air Quality Data**

This notebook documents my work on Air Quality data using Python. We will be using air-quality measurement, particularly, Nitrogen dioxide levels and Particulate Matter (PM10) data from Chatham roadside and Edinburgh measuring centres and use python to retrieve, wrangle, clean, sort and filter the data, analyse and visualise the data to make conclusions on the air quality levels in these two places. The analyses in this notebook looks at Nitrogen dioxide levels and PM10 in these two places and can be repeated for any other pollutants and/or places.

The first part of the notebook documents my work on air quality data on Nitrogen dioxide measurements from Chatham Roadside, Kent Edinburgh Haymarket area and St. Leonard's street. The second part of the notebook documents my work on PM10 from Chatham Roadside and Edinburgh St. Leonard's street.

##What is Air quality and what data is available on it?
**Source:** Department for Environment Food and Rural affairs (Defra), UK: https://uk-air.defra.gov.uk/air-pollution/

Air pollution can cause both short term and long term effects on health and many people are concerned about pollution in the air that they breathe. 

These people may include:

* People with heart or lung conditions, or other breathing problems, whose health may be affected by air pollution.
* Parents, carers and healthcare professionals who look after someone whose health is sensitive to pollution.
* People who want to know more about air pollution, its causes, and what they can do to help reduce it.
* The scientific community and students, who may need data on air pollution levels, either now or in the past, throughout the UK.

Free, detailed, clear and easy to use information on air pollution in the UK is available for all these purposes at UK's Defra, website on air pollution (link above).

## 1. Nitrogen Dioxide: 
**Source:** https://www.gov.uk/government/statistics/emissions-of-air-pollutants/emissions-of-air-pollutants-in-the-uk-nitrogen-oxides-nox 

In this section of this notebook, we will look at the measured value of Nitrogen dioxide levels in the air. Nitrogen dioxide, in the UK, mostly come from fuel combustion and is harmful to health. 

Short-term exposure to concentrations of NO2 can cause inflammation of the airways and increase susceptibility to respiratory infections and to allergens. NO2 can exacerbate the symptoms of those already suffering from lung or heart conditions. In addition, NOx can cause changes to the environment. Deposition of Nitrogen to the environment both directly as a gas (dry deposition) and in precipitation (wet deposition) can change soil chemistry and affect biodiversity in sensitive habitats.

Nitrogen oxides are also precursors for the formation of ozone. Ozone is a gas which is also damaging to human health and can trigger inflammation of the respiratory tract, eyes, nose and throat as well as asthma attacks. Moreover, ozone can have adverse effects on the environment through oxidative damage to vegetation including crops.

**Data:**

There are over 1500 sites across the UK that monitor air quality. They are organised into networks that gather a particular kind of information, using a particular method. There are two major types - automatic and non-automatic networks. The Monitoring Networks section provides further network information. All the measurements from these monitoring stations are available to download data from the networks using the Data Selector Tool via Defra website.

https://uk-air.defra.gov.uk/data/

I have chosen to work with measured data from Chatham Roadside and Edinburgh St. Leonard street stations in this notebook. The functions and code in this notebook can be used to repeat the same analyses for data from any other monitoring stations in the UK.

The following data files contains measured values of Nitrogen dioxide in the air collected at a roadside monitoring station at Chatham, Kent and St. Leonard's Street, Edinburgh. Access to data from these two air quality monitoring stations were obtained from DEFRA website (https://uk-air.defra.gov.uk/data/) and has been uploaded to my github. They can be found at: 

https://raw.githubusercontent.com/JaySanthanam/Programming-for-data/main/Datasets/NO2_Kent.csv

https://raw.githubusercontent.com/JaySanthanam/Programming-for-data/main/Datasets/NO2_Edin.csv

respectively.

The datasets obtained from DEFRA contain:
* a heading line (with station name) which will be skipped while loading the data and a separate column for each station will be created later when wrangling the dataframes to create a new dataset.
* dates are given rather as texts (so need to be converted to dates)
* times which are not all in the same format will also need to be converted to dates along with Date column.
* Nitrogen Dioxide levels which are, again, text sometimes contain "No data". This column needs to be converted to a numeric column with null values instead or "No data".
* Status which is always the same shows the unit of measurement for Nitrogen dioxide levels.

### Read, clean, sort and wrangle the data and write it to Pandas dataframe.

Read the dataset into a dataframe, skipping the first row   
Convert dates to date format  
Remove rows with 'No data' in the Nitrogen dioxide column  
Convert the Nitrogen dioxide levels values to float type  
Sort by Nitrogen dioxide level  
Create a new column for 'Weekdays' (use df['Date'].dt.weekday)  
Rename the column Nitrogen dioxide level to NO2 Level (V ug/m2)  
Remove the Status column


In [14]:
import pandas as pd
import numpy as np
from datetime import datetime, timezone

def get_csv_data(url):
  data = pd.read_csv(url, skiprows=1)
  return data

def data_clean_wrangle(project_data):
  project_data['Date'] =  pd.to_datetime(project_data['Date'], format= "%d/%m/%Y")
  project_data["Nitrogen dioxide"] = project_data["Nitrogen dioxide"].replace('No data', np.nan)
 # project_data = project_data.dropna(subset = ["Nitrogen dioxide"])
  project_data["Nitrogen dioxide"] = pd.to_numeric(project_data["Nitrogen dioxide"], downcast="float")
 # project_data = project_data.sort_values(by=['Nitrogen dioxide'])
  project_data['Weekdays'] = project_data['Date'].dt.weekday
  project_data = project_data.rename(columns={"Nitrogen dioxide": "NO2 Level (V ug/m2)"})
  project_data = project_data.drop(columns=['Status'])
  return project_data

url =  "https://raw.githubusercontent.com/JaySanthanam/Programming-for-data/main/Datasets/NO2_Edin.csv"
data = get_csv_data(url)
Edin_data = data_clean_wrangle(data)

Did the data cleaning and sorting work?

In [15]:
print(Edin_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35088 entries, 0 to 35087
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 35088 non-null  datetime64[ns]
 1   Time                 35088 non-null  object        
 2   NO2 Level (V ug/m2)  33014 non-null  float32       
 3   Weekdays             35088 non-null  int64         
dtypes: datetime64[ns](1), float32(1), int64(1), object(1)
memory usage: 959.6+ KB
None


Let us take a look at the dataset for Edinburgh.

In [16]:
print(Edin_data.head())

        Date      Time  NO2 Level (V ug/m2)  Weekdays
0 2017-03-01  01:00:00                  NaN         2
1 2017-03-01  02:00:00                  NaN         2
2 2017-03-01  03:00:00                  NaN         2
3 2017-03-01  04:00:00                  NaN         2
4 2017-03-01  05:00:00                  NaN         2


Now let's repeat the same for Chatham data.

In [17]:
url =  "https://raw.githubusercontent.com/JaySanthanam/Programming-for-data/main/Datasets/NO2_Kent.csv"
data = get_csv_data(url)
chatham_data = data_clean_wrangle(data)

Let's look at the data from Chatham.

In [18]:
print(chatham_data.shape)
print(chatham_data.head())
print(chatham_data.info())

(35088, 4)
        Date      Time  NO2 Level (V ug/m2)  Weekdays
0 2017-03-01  01:00:00              4.41596         2
1 2017-03-01  02:00:00              2.82604         2
2 2017-03-01  03:00:00              3.31484         2
3 2017-03-01  04:00:00              3.31149         2
4 2017-03-01  05:00:00              5.53478         2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35088 entries, 0 to 35087
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 35088 non-null  datetime64[ns]
 1   Time                 35088 non-null  object        
 2   NO2 Level (V ug/m2)  34588 non-null  float32       
 3   Weekdays             35088 non-null  int64         
dtypes: datetime64[ns](1), float32(1), int64(1), object(1)
memory usage: 959.6+ KB
None


In [23]:
Edin_data.rename(columns = {'NO2 Level (V ug/m2)':'Edin_NO2_Level'}, inplace = True)
chatham_data.rename(columns = {'NO2 Level (V ug/m2)':'Chatham_NO2_Level'}, inplace = True)

In [26]:
Nitrogen_data = Edin_data.copy()
Nitrogen_data['Chatham_NO2_Level'] = chatham_data['Chatham_NO2_Level']
Nitrogen_data

Unnamed: 0,Date,Time,Edin_NO2_Level,Weekdays,Chatham_NO2_Level
0,2017-03-01,01:00:00,,2,4.415960
1,2017-03-01,02:00:00,,2,2.826040
2,2017-03-01,03:00:00,,2,3.314840
3,2017-03-01,04:00:00,,2,3.311490
4,2017-03-01,05:00:00,,2,5.534780
...,...,...,...,...,...
35083,2021-03-01,20:00:00,,0,22.479891
35084,2021-03-01,21:00:00,,0,18.130960
35085,2021-03-01,22:00:00,,0,25.372990
35086,2021-03-01,23:00:00,,0,24.558630


In [31]:
Nitrogen_data = Nitrogen_data[['Date', 'Time', "Edin_NO2_Level", 'Chatham_NO2_Level', 'Weekdays']]

In [32]:
Nitrogen_data = Nitrogen_data.dropna(subset = ["Edin_NO2_Level", 'Chatham_NO2_Level'])

In [33]:
Nitrogen_data.describe()

Unnamed: 0,Edin_NO2_Level,Chatham_NO2_Level,Weekdays
count,32565.0,32565.0,32565.0
mean,17.858732,22.502911,3.024628
std,14.722378,14.758221,2.002718
min,0.10384,0.05463,0.0
25%,7.38824,11.14081,1.0
50%,13.45774,19.195709,3.0
75%,23.722401,30.88143,5.0
max,108.297943,113.06189,6.0


### Expand the dataset and show summary statistics for larger dataset
---

There is a second data set here covering the year 2021:  https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ  

Concatenate the two datasets to expand it to 2020 and 2021.  

Before you can concatenate the datasets you will need to clean and wrangle the second dataset in the same way as the first.  Use the code cell below.  Give the second dataset a different name. 

After the datasets have been concatenated, group the data by Weekdays and show summary statistics by day of the week.

In [None]:
url =  "https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ"
project_data2 = pd.read_csv(url, skiprows=1)
project_data2_sorted = data_clean_wrangle(project_data2)
project_data_combined = pd.concat([project_data_sorted,project_data2_sorted], join='inner', ignore_index=True)
project_data_combined = project_data_combined.sort_values(by=['Weekdays'])
print(project_data_combined.groupby(by=["Weekdays"])[["NO2 Level (V ug/m2)"]].describe())
#print(project_data_combined.shape)
#print(project_data_combined.head())

### Helpful references
---
Skipping rows when reading datasets:  
https://www.geeksforgeeks.org/how-to-skip-rows-while-reading-csv-file-using-pandas/  

Converting strings to dates:  
https://www.geeksforgeeks.org/convert-the-column-type-from-string-to-datetime-format-in-pandas-dataframe/

Dropping rows where data has a given value:  
https://www.datasciencemadesimple.com/drop-delete-rows-conditions-python-pandas/  
(see section Drop a row or observation by condition) 

Convert a column of strings to a column of floats:
https://datatofish.com/convert-string-to-float-dataframe/  

Create a new column from data converted in an existing column:  
https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/  

Rename a column:  
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html  

Remove a column by name:  
https://www.kite.com/python/answers/how-to-delete-columns-from-a-pandas-%60dataframe%60-by-column-name-in-python#:~:text=Use%20the%20del%20keyword%20to,the%20name%20column_name%20from%20DataFrame%20.
