<a href="https://colab.research.google.com/github/JaySanthanam/Programming-for-data/blob/main/Projects/Work_in_progress/Air_quality_Analyses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Air Quality data Analyses**

This notebook documents my work on Air Quality data analyses using Python. we will be using air-quality measurement, particularly, Nitrogen dioxide levels and Particulate Matter (PM10) data from Chatham roadside and Edinburgh measuring centres and use python to retrieve, wrangle, clean, sort and filter the data, analyse and visualise the data to make conclusions on the air quality levels in these two places. The analyses in this notebook looks at Nitrogen dioxide levels and PM10 in these two places and can be repeated for any other pollutants and/or places.

The first part of the notebook documents my work on air quality data on Nitrogen dioxide measurements from Chatham Roadside, Kent Edinburgh Haymarket area and St. Leonard's street.

The second part of the notebook documents my work on PM10 from Chatham Roadside and Edinburgh St. Leonard's street.

##What is Air quality and what data is available on it?
####Source: Department for Environment Food and Rural affairs (Defra), UK: https://uk-air.defra.gov.uk/air-pollution/

Air pollution can cause both short term and long term effects on health and many people are concerned about pollution in the air that they breathe. 

These people may include:

* People with heart or lung conditions, or other breathing problems, whose health may be affected by air pollution.
* Parents, carers and healthcare professionals who look after someone whose health is sensitive to pollution.
* People who want to know more about air pollution, its causes, and what they can do to help reduce it.
* The scientific community and students, who may need data on air pollution levels, either now or in the past, throughout the UK.

Free, detailed, clear and easy to use information on air pollution in the UK is available for all these purposes at UK's Defra, website on air pollution (link above).

## 1. Chatham data 

In this section of this notebook, we look at the air quality data - that is measured value of nitrogen dioxide levels in the air 

The following data file contains data collected at a roadside monitoring station.  You can see the data in a spreadsheet here: https://docs.google.com/spreadsheets/d/1XpAvrpuyMsKDO76EZ3kxuddBOu7cZX1Od4uEts14zco/edit?usp=sharing

The data contains:
* a heading line (Chatham Roadside) which needs to be skipped
* dates which are sometimes left- and sometimes right-justified indicating that they are not formatted as dates, rather they are text (so need to be converted to dates)
* times which are not all in the same format
* Nitrogen Dioxide levels which are, again, text and sometimes contain nodata
* Status which is always the same





### Project - clean, sort and wrangle the data

Read the dataset into a dataframe, skipping the first row   
Convert dates to date format  
Remove rows with nodata in the Nitrogen dioxide column  
Convert the Nitrogen dioxide levels values to float type  
Sort by Nitrogen dioxide level  
Create a new column for 'Weekdays' (use df['Date'].dt.weekday)  
Rename the column Nitrogen dioxide level to NO2 Level (V ug/m2)  
Remove the Status column  

The dataset can be viewed here:  https://drive.google.com/file/d/1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ/view?usp=sharing  and the data accessed here: https://drive.google.com/uc?id=1QSNJ3B1ku8kjXsA_tCBh4fbpDK7wVLAA This is a .csv file  

**NOTE:** Some useful references are included at the bottom of this spreadsheet.

Use the code cell below to work your code.

In [15]:
import pandas as pd
import numpy as np
from datetime import datetime, timezone

def data_clean_wrangle(project_data):
  project_data['Date'] =  pd.to_datetime(project_data['Date'], format= "%d/%m/%Y")
  project_data["Nitrogen dioxide"] = project_data["Nitrogen dioxide"].replace('nodata', np.nan)
  project_data = project_data.dropna(subset = ["Nitrogen dioxide"])
  project_data["Nitrogen dioxide"] = pd.to_numeric(project_data["Nitrogen dioxide"], downcast="float")
  project_data_sorted = project_data.sort_values(by=['Nitrogen dioxide'])
  project_data_sorted['Weekdays'] = project_data_sorted['Date'].dt.weekday
  project_data_sorted = project_data_sorted.rename(columns={"Nitrogen dioxide": "NO2 Level (V ug/m2)"})
  project_data_sorted = project_data_sorted.drop(columns=['Status'])
  return project_data_sorted

def get_csv_data(url):
  data = pd.read_csv(url, skiprows=1)
  return data


url =  "https://raw.githubusercontent.com/JaySanthanam/Programming-for-data/main/Datasets/NO2_Edin.csv"
data = get_csv_data(url)
#project_data = data_clean_wrangle(data)
print(data.shape)
print(data.head())
print(data.info())


(35092, 4)
         Date   Time Nitrogen dioxide   Status
0  01/03/2017  01:00          No data  V ugm-3
1  01/03/2017  02:00          No data  V ugm-3
2  01/03/2017  03:00          No data  V ugm-3
3  01/03/2017  04:00          No data  V ugm-3
4  01/03/2017  05:00          No data  V ugm-3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35092 entries, 0 to 35091
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              35089 non-null  object
 1   Time              35088 non-null  object
 2   Nitrogen dioxide  35088 non-null  object
 3   Status            35088 non-null  object
dtypes: object(4)
memory usage: 1.1+ MB
None


In [16]:
data['Date']

0        01/03/2017
1        01/03/2017
2        01/03/2017
3        01/03/2017
4        01/03/2017
            ...    
35087         44256
35088           NaN
35089           NaN
35090           NaN
35091           End
Name: Date, Length: 35092, dtype: object

### Expand the dataset and show summary statistics for larger dataset
---

There is a second data set here covering the year 2021:  https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ  

Concatenate the two datasets to expand it to 2020 and 2021.  

Before you can concatenate the datasets you will need to clean and wrangle the second dataset in the same way as the first.  Use the code cell below.  Give the second dataset a different name. 

After the datasets have been concatenated, group the data by Weekdays and show summary statistics by day of the week.

In [None]:
url =  "https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ"
project_data2 = pd.read_csv(url, skiprows=1)
project_data2_sorted = data_clean_wrangle(project_data2)
project_data_combined = pd.concat([project_data_sorted,project_data2_sorted], join='inner', ignore_index=True)
project_data_combined = project_data_combined.sort_values(by=['Weekdays'])
print(project_data_combined.groupby(by=["Weekdays"])[["NO2 Level (V ug/m2)"]].describe())
#print(project_data_combined.shape)
#print(project_data_combined.head())

         NO2 Level (V ug/m2)             ...                      
                       count       mean  ...        75%        max
Weekdays                                 ...                      
0                     2443.0  14.373540  ...  19.659705  82.596092
1                     2427.0  15.659355  ...  21.724490  80.278442
2                     2483.0  17.288498  ...  24.675005  73.409401
3                     2504.0  15.301346  ...  20.472977  72.000839
4                     2515.0  15.776768  ...  22.257344  71.219772
5                     2495.0  12.568014  ...  17.242315  62.471668
6                     2485.0  10.272859  ...  14.229000  53.096161

[7 rows x 8 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


### Helpful references
---
Skipping rows when reading datasets:  
https://www.geeksforgeeks.org/how-to-skip-rows-while-reading-csv-file-using-pandas/  

Converting strings to dates:  
https://www.geeksforgeeks.org/convert-the-column-type-from-string-to-datetime-format-in-pandas-dataframe/

Dropping rows where data has a given value:  
https://www.datasciencemadesimple.com/drop-delete-rows-conditions-python-pandas/  
(see section Drop a row or observation by condition) 

Convert a column of strings to a column of floats:
https://datatofish.com/convert-string-to-float-dataframe/  

Create a new column from data converted in an existing column:  
https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/  

Rename a column:  
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html  

Remove a column by name:  
https://www.kite.com/python/answers/how-to-delete-columns-from-a-pandas-%60dataframe%60-by-column-name-in-python#:~:text=Use%20the%20del%20keyword%20to,the%20name%20column_name%20from%20DataFrame%20.
