# Air Quality: Design (wrap up) and Implement your Product

Welcome to the final lab of this project. Here again, you'll be working with the dataset you've now become familiar with from the air quality monitoring network in Bogotá [RMCAB](http://201.245.192.252:81/home/map). In this notebook, you will complete the following steps:

1. Import Python packages
2. Load the dataset with missing values filled in (output from the last lab)
3. Use the nearest neighbor method to make a map of PM2.5 in Bogotá
4. Test different values of k for the nearest neighbor method
5. Use the best value of k to make a map of PM2.5 in Bogotá
6. Construct a map animation of PM2.5 in Bogotá
7. Display your map animation

## 1. Import Python Packages

Run the next cell to import the Python packages you'll need for this lab.

Note the `import utils` line. This line imports the functions that were specifically written for this lab. If you want to look at what these functions are, go to `File -> Open...` and open the `utils.py` file to have a look.

In [1]:
# Import Python packages.
import folium # package for animations
import folium.plugins as plugins # extras for animations
import pandas as pd # package for reading in and manipulating data
from sklearn.neighbors import KNeighborsRegressor # package for doing KNN
from datetime import datetime # package for manipulating dates

import utils # utility functions defined for this lab

print("All packages imported successfully!")

All packages imported successfully!


## 2. Load the dataset with missing values filled in (output from the last lab)

Run the next cell to read in the dataset that was the final output from the last lab, namely, a dataset with all missing values for the pollutants filled in. 

In [2]:
# Load the dataset with missing values filled in.
full_dataset = pd.read_csv('data/full_data_with_imputed_values.csv')
full_dataset['DateTime'] = pd.to_datetime(full_dataset['DateTime'], dayfirst=True)

full_dataset.head(5)

Unnamed: 0,DateTime,Station,Latitude,Longitude,PM2.5,PM10,NO,NO2,NOX,CO,OZONE,PM2.5_imputed_flag,PM10_imputed_flag,NO_imputed_flag,NO2_imputed_flag,NOX_imputed_flag,CO_imputed_flag,OZONE_imputed_flag
0,2021-01-01 00:00:00,USM,4.532097,-74.116947,32.7,56.6,7.504,15.962,23.493,0.44924,2.431,,,,,,,
1,2021-01-01 01:00:00,USM,4.532097,-74.116947,39.3,59.3,16.56,17.866,34.426,0.69832,1.121,,,,,,,
2,2021-01-01 02:00:00,USM,4.532097,-74.116947,70.8,96.4,22.989,17.802,40.791,0.88243,1.172,,,,,,,
3,2021-01-01 03:00:00,USM,4.532097,-74.116947,81.0,108.3,3.704,9.886,13.591,0.29549,6.565,,,,,,,
4,2021-01-01 04:00:00,USM,4.532097,-74.116947,56.1,87.7,2.098,9.272,11.371,0.16621,9.513,,,,,,,


## 3. Use the nearest neighbor method to make a map of PM2.5 in Bogotá
Here you use the nearest neighbor method to estimate the values of pollutants at the points between the stations, so you can create a nice visual map of pollution.

In [3]:
# Define a value for k
k = 3
# Define the target pollutant
target = 'PM2.5'
# Define a grid cell size (higher value implies a finer grid)
n_points_grid = 64
neighbors_model = KNeighborsRegressor(n_neighbors=k, weights = 'distance', metric='sqeuclidean')
# Isolate a single time step from the dataset
time_step = datetime.fromisoformat('2021-04-05T08:00:00')
time_step_data = full_dataset[full_dataset['DateTime'] == time_step]
neighbors_model.fit(time_step_data[['Latitude', 'Longitude']], time_step_data[[target]])
# Generate a map of predictions for Bogotá
predictions_xy, dlat, dlon = utils.predict_on_bogota(neighbors_model, n_points_grid)
utils.create_heat_map(predictions_xy, time_step_data, dlat, dlon, target)

## 4. Test different values of k for the nearest neighbor method
Run the cells below to first calculate the mean absolute error (MAE) for k=1, or in other words, the error associated with using just one nearest neighbor as you did to create the map above. After that, you'll run the same calculation for different values of k. 

The way you're doing this similar to what you did in the previous lab, where you calculated the MAE for using nearby sensor station measurements to estimate the value of PM2.5 at any given sensor station location. Here you'll evaluate the method shown in the map above at each sensor station location as if that station's measurement was replaced with a value from the nearest neighbor station, and then a weighted average of k nearest neighbor stations.


The calculation for mean absolute error that's being performed by the code before is the following:

$$MAE = \frac{1}{n} \sum_{i=1}^{n}{|\rm{actual}_i - \rm{model}_i|}$$
    
Where "n" is the number of samples in the test dataset

In [4]:
# Make an estimate of mean absolute error for k=1
utils.calculate_mae_for_k(full_dataset, k=1, target_pollutant=target)

7.991864932077892

After testing k=1, run the following cell to test a range of values for k.

In [5]:
# Make an estimate of mean absolute error (MAE) for a range of k values.
kmin = 1
kmax = 7

for kneighbors in range(kmin, kmax+1):
    mae = utils.calculate_mae_for_k(full_dataset, k=kneighbors, target_pollutant=target)
    print(f'k = {kneighbors}, MAE = {mae}')

k = 1, MAE = 7.991864932077892
k = 2, MAE = 7.356887738049645
k = 3, MAE = 7.214031880131258
k = 4, MAE = 7.171022397383608
k = 5, MAE = 7.160089807079792
k = 6, MAE = 7.156771045725128
k = 7, MAE = 7.156438017933703


## 5. Use the best value of k to make a map of PM2.5 in Bogotá

Run the cell below to generate a map of PM2.5 values. The map will show the concentration of the chosen pollutant over the city on the selected `end_date`. By clicking on the circles on the map (stations), pop-up plots appear, showing the concentration of the pollutant over the selected time range (from `start_date` to `end_date`) You can change the values of dates and times or `k` to see how the data differs at various times and how the result changes depending on `k`.

In [6]:
k = 3
start_date = datetime.fromisoformat('2021-08-02T08:00:00')
end_date = datetime.fromisoformat('2021-08-05T08:00:00')

utils.create_heat_map_with_date_range(full_dataset, start_date, end_date, k, target)

<Figure size 432x288 with 0 Axes>

## 6. Construct a map animation of PM2.5 in Bogotá
Run the next cell to generate an animation of PM2.5 over a specific time range. You can change k to use a different number of neighbors and change the dates and times to look at a different time range. 

In [None]:
# Choose parameters for the animation
k = 3
n_points_grid = 64
# Filter a date range
start_date = datetime.fromisoformat('2021-08-04T08:00:00')
end_date = datetime.fromisoformat('2021-08-05T08:00:00')

# Create the features for the animation (these are the shapes that will appear on the map)
features = utils.create_animation_features(full_dataset, start_date, end_date, k, n_points_grid, target)
print('Features for the animation created successfully! Run the next cell to see the result!')

## 7. Display your map animation

Run the next cell to display the animation you created!

In [None]:
# Create the map animation using the folium library
map_animation = folium.Map(location=[4.7110, -74.0721], zoom_start = 11) 
# Add the features to the animation
plugins.TimestampedGeoJson(
    {"type": "FeatureCollection", "features": features},
    period="PT1H",
    duration='PT1H',
    add_last_point=True
).add_to(map_animation)

# Run the animation
map_animation

## **Congratulations on finishing this lab!**

**Keep up the good work :)**