# CME538 - Introduction to Data Science
## Assignment 2 - Data Wrangling

### Learning Objectives
After completing this assignment, you should be comfortable:

- Using `requests` and `BeautifulSoup` to scrape a simple web page.
- Wrangle unstructured text data into a DataFrame format.
- Use `Pandas` functions to extract information from a DataFrame.
- Introduced to the `Folium` package.

You are free to add new cells to use as a scratch pad, but make sure to clean you code up and present your answer in the cell indicated with `# Write your code here`.

### Marking Breakdown

Question | Points
--- | ---
Question 1a | 1
Question 1b | 4
Question 1c | 1
Question 2a | 5
Question 2b | 2
Question 2c | 1
Question 2d | 1
Question 2e | 1
Question 2f | 1
Question 2g | 1
Question 2h | 1
Total | 19

One of the following marks below will be added to the **Total** above.

### Code Quality

| Rank | Points | Description |
| :-- | :-- | :-- |
| Youngling | 1 | Code is unorganized, variables names are not descriptive, redundant, memory-intensive, computationally-intensive, uncommented, error-prone, difficult to understand. |
| Padawan | 2 | Code is organized, variables names are descriptive, satisfactory utilization of memory and computational resources, satisfactory commenting, readable. |
| Jedi | 3 | Code is organized, easy to understand, efficient, clean, a pleasure to read. #cleancode |

## Setup Notebook

In [None]:
# Import 3rd party libraries
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pylab as plt

# Import local libraries
import utils

# Configure Notebook
import warnings
warnings.filterwarnings('ignore')
%config Completer.use_jedi = False

Let's install a super cool geospatial plotting package `Folium`.

In [None]:
!pip install folium

In [None]:
import folium

# Overview
You start a new job at an engineering company and you join a team that is planning to build a new waterfront facility on Lake Ontario. A critical input to the team's design process is information about the swell height and its seasonality (We'll learn more about seasonality in Week 5). Your manager informs you that NOAA (National Oceanic and Atmospheric Administration) has an array of sensors installed throughout the great lakes that measure swell height and direction among other things. 

In this assignment, you'll be working with the [NOAA - Great Lakes Environmental Research Laboratory (GLERL)](https://www.glerl.noaa.gov/res/glcfs/) dataset which contains forecasts and measurements for Ice Cover, Wave Height, Current Direction, Wind Speed, and others. We have already worked with this dataset in Lecture 3.2. 

<br>
<img src="images/noaa.gif" alt="drawing" width="500"/>
<br>

Your managers asks you to programmatically pull all available wave data from the GLERL server and report to her the minimum, maximum, and average wave height from the three closest grid points in the dataset to the planned location of the facility (lat = 43.9593°, lon = -78.1677°).  

The image below shows the grid point locations in black and the location of the planned facility in red.

In [None]:
# Import 'grid_plot.csv'
grid_plot = pd.read_csv('grid_plot.csv')

# Create a map of Toronto
map1 = folium.Map(location=[43.9593, -78.1677], 
                 tiles='cartodbpositron', 
                 zoom_start=8)

# Add bike stations to the map
for idx, row in grid_plot.iterrows():
    folium.Circle(location=[row.lat, row.lon],
                  radius=20,
                  color='black').add_to(map1)

# Add weather stations
folium.Marker([43.6532, -79.3832], icon=folium.Icon(color='blue'), popup='UofT').add_to(map1)
folium.Marker([43.9593, -78.1677], icon=folium.Icon(color='red'), popup='Facility').add_to(map1)

# Display map
map1

## Question 1
### HTML & Web Scraping
For Question 1, you'll be using your web scraping skillz to create a reference DataFrame with the names of files that contain wave data for Lake Ontario in 2021.The files can be found [here](https://www.glerl.noaa.gov/emf/waves/GLERL-Donelan-Archive/2021/').

The filename format is:
LYYYY_MM.F.NC

where:
- L = lake letter (c=St. Clair, s=Superior, m=Michigan, h=Huron, e=Erie, o=Ontario)
- YYYY = year at start of simulation (GMT)
- MM = month at start of simulation (GMT)
- F = is either `in1` or `out1` (don't worry about this)  
- NC = file extension

### Question 1a
Use `requests.get()` to grab the HTML from this link (https://www.glerl.noaa.gov/emf/waves/GLERL-Donelan-Archive/2021/). Create a new variable named `response` and assign the returned object from `requests.get()` to it. Sometimes the **GLERL** server doesn't return a vaid response, which produces a Python error. You may have to run this cell more than once.

In [None]:
# Write your code here.
response = ...

Let's use `BeautifulSoup` to parse the html object returned by `requests.get()`.

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

Next, let's use the `.findAll()` method to generate a list of HTML entries for each NOAA file.

In [None]:
table_rows = soup.findAll('tr')

We can see that the table data `<td>` we're interested in only appears after line 3 so let's grab everything from the 3th row on. 

In [None]:
table_rows = table_rows[3:-1]

Let's print the first 5 rows of `table_rows` to see what they contain.

In [None]:
for table_row in table_rows[0:10]:
    print(table_row)
    print()

In [None]:
len(table_rows)

We can see that `table_rows` contains `<tr></tr>` and `<td></td>` html tags. Each list entry is wrapped in `<tr></tr>` tags and withing these `<tr></tr>` tags, the data we're interest in is wrapped in `<td></td>` tags.

### Question 1b
Use `table_rows` to create a DataFrame that inludes a row entry for each list item in `table_rows`. There are 54 list items in `table_rows` and therefore, you'll be creating a DataFrame with 54 rows. Create a variable called `noaa_files` and assign the DataFrame to it. The DataFrame should have three columns `filename`, `upload_date`, and `file_size`, which can all be extracted from the HTML snippets in `table_rows`. Make sure to use string method to remove any excess white space. For example, there may be a filename string `'	  o2021_12.in1.nc'` but it should be entered into the DataFrame as `'co2021_12.in1.nc'`. 

`noaa_file.head(10)` should return something like this but with different `filename`, `upload_date`, and `file_size`. The table below is just to give you an idea for is expected but yours will look different.
<br>
<img src="images/noaa_files.png" alt="drawing" width="350"/>
<br>

In [None]:
# Write your code here.
...

# View DataFrame
noaa_files.head()

Remember, `noaa_files` should have 54 rows. Let's check.

In [None]:
print('noaa_files has {} rows'.format(noaa_files.shape[0]))

### Question 1c
Filter `noaa_files` so it only contains `.in1.nc` files for lake Ontario. After filtering, reset the index using `.reset_index(drop=True)`.

In [None]:
# Write your code here.
noaa_files = ...

# View DataFrame
noaa_files.head()

`noaa_files` should now have 8 rows. Let's check.

In [None]:
print('noaa_files has {} rows'.format(noaa_files.shape[0]))

## Question 2
Now that we have a DataFrame containing the names of all the files we're interested in, we can start downloading these files and extracting the desired information. Because the **GLERL** server can be a bit unreliable and because they recently retired this database, we've gone ahead and downloaded a bunch of `.wav` for you, which are located in the assignment directory. 

For Question 2, we are working with a different file because of the recent database changes. For Question 2 we are working with `.wav` (Wave) files with the following naming conversion.

The gridded fields filename format is:
LYYYYDDDHH.N.WAV

where:
- L = lake letter (s=Superior, m=Michigan, h=Huron, e=Erie, o=Ontario)
- YYYY = year at start of simulation (GMT)
- DDD = Day Of Year at start of simulation (GMT)
- HH = hr at start of simulation (GMT)
- N = Site Number (don't worry about this, it will always be 0)

Because we are focused on wave information for Lake Ontario, our files have the following format **oYYYYDDDHH.0.wav** (`o` for Lake Ontario and `.wav` for wave measurements).

Let's import the `noaa_files` `.csv` that corresponds to the files we've already downloaded for you.

In [None]:
noaa_files = pd.read_csv('noaa_files.csv')
noaa_files.head()

Let's also take a look at the `.wav` files we've included for you in the assignment folder.

In [None]:
[file for file in os.listdir() if '.wav' in file]

There are 32 rows in the `noaa_files` DataFrame and 32 `.wav` files in the assignment folder. They correspond to the same 32 datetimes.

### Question 2a
Write a function `noaa_file_parser(filename)` that takes a filename as as an argument (e.g. filename = `o202101518.0.wav`) and returns a DataFrame with the following columns.

- filename (string)
- year (int)
- day (int)
- hour (int)
- grid_number (int)
- wave_height (float)
- wave_direction (int)
- wave_period (float)

There are two types of rows in a noaa file (e.g. `o202101518.0.wav`):

1. **Time stamp row** - this row indicates the date and time of the grid measurements to follow.
    
    Example: For file `o202101518.0.wav`, the first row is 
    
    `'2021 015 19     /glcfs/bathy/ontario5km.dat    WAVES                   746'`
    
    There are six entries in a time stamp row.
    
    - Year (GMT)
    - Day of the year (GMT)
    - Hour of the day (GMT)
    - Map file. There is a map file for each lake and it relates the grid numbers to latitudes and longitudes. `ontario5km.map` is the Lake Ontario map and its already in your assignment folder.
    - File type (WAVES)
    - Number of grid points for a particular lake (746 for lake ontario).

2. **Measurement row** - this row contains the wave measurements we're interested in. Each measurement row contains wave measurements for a particular gird point. 

    Example: For file `o202101518.0.wav`, the second row is 
    
    `'1   1.029  231  4.2'`
    
    There are four entries in a measurement row.
    
    - grid_number (int)
    - wave height (meters) (float)
    - wave direction (0 = toward north, 90 = toward east) (int)
    - wave period (s) (float)
    
The NOAA files contain this repeating pattern:
```
2021 015 19     /glcfs/bathy/ontario5km.dat    WAVES                   746
1   1.029  231  4.2
2   0.932  228  4.0
...
745   0.656  312  3.4
746   0.581  331  3.4
2021 015 20     /glcfs/bathy/ontario5km.dat    WAVES                   746
1   1.142  228  4.4
2   1.042  227  4.3
...
745   0.642  316  3.4
746   0.600  345  3.8
2021 015 21     /glcfs/bathy/ontario5km.dat    WAVES                   746
1   1.205  224  4.9
2   1.109  223  4.7
...
745   0.635  317  3.4
746   0.592  350  3.8
```
    
Your function `noaa_file_parser(filename='o202101518.0.wav')` should return the following DataFrame (first 10 rows shown). 
<br>
<img src="images/noaa_file_parser.png" alt="drawing" width="550"/>
<br>

You can use the code in the cell below to explore NOAA file `'o202101518.0.wav'`.

In [None]:
open('o202101518.0.wav', 'r').read().split('\n')[0:10]

In [None]:
# Write your code here.
def noaa_file_parser(filename):
    ...

# Print DataFrame head. Here I am using the first file in the assignment directory to print (o202101518.0.wav).
file_grid_data = noaa_file_parser(filename=noaa_files.loc[0, 'filename'])
file_grid_data.head()

`file_grid_data` should have 4476 rows. Let's check

In [None]:
print('file_grid_data has {} rows'.format(file_grid_data.shape[0]))

### Question 2b
Next, use the function you just built (`noaa_file_parser()`) to parse each noaa file in the DataFrame `noaa_files`. Run `noaa_file_parser()` in a `for` loop that loops through each file in `noaa_files['filename]`. At the end of each loop, add the DataFrame returned by `noaa_file_parser()` to a list. After looping through every file in the DataFrame `noaa_files`, use `pd.concat()` to combine all the DataFrames. The result will be one large DataFrame containing all the wave measurement data. Create a variable `grid_data` and assign this new DataFrame to it.

In [None]:
# Write your code here.
...

# View DataFrame
grid_data.head() 

As a quick check, `grid_data` should have 32 unique files and 143232 rows in it.

In [None]:
print('There are {} unique files in grid_data and {} rows'.format(grid_data['filename'].nunique(), grid_data.shape[0]))

### Question 2c
The map file `ontario5km.map` is located in the assignment directory and contains the latitude and longitude information for each grid_number in `grid_data`. Import the `ontario5km.map` as a DataFrame with the following column names:

- grid_number
- fortran column
- fortran row
- lat
- lon
- depth

The ascii map file structure is:

NNNNN III JJJ LL.LLLLL LL.LLLLL DDD

where:
- NNNNN = sequence number (grid_number)
- III = fortran column
- JJJ = fortran row
- LL.LLLLL = lat (decimal degrees N)
- LL.LLLLL = lon (decimal degrees W)
- DDD = depth (m)

**Hint:** The columns are white-space delimited.

Create a variable `map_file` and assign this new DataFrame to it.

Below is a quick view of what the file `'ontario5km.map'` contains. Notice there are no column names so we'll have to add them ourselves.

In [None]:
open('ontario5km.map', 'r').read().split('\n')[0:10]

In [None]:
# Write your code here.
map_file = ...

# Because Longitude is in units of (decimal degrees W), let's convert these to negative values.
map_file['lon'] = map_file['lon'] * -1

# View DataFrame
map_file.head()

### Question 2d
We want our wave measurements in `grid_data` to have a geographic location so we can find the 3 closest grid points to the planned facility. To do this, we must use `pd.merge()` to map columns `lat` and `lon` from `map_file` to `grid_data`. Create a variable `grid_data_final` and assign this new DataFrame to it.

`grid_data_final.head()` should look like this:
<br>
<img src="images/grid_data.png" alt="drawing" width="700"/>
<br>

In [None]:
# Write your code here.
grid_data_final = ...

# View DataFrame
grid_data_final.head()

### Question 2e
Save `grid_data_final` to the root path with file name `'grid_data_final.csv'`. Make sure to not include an index column.

In [None]:
# Write your code here.
...

### Question 2f
Next, we need to find the three grid points that are closest to the location of the planned facility. The facility will be located at (lat = 43.9593°, lon = -78.1677°). A helper function has been included to calculate the distance between any two points (lat1, lon1 and lat2, lon2). We’ll use the Haversine (or Great Circle) distance formula, which takes the latitude and longitude of two points, adjusts for Earth’s curvature, and calculates the straight-line distance between them. 

You can call this function as follows:

`utils.haversine(lat1, lon1, lat2, lon2)`

The distance is returned in kilometers.

First, create a new column called `distance` for DataFrame `grid_data_final`. You can use the `.apply()` method to apply `utils.haversine(lat1, lon1, lat2, lon2)` to each row. 

`grid_data_final.head()` should look like this:
<br>
<img src="images/grid_data_distance.png" alt="drawing" width="700"/>
<br>

In [None]:
# Write your code here.
grid_data_final['distance'] = ...

# View DataFrame
grid_data_final.head()

### Question 2g
Next, create a new DataFrame called `closest_points` which only contains wave measurement data from the three closest grid points to the planned facility at (lat = 43.9593°, lon = -78.1677°).

In [None]:
# Write your code here.
... 

# View DataFrame
closest_points.head() 

As a quick check, `closest_points` should have 576 rows and 3 unique `'grid_number'`. 

In [None]:
print('closest_points has {} rows and {} unique grid_number.'.format(closest_points.shape[0], closest_points['grid_number'].nunique()))

Validate that the three points you've found make sense visually with a plot. The three closest point are shown as red markers.

In [None]:
# Import 'grid_plot.csv'
grid_plot = pd.read_csv('grid_plot.csv')

# Create a map of Toronto
map2 = folium.Map(location=[43.9593, -78.1677], 
                 tiles='cartodbpositron', 
                 zoom_start=8)

# Add bike stations to the map
for idx, row in grid_plot.iterrows():
    folium.Circle(location=[row.lat, row.lon],
                  radius=20,
                  color='black').add_to(map2)

# Add bike stations to the map
for idx, row in closest_points.iterrows():
    folium.Circle(location=[row.lat, row.lon],
                  radius=20,
                  color='red').add_to(map2)
    
# Add weather stations
folium.Marker([43.6532, -79.3832], icon=folium.Icon(color='blue'), popup='UofT').add_to(map2)
folium.Marker([43.9593, -78.1677], icon=folium.Icon(color='red'), popup='Facility').add_to(map2)

# Display map
map2

### Question 2h
The final step is to use the `closest_points` DataFrame to calculate the minimum, maximum and average wave height across the three closest points. Create variables `wave_height_min`, `wave_height_max`, `wave_height_mean` and assign the computed values to them.

In [None]:
# Write your code here.
wave_height_min = ...
wave_height_max = ...
wave_height_mean = ...

# Print answers
print('Wave height min: {} m\nWave height max: {} m\nWave height mean: {} m'.format(wave_height_min, 
                                                                                    wave_height_max, 
                                                                                    wave_height_mean)) 

**Congratulation, you're done Assignment 2. Review your answers and clean up that code before submitting on Quercus. `#cleancode`** 