# ph21: HWK 1
In this assignment, we will get Buoy data from the NDBC and then use it to make some estimates about the future.

### Familiarize yourself

We will be using data from the National Data Buoy Center (NDBC). Visit the NDBC website, https://www.ndbc.noaa.gov, and access some historical data. Take a look at the URL for the data you have accessed: you will need this format to access the data from python.

### Part 1: Getting the data

You will need to save all necessary for your second part here. The following steps are for reference.

1. Use the `requests` library (or any other similar library at your choice) to download historical Buoy data
3. Save all data you downloaded into (a) file(s). If you feel like it, only save that part that is relevant for your second part.
4. Use a html parser library (e.g., `BeautifulSoup`, `lxml`, or `html5lib`) to extract the meaning of the data (i.e. what is it measuring and what are the units?). Also save this information somewhere.
5. From now on, you don't need to make any more requests to the NDBC website. Only use the data you downloaded.


### Part 2: Processing the data
6. (Optional) use the `pandas` library to read the data and process as needed, and save the processed data if you want to. Save the processed data
7. plot the oceant temperatures (`WTMP`), wave heights (`WVHT`), average wave periods (`APD`), and wind speeds (`WSPD`) going back 10 years or so (some buoys don't have all the data every year)
8. Look through `scipy.stats` and choose something like Pearson's or Spearman's correlation test. Determine what (if any) correlations you find between mean ocean temperatures and maximum wave heights or wave periods.


### All written codes are only here for reference. You can also modify them if you want to.

# Part 1

In [35]:
# Libraries you might want to use
# Uncomment the ones you need
# from datetime import datetime
from io import StringIO
from pathlib import Path

from requests import Session
# from curl_cffi.requests import Session, AsyncSession

# from bs4 import BeautifulSoup
# from lxml import html

# from matplotlib import pyplot as plt
# import seaborn as sns

import pandas as pd
# import numpy as np
# from scipy import stats

import yaml
from tqdm import tqdm

In [37]:
# Prepare requests session. Let's pretend we are Chrome 118 here

# Think: In some cases, we can still be blocked as been flagged as bot at the first request.
#        Can you guess how the server knows it? (ignore javascript for now)

sess = Session()
sess.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36'})

# Think: What's the difference between using requests.get() and sess.get()?

In [39]:
# Where to save the fetched data
path = Path('./Data')
path.mkdir(exist_ok=True)
url_template = 'https://www.ndbc.noaa.gov/view_text_file.php?filename={buoy_id}h{year}.txt.gz&dir=data/historical/stdmet/'
description_page_url = 'https://www.ndbc.noaa.gov/faq/measdes.shtml'

In [41]:
# Load the config file
with open('config.yaml') as f:
    config = yaml.safe_load(f)

buoy_ids:list[int] = config['buoys'] # We only use these id's for this assignment
columns:list[str] = config['columns'] # Data columns we want to keep. You may add more if you want

FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'

To fetch one file, you can do something like this:

In [None]:
# An example of how to fetch the description and put it in a pd.DataFrame

r = sess.get(url_template.format(buoy_id=41010, year=2022))
r.raise_for_status()
buf= StringIO(r.text)
df = pd.read_csv(buf, delim_whitespace=True, skiprows=[1])

Take a look:

In [None]:
df.head()

Now, your task here is:
1. For each buoy_id, generate all urls pointing to its data, and fetch&save the data. If the data is missing (`status_code`=404), skip it.
2. Save everything in raw, or any processed form that you see fits.
3. If the file is already downloaded, don't download it again. Just skip it here.

Learn more about http status codes here:
- https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
- https://http.cat/

Two important ones here for today:
- 200: OK; everything's alright
- 404: Not Found; the file is missing

In [None]:
# ================ INSERT YOUR CODE HERE ================


Though the units for all quantities are included in the data files received already, lets also try get them again from the website along with the description for each columns by parsing htmls.

All information of our interest is available at [`description_page`](https://www.ndbc.noaa.gov/faq/measdes.shtml#stdmet).

Your task is to:

1. Get a dictionary of all the quantity name and its unit
2. Get a dictionary of all the quantity name and its description

To achieve this, you may want to use some html parser library, such as `BeautifulSoup`, `lxml`, or `html5lib`. If you really feel like it, you can also use regex just for fun like the good old days.

Tip: Having no idea where to start? Try to open the [`description_page`](https://www.ndbc.noaa.gov/faq/measdes.shtml#stdmet) in your browser, and use the `inspect` tool (F12, or Ctrl+Shift+I) to see how the html is structured and locate the part of your interest. Pasting a copy of the html to some LLM and ask it how to extract the information you want may also be a good idea.

Another Tip: Something called [`xpath`](https://en.wikipedia.org/wiki/XPath) may be extremely useful here. It can be access with `lxml`.

In [None]:
r = sess.get(description_page_url)
r.raise_for_status()
html_text = r.text # raw html text for your input

In [None]:
# ================ INSERT YOUR CODE HERE ================

if not (path/'units.yaml').exists() or not (path/'meanings.yaml').exists():
    ...

    with open(path/'units.yaml', 'w') as f:
        yaml.dump(unit_map, f)
    with open(path/'meanings.yaml', 'w') as f:
        yaml.dump(meaning_map, f)

In [None]:
with open(path/'units.yaml') as f:
    unit_map = yaml.safe_load(f)
unit_map

In [None]:
with open(path/'meanings.yaml') as f:
    meaning_map = yaml.safe_load(f)
meaning_map

You may want to concatenate each buoy's data into a single dataframe, or any other data structure that you see fit. `pandas` is recommended here for your convenience.

After plotting, you will notice that some values are saturated at ~100. This is considered as artifacts, and should be removed. We will filter out everything $\ge 90$ in this case. If you are using `pandas`, you can do something like `df[df[columns] < 90]`.
Also, remove all rows with `NaN` values.

In [None]:
# ================ INSERT YOUR CODE HERE ================


As python handles datetime objects in fundamentally nasty way, we provide a snippet to convert YMDhm to datetime objects for you. You can use it like this:

```python
YMDhm = df[['#YY','MM','DD','hh','mm']]
timestemp = YMDhm.agg(lambda x: datetime(*x), axis=1)
```

to get a series of datetime objects from the dataframe `df` with columns `#YY`, `MM`, `DD`, `hh`, and `mm`.

#### For one buoy, plot all of its columns from all last 10 years. You can use any plotter of your choice.
##### Can you see any trend? If it is too noisy for human eyes, what could to be done to make it more clear?

##### Launch `scipy.stats` and choose something like Pearson's or Spearman's correlation test. Determine what (if any) correlations you find between mean ocean temperatures and maximum wave heights or wave periods.|

In [None]:
# ================ INSERT YOUR CODE HERE ================

