# Data Scraping #

The first step is getting data. Fotunately, [NOAA](https://www.noaa.gov/) had provided many climate dataset.

In this totorial, we will use the dataset from [Global Forecast System (GFS)](https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs) 

And scrape the `GRIB data(.grb)` and `GRIB2 data(.grb2)` from `Product Types` > `GFS Analysis` > `GFS-ANL, Historical Model` which is this [link](https://www.ncei.noaa.gov/data/global-forecast-system/access/historical/analysis/)

The `period of record` of this dataset is `01Jan2007–15May2020` and the data was collected `4 times per day` `(00, 06, 12 and 18 UTC)`

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib
import os
from tqdm import tqdm

In [None]:
url_template = 'https://www.ncei.noaa.gov/data/global-forecast-system/access/historical/analysis/'

In [None]:
dates = pd.date_range("2007-01-01","2020-05-15")

In [None]:
def create_dir_if_not_exist(directory):
    if not os.path.exists(directory):
        os.makedirs(directory)

In [None]:
root_dir = 'data/'
create_dir_if_not_exist(root_dir)

for date in tqdm(dates):
    try:
        year_month = date.strftime("%Y%m")
        year_month_day = date.strftime("%Y%m%d")
        dir = root_dir + year_month + '/' + year_month_day + '/'
        create_dir_if_not_exist(dir)
        url = url_template + year_month + '/' + year_month_day + '/'
        res = requests.get(url)
        soup = BeautifulSoup(res.text, 'html.parser')

        file_links = []
        file_links = soup.findAll('a', href=lambda link: '000.grb2' in link and 'gfsanl_4' in link)
        if len(file_links) == 0:
            threes = soup.findAll('a', href=lambda link: '000.grb' in link and 'gfsanl_3' in link)
            if len(threes) == 0:
                continue
            else:
                file_links = threes
                
        for link in file_links:
            file_url = url + link['href']
            urllib.request.urlretrieve(file_url, dir + link['href'])
            
    except Exception as e:
        print(e)
        continue

In [None]:
!zip -r data.zip data

At this point, we can get the climate dataset. Unfortunately, we still doesn't know which data are there in the dataset. 

So, the next section in this tutorial is **Data Inspecting**