# Lecture 2: Web Scraping and Data Download

## Introduction

In this second lecture, we will dive into web scraping and data download for meteorological analysis. You might already be familiar with the user-friendly data download page created by **Brian K. Blaylock**, which allows manual data retrieval:

[**Brian K. Blaylock - GOES-16 Data Download**](https://home.chpc.utah.edu/~u0553130/Brian_Blaylock/cgi-bin/goes16_download.cgi)

While this manual download method is convenient, it may not be suitable for automated data acquisition. That's why we will explore the concept of **"Web scraping"** to automate this process and make it more efficient.

Explore [**NOAA GOES on AWS**](https://docs.opendata.aws/noaa-goes16/cics-readme.html#accessing-goes-data-on-aws)

Let's get started!

In [None]:
# Importing the datetime module to work with date and time information.
## complete you code here:



The basic date and time types could be obtained using the "datetime" Library
https://docs.python.org/3/library/datetime.html

In [None]:
# Create a datetime object representing the current UTC time.
## complete you code here:



In [None]:
# Display current date
## complete you code here:



In [None]:
# Display a message along with the current UTC time.
## complete you code here:



In [None]:
# Define a dictionary 'dini' containing parameters for data download URL construction.
dini = {
## complete you code here:    
    'src': ,                    # Data source
    'sat': ,                    # Satellite (GOES-16)
    'str': ,                    # Spatial domain (e.g., 'F' for full disk)
    'prd': ,                    # Product type (e.g., ABI-L1b-Rad for radiance data)
    'tme': dnow                 # Date and time (current UTC time)
}

In [None]:
# Construct the URL for data download using the provided parameters.
url = (
    'https://home.chpc.utah.edu'
    '/~u0553130/Brian_Blaylock/cgi-bin/goes16_download.cgi?'
    'source={src}&'
    'satellite=noaa-goes{sat}&'
    'domain={str}&'
    'product={prd}&'
    ## complete you code here:
    'date=         '                # Format the date as 'YYYY-MM-DD'
    'hour=         '                # Format the hour as 'HH'
)

In [None]:
# Print the constructed URL with parameter values applied using string formatting.
## complete you code here:




In [None]:
# Import the 'requests' library for making HTTP requests.
## complete you code here:

# Import 'BeautifulSoup' from the 'bs4' library for web scraping and parsing HTML.
## complete you code here:

# Import 'minidom' from 'xml.dom' for working with XML data.
## complete you code here:


In [None]:
# Construct the complete URL with parameter values applied.
## complete you code here:
urit = 

# Make an HTTP GET request to the constructed URL.
## complete you code here:
response = 

# Parse the HTML content of the response using BeautifulSoup.
## complete you code here:
dates = 

# Find all elements with class 'mybtn-group' in the parsed HTML.
## complete you code here:
alltimesxml = 


# Display the scraped HTML content stored in the 'alltimesxml' variable.
## complete you code here:
alltimesxml

In [None]:
# Initialize an empty list to store the data to be downloaded.
## complete you code here:
ls2down = 

# Initialize 'nps' to -1 as an initial value.
nps = -1

# Iterate through the elements in 'alltimesxml'.
for i in range(len(alltimesxml)):
    # Find all 'a' tags within the current element.
    tags = alltimesxml[i].find_all('a')
    
    # Initialize an empty list to store time differences.
    nwdts = []
    
    # Iterate through the 'a' tags.
    for j in range(len(tags)):
        # Extract the date and time information from the 'href' attribute.
        dts = tags[j].attrs['href'].split('/',)[-1]
        
        # Split the date and time string to extract the timestamp.
        lsfst = dts.split('_')
        dtstr = lsfst[-3]
        
        # Convert the timestamp to a datetime object.
        ndnow = datetime.strptime(dtstr, 's%Y%j%H%M%S%f')
        
        # Calculate the time difference in minutes.
        tmdf = ndnow.minute - dnow.minute
        
        # Append the absolute time difference to the list.
        nwdts.append(abs(tmdf))
    
    # Find the index with the minimum time difference.
    id1 = min(nwdts)
    
    # If 'nps' is 0 and the minimum time difference is less than 'dtm', update 'nps'.
    if nps == 0 and id1 < dtm:
        nps = nwdts.index(id1)
    
    # Extract the filename and button text from the 'a' tag.
    lsfst = tags[nps].attrs['href'].split('/',)[-1]
    print(lsfst + ' - ' + tags[j].button.text)
    
    # Append the filename to the 'ls2down' list.
    ls2down.append(lsfst)

In [None]:
# Check if the 'str' key in the dini dictionary contains 'M' (indicating a specific domain)
if 'M' in dini['str']:
    # If 'M' is found, update the 'str' key in the dini dictionary to 'M' for consistency
    dini.update({'str':'M'})

# Construct the base URL for downloading the data using formatted string. 
# This URL includes placeholders for satellite (sat), product (prd), and time (tme) parameters, 
# which are filled in from the dini dictionary.
url_base = 'https://noaa-goes{sat}.s3.amazonaws.com/{prd}{str}/{tme:%Y}/{tme:%j}/{tme:%H}/'.format(**dini)

# Initialize an empty list to store the complete URLs for downloading the data files
## complete you code here:

# Iterate over each item in the list of file identifiers (ls2down)
for urli in ls2down:
    # Uncomment the next line to print each constructed URL before adding it to the list (for debugging)
    # print(f'{url_base}{urli}')
    # Append the full URL for each file to the urls2dwn list. This URL is constructed by combining
    # the base URL with the specific file identifier (urli), allowing for direct access to each file.
    
    ## complete you code here:
    

In [None]:
import requests
import os

# Prompt the user to enter the download folder path. 
download_folder = input("Enter the path to the download folder: ") # you can name the download folder "input"
 
# URL of the file to download.
urld = 'https://noaa-goes16.s3.amazonaws.com/ABI-L1b-RadM/2022/306/17/OR_ABI-L1b-RadM1-M6C13_G16_s20223061734250_e20223061734319_c20223061734351.nc'

# Send an HTTP GET request to the URL.
response = requests.get(urld)

# Check if the response is successful (status code 200).
if response.status_code == 200:
    # Extract the filename from the URL.
    filename = os.path.basename(urld)

    # Construct the complete path to save the file in the chosen download folder.
    file_path = os.path.join(download_folder, filename)

    # Write the content to the file in binary mode.
    with open(file_path, "wb") as file:
        file.write(response.content)

    print(f"File '{filename}' downloaded and saved to '{download_folder}'.")
    # Press enter if you want to save the file in the current directory 
else:
    print("Failed to download the file. Check the URL or your internet connection.")

In [None]:
# Display the URLs for downloading the data files.



In [None]:
import requests
import os

try:
    # Prompt the user to enter the download folder path.
    download_folder = input("Enter the path to the download folder: ") # you can name the download folder "input"

    # Ensure the download folder exists.
    if not os.path.isdir(download_folder):
        print(f"Creating download folder at '{download_folder}'.")
        os.makedirs(download_folder, exist_ok=True)

    # URL of the file to download.
    urld = 'https://noaa-goes16.s3.amazonaws.com/ABI-L1b-RadM/2022/306/17/OR_ABI-L1b-RadM1-M6C13_G16_s20223061734250_e20223061734319_c20223061734351.nc'

    print("Attempting to download the file...")
    # Send an HTTP GET request to the URL.
    response = requests.get(urld, stream=True)

    # Check if the response is successful (status code 200).
    if response.status_code == 200:
        # Extract the filename from the URL.
        filename = os.path.basename(urld)

        # Construct the complete path to save the file in the chosen download folder.
        file_path = os.path.join(download_folder, filename)

        # Write the content to the file in binary mode.
        with open(file_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

        print(f"File '{filename}' downloaded and saved to '{download_folder}'.")
    else:
        print(f"Failed to download the file. Server responded with status code: {response.status_code}. Check the URL or your internet connection.")
except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
import requests
import os

# Specify the local directory where you want to save the files. # you can name the download folder "input"
local_directory = input("Enter the path to the download folder: ")

# Ensure that the local directory exists; create it if it doesn't.
os.makedirs(local_directory, exist_ok=True)

# Iterate through the URLs and download files.
for urld in urls2dwn:
    # Extract the filename from the URL.
    ntw = urld.split('/')[-1]
    
    # Construct the complete path to save the file in the local directory.
    file_path = os.path.join(local_directory, ntw)
    
    # Send an HTTP GET request to the URL.
    resp = requests.get(urld)
    
    # Check if the response is successful (status code 200).
    if resp.status_code == 200:
        # Write the content to the file in binary mode.
        with open(file_path, "wb") as file:
            file.write(resp.content)
        print(f"File '{ntw}' downloaded and saved to '{local_directory}'.")
    else:
        print(f"Failed to download '{ntw}' from the URL: {urld}")

## complete you code here:



In [None]:
# Import necessary AWS SDK and configuration modules.
## complete you code here:




In [None]:
# Select the AWS S3 bucket name, remote file path, and local destination file name.
## complete you code here:
s3_bucket = 
bucket_file = 'ABI-L2-MCMIPF/2022/273/03/OR_ABI-L2-MCMIPF-M6_G16_s20222730350207_e20222730359521_c20222730400027.nc'
local_file = 'OR_ABI-L2-MCMIPF-M6_G16_s20222730350207_e20222730359521_c20222730400027.nc'

In [None]:
# Connect to the AWS S3 bucket using the Boto3 client.
## complete you code here:



In [None]:
# Download the file from the AWS S3 bucket to the local destination.
## complete you code here:

