# Data from the Web
In this lesson, we will explore how to obtain data from the internet using Python.

By the end of this lesson, you should be able to:
1. Read data from a simple text web page
2. Parse data from a web page with html formatting
3. Download images from a web page using the image url
4. Download data from a website using the download link's url

## A Note on Web Scraping
Web scraping refers to the gathering of data from a website and storing it on another computer. In the U.S., it is perfectly legal to write programs to gather data from the web as long as you don't use information to harm the company and/or it's website. There are a few guidelines to follow regarding web scraping:
1. Don't overwhelm sites by making excessive requests
2. Do give attribution to the sites where you retrieve your data
3. Only gather data from sites that are publically available.

In general, if you are gathering data from public websites for educational purposes (rather than writing code to support a business interest), then you are in the clear. In fact, many websites provide a means to access data from their site in the form of an Application Programmng Interface (API). Many online services such as Google, OpenAI, etc. provide a means by which to access data on their sites with pre-defined tools. We will explore these tools in Lecture 10-2.

## Examples for this lesson: National Data Buoy Center
To get familiar with web scraping, we are going to download data from various components of NOAA's [National Data Buoy Center website](https://www.ndbc.noaa.gov/). This website hosts oceanographic and meteorological data collected by NOAA (and paid for by U.S. tax payers) is freely available for the public to access and use.

#### Import the modules required for this notebook

In [1]:
# import the requests and BeautifulSoup modules


### Part 1: Reading simple text data
To explore handling data from the web, first we will start with simple text data. One easy set of data to visualize is the data from the National Buoy Data Center. Take a look at an example web page by following the link: https://www.ndbc.noaa.gov/data/realtime2/46092.txt

Now, we will access this data using Python using the requests module: 

In [2]:
# define a url to the web page


# use the requests package to read the page to a response


# print the response status code


# print the response status description (reason)


# save the page text into a string


# split the text by next lines and print the first 3 lines


#### &#x1F914; Mini-Exercise
Goal: The National Weather Service produces regional text based forecasts for weather that can be transmitted to boats operating in US Coastal Waters. Read in data from the most recent forecast to find out about current weather alerts. The alerts for San Francisoc Bay area can be accessed at https://tgftp.nws.noaa.gov/data/raw/fz/fzus56.kmtr.cwf.mtr.txt.

In [3]:
# enter your code here


### Part 2: Parsing data from html-formatted pages
Typically, web pages are not just text - they are formatted with HyperText Markup Language (HTML) formatting. Just like regular pages, we can read in html-formatted pages a typical ascii-style text:

In [4]:
# define the url to the station


# use the response module to get the data from the url


# read in the page text 


# split the text by lines


# search the lines for the one that has the "Water depth" information


As you can see above, it can be a little cumbersome to search through all of the html code to find what you might be looking for on a website. To obtain data from these sites in a usable format, it is helpful to leverage tools that can help to parse html code. Since html is a common language for web pages, there are several packages to organize and search html-scripted pages. One commonly-used function is Beautiful Soup:

In [5]:
# use BeautifulSoup to parse the html data


# search for the division with the id "stn_metadata"
# store it as a variable called stn_metadata


# convert the stn_metadata to a string


# split the stn_metadata


# search for the "Water depth" in the stn_metadata


#### &#x1F914; Mini-Exercise
Goal: Find the link to the NDBC's Facebook site.

The front page of the NDBC site contains three links to the Facebook, LinkedIn, and Twitter (X) pages for the NDBC social media. These are contained in an html division with the class "socialMediaContainer". Use the requests library to open the page and the BeautifulSoup module to parse the html text. Then, find the division with the social media information to find where the link to the Facebook page leads. In particular, print the line with the "NDBC on Facebook" string.

In [6]:
# enter your code here


### Part 3: Obtaining images from web pages
Anything that exists on a web page can be obtained and stored on your local system. For example, images that are hosted on web pages can be stored on your system. Consider again the Monterey Buoy 46092 as described here: https://www.ndbc.noaa.gov/station_page.php?station=46092

This page contains an image file for the buoy. Let's find the link to the buoy and download it.

In [7]:
# provide a path to the buoy image


# use the requests module to get the image


# define an output name for the image


# open the file as a writable binary

    # iterate through the chunks and write to the file


#### &#x1F914; Mini-Exercise
Goal: Find and store an image from your favorite web site. 

Any publically-available image on the web can be downloaded with the requests module. Download your favorite image in the block below and store it to your system. If you don't have a favorite site or image, you can download this comic: https://imgs.xkcd.com/comics/git_2x.png

In [8]:
# define the output_file path
output_file = 'lecture_10-1.jpg'

# enter your code here



# show the image using the code from Homework 4:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
fig = plt.figure()
img = mpimg.imread(output_file)
plt.imshow(img)
plt.show()
plt.close(fig)

FileNotFoundError: [Errno 2] No such file or directory: 'lecture_10-1.jpg'

<Figure size 640x480 with 0 Axes>

### Part 4: Downloading data from links
A lot of data is provided online with through accessible links to data from remote data servers. For example, the historical buoy data on the NDBC site is stored in compressed format. We can have a look at the available historical data for the Monterey Buoy here: https://www.ndbc.noaa.gov/station_history.php?station=46092

Below, let's download the compressed 2022 data stored in the `46092h2022.txt.gz` file.

In [None]:
# enter the url here


# use the requests module to get the data


# define an output file 


# read the file in as chunks
