<font color="white">.</font> | <font color="white">.</font> | <font color="white">.</font>
-- | -- | --
![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg) | <h1><font size="+3">ASTG Python Courses</font></h1> | ![NASA](https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png)

---

<CENTER>
<H1 style="color:red">
Accessing Web Resources with Python
</H1>
</CENTER>

In [None]:
from __future__ import print_function

## <font color='red'>Reference Documents</font>

* <a href="http://zetcode.com/python/requests/">Python Requests Tutorial</a>
* <a href="https://realpython.com/python-requests/">Python’s Requests Library (Guide)</a>
* <a href="https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/">What is web scraping</a>
* <a href="https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184">Building a Web Scraper from start to finish</a>
* <a href="https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/">Ultimate Guide to Web Scraping with Python Part 1: Requests and BeautifulSoup</a>
* <a href="https://www.dataquest.io/blog/web-scraping-tutorial-python/">Tutorial: Web Scraping with Python Using Beautiful Soup</a>
* <a href="https://realpython.com/beautiful-soup-web-scraper-python/">Beautiful Soup: Build a Web Scraper With Python</a>
* <a href="https://stackabuse.com/download-files-with-python/">Download Files with Python</a>

## <font color='red'>What will be Covered?</font>
+ Accessing Web Pages with Requests
+ Introduction to Json
+ Web Scraping with Json
+ Web Scraping with Beautiful Soup

![fig_scrap](https://miro.medium.com/max/1400/1*4BnBQE9Bu-EQ-gGz25x8pg.png)
Image Source: gurutechnolabs.com

# <font color='red'>Python `requests` Module</font>

* Requests is a simple and elegant Python HTTP (Hypertext Transfer Protocol) library. 
* It provides methods for accessing Web resources via HTTP. 
* The HTTP request returns a Response Object with all the response data (content, encoding, status, etc.).
* Requests is a built-in Python module.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from skimage import io

In [None]:
import os
import pprint
import requests as reqs

print(reqs.__version__)
print(reqs.__copyright__)

**Reading a Web Page**
- We use the function `get()` to grab the content of a web page into an object.
- We extract from the object the HTML content of the page.

In [None]:
resp = reqs.get("http://www.webcode.me")

We can get all information from the `resp` object:

In [None]:
print(resp.text)

We can use the module `re` to strip all the HTML markups from the content.

In [None]:
import re

resp = reqs.get("http://www.webcode.me")

content = resp.text

stripped = re.sub('<[^<]+?>', '', content)
print(stripped)

- When you make a request, `Requests` makes educated guesses about the encoding of the response based on the HTTP headers. 
- The text encoding guessed by `Requests` is used when you access `resp.text`. 
- You can find out what encoding `Requests` is using, and change it.

In [None]:
resp.encoding

In [None]:
resp.encoding = 'utf-8'

If you change the encoding, `Requests` will use the new value of `resp.encoding` whenever you call `resp.text`.

**HTTP Request**
- An HTTP request is a message send from the client to the browser to retrieve some information or to make some action.
- Request's request method creates a new request. 
- We use the `request` module methods: `get()`, `post()`, or `put()`.

Create a `GET` request and send it to the web site.

In [None]:
resp = reqs.request(method='GET', url="http://www.webcode.me")
print(resp.text)

**Getting the Status of a Web Page**
- We perform an HTTP request with the `get()` method and check for the returned status.
- 200 is a standard response for a successful HTTP request and 404 tells that the requested resource could not be found.

In [None]:
resp = reqs.get("http://www.webcode.me")
print(resp.status_code)

In [None]:
resp = reqs.get("http://www.webcode.me/news")
print(resp.status_code)

**Other Information**

In [None]:
resp = reqs.get("http://www.webcode.me")

print("\t URL:      {}".format(resp.url))
print("\t Encoding: {}".format(resp.encoding))
print("\t URL:      {}".format(resp.url))
print("\t Time:     {}".format(resp.elapsed))

print("Server:         {}".format(resp.headers['server']))
print("CONNECTION:     {}".format(resp.headers['CONNECTION']))
print("Date:           {}".format(resp.headers['Date']))
print("Last modified:  {}".format(resp.headers['last-modified']))
print("Content type:   {}".format(resp.headers['content-type']))

**`requests` `head()` Method**
- The `head()` method retrieves document headers. 
- The headers consist of fields, including date, server, content type, or last modification time.

In [None]:
resp = reqs.head("http://www.webcode.me")

print("Server:         {}".format(resp.headers['server']))
print("CONNECTION:     {}".format(resp.headers['CONNECTION']))
print("Date:           {}".format(resp.headers['Date']))
print("Last modified:  {}".format(resp.headers['last-modified']))
print("Content type:   {}".format(resp.headers['content-type']))

**`requests` `get()` Method**
- The `get()` method issues a GET request to the server. 
- The GET method requests a representation of the specified resource.

The the following script sends a variable with a value to the `httpbin.org` server. The variable is specified directly in the URL.

In [None]:
resp = reqs.get("https://httpbin.org/get?name=Peter")
print(resp.text)

- The `get()` method takes a params parameter where we can specify the query parameters.
     - The beginning of the query parameters is denoted by a question mark (`?`).
     - The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (`key=value`).
     - Every URL can have multiple query parameters, which are separated from each other by an ampersand (`&`)

```python
resp = reqs.get("https://httpbin.org/get?name=Peter&age=23")
```

We send a `get()` request to the web site and pass the data, which is specified in the `params` parameter:

In [None]:
payload = {'name': 'Peter'}
resp = reqs.get("https://httpbin.org/get", params=payload)

`payload` is a dictionary of pairs of keys/values:

In [None]:
payload = {'name': 'Peter', 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

In [None]:
print(resp.text)

You can also pass a list of items as a value:

In [None]:
payload = {'name': ['Peter', 'Johns'], 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

**Other Methods**

```python
requests.post('https://httpbin.org/post', data={'key':'value'})
requests.put('https://httpbin.org/put', data={'key':'value'})
requests.delete('https://httpbin.org/delete')
requests.patch('https://httpbin.org/patch', data={'key':'value'})
requests.options('https://httpbin.org/get')
```

### Summary of `requests` Methods 

| Method	| Description |
| :--- | :--- |
| delete(url, args)	| Sends a DELETE request to the specified url | 
| get(url, params, args)	| Sends a GET request to the specified url | 
| head(url, args)	| Sends a HEAD request to the specified url | 
| patch(url, data, args)	| Sends a PATCH request to the specified url | 
| post(url, data, json, args)	| Sends a POST request to the specified url | 
| put(url, data, args)	| Sends a PUT request to the specified url | 
| request(method, url, args)	| Sends a request of the specified method to the specified url| 

---

# <font color='red'>Web Scraping</font>

![fig_json](https://daveberesford.co.uk/wp-content/uploads/2019/02/data-scraping-960x594.png)
Image Source: daveberesford.co.uk

> Web scraping is a mechanism of collecting large amounts of data from the webpage and store the data into any required format which further helps us to perform analysis on the extracted data.


- Web scraping is used to extract or “scrape” data from any web page on the Internet.
- Web scraping is performed using a “**web scraper**” (or a “bot” or a “web spider” or “web crawler”). 
- A web-scraper is a program that goes to web pages, downloads the contents, extracts data out of the contents and then saves the data to a file or a database.



Web scraping involves a three-step process:

1. **Step 1**: Send an HTTP request to the webpage you want to scrape
   - The server responds to the request by returning the HTML content of the target webpage.
2. **Step 2**: Parse the HTML content
   - A parser is needed to create a nested structure of the HTML data. 
3. **Step 3**: Pull data out of HTML
   - We use Python packages such as Json and Beautiful Soup to pull out data and store them.
     
![fig_scrap](https://www.scrapehero.com/wp/wp-content/uploads/2018/01/xhow-does-a-web-scraper-work-simple-2.png.pagespeed.ic.MeNRriGmi9.webp)
Image Source: scrapehero.com

Web Scrapers crawl websites, extracts data from it, transforms to a usable structured format and load it to a file or database for subsequent use.

A typical web scraper has the following components:

![fig_scrap](https://www.scrapehero.com/wp/wp-content/uploads/2018/01/xComponents-of-web-scraper1.png.pagespeed.ic.uNMyC_Y5W4.webp)

### <font color="blue">Web Scraping Rules</font>

![fig_ethics](https://hackernoon.com/hn-images/0*MPt2rectMhwklT63.jpg)

As reference, check: <a href="https://info.scrapinghub.com/web-scraping-guide/web-scraping-best-practices">The Web Scraping Best Practices Guide</a> or watch the video <a href="https://www.youtube.com/watch?v=i7DEy-ZB_Lk">Is Web Scraping Legal?</a>

- Check a website’s Terms and Conditions before you scrape it.
- Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
- The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed.

## <font color='blue'>Web Scraping with JSON</font>

### <font color="green"> What is JSON?</font>

* JSON (JavaScript Object Notation) is a popular data format used for representing structured data. 
* It is a text format that is language independent and can be used in Python, Perl among other languages. 
* JSON format is used for data communications between servers and web applications.
* It is built on two structures:

     - A collection of name/value pairs. This is realized as an object, record, dictionary, hash table, keyed list, or associative array.
     - An ordered list of values. This is realized as an array, vector, list, or sequence.
     
     
The main functions of `JSON` are:

* `dump()`: encoded string writing on file.
* `load()`: Decode while JSON file read.
* `dumps()`: encoding to JSON objects
* `loads()`: Decode the JSON string.

**Example of JSON Data**

```python
{
    "stations": [
        {
            "acronym": “BLD”, 
            "name": "Boulder Colorado",
            "latitude”: 40.00,
            "longitude”: -105.25
        }, 
        {
            "acronym”: “BHD”, 
            "name": "Baring Head Wellington New Zealand",
            "latitude": -41.28,
            "longitude": 174.87
        }
    ]
}
```

**Another Example of JSON Data**

We consider an online database, <a href="IP-API.com">IP-API.com</a>, that returns GeoIP data in JSON format. Simply opening <a href="http://ip-api.com/json/54.148.84.95">http://ip-api.com/json/54.148.84.95</a> will return the following JSON result:


```python
{
  "as": "AS16509 Amazon.com, Inc.",
  "city": "Boardman",
  "country": "United States",
  "countryCode": "US",
  "isp": "Amazon",
  "lat": 45.8696,
  "lon": -119.688,
  "org": "Amazon",
  "query": "54.148.84.95",
  "region": "OR",
  "regionName": "Oregon",
  "status": "success",
  "timezone": "America\/Los_Angeles",
  "zip": "97818"
}
```

To see your own Geolocation data in JSON format, just open <a href="http://ip-api.com/json/">http://ip-api.com/json/</a>.

In [None]:
resp = reqs.request(method='GET', url="http://ip-api.com/json/")
print(resp.text)

### <font color="green"> Serialization and Deserialization</font>

> … the process of translating data structures or object state into a format that can be stored … or transmitted … and reconstructed later (possibly in a different computer environment). (Wikipedia)

* **Serialization** is a process of transforming objects or data structures into byte streams or strings. 
* These byte streams can then be stored or transferred easily. 
* This allows the developers to save, for example, configuration data or user's progress, and then store it (on disk or in a database) or send it to another location.
* The reverse process of serialization is known as **deserialization**.

### Why do we need serialization?

We need Serialization for the following reasons:

- **Communication**: Serialization involves the procedure of object serialization and transmission. This enables multiple computer systems to design, share and execute objects simultaneously.
- **Caching**: The time consumed in building an object is more compared to the time required for de-serializing it. Serialization minimizes time consumption by caching the giant objects.
- **Deep Copy**: Cloning process is made simple by using Serialization. An exact replica of an object is obtained by serializing the object to a byte array, and then de-serializing it.
- **Portability**: The major advantage of Serialization is that it works across different architectures or Operating Systems.
- **Persistence**: The State of any object can be directly stored by applying Serialization on to it and stored in a database so that it can be retrieved later.

![fig_sd](https://miro.medium.com/max/1150/1*9zJJ65xk8agiQXlqd7nYUw.jpeg)
Image Source: Phonlawat Khunphet

In [None]:
import json

**Serialization**

We use the `dump()` that takes two arguments: 
* The data object to be serialized.
* The file object to which it will be written (Byte format).

In [None]:
x = {
  "name": "John",
  "age": 30,
  "married": True,
  "divorced": False,
  "children": ("Ann","Billy"),
  "pets": None,
  "cars": [
    {"model": "BMW 230", "mpg": 27.5},
    {"model": "Ford Edge", "mpg": 24.1}
  ]
}

file_name = "Sample.json"
with open(file_name, "w") as fid: 
     json.dump(x, fid)

In [None]:
!cat Sample.json

**Deserializing JSON**

* The Deserialization is opposite of Serialization, i.e. conversion of JSON object into their respective Python objects. 
* We use the `load()` function which is usually used to load from string, otherwise the root object is in list or dict.

In [None]:
with open(file_name, "r") as fid: 
     z = json.load(fid)
        
print(z)
print()
for key in z:
    print("{:>12}: {}".format(key, z[key]))

### <font color="green">Scraping the NASA Astronomy Picture Of the Day (APOD) Webpage </font>

- We want to be able to obtain from the webpage <a href="https://api.nasa.gov/planetary/apod"> https://api.nasa.gov/planetary/apod</a>,  the Astronomy picture of the day for a given day and plot the image.
- We access the webpage (using a set of parameters) and retrieve the content of the page as a JSON object.

**Query Parameters**

| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
|`date` | YYYY-MM-DD | today | Date of the APOD image to retrieve |
|`start_date` | YYYY-MM-DD | none | The start of a date range, when requesting date for a range of dates. Cannot be used with `date`. |
|`end_date` | YYYY-MM-DD | today | The end of the date range, when used with `start_date`. |
| `count` |	int	| none	| If this is specified then count randomly chosen images will be returned. Cannot be used with `date` or `start_date` and `end_date`. |
| `hd` | bool | False | Retrieve the URL for the high resolution image |
| `api_key` | string | DEMO_KEY | <a href="https://api.nasa.gov/">[https://api.nasa.gov/</a> key for expanded usage |


In [None]:
url = "https://api.nasa.gov/planetary/apod"
date = "2020-12-26"
payload = {'api_key': "DEMO_KEY",
          'date': date,
          'hd': True}

page_content = reqs.get(url, params=payload)

Process the data with JSON

In [None]:
if page_content.status_code == 200:
   json_page = json.loads(page_content.text)

The APOD variable is a dictionary of various keys and values. Let’s take a look at the keys of this variable:

In [None]:
for x in json_page:
    print(x)

Print the keys and values:

In [None]:
for x in json_page:
    print("{} --> {}".format(x, json_page[x]))
    print()

In [None]:
pprint.pprint(json_page)

Plot images:

In [None]:
if json_page["media_type"] == "image":
    io.imshow(io.imread(json_page["url"]))
    io.show()

<font color="red">If you want to download the file on your local system:</font>

In [None]:
import urllib

url_name = json_page["url"]
loc_file_name = os.path.basename(url_name)

urllib.request.urlretrieve(url_name, loc_file_name)

If you want to view the image through a browser, use:

In [None]:
import webbrowser
webbrowser.open(json_page["url"])

### <font color="green">Obtaining Mars Rover Photos</font>

In [None]:
rover_url = 'https://api.nasa.gov/mars-photos/api/v1/rovers/curiosity/photos'

payload = {'api_key': "DEMO_KEY",
           'sol': 1000}

response = reqs.get(rover_url, params=payload)
response_dictionary = response.json()
photos = response_dictionary['photos']

In [None]:
print(type(photos))
print(len(photos))

In [None]:
print(photos[0])

Extract the URL of each photo:

In [None]:
url_photos = list()
for photo in photos:
    url_photos.append(photo['img_src'])

print(url_photos[0])

Randomly select 20 pictures:

In [None]:
import random
url_pictures = random.sample(url_photos, 20)

Display the 20 photos:

In [None]:
fig, axes = plt.subplots(4, 5, figsize=(20, 20))
ax = axes.ravel()

for i in range(20):
    ax[i].imshow(io.imread(url_pictures[i]))

fig.tight_layout()

## Exercise 1:

Use the following code to list all the images in the provided year range:

```python
url = "https://images-api.nasa.gov/search"

payload = {
        "q": "apollo",
        "page": "1",
        "media_type": "image",
        "year_start": "2018",
        "year_end": "2020"}

response = reqs.get(url, params=payload)
response_dictionary = response.json()["collection"]["items"]
```

### <font color="green">Scraping the Earth Observatory Natural Event Tracker (EONET) Webpage </font>

- We want to be able to browse the webpage <a href="https://eonet.sci.gsfc.nasa.gov/api/v2.1/events"> https://eonet.sci.gsfc.nasa.gov/api/v2.1/events</a>,  to identify natural events on Earth.

**Query Parameters**

| Parameter | Value(s) |  Description |
| --- | --- | --- |
|`source` | Source ID | Filter the returned events by the <a href="https://eonet.sci.gsfc.nasa.gov/api/v2.1/sources">Source</a>. Multiple sources can be included in the parameter: comma separated, operates as a boolean OR. |
|`status` | open or closed | Events that have ended are assigned a closed date and the existence of that date will allow you to filter for only-open or only-closed events. Omitting the status parameter will return only the currently open events. |
| `limit` | int | Limits the number of events returned |
| `days ` | int | Limit the number of prior days (including today) from which events will be returned. |



In [None]:
url = "https://eonet.sci.gsfc.nasa.gov/api/v2.1/events"
paylod = {'source': "EO",
          'status': "open",
          'limit': 6,
          'days': 20}

page_content = reqs.get(url, params=payload)

In [None]:
if page_content.status_code == 200:
    json_page = json.loads(page_content.text)

In [None]:
for x in json_page:
    print(x)

In [None]:
pprint.pprint(json_page['events'])

---

## <font color='blue'>Web Scraping with Beautiful Soup</font>

- Web scraping allows you to download the HTML of a website and extract the data that you need.
- Beautiful Soup is a Python library for scraping data from websites.
- Beautiful Soup creates a parse tree from parsed HTML and XML documents.

![fig_bs4](https://www.opencodez.com/wp-content/uploads/2019/06/Web-Scraping-using-Beautiful-Soup.png)


In [None]:
from bs4 import BeautifulSoup as bso

In [None]:
source = reqs.get("http://www.webcode.me")
print(source)

**Create a beautiful soup object**

In [None]:
mysoup = bso(source.text, 'html.parser')

**Print the the HTML content of the page using the `prettify` method**

In [None]:
print(mysoup.prettify())

**Obtain the title section of the page**

In [None]:
print(mysoup.title)

**Get attribute name**

In [None]:
print(mysoup.title.name)

**Get attribute values**

In [None]:
print(mysoup.title.string)

In [None]:
print(mysoup.title.text)

**Beginning navigation**

In [None]:
print(mysoup.title.parent.name)

**Getting specific tags**
- HTML is made up of <a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element">tags</a>. It stores all of it’s data in them, and in the midst of all that clutter lies the data we need. Some of the tags are:
     * `head`
     * `body`
     * `title`
     * `p` - for paragraph
     * `div` — indicates a division, or area, of the page.
     * `b` — bolds any text inside.
     * `i` — italicizes any text inside.
     * `table` — creates a table.
     * `form` — creates an input form.
- The `find` method searches for the first tag with the needed name.
- The `find_all` method searches for all tags with the needed tag name and returns them as a list.

Assume that we want to find paragraph tags `<p>`:

In [None]:
print(mysoup.p)

In [None]:
print(mysoup.p.text)

In [None]:
print(mysoup.find('p'))

In [None]:
print(type(mysoup.find('p')))

We can find all paragraphs:

In [None]:
print(mysoup.find_all('p'))

In [None]:
print(type(mysoup.find_all('p')))

To get the last paragraph only:

In [None]:
print(mysoup.find_all('p')[-1])

We can loop over the paragraphs:

In [None]:
for i, paragraph in enumerate(mysoup.find_all('p'), start=1):
    print("Paragraph Text {}: {}".format(i, paragraph.text))

In [None]:
body = mysoup.find_all('body')
print(body)

In [None]:
print("Type body:       ", type(body))
print("Type inner body: ", type(body[0]))

In [None]:
print(body[0].find_all('p'))

In [None]:
print(mysoup.find_all('body')[0].find_all('p'))

**Grab the text**

- Use the method `get_text`.

In [None]:
print(mysoup.get_text())

**Searching for tags by class and id**

- Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. 
- We can also use them when scraping to specify specific elements we want to scrape.

In [None]:
url = "http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html"
source = reqs.get(url)
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify())

- We can use the `find_all` method to search for items by `class` or by `id`. 
- In the below example, we’ll search for any `p` tag that has the class `outer-text`:

In [None]:
mysoup.find_all('p', class_='outer-text')

We can also look for any tag that has the class `outer-text`:

In [None]:
mysoup.find_all(class_="outer-text")

In [None]:
mysoup.find_all(class_="outer-text")[-1].get_text()

We can also search for elements by id:

In [None]:
mysoup.find_all(id="first")

In [None]:
mysoup.find_all(id="first")[0].get_text()

**Using CSS Selectors**

- <a href="https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors">CSS (Cascading Style Sheets)</a> is a declarative language that controls how webpages look in the browser. 
- The browser applies CSS style declarations to selected elements to display them properly. 
- A style declaration contains the properties and their values, which determine how a webpage looks.
- CSS selectors</a> are how the CSS language allows developers to specify HTML tags to style. 

Here are some examples:

- `p a` — finds all `a` tags inside of a `p` tag.
- `body p a` — finds all `a` tags inside of a `p` tag inside of a `body` tag.
- `html body` — finds all `body` tags inside of an `html` tag.
- `p.outer-text` — finds all `p` tags with a class of `outer-text`.
- `p#first` — finds all `p` tags with an id of `first`.
- `body p.outer-text` — finds any `p` tags with a class of `outer-text` inside of a `body` tag.

We can use the CSS selectors to search items inside webpages. `BeautifulSoup` objects support searching a page via CSS selectors using the `select` method. 

Find all the `p` tags in our page that are inside of a `body`:

In [None]:
mysoup.select("body p")

Find all the `b` tags in our page that are inside of a `p`:

In [None]:
mysoup.select("p b")

Find all `b` tags inside of a `p` tag inside of a `body`:

In [None]:
mysoup.select("body p b")

Find all `p` tags with an id of `first`:

In [None]:
mysoup.select('p#first')

### <font color="blue"> Example: Extract the web link of the Astronomy Picture of the Day</font>

In [None]:
url = "https://apod.nasa.gov/apod/astropix.html"
source = reqs.get(url)
mysoup = bso(source.text, 'html.parser')

In [None]:
print(mysoup.prettify())

Print basic information of the Image of the Day:

In [None]:
print(mysoup.find('p').get_text())

In [None]:
href_comments = mysoup.find_all('a')
for a in href_comments:
    print(a.get_text())

Here we assume that the Picture of the Day is a video. If it is not the case, we will skip the next six cells.

In [None]:
mysoup.iframe

In [None]:
from IPython.display import HTML

HTML(str(mysoup.iframe))

In [None]:
mysoup.iframe['src']

In [None]:
src_list = [a['src'] for a in mysoup.select('iframe[src]')]
src_list

In [None]:
src_tags = mysoup.find_all(src=True)
src_tags

Find all `href` tags:

In [None]:
href_tags = mysoup.find_all(href=True)
href_tags

In [None]:
link_list = [l.get('href') for l in mysoup.find_all('a')]
link_list

In [None]:
link_list = [a['href'] for a in mysoup.select('a[href]')]
link_list

### <font color="blue"> Example: Weather Data for Greenbelt, Maryland</font>

In [None]:
url = "https://forecast.weather.gov/MapClick.php"
params = {'lat': 39.00079000000005,
          'lon': -76.88055999999995}

source = reqs.get(url, params=params)
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify())

**Extract Tonight's Forecast**

In [None]:
seven_day = mysoup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

In [None]:
period = tonight.find(class_="period-name").get_text()
print(period)

In [None]:
short_desc = tonight.find(class_="short-desc").get_text()
print(short_desc)

In [None]:
temp = tonight.find(class_="temp").get_text()
print(temp)

In [None]:
img = tonight.find("img")
desc = img['title']
print(desc)

**Extracting all Data**

We use CSS selectors to extract everything at once.

We select all items with the class `period-name` inside an item with the class `tombstone-container` in `seven_day`.

In [None]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
print(periods)

We can apply the same technique to get the other fields:

In [None]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_descs)

In [None]:
#temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")
#print(temps)

In [None]:
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(descs)

We can combine the data into a Pandas DataFrame:

In [None]:
import pandas as pd
df_weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    #"temp": temps,
    "desc":descs
})
df_weather

**Detailed Forecast**

In [None]:
det_forecast = mysoup.find(id="detailed-forecast-body")

In [None]:
forecast_labels = det_forecast.find_all(class_="col-sm-2 forecast-label")

In [None]:
forecast_texts = det_forecast.find_all(class_="col-sm-10 forecast-text")

In [None]:
for a, b in zip(forecast_labels, forecast_texts):
    print("\033[1m {:>15}: \033[0m {:<}".format(a.get_text(), b.get_text()))
    print()

### Exercise 2: 

- Go to the sitethe website `https://eonet.sci.gsfc.nasa.gov/api/v2.1/events`
- Select a date range and the number of events you want to retrieve.
- Creade a Pandas DataFrame that contains as columns the event type, date, latitude and longitude.

```python
url = "https://eonet.sci.gsfc.nasa.gov/api/v2.1/events"
paylod = {'source': "EO",
          'status': "open",
          'limit': 6,
          'days': 20}
```

### <font color="blue"> Example: MODIS Aerosol Optical Thickness</font>

- Scientists use measurements from the MODIS sensor aboard NASA's Terra and Aqua satellites to map the amount of aerosol that is in the air all over the world. Because aerosols reflect visible and near-infrared light back to space, scientists can use satellites to make maps of where there are high concentrations of these particles.
- Scientists call this measurement aerosol optical thickness (AOT). 
- It is a measure of how much light the airborne particles prevent from traveling through the atmosphere. 
- Aerosols absorb and scatter incoming sunlight, thus reducing visibility and increasing optical thickness. An optical thickness of less than 0.1 indicates a crystal clear sky with maximum visibility, whereas a value of 1 indicates the presence of aerosols so dense that people would have difficulty seeing the Sun, even at mid-day!


In this example, we want to access the <a href="https://neo.sci.gsfc.nasa.gov/">NASA Earth Observations (NEO)</a> website to obtain the AOT measurements for a given day or a range of days (from 2000 to present).

**Select the day range of interest:**

In [None]:
import pandas as pd

beg_date = '2019-12-30'
end_date = '2019-12-31'

pd_series = pd.date_range(start=beg_date, end=end_date, freq='D')
dates = [dt.strftime('%Y-%m-%d') for dt in pd_series]

url_base = "https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MODAL2_M_AER_OD&year="

urls = [url_base+dt for dt in dates]

**Access the webpage for the first day:**

In [None]:
source = reqs.get(urls[0])
print(source)

**Parse the webpage and print its content:**

In [None]:
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify)

**Gather all the lines with `href` tag:**

In [None]:
href_tags = mysoup.find_all(href=True)

**Find the `http` address that has the word `CSV`. That will give us the remote location of the file we want to read.**

In [None]:
for tag in href_tags:
    loc_url = tag["href"]
    if "CSV" in loc_url:
        csv_url = loc_url
        break

**Use `Pandas` to read the remote file:**

In [None]:
df = pd.read_csv(csv_url, index_col=0)
df

**It seems that `99999.0` corresponds to a missing value. Let us replace it with `NaN`:**

In [None]:
df = pd.read_csv(csv_url, index_col=0, na_values=99999.0)
df

**We can use `Xarray` to quickly visualize the data:**

In [None]:
import xarray as xr

da = xr.DataArray(df.values,
                  coords=[[float(lat) for lat in df.index], 
                          [float(lon) for lon in df.columns]],
                  dims=['latitude', 'longitude'])

da

In [None]:
da.plot()

## <font color="blue">Application</font>

- We want to get all book names on historic New York Time Best Sellers (Business section)
- The purpose is to:
     1. Help to compile my reading list in 2020
     2. Serve as reference to use Python for simple web analytics
- We use the Python packages: `Pandas`, `Requests` and `Baeutiful Soup`
- We save data in `pickle` and `csv` formats.

The example was taken from: <a href="https://towardsdatascience.com/building-my-2020-reading-list-with-a-simple-python-script-b610c7f2c223">Building my 2020 reading list with a simple Python script</a> by Pan Wu.

In [None]:
import pandas as pd

# Create an empty Pandas dataframe
nylist = pd.DataFrame()

beg_year = 2013
end_year = 2020
for the_year in range(beg_year, end_year):
    for the_month in range(1, 13):
        cur_month = str(the_month).zfill(2) # month in two digits
        # one need to get the URL pattern first, and then use Requests package to get the URL content
        url = 'https://www.nytimes.com/books/best-sellers/{0}/{1}/01/business-books/'.format(the_year, cur_month)
        page = reqs.get(url)
        print(" --  try: {0}, {1} -- ".format(the_year, cur_month))
        
        # Ensure proper result is returned
        if page.status_code != 200:
            print("      Missing data for Year {} and Month {}".format(the_year, cur_month))
            continue
        
        # one may want to use BeautifulSoup to parse the right elements out
        soup = bso(page.text, 'html.parser')
        
        # the specific class names are unique for this URL and they don't change across all URLs
        top_list = soup.findAll("ol", {"class": "css-12yzwg4"})[0].findAll("div", {"class": "css-xe4cfy"})
        print("Year: {} - Month: {} - How many in the top list: {}".format(the_year, the_month, len(top_list)))
        
        # loop through the Best Seller list in each Year-Month, and append the information into a pandas DataFrame
        for i in range(len(top_list)):
            book   = top_list[i].contents[0]
            title  = book.findAll("h3", {"class": "css-5pe77f"})[0].text
            author = book.findAll("p",  {"class": "css-hjukut"})[0].text
            review = book.get("href")
            # print("{0}, {1}; review: {2}".format(title, author, review))
            one_item = pd.Series([the_year, the_month, title, author, i+1, review], 
                                 index=['year', 'month', 'title', 'author', 'rank', 'review'])
            nylist = nylist.append(one_item, ignore_index=True, sort=False)

# write out the result to a pickle file for easy analysis later.
nylist.to_pickle("nylist.pkl")
nylist.to_csv("nylist.csv", index=False)

In [None]:
nylist

### Exercise 3

- Write a Python script that reads the pickle file `nylist.pkl` or the csv file `nylist.csv` and prints its content.