# <u>Web Scraping</u>

## <u>Part \#3: XML and Semi-Structured Data</u>

### <u>Where do data scientists get their data?</u>

A data scientist needs sources for data to do his or her work. While you don't need this level of detail on the AP test, here are is some information about the types of data available to data scientists:

* **Unstructured data:** Most data available on the web is unstructured data. Image files, sound files, video files, text files, and HTML files are all examples of unstructured data. These are some of the richest sources of data available, but they are also difficult to process (search, sort, classify, analyze, summarize etc). For example, while you were easily able to write algorithms in a previous chapter to apply image filters to the pixels of an image, it is very difficult to write functions to classify the objects featured in image files (people, trees, dogs, cats, mountains, etc). Similarly, while a web browser can effectively display an HTML file, it is not easy to write an algorithm to give a short summary of the content of the webpage. You can certainly extract useful data from an HTML file, but you just saw that it takes a great deal of effort using a module such as **Beautiful Soup**. In fact, the name of the module describes the World Wide Web: it is a beautiful soup of unstructured data. Data scientists often spend a great deal of time finding their data. To read more, here's a great blog post on the topic: https://www.dataquest.io/blog/web-scraping-beautifulsoup/


* **Semi-structured data:** You already have experience with semi-structured data: a *.csv* file is an example of semi-structured data. This is data that supports automated processing of its contents, such as we saw with Pandas during the our chapter on Open Data. In other courses, you may learn about [JSON](https://www.w3schools.com/js/js_json_intro.asp), and in this notebook you will learn about [XML](https://www.w3schools.com/xml/xml_whatis.asp). As you will soon see, the beauty of XML is that you can work with data in an automated way.


* **Structured data:** This is data that is stored in a *database*. Organizations such as corporations, governments, and universities will have servers dedicated to their databases and database software. The data stored in a database is similar to what you have seen in *.csv* files, but has some additional structure. We will not work with databases (structured data) in this class, but they are a great source of information. But if you want to learn more, you may want look up the term *relational database* or *SQL*.  

### <u>What is XML?</u>

XML stands for eXtensible Markup Language. While HTML is meant to display webpages, XML is meant to store/transport/describe data. Humans can read and understand XML, and computers can also process XML in an automated way.

**<u>Task \#1:</u>** Read the following pages from W3Schools, then answer the questions below:

  * XML Introduction: https://www.w3schools.com/xml/xml_whatis.asp

  * XML Tutorial: https://www.w3schools.com/xml/default.asp

**<u>Question \#1:</u>** What does the "extensible" in "extensible markup language" mean?

**<u>Your Answer:</u>** Most XML applications will work as expected even if new data is added or removed.

**<u>Question \#2:</u>** Suppose you are trying to write an algorithm to process data in an automated way. Why would you prefer for your algorithm to work with *extensible* data? 

**<u>Your Answer:</u>** You can easily add and remove data.

### <u>Chicago Weather Data</u>

Before you read any further, visit the page http://w1.weather.gov/xml/current_obs/KORD.xml and look at the data on the page. This is a feed of current weather conditions at Chicago's O'Hare airport. You can use this XML document to create a webpage or app that always knows the most up-to-date weather in Chicago. 

Let's take a look at the XML source for this feed:

In [1]:
from bs4 import BeautifulSoup       # Import BeaurtifulSoup
from urllib.request import urlopen  # Import urlopen

xml_page = urlopen("http://w1.weather.gov/xml/current_obs/KORD.xml")   # Opens whatever page we are requesting
bs_obj = BeautifulSoup(xml_page, 'xml')  # Extracts the xml data

print(bs_obj.prettify()[:1000])  # Makes it more easily readible or 'pretty'

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?>
<current_observation version="1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.weather.gov/view/current_observation.xsd">
 <credit>
  NOAA's National Weather Service
 </credit>
 <credit_URL>
  https://weather.gov/
 </credit_URL>
 <image>
  <url>
   https://weather.gov/images/xml_logo.gif
  </url>
  <title>
   NOAA's National Weather Service
  </title>
  <link>
   https://www.weather.gov
  </link>
 </image>
 <suggested_pickup>
  15 minutes after the hour
 </suggested_pickup>
 <suggested_pickup_period>
  60
 </suggested_pickup_period>
 <location>
  Chicago, Chicago-O'Hare International Airport, IL
 </location>
 <station_id>
  KORD
 </station_id>
 <latitude>
  41.97972
 </latitude>
 <longitude>
  -87.90444
 </longitude>
 <observation_time>
  Last Updated on May 23 2022, 10:51 pm CDT
 </observation_ti

If you want to extract the current temperature from this data, just run the code below:

In [2]:
current_temp = bs_obj.find('temp_f').getText()  # Grabs the text found in the inner HTML of the <temp_f> tag
print(current_temp)

51.0


**<u>Question \#3:</u>** What will you probably notice about the output from the cell directly above if you were to re-run the code in this notebook at a later date?

**<u>Your Answer:</u>** The temp will change.

**<u>Question \#4:</u>** Where was this temperature measurement taken? To answer this question, you must write code that uses the latitude and longitude data in the XML above to create a tuple.  Give your answer as a tuple in the form (latitude,longitude)

**<u>Location Tuple:</u>**  ('41.97972', '-87.90444')

In [3]:
lat = float(bs_obj.find('latitude').getText())
long = float(bs_obj.find('longitude').getText())
lat, long

(41.97972, -87.90444)

**<u>Task \#2</u>** Use the Google Maps API to create a marker map with the location of the coordinates you found in the previous question.
* *Hint: Look back at Metadata Part 4 to see how we displayed GPS coordinates*

In [4]:
# Import the gmaps python module and load in your API Key:
import gmaps
gmaps.configure(api_key="AIzaSyCLla6Q7krE9xNg6SnNMoGNIzjCLddE9EU")

In [5]:
from ipywidgets.embed import embed_minimal_html # Allows us to create a separte file for the Google Maps

markers = gmaps.marker_layer([(lat, long)])    # Create markers for each tuple/coordinate
markermap = gmaps.Map()                         # Create a GMap variable
markermap.add_layer(markers)                    # Add the layer of markers to GMap

embed_minimal_html('output/MarkerMap2.html', views=[markermap])
print("*** If no map appears, uncomment the line above, re-run this cell, and check your 'Metadata Part 5' folder to find the new HTML file name \"MarkerMap1.html\". ***")

markermap

*** If no map appears, uncomment the line above, re-run this cell, and check your 'Metadata Part 5' folder to find the new HTML file name "MarkerMap1.html". ***


Map(configuration={'api_key': 'AIzaSyCLla6Q7krE9xNg6SnNMoGNIzjCLddE9EU'}, data_bounds=[(41.97971, -87.90445), …

**<u>Question \#5:</u>** What does the following function, *tag_extractor(url, tag)*, do? Some structure is provided below to help you answer this question:

**<u>Your Answers (Below):</u>**  

* _What is the purpose of the function?_
    * Your answer: It finds the the specific tag in a xml file and return the inner html of the tag.
* _What purpose does the parameter **url** serve in the function?_
    * Your answer: It is the link to the xml file.
* _What purpose does the parameter **tag** serve in the function?_
    * Your answer: It looks for that specific tag in the xml file.
* _What information is being returned by the function?_  
    * Your answer: It return the inner html of the specific tag in the xml file. 

In [6]:
def tag_extractor(url, tag):    
    from bs4 import BeautifulSoup  
    from urllib.request import urlopen

    xml_page = urlopen(url)   #opens whatever page we are requesting
    bs_obj = BeautifulSoup(xml_page, 'xml')
    
    return bs_obj.find(tag).getText()

tag_extractor('http://w1.weather.gov/xml/current_obs/KORD.xml', 'temp_f')

'51.0'

**<u>Question \#6:</u>** How can *tag_extractor(url, tag)* be considered an abstraction that helps to manage the complexity of a computer program? 

**<u>Your Answer:</u>** Instead of repeatedly opening different xml files and writing the same code to get the same tags you can use an abstraction like above to make it much easier.

**<u>Task \#3:</u>** Use *tag_extractor(url, tag)* to determine the date/time of the most recent temperature measurement.

In [7]:
tag_extractor('http://w1.weather.gov/xml/current_obs/KORD.xml', 'observation_time')

'Last Updated on May 23 2022, 10:51 pm CDT'

### <u>HTML as Output</u>

**<u>Question \#7:</u>** What is the purpose of the function *html_output()*? What kind of data does html_output() produce as output?

**<u>Your Answer:</u>** It creates a html file that has the current weather from the xml file.

In [8]:
# define the function: 
def html_output():    
    output_string = """
    <html>
    <head>
        <style>
            body {
                background-color: #BBBBBB; 
                text-align: center;        
            }
        </style>
    </head>

    <body>
    <h1>Chicago Weather</h1>
    <p> The current temperature in Chicago is 
    """

    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KORD.xml', 'temp_f')

    output_string += """
    degrees Fahrenheit.
    </p>
    <br>

    </body>
    </html>
    """

    html_file= open("output/O'Hare Temperature.html","w")
    html_file.write(output_string)
    html_file.close()
    
# now call the function: 

html_output()
print("*** Look in the 'Output' folder of 'Web Scraping Part 3' to find the new HTML file. ***")

*** Look in the 'Output' folder of 'Web Scraping Part 3' to find the new HTML file. ***


**<u>Task \#4:</u>** Create your own version of *html_output()* that includes the date/time of the most recent temperature measurement as well as two other measurements to your output *.html* document. 

In [9]:
# define the function: 
def html_output():    
    output_string = """
    <html>
    <head>
        <style>
            body {
                background-color: #BBBBBB; 
                text-align: center;        
            }
        </style>
        <meta http-equiv="refresh" content="3600">
    </head>

    <body>
    <h1>Chicago Weather</h1>
    <p> The current temperature in Chicago is 
    """

    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KORD.xml', 'temp_f')
    output_string += """
    degrees Fahrenheit. <br>"""
    
    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KORD.xml', 'observation_time') + '<br>'
    
    output_string += '(Lat, Long): ' + bs_obj.find('latitude').getText() + ',' + bs_obj.find('longitude').getText()
    
    output_string += """
    </p>
    <br>

    </body>
    </html>
    """

    html_file= open("output/O'Hare Temperature2.html","w")
    html_file.write(output_string)
    html_file.close()
    
# now call the function: 

html_output()
print("*** Look in the 'Output' folder of 'Web Scraping Part 3' to find the new HTML file. ***")

*** Look in the 'Output' folder of 'Web Scraping Part 3' to find the new HTML file. ***


**<u>Task \#5:</u>** This is a multistep task: 

1) Read this [w3schools documentation](https://www.w3schools.com/tags/att_meta_http_equiv.asp)

2) Add the HTML `<meta http-equiv="refresh" content="3600">` between the head tags in your html_output() function

3) Read (but do not run) the code in the cell below: 

In [10]:
running = False
import time


while running:
    html_output()
    time.sleep(3600)

**<u>Question \#9:</u>** If you were to change *running* to *True* in the code cell above, then running this cell would start an infinite loop. What would the purpose be of an infinite loop and the HTML above, `<meta http-equiv="refresh" content="3600">`? What would it allow you to do? 

**<u>Your Answer:</u>** It would refresh the page with new updated data every 3600 seconds.


### <u>Experiment and Explore</u>

**<u>Task \#6:</u>** With any extra time this period, go to http://w1.weather.gov/xml/current_obs/KORD.xml, but change the **KORD** portion of this URL to another ICAO airport code for an airport in the United States or its territories. Experiment. Show the results of your experimentation in an HTML output file, and produce a Google Maps Marker Map in this notebook that includes the location you chose to explore. 

  * Airport codes in the lower-48 states: https://en.wikipedia.org/wiki/List_of_airports_by_ICAO_code:_K

  * All airport codes in the United States and territories: https://en.wikipedia.org/wiki/List_of_airports_in_the_United_States#Lists_by_ICAO_location_indicator

In [43]:
def get_name_of_every_airport():
    html_page = urlopen('https://en.wikipedia.org/wiki/List_of_airports_in_the_United_States#Lists_by_ICAO_location_indicator')                    # Opens whatever page we are requesting
    bs_obj = BeautifulSoup(html_page, 'html.parser')                              # Saves the html in a Beautiful Soup object
    try:
        url = bs_obj.findAll('tbody')[2].findAll("tr",{"valign":"top"})
        all_loc = []
        for i in url:
            td = i.findAll('td')[3].string
            all_loc.append(td[:-1])
    except AttributeError:
        all_loc = False                                                         # If no URL is available, store 'False' instead
    return all_loc
get_location_of_every_airport()[:10]

['KBHM',
 'KDHN',
 'KHSV',
 'KMOB',
 'KMGM',
 'PALH',
 'PAMR',
 'PANC',
 'PANI',
 'PABE']

In [92]:
def get_loc_of_every_airport(how_many):
    loc = []
    all_loc = get_location_of_every_airport()
    j=0
    for i in all_loc:
        latitude = tag_extractor(f'http://w1.weather.gov/xml/current_obs/{i}.xml', 'latitude')
        longitude = tag_extractor(f'http://w1.weather.gov/xml/current_obs/{i}.xml', 'longitude')
        loc.append((float(latitude),float(longitude)))
        if j == how_many: return loc
        j += 1
    return loc

In [93]:
final_loc = get_loc_of_every_airport(100)
final_loc

[(33.56556, -86.745),
 (31.32139, -85.44972),
 (34.64361, -86.78556),
 (30.68833, -88.24556),
 (32.30028, -86.40611),
 (61.18333, -149.96667),
 (61.21667, -149.85),
 (61.17444, -149.99611),
 (61.58139, -159.54278),
 (60.77972, -161.83778),
 (60.49167, -145.47778),
 (70.2, -148.46667),
 (59.05, -158.51667),
 (64.80389, -147.87611),
 (58.41667, -135.7),
 (59.65, -151.48333),
 (58.35472, -134.57611),
 (60.57306, -151.245),
 (55.35556, -131.71361),
 (58.67667, -156.64917),
 (55.5839, -133.067),
 (57.75, -152.5),
 (66.88576, -162.60624),
 (64.51194, -165.445),
 (56.8017, -132.9453),
 (62.05, -163.3),
 (57.048, -135.3647),
 (63.88333, -160.8),
 (53.9, -166.53333),
 (71.28528, -156.76583),
 (61.13333, -146.26667),
 (56.48333, -132.36667),
 (59.51667, -139.66667),
 (35.1575, -114.55944),
 (35.14433, -111.66637),
 (35.94582, -112.15538),
 (33.31667, -111.65),
 (36.92056, -111.44806),
 (33.427799, -112.003465),
 (34.64917, -112.42222),
 (32.13153, -110.95635),
 (32.65944, -114.59306),
 (36.28977

In [95]:
from ipywidgets.embed import embed_minimal_html # Allows us to create a separte file for the Google Maps

markers = gmaps.marker_layer(final_loc)    # Create markers for each tuple/coordinate
markermap = gmaps.Map()                         # Create a GMap variable
markermap.add_layer(markers)                    # Add the layer of markers to GMap

embed_minimal_html('output/MarkerMap2.html', views=[markermap])
print("*** If no map appears, uncomment the line above, re-run this cell, and check your 'Metadata Part 5' folder to find the new HTML file name \"MarkerMap1.html\". ***")

markermap

*** If no map appears, uncomment the line above, re-run this cell, and check your 'Metadata Part 5' folder to find the new HTML file name "MarkerMap1.html". ***


Map(configuration={'api_key': 'AIzaSyCLla6Q7krE9xNg6SnNMoGNIzjCLddE9EU'}, data_bounds=[(16.053299230651156, -1…

Location of a hundred airports across the us.