### Metadata on the Web

We're going to take a quick historical view of metadata on the web. As we have mentioned, data within web pages (the "metadata") can be a rich source of information. Most of this is used by search engines, Facebook, Twitter, etc - but it's also available for you! Let's take a quick spin through the past ~15 years and look at how metadata on the web has changed. And, we're going to write a little HTML on the way :-0

**Let's travel back in time... to the year 2000...**

Imaging that we are going to start a technology news web site and put it on the world-wide-web, information super-highway thing. We start off by creating a web page that links to some of the big technology stories of the day, like this:

In [None]:
%%HTML

<html>

    <head>
        <title>My Technology News Site</title>
    </head>

    <body>
        <div>
            <p><strong>Steve Jobs introduces the public beta of Mac OS X</strong></p>
            <div>Sept 13, 2000 - Steve Jobs <a href="https://www.apple.com/pr/library/2000/09/13Apple-Releases-Mac-OS-X-Public-Beta.html" target="_blank">introduces</a> the public beta of Mac OS X for US$29.95.</div>
            <div>Author: Michael Young</div>
        </div>
    </body>

</html>

---

We send the link to our new site to our family and friends, and we have a handful of people reading it (well, two people really: our mom and the dog).

**So, now we want more people to read it!**

What's the best way to have more people discover this site in 2000? **Search.** And, by that I mean Google (which had launched a few years earlier in 1998). Let's ignore ~~Yahoo!~~ Oath for now, but it was a real player in search in the early days of the web.

Our good friend, the SEO Guru, told us we needed to do some work on our site so that we could move up in the rankings. After talking to the "guru", we learned to help Google by telling them a little about our humble news site using the **```meta```** ```keywords``` and ```description``` tags.

In [None]:
%%HTML

<html>

    <head>
        <title>My Technology News Site</title>
        <meta name="description" content="My Technology News Site has the most interesting technology stories every day.">
        <meta name="keywords" content="tech, news, super important tech news, technology, technology news, kim kardashian">
    </head>

    <body>
        <div>
            <div><strong>Steve Jobs introduces the public beta of Mac OS X</strong></div>
            <div>Sept 13, 2000 - Steve Jobs <a href="https://www.apple.com/pr/library/2000/09/13Apple-Releases-Mac-OS-X-Public-Beta.html" target="_blank">introduces</a> the public beta of Mac OS X for US$29.95.</div>
            <div>Author: Michael Young</div>
        </div>
    </body

</html>

---
To review (in case of web technology circa 2000 is feeling slightly remote):

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The **```keywords```** metadata is used to tell search engines about the topic of the page.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The **```description```** metadata is used to describe the site and this is what search engines use in their search results.

A **key point** to remember: none of this data is viewable by users. It's for machines.

Here is example of the ```description``` tag in use: 

![description](https://qph.ec.quoracdn.net/main-qimg-2c6dddd356b26ca0763241db501f52f8)

**Great!**

Google now knows a little about us and is ranking our site a bit higher in the search results for "technology news." Because of that, we have a few more people showing up at our site.

Over the next few years, we expand our little tech news site in to listing events as well. 

In [None]:
%%HTML

<html>

    <head>
        <title>My Technology News Site - Events in San Francisco</title>
        <meta name="description" content="My Technology News Site has the most interesting technology stories and events.">
        <meta name="keywords" content="tech, news, super important tech news, technology, technology news, technology events, events, San Francico, Silicon Valley">
    </head>

    <body>
        <div>
            <div><strong>Macworld Expo San Francisco</strong></div>
            <div>January 5-9, 2004</div>
            <div>Moscone Convention Center, San Francisco, CA</div>
        </div>
    </body

</html>

---

**Questions**: how does a search engine know that this site/page is really about technology? How does it know it's an event listing? Does anyone see how this metadata approach could be abused?


### Microformats

Around 2005, a group of people came up with the notion of a "Microformat." The idea was to use additional markup in HTML to allow machines to easily discover data inside HTML (like our calendar event or news story). Simply put, _Microformats are a way to use html pages as both a human readable document and machine readable data, without repetition._

The idea was originally a grassroots movement from developers but it was soon supported by some search engines and browers. It was never part of a standards body though - just an "informal" specification. [Microformats](http://microformats.org/wiki/Main_Page) are still used and supported but as we'll see, new metadata formats came along...

Microformats allowed developers to highlight specific elements/types of content within a page, such as:

```
hAtom - blog posts and other date-stamped content
hCalendar - events
hCard - people, organizations, contacts
hListing - listings for products or services
hMedia - media info about images, video, audio
hProduct - products
hRecipe - cooking+baking recipes
hResume - individual resumes and CVs
hReview - individual reviews and ratings
hReview-aggregate - aggregate reviews and ratings
adr - address location information
geo - latitude & longitude location (WGS84 geographic coordinates)
```

**What good does this do? What can we do with Micoformats?**

A few things:
1. Search engines now have help in knowing what a page, or data within a page, is about.
2. Search engines can use this markup to know what to show in something like a "rich snippet."
3. Browsers started adding the ability to do things like detect an event in a page and allow a user to add it to their calendar (or a person's information to their Address book).

Make sense? Let's revisit our event listing by using the `hCalendar` microformat.

In [None]:
%%HTML

<html>

    <head>
        <title>My Technology News Site - Events in San Francisco</title>
        <meta name="description" content="My Technology News Site has the most interesting technology stories and events.">
        <meta name="keywords" content="tech, news, super important tech news, technology, technology news, technology events, events, San Francico, Silicon Valley">
    </head>

    <body>
        <div class="vevent">
            <div class="summary"><strong>Macworld Expo San Francisco</strong></div>
            <div>
                 <span class="dtstart" title="2004-01-05">January 5</span>-
                 <span class="dtend" title="2005-01-09">9, 2004</span>
            </div>
            <div class="location">Moscone Convention Center, San Francisco, CA</div>
        </div>
    </body>

</html>

---

### Enter Microdata (and others)

Over the coming years, other metadata approaches emerged such as [Microdata](http://schema.org/) (Google and other search engines), [OpenGraph](http://ogp.me/) (Facebook), [TwitterCards](https://dev.twitter.com/cards/overview) (Twitter) and others (RDF, [RDFa](https://rdfa.info/)). These were created and driven by various standard bodies, commercial interests (publishers, social networks, search engines, browsers) and developers. Again, the goal of these were to make it easier for machines to make sense of the data published inside web pages and to use that data to help display, rank and make publisher's content easier to interact with.

We're going to focus on [Microdata](https://www.w3.org/TR/microdata/) for the rest of the class, but it's worth looking in to the others as well.

Similar to Microformats, Microdata is defined as: _This specification defines the HTML microdata mechanism. This mechanism allows machine-readable data to be embedded in HTML documents in an easy-to-write manner, with an unambiguous parsing model. It is compatible with numerous other data formats including RDF and JSON._

Though `Microdata` is not an official spec of *The W3C* (_The W3C HTML Working Group failed to find an editor for the specification and terminated its development with a 'Note'._) it is supported by Google, Microsoft, Yahoo and Yandex. In fact, these companies came together to create a vocabulary (specification, essentially) around microdata that is published at http://schema.org/. These companies have tried to establish and open forum and community-based process for updating the vocabulary/specification.

Let's look at an example of how microdata works. We'll start by looking at a **Movie**. Here is some simple HTML that display's information about the movie Avatar. Go ahead and run it.

In [None]:
%%HTML

<div>
    <h1>Avatar</h1>
    <div>Director: James Cameron (born August 16, 1954)</div>
    <div>Science Fiction</div>
    <div><a href="../movies/avatar-theatrical-trailer.html">Trailer</a></div>
</div>

---

### Adding Microdata to our HTML

We want to let Google and the search engines know what this is information about a movie.

**Step 1**: Identify which section is about the Movie 🎥

Add the **`itemscope`** attribute to the HTML element which encloses the information about the movie.

```html
<div itemscope>
    ...Movie info here...
</div>
```

In [None]:
%%HTML

<div itemscope>
    <h1>Avatar</h1>
    <div>Director: James Cameron (born August 16, 1954)</div>
    <div>Science fiction</div>
    <div><a href="../movies/avatar-theatrical-trailer.html">Trailer</a></div>
</div>

---

**Step 2**: Specify the type (i.e. this thing is a Movie)

Now, add the **`itemtype`** attribute right after the **`itemscope`** and specify the type. When specifying the type, you can use any of the types listed on [schema.org](http://schema.org/docs/full.html)

```html
<div itemscope itemtype="http://schema.org/Movie">
    ...Movie info here...
</div>
```

In [None]:
%%HTML

<div itemscope itemtype="http://schema.org/Movie">
    <h1>Avatar</h1>
    <div>Director: James Cameron (born August 16, 1954)</div>
    <div>Science fiction</div>
    <div><a href="../movies/avatar-theatrical-trailer.html">Trailer</a></div>
</div>

---

**Step 3**: Use the **`itemprop`** attribute to specify properties about the Movie.

Nothing has changed visually on the page, but we've told search engines that this section of the page is about a Movie. Google thanks you! Can we go further with the [Movie type](http://schema.org/Movie)? How would we tell Google which of these fields is the movie name, director and genre? We can do this using the **`itemprop`** attribute.


In [None]:
%%HTML

<div itemscope itemtype="http://schema.org/Movie">
    <h1 itemprop="name">Avatar</h1>
    
    <div itemprop="director" itemscope itemtype="http://schema.org/Person">
        <span itemprop="name">James Cameron</span>
        (born: <span itemprop="birthDate">August 16, 1954</span>)
    </div>
        
    <div itemprop="genre">Science fiction</div>
    <div><a href="../movies/avatar-theatrical-trailer.html">Trailer</a></div>
</div>

### You Try It

Edit the HTML above and specify the trailor. Reference the [Movie documenation on schema.org.](http://schema.org/Movie)


In [None]:
%%HTML


### HOLD THE PHONE!

So, this is one of three flavors of microdata. The others being [RDFa](https://rdfa.info/) and [JSON-LD](http://json-ld.org/spec/latest/json-ld/) (where "LD" is "linked data"). 

The search engines support all three formats but Google [recently said](https://developers.google.com/search/docs/guides/intro-structured-data) JSON-LD is their recommended format.

Some publishers use the Microdata HTML markup, some use JSON-LD. It's a bit of the wild west out there. For the examples in this notebook, we'll stick with the HTML markup. However, if you were to take our Movie example from above and express it as JSON-LD, it would look something like this:

```json
<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Movie",
  "name": "Avatar",
  "genre": "Science Fiction",
  "director": {
    "@type": "Person",
    "name": "James Camerom",
  },
}
</script>

```


### Let's looks at the `NewsArticle` type 📰

Take a quick peek at the schema documeation before we get started: http://schema.org/NewsArticle

How would we go about extracting information about a NewsArticle using the tools we know (requests, BeautifulSoup, etc)?

In particular, can we find this information in a [news article]('https://www.nytimes.com/2017/04/02/us/politics/trump-china-jared-kushner.html')?
* `headline`
* `author`
* `description`

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.nytimes.com/2017/04/02/us/politics/trump-china-jared-kushner.html'

# make the request and run the page through BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

for tag in soup.find_all(attrs={'itemtype': 'http://schema.org/NewsArticle'}):

    for stuff in tag.find_all(attrs={"itemprop":True}):
        
        if stuff['itemprop'] == 'headline':
            print('headline: {0}'.format(stuff.string))

        elif 'author' in stuff['itemprop']:
            print('author: {0}'.format(stuff.string))

        elif stuff['itemprop'] == 'description':
            print('description: {0}'.format(stuff['content']))
  

### ClaimReview

Ok, let's finally take a look at the *`ClaimReview`* microdata specification: https://schema.org/ClaimReview

`ClaimReview` was introduced as a new type addded to the list of supported types (on schema.org) by [Google on October 16](https://blog.google/topics/journalism-news/labeling-fact-check-articles-google-news/).

A `ClaimReview` is defined as: _A fact-checking review of claims made (or reported) in some creative work (referenced via itemReviewed)._

Let's look at an article from the Washington Post "Fact Checker" blog to see if they have a ClaimReview.

In [None]:
url = 'https://www.washingtonpost.com/news/fact-checker/wp/2018/04/05/president-trump-says-his-beautiful-wall-is-being-built-nope/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

for tag in soup.find_all(attrs={"itemscope":True}):
    print(tag['itemtype'])

Now, let's take a closer look at the info in the `ClaimReview`.

In [None]:
# let's print out the info in the Claim Review
for tag in soup.find_all(attrs={'itemtype': 'http://schema.org/ClaimReview'}):

    for stuff in tag.find_all(attrs={'itemprop':'itemReviewed'}):
        print(stuff.prettify())

To compare, take a look at what Google's tool finds for this article: https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fwww.washingtonpost.com%2Fnews%2Ffact-checker%2Fwp%2F2018%2F04%2F05%2Fpresident-trump-says-his-beautiful-wall-is-being-built-nope%2F%3Futm_term%3D.23e8905ce6f4

One important piece of data in here is the `reviewRating` --> `alternateName` value. In WaPo's case, this is where they tell you if the Claim is 1-4 `Pinocchios`, `The Geppetto Checkmark`, `An Upside-Down Pinocchio
` or `Verdict Pending`. You can see their entire rating system [here](https://www.washingtonpost.com/news/fact-checker/about-the-fact-checker/). What can you do with this information? 

How might you use what we've just done to find how fake news is shared on social networks?

### P.S. if you were to add most of the various forms of metadata to our "technology news site" example...

...it might look something like this:

In [None]:
%%HTML

<html>

    <head>
        <title>My Technology News Site - Events in San Francisco</title>
        <meta name="description" content="My Technology News Site has the most interesting technology stories and events.">
        <meta name="keywords" content="tech, news, super important tech news, technology, technology news, technology events, events, San Francico, Silicon Valley">
        
        <!-- OpenGraph for FB -->
        <meta property="og:title" content="My Technology News Site - Events in San Francisco" />
        <meta properly="og:description" content="My Technology News Site has the most interesting technology stories and events." />
        <meta property="og:type" content="website" />
        <meta property="og:image" content="http://mysweettechsite.com/logo.png" />

        <!-- TwitterCard for Twitter -->
        
        <meta name="twitter:card" content="summary" />
        <meta name="twitter:site" content="@mytwitteraccount" />
        <meta name="twitter:title" content="My Technology News Site - Events in San Francisco" />
        <meta name="twitter:image" content="http://mysweettechsite.com/logo.png" />

        <script type="application/ld+json">
        {
            "@context": "http://schema.org",
            "@type": "Event",
            "location": {
                "@type": "Place",
                "address": {
                  "@type": "PostalAddress",
                  "addressLocality": "San Francisco",
                  "addressRegion": "CA",
                },
                "name": "The Moscone Convention Center"
            },
            "name": "Macworld Expo San Francisco",
            "startDate": "2014-01-05T09:00",
            "endDate": "2014-01-09T17:00"
        }
        </script>
    
    </head>

    <body>
        <div class="vevent">
            <div class="summary"><strong>Macworld Expo San Francisco</strong></div>
            <div>
                 <span class="dtstart" title="2004-01-05">January 5</span>-
                 <span class="dtend" title="2005-01-09">9, 2004</span>
            </div>
            <div class="location">Moscone Convention Center, San Francisco, CA</div>
        </div>
    </body>

</html>