*****************************************************************
#  The Social Web: data representation
- Instructors: Jacco van Ossenbruggen, Dayana Spagnuelo
- TAs Michael Accetto, Oktay Kavi, Abhirup Mukherjee, Nihat Uzunalioğlu
- Exercises for Hands-on session 2
*****************************************************************

In this session you are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking at geographical data.

Prerequisites:
- Python 3.8
- Python packages: requests, BeautifulSoup4, HTMLParser, rdflib


In [1]:
# If you're using a virtualenv, make sure it's activated before running
# this cell!
!pip install requests
!pip install BeautifulSoup4
!pip install HTMLParser
!pip install rdflib

Collecting HTMLParser
  Downloading HTMLParser-0.0.2.tar.gz (6.0 kB)
Building wheels for collected packages: HTMLParser
  Building wheel for HTMLParser (setup.py) ... [?25ldone
[?25h  Created wheel for HTMLParser: filename=HTMLParser-0.0.2-py3-none-any.whl size=5984 sha256=4f1e489e95a38491bf4d041ce4f0f6d1d3810864a4d8f221a2110b8b56894059
  Stored in directory: /Users/chieh/Library/Caches/pip/wheels/88/0f/43/11747d95b28379b346c15f935f4d4075e7a4ec068d3a510c79
Successfully built HTMLParser
Installing collected packages: HTMLParser
Successfully installed HTMLParser-0.0.2
Collecting rdflib
  Downloading rdflib-5.0.0-py3-none-any.whl (231 kB)
[K     |████████████████████████████████| 231 kB 3.3 MB/s eta 0:00:01
Collecting isodate
  Downloading isodate-0.6.0-py2.py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 2.3 MB/s eta 0:00:01
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.6.0 rdflib-5.0.0


##  Exercise 1

Even if web pages do not use microformat, interesting data can often be extracted from the HML.You may use packages such as BeautifulSoup to extract arbitrary pieces of data from any HTML page.
The example below shows how we can find the URL of first image in the infobox table of the wikipedia page on Amsterdam. Tip: compare the code below with HTML source code of the wikipedia page: the image url is in the "src" attribute of the "img" element of in the "table" element with class="infobox".

In [1]:
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

# This script requires you to add a url of a page with geotags to the commandline, e.g.
# python geo.py 'http://en.wikipedia.org/wiki/Amsterdam'
URL = 'https://en.wikipedia.org/wiki/Amsterdam'

req = requests.get(URL, headers={'User-Agent' : "Social Web Course Student"})
soup = BeautifulSoup(req.text)
# print(req.text)
image1 = soup.findAll('table', class_='infobox')[0].find('img')
print(image1['src'])  


//upload.wikimedia.org/wikipedia/commons/thumb/b/be/KeizersgrachtReguliersgrachtAmsterdam.jpg/270px-KeizersgrachtReguliersgrachtAmsterdam.jpg


Extracting coordinates from a webpage and reformatting them in the geo microformat (based on Example 8-1 in Mining the Social Web). Note that wikipages may encode long/lat information in different ways. On of the ways used by the Amsterdam wikipedia page is in a span element that is not shown to the user: 
<span class="geo">52.367; 4.900</span>
This span element has a single child: len(geoTag == 1) and no further structure, we have to manually get the long/lat by splitting the string on the ';' semicolon.

In [2]:

geoTag = soup.find(True, 'geo')
print(geoTag)

if geoTag and len(geoTag) > 1:
        lat = geoTag.find(True, 'latitude').string
        lon = geoTag.find(True, 'longitude').string
        print ('Location is at'), lat, lon
elif geoTag and len(geoTag) == 1:
        (lat, lon) = geoTag.string.split(';')
        (lat, lon) = (lat.strip(), lon.strip())
        print (('Location is at'), lat, lon)
else:
        print ('Location not found')


<span class="geo">52.367; 4.900</span>
Location is at 52.367 4.900


### Task 1

Can you convert the output of Exercise 1 into KML? Here is the KML documentation: https://developers.google.com/kml/documentation/?csw=1 and here you can find a simple example of how it is used: https://renenyffenegger.ch/notes/tools/Google-Earth/kml/index

Visualise the point in Google Maps using the following code example: https://developers.google.com/maps/documentation/javascript/examples/layer-kml-features
You will have to create your own KML file for the custom map layer, and provide a URL to the KML file inside the JavaScript code, which means that you have to upload the file somewhere. You can use a service like http://pastebin.com/ to obtain a URL for your KML file —> paste the code there and request the RAW format URL; use this one in this Task1.

Is KML a microformat, why (not)?

## Ans: 
We convert the coordinate obtained from the Exercise into KML at the link of: https://pastebin.com/raw/q2Vje0zU
And then visualize it in https://jsfiddle.net/dvzfbxkh/2/
We think that the KML is a microformat since it provides some information like placemark and coordinate. This kind of information can be readible by users and be procssible by softweare. Therefore, we think KML is a microformat.


## Exercise 2 
In order to find information in the web we can use microformats. However in this example you will not be using hRecipe. Instead, we'll show you how to find arbitrary tags in a webpage.


### Task 2 
Parsing data for a <sub><sup>veggie</sup></sub> spaghetti alla carbonara recipe (from Example 2-7 in Mining the Social Web).

In [1]:
import requests
import json
from bs4 import BeautifulSoup

# A yummy webpage (feel free to change to your likings.)
URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/"

# requests will return the html found at the given webpage...
page = requests.get(URL)
# ...and a BeautifulSoup object can be created from its content.
soup = BeautifulSoup(page.content, 'html.parser')

listchildren = list(soup.children)
#print(listchildren)

We can find any element in the page through *css tag selectors*
You can find them all [here](https://www.w3schools.com/cssref/css_selectors.asp), but shortly these are "." for classes, # for ids and plain text for the element name.


You can also combine them, so that looking for ".class1.class2" would select all elements displaying both classes. For a deeper overview please check the above link (or google "html tag selectors"). 

In [2]:
print(len(listchildren)) # we can see here how many children the html doc has got.
ingredients_unparsed = soup.select_one(".tasty-recipes-ingredients")
# let's get all the "list item" elements in a list:
ing_unp = ingredients_unparsed.findAll('li')
print(ing_unp)

4
[<li><span data-amount="1">1</span> pound spaghetti noodles</li>, <li><span data-amount="0.5" data-unit="cup">½ cup</span> smoked mozzarella cheese</li>, <li><span data-amount="0.5" data-unit="cup">½ cup</span> grated Parmesan cheese, plus more for serving</li>, <li><span data-amount="4">4</span> egg yolks</li>, <li><span data-amount="1" data-unit="cup">1 cup</span> frozen Earthbound Farm Organic peas</li>, <li><span data-amount="8" data-unit="cup">8 cups</span> Earthbound Farm Organic spinach</li>, <li><span data-amount="3" data-unit="tablespoon">3 tablespoons</span> butter</li>, <li><a class="tasty-link" data-tasty-links-no-disclosure="" href="https://www.acouplecooks.com/what-is-kosher-salt/" target="_blank">Kosher salt</a></li>, <li>Fresh ground black pepper</li>]


Mmmh... not so pretty yet. How about listing their items using the text method?

In [3]:
ingredients = [t.text for t in ing_unp]
print("Ingredients:\n")
# [print(i) for i in ingredients]  # Also prints the generator
# Instead
for ing in ingredients:
    print(ing)

Ingredients:

1 pound spaghetti noodles
½ cup smoked mozzarella cheese
½ cup grated Parmesan cheese, plus more for serving
4 egg yolks
1 cup frozen Earthbound Farm Organic peas
8 cups Earthbound Farm Organic spinach
3 tablespoons butter
Kosher salt
Fresh ground black pepper


Good. Now the instructions:

In [4]:
instructions_unparsed = soup.select_one(".tasty-recipes-instructions")
instructions_unparsed = instructions_unparsed.findAll("li")
#print(instructions_unparsed)
instructions = [inst.text for inst in instructions_unparsed]
print("Instructions:\n")
for ins in instructions:
    print(ins)

Instructions:

In a large pot, combine 6 quarts of water with 2 tablespoons kosher salt and bring it to a boil.
Grate the Parmesan and mozzarella cheese. Carefully separate four egg yolks and set aside.
Once boiling, add the pasta and cook until the pasta is just about al dente, about 7 minutes; then add peas and spinach and cook for 1 minute. Reserve 1 cup cooking water, and then drain the pasta and vegetables.
In a skillet, melt the butter, then stir in the cheeses, ¼ cup pasta water, and ¼ teaspoon kosher salt. Stir in the pasta and vegetables until creamy over low heat, adding more pasta water if necessary (note that the mozzarella will stick together in some places).
To serve, top each pasta serving with a whole egg yolk and additional Parmesan cheese, and stir the yolk into the pasta at the table (if you are uncomfortable serving egg yolks at the table, stir the egg yolks into the pasta in the skillet to heat them through). Serve immediately. (Note that the mozzarella cheese can 

Let's finish off with the title:

In [5]:
title_unparsed = soup.select_one(".post-header") # 
categorical_title = title_unparsed.text.split("›") # website specific divider.
print(categorical_title)
recipe_title = categorical_title[-1].strip() # let's remove that ugly space at the beginning.
recipe_title

['\n\n\nVegetarian Carbonara\n\nRecipes ', ' Fast Dinner Ideas ', ' Vegetarian Carbonara\n']


'Vegetarian Carbonara'

## Task 2.1
Now it's your turn. Create a function that can scrape any recipe webpage from the same website (other websites will have different class tags). 

Make sure to:

- return itemized content (e.g. ingredients) in a list. You may want to use a list comprehension here.
- Not all items have been cleaned of their html markdown (see variables ```ingredients``` vs. ```instructions_unparsed```. Make sure to return a list with human readable content (i.e. by using the ```.text``` attribute).


In [6]:
# -*- coding: utf-8 -*-

import requests
import json
from bs4 import BeautifulSoup

# Pass in a URL containing hRecipe, such as
# https://www.jamieoliver.com/recipes/pasta-recipes/veggie-carbonara/

URL = "https://www.acouplecooks.com/easy-blueberry-crisp/"#YOUR RECIPE HERE/

# Parse out some of the pertinent information for a recipe.
# See http://microformats.org/wiki/hrecipe.

def parse_website(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # You code here
    # Parse header and get the title
    title_unparsed = soup.select_one(".post-header") # 
    categorical_title = title_unparsed.text.split("›") # website specific divider.
    recipe_title = categorical_title[-1].strip() # let's remove that ugly space at the beginning.
    fn = recipe_title
    #print(fn)

    # Ingredients
    ingredients_unparsed = soup.select_one(".tasty-recipes-ingredients")
    # let's get all the "list item" elements in a list:
    ing_unp = ingredients_unparsed.findAll('li')
    ingredients = [t.text for t in ing_unp]

    # Instructions
    instructions_unparsed = soup.select_one(".tasty-recipes-instructions")
    instructions_unparsed = instructions_unparsed.findAll("li")
    instructions = [t.text for t in instructions_unparsed]

    return {
            'name': fn,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    
recipe = parse_website(URL)
#print (recipe)
print(json.dumps(recipe,indent=4))

{
    "name": "Best Blueberry Crisp Recipe",
    "ingredients": [
        "5 cups fresh blueberries",
        "1 tablespoon vanilla extract",
        "2 tablespoons fresh lemon juice",
        "2 tablespoons arrowroot powder or cornstarch",
        "1/4 teaspoon kosher salt",
        "5 tablespoons cold unsalted butter",
        "1 cup rolled oats",
        "1/2 cup coconut sugar or granulated sugar",
        "1/2 cup almond flour",
        "1 teaspoon vanilla extract",
        "1/2 teaspoon kosher salt",
        "1 teaspoon culinary lavender, crushed under the bottom of a glass until powdery (optional)*"
    ],
    "instructions": [
        "Preheat the oven to 350F. Lightly grease an 8-inch pie plate or a cast-iron skillet.",
        "Make the filling: In a large bowl, toss together the blueberries, vanilla, lemon juice, arrowroot and salt until well coated. Transfer the filling to the prepared pie plate.",
        "Make the topping: Wipe out the bowl, then chop the butter into small

But How can we get information not only from one website,  but from all? 

The answer: microformats.

But rather than extracting with information manually from the schema.org or hRecipe microformats, we can use a package, ```scrape-schema-recipe``` 

Feel free to experiment with it. 

In [7]:
#try scrape-schema-recipe
!pip install scrape_schema_recipe



In [8]:
import scrape_schema_recipe
URL = "https://www.acouplecooks.com/easy-blueberry-crisp/"
recipe_new = scrape_schema_recipe.scrape_url(URL, python_objects=True)
print(len(recipe_new))

recipe_new = recipe_new[0]
#keywords
print()
print(recipe_new.keys())

#print recipe name
print(recipe_new['name'])

#print recipe published date
print('\n Published date:')
print(recipe_new['datePublished'])

#print recipe ingredient
ingredients_new = recipe_new['recipeIngredient']
print("\n Ingredients:")
for ing in ingredients_new:
    print(ing)

#print instructions
print("\n Instructions:")
instructions_new = recipe_new['recipeInstructions']
print(json.dumps(instructions_new, indent=4))


1

dict_keys(['@context', '@type', 'name', 'description', 'author', 'keywords', 'image', 'url', 'recipeIngredient', 'recipeInstructions', 'prepTime', 'cookTime', 'totalTime', 'recipeYield', 'recipeCategory', 'cookingMethod', 'recipeCuisine', 'aggregateRating', 'nutrition', 'datePublished', '@id', 'isPartOf', 'mainEntityOfPage'])
Best Blueberry Crisp Recipe

 Published date:
2019-04-23

 Ingredients:
5 cups fresh blueberries
1 tablespoon vanilla extract
2 tablespoons fresh lemon juice
2 tablespoons arrowroot powder or cornstarch
1/4 teaspoon kosher salt
5 tablespoons cold unsalted butter
1 cup rolled oats
1/2 cup coconut sugar or granulated sugar
1/2 cup almond flour
1 teaspoon vanilla extract
1/2 teaspoon kosher salt
1 teaspoon culinary lavender, crushed under the bottom of a glass until powdery (optional)*

 Instructions:
[
    {
        "@type": "HowToStep",
        "text": "Preheat the oven to 350F. Lightly grease an 8-inch pie plate or a cast-iron skillet.",
        "url": "https:/

### Task 2.2
hRecipe is a microformat specifically created for recipes.
Can you for example easily compare different dessert recipe ingredients? For inspiration you can look back at the exercises you did in Hands-on session 1 where you compared different sets of tweets.

In [9]:
re01_url = "https://www.acouplecooks.com/blueberry-cake/"
re02_url = "https://www.jamieoliver.com/recipes/fruit-recipes/blueberry-cornmeal-skillet-cake/"

recipe_acouplecooks = scrape_schema_recipe.scrape_url(re01_url, python_objects=True)
recipe_jamie = scrape_schema_recipe.scrape_url(re02_url, python_objects=True)

recipe_acouplecooks = recipe_acouplecooks[0]
recipe_jamie = recipe_jamie[0]

#acouplecooks ingredients
ingredients_acouplecooks = recipe_acouplecooks['recipeIngredient']
print("Ingredients of acouplecooks recipe:\n")
for ing in ingredients_acouplecooks:
    print(ing)
print("-----------------------------------------")
#jamie ingredients
ingredients_jamie = recipe_jamie['recipeIngredient']
print("Ingredients of jamieoliver recipe:\n")
for ing in ingredients_jamie:
    print(ing)

Ingredients of acouplecooks recipe:

1 1/4 cups plus 1 teaspoon all-purpose flour, divided
2/3 cup firmly packed cup brown sugar
1 teaspoon ground cinnamon
3/4 teaspoon baking powder
3/4 teaspoon baking soda
1/4 teaspoon kosher salt
1 egg
1/3 cup olive oil
1/2 cup plain Greek yogurt
1/2 cup unsweetened applesauce
1 teaspoon vanilla extract
1 1/2 cups blueberries
1 teaspoon lemon zest plus 1 teaspoon lemon juice
Powdered sugar (optional), for dusting
-----------------------------------------
Ingredients of jamieoliver recipe:

150 g unsalted butter (at room temperature), plus extra for greasing
200 g caster sugar 
1  lemon 
150 g fine cornmeal 
150 g ground almonds 
½ teaspoon baking powder 
4 large free-range eggs 
2 teaspoons vanilla extract 
200 g blueberries 


## Exercise 3

Schema.org is one of the most widely used annotations formats. Schema.org is a multipurpose  template that has been created by a consortium consisting of Yahoo!, Google and Microsoft. It can describe entities, events, products etc. Check out the vocabulary specs on Schema.org.

### Task 3

Parsing schema.org microdata. To parse this data you need to install the rdflib-microdata package, which you have done in one of the previous steps.



In [10]:
from rdflib import Graph

# Source: https://www.youtube.com/watch?v=sCU214rbRZ0
# Pass in a URL containing Schema.org microformats
URL = "http://dbpedia.org/resource/Micheal_Jackson"

# Initialize a graph
g = Graph()

# Parse in an RDF file graph dbpedia
result = g.parse(location=URL)

# Loop through first 10 triples in the graph
for index, (sub, pred, obj) in enumerate(g):
    print(sub, pred, obj)
    if index == 10:
        break

http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageRevisionID 631226997
http://dbpedia.org/resource/Micheal_Jackson http://www.w3.org/2002/07/owl#sameAs http://dbpedia.org/resource/Micheal_Jackson
http://dbpedia.org/resource/Micheal_Jackson http://www.w3.org/ns/prov#wasDerivedFrom http://en.wikipedia.org/wiki/Micheal_Jackson?oldid=631226997
http://en.wikipedia.org/wiki/Micheal_Jackson http://xmlns.com/foaf/0.1/primaryTopic http://dbpedia.org/resource/Micheal_Jackson
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageRedirects http://dbpedia.org/resource/Michael_Jackson
http://dbpedia.org/resource/Micheal_Jackson http://www.w3.org/2000/01/rdf-schema#label Micheal Jackson
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageID 14995602
http://dbpedia.org/resource/Micheal_Jackson http://xmlns.com/foaf/0.1/isPrimaryTopicOf http://en.wikipedia.org/wiki/Micheal_Jackson


In [11]:
# Print the size of the Graph
print(f'Graph has {len(g)} facts')

Graph has 8 facts


In [12]:
# Print out the entire Graph in the RDF Turtle format
print(g.serialize(format='ttl').decode('u8'))

@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://en.wikipedia.org/wiki/Micheal_Jackson> foaf:primaryTopic <http://dbpedia.org/resource/Micheal_Jackson> .

<http://dbpedia.org/resource/Micheal_Jackson> rdfs:label "Micheal Jackson"@en ;
    dbo:wikiPageID 14995602 ;
    dbo:wikiPageRedirects <http://dbpedia.org/resource/Michael_Jackson> ;
    dbo:wikiPageRevisionID 631226997 ;
    owl:sameAs <http://dbpedia.org/resource/Micheal_Jackson> ;
    prov:wasDerivedFrom <http://en.wikipedia.org/wiki/Micheal_Jackson?oldid=631226997> ;
    foaf:isPrimaryTopicOf <http://en.wikipedia.org/wiki/Micheal_Jackson> .




### Task 3.1 
Compare the schema.org information about a band on last.fm to the Facebook Open Graph information about the same band from Facebook. What are the differences? Which format do you think supports better interoperability?

## Ans:
   The following two cells show the result of the band BTS on last.fm in schema.org information and in Facebook Open Graph information. The different between these two presentations is huge. First, the most difference is that schema.org using rdflib shows more detail and stuctured metadata of prefixes and knowledgement. While, Facebook Open Graph shows very limited contents. Second, Open Graph for Python lib provided in their index page has been no longer maintained. The handy way to get Open Graph of one web data is to use Facebook development tools, which is special for url when sharing in Facebook, not appropriate for our to represent knowledgment nor parse work. Moreover, the format of schema.org can explicit to serialize the graph, which is helpful in later processing. In our opinion, the format of schema.org is better to support interoperability and universality.

In [150]:
# The schema.org information about band BTS 
url = "http://dbpedia.org/resource/BTS"

gp = Graph()
gp.parse(url)

print(f'Graph has {len(gp)} facts')

print(gp.serialize(format='ttl').decode('u8'))

Graph has 59 facts
@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://dbpedia.org/resource/BTS_(disambiguation)> dbo:wikiPageRedirects <http://dbpedia.org/resource/BTS> .

<http://dbpedia.org/resource/BtS> dbo:wikiPageRedirects <http://dbpedia.org/resource/BTS> .

<http://en.wikipedia.org/wiki/BTS> foaf:primaryTopic <http://dbpedia.org/resource/BTS> .

<http://dbpedia.org/resource/BTS> a <http://dbpedia.org/class/yago/Abstraction100002137>,
        <http://dbpedia.org/class/yago/Company108058098>,
        <http://dbpedia.org/class/yago/Group100031264>,
        <http://dbpedia.org/class/yago/Institution108053576>,
        <http://dbpedia.org/class/yago/Organization108008335>,
        <http://dbpedia.org/class/yago/SocialGroup107950920>,
     

### Task 3.2
Explore the various microformats at http://microformats.org/ and compare the output of the exercises with the output of http://microformats.org/. Think about possible microformats you want to support in your final assignment and read up on how to parse them.

## Ans:
   We explore three types of microformats hCard, hCalendar, and hReview. After comparison, we find that the format of schema.org can give us a serialized output, while these three microfomats cannot. And for different microformats have different descripted objects. It is not universial for web data on many categories. To our satisfication, the hCard has the most identifiers which can better represent entities like person, orgnization, and so on. However, we suggest that we may possibly use the format of schema.org in our final assignmnet for reasons. There are some handy resources and instructions on it to help us to represent data from web and parse them.