*****************************************************************
#  The Social Web: data representation
- Instructors: Jacco van Ossenbruggen.
- TAs: Ayesha Noorain, Alex Boyko, Caio Silva, Elena Beretta, Mirthe Dankloff.
- Exercises for Hands-on session 2
*****************************************************************

Group 28:
- Artin Sanaye, 2717366
    * VUnetID: ase299
    * a.sanaye@student.vu.nl
- Saman khodadadi, 2740086
    * VunetID:  ski286
    * s.khodadadi@student.vu.nl
- Farimah Mohebi, 2766661
    * VunetID: fmo207
    * f.mohebi@student.vu.nl
- Bono Lardinois, 2601771
    * VunetID:  bls580
    * 2601771@student.vu.nl
    

In this session you are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking at geographical data.

Prerequisites:
- Python 3.8
- Python packages: requests, BeautifulSoup4, HTMLParser, rdflib


In [37]:
# If you're using a virtualenv, make sure it's activated before running
# this cell!
!pip install requests
!pip install BeautifulSoup4
!pip install HTMLParser
!pip install rdflib

Collecting HTMLParser
  Using cached HTMLParser-0.0.2.tar.gz (6.0 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: HTMLParser
  Building wheel for HTMLParser (setup.py) ... [?25ldone
[?25h  Created wheel for HTMLParser: filename=HTMLParser-0.0.2-py3-none-any.whl size=5982 sha256=968f589d6b629881c9d668d6881aaccf766e3890033ff69f12d49bb68358ffbb
  Stored in directory: /Users/Bono/Library/Caches/pip/wheels/9a/89/a5/f4d70553bc8105fa3622ed2e6584b289d6e6c859e9ee8ec858
Successfully built HTMLParser
Installing collected packages: HTMLParser
Successfully installed HTMLParser-0.0.2


##  Exercise 1

Even if web pages do not use microformat, interesting data can often be extracted from the HTML. You may use packages such as BeautifulSoup to extract arbitrary pieces of data from any HTML page.
The example below shows how we can find the URL of first image in the infobox table of the wikipedia page on Amsterdam. Tip: compare the code below with HTML source code of the wikipedia page: the image url is in the "src" attribute of the "img" element of in the "table" element with class="infobox".

In [38]:
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

# This script requires you to add a url of a page with geotags to the commandline, e.g.
# python geo.py 'http://en.wikipedia.org/wiki/Amsterdam'
URL = 'https://en.wikipedia.org/wiki/Amsterdam'

req = requests.get(URL, headers={'User-Agent' : "Social Web Course Student"})
soup = BeautifulSoup(req.text)
# print(req.text)
image1 = soup.findAll('table', class_='infobox')[0].find('img')
print(image1['src'])  

//upload.wikimedia.org/wikipedia/commons/thumb/b/be/KeizersgrachtReguliersgrachtAmsterdam.jpg/270px-KeizersgrachtReguliersgrachtAmsterdam.jpg


Extracting coordinates from a webpage and reformatting them in the geo microformat (based on Example 8-1 in Mining the Social Web). Note that wikipages may encode long/lat information in different ways. On of the ways used by the Amsterdam wikipedia page is in a span element that is not shown to the user: 
<span class="geo">52.367; 4.900</span>
This span element has a single child: len(geoTag == 1) and no further structure, we have to manually get the long/lat by splitting the string on the ';' semicolon.

In [39]:

geoTag = soup.find(True, 'geo')
print(geoTag)

if geoTag and len(geoTag) > 1:
        lat = geoTag.find(True, 'latitude').string
        lon = geoTag.find(True, 'longitude').string
        print ('Location is at'), lat, lon
elif geoTag and len(geoTag) == 1:
        (lat, lon) = geoTag.string.split(';')
        (lat, lon) = (lat.strip(), lon.strip())
        print (('Location is at'), lat, lon)
else:
        print ('Location not found')


<span class="geo">52.367; 4.900</span>
Location is at 52.367 4.900


### Task 1

Can you convert the output of Exercise 1 into KML? Here is the KML documentation: https://developers.google.com/kml/documentation/?csw=1 and here you can find a simple example of how it is used: https://renenyffenegger.ch/notes/tools/Google-Earth/kml/index

Visualise the point in Google Maps using the following code example: https://developers.google.com/maps/documentation/javascript/examples/layer-kml-features
You will have to create your own KML file for the custom map layer, and provide a URL to the KML file inside the JavaScript code, which means that you have to upload the file somewhere. You can use a service like http://pastebin.com/ to obtain a URL for your KML file —> paste the code there and request the RAW format URL; use this one in this Task1.

Is KML a microformat, why (not)?

### Jsfiddle: https://jsfiddle.net/m23dn7oe/

### Pastebin: https://pastebin.com/raw/dAyCCZGa

Answer: Yes, KML is a microformat.

Reason:
A microformat is basically an open standard format about formatting that specifies a set of attributes for a particular markup. This is a widely used format for designating information (in our case geographic data). So since KML used a tag based structure with nested elements and attributes, it is definitely a microformat. 


## Exercise 2 
In order to find information in the web we can use microformats such as [hRecipe](https://microformats.org/wiki/hrecipe) or Schema.org's [Recipe](https://schema.org/Recipe). But first, we'll show you how to find arbitrary tags in a webpage.


### Task 2 
Parsing data for a <sub><sup>veggie</sup></sub> spaghetti alla carbonara recipe (from Example 2-7 in Mining the Social Web).

In [40]:
import requests
import json
from bs4 import BeautifulSoup

# A yummy webpage (feel free to change to your likings.)
URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/"

# requests will return the html found at the given webpage...
page = requests.get(URL)
# ...and a BeautifulSoup object can be created from its content.
soup = BeautifulSoup(page.content, 'html.parser')

listchildren = list(soup.children)
print(listchildren)

['html', <html lang="en-US">
<head><meta content="index, nofollow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
<!-- This site is optimized with the Yoast SEO Premium plugin v18.1 (Yoast SEO v18.2) - https://yoast.com/wordpress/plugins/seo/ -->
<title>Vegetarian Carbonara – A Couple Cooks</title><link as="style" href="https://fonts.googleapis.com/css2?family=Montserrat:wght@800&amp;display=swap" rel="preload"/><link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@800&amp;display=swap" media="print" onload="this.media='all'" rel="stylesheet"/><noscript><link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@800&amp;display=swap" rel="stylesheet"/></noscript>
<meta content="This vegetarian carbonara is quick and delicious; an egg yolk is stirred into the pasta to create a sauce, and smoked mozarella is used instead of bacon." name="description"/>
<link href="https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/" 

We can find any element in the page through *css tag selectors*
You can find them all [here](https://www.w3schools.com/cssref/css_selectors.asp), but shortly these are "." for classes, # for ids and plain text for the element name.


You can also combine them, so that looking for ".class1.class2" would select all elements displaying both classes. For a deeper overview please check the above link (or google "html tag selectors"). 

In [41]:
print(len(listchildren)) # we can see here how many children the html doc has got.
ingredients_unparsed = soup.select_one(".tasty-recipes-ingredients")
# let's get all the "list item" elements in a list:
ing_unp = ingredients_unparsed.findAll('li')
print(ing_unp)

2
[<li><span data-amount="1">1</span> pound spaghetti noodles</li>, <li><span data-amount="0.5" data-unit="cup">½ cup</span> smoked mozzarella cheese</li>, <li><span data-amount="0.5" data-unit="cup">½ cup</span> grated Parmesan cheese, plus more for serving</li>, <li><span data-amount="4">4</span> egg yolks</li>, <li><span data-amount="1" data-unit="cup">1 cup</span> frozen Earthbound Farm Organic peas</li>, <li><span data-amount="8" data-unit="cup">8 cups</span> Earthbound Farm Organic spinach</li>, <li><span data-amount="3" data-unit="tablespoon">3 tablespoons</span> butter</li>, <li><a class="tasty-link" data-tasty-links-no-disclosure="" href="https://www.acouplecooks.com/what-is-kosher-salt/" rel="noopener" target="_blank">Kosher salt</a></li>, <li>Fresh ground black pepper</li>]


Mmmh... not so pretty yet. How about listing their items using the text method?

In [42]:
ingredients = [t.text for t in ing_unp]
print("Ingredients:\n")
# [print(i) for i in ingredients]  # Also prints the generator
# Instead
for ing in ingredients:
    print(ing)

Ingredients:

1 pound spaghetti noodles
½ cup smoked mozzarella cheese
½ cup grated Parmesan cheese, plus more for serving
4 egg yolks
1 cup frozen Earthbound Farm Organic peas
8 cups Earthbound Farm Organic spinach
3 tablespoons butter
Kosher salt
Fresh ground black pepper


Good. Now the instructions:

In [43]:
instructions_unparsed = soup.select_one(".tasty-recipes-instructions")
instructions_unparsed = instructions_unparsed.findAll("li")
print(instructions_unparsed)

[<li id="instruction-step-1">In a large pot, combine 6 quarts of water with 2 tablespoons <a class="tasty-link" data-tasty-links-no-disclosure="" href="https://www.acouplecooks.com/what-is-kosher-salt/" rel="noopener" target="_blank">kosher salt</a> and bring it to a boil.</li>, <li id="instruction-step-2">Grate the Parmesan and mozzarella cheese. Carefully separate four egg yolks and set aside.</li>, <li id="instruction-step-3">Once boiling, add the pasta and cook until the pasta is just about al dente, about 7 minutes; then add peas and spinach and cook for 1 minute. Reserve 1 cup cooking water, and then drain the pasta and vegetables.</li>, <li id="instruction-step-4">In a skillet, melt the butter, then stir in the cheeses, ¼ cup pasta water, and ¼ teaspoon <a class="tasty-link" data-tasty-links-no-disclosure="" href="https://www.acouplecooks.com/what-is-kosher-salt/" rel="noopener" target="_blank">kosher salt</a>. Stir in the pasta and vegetables until creamy over low heat, adding 

Let's finish off with the title:

In [44]:
title_unparsed = soup.select_one(".post-header") # 
categorical_title = title_unparsed.text.split("›") # website specific divider.
recipe_title = categorical_title[-1].strip() # let's remove that ugly space at the beginning.
recipe_title

'Vegetarian Carbonara vegetarianJump to Recipeby Sonja OverhiserBuy Our Cookbook'

## Task 2.1
Now it's your turn. Create a function that can scrape any recipe webpage from the same website (other websites will have different class tags). 

Make sure to:

- return itemized content (e.g. ingredients) in a list. You may want to use a list comprehension here.
- Not all items have been cleaned of their html markdown (see variables ```ingredients``` vs. ```instructions_unparsed```. Make sure to return a list with human readable content (i.e. by using the ```.text``` attribute).


In [45]:
# -*- coding: utf-8 -*-

import requests
import json
from bs4 import BeautifulSoup

# Pass in a URL containing hRecipe, such as
# https://www.jamieoliver.com/recipes/pasta-recipes/veggie-carbonara/


URL = "https://www.jamieoliver.com/recipes/rice-recipes/mexican-inspired-bowl/"#YOUR RECIPE HERE/

# Parse out some of the pertinent information for a recipe.
# See http://microformats.org/wiki/hrecipe.
# we need the ingredients and the instructions

def parse_website(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
#     # You code here

    # get title
    title = soup.title
    fn = title.string
    
    # ingredients
    ingredients_unparsed = soup.select_one(".ingred-list")
    ing_unp = ingredients_unparsed.findAll('li')
    ingredients = [t.text for t in ing_unp]
    
    # Instructions
    instructions_unparsed = soup.select_one(".recipeSteps")
    instructions_unparsed = instructions_unparsed.findAll("li")
    instructions = [t.text for t in instructions_unparsed]

    return {
            'name': fn,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    
recipe = parse_website(URL)
print (recipe)

{'name': 'Mexican-inspired bowl | Jamie Oliver', 'ingredients': ['\n                                                                                                            320                                                                                                                                                                g                                                                                                                                                                brown rice                                                                                                                                                        ', '\n                                                                                                            1                                                                                                                                                                                                                    corn on the

But How can we get information not only from one website,  but from all? 

The answer: microformats.

But rather than extracting with information manually from the schema.org or hRecipe microformats, we can use a package, ```scrape-schema-recipe``` 

Feel free to experiment with it. 

In [46]:
import scrape_schema_recipe

url = 'https://www.jamieoliver.com/recipes/rice-recipes/mexican-inspired-bowl/'
recipe_list = scrape_schema_recipe.scrape_url(url, python_objects=True)
recipe = recipe_list[0]

print(recipe['name'])
for i in recipe['recipeIngredient']:
    print(i)

instructions = BeautifulSoup(recipe['recipeInstructions'])
print(instructions.get_text())


Mexican-inspired bowl
320 g brown rice 
1  corn on the cob 
1  red onion 
2 cloves of garlic 
1  fresh jalapeño or green chilli 
1 small red pepper 
1 small yellow pepper 
1 bunch of fresh coriander (30g)
  olive oil 
1 teaspoon ground cumin 
1 teaspoon  ground coriander 
1 stick of cinnamon 
1 x 400 g tin of quality plum tomatoes 
1  ripe avocado 
2  limes 
  extra virgin olive oil 
1 x 400 g tin of black beans 
½ teaspoon smoked paprika 
  natural yoghurt 
  hot chilli sauce 
6  wholemeal tortillas 
Cook 320g of brown rice according to the packet instructions. Once cooked, drain, return to the pan and cover with a lid to keep warm.Meanwhile, place a large dry frying pan over a medium heat. Once hot, add 1 corn on the cob and cook for around 10 minutes, or until blackened all over, turning occasionally.Peel and roughly chop 1 red onion, then peel and finely slice 2 cloves of garlic, along with 1 jalapeño or green chilli. Deseed and roughly chop 1 small red and 1 small yellow pepper, t

### Task 2.2
hRecipe is a microformat specifically created for recipes.
Can you for example easily compare different dessert recipe ingredients? For inspiration you can look back at the exercises you did in Hands-on session 1 where you compared different sets of tweets.

In [47]:
def recipe_scraper(url):
    recipe_list = scrape_schema_recipe.scrape_url(url, python_objects=True)
    recipe = recipe_list[0]

    print(recipe['name'])
    # print(recipe['recipeIngredient'])
    for i in recipe['recipeIngredient']:
        print(i)
    print(recipe['recipeInstructions']) 

## Recipe 1

In [48]:
url = "https://www.bbcgoodfood.com/recipes/easy-chocolate-molten-cakes"
recipe_scraper(url)

Easy chocolate molten cakes
100g butter, plus extra to grease
100g dark chocolate, chopped
150g light brown soft sugar
3 large eggs
½ tsp vanilla extract
50g plain flour
single cream, to serve
[{'@type': 'HowToStep', 'text': 'Heat oven to 200C/180C fan/gas 6. Butter 6 dariole moulds or basins well and place on a baking tray.'}, {'@type': 'HowToStep', 'text': 'Put 100g butter and 100g chopped dark chocolate in a heatproof bowl and set over a pan of hot water (or alternatively put in the microwave and melt in 30 second bursts on a low setting) and stir until smooth. Set aside to cool slightly for 15 mins.'}, {'@type': 'HowToStep', 'text': 'Using an electric hand whisk, mix in 150g light brown soft sugar, then 3 large eggs, one at a time, followed by ½ tsp vanilla extract and finally 50g plain flour. Divide the mixture among the darioles or basins.'}, {'@type': 'HowToStep', 'text': "You can now either put the mixture in the fridge, or freezer until you're ready to bake them. Can be cooked

## Recipe 2

In [49]:
url = "https://www.bbcgoodfood.com/recipes/treacle-tart"
recipe_scraper(url)

Treacle tart
250g plain flour
½ tsp fine salt
140g cold unsalted butter, cubed
3 tbsp icing sugar
2 medium egg yolks
2-3 tbsp cold water
400g golden syrup
1 ball stem ginger in syrup, finely chopped, plus 50g of the syrup
1 lemon, zested
2 medium eggs, lightly beaten
100g fine fresh white breadcrumbs
[{'@type': 'HowToStep', 'text': 'Sieve the flour and salt into a large bowl. Add the butter and rub together with your fingers to a fine breadcrumb-like texture (you can also do this part in a food processor). Stir though the icing sugar, then quickly add the egg yolks and 2 tbsp water, mixing swiftly with a cutlery knife to combine. Form into a ball (add another tbsp water if you need to), wrap and chill for 30 mins. Roll out to the\xa0thickness of a pound coin, and line a 22cm fluted tart tin with the pastry, leaving\xa0an overhang. Return to the fridge for 30 mins.'}, {'@type': 'HowToStep', 'text': 'Heat the oven to 200C/180C fan/gas 6. Put a baking sheet into the oven to heat up. Line 

## Exercise 3

Schema.org is one of the most widely used annotations formats. Schema.org is a multipurpose  template that has been created by a consortium consisting of Yahoo!, Google and Microsoft. It can describe entities, events, products etc. Check out the vocabulary specs on Schema.org.

### Task 3

Parsing schema.org microdata. To parse this data you need to install the rdflib-microdata package, which you have done in one of the previous steps.



In [50]:
from rdflib import Graph

# Source: https://www.youtube.com/watch?v=sCU214rbRZ0
# Pass in a URL containing Schema.org microformats
URL = "http://dbpedia.org/resource/Micheal_Jackson"

# Initialize a graph
g = Graph()

# Parse in an RDF file graph dbpedia
result = g.parse(location=URL)

# Loop through first 10 triples in the graph
for index, (sub, pred, obj) in enumerate(g):
    print(sub, pred, obj)
    if index == 10:
        break

http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageRevisionID 1056738079
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageWikiLink http://dbpedia.org/resource/Michael_Jackson
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/property/wikiPageUsesTemplate http://dbpedia.org/resource/Template:R_from_misspelling
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageLength 68
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageID 14995602
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageRedirects http://dbpedia.org/resource/Michael_Jackson
http://dbpedia.org/resource/Micheal_Jackson http://www.w3.org/ns/prov#wasDerivedFrom http://en.wikipedia.org/wiki/Micheal_Jackson?oldid=1056738079&ns=0
http://dbpedia.org/resource/Micheal_Jackson http://xmlns.com/foaf/0.1/isPrimaryTopicOf http://en.wikipedia.org/wiki/Micheal_Jackson
http://dbpedia.o

In [51]:
# Print the size of the Graph
print(f'Graph has {len(g)} facts')

Graph has 9 facts


In [52]:
# Print out the entire Graph in the RDF Turtle format
print(g.serialize(format='ttl'))

@prefix ns1: <http://dbpedia.org/ontology/> .
@prefix ns2: <http://dbpedia.org/property/> .
@prefix ns3: <http://www.w3.org/ns/prov#> .
@prefix ns4: <http://xmlns.com/foaf/0.1/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://dbpedia.org/resource/Micheal_Jackson> rdfs:label "Micheal Jackson"@en ;
    ns1:wikiPageID 14995602 ;
    ns1:wikiPageLength "68"^^xsd:nonNegativeInteger ;
    ns1:wikiPageRedirects <http://dbpedia.org/resource/Michael_Jackson> ;
    ns1:wikiPageRevisionID 1056738079 ;
    ns1:wikiPageWikiLink <http://dbpedia.org/resource/Michael_Jackson> ;
    ns2:wikiPageUsesTemplate <http://dbpedia.org/resource/Template:R_from_misspelling> ;
    ns3:wasDerivedFrom <http://en.wikipedia.org/wiki/Micheal_Jackson?oldid=1056738079&ns=0> ;
    ns4:isPrimaryTopicOf <http://en.wikipedia.org/wiki/Micheal_Jackson> .




### Task 3.1 
Compare the schema.org information about a band on last.fm to the Facebook Open Graph information about the same band from Facebook. What are the differences? Which format do you think supports better interoperability?

In [53]:
URL = "https://www.last.fm/music/Tyler,+the+Creator"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
info = soup.find('div', itemtype='http://schema.org/MusicGroup')
print(info)

<div data-page-resource-blacklist-level="" data-page-resource-name="Tyler, the Creator" data-page-resource-type="artist" itemscope="" itemtype="http://schema.org/MusicGroup">
<div data-require="components/disclose-base,components/disclose-autoclose-v2,components/disclose-dropdown-v2,components/disclose-dropdown-location-picker-v2,components/disclose-collapsing-nav-v2,components/disclose-artwork,components/disclose-remove,components/disclose-search,components/disclose-hover-v3,components/disclose-select,components/disclose-lazy-buylinks,components/focus-controls,components/prevent-resubmit-v2,components/edit-scrobble,components/toggle-buttons,components/click-proxy,components/bookmark-notification,components/follow-notification,components/tourguide"></div>
<nav class="masthead"><div class="masthead-inner-wrap"><div class="masthead-logo"><span class="masthead-logo-loading"></span><a href="/"> Last.fm</a></div><a aria-controls="masthead-search" class="masthead-search-toggle" data-disclose

## Facebook

In [54]:
URL = "https://www.facebook.com/TylerTheCreator/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
info = soup.find('div', itemtype='http://schema.org/MusicGroup')
print(info)

None


The results show that last.fm is organized in schema.org, but we do not get feedback from Facebook. This means that last.fm uses a  schem.org structure to organize the information on their page. In contrast, the Facebook open graph structure probably has some other way of organizing their information. 

With this said, we think that last.fm supports better interoperability because of the structure (using schema.org) last.fm makes it easy for systems and applications to get information from their website in a structured manner.

### Task 3.2
Explore the various microformats at http://microformats.org/ and compare the output of the exercises with the output of http://microformats.org/. Think about possible microformats you want to support in your final assignment and read up on how to parse them.

For our final assignment, we are thinking about doing a network analysis on, for example, LinkedIn profiles. In this network analysis, we can see who is connected with each other. Moreover, to extract information from someone's profile, an option would be to use the h-card from the microformats. The h-card makes it easy to extract personal information from a page like a name, mail, a company they work for, etc. 

Finally, the microformats we will use for our final assignment is strongly depended on the website we are going to use (we used LinkedIn as an example, but this can still change). But we will probably fetch the data from the website in a JSON microformat. This makes it easier for us to use this later in Python.