# Data formats and open data
**Exercises for week 5B** in Digital Methods, University of Copenhagen

## 1. HTML

HTML is the markup language used by web pages. It's ubiquitous on the web; even when editing this notebook you are interacting with HTML (right click and hit "View Page Source" if you need proof). Here follows some exercises to get you comfortable with navigating HTML on web pages.

> **Ex. 1**: Right click inside the cell below and hit "Inspect". This should launch the "Inspector" tool in your browser, showing you where the element that renders the cell sits inside the DOM.
1. How deeply is it nested? Are there any sibling elements? We counted seven parents and three siblings
2. What happens when you update it? Change the text and see for yourself.
>
> *Hint: Most modern browsers (e.g. Firefox, Chrome, Brave) will let you hover elements in the DOM to show where they display on the web page.*

*HTML is a beautiful soup of hypertext!*

> **Ex. 2**: In the HTML code below:
1. What is typically the use of the `<p>`, `<h1>` and `<h2>` tags? Look them up, what are they for?
2. What are the attributes of the `div` element?
3. Create a text file that ends with ".html" and open it in a browser.

    <html>
    <body>

    <div width=200 height=100 id="main">
        <h1>This is the main title of the webpage</h1>
        <h2>This is a sub-heading</h2>
        <p>This is a paragraph of text.</p>
    </div>

    <h2>This is another sub-heading</h2>
    <p>This is a paragraph of text with some words in bold.</p>
    <img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.giphy.com%2Fmedia%2FkFgzrTt798d2w%2Fgiphy.gif&f=1&nofb=1" width="493" height="340">
    <p>And that just above is an image.</p>

    </body>
    </html>


> **Ex. 3**: Using the `requests` module, download [this web page](https://www.boliga.dk/resultat?propertyType=3&zipCodes=2200&page=1). Print the first 100 lines of the html string. How many lines are there in total?
>
> *Hint: use the `requests.get` method. To figure out how it works, execute `?requests.get` (after importing `requests`), this displays the module documentation.*

In [1]:
from bs4 import BeautifulSoup
import requests as rq
url="https://www.boliga.dk/resultat?propertyType=3&zipCodes=2200&page=1"
r=rq.get(url).text

In [2]:
len(r)

678943

In [3]:
splitted=r.split('\n')

In [4]:
len(splitted) #is the number of lines!

2368

In [5]:
splitted[2]

"  <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':"

In [6]:
er_det_hundrede=splitted[0:100]

In [7]:
len(er_det_hundrede)

100

In [8]:
#if you want to solve the assignment, write
#print(er_det_hundrede)

## 1.2 Scraping

*Scraping* means to parse HTML and collect the important pieces of information inside. *Crawling* is
another important contect, and the word refers to automatically sifting through pages of the web and scraping
information on each page. 90% of scraping and crawling work can be done using the two modules `requests` and
`BeautifulSoup`.

> **Ex. 4:** Load the toy example HTML with BeautifulSoup. Use the [documentation page](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for reference on how to do this.
1. Access the `h1` element inside the `div` and print out its content (which is "This is the main title of the webpage").
2. Get the value of the `src` attribute inside the `img` element.
3. Get the second subheading that contains "This is another sub-heading" and print out that content.
4. Get the `div` element by searching for its id.

In [9]:
import requests as rq
url1="""    <html>
    <body>

    <div width=200 height=100 id="main">
        <h1>This is the main title of the webpage</h1>
        <h2>This is a sub-heading</h2>
        <p>This is a paragraph of text.</p>
    </div>

    <h2>This is another sub-heading</h2>
    <p>This is a paragraph of text with some words in bold.</p>
    <img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.giphy.com%2Fmedia%2FkFgzrTt798d2w%2Fgiphy.gif&f=1&nofb=1" width="493" height="340">
    <p>And that just above is an image.</p>

    </body>
    </html>

</html>"""
soup=BeautifulSoup(url1,"html.parser")

In [10]:
print(soup.prettify())

<html>
 <body>
  <div height="100" id="main" width="200">
   <h1>
    This is the main title of the webpage
   </h1>
   <h2>
    This is a sub-heading
   </h2>
   <p>
    This is a paragraph of text.
   </p>
  </div>
  <h2>
   This is another sub-heading
  </h2>
  <p>
   This is a paragraph of text with some words in bold.
  </p>
  <img height="340" src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.giphy.com%2Fmedia%2FkFgzrTt798d2w%2Fgiphy.gif&amp;f=1&amp;nofb=1" width="493"/>
  <p>
   And that just above is an image.
  </p>
 </body>
</html>



In [11]:
soup.h1.string #Assignment 1

'This is the main title of the webpage'

In [12]:
soup.img["src"] #sweet, assignment 2

'https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.giphy.com%2Fmedia%2FkFgzrTt798d2w%2Fgiphy.gif&f=1&nofb=1'

In [13]:
soup.find_all("h2")[1] #boojaaaa assignment 3

<h2>This is another sub-heading</h2>

In [14]:
soup.find(id="main") #assignment 4

<div height="100" id="main" width="200">
<h1>This is the main title of the webpage</h1>
<h2>This is a sub-heading</h2>
<p>This is a paragraph of text.</p>
</div>

> **Ex. 5** Load the HTML you downloaded in Ex. 3. For each post, extract price, square meter size and "Ejerudgift". You should create three different lists that contain each variable across posts.

In [15]:
import requests as rq
from bs4 import BeautifulSoup
#... to begin our work with online material and BeautifulSoup!

In [16]:
#Our link was called r:
soup2=BeautifulSoup(r,"html.parser")

In [17]:
#I found something attached to every price - a class instead of an id!
#However, only class_ instead of class works here in python and we therefore find it like this:
soup2.find_all(class_="primary-value d-flex justify-content-end")

[<div _ngcontent-sc52="" class="primary-value d-flex justify-content-end"><app-tooltip _ngcontent-sc52="" _nghost-sc40="" class="ml-2 md-right d-flex"><!-- --><p _ngcontent-sc40="" class="app-tooltip"><!-- --><!-- --> Pris er faldet <!-- --><!-- --></p><!-- --></app-tooltip> 2.495.000 kr. </div>,
 <div _ngcontent-sc52="" class="primary-value d-flex justify-content-end"><app-tooltip _ngcontent-sc52="" _nghost-sc40="" class="ml-2 md-right d-flex"><!-- --><p _ngcontent-sc40="" class="app-tooltip"><!-- --><!-- --> Pris er faldet <!-- --><!-- --></p><!-- --></app-tooltip> 8.300.000 kr. </div>,
 <div _ngcontent-sc52="" class="primary-value d-flex justify-content-end"><app-tooltip _ngcontent-sc52="" _nghost-sc40="" class="ml-2 md-right d-flex"><!-- --><p _ngcontent-sc40="" class="app-tooltip"><!-- --><!-- --> Pris er faldet <!-- --><!-- --></p><!-- --></app-tooltip> 1.345.000 kr. </div>,
 <div _ngcontent-sc52="" class="primary-value d-flex justify-content-end"><app-tooltip _ngcontent-sc52="" 

In [18]:
#VI DROPPER FUNKTIONER
import re
Prices=soup2.find_all(class_="primary-value d-flex justify-content-end")
Prices_list=[]
for element in Prices:
    Prices_list.append(element)
    #for at gøre den mere samarbejdsvillig som liste
#I've changed my mind - it's better to use this as a string in order to use findall() later
#(which is only available to strings)
pricestring=str(Prices_list)

In [46]:
#Defining the object we want to find with the re-module:
#which is a whitespace+one number (only for things below 10 mio+dot+numbers+dot+numbers
number=re.compile("(\s\d+?.\d+\.\d+)")
numbershort=re.compile("(\s\d+?.\d+)")

In [47]:
clean_prices=numbershort.findall(pricestring)

In [48]:
boligprisliste=numbershort.findall(pricestring)
len(boligprisliste)
#length of boligprisliste is 52

52

In [49]:
#WE FINALLY HAVE THE PRICESSSSS IN A LIST - no function necessary!

In [50]:
#Nu til kvadratmeter:
#find_all("div"[13]) måske
#<span _ngcontent-boliga-app-c46="" class="text-nowrap">47 m²</span>

In [51]:
import re
collectionscrap=soup2.find_all("app-house-details")
meters=re.compile("\d+\s")
#so now, after defining the stuff, lets take the specific details for meters in to a list

In [52]:
meterliste=[]
for element in collectionscrap:
    meterliste.append(element.find_all("span")[1])
meterstring=str(meterliste)

In [53]:
#this provides us with a very neat but not just number-based list, we'll change it like this:
clean_meters=meters.findall(meterstring)

In [54]:
len(clean_meters)
#.. the lenghts match <3 thank goodness

52

In [55]:
y=[]
x=[]
for element in clean_meters:
    yel=int(element)
    y.append(yel)
for element in clean_prices:
    xyel=float(element)
    x.append(xyel)
    #in order to convert them to numbers, I have divided the mios with 1000
    #(in theory - in reality, I have just found small pieces of string)

In [58]:
price_meter=zip(y,x)

In [59]:
list(price_meter)

[(60, 2.495),
 (191, 8.3),
 (32, 1.345),
 (29, 1.398),
 (47, 1.445),
 (44, 1.495),
 (44, 1.495),
 (37, 1.495),
 (44, 1.575),
 (51, 1.595),
 (46, 1.595),
 (44, 1.75),
 (43, 1.75),
 (52, 1.795),
 (53, 1.795),
 (44, 1.795),
 (43, 1.795),
 (47, 1.845),
 (44, 1.895),
 (53, 1.895),
 (55, 1.898),
 (42, 1.898),
 (42, 1.975),
 (58, 1.995),
 (45, 1.999),
 (52, 2.045),
 (50, 2.095),
 (43, 2.099),
 (54, 2.145),
 (66, 2.198),
 (52, 2.199),
 (56, 2.248),
 (59, 2.295),
 (49, 2.295),
 (59, 2.295),
 (47, 2.298),
 (73, 2.299),
 (53, 2.345),
 (55, 2.345),
 (56, 2.395),
 (54, 2.395),
 (61, 2.395),
 (47, 2.4),
 (58, 2.445),
 (53, 2.445),
 (44, 2.45),
 (57, 2.495),
 (60, 2.495),
 (58, 2.495),
 (46, 2.595),
 (61, 2.625),
 (48, 2.635)]

> **Ex. 6:** Make a scatter plot of square meter size vs. extracted price. Then make a new variable that 
measures price per square meter and scatter plot this against "Ejerudgift". Can you say anything about how
"Ejerudgift" influences square meter price?

> **Supercharge:** Crawl over pages of Boliga to collect this data for the entire borough of Nørrebro. Or all of Copenhagen!