# Parse a website

Now we want to put the things together we know and parse a website!

## Example idea
We want to search for a new flat in Konstanz. <br>
Therefore, we need a all available exposes matching our search criterias and save them in a pandas table,
so we could later call the script every **X** minutes to check for new exposes and send us a mail.


### whole workflow for the idea
1. choose a website
- find the whole query to the website
- scan the website for exposes
- store them in a pandas table
- load a pandas table of the old search
- compare the new search to the old
- if we have new exposes:
    - send a mail with the overview of the new exposes
    - extra: send a list of details for the new exposes
- save the new table
- make the script automatically called every **X** minutes

### our Tasks for today
1. choose a website
- find the whole query to the website
- scan the website for exposes
- store them in a pandas table
- recover them from a pandas table
- compare the new search to the old

## Choose a website
As example we take: https://www.immobilienscout24.de/

## find the whole query to the website
Let's fill out the query to look for a flat matchin our conditions.
- Location: `Konstanz`
- Price: `< 500`

Example:<br>
https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Konstanz-Kreis/-/-/-/EURO--500,00

and copy the URL.

In [1]:
# save the URL
url = "https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Konstanz-Kreis/-/-/-/EURO--500,00"

To scan the website we use the package `urllib3`.

Because the website uses the `HTTPS` protocol we need also to use a certificate. We can get it with the package `certifi`.

Let's put things together.

In [2]:
import urllib3
import certifi

In [3]:
# first we have to set up a Manager for the Website
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())

In [4]:
# now we have to request the content of the website using a GET request
r = http.request('GET', url)

In [5]:
# now we have to decode the data, in case it's a byte stream or ASCII
html = r.data.decode('utf-8')

In [6]:
# let's print the first 200 characters
print(html[:200])















<!doctype html>
<html lang="de">
<head>
  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/>
  <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
  <meta n


As you can see we got the whole html code.

A good way to parse html code is to use `BeautifulSoup` from the `bs4` package.

This allows us to easily scan the code for the used html tags.

In [7]:
#@solution
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

As example lets find the header of the html file.

It's between the tags `<head>` and `</head>`.

In [8]:
#@solution
soup.find('head')

<head>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="none" name="msapplication-config"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1" name="viewport"/>
<link href="/Suche/resources/manifest.json" rel="manifest"/>
<title>Mietwohnungen Konstanz (Kreis): Wohnungen mieten in Konstanz (Kreis) bei Immobilien Scout24</title>
<meta content="index, follow" name="robots"/>
<meta content="Konstanz (Kreis): Mietwohnungen in Konstanz (Kreis). Bei Immobilien Scout24 finden Sie passende Angebote zu Wohnungen mieten oder Mietwohnung in Konstanz (Kreis)." name="description"/>
<meta content="Mietwohnungen Konstanz (Kreis), Wohnungen mieten Konstanz (Kreis)" name="keywords"/>
<link href="https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Konstanz-Kreis/-/-/-/EURO--500,00" rel="canonica

Now we want to find all entries of the exposes.
For this we have to know how they are stored in the html file.

### Option 1
- look for a catchy word in the caption of an expose.
- search for it in the source code of the website (in the browser: rightclick `show source code` or `view page source`)
- work through the text till you find the tags / pattern how the website is build up

### Option 2 (better!)
- use the `inspect` tool of your browser (if it has one!)
    - Chrome : `inspect` (right click or `CTRL + SHIFT + I`
    - FireFox : `inspect Element` (right click)
- move your mouse in the `Elements` tab over the elements and see which are highlighted
- find the element associated with the box of the expose

### box expose

The expose is saved in an element called `<li>` with `class="result-list__listing "`. (Note the tailing space!)

Let's search for this elements!
But this time we want all elements! So we use `.find_all` instead of `.find`.

In [9]:
#@solution
entries = soup.find_all('li', {'class': "result-list__listing "})

Let's have a look into the first entry.

In [10]:
#@solution
entry = entries[0]
print(entry)

<li class="result-list__listing " data-id="107798016"><div><article class="result-list-entry result-list-entry--s" data-item="result" data-listing-size="S" data-obid="107798016" id="result-107798016"><button aria-label="Ausblenden" class="button-reset result-list-entry__close-button link-internal"><span class="palm-hide fa fa-times"></span><div class="lap-hide desk-hide close-x align-center"><span class="fa fa-times font-white"></span></div></button><div class="grid grid-flex"><div aria-hidden="true" class="grid-item result-list-entry__gallery-container" style="position:relative"><div class="gallery-responsive"><div style="padding-top:75%"></div><div class="gallery-container"><a data-go-to-expose-id="107798016" data-go-to-expose-referrer="RESULT_LIST_LISTING" data-go-to-expose-searchtype="district" href="/expose/107798016"><span class="slick-bg-layer"></span><img alt="Immobilienbild" class="gallery__image block height-full" src="https://pictures.immobilienscout24.de/listings/f64bf49f-6

Still a lots of text. Maybe we can refine it further.
What we see is that is has an attribute called `data-id=` in the first field.
This seems to be the **unique** id for the expose.

In [11]:
#@solution
expose = int(entry['data-id'])
expose

107798016

If we **inspect** the element further with our browser we can see that the **title** is stored in a `<h5>` element (**h**eadline of level 5).

Let's find it and get the text it surrounds.

Therefore, we filter the `entry` for the `<h5>` tag.

In [12]:
#@solution
entry.find('h5')

<h5 class="result-list-entry__brand-title font-h6 onlyLarge nine-tenths font-ellipsis"><span class="result-list-entry__new-flag margin-right-xs">NEU</span>Moderne 1-Zimmer-Wohnung in bevorzugter Wohnlage von Konstanz-Allmannsdorf</h5>

the text we get with `.text`

In [13]:
#@solution
title = entry.find('h5').text
title

'NEUModerne 1-Zimmer-Wohnung in bevorzugter Wohnlage von Konstanz-Allmannsdorf'

We can now do the same for the address in a nested way.

In [14]:
address = entry.find('div', {'class': "result-list-entry__address"}).find('button').find('div').text

All the informations are stored in `<dl>` elements lets filter them and then get the different values from them. (stored in a `dd` element)

In [15]:
infos = entry.find_all('dl')
price = float(infos[0].find('dd').text.split()[0].replace(',', '.'))
space = float(infos[1].find('dd').text.split()[0].replace(',', '.'))
rooms = int(infos[2].find('dd').text.split()[0])

Let's store them in a dict so everything is ordered.

In [16]:
result = dict(
            title=title,
            expose=expose,
            address=address,
            price_cold=price,
            space=space,
            rooms=rooms,
        )
result

{'title': 'NEUModerne 1-Zimmer-Wohnung in bevorzugter Wohnlage von Konstanz-Allmannsdorf',
 'expose': 107798016,
 'address': 'Konstanz, Konstanz (Kreis)',
 'price_cold': 450.0,
 'space': 31.0,
 'rooms': 1}

If we want to do it for all our entries we can put it into a loop.

In [17]:
results = []
for entry in entries:
    expose = int(entry['data-id'])

    title = entry.find('h5').text

    address = entry.find('div', {'class': "result-list-entry__address"}).find('button').find('div').text

    infos = entry.find_all('dl')
    price = float(infos[0].find('dd').text.split()[0].replace(',', '.'))
    space = float(infos[1].find('dd').text.split()[0].replace(',', '.'))
    rooms = float(infos[2].find('dd').text.split()[0].replace(',', '.'))

    results.append(dict(
        title=title,
        expose=expose,
        address=address,
        price_cold=price,
        space=space,
        rooms=rooms,
    ))

We can now store the results in a `pandas` `DataFrame`.

In [18]:
#@solution
import pandas as pd

In [19]:
#@solution
df = pd.DataFrame(results).set_index('expose')

In [20]:
#@solution
df

Unnamed: 0_level_0,address,price_cold,rooms,space,title
expose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
107798016,"Konstanz, Konstanz (Kreis)",450.0,1.0,31.0,NEUModerne 1-Zimmer-Wohnung in bevorzugter Woh...
107796971,"Feldbergstr. 52, Singen (Hohentwiel), Konstanz...",245.0,1.0,40.22,NEUSeniorenwohnung mit Betreuung durch das DRK
107597166,"Konstanz, Konstanz (Kreis)",400.0,1.0,40.0,NEUAll Inklusive》ZWISCHENMIETE《teilmöbilierte ...
107471975,"Zähringerplatz 8, Konstanz, Konstanz (Kreis)",350.0,1.0,43.0,Freundliche 1-Zimmer-Wohnung mit Balkon und EB...
99491471,"Birnaublich 19, Konstanz, Konstanz (Kreis)",500.0,1.5,43.0,+++Single- Appartement in Wallhausen +++
95942599,"Röschberg 5, Hohenfels, Konstanz (Kreis)",340.0,1.0,30.0,Schöne 1 Zimmer Wohnung in Hohenfels
81638413,"Singen (Hohentwiel), Konstanz (Kreis)",450.0,2.0,44.84,NEUModerne 2 -Zimmerwohnung mit Balkon in Singen
100903160,"Gartenstraße 3, Königsfeld im Schwarzwald, Sch...",220.0,1.0,24.8,NEUhelle 1-Zimmer-Wohnung in Königsfeld
101871233,"Panoramastraße 73, Oberteuringen, Bodenseekreis",450.0,2.0,49.0,NEUMöblierte 2-Zimmer-Wohnung mit Balkon im Fe...
107646969,"Furtwangen im Schwarzwald, Schwarzwald-Baar-Kreis",450.0,3.0,80.0,NEUHelle 3-Zimmer-DG-Wohnung in Furtwangen-Sch...


Let's save the Pandas table in a file, so we could restore it in the "next" run. Let's save them in `JSON` format with `df.to_json`

In [21]:
#@solution
df.to_json('mytable.json', orient='columns')

If we want to load the results we can do it with: `pd.read_json`

In [22]:
df_old = pd.read_json('mytable.json',
                          orient='columns',
                          convert_dates=False,  # dont convert columns to dates
                          convert_axes=False,  # dont convert index to dates
                          )
df_old.index = df_old.index.astype(int)
df_old.index.name = 'expose'
df_old

Unnamed: 0_level_0,address,price_cold,rooms,space,title
expose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100903160,"Gartenstraße 3, Königsfeld im Schwarzwald, Sch...",220.0,1.0,24.8,NEUhelle 1-Zimmer-Wohnung in Königsfeld
101871233,"Panoramastraße 73, Oberteuringen, Bodenseekreis",450.0,2.0,49.0,NEUMöblierte 2-Zimmer-Wohnung mit Balkon im Fe...
107471975,"Zähringerplatz 8, Konstanz, Konstanz (Kreis)",350.0,1.0,43.0,Freundliche 1-Zimmer-Wohnung mit Balkon und EB...
107597166,"Konstanz, Konstanz (Kreis)",400.0,1.0,40.0,NEUAll Inklusive》ZWISCHENMIETE《teilmöbilierte ...
107646969,"Furtwangen im Schwarzwald, Schwarzwald-Baar-Kreis",450.0,3.0,80.0,NEUHelle 3-Zimmer-DG-Wohnung in Furtwangen-Sch...
107780556,"Schwenninger Str. 11/2, Villingen-Schwenningen...",260.0,1.0,28.0,NEUGepflegte 1-Zimmer-DG-Wohnung mit Balkon un...
107796971,"Feldbergstr. 52, Singen (Hohentwiel), Konstanz...",245.0,1.0,40.22,NEUSeniorenwohnung mit Betreuung durch das DRK
107798016,"Konstanz, Konstanz (Kreis)",450.0,1.0,31.0,NEUModerne 1-Zimmer-Wohnung in bevorzugter Woh...
107804050,"Sophienstraße 27, Villingen-Schwenningen, Schw...",400.0,1.0,32.0,NEU1 Zimmer Appartment Schwenningen
107806900,"Adlerweg 9, Villingen-Schwenningen, Schwarzwal...",500.0,2.5,54.0,"NEUFreundliche 2,5-Zimmer-EG-Wohnung mit Balko..."


## find new entries

<div class='alert alert-block alert-info'>


<ul>
    <li>Load the file `mytable_old.json` (created yesterday).</li>
    <li>Write a function to compare the both pandas frames to find new entries.</li>
</ul>

</div>

In [23]:
#@solution
df_old = pd.read_json('mytable_old.json',
                          orient='columns',
                          convert_dates=False,  # dont convert columns to dates
                          convert_axes=False,  # dont convert index to dates
                          )
df_old.index = df_old.index.astype(int)
df_old.index.name = 'expose'
df_old

Unnamed: 0_level_0,address,price_cold,rooms,space,title
expose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100903160,"Gartenstraße 3, Königsfeld im Schwarzwald, Sch...",220.0,1.0,24.8,NEUhelle 1-Zimmer-Wohnung in Königsfeld
101871233,"Panoramastraße 73, Oberteuringen, Bodenseekreis",450.0,2.0,49.0,NEUMöblierte 2-Zimmer-Wohnung mit Balkon im Fe...
107471975,"Zähringerplatz 8, Konstanz, Konstanz (Kreis)",350.0,1.0,43.0,Freundliche 1-Zimmer-Wohnung mit Balkon und EB...
107597166,"Konstanz, Konstanz (Kreis)",400.0,1.0,40.0,NEUAll Inklusive》ZWISCHENMIETE《teilmöbilierte ...
107646969,"Furtwangen im Schwarzwald, Schwarzwald-Baar-Kreis",450.0,3.0,80.0,NEUHelle 3-Zimmer-DG-Wohnung in Furtwangen-Sch...
107765884,"Schwarzwaldstrasse X, Schönwald im Schwarzwald...",150.0,1.0,25.0,NEU1-Zi Whg SCHÖNWALD neben Furtwangen
107780556,"Schwenninger Str. 11/2, Villingen-Schwenningen...",260.0,1.0,28.0,NEUGepflegte 1-Zimmer-DG-Wohnung mit Balkon un...
107796971,"Feldbergstr. 52, Singen (Hohentwiel), Konstanz...",245.0,1.0,40.22,NEUSeniorenwohnung mit Betreuung durch das DRK
107798016,"Konstanz, Konstanz (Kreis)",450.0,1.0,31.0,NEUModerne 1-Zimmer-Wohnung in bevorzugter Woh...
107804050,"Sophienstraße 27, Villingen-Schwenningen, Schw...",400.0,1.0,32.0,NEU1 Zimmer Appartment Schwenningen


In [24]:
#@solution
def check_for_new(df, df_old):
    """
    Function to check for new expose
    """

    # get the index of the new expose
    list_new = [expose for expose in df.index if expose not in df_old.index]
    # create sub pandas frame
    df_new = df.loc[list_new]
    return df_new

In [25]:
#@solution
check_for_new(df, df_old)

Unnamed: 0_level_0,address,price_cold,rooms,space,title
expose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
107864970,"Freiligrathstraße 4, Friedrichshafen, Bodensee...",370.0,1.0,33.0,NEUNeuwertige 1-Zimmer-Wohnung mit Balkon und ...
107864877,"Friedenstraße 7, Villingen-Schwenningen, Schwa...",280.0,2.0,36.86,NEUDas Glück hat ein Zuhause: individuelle 2-Z...
