# Parse a website

Now we want to put the things together we know and parse a website!

## Example idea
We want to search for a new flat in Konstanz. <br>
Therefore, we need a all available exposes matching our search criterias and save them in a pandas table,
so we could later call the script every **X** minutes to check for new exposes and send us a mail.


### whole workflow for the idea
1. choose a website
- find the whole query to the website
- scan the website for exposes
- store them in a pandas table
- load a pandas table of the old search
- compare the new search to the old
- if we have new exposes:
    - send a mail with the overview of the new exposes
    - extra: send a list of details for the new exposes
- save the new table
- make the script automatically called every **X** minutes

### our Tasks for today
1. choose a website
- find the whole query to the website
- scan the website for exposes
- store them in a pandas table
- recover them from a pandas table
- compare the new search to the old

## Choose a website
As example we take: https://www.immobilienscout24.de/

## find the whole query to the website
Let's fill out the query to look for a flat matchin our conditions.
- Location: `Konstanz`
- Price: `< 500`

Example:<br>
https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Konstanz-Kreis/-/-/-/EURO--500,00

and copy the URL.

In [None]:
# save the URL
url = "https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Konstanz-Kreis/-/-/-/EURO--500,00"

To scan the website we use the package `urllib3`.

Because the website uses the `HTTPS` protocol we need also to use a certificate. We can get it with the package `certifi`.

Let's put things together.

In [None]:
import urllib3
import certifi

In [None]:
# first we have to set up a Manager for the Website
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())

In [None]:
# now we have to request the content of the website using a GET request
r = http.request('GET', url)

In [None]:
# now we have to decode the data, in case it's a byte stream or ASCII
html = r.data.decode('utf-8')

In [None]:
# let's print the first 200 characters
print(html[:200])

Now we want to find all entries of the exposes.
For this we have to know how they are stored in the html file.

### Option 1
- look for a catchy word in the caption of an expose.
- search for it in the source code of the website (in the browser: rightclick `show source code` or `view page source`)
- work through the text till you find the tags / pattern how the website is build up

### Option 2 (better!)
- use the `inspect` tool of your browser (if it has one!)
    - Chrome : `inspect` (right click or `CTRL + SHIFT + I`
    - FireFox : `inspect Element` (right click)
- move your mouse in the `Elements` tab over the elements and see which are highlighted
- find the element associated with the box of the expose

### box expose

Let's have a look into the first entry.

Still a lots of text. Maybe we can refine it further.
What we see is that is has an attribute called `data-id=` in the first field.
This seems to be the **unique** id for the expose.

We can now do the same for the address in a nested way.

In [None]:
address = entry.find('div', {'class': "result-list-entry__address"}).find('button').find('div').text

All the informations are stored in `<dl>` elements lets filter them and then get the different values from them. (stored in a `dd` element)

In [None]:
infos = entry.find_all('dl')
price = float(infos[0].find('dd').text.split()[0].replace(',', '.'))
space = float(infos[1].find('dd').text.split()[0].replace(',', '.'))
rooms = int(infos[2].find('dd').text.split()[0])

Let's store them in a dict so everything is ordered.

In [None]:
result = dict(
            title=title,
            expose=expose,
            address=address,
            price_cold=price,
            space=space,
            rooms=rooms,
        )
result

If we want to do it for all our entries we can put it into a loop.

In [None]:
results = []
for entry in entries:
    expose = int(entry['data-id'])

    title = entry.find('h5').text

    address = entry.find('div', {'class': "result-list-entry__address"}).find('button').find('div').text

    infos = entry.find_all('dl')
    price = float(infos[0].find('dd').text.split()[0].replace(',', '.'))
    space = float(infos[1].find('dd').text.split()[0].replace(',', '.'))
    rooms = float(infos[2].find('dd').text.split()[0].replace(',', '.'))

    results.append(dict(
        title=title,
        expose=expose,
        address=address,
        price_cold=price,
        space=space,
        rooms=rooms,
    ))

If we want to load the results we can do it with: `pd.read_json`

In [None]:
df_old = pd.read_json('mytable.json',
                          orient='columns',
                          convert_dates=False,  # dont convert columns to dates
                          convert_axes=False,  # dont convert index to dates
                          )
df_old.index = df_old.index.astype(int)
df_old.index.name = 'expose'
df_old

## find new entries

<div class='alert alert-block alert-info'>


<ul>
    <li>Load the file `mytable_old.json` (created yesterday).</li>
    <li>Write a function to compare the both pandas frames to find new entries.</li>
</ul>

</div>