# Example scraping notebook

This notebook is a draft you can base your scraping project on. We scrape property listings in Iceland from mbl.is.

We only scrape properties listed in the last 24 hours. In the first cell we use the requests library to submit the form data you can find [here](https://www.mbl.is/fasteignir/). We get back a response from the server with a URL we should redirect to.

In [1]:
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
payload={"newtoday":"newtoday"} # If you want to change the search criteria you should specify it here

#session = requests.Session()
r = requests.post('https://www.mbl.is/fasteignir/query/',data=payload)
url = r.url

With the result of the search we can start scraping using the URL we just acquired. For the sake of simplicity, we use the Gazpacho library.

In the following cell we collect the URLs for each listing by go through the results page by page.

We use the tqdm library to see the progress more easily.

In [5]:
from gazpacho import get, Soup
from tqdm.notebook import tqdm

new_houses = []

html = get(url)
soup = Soup(html)

n_pages = int(soup.find('div',{'class':'pagination'}).find('div',{'class':'info'}).text.split(" ")[-1])
for i in tqdm(range(n_pages)):
    if i != 0:
        next_page = soup.find('span',{'class':'next'}).find('a').attrs['href']
        url = "https://www.mbl.is" + next_page
        html = get(url)
        soup = Soup(html)
    
    houses = soup.find('div', {'id': 'realestate-result-'}, partial=True)
    for house in houses:
        url = house.find('div',{'class':'realestate-head'}).find('a').attrs['href']
        house_id = url.split("/")[-2]
        house_name = house.find('div',{'class':'realestate-head'}).find('h4').text

        d = {"url":url,"house_id":house_id,"house_name":house_name}

        new_houses.append(d)

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




After collecting the links for all the new listings we scrape each page separately.

In [6]:
for house in tqdm(new_houses):
    url = "https://www.mbl.is" + house["url"]
    html = get(url)
    soup = Soup(html)
    
    info_table = soup.find('div',{'class':'numbers'}).find('tr')

    try:
        house['fasteignasala'] = info_table[0].find('img').attrs['alt']
    except:
        house['fasteignasala'] = None

    for row in info_table[1:-1]:
        label_text = row.text
        value = row.find('td',{'class':'value'}).text
        house[label_text] = value

HBox(children=(IntProgress(value=0, max=228), HTML(value='')))




Try to make sure that you can turn your data into a usable dataset. It is is easy to create a CSV file, for example, with Pandas as I show in the following cell.

In [4]:
import pandas as pd

df = pd.DataFrame(new_houses)
df.to_csv("new_listings.csv")

NameError: name 'new_houses' is not defined

There are many more things that can be done with this approach. For example, get the images from each listing, clean the data such that the price is an integer.

You could an analysis of the description text for each listing, what words are indicative of extra features that might be interesting to use in your prediction modeling?