TODO:
- docstrings on all functions
- more text on both notebooks
- working pictures with saving 
- crawler that crawls more than one listing
- crawler that crawls more than one page
- exercises that are not too hard -> think about the available time

# Introduction to Web-Scraping in Python

Creating an HTML web-scraper is an easy task if the beforementioned basics of Python programming are properly understood. A very basic understanding of HTML code is also needed.

## Using the Firefox Debugger

To understand the data and web page that we want to scrape, we most often have to use the debugging software of our browser. In our example we use the debugger of Firefox. To open the debugger, you can visit https://www.immobilienscout24.de/expose/109523308 and press *CTRL+Shift+I* or alternatively Right-Click on the page and select 'Inspect Element'.

The first data of the page that we are interested in is the rent, or Kaltmiete. To understand where we will find this type of data with our crawler, we can Right-Click on the element and select 'Inspect Elemnt'. The resulting HMTL code should be

What we can learn here is that the rent element has the HTML class "is24qa-kaltmiete is24-value font-semibold", which we can use later in our scraper.

## Creating a simple Crawler

In [1]:
import requests
from bs4 import BeautifulSoup

After importing our libraries, we can request the web page of interest. Since we are interested in the content of the web page, we add the function .text from requests to our request.

In [2]:
r = requests.get('https://www.immobilienscout24.de/expose/109523308').text

**Important:** After requesting the web page, we have downloaded the complete page and stored it into our variable *r*. From here on out we are working with a local copy of the web page, therefore we do not bother the web page provider with unnecessary requests!

In [3]:
soup = BeautifulSoup(r, 'html.parser')

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="de">
 <head>
  <meta content="IE=edge, requiresActiveX=true" http-equiv="X-UA-Compatible"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="Immobilien, Wohnungen, Immobilie, Wohnung, Haus, Eigentumswohung, Häuser, Mietwohnung, Wohnungsmarkt, Wohnungssuche" name="keywords"/>
  <meta content="Etagenwohnung (Wohnung/Miete): 3 Zimmer - 106,9 qm - Bautzener Straße 35, 10829 Berlin, Schöneberg (Schöneberg) bei ImmobilienScout24 (Scout-ID: 109523308)" name="description"/>
  <meta content="none" name="msapplication-config"/>
  <meta content="telephone=no" name="format-detection">
   <link href="https://www.immobilienscout24.de/expose/109523308" rel="canonical"/>
   <link href="//www.static-immobilienscout24.de" rel="dns-prefetch"/>
   <link href="//www.google.com" rel="dns-prefetch"/>
   <title>
    Sofortbezug: 3 Zimmer und 2 Balkone sorgen für einzigarti

In [5]:
soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold")

[<div class="is24qa-kaltmiete is24-value font-semibold"> 1.437,40 € </div>]

In [6]:
type(soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold"))

bs4.element.ResultSet

In [7]:
soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold")[0]

<div class="is24qa-kaltmiete is24-value font-semibold"> 1.437,40 € </div>

In [8]:
soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold")[0].text

' 1.437,40 € '

In [9]:
rent = soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold")[0].text

In [10]:
type(rent)

str

Congratulations! You just created your very first web crawler!

## Getting more Data

We can get more data, like the amount of rooms and the square meters.

In [11]:
def scrape_complete_page(url):
    r = requests.get(url).text
    soup = BeautifulSoup(r, 'html.parser')
    return soup

In [12]:
def extract_single_element(soup, html_class): 
    value = soup.find_all(class_=html_class)[0].text
    return value

In [13]:
soup = scrape_complete_page('https://www.immobilienscout24.de/expose/109523308')

In [14]:
rooms = extract_single_element(soup, 'is24qa-zi is24-value font-semibold')
rooms

' 3 '

In [15]:
sqm = extract_single_element(soup, 'is24qa-flaeche is24-value font-semibold')
sqm

' 106,87 m² '

In [16]:
html_classes = ['is24qa-kaltmiete is24-value font-semibold', 
                'is24qa-zi is24-value font-semibold', 
                'is24qa-flaeche is24-value font-semibold']

In [25]:
urls = ['https://www.immobilienscout24.de/expose/109523308', 
        'https://www.immobilienscout24.de/expose/108982092',
       'https://www.immobilienscout24.de/expose/110182204']

In [38]:
def scrape_elements(urls, html_classes_list):
    for url in urls:
        soup = scrape_complete_page(url)
        print('====================================================')
        print('url: ')
        print(url)
        for html_class in html_classes_list:
            print(html_class)
            print(extract_single_element(soup, html_class))
    

In [39]:
scrape_elements(urls, html_classes)

url: 
https://www.immobilienscout24.de/expose/109523308
is24qa-kaltmiete is24-value font-semibold
 1.437,40 € 
is24qa-zi is24-value font-semibold
 3 
is24qa-flaeche is24-value font-semibold
 106,87 m² 
url: 
https://www.immobilienscout24.de/expose/108982092
is24qa-kaltmiete is24-value font-semibold
 1.247,97 € 
is24qa-zi is24-value font-semibold
 2 
is24qa-flaeche is24-value font-semibold
 73,41 m² 
url: 
https://www.immobilienscout24.de/expose/110182204
is24qa-kaltmiete is24-value font-semibold
 1.123,88 € 
is24qa-zi is24-value font-semibold
 3 
is24qa-flaeche is24-value font-semibold
 83,25 m² 


In [68]:
soup = scrape_complete_page('https://www.immobilienscout24.de/expose/110182204')

In [69]:
# IMAGES ARE ONLY SOMETIMES WOKRING; SUPER STRANGE....*

In [70]:
images = []
for img in soup.findAll('img'):
    images.append(img.get('src'))

In [71]:
images

['//www.immobilienscout24.de/etc/designs/is24/img/logo.svg',
 'https://pictures.immobilienscout24.de/listings/a99695a0-6515-47db-9816-50c65c9dc137-1277517440.jpg/ORIG/legacy_thumbnail/1024x768/format/jpg/quality/80',
 '//www.static-immobilienscout24.de/statpic/3eaf17869bb51bf27bd7c91bc9853973_pixel.png',
 '//www.static-immobilienscout24.de/statpic/3eaf17869bb51bf27bd7c91bc9853973_pixel.png',
 '//www.static-immobilienscout24.de/statpic/3eaf17869bb51bf27bd7c91bc9853973_pixel.png',
 '//www.static-immobilienscout24.de/statpic/3eaf17869bb51bf27bd7c91bc9853973_pixel.png',
 '//www.static-immobilienscout24.de/statpic/3eaf17869bb51bf27bd7c91bc9853973_pixel.png',
 '//www.static-immobilienscout24.de/statpic/3eaf17869bb51bf27bd7c91bc9853973_pixel.png',
 '//www.static-immobilienscout24.de/statpic/expose/address_icon/dabb800ceffa82de1d7a9c5780015bc9_Address_Map_x1.png',
 '//www.static-immobilienscout24.de/statpic/expose/icons/79d008f4081b3e0dd0ac5fba370bfd9c_icn_zoom__56x56.svg',
 '//www.static-immo

In [67]:
product_images_urls = []

for element in images:
    if 'pictures' in element:
        #print(element)
        product_images_urls.append(element)
    else:
        print('no')
product_images_urls

no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no


['https://pictures.immobilienscout24.de/listings/5ef793a5-6d60-4d00-8f56-6c1298a1cb60-1271017530.jpg/ORIG/legacy_thumbnail/1024x768/format/jpg/quality/80']