# Introduction to Web-Scraping in Python

Creating an HTML web-scraper is an easy task if the beforementioned basics of Python programming are properly understood. A very basic understanding of HTML code is also needed.

## Using the Firefox Debugger

To understand the data and web page that we want to scrape, we most often have to use the debugging software of our browser. In our example we use the debugger of Firefox. To open the debugger, you can visit https://www.immobilienscout24.de/expose/109523308 and press *CTRL+Shift+I* or alternatively Right-Click on the page and select 'Inspect Element'.

The first data of the page that we are interested in is the rent, or Kaltmiete. To understand where we will find this type of data with our crawler, we can Right-Click on the element and select 'Inspect Elemnt'. The resulting HMTL code should be

What we can learn here is that the rent element has the HTML class "is24qa-kaltmiete is24-value font-semibold", which we can use later in our scraper.

## Creating a simple Crawler

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import random
import re
import sys
import time

After importing our libraries, we can request the web page of interest. Since we are interested in the content of the web page, we add the function .text from requests to our request.

In [None]:
r = requests.get('https://www.immobilienscout24.de/expose/109523308').text

**Important:** After requesting the web page, we have downloaded the complete page and stored it into our variable *r*. From here on out we are working with a local copy of the web page, therefore we do not bother the web page provider with unnecessary requests!

To make the text of the request easier to use, we need BeautifulSoup.

In [None]:
soup = BeautifulSoup(r, 'html.parser')

In [None]:
print(soup.prettify())

We can now search for specific elements in our text, just as we did in our web browser!

In [None]:
soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold")

In [None]:
type(soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold"))

In [None]:
soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold")[0]

In [None]:
soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold")[0].text

In [None]:
rent = soup.find_all(class_="is24qa-kaltmiete is24-value font-semibold")[0].text

In [None]:
type(rent)

Congratulations! You just created your very first web crawler!

## Getting more Data

We can get more data, like the amount of rooms and the square meters.

We create one function to scrape the complete page and another to get single elements. This way we only request the page once but can search within the requested HTML as often as we like.

In [None]:
def scrape_complete_page(url):
    r = requests.get(url).text
    soup = BeautifulSoup(r, 'html.parser')
    return soup

In [None]:
def extract_single_element(soup, html_class): 
    value = soup.find_all(class_=html_class)[0].text
    return value

In [None]:
soup = scrape_complete_page('https://www.immobilienscout24.de/expose/109523308')

In [None]:
rooms = extract_single_element(soup, 'is24qa-zi is24-value font-semibold')
rooms

In [None]:
sqm = extract_single_element(soup, 'is24qa-flaeche is24-value font-semibold')
sqm

## Scraping multiple listings

In [None]:
html_classes = ['is24qa-kaltmiete is24-value font-semibold', 
                'is24qa-zi is24-value font-semibold', 
                'is24qa-flaeche is24-value font-semibold']

In [None]:
urls = ['https://www.immobilienscout24.de/expose/109523308', 
        'https://www.immobilienscout24.de/expose/108982092',
       'https://www.immobilienscout24.de/expose/110182204']

### String Methods

In [None]:
urls[0]
type(urls[0])

Get only the expose ID, which is the fifth element of this new list

In [None]:
urls[0].split('/')

In [None]:
urls[0].split('/')[4]

We can now create a function that accepts a list of urls and a list of html classes to automatically download data from multiple listings.

In [None]:
def scrape_elements(urls, html_classes_list):
    # create an empty list where all the data is stored
    data_all = []
    for url in urls:
        time.sleep(random.uniform(0.3, 3))
        # web page gets requested only once
        soup = scrape_complete_page(url)
        print('====================================================')
        print('url: ')
        print(url)
        # create an empty list for each data set
        data_set = []
        data_set.append(url)
        # get all elements that are specified in html_classes
        for html_class in html_classes_list:
            print(html_class)
            print(extract_single_element(soup, html_class))
            # add the elements to the list
            data_set.append(extract_single_element(soup, html_class))
        # add all the data into the data_all list as list of lists
        data_all.append(data_set)
    print(data_all)
    # create a pandas dataframe to easily store the data as a .csv-file
    column_names = ['url', 'rent', 'rooms','area']
    df = pd.DataFrame(data_all, columns = column_names)
    df.to_csv('./rent_data.csv', sep=';')
    
    print(df)

In [None]:
scrape_elements(urls, html_classes)

## Exercise 7

## Downloading images

Downloading images requires a request for each image. We also need to find the image-links on the website before sending our requests.

In [None]:
soup = scrape_complete_page('https://www.immobilienscout24.de/expose/109523308')

In [None]:
soup.find_all(class_='sp-image ')

In [None]:
soup.find_all(class_='sp-image ')[0]['data-src']

In [None]:
soup.find_all(class_='sp-image ')[0]['data-src'].split('/ORIG')

In [None]:
soup.find_all(class_='sp-image ')[0]['data-src'].split('/ORIG')[0]

In [None]:
images = soup.find_all(class_='sp-image ')

In [None]:
images_urls = []
for image in images:
    print(image['data-src'].split('/ORIG')[0])
    images_urls.append(image['data-src'].split('/ORIG')[0])

In [None]:
images_urls

In [None]:
import os
os.getcwd()

In [None]:
def save_images(url, images_list):
    # get expose id from the URL
    expose = url.split('/')[4]
    print("crawling pictures for expose #: " + str(expose))
    i = 0
    if not os.path.exists("./images/"):
        os.makedirs("./images/")
    for image_url in images_list:
        sys.stdout.write('\r'+"downloading image # " + str(i))

        r = requests.get(image_url)
        #print(image_url)
        if not os.path.exists("./images/" + expose + "/"):
            os.makedirs("./images/" + expose + "/")
        with open("./images/" + expose + "/" + str(i) + ".jpg", "wb") as f:
            f.write(r.content)
        i = i + 1


## Extend the scraping function with image saving

We now add the option to save images to our previous function. We also add a try and except method to catch any potential errors.

In [None]:
def scrape_elements(urls, html_classes_list):
    # create an empty list where all the data is stored
    data_all = []
    
    for url in urls:
        # a short random break between requests is very important to not be a bother to the 
        # web service provider
        time.sleep(random.uniform(0.3, 2))
        # here we added a try and except to skip errors with pages that are not 
        # standard to the regular layout of immobilienscout.de
        try:
            # web page gets requested only once
            soup = scrape_complete_page(url)
            print('\n')
            print('====================================================')
            print('url: ' + str(url))
            # create an empty list for each data set
            data_set = []
            data_set.append(url)
            # get all elements that are specified in html_classes
            for html_class in html_classes_list:
                # print(html_class)
                # print(extract_single_element(soup, html_class))
                # add the elements to the list
                data_set.append(extract_single_element(soup, html_class))
            # add all the data into the data_all list as list of lists
            data_all.append(data_set)

            # new code to save images from all urls as well as the data from 
            # before
            images = soup.find_all(class_='sp-image ')
            images_urls = []
            for image in images:
                images_urls.append(image['data-src'].split('/ORIG')[0])
            save_images(url, images_urls)
        except Exception as e:
            pass
    print(data_all)
    # create a pandas dataframe to easily store the data as a .csv-file
    column_names = ['url', 'rent', 'rooms','area']
    df = pd.DataFrame(data_all, columns = column_names)
    df.to_csv('./rent_data.csv', sep=';')

In [None]:
scrape_elements(urls, html_classes)

## Get multiple listings

The final goal we want to achieve is to automatically get all the listings of a specific city and crawl the data as well as the images that we need.

In [None]:
for i in range(1,5):
    url = f'https://www.immobilienscout24.de/Suche/S-T/P-{i}/Wohnung-Miete/Umkreissuche/Berlin/-/229459/2511140/-/-/50?enteredFrom=result_list'
    print(url)

In [None]:
url = 'https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Umkreissuche/Berlin/-/229459/2511140/-/-/50?enteredFrom=result_list'
r = requests.get(url)
data = r.text

In [None]:
soup = BeautifulSoup(data)
urls = soup.find_all('article')
urls

In [None]:
urls[0]

In [None]:
urls[0]['data-obid']

In [None]:
def listings_urls():
    columns = ['url']
    df = pd.DataFrame(columns=columns)
    url_list =[]
    pagelimit = 3
    for i in range(1, pagelimit):
        url = f'https://www.immobilienscout24.de/Suche/S-T/P-{i}/Wohnung-Miete/Umkreissuche/Berlin/-/229459/2511140/-/-/50?enteredFrom=result_list'
        r = requests.get(url)
        data = r.text
        soup = BeautifulSoup(data)
        urls = soup.find_all('article')
        j = 0
        for expose in urls:
            j = j + 1
            new_url = 'https://www.immobilienscout24.de/expose/' + str(expose['data-obid'])
            url_list.append(new_url)
    return url_list


In [None]:
urls = listings_urls()

In [None]:
urls

In [None]:
scrape_elements(urls[0:10], html_classes)

## Exercise 8

## Usage of Proxies

Sometimes web sites block people who are scraping their sites. Therefore we have to use proxies to disguise our identity. In this case we use very slow free proxies, but if you actually need proxies for your project, you should probably invest into some professionial service.

We download the free proxies list once and save it locally:

In [None]:
r = requests.get(
"https://proxyscrape.com/api?request=getproxies&proxytype=http&timeout=")
with open("./proxies.txt", "wb") as f:
    f.write(r.content)

In [None]:
with open("./proxies.txt", "r") as f:
    proxies = f.read().splitlines()
    proxies = list(proxies)
    # print(proxies)

In [None]:
ip = requests.get('https://api.ipify.org').text
print('My public IP address is:', ip)

In [None]:
proxy={'https': proxies[5]}

In [None]:
proxy

In [None]:
proxy={'https': proxies[5]}
ip = requests.get('https://api.ipify.org', proxies = proxy).text
print('My public IP address is:', ip)

In [None]:
def request_with_proxy(request_url):
    with open("./proxies.txt", "r") as f:
        proxies = f.read().splitlines()
        proxies = list(proxies)
        
    try:
        proxy = {'https': proxies[random.randint(0, len(proxies))]}
        r = requests.get(request_url, timeout=1.0, proxies = proxy)
        # print(r.text)
        return r
    except Exception as e:
        # print(e)
        return request_with_proxy(request_url)

In [None]:
request_with_proxy('https://api.ipify.org').text

## Crawler with Proxies

In [None]:
def scrape_complete_page_with_proxy(url):
    r = request_with_proxy(url).text
    soup = BeautifulSoup(r, 'html.parser')
    return soup

In [None]:
def save_images_with_proxy(url, images_list):
    # get expose id from the URL
    expose = url.split('/')[4]
    print("crawling pictures for expose #: " + str(expose))
    i = 0
    if not os.path.exists("./images/"):
        os.makedirs("./images/")
    for image_url in images_list:
        sys.stdout.write('\r'+"downloading image # " + str(i))

        r = request_with_proxy(image_url)
        #print(image_url)
        if not os.path.exists("./images/" + expose + "/"):
            os.makedirs("./images/" + expose + "/")
        with open("./images/" + expose + "/" + str(i) + ".jpg", "wb") as f:
            f.write(r.content)
        i = i + 1

In [None]:
def scrape_elements_with_proxy(urls, html_classes_list):
    # create an empty list where all the data is stored
    data_all = []
    
    for url in urls:
        # here we added a try and except to skip errors with pages that are not 
        # standard to the regular layout of immobilienscout.de
        try:
            # web page gets requested only once
            soup = scrape_complete_page_with_proxy(url)
            print('\n')
            print('====================================================')
            print('url: ' + str(url))
            # create an empty list for each data set
            data_set = []
            data_set.append(url)
            # get all elements that are specified in html_classes
            for html_class in html_classes_list:
                # print(html_class)
                # print(extract_single_element(soup, html_class))
                # add the elements to the list
                data_set.append(extract_single_element(soup, html_class))
            # add all the data into the data_all list as list of lists
            data_all.append(data_set)

            # new code to save images from all urls as well as the data from 
            # before
            images = soup.find_all(class_='sp-image ')
            images_urls = []
            for image in images:
                images_urls.append(image['data-src'].split('/ORIG')[0])
            save_images_with_proxy(url, images_urls)
        except Exception as e:
            pass
    print(data_all)
    # create a pandas dataframe to easily store the data as a .csv-file
    column_names = ['url', 'rent', 'rooms','area']
    df = pd.DataFrame(data_all, columns = column_names)
    df.to_csv('./rent_data.csv', sep=';')

In [None]:
scrape_elements_with_proxy(urls, html_classes)