# Testing out webscraping on pararius.nl

## Background

### Observations
- https://www.pararius.nl/huurwoningen/[STAD] is the overview per city
- changes in https://www.pararius.nl/appartement-te-huur/[STAD]/[CODDE]/[STRAAT] when going to a listing

### Desired features

See: 
- price
- street,
- (zipcode),
- neighborhood
- agent,
- number of rooms,
- number of bedrooms,
- suitable for sharing (based on AI),
- date added,
- surface area

Other:
- sortable

## Scraping

### Set up

In [84]:
# Import packages
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs 

# Methods
def get_number(string):
    allowed = [str(i) for i in range(10)]
    str_numbers = list(filter(lambda x: x in allowed, [*string]))
    joined = ''.join(str_numbers)
    value = int(joined)
    return value

def truncate_middle(s, length, ellipsis="..."):
    """
    Truncate a string to a specified length, adding ellipses in the middle if necessary.

    Args:
        s (str): The input string.
        length (int): The maximum length of the resulting string (including ellipses).
        ellipsis (str): The ellipsis string to use (default is "...").

    Returns:
        str: The truncated string.
    """
    if len(s) <= length:
        return s  # No truncation needed

    # Calculate the length of the prefix and suffix (including ellipses)
    prefix_length = (length - len(ellipsis)) // 2
    suffix_length = length - prefix_length - len(ellipsis)

    # Construct the truncated string with ellipses in the middle
    truncated = s[:prefix_length] + ellipsis + s[-suffix_length:]

    return truncated

In [129]:
# Listing Page

url ='https://www.pararius.nl/appartement-te-huur/rotterdam/b6c9f139/prins-hendrikkade'
req = requests.get(url)
print(req)

soup = bs(req.text, "html.parser")

# Monthly price
price_html = soup.find("div", {"class": "listing-detail-summary__price"})
print(price_html.text.split())
price = get_number(price_html.text)
print(price)

# Surface area
area_html = soup.find("li", {"class": "illustrated-features__item illustrated-features__item--surface-area"})
area = get_number(area_html.text)
print(area)

# Number of rooms
nrooms_html = soup.find("li", {"class": "illustrated-features__item illustrated-features__item--number-of-rooms"})
nrooms = get_number(nrooms_html.text)
print(nrooms)

# Number of bedrooms
nbedrooms_html = soup.find("dd", {"class": "listing-features__description listing-features__description--number_of_bedrooms"})
nbedrooms = get_number(nbedrooms_html.text)
print(nbedrooms)

# Number of bathrooms
nbathrooms_html = soup.find("dd", {"class": "listing-features__description listing-features__description--number_of_bathrooms"})
nbathrooms = get_number(nbathrooms_html.text)
print(nbathrooms)    

# Furnished
furnished_html = soup.find("li", {"class": "illustrated-features__item illustrated-features__item--interior"})
furnished = furnished_html.text
print(furnished)

# Very easy to add additonal 'data' from the listing
# Neighbourhood
# Zipcode
location_html = soup.find("div", {"class": "listing-detail-summary__location"})
location_split = location_html.text.split()
zipcode = location_split[0] + location_split[1]
neighborhood = location_split[2]
print(zipcode)
print(neighborhood)

# Street
street_htmls = soup.find_all("a", {"class": "breadcrumbs__link"})
street = street_htmls[-1].text
print(street)

# Offered since
since_html = soup.find("dd", {"class": "listing-features__description listing-features__description--offered_since"})
since_down_html = since_html.find("span", {"class": "listing-features__main-description"})
since = since_down_html.text
print(since)

# Agent
agent_html = soup.find("a", {"class": "agent-summary__title-link"})
agent = agent_html.text
print(agent)

# Description
description_html = soup.find("div", {"class": "listing-detail-description__content"})
description = description_html.text
print(truncate_middle(description,100))

# ChatGPT integration

<Response [200]>
['€', '1.890', 'per', 'maand']
1890
84
3
1
1
Gemeubileerd
3071KB
(Noordereiland)
Prins Hendrikkade
7 weken
Perfectrent


Beschrijving
Tijdelijk onderkomen voor 6 maand... twee werkdagen. Alvast bedankt voor de moeite!




In [107]:
# Overview page
url = 'https://www.pararius.nl/huurwoningen/rotterdam'

req = requests.get(url)
print(req)

soup = bs(req.text, "html.parser")

# Maximum number of pages
numpages_html = soup.find_all("li", {"class": "pagination__item"})
numpages = get_number(numpages_html[-2].text)
print(numpages)

# Pages
pagelinks = ["https://www.pararius.nl/huurwoningen/rotterdam/page-" + str(i) for i in range(2,numpages+1)]
print(pagelinks) # Probably neater to do this with a for-loop (btw: page-1 redirects to the first page so not necessary to start at 2)

# Listing links on page
listings_html = soup.find_all("a", {"class": "listing-search-item__link listing-search-item__link--title"})
listing_links = ["https://www.pararius.nl/" + link.get("href") for link in listings_html]
print(listing_links)

<Response [200]>
16
['https://www.pararius.nl/huurwoningen/rotterdam/page-2', 'https://www.pararius.nl/huurwoningen/rotterdam/page-3', 'https://www.pararius.nl/huurwoningen/rotterdam/page-4', 'https://www.pararius.nl/huurwoningen/rotterdam/page-5', 'https://www.pararius.nl/huurwoningen/rotterdam/page-6', 'https://www.pararius.nl/huurwoningen/rotterdam/page-7', 'https://www.pararius.nl/huurwoningen/rotterdam/page-8', 'https://www.pararius.nl/huurwoningen/rotterdam/page-9', 'https://www.pararius.nl/huurwoningen/rotterdam/page-10', 'https://www.pararius.nl/huurwoningen/rotterdam/page-11', 'https://www.pararius.nl/huurwoningen/rotterdam/page-12', 'https://www.pararius.nl/huurwoningen/rotterdam/page-13', 'https://www.pararius.nl/huurwoningen/rotterdam/page-14', 'https://www.pararius.nl/huurwoningen/rotterdam/page-15', 'https://www.pararius.nl/huurwoningen/rotterdam/page-16']
['https://www.pararius.nl//appartement-te-huur/rotterdam/b6c9f139/prins-hendrikkade', 'https://www.pararius.nl//huis-

In [106]:
print(list(range(2,16)))

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
