## 1.1 Source Overview

ImmoScout24.de is Germany's leading online real estate platform, offering a vast array of property listings for sale and rent, including apartments, houses, commercial spaces, and land. The platform provides detailed information on each property, such as price, location, size, amenities, and photos, facilitating comprehensive property searches for users.

## 1.2. Data Access Methods

## Data Access Methods for ImmoScout24

| **Method**                 | **Description**                                                                                                                                                                                                                         | **References**                                                                                                                                                                                                                              | **Considerations**                                                                                                                                                                                                                           |
|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Official API**           | ImmoScout24 offers an official API that provides structured access to property data. However, access is typically restricted and may require approval, and the API documentation is primarily in German.                                   | [ImmoScout24 API Documentation](https://api.immobilienscout24.de/)                                                                                                                                                                         | Access involves costs ranging from [59€ per month up to 4,849€ per month](https://api.immobilienscout24.de/files/API-Pricelisting.pdf).                                                                                                                                                      |
| **Python Packages**        | There are no widely recognized Python packages dedicated solely to interacting with ImmoScout24. However, general web scraping libraries like `requests`, `BeautifulSoup`, and `Selenium` can be employed to extract data from the website. | [Scraping Immobilienscout24.de Real Estate Data](https://scrapfly.io/blog/how-to-scrape-immobillienscout24-real-estate-property-data/)                                                                                                     | Requires handling dynamic content loading and potential anti-scraping measures.                                                                                                      |
| **Web Scraping**           | Direct web scraping involves extracting data from ImmoScout24's web pages. Due to the site's dynamic content loading and anti-scraping mechanisms, this approach can be challenging and may require advanced techniques.                   | [How to Scrape Immobilienscout24.de Real Estate Data](https://hasdata.com/blog/how-to-scrape-immobilienscout24) <br> [ImmoSpider GitHub Repository](https://github.com/asmaier/ImmoSpider) <br> [Scraping Immobilienscout24.de Real Estate Data](https://scrapfly.io/blog/how-to-scrape-immobillienscout24-real-estate-property-data/) | Legal and ethical considerations are important; implementing strategies to handle dynamic content and anti-scraping measures is necessary.                                                                                                   |
| **Third-Party Services**   | Platforms like Apify offer pre-built scrapers for ImmoScout24, allowing users to extract data without developing custom solutions.                                                                                                       | [immobilienscout24.de properties pages scraper](https://apify.com/azzouzana/immobilienscout24-de-properties-pages-scraper)                                                                                                                                                       | While convenient, reliance on third-party services may involve costs and data privacy considerations.                                                                                                                                       |



## Task 2


#### robots.txt:

It is always a good idea to check the robots.txt on the website that you want to scrape and consider the definitions provided by the owner of the website and consider them in the scraping process. Below is the content of the robots.txt of immoscout24.de:

User-agent: * \
Disallow: /published-downloads/ \
Disallow: /published-images/ \
Disallow: /error/ \
Disallow: /errors/ \
Disallow: /marktplatz/ \
Disallow: /de/scoutmanager/ \
Disallow: /de/iidn1/offline-IBW.jsp \
Disallow: /meinkonto/ \
Disallow: /adresse/ \
Allow: /meinkonto/premium-mitgliedschaft \
Allow: /meinkonto/kreditkarte \
Allow: /meinkonto/bewerbermappe/vermieter \
Allow: /meinkonto/*.css \
Allow: /meinkonto/*.js \
Disallow: /merkzettel/ \
Disallow: /*.pdf$ \
Disallow: /is24wiki/ \
Disallow: /main.go \
Disallow: /servlet \
Disallow: /IS24/ \
Disallow: /modules/ \
Disallow: /akamai/ \
Disallow: /immobilienpreise/radius/ \
Disallow: /immobilienpreise/api/ \

Disallow: /utm/ 

User-agent: Mediapartners-Google \
Allow: * 

---

According to the provided robots.txt file, there are no restrictions placed on the “/suche/” or “/expose/” directories. This indicates that scraping these endpoints falls within the site's automated access guidelines, as long as the scraping process strictly avoids all the paths that are explicitly disallowed. In the tutorial and example below we will scrape information in the “/suche/” or “/expose/” directories only.

---

Since immoscout24.de would rather distribute its data via the costly API some barriers have been built to hinder us on scraping their website. Thus we are taking some steps to mitigate detection mechanisms on this website:
1. `undetected_chromedriver` which hides automation flags and bypasses some basic-bot-detection Javascript
2. Custom Chrome options mimic real user environment such as `disable-infobars` and `disable-blink-features=AutomationControlled`
3. Random sleep between Actions simulating human behavior
4. Scrolling Simulaiton which loads lazy content and also signals human interactions, since bots usually just load the page and parse HTML

---

We will use Selenium and BeautifulSoup for efficient scraping of the website:

Selenium primarely renders Javascript pages and simulates real user behavior.
BeautifulSoup parses and extracts the data from the HTML that Selenoim loaded. 


### Mini Turorial: Web Scraping Property Listings from ImmoScout24.de
This tutorial demonstrates how to use Selenium with BeautifulSoup to scrape real estate data from ImmoScout24.
We'll walk through the main steps, from setup to extracting data from property pages.

#### Imports

In [None]:
# # there is a known issue with using undetected_chromedriver and python 3.13.; this code resolves that issue:
# import sys
# try:
#     import distutils  
# except ModuleNotFoundError:
#     # If not found, use setuptools' bundled version instead
#     import setuptools._distutils as distutils
#     sys.modules['distutils'] = distutils


In [None]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import time, random, csv, json



#### 1. Setup & Installation
Install necessary packages first:

In [1]:
# pip install undetected-chromedriver, selenium, beautifulsoup4

#### 2. Utility Functions

In [23]:
# This function mimics human behavior through making random breaks with no action
def random_sleep(a=2, b=5):
    time.sleep(random.uniform(a, b))

#### 3. Extract Listing URLs

In [24]:
def extract_listing_urls(search_url, driver, max_pages=1):
    """Scrape URLs from search result pages"""
    listing_urls = []
    base_url = "https://www.immobilienscout24.de"

    driver.get(search_url)
    random_sleep(2, 4)

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    anchors = soup.find_all('a', href=lambda href: href and href.startswith('/expose/'))

    for a in anchors:
        href = base_url + a['href'].split('?')[0]
        if href not in listing_urls:
            listing_urls.append(href)

    return listing_urls[:5]  # Just return the first five for the demo

#### 4. Extract Data From Listing

In [25]:
def extract_detailed_info(soup, url):
    """Scrape individual property data"""
    data = {'url': url}

    title = soup.find('h1', id='expose-title')
    if title:
        data['title'] = title.text.strip()

    address = soup.find('span', class_='zip-region-and-country')
    if address:
        data['address'] = address.text.strip()

    cold_rent = soup.select_one('div.is24qa-kaltmiete-main')
    if cold_rent:
        data['cold_rent'] = cold_rent.text.strip()

    area = soup.select_one('div.is24qa-flaeche-main')
    if area:
        data['area'] = area.text.strip()

    rooms = soup.select_one('div.is24qa-zi-main')
    if rooms:
        data['rooms'] = rooms.text.strip()

    return data

#### 5. Main Function

In [26]:
def run_demo(search_url):
    options = uc.ChromeOptions()
    options.headless = False
    options.add_argument("--disable-blink-features=AutomationControlled")
    driver = uc.Chrome(options=options)

    all_data = []
    try:
        urls = extract_listing_urls(search_url, driver)
        for url in urls:
            driver.get(url)
            random_sleep(2, 4)
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            details = extract_detailed_info(soup, url)
            all_data.append(details)
    finally:
        driver.quit()

    return all_data

#### 6. Run it:

In [27]:
if __name__ == "__main__":
    search_url = "https://www.immobilienscout24.de/Suche/de/schleswig-holstein/kiel/wohnung-mieten"
    data = run_demo(search_url)
    for d in data:
        print(json.dumps(d, indent=2, ensure_ascii=False))

{
  "url": "https://www.immobilienscout24.de/expose/158397021",
  "title": "Bezaubernde 2 Zi.-Whg. In Bestlage Kiels",
  "address": "Düsternbrook, 24105 Kiel",
  "cold_rent": "880 €",
  "area": "63  m²",
  "rooms": "2"
}
{
  "url": "https://www.immobilienscout24.de/expose/158394744",
  "title": "3-Zimmer-Erdgeschosswohnung mit eigener Terrasse & Garten",
  "address": "Hassee, 24103 Kiel",
  "cold_rent": "858 €",
  "area": "78  m²",
  "rooms": "3"
}
{
  "url": "https://www.immobilienscout24.de/expose/158393664",
  "title": "Neuwertige Wohnung mit zwei Zimmern sowie Balkon und EBK in Kiel",
  "address": "Vorstadt, 24103 Kiel",
  "cold_rent": "1.099 €",
  "area": "68  m²",
  "rooms": "2"
}
{
  "url": "https://www.immobilienscout24.de/expose/156356800",
  "title": "Exklusive Maisonette-Wohnung in Kiel-Hassee – Perfekte Kombination aus Wohnen und Arbeiten",
  "address": "Gaarden-Süd und Kronsburg, 24113 Kiel",
  "cold_rent": "2.400 €",
  "area": "205  m²",
  "rooms": "5"
}
{
  "url": "https

#### 7. Discussion
- This approach can be scaled to collect more listings and fields.
- Limitations: Website structure changes can break the script. It also consumes more resources than APIs.
- Ethical concerns: Make sure scraping respects robots.txt and server limits. Use sleep intervals to avoid hammering the site.
- No login or API key needed.

### More In-Depth Usage of Web-scraping for immoscout24.de
Below is an example usage of the Tutorial above. It will be shown how it is possible to extend the existing code and retrieve relevant information for individual use cases:

#### 1. and 2. are the same as in the Tutorial and will be reused here just as they are above.

In [28]:
def save_to_csv(data, filename):
    if not data:
        return
    fieldnames = set()
    for item in data:
        fieldnames.update(item.keys())
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=sorted(fieldnames))
        writer.writeheader()
        writer.writerows(data)

def save_to_json(data, filename):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

#### 3. Scrape Listing URLs

In [29]:
def extract_listing_urls(search_url, driver, max_pages=20):
    """
    Extract property listing URLs from search results pages by navigating through pagination.
    
    Args:
        search_url (str): The URL of the search results page.
        driver (WebDriver): Selenium WebDriver instance.
        max_pages (int): Maximum number of pages to traverse.
    
    Returns:
        list: A list of unique property listing URLs.
    """
    all_listing_urls = []
    current_page = 1

    driver.get(search_url)
    random_sleep(2, 6)

    # Accept cookie consent if present using various possible selectors.
    try:
        consent_selectors = [
            "#onetrust-accept-btn-handler",
            "button[data-testid='uc-accept-all-button']",
            "#uc-btn-accept-banner",
            "button.consent-accept-button",
            "button.usercentrics-button"
        ]
        for selector in consent_selectors:
            try:
                consent_button = WebDriverWait(driver, 3).until(
                    EC.element_to_be_clickable((By.CSS_SELECTOR, selector))
                )
                driver.execute_script("arguments[0].click();", consent_button)
                print("Accepted cookies")
                random_sleep(1, 2)
                break
            except:
                continue
    except:
        print("No cookie consent found or already accepted")

    while current_page <= max_pages:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )

        # Scroll to load dynamic content
        for _ in range(3):
            driver.execute_script("window.scrollBy(0, 500);")
            random_sleep(0.5, 1)

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        random_sleep(1, 2)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        listing_elements = soup.find_all('a', href=lambda href: href and href.startswith('/expose/'))

        base_url = "https://www.immobilienscout24.de"
        page_urls = []

        for elem in listing_elements:
            url = base_url + elem['href'].split('?')[0]
            if url not in all_listing_urls and url not in page_urls:
                page_urls.append(url)

        print(f"Found {len(page_urls)} listings on page {current_page}")
        all_listing_urls.extend(page_urls)

        if current_page >= max_pages:
            print(f"Reached maximum number of pages ({max_pages})")
            break

        try:
            next_button = None
            pagination_selectors = [
                "[data-testid='pagination-button-next']",
                ".Pagination_pagination-button-next__Fd8-Y",
                "button[aria-label='Nächste Seite']",
                "[data-testid='paginationForward']",
                ".pagination-link--next"
            ]
            for selector in pagination_selectors:
                try:
                    next_button = driver.find_element(By.CSS_SELECTOR, selector)
                    break
                except NoSuchElementException:
                    continue

            if next_button:
                disabled = next_button.get_attribute("aria-disabled")
                if disabled and disabled.lower() == "true":
                    print("Next page button is disabled - reached the last page")
                    break

                driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", next_button)
                random_sleep(1, 2)
                print("Clicking next page button using JavaScript")
                driver.execute_script("arguments[0].click();", next_button)
                current_page += 1
                WebDriverWait(driver, 10).until(lambda d: d.find_element(By.CSS_SELECTOR, "body"))
                driver.execute_script("return document.readyState") == "complete"
                random_sleep(2, 4)
            else:
                print("Next page button not found - reached the last page")
                break

        except Exception as e:
            print(f"Error navigating to next page: {str(e)}")
            break

    print(f"Total unique listings found: {len(all_listing_urls)}")
    return all_listing_urls


#### 4. Scrape Listing Details

In [30]:
def extract_detailed_info(soup, url):
    """Extract detailed property information from an expose page"""
    property_data = {'url': url}

    title_elem = soup.find('h1', id='expose-title')
    if title_elem:
        property_data['title'] = title_elem.text.strip()

    address_elem = soup.find('span', class_='zip-region-and-country')
    if address_elem:
        property_data['address'] = address_elem.text.strip()

    # Extract primary details using CSS selectors
    detail_selectors = {
        'cold_rent': 'div.is24qa-kaltmiete-main',
        'warm_rent': 'div.is24qa-warmmiete-main',
        'area': 'div.is24qa-flaeche-main',
        'rooms': 'div.is24qa-zi-main',
    }
    for key, selector in detail_selectors.items():
        elem = soup.select_one(selector)
        if elem:
            property_data[key] = elem.text.strip()

    # Extract additional details by trying multiple attribute lookups
    detail_items = {
        'price': 'is24qa-kaufpreis',
        'sizeliving': 'is24qa-wohnflaeche-ca',
        'rooms': 'is24qa-zimmer',
        'construction_year': 'is24qa-baujahr',
        'condition': 'is24qa-objektzustand',
        'heating_type': 'is24qa-heizungsart',
        'floor': 'is24qa-etage',
        'total_floors': 'is24qa-etagenzahl',
        'balcony': 'is24qa-balkon-terrasse-label',
        'garden': 'is24qa-garten-mitbenutzung-label',
        'basement': 'is24qa-keller-label',
        'elevator': 'is24qa-personenaufzug-label',
        'parking': 'is24qa-garage-stellplatz',
        'available_from': 'is24qa-bezugsfrei-ab',
        'kitchen': 'is24qa-einbaukueche-label',
        'furnished': 'is24qa-mobiliar',
        'guest_toilet': 'is24qa-gaeste-wc-label',
        'property_type': 'is24qa-typ',
        'energy_certificate': 'is24qa-energieausweis',
        'energy_efficiency_class': 'is24qa-energieeffizienzklasse',
    }
    for key, item_id in detail_items.items():
        elem = soup.find(class_=item_id) or soup.find(attrs={"data-id": item_id}) or soup.find(id=item_id)
        if elem:
            property_data[key] = elem.text.strip()

    # Extract description if available
    description_elem = soup.find('pre', id='expose-description')
    if description_elem:
        property_data['description'] = description_elem.text.strip()

    return property_data

#### 5. Main Scraper Runner

In [31]:
def main(search_url, output_file="properties.csv", max_pages=10):
    """Main scraping workflow that gathers property listings and extracts detailed information."""

    # Configure Chrome options for Selenium with undetected_chromedriver
    options = uc.ChromeOptions()
    options.headless = False
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--disable-extensions")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-infobars")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-browser-side-navigation")
    options.add_argument("--disable-gpu")
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")

    driver = uc.Chrome(options=options)

    try:
        # Extract listing URLs from the search results pages
        listing_urls = extract_listing_urls(search_url, driver, max_pages)
        if not listing_urls:
            print("No property listings found. Exiting.")
            return

        all_data = []
        successful_scrapes = 0

        # Iterate over each listing URL and scrape detailed property info
        for i, url in enumerate(listing_urls):
            try:
                print(f"Scraping {i+1}/{len(listing_urls)}: {url}")
                driver.get(url)
                random_sleep(3, 6)
                soup = BeautifulSoup(driver.page_source, 'html.parser')
                property_data = extract_detailed_info(soup, url)
                all_data.append(property_data)
                successful_scrapes += 1

                # Periodically save progress every 5 listings to diminish data loss in case of an error occuring.
                if i > 0 and i % 5 == 0:
                    save_to_csv(all_data, output_file)
                    save_to_json(all_data, output_file.replace('.csv', '.json'))
                    print(f"Progress saved: {successful_scrapes}/{len(listing_urls)} properties")

            except Exception as e:
                print(f"Error scraping {url}: {e}")

        save_to_csv(all_data, output_file)
        save_to_json(all_data, output_file.replace('.csv', '.json'))
        print(f"Completed scraping {successful_scrapes}/{len(listing_urls)} properties")

    finally:
        driver.quit()

#### 6. Run it

In [32]:
if __name__ == "__main__":
    search_url = "https://www.immobilienscout24.de/Suche/de/schleswig-holstein/kiel/wohnung-mieten?enteredFrom=one_step_search"
    main(search_url, "immoscout_properties.csv", max_pages=10)

Found 20 listings on page 1
Clicking next page button using JavaScript
Found 20 listings on page 2
Clicking next page button using JavaScript
Found 20 listings on page 3
Clicking next page button using JavaScript
Found 20 listings on page 4
Clicking next page button using JavaScript
Found 20 listings on page 5
Clicking next page button using JavaScript
Found 20 listings on page 6
Clicking next page button using JavaScript
Found 20 listings on page 7
Clicking next page button using JavaScript
Found 20 listings on page 8
Clicking next page button using JavaScript
Found 20 listings on page 9
Clicking next page button using JavaScript
Found 20 listings on page 10
Reached maximum number of pages (10)
Total unique listings found: 200
Scraping 1/200: https://www.immobilienscout24.de/expose/158397021
Scraping 2/200: https://www.immobilienscout24.de/expose/158394744
Scraping 3/200: https://www.immobilienscout24.de/expose/158393664
Scraping 4/200: https://www.immobilienscout24.de/expose/15635680

# Part 3 of the Exercise
## What makes this data interesting?

Looking at the rental listings we scraped, there are several aspects that make this data particularly valuable:

First, we have a rich set of **structured data points** including:
- Rental prices (both cold and warm rent)
- Apartment sizes ranging from 20m² to over 200m²
- Room counts from 1 to 5 bedrooms
- Various amenities like balconies, basements, and elevators

Second, the **geographic distribution** is fascinating! The data covers diverse neighborhoods across Kiel - from Gaarden-Ost and Mettenhof to upscale areas like Blücherplatz. This gives us a great opportunity to analyze location-based price differences.

Third, there are interesting **property quality indicators** like:
- Building condition descriptions ("Neuwertig," "Saniert," "Gepflegt")
- Construction years (some dating back to 1900!)
- Various amenity combinations

## Addressing the specific questions:

### Are there labels for supervised ML?
The data provides several potential target variables we could predict:
- Rental prices would be perfect for regression models
- Property conditions could be used for classification
- We could even predict property types based on features

### Does the data contain network structures?
No, this dataset doesn't have social network elements like replies or shares. Each listing stands alone as an independent property offering.

### Could we build a RAG system?
A retrieval-augmented generation system would work well with this data. We could combine the structured property attributes with the text descriptions to create a system that answers natural language queries about housing options in Kiel.

## Proposed use case: Rental Price Prediction & Recommendation Engine

A system could be built that:

1. **Predicts fair rental prices** based on features like location, size, and amenities
   - This would help students like us know if we're getting a good deal.

2. **Analyzes neighborhood affordability**
   - My data shows huge price disparities: from €300 for a small place in Dietrichsdorf to €3,400 for luxury apartments near Schrevenpark

3. **Creates personalized recommendations**
   - Imagine telling the system "I need a 2-bedroom with a balcony near the university for under €700" and getting matching options

4. **Provides natural language insights**
   - A conversational interface could make apartment hunting much easier

