#  Scraper
Our goal in this notebook is to scrape data for house sales in the city of Thesaloniki. We will scrape data from the biggest Greek website for house sales: spitogatos.gr. First we import all the necessary libraries. We are going to use selenium for the scraping.


In [1]:
import pandas as pd
import os

from selenium import webdriver
from selenium.webdriver.common.by import By
import numpy as np
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
import time
import datetime
import logging

from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from scrape import many_page_scrape,get_driver

First we need to know what a WebDriver element is. A WebDriver element is a Python object representing an HTML element on a web page, as found and controlled by Selenium WebDriver. It allows you to interact with elements on the page—such as clicking buttons, entering text, or reading content—using Python code.As a start we initialize a webdriver element. When we run this a chromium window should open. Maximize the window.

In [2]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))


Let us manually go now the desired website spitogatos.gr. Notice that the website requires us to solve a captcha before we can access the listings. Throughout this script we will be required to manually solve the captcha every now and then since the website has captchas that appear if suspected bot activity is detected. First let us explore this webpage a bit.

![Alt text](Screenshot.png)

Searching for houses in Thessaloniki, Greece we get:  ![Alt text](Screenshot3.png)

So what we want is to scrape the HTML code of this website. Notice that there are 30 entries in each page and there are more than a thousand pages in total. For that reason we have created the scrape.py file that cotains all the necessary functions that will be used in the scraping process. These function were built using [selenium](https://www.selenium.dev/documentation/webdriver/) and following the relevant documentation. For more details take a look at the scrape.py file.![Alt text](Screenshot22.png)

Notice that the houses are all displayed without any filters or attributes besides the basics that we see on the entries like floor, number of rooms, number of bathrooms, price and total area. So at first glance these are the only attributes that we are able to scrape without going into much trouble. 

In order for us to gather more information like year the house was built and other characteristics we will need to be more creative. We have created a function in the scrape.py file, called many_page_scrape,  which basically takes a url and starts scraping all the available pages for that url. So in order to find the year each house was bulit we will put a filter and display only houses built 1952. Once our function scrapes all those entries we will continue with 1953, 1954 and so on until 2024. The script will save these data as tables with columns floor, number of rooms, price, total area, location, year of construction. 


After we scrape all years we will similarly apply all other filters and get different tables with the same entries. After cross examining all the dofferent tables we can construct a final file which will contain detailed data for all the houses.  

We start with the scraping process first. Notice that if we wanted to scrape a different location than Thessaloniki we would have to change the urls in the scrape.py file.

The following script performs the scraping process  by attribute, with folder management and captcha handling.




**Parameters:**
- `folder_url` (str): The path to the main folder where the data will be saved.
- `attribute` (str): The attribute by which to filter house listings (e.g., "autonomous_heating"). 

- `start_at_first_page` (bool, optional): If `True`, initializes the driver to start at the first page with the given attribute filter. Default is `False`.
- `custom_scrape` (bool, optional): If `True`, initializes the driver to a custom webpage. Default is `False`.

**Workflow:**
1. Optionally initializes the Selenium driver to the first page with the specified attribute.
2. Creates the main data folder and a subfolder for the specific attribute, handling any errors if folders already exist.
3. Prints the attribute being scraped.
4. Initializes an empty DataFrame with columns for house details.
5. Calls `many_page_scrape` to scrape all pages for the specified attribute, handling captchas by prompting the user to solve them manually if a `TimeoutException` occurs.

**Note:**  
- The function expects the global `driver` object and the `many_page_scrape` and `get_driver` functions to be defined and imported.
- Data for each attribute is saved in its corresponding subfolder under the main folder.
- The scraped data will be saved in batches of 50 pages (so around 1500 entries) each. The last batch will be usually smaller than that and it will have a different name. 
- We note that the script is built in such a way that it removes duplicate entries before saving them. That was necessary because spitogatos.gr often contains duplicate house ads.

In [3]:

def main(folder_url,attribute,start_at_first_page=True,custom_scrape=False):

    if start_at_first_page and not custom_scrape:
        get_driver(driver,attribute=attribute)
    elif start_at_first_page and custom_scrape:
        get_driver(driver,attribute=attribute,custom_scrape=True)
    
    try:
        os.mkdir(folder_url)
        
        print(f"Folder '{folder_url}' created successfully.")
    except FileExistsError:
        print(f"Folder '{folder_url}' already exists.")
    except OSError as error:
        print(f"Error creating folder '{folder_url}': {error}")
    try:
        os.mkdir(f"{folder_url}/Houses_data_{attribute}")
        print(f"Folder Houses_data_{attribute} created successfully.")
    except FileExistsError:
        print(f"Folder Houses_data_{attribute} already exists.")
    except OSError as error:
        print(f"Error creating folder Houses_data_{attribute}: {error}")

    print(f"We are scraping attribute {attribute}")

    df=pd.DataFrame(columns=["Location", "Price","Total_area","House_type","Floor","Rooms","Bathrooms","submission_date"])
    i=1
    while True:
        
        try:
            df = many_page_scrape(driver, df, attribute=attribute)
            break  # Exit loop if successful
        except TimeoutException:
            input("TimeoutException: Please solve the captcha in the browser, then press Enter to retry...")
            i=i+1
            if i > 2:
                yes_no=input("Would you like to continue to the next attribute? (yes/no): ")
                if yes_no.lower() == 'yes':
                    break
             

        


The attribute variable can be any of the following: 

['autonomous_heating', 'central_heating', 'individual_heating', 'no_heating', 'petrol_heating', 'natural_gas_heating', 'LPG_heating', 'electrical_heating', 'thermal_storage_heating', 'wood_headting', 'pellet_heating', 'heat_pump_heating', 'with_AC', 'with_storage_room', 'with_elavator', 'with_solar_heater', 'with_fireplace', 'Furnished', 'with_parking', 'with_garden', 'with_pool', 'with_balcony', 'last_floor', 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]

In [None]:
folder_name =f"/home/tsantaris/OneDrive/Data science and AI stuff/Project spitogatos/Houses_data_{datetime.date.today()}"

attributes=['autonomous_heating', 'central_heating', 'individual_heating', 'no_heating', 'petrol_heating', 'natural_gas_heating', 'LPG_heating', 
            'electrical_heating', 'thermal_storage_heating', 'wood_heating', 'pellet_heating', 'heat_pump_heating', 'with_AC', 'with_storage_room', 
            'with_elavator', 'with_solar_heater', 'with_fireplace', 'Furnished', 'with_parking', 'with_garden', 'with_pool', 'with_balcony', 
            'last_floor', 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 
            1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 
            1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 
            2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024,2025]
for attribute in attributes:
    main(folder_name, attribute, start_at_first_page=True, custom_scrape=True)

Folder '/home/tsantaris/OneDrive/Data science and AI stuff/Project spitogatos/Houses_data_2025-08-11' already exists.
Folder Houses_data_with_AC created successfully.
We are scraping attribute with_AC
the number of total pages to scrape is:1
the number of total pages to scrape is:5
Reached entry number: [91m30[0m on current page
We scraped page 1 of 5
Reached entry number: [91m30[0m on current page
We scraped page 2 of 5
Reached entry number: [91m30[0m on current page
We scraped page 3 of 5
Reached entry number: [91m30[0m on current page
We scraped page 4 of 5
Reached entry number: [91m26[0m on current page
Saved data from page 0 to 5
Folder '/home/tsantaris/OneDrive/Data science and AI stuff/Project spitogatos/Houses_data_2025-08-11' already exists.
Folder Houses_data_with_storage_room created successfully.
We are scraping attribute with_storage_room
the number of total pages to scrape is:12
Reached entry number: [91m30[0m on current page
We scraped page 1 of 12
Reached en