The Home Market Harvester is a complete data system designed to gather
-> purify
-> analyze
-> train model
-> display information
about the real estate market. It focuses on specific areas, comparing selected properties with the general market.
The system features an interactive dashboard that shows an overview of local market trends, compares selected properties with the market, and includes a map showing where the properties are located.
It gathers data from olx.pl
and otodom.pl
, which are websites listing properties in Poland.
The program runs on a personal computer and uses free, open-source tools along with two additional services for improving the data. These services provide location details through Nominatim
and calculate travel times via openrouteservice
. The dashboard is built with the streamlit
framework, allowing it to be accessed via a local web address and shared with others.
scraping:
Selenium:
Interact with dynamically generated content and provide javascript interactions.Beautiful Soup:
Extracting the data from the HTML page-source.
cleaning:
NumPy:
Support for large, multi-dimensional arrays and matrices.pandas:
Tools for reading, writing, and manipulating tabular data.jupyter:
Facilitates incremental code development, enabling users to write and execute code in manageable chunks, thereby enabling step-by-step data visualization, review, and iterative adjustments.
data enrichment:
Nominatim:
Transforms addresses into geographic coordinates using OpenStreetMap data, enhancing the mapping and visualization of property listings.openrouteservice:
This API calculates routes and travel times using OpenStreetMap data, improving the accuracy of travel planning displayed on maps.
model developing:
scikit-learn:
Library for machine learning that offers tools for data analysis and pattern detection. It includes efficient options like regression models, which are ideal for training quickly and accurately, even with small data sets.
data visualizing:
Streamlit:
A framework that simplifies creating web apps for data analysis and machine learning, enabling developers to turn data scripts into interactive, public web applications with minimal coding.matplotlib:
Used for creating static, interactive, and animated visualizations in Python. Implemented for the bar charts and the map.seaborn:
Visualization library based on matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics, making data visualization both easier and more aesthetically pleasing.
data
: Houses both raw and processed datasets.logs
: Archives logs from the pipeline operations, such as scraping and system activity.model
: Stores machine learning models developed from the housing data.notebooks
: Contains Jupyter Notebooks for data analysis, cleaning, and model creation in the app development.pipeline
: The backbone of the project, encompassing scripts for scraping, cleaning, data model creation, and visualization..env
: A key file for setting important variables needed for the pipeline to work properly.
Each stage of the pipeline
(a_scraping
, b_cleaning
, c_model_developing
, d_data_visualizing
) is executed sequentially:
- Scraping (
a_scraping
): Collects initial data from specific sources. - Cleaning (
b_cleaning
): Enhances data quality by removing errors and making it ready for analysis. - Model Developing (
c_model_developing
): Works on building and improving machine learning models. - Data Visualizing (
d_data_visualizing
): Displays data and insights through interactive dashboards.
Subdirectories such as orchestration
and config
help these processes by offering tools, helper functions, and configuration management for smooth pipeline operation.
Look at Pipfile
To set up the project environment:
pip install pipenv
pipenv install
pipenv shell
🚨 Note: It's important to remember that the pipeline relies on external data sources, which may be subject to A/B tests, frontend changes, anti-bot activity, and server failures.
Found in the pipeline/config
directory, this setup makes it easier to manage API keys, file paths, and server settings:
-
Dynamic Naming with
run_pipeline.conf
: TheMARKET_OFFERS_TIMEPLACE
variable automatically names data storage directories using timestamps and locations, such as2024_02_20_16_37_54_Mierzęcice__Będziński__Śląskie
. This helps keep data organized and easy to find. -
Security with
.env
File: Important details likeAPI
keys,USER_OFFERS_PATH
,CHROME_DRIVER_PATH
, andCHROME_BROWSER_PATH
are stored here for better security. -
It's essential to get and set up the required API key for
openrouteservice
and include paths likeCHROME_DRIVER_PATH
,CHROME_BROWSER_PATH
, andUSER_OFFERS_PATH
in the .env
file.
The app can be executed by running the run_pipeline.py
script found within the pipeline
directory.
python pipeline/run_pipeline.py --location_query "Location Name" --area_radius <radius in kilometers> --scraped_offers_cap <maximum number of offers> --destination_coords <latitude, longitute> --user_data_path <path to your data.csv>
For example, to collect up to 100
housing offers within 25
km of Warsaw
at coordinates (52.203531, 21.047047)
, and compare them with your data stored at D:\path\user_data.csv
(this step is optional), use the following command:
python pipeline/run_pipeline.py --location_query "Warszawa" --area_radius 25 --scraped_offers_cap 100 --destination_coords "52.203531, 21.047047" --user_data_path "D:\path\user_data.csv"
The notebooks
directory includes Jupyter Notebooks that provide an interactive environment for developing and handling data. These notebooks are meant for development only, not for production.
The pipeline supports running each stage independently as a Python script, except for the d_data_visualizing
stage. This stage uses the streamlit
framework to produce interactive visualizations. For more details on this component, see the streamlit_README.
The tests
directory contains scripts that check the functionality and reliability of different parts of the pipeline. Right now, only the scraping phase has automated tests.
To execute the tests use the following command:
pipenv shell # at the root of the project
python -m unittest discover -s tests -p 'test_*.py'
During the development, three significant insights were gained:
-
Preserving HTML Source Code for Data Integrity Due to the instability of
web scraping
sources, we save theHTML
source code of each listing. This practice prevents data loss during processing and makes it easier to extract data if listings change. The small size of HTML files means they don’t take up much disk space or affect performance, making this method efficient and practical. -
Executing Python Scripts Running
Python
scripts directly from.py
files is more effective than convertingJupyter Notebooks
to.py
files and then running them. The latter often causes issues with library compatibility. Direct execution avoids these problems and ensures smoother development. -
Codebase Structure Simplification
The project initially adopted a modular approach, with each step executed as a separate subprocess. This complexity hindered effective testing due to changing the behavior of the subprocesses in theunittest
environment. It is more effective to integrate the codebase and use function calls within a single process for easier testing and maintenance. -
Updating Environment Variables During Runtime
To prevent issues withenvironment variables
not updating correctly, it is better to directly modifysystem files
within the project.
This project is licensed under the terms of the LICENSE file located in the project root.
Note: This README covers the overall project. For detailed information on specific components or stages, please see the README files in the respective stages directory.