In [1]:
import warnings

import pandas as pd
from pandas import DataFrame
import numpy as np

import requests
from bs4 import BeautifulSoup

warnings.filterwarnings('ignore')

<h1 style="color:black; background-color:white; padding:10px; padding-bottom:10px;text-align: center;">The role of web scraping in data analysis of estate market</h1>

<h2 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-10px">I. Introduction</h2>
<p style="color:black; background-color:white; padding:5px; padding-bottom:20px;margin-bottom:-10px">
In the era of Big Data, businesses and researchers are constantly seeking valuable insights and meaningful information hidden within vast amounts of online data. One of the most powerful tools at their disposal is web scraping, a technique that plays a crucial role in extracting and analyzing data from websites. Web scraping enables the automated retrieval of structured data from web pages, transforming unstructured information into valuable datasets for further analysis.
<p style="color:black; background-color:white; padding:5px;padding-bottom:20px;margin-bottom:-10px">
Web scraping has become increasingly important in data analysis due to several key reasons. First and foremost, the internet is a treasure trove of valuable information. Websites contain a wealth of data ranging from product prices and customer reviews to news articles and social media posts. By leveraging web scraping techniques, analysts can collect, aggregate, and analyze this data at scale, providing valuable insights and a competitive edge.</p>
<p style="color:black; background-color:white; padding:5px;padding-bottom:20px;margin-bottom:-10px">Furthermore, web scraping allows analysts to access data from multiple sources, regardless of whether those sources provide APIs or data feeds. While some websites offer dedicated APIs, many do not, making web scraping the go-to solution for acquiring data that would otherwise be inaccessible. This accessibility to a wide range of data sources empowers analysts to conduct comprehensive and cross-domain research, uncovering trends, patterns, and correlations that were previously out of reach.</p>
<p style="color:black; background-color:white; padding:5px;padding-bottom:20px;margin-bottom:-10px">Another crucial aspect of web scraping in data analysis is the ability to monitor and track changes in online data over time. Businesses can utilize web scraping to keep an eye on competitors, monitor market dynamics, and gather valuable intelligence. By tracking changes in pricing, product availability, or customer sentiment, organizations can make data-driven decisions, adapt their strategies, and stay ahead in a rapidly evolving marketplace.</p>
<p style="color:black; background-color:white; padding:5px;padding-bottom:20px;margin-bottom:-10px">Moreover, web scraping enables the creation of large-scale datasets for training machine learning models and other data-driven applications. By systematically collecting data from diverse sources, analysts can build robust models that can automate tasks, make predictions, and provide valuable recommendations. This integration of web scraping and machine learning opens up new avenues for innovation and enables organizations to leverage the power of artificial intelligence in their data analysis workflows.</p>
<p style="color:black; background-color:white; padding:5px;">In summary, web scraping plays a vital role in data analysis by unlocking the potential of online data. It empowers analysts to access, collect, and analyze vast amounts of data from various sources, providing valuable insights and driving data-driven decision-making. As businesses strive to stay competitive in the digital age, harnessing the power of web scraping has become an indispensable tool for extracting knowledge from the web and uncovering hidden opportunities for growth and success.</p>

<h2 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-10px">II. Methodology</h2>
<p style="color:black; background-color:white; padding:5px; padding-bottom:20px;margin-bottom:-10px">
The methodology employed in this study aims to demonstrate the significance of web scraping in data analysis by utilizing the powerful BeautifulSoup library for extracting data from a real estate website in Bulgaria. Specifically, the focus will be on gathering data related to various types of flats in the city of Pazardzhik, including their prices.
<p style="color:black; background-color:white; padding:5px;padding-bottom:20px;margin-bottom:-10px">
The first step involves using web scraping techniques with BeautifulSoup to extract the desired data from the target website. By parsing the HTML content of the web pages, relevant information such as flat types and their corresponding prices will be systematically collected and organized.</p>
<p style="color:black; background-color:white; padding:5px;padding-bottom:20px;margin-bottom:-10px">
Once the data collection phase is complete, the next stage involves leveraging the tools and techniques of machine learning. Regression analysis will be applied to investigate the relationship between different factors, such as flat characteristics and their prices. This analysis will provide insights into pricing patterns and potentially uncover significant predictors of property prices.</p>
<p style="color:black; background-color:white; padding:5px;padding-bottom:20px;margin-bottom:-10px">Additionally, clustering algorithms will be employed to group flats based on similarities in their attributes. This clustering analysis can reveal distinct categories or clusters of flats with similar characteristics or pricing patterns, enabling a deeper understanding of the real estate market in Pazardzhik.</p>
<p style="color:black; background-color:white; padding:5px;">By combining web scraping with machine learning techniques, this study aims to showcase the power of web scraping in data analysis. The extraction of real estate data from the target website will enable the application of machine learning algorithms to gain valuable insights, identify trends, and make informed decisions in the real estate market of Pazardzhik, Bulgaria.</p>

<h2 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-25px">III. Empirical analysis</h2>
<h3 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-10px">1. Import the web page code</h3>
<p style="color:black; background-color:white; padding:5px">
In this first step, the web code of the selected page is loaded and the first 10 lines of it are visualized.</p>

In [2]:
URL_imotibg = "https://www.imot.bg/pcgi/imot.cgi?act=3&slink=9bhxh7&f1=1"
page_imoti = requests.get(URL_imotibg)

# Assuming 'page1' contains the HTML content of the web page
soup = BeautifulSoup(page_imoti.content, "html.parser")

# Get the prettified HTML as a string
prettified_html = soup.prettify()

# Split the HTML string into lines
html_lines = prettified_html.split('\n')

# Print only the first few lines
num_lines_to_print = 10  # Adjust the number of lines you want to print
for line in html_lines[:num_lines_to_print]:
    print(line)

<!DOCTYPE html>
<html lang="bg">
 <head>
  <title>
   Обяви Продава в град Пазарджик :: imot.bg
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Актуални обяви за имоти в град Пазарджик от сайта за недвижими имоти imot.bg. Вижте обяви Продава в град Пазарджик" name="description"/>
  <link href="https://www.imot.bg/favicon.ico" rel="SHORTCUT ICON"/>
  <link href="../styless/styles.css?279" rel="stylesheet" type="text/css"/>


---

<h3 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-25px">2. Import the data</h3>
<h4 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-10px">Step one. Import the data from the loading page</h4>
<p style="color:black; background-color:white; padding:5px">In this step, we filter the home page for two elements that we need for the analysis - price and type of apartments. We save the results as a dataframe.
</p>

In [100]:
estate_collection_first = {}

URL_imotibg_first_page = "https://www.imot.bg/pcgi/imot.cgi?act=3&slink=9bhxh7&f1=1"
loading_page_imoti = requests.get(URL_imotibg_first_page)

soup_loading_page = BeautifulSoup(loading_page_imoti.content, 'html.parser')

price_elements_loading_page = soup_loading_page.find_all('div', class_='price')
type_elements_loading_page = soup_loading_page.find_all('a', class_='lnk1')
location_elements_loading_page = soup_loading_page.find_all('a', class_='lnk2')

for p, t, l in zip(p_elements_loading_page, type_elements_loading_page, location_elements_loading_page):
    estate_collection_first.setdefault('Имот', []).append(t.text.strip())
    estate_collection_first.setdefault('Цена', []).append(p.text.strip())
    estate_collection_first.setdefault('Локация', []).append(l.text.strip())

# Create the DataFrame from the accumulated data
df_estate_collection_first = pd.DataFrame(estate_collection_first)
df_estate_collection_first.index = df_estate_collection_first.index + 1
df_estate_collection_first

Unnamed: 0,Имот,Цена,Локация
1,Продава 2-СТАЕН,70 000 лв.,"град Пазарджик, Младост"
2,Продава 3-СТАЕН,120 000 лв.,"град Пазарджик, Руски"
3,Продава 3-СТАЕН,68 000 EUR,"град Пазарджик, Младост"
4,Продава 2-СТАЕН,72 000 EUR,"град Пазарджик, Център"
5,Продава 3-СТАЕН,72 800 EUR,"град Пазарджик, Ставропол"
6,Продава 3-СТАЕН,160 000 лв.,"град Пазарджик, Устрем"
7,Продава 3-СТАЕН,85 000 EUR,"град Пазарджик, Център"
8,Продава 3-СТАЕН,99 500 EUR,"град Пазарджик, Център"
9,Продава 1-СТАЕН,55 000 лв.,"град Пазарджик, Център"
10,Продава 2-СТАЕН,39 000 EUR,"град Пазарджик, Моста на Лютата"


<p style="color:black; background-color:white; padding:5px;">As you can see from the dataframe we have 40 cases arranged in three columns first column which is named "Имот" , second column which is named "Цена" and third named "Локация". The apartments are from one-room to three-room in two types of curency in BGN and Euro, in all neighborhoods in the city.</p>

---

<h4 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-10px">Step two. Transform the data from the loading page</h4>
<p style="color:black; background-color:white; padding:5px">In this step, the data will be transformed in two directions: the names of the columns and the elements in the columns will be changed from Bulgarian to English, and secondly, the monetary unit will be unified, and all prices in BGN will be recalculated into Euros.</p>

In [101]:
df_estate_collection_first = df_estate_collection_first.rename(columns={'Имот': 'Estate_Type', 'Цена': 'Price', "Локация": "Location"})

df_estate_collection_first['Estate_Type'] = df_estate_collection_first['Estate_Type'].str.replace('Продава', '')
df_estate_collection_first['Estate_Type'] = df_estate_collection_first['Estate_Type'].str.replace('СТАЕН', 'room')

df_estate_collection_first['Location'] = df_estate_collection_first['Location'].str.replace('град', '')
df_estate_collection_first['Location'] = df_estate_collection_first['Location'].str.replace('Пазарджик', '')
df_estate_collection_first['Location'] = df_estate_collection_first['Location'].str.replace(',', '')
df_estate_collection_first['Location'] = df_estate_collection_first['Location'].str.replace(' ', '')

# Dictionary for renaming elements
rename_dict = {'Младост': 'Mladost', 'Руски': 'Ruski', 'Център': 'Shirok centre', 'Ставропол': 'Stavropol', 'Устрем': 'Ustrem', 'Моста на Лютата': 'The bridge',
               'Запад': 'Zapad', 'Идеаленцентър': 'Super centre', 'Ябълките': 'Yabalkite'}

# Rename elements in 'Column2' using the dictionary
df_estate_collection_first['Location'] = df_estate_collection_first['Location'].replace(rename_dict)

def convert_to_eur(Price):
    if 'лв.' in Price:
        Price = Price.replace('лв.', '').replace(' ', '')
        Price = round(float(Price) / 1.96)
    elif 'EUR' in Price:
        Price = Price.replace('EUR', '').replace(' ', '')
        Price = round(float(Price))
    return Price

# Apply the conversion function to the Price column
df_estate_collection_first['Price'] = df_estate_collection_first['Price'].apply(convert_to_eur)
df_estate_collection_first

Unnamed: 0,Estate_Type,Price,Location
1,2-room,35714,Mladost
2,3-room,61224,Ruski
3,3-room,68000,Mladost
4,2-room,72000,Shirok centre
5,3-room,72800,Stavropol
6,3-room,81633,Ustrem
7,3-room,85000,Shirok centre
8,3-room,99500,Shirok centre
9,1-room,28061,Shirok centre
10,2-room,39000,МостанаЛютата


---

<h4 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-10px">Step three. Take the links from the loading page</h4>
<p style="color:black; background-color:white; padding:5px">Since the home page only shows 40 apartments in number, in this step we find all the remaining links to pages from the property filter we used.</p>

In [123]:
next_page_elements = soup_loading_page.find_all('a', class_='pageNumbers')

# Iterate over the next page elements
for next_page_element in next_page_elements:
    
    # Extract the URL of the next page
    next_page_url = next_page_element['href']
    
    # Print the URL of the next page
    print("Next Page URL:", next_page_url)

Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=2
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=3
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=4
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=5
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=6
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=7
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=8
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=9
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=2
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=3
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=4
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=5
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=6
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&f1=7
Next Page URL: //www.imot.bg/pcgi/imot.cgi?act=3&slink=9bk2pu&

<p style="color:black; background-color:white; padding:5px;">As we can see from the data after number nine the next hyperlink is under number 2 or actually the data is duplicated.</p>

---

<h4 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-10px">Step four. Load the data from all pages</h4>
<p style="color:black; background-color:white; padding:5px">Once we have found the links to the rest of the pages matching the filter, we transform the code to load all apartments and not just the ones on the loading page. We remove the deduplicate links by only loading the first nine hyperlinks</p>

In [5]:
estate_collection_all = {}

URL_imotibg_first_page = "https://www.imot.bg/pcgi/imot.cgi?act=3&slink=9bhxh7&f1=1"
loading_page_imoti = requests.get(URL_imotibg_first_page)

soup_loading_page = BeautifulSoup(loading_page_imoti.content, 'html.parser')

next_page_elements = soup_loading_page.find_all('a', class_='pageNumbers')

p_elements_loading_page = soup_loading_page.find_all('div', class_='price')
type_elements_loading_page = soup_loading_page.find_all('a', class_='lnk1')
location_elements_loading_page = soup_loading_page.find_all('a', class_='lnk2')

for index, (p, t, l) in enumerate(zip(p_elements_loading_page, type_elements_loading_page, location_elements_loading_page)):
    estate_collection_all.setdefault('Имот', []).append(t.text.strip())
    estate_collection_all.setdefault('Цена', []).append(p.text.strip())
    estate_collection_all.setdefault('Локация', []).append(l.text.strip())

for next_page_element in next_page_elements[:8]:  # Take only the first 8 elements

    next_page_url = next_page_element['href']  # Join the base URL with the relative URL

    # Make a request to the next page URL
    response = requests.get(url='http:' + next_page_url)

    # Parse the HTML content of the next page
    soup_next_page = BeautifulSoup(response.content.decode('windows-1251'), 'html.parser')

    # Extract all paragraph elements from the next page
    price_elements_next_page = soup_next_page.find_all('div', class_='price')
    type_elements_next_page = soup_next_page.find_all('a', class_='lnk1')
    location_elements_next_page = soup_next_page.find_all('a', class_='lnk2')

    for index, (p, t, l) in enumerate(zip(price_elements_next_page, type_elements_next_page, location_elements_next_page)):
        estate_collection_all.setdefault('Имот', []).append(t.text.strip())
        estate_collection_all.setdefault('Цена', []).append(p.text.strip())
        estate_collection_all.setdefault('Локация', []).append(l.text.strip())

df_estate_collection_all = pd.DataFrame(estate_collection_all)
df_estate_collection_all

Unnamed: 0,Имот,Цена,Локация
0,Продава 1-СТАЕН,51 000 лв.,"град Пазарджик, Младост"
1,Продава 2-СТАЕН,70 000 лв.,"град Пазарджик, Младост"
2,Продава 3-СТАЕН,120 000 лв.,"град Пазарджик, Руски"
3,Продава 3-СТАЕН,68 000 EUR,"град Пазарджик, Младост"
4,Продава 2-СТАЕН,72 000 EUR,"град Пазарджик, Център"
...,...,...,...
345,Продава 4-СТАЕН,Цена при запитване,"град Пазарджик, Център"
346,Продава 3-СТАЕН,Цена при запитване,"град Пазарджик, Окръжна болница"
347,Продава 3-СТАЕН,Цена при запитване,"град Пазарджик, Руски"
348,Продава 3-СТАЕН,Цена при запитване,"град Пазарджик, Руски"


<p style="color:black; background-color:white; padding:5px;">The result is a dataframe with all apartments in the city of Pazardzhik, Bulgaria, a total of 350 in number, arranged in three columns - apartment type, apartment price and location.</p>

---

<h4 style="color:black; background-color:white; padding:5px; padding-bottom:10px; margin-bottom:-10px">Step five. Transform the data from all pages</h4>
<p style="color:black; background-color:white; padding:5px">In this step, as we already do that, the data will be transformed in two directions: the names of the columns and the elements in the columns will be changed from Bulgarian to English, and secondly, the monetary unit will be unified, and all prices in BGN will be recalculated into Euros. Also, any apartments without a listed price will be removed from the dataframe.</p>

In [6]:
# transform the index to start from 1 
df_estate_collection_all.index = df_estate_collection_all.index + 1

#Change the name of columns
df_estate_collection_all = df_estate_collection_all.rename(columns={'Имот': 'Estate_Type', 'Цена': 'Price', 'Локация': 'Location'})

#change the name of elements in columns
df_estate_collection_all['Estate_Type'] = df_estate_collection_all['Estate_Type'].str.replace('Продава', '')
df_estate_collection_all['Estate_Type'] = df_estate_collection_all['Estate_Type'].str.replace('СТАЕН', 'room')
df_estate_collection_all['Price'] = df_estate_collection_all['Price'].replace('Цена при запитване', float('nan'))

#remove the nan elemenents
df_estate_collection_all = df_estate_collection_all.dropna(subset=['Price'])

#transform the locations to eng
df_estate_collection_all['Location'] = df_estate_collection_all['Location'].str.replace('град', '')
df_estate_collection_all['Location'] = df_estate_collection_all['Location'].str.replace('Пазарджик', '')
df_estate_collection_all['Location'] = df_estate_collection_all['Location'].str.replace(',', '')
df_estate_collection_all['Location'] = df_estate_collection_all['Location'].str.replace(' ', '')

# Dictionary for renaming elements
rename_dict = {'Младост': 'Mladost', 'Руски': 'Ruski', 'Център': 'Shirok centre', 'Ставропол': 'Stavropol', 'Устрем': 'Ustrem', 'Моста на Лютата': 'The bridge',
               'Запад': 'Zapad', 'Идеаленцентър': 'Super centre', 'Ябълките': 'Yabalkite', 'Промишленазона': 'Industrial area', 'Изток': 'Iztok', 
               'Окръжнаболница': 'Bolnica', 'МостанаЛютата': 'The bridge'}

# Rename elements in 'Column2' using the dictionary
df_estate_collection_all['Location'] = df_estate_collection_all['Location'].replace(rename_dict)

# change the elements in price column
def convert_to_eur(Price):
    if 'лв.' in Price:
        Price = Price.replace('лв.', '').replace(' ', '')
        Price = round(float(Price) / 1.96)
    elif 'EUR' in Price:
        Price = Price.replace('EUR', '').replace(' ', '')
        Price = round(float(Price))
    return Price

# Apply the conversion function to the Price column
df_estate_collection_all['Price'] = df_estate_collection_all['Price'].apply(convert_to_eur)
df_estate_collection_all

Unnamed: 0,Estate_Type,Price,Location
1,1-room,26020,Mladost
2,2-room,35714,Mladost
3,3-room,61224,Ruski
4,3-room,68000,Mladost
5,2-room,72000,Shirok centre
...,...,...,...
340,3-room,145000,Industrial area
341,4-room,150000,Iztok
342,3-room,150000,Super centre
343,4-room,155612,Super centre


<p style="color:black; background-color:white; padding:5px;">The result is a dataframe with all apartments in the city of Pazardzhik, Bulgaria, with prices, a total of 344 in number, arranged in three columns - apartment type, apartment price (in euro) and location.</p>

---