# Project Motivation:

Recently, my father inherited an apartment in Vallecas from an aunt of his who passed away last year. He decided he wanted to sell the property to a family, so he contacted a couple of real estate agencies to get an idea of the price he could list the apartment for.

One Friday, my father mentioned to me that he would be meeting with both real estate agencies on Monday, after the weekend, to ask about the value of the apartment. I asked him if he had any idea of the price range to expect, and he told me he had no clue. At the time, I was studying data analysis and data science, so I decided to carry out a quick investigation on my own to provide him with an estimated price before the meetings. I wanted to make the project as thorough as possible in just two days; this is the result.

**Result**:
The estimated price from my analysis was around €180,000. The prices offered by the real estate agencies were €170,000 and €165,000. After negotiations with the buyers (a young couple), he sold the apartment for €160,000. Although the model wasn't entirely accurate, it gave my father a rough idea of the property's value.

## 1. Obtain the data from both real state agencies

The first step, described in this notebook, consists in the obtention of the data of several apartments from both real state agencies. In our case those agencies are: Tecnocasa and Redpiso. Both of them sell properties in Spain. However, to reduce the number of data and facilitate compression, we are going to use only the data ubicated in Madrid city.

This analysis is equivalent to other cities, or even several cities. It would only change the size of the data and its value.



To obtain the data we will use the library BeautifulSoup, very well known to perform Webscrapping.

In case you need to install it just run:


In [60]:
pip install requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


The two main libraries for Webscrapping are: BeautifulSoup and requests. The pandas library is very useful to save all the data in a dataframe.

Import the libraries:


In [61]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Tecnocasa

We are going to start with Tecnocasa data. You can easily google it and find its web: https://www.tecnocasa.es/

In the search bar we can search for the city we want, in our case the city of Madrid.

*Note: The data was obtained on 17/08/2024. Repeating this code another day will generate different data*


![Imagen](Tecnocasa_main.png)

A map will open with a checklist at the right to select the neighborhoods we want. Let's select all of them:

![Image](Tecnocasa_choose.png)

That will bring as to: https://www.tecnocasa.es/venta/inmuebles/comunidad-de-madrid/madrid/madrid.html

This page is the list of all the properties offered by this website with the parameters that we have chosen (which is only one: located in Madrid)

![Imagen](Tecnocasa_list.png)

We are interested in obtaining the data of each property that I have marked in red: price, type of property, location, bedrooms, surface area and bathrooms.

So let's get to it:


In [67]:
# Tecnocasa URL:
url = 'https://www.tecnocasa.es/venta/inmuebles/comunidad-de-madrid/madrid/madrid.html'

# Define the User-Agent:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

# Request access to the HTML code and see the response
response = requests.get(url, headers = headers)

# Response [200] means that the request success
response

<Response [200]>

Response [200] means 'OK' so we can continue.
Let's see the HTML code of the page:


In [68]:
# Create a BeautifulSoup object using the HTML code as text and Python's html parser:
soup = BeautifulSoup(response.text, "html.parser")
soup

<!DOCTYPE html>

<html lang="es-es">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=5.1" name="viewport"/>
<meta content="cScbt1vnbVnIR2YiMpCADBQDEKD0oAdFaVWlCIUU" name="csrf-token"/>
<title>Pisos y casas  en venta en Madrid - Tecnocasa.es</title>
<meta content="Pisos y casas  en venta a Madrid: ¿Quieres comprar, vender o alquilar pisos y casas? ¡Descubre las propuestas online las inmobiliarias Tecnocasa.es!" name="description">
<meta content="MKautx_6-Jv0E6VAs8uJPkZG1wLRejHrW68L_yWI2Xo" name="google-site-verification">
<meta content="" name="copyright">
<meta content="" name="author"/>
<meta content="General" name="classification"/>
<meta content="General" name="rating"/>
<meta content="Global" name="distribution"/>
<meta content="all,index,follow" name="robots"/>
<link href="https://www.tecnocasa.es/venta/inmuebles/comunidad-de-madrid/madrid/madrid.html" rel="canonical"/>
<meta content="3 days" name="revisit-after"/>
<!-- FB -->
<me

Now, we have the content of the page in our "soup" object. To view the content of the page in the browser, simply go to the page itself and press Ctrl+U.

In the HTML code we can search for the information we are looking for. In our case it is very easy to see that for each property all the information is contained in <estate-card> elements; each element with its own label. This makes webscrapping much easier.


![Imagen](Tecnocasa_html.png) 

To access the information, simply use the "find" command using the name of the element and its label as input:


In [69]:
# Example using the price of the first property:

# "find" the element we want
price_template = soup.find('template', {'slot': 'estate-price'})

# Obtain the text from the element
price = price_template.get_text(strip=True)
price

'190.000 €'

This search operation can be performed more quickly by using the "find_all" function which finds all elements with the same tags.

Using "find_all" we obtain the data of the 15 homes that appear on the page


In [72]:
# Price of every property in the page:
price_templates = soup.find_all('template', {'slot': 'estate-price'})

# Obtain the text of each element
prices = [elem.get_text(strip=True) for elem in price_templates]

prices

['190.000 €',
 '279.000 €',
 '319.900 €',
 '116.000 €',
 '155.000 €',
 '410.000 €',
 '90.000 €',
 '299.000 €',
 '185.000 €',
 '46.000 €',
 '315.000 €',
 '17.000 €',
 '379.900 €',
 '239.900 €',
 '306.000 €']

We can use the same process for all the tags:

"estate-price"

"estate-title"

"estate-subtitle"

"estate-rooms"

"estate-surface"

"estate-bathrooms"


In [71]:
# Find and store all the parameters:
templates = soup.find_all('template', {'slot': 'estate-price'})
price = [elem.get_text(strip=True) for elem in templates]

templates = soup.find_all('template', {'slot': 'estate-title'})
title = [elem.get_text(strip=True) for elem in templates]

templates = soup.find_all('template', {'slot': 'estate-subtitle'})
subtitle = [elem.get_text(strip=True) for elem in templates]

templates = soup.find_all('template', {'slot': 'estate-rooms'})
rooms = [elem.get_text(strip=True) for elem in templates]

templates = soup.find_all('template', {'slot': 'estate-surface'})
surface = [elem.get_text(strip=True) for elem in templates]

templates = soup.find_all('template', {'slot': 'estate-bathrooms'})
bathrooms = [elem.get_text(strip=True) for elem in templates]

Before continuing, let's explore each of the parameters:

*Remember that all values are string*

In [74]:
print(price[0:3])
print(title[0:3])
print(subtitle[0:3])
print(rooms[0:3])
print(surface[0:3])
print(bathrooms[0:3])

['190.000 €', '279.000 €', '319.900 €']
['Piso en venta', 'Piso en venta', 'Piso en venta']
['Madrid, Arganzuela', 'Madrid, Salamanca', 'Madrid, Salamanca']
['2 dorm.', '2 dorm.', '2 dorm.']
['49 m<sup>2</sup>', '49 m<sup>2</sup>', '53 m<sup>2</sup>']
['1 baño', '1 baño', '1 baño']


Personally I don't like the way some things look, especially because many parameters are numerical and we might be interested in using them as numbers in the future. So let's clean up the data a little. Being all lists of *string* it is easy:

- We are going to remove the currency '(€)'
- We are going to remove 'Madrid' from the subtitle as it is repetitive and unnecessary, since all the selected properties are in Madrid
- Let's remove 'dorm.' and 'baño'
- Let's remove 'm<sup>2</sup>'

In [76]:
# Cleaning data:

# Numerical:
templates = soup.find_all('template', {'slot': 'estate-price'})
price = [elem.get_text(strip=True).replace(' €','') for elem in templates]

templates = soup.find_all('template', {'slot': 'estate-rooms'})
rooms = [elem.get_text(strip=True).replace(' dorm.','') for elem in templates]

templates = soup.find_all('template', {'slot': 'estate-surface'})
surface = [elem.get_text(strip=True).replace(' m<sup>2</sup>','') for elem in templates]

templates = soup.find_all('template', {'slot': 'estate-bathrooms'})
bathrooms = [elem.get_text(strip=True).replace(' baño','').replace('s','') for elem in templates]

# Categorical:
templates = soup.find_all('template', {'slot': 'estate-title'})
title = [elem.get_text(strip=True) for elem in templates]

templates = soup.find_all('template', {'slot': 'estate-subtitle'})
subtitle = [elem.get_text(strip=True).replace('Madrid, ','') for elem in templates]

# Result:
print(price[0:3])
print(title[0:3])
print(subtitle[0:3])
print(rooms[0:3])
print(surface[0:3])
print(bathrooms[0:3])

['190.000', '279.000', '319.900']
['Piso en venta', 'Piso en venta', 'Piso en venta']
['Arganzuela', 'Salamanca', 'Salamanca']
['2', '2', '2']
['49', '49', '53']
['1', '1', '1']


With the clean data we can now save it in a DataFrame for later use in analysis and predictive models:

In [77]:
# Save the information in a dataframe

# Columns
data = {'Type': title,
        'Location': subtitle,
        'Bathrooms': bathrooms,
        'Rooms': rooms,
        'Surface (m2)': surface,
        'Price (€)': price}

df = pd.DataFrame(data)
df

Unnamed: 0,Type,Location,Bathrooms,Rooms,Surface (m2),Price (€)
0,Piso en venta,Arganzuela,1.0,2.0,49,190.0
1,Piso en venta,Salamanca,1.0,2.0,49,279.0
2,Piso en venta,Salamanca,1.0,2.0,53,319.9
3,Box/plaza de garaje en venta,Fuencarral,,,290,116.0
4,Piso en venta,Latina,1.0,3.0,65,155.0
5,Piso en venta,Chamartín,2.0,3.0,119,410.0
6,Local comercial en venta,Carabanchel,1.0,,65,90.0
7,Piso en venta,Carabanchel,2.0,2.0,76,299.0
8,Piso en venta,Ciudad Lineal,1.0,2.0,69,185.0
9,Local comercial en venta,Vicálvaro,2.0,5.0,82,46.0


It looks very good, the data seems clean and those that do not appear on the web are directly empty in the dataframe.

Now, we do not want to do an analysis only with the first 15 properties that come up from the search engine. The interesting thing is to obtain **ALL** the properties on the web.

To do this, you can look on the web and see that there are 53 pages of property results in Madrid (**Remember: at 17/09/2024**). So we can perform a 'for' loop to view each page

In [78]:
# Store the data in lists:
all_prices = []
all_title = []
all_subtitle = []
all_rooms = []
all_surface = []
all_bathrooms = []

#  53 pages at 17-09-2024
for i in range(1,54):
    
    # url of page 1
    url = 'https://www.tecnocasa.es/venta/inmuebles/comunidad-de-madrid/madrid/madrid.html'
    
    # url of the rest of the pages (from 2 to 53)
    # it simply add '/pag-i' to the url of page 1
    if i >=2: url = url + '/pag-'+ str(i)
    
    # Use the same process to obtain the HTML
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

    response = requests.get(url, headers = headers)

    soup = BeautifulSoup(response.text, "html.parser")
    
    
    # Get and clean the data in the same step
    templates = soup.find_all('template', {'slot': 'estate-price'})
    price = [elem.get_text(strip=True).replace(' €','') for elem in templates]

    templates = soup.find_all('template', {'slot': 'estate-rooms'})
    rooms = [elem.get_text(strip=True).replace(' dorm.','') for elem in templates]

    templates = soup.find_all('template', {'slot': 'estate-surface'})
    surface = [elem.get_text(strip=True).replace(' m<sup>2</sup>','') for elem in templates]

    templates = soup.find_all('template', {'slot': 'estate-bathrooms'})
    bathrooms = [elem.get_text(strip=True).replace(' baño','').replace('s','') for elem in templates]

    templates = soup.find_all('template', {'slot': 'estate-title'})
    title = [elem.get_text(strip=True) for elem in templates]

    templates = soup.find_all('template', {'slot': 'estate-subtitle'})
    subtitle = [elem.get_text(strip=True).replace('Madrid, ','') for elem in templates]
    
    
    # Add the clean data to the lists
    all_prices.extend(price)
    all_rooms.extend(rooms)
    all_surface.extend(surface)
    all_bathrooms.extend(bathrooms)
    all_title.extend(title)
    all_subtitle.extend(subtitle)
    
    
    # This is a print to see the process:
    if i%10 == 0: print(i)

print('Finish!')


10
20
30
40
50
Finish!


Save the data in a dataframe

In [79]:
data = {'Type': all_title,
        'Location': all_subtitle,
        'Bathrooms': all_bathrooms,
        'Rooms': all_rooms,
        'Surface (m2)': all_surface,
        'Price (€)': all_prices}

df = pd.DataFrame(data)
df

Unnamed: 0,Type,Location,Bathrooms,Rooms,Surface (m2),Price (€)
0,Piso en venta,Arganzuela,1,2,49,190.000
1,Piso en venta,Salamanca,1,2,49,279.000
2,Piso en venta,Salamanca,1,2,53,319.900
3,Box/plaza de garaje en venta,Fuencarral,,,290,116.000
4,Piso en venta,Latina,1,3,65,155.000
...,...,...,...,...,...,...
790,Box/plaza de garaje en venta,San Blas,,,9,12.000
791,Box/plaza de garaje en venta,Retiro,,,10,25.000
792,Box/plaza de garaje en venta,San Blas,,,15,24.500
793,Box/plaza de garaje en venta,Salamanca,,,9,9.500


We have obtained the information on the 795 properties that Tecnocasa offers in Madrid, we have cleaned it slightly and converted it into a dataframe to use in subsequent analyses. 

For now, we are going to save this information in an excel

In [80]:
df.to_excel('inmuebles_Tecnocasa.xlsx')

### Redpiso

Now we are going to perform the same search on the Redpiso website. Its main website is: https://www.redpiso.es/

In the search bar we can search for the city we want, in this case the city of Madrid.

*Note: the data was obtained on 09/17/2024. Repeating this code on another day may generate different data*

![Imagen](Redpiso_main.png)

Once the city is selected it will send us to the page: https://www.redpiso.es/venta-viviendas/madrid/madrid

This page contains the list of all the properties that this website offers with the parameters that we have chosen

![Image](Redpiso_list.png)

We repeat the process to obtain the HTML code as in the previous case:

In [81]:
url = 'https://www.redpiso.es/venta-viviendas/madrid/madrid'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

response = requests.get(url, headers = headers)

response

<Response [200]>

In [82]:
soup = BeautifulSoup(response.text, "html.parser")
soup

<!DOCTYPE html>

<html lang="es">
<head>
<title>Pisos y casas en Madrid, Madrid</title>
<base href="https://www.redpiso.es/"/>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="2085 pisos y casas en venta en Madrid, Madrid: anuncios de pisos y casas en venta en Madrid, Madrid con fotos" name="description">
<meta content="insfera" name="author">
<meta content="notranslate" name="google">
<meta content="348212335558257" property="fb:app_id"/><meta content="website" property="og:type"/><meta content="Pisos y casas en Madrid, Madrid" property="og:title"/><meta content="https://www.redpiso.es/venta-viviendas/madrid/madrid" property="og:url"/><meta content="https://www.redpiso.es/img/redpiso-logo-head.png" property="og:image"/><meta content="2085 pisos y casas en venta en Madrid, Madrid: anuncios de pisos y casas en venta en Madrid, Madrid con fotos" property="og:description"/><m

Since we have a response [200] we can go ahead and create the 'soup' object.

The HTML code on this page is different from the previous one (it can be seen by pressing Ctrl+U on the page). In this case each element is placed in different labels as can be seen in the image:

![Image](Redpiso_html.png)

- The price is in h3
- The description is in h5
- The surface, bathrooms and rooms are in div class='property-list-options-item'
    
However, it is easy to see that the fact that 3 elements share the same tag makes things quite difficult. What's more, in some property offered some of them are missing and their line does not appear directly in the HTML code.
    
In the previous case of Tecnocasa, all the labels appeared in all the properties, even if some were empty. That made exploration much easier. Here we have to think a little...

Let's start with the easy part: the price and description:

In [84]:
# Using find_all
templates = soup.find_all('h3')
prices = [elem.get_text(strip=True) for elem in templates]

templates = soup.find_all('h5')
description = [elem.get_text(strip=True) for elem in templates]

print(prices[0:3])
print(description[0:3])

['106.500 €', '120.000 €', '120.000 €']
['Piso en venta en CALLE BENIMAMET, San Cristóbal, Villaverde, Madrid, Madrid', 'Piso en venta en Buenavista, Carabanchel, Madrid, Madrid', 'Piso en venta en Buenavista, Carabanchel, Madrid, Madrid']


The price looks exactly the same as Tecnocasa data.

The description is a mix of what Tecnocasa called 'Title' and 'Subtitle'. For now we are going to leave it like this. In the next notebook we will clean the data in more depth and separate streets, areas and neighborhoods.

Let's go with the 3 elements that share labels. One might be tempted to use "find_all" and find all three parameters at once and save them in the same list to later split them. 

*SPOILER: It's a mistake, it won't work*

In [85]:
templates = soup.find_all('div', class_='property-list-options-item')

params = [elem.get_text(strip=True) for elem in templates]

print(params[0:9])

['64 m²', '2 hab.', '1', '37 m²', '1 hab.', '1', '33 m²', '1 hab.', '1']


As you can see, the surface, rooms and bathrooms are mixed; but they seem organized. Maybe we can divided it into three lists, each one with its parameter:

In [87]:
surface = [params[i] for i in range(0,len(params),3)]
rooms = [params[i] for i in range(1,len(params),3)]
bathrooms = [params[i] for i in range(2,len(params),3)]

print(surface[0:3])
print(rooms[0:3])
print(bathrooms[0:3])

['64 m²', '37 m²', '33 m²']
['2 hab.', '1 hab.', '1 hab.']
['1', '1', '1']


Clean the data:

In [88]:
surface = [params[i].replace(' m²','') for i in range(0,len(params),3)]
rooms = [params[i].replace(' hab.','') for i in range(1,len(params),3)]
bathrooms = [params[i] for i in range(2,len(params),3)]

print(surface[0:3])
print(rooms[0:3])
print(bathrooms[0:3])

['64', '37', '33']
['2', '1', '1']
['1', '1', '1']


It seems to work. We can do this for all properties then, right?

*SPOILER: No*

On each page there are 12 properties, therefore, the length of the 'params' list should be 3 x 12 = 36 for each page. 12 surfaces + 12 bedrooms + 12 bathrooms. We will see that this is not the case:

In [91]:
# All data in the web

# Store data in lists
all_prices = []
all_rooms = []
all_surface = []
all_bathrooms = []
all_description = []

# 174 pages at 17-09-2024
for i in range(1,175):
    
    # url of page 1
    url = 'https://www.redpiso.es/venta-viviendas/madrid/madrid'
    
    # url of pages from 2 to 174
    if i >=2: url = url + '/pagina-'+str(i)
    
    # HTML requests
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

    response = requests.get(url, headers = headers)

    soup = BeautifulSoup(response.text, "html.parser")
    
    
    # Get and clean data in the same step:
    templates = soup.find_all('h3')
    prices = [elem.get_text(strip=True).replace(' €','') for elem in templates]
    
    templates = soup.find_all('div', class_='property-list-options-item')
    params = [elem.get_text(strip=True) for elem in templates]
    
    surface = [params[i].replace(' m²','') for i in range(0,len(params),3)]

    rooms = [params[i].replace(' hab.','') for i in range(1,len(params),3)]

    bathrooms = [params[i] for i in range(2,len(params),3)]
    
    templates = soup.find_all('h5')
    description = [elem.get_text(strip=True) for elem in templates]
    
    
    # Store data:
    all_prices.extend(prices)
    all_rooms.extend(rooms)
    all_surface.extend(surface)
    all_bathrooms.extend(bathrooms)
    all_description.extend(description)
    
    
    # Check if length of 'params' is 36 in every page
    if len(params) !=36:
        print('len(params) = ' + str(len(params)) + ' in pagina-' + str(i))
    


len(params) = 35 in pagina-2
len(params) = 35 in pagina-3
len(params) = 35 in pagina-4
len(params) = 34 in pagina-18
len(params) = 35 in pagina-23
len(params) = 35 in pagina-26
len(params) = 34 in pagina-27
30
len(params) = 34 in pagina-31
len(params) = 35 in pagina-34
len(params) = 35 in pagina-40
len(params) = 33 in pagina-43
len(params) = 35 in pagina-45
len(params) = 35 in pagina-51
len(params) = 34 in pagina-54
len(params) = 34 in pagina-59
60
len(params) = 35 in pagina-64
len(params) = 35 in pagina-74
len(params) = 35 in pagina-75
len(params) = 35 in pagina-76
90
len(params) = 35 in pagina-94
len(params) = 34 in pagina-96
len(params) = 34 in pagina-98
len(params) = 35 in pagina-102
len(params) = 35 in pagina-105
len(params) = 35 in pagina-110
len(params) = 34 in pagina-112
len(params) = 35 in pagina-119
120
len(params) = 35 in pagina-122
len(params) = 34 in pagina-140
len(params) = 35 in pagina-142
len(params) = 35 in pagina-148
150
len(params) = 30 in pagina-173
len(params) = 18

It can be clearly seen that in many pages the length of 'params' is not 36; which implies that some data is missing. And since we have divided the 'params' list according to the order of the data, they will be mixed:

In [97]:
print('Rooms: ', all_rooms[0:30])
print('---')
print('Surfaces: ',all_surface[0:30])
print('---')
print( 'Bathrooms: ',all_bathrooms[0:30])

Rooms:  ['2', '1', '1', '1', '1', '3', '3', '2', '1', '1', '2', '2', '2', '1', '2', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '3', '2', '1', '2', '2']
---
Surfaces:  ['64', '37', '33', '48', '30', '56', '101', '55', '34', '48', '65', '56', '51', '35', '55', '39', '33', '2 hab.', '1 hab.', '1 hab.', '2 hab.', '2 hab.', '2 hab.', '1 hab.', '41', '68', '63', '57', '47', '50']
---
Bathrooms:  ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '62 m²', '58 m²', '33 m²', '66 m²', '54 m²', '63 m²', '55 m²', '1', '1', '1', '1', '1', '1', '1']


It's all mixed...

You might think that instead of splitting the 'params' list by order it could be done according to the order **AND** the elements in it. That is, each group of 3 must have a surface element with 'm2', one of rooms with 'hab.' and a number indicating the bathroom. If any of these elements are missing, it is replaced by 'null'.

However, that would not work with those homes that had 'null' in their three elements since it would not identify them.

A possible solution is to search with "find_all" for the top layer and join the three data as one. In this case the top layer is div class='property-list-options'

![Imagen](Redpiso_html_zoom.png)

In [98]:
# Due to the previous for loop we are going to search on the last page (174)
templates = soup.find_all('div', class_='property-list-options')
params = [elem.get_text(strip=True) for elem in templates]
params

['94 m²3 hab.2',
 '3 hab.1',
 '90 m²3 hab.1',
 '79 m²3 hab.1',
 '76 m²3 hab.1',
 '104 m²3 hab.2']

This is much better because in all properties there is the tag: 'property-list-options' even when it is empty, so in that case we can also explore it

In [100]:
# Divide data by type:

# If the data does not exist we leave it empty
surface = [element.split(' m²')[0] if ' m²' in element else '' for element in params]

rooms = [element.split(' hab.')[0][-1] if ' hab.' in element else '' for element in params]

bathrooms = [element[-1] if len(element)>1 else '' for element in params]

print('Surfaces: ',surface)
print('Rooms: ',rooms)
print('Bathrooms: ',bathrooms)

Surfaces:  ['94', '', '90', '79', '76', '104']
Rooms:  ['3', '3', '3', '3', '3', '3']
Bathrooms:  ['2', '1', '1', '1', '1', '2']



This works. We repeat the new process for all the pages of the web:

In [101]:
# Store data in lists
all_prices = []
all_title = []
all_subtitle = []
all_rooms = []
all_surface = []
all_bathrooms = []
all_description = []

# 174 pages at 17-09-2024
for i in range(1,175):
    
    # url of page 1
    url = 'https://www.redpiso.es/venta-viviendas/madrid/madrid'
    
    #url of the rest
    if i >=2: url = url + '/pagina-'+str(i)
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

    response = requests.get(url, headers = headers)

    soup = BeautifulSoup(response.text, "html.parser")
    

    templates = soup.find_all('h3')
    prices = [elem.get_text(strip=True).replace(' €','') for elem in templates]
    
    
    # New method to obtain the data
    templates = soup.find_all('div', class_='property-list-options')
    params = [elem.get_text(strip=True) for elem in templates]
    
    surface = [element.split(' m²')[0] if ' m²' in element else '' for element in params]

    rooms = [element.split(' hab.')[0][-1] if ' hab.' in element else '' for element in params]

    bathrooms = [element[-1] if len(element)>1 else '' for element in params]

    templates = soup.find_all('h5')
    description = [elem.get_text(strip=True) for elem in templates] 
    
    
    # Store the data
    all_prices.extend(prices)
    all_rooms.extend(rooms)
    all_surface.extend(surface)
    all_bathrooms.extend(bathrooms)
    all_description.extend(description)

    
    # This is a print to see the process:
    if i%30 == 0: print(i)

print('Finish')

30
60
90
120
150
Finish


Check the lists

In [103]:
print('Surfaces: ',all_surface[0:30])
print('Rooms: ',all_rooms[0:30])
print('Bathrooms: ',all_bathrooms[0:30])

Surfaces:  ['64', '37', '33', '48', '30', '56', '101', '55', '34', '48', '65', '56', '51', '35', '55', '39', '33', '62', '58', '33', '66', '54', '63', '55', '41', '68', '63', '57', '47', '50']
Rooms:  ['2', '1', '1', '1', '1', '3', '3', '2', '1', '1', '2', '2', '2', '1', '2', '1', '', '2', '1', '1', '2', '2', '2', '1', '1', '3', '2', '1', '2', '2']
Bathrooms:  ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']


It looks much better; Now we have divided the data correctly

In [104]:
# Store the data into a dataframe

data = {'Description': all_description,
        'Bathrooms': all_bathrooms,
        'Rooms': all_rooms,
        'Surface (m2)': all_surface,
        'Price (€)': all_prices}

df = pd.DataFrame(data)
df

Unnamed: 0,Description,Bathrooms,Rooms,Surface (m2),Price (€)
0,"Piso en venta en CALLE BENIMAMET, San Cristóba...",1,2,64,106.500
1,"Piso en venta en Buenavista, Carabanchel, Madr...",1,1,37,120.000
2,"Piso en venta en Buenavista, Carabanchel, Madr...",1,1,33,120.000
3,"Piso en venta en San Blas-Canillejas, Madrid, ...",1,1,48,120.000
4,"Loft en venta en CALLE HERMOSILLA, Fuente del ...",1,1,30,124.000
...,...,...,...,...,...
2077,"Piso en venta en CALLE ORIO, 4, Los Ángeles, V...",1,3,,A consultar
2078,"Piso en venta en CALLE CANCION DEL OLVIDO, 12,...",1,3,90,A consultar
2079,Piso en venta en CALLE LA ALEGRIA DE LA HUERTA...,1,3,79,A consultar
2080,"Piso en venta en CALLE Ochagavia, Valdezarza, ...",1,3,76,A consultar


In fact, there are no problems with those properties that have the three problematic parameters empty:

In [107]:
df[df['Bathrooms'] == '']

Unnamed: 0,Description,Bathrooms,Rooms,Surface (m2),Price (€)
637,"Piso en venta en Hispanoamérica, Chamartín, Ma...",,,,1.150.000
2072,"Piso en venta en CALLE MARISMAS, 57, Nueva Num...",,,,A consultar
2073,"Casa en venta en CALLE VILLACARRILLO, 1, Entre...",,,,A consultar


In [105]:
# Save the dataframe in an excel:
df.to_excel('inmuebles_redpiso.xlsx')


In the next Notebook we will load the excel that we have generated, we will clean the data and combine it to prepare it for the analysis.