<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Walkthrough" data-toc-modified-id="Walkthrough-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Walkthrough</a></span><ul class="toc-item"><li><span><a href="#Collect-house-prices" data-toc-modified-id="Collect-house-prices-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Collect house prices</a></span></li><li><span><a href="#Collect-house-links" data-toc-modified-id="Collect-house-links-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Collect house links</a></span></li><li><span><a href="#Collect-house-names" data-toc-modified-id="Collect-house-names-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Collect house names</a></span></li><li><span><a href="#Complete-function-that-loops-for-each-city" data-toc-modified-id="Complete-function-that-loops-for-each-city-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Complete function that loops for each city</a></span></li></ul></li><li><span><a href="#Complete-code" data-toc-modified-id="Complete-code-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Complete code</a></span></li></ul></div>

# Book your summer vacation and save money with Python and Web Scraping

We all know how hard it is to find a cheap hotel or a cheap apartment when you are booking your summer vacation. Everybody ends up surfing plenty of websites, trying to find out which city has the cheapest options. This process is for sure tedious, boring and time-consuming.

Luckily, if you have a little knowledge of Python, you can make this process incredibly fast. In this notebook, I will guide you step-by-step and show you how you can develop a simple program that, given a list of cities, will tell you what the cheapest hotels in these places are, ordered by price. Here is an example of the final output of this model:

In [52]:
cities = ["Monterosso", "Loano", "Imperia", "Savona"]
country   = "Italy"
people    = 5
in_day    = 20
in_month  = 8
in_year   = 2019
out_day   = 27
out_month = 8
out_year  = 2019

df = get_house_prices(cities, country, people, in_month, in_day, in_year, out_month, out_day, out_year)
df.head(9)

Unnamed: 0,House,Price,Link,City
0,Hotel Garden Lido,542,https://www.booking.com/hotel/it/grand-garden-...,Loano
1,Villa Tanca Luxury Collection,575,https://www.booking.com/hotel/it/villa-tanca-l...,Monterosso
2,trilocale arredato,651,https://www.booking.com/hotel/it/trilocale-arr...,Loano
3,Locazione turistica Casa Nives (IMP169),749,https://www.booking.com/hotel/it/locazione-tur...,Imperia
4,Casa Franco,765,https://www.booking.com/hotel/it/casa-franco-f...,Loano
5,Appartamenti Tra gli Ulivi,770,https://www.booking.com/hotel/it/appartamenti-...,Loano
6,appartamento VIA TERRE BIANCHE,844,https://www.booking.com/hotel/it/appartamento-...,Imperia
7,Villa Caterina,847,https://www.booking.com/hotel/it/villa-caterin...,Imperia
8,Belbea Tourist Resort **,859,https://www.booking.com/hotel/it/belbea-touris...,Loano


If you do not want to go through the detailed explanation of all the steps, you can simply copy and paste the function that you will find at the end of this article.

## Walkthrough

First things first, let's import the libraries that we are going to use.

In [8]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

Let's suppose you want to find a cheap place in Genova, and you want to spend a week with your four friends, from August 15 to August 22. We need to define a series of variables that we are going to use soon: check-in and check-out dates, number of people and city.

In [9]:
in_month  = 8
in_day    = 15
in_year   = 2019
out_month = 8
out_day   = 22
out_year  = 2019
people    = 5
city      = "Genova"
country   = "Italy"

In this example, we are going to scrape the information from booking.com, but you can adapt this code to any other website. Since we are going to do some web scraping, we need to set our User-Agent. On this page you can learn what a User-Agent is, please copy yours to your clipboard and paste it to the next variable that we are going to define. We will also define a variable called url that contains the web page that we will scrape the data from.

In [25]:
headers = {"User-Agent": "paste_your_user_agent_here"}
url = "https://www.booking.com/searchresults.it.html?checkin_month={in_month}&checkin_monthday={in_day}&checkin_year={in_year}&checkout_month={out_month}&checkout_monthday={out_day}&checkout_year={out_year}&group_adults={people}&group_children=0&order=price&ss={city}%2C%20{country}"\
.format(in_month=in_month,
       in_day=in_day,
       in_year=in_year,
       out_month=out_month,
       out_day=out_day,
       out_year=out_year,
       people=people,
       city=city,
       country=country)

Let's now check that everything is working fine, and let's create our BeautifulSoup object that is necessary for our web scraping.

In [26]:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
print(response)

<Response [200]>


If our output is <Response [200]>, it means that we did everything correctly. If the number is not 200, go back and check that you did not miswrite your User-Agent.

### Collect house prices

We can finally start our web scraping and collect the prices of the ten cheapest houses in Genova.

In [27]:
prices = soup.find_all("div", class_="prco-ltr-right-align-helper")
prices_list = [price.get_text().split("\n") for price in prices]

If you print the object prices_list, you can see that the price of each house is hiding there. With the next few lines of code, we are going to extract it and to save everything into a pandas DataFrame.

In [28]:
prcs = []
final_prices = dict()
t = 3
while t <= 30:
    prcs.append(prices_list[t-2][2])
    t+=3
for i in range(len(prcs)):
    final_prices[i] = prcs[i].replace("\xa0", "")
final_prices = pd.DataFrame(final_prices, index=["Price"]).T

Once we have a dataframe containing the prices of the ten cheapest houses in Genova, we will add a column called "ID" that we will use to join this dataset with the others.

In [15]:
IDs = [i for i in range(10)]
final_prices["ID"] = IDs

If you print the object final_prices, you will see something like this:

In [16]:
final_prices

Unnamed: 0,Price,ID
0,€1.630,0
1,€1.760,1
2,€1.663,2
3,€2.030,3
4,€2.240,4
5,€3.089,5
6,€2.376,6
7,€2.430,7
8,€2.775,8
9,€3.065,9


### Collect house links

Of course, at this point, you will want to match each price with its booking.com link, so that you can actually see the house whose price is listed in the previous object. This step is very similar to what we have just done.

In [17]:
links = soup.find_all("a", class_="hotel_name_link url")
links_list = ["https://www.booking.com" + str(link.get("href").replace("\n", "")) for link in links]
final_links = pd.DataFrame(links_list[:10], columns=["Link"])
final_links["ID"] = IDs

Now we need to merge the two dataframes.

In [18]:
df = final_prices.merge(final_links, on="ID")

If you print df, you will see something that looks like this:

In [19]:
df

Unnamed: 0,Price,ID,Link
0,€1.630,0,https://www.booking.com/hotel/it/albergo-posta...
1,€1.760,1,https://www.booking.com/hotel/it/novotelgenova...
2,€1.663,2,https://www.booking.com/hotel/it/albergo-panso...
3,€2.030,3,https://www.booking.com/hotel/it/la-piccola-ma...
4,€2.240,4,https://www.booking.com/hotel/it/memeapartment...
5,€3.089,5,https://www.booking.com/hotel/it/hotelmetropol...
6,€2.376,6,https://www.booking.com/hotel/it/genoa-city.it...
7,€2.430,7,https://www.booking.com/hotel/it/best-western-...
8,€2.775,8,https://www.booking.com/hotel/it/al-villino-br...
9,€3.065,9,https://www.booking.com/hotel/it/doria-genova....


### Collect house names

The last step is to collect the name of the hotels, in order to have a final dataframe that is more easily readable.

In [20]:
names = soup.find_all("span", class_="sr-hotel__name")
names_list = [name.get_text() for name in names]
for i in range(len(names_list)):
    names_list[i] = names_list[i].replace("\n", "")
houses = pd.DataFrame(names_list, columns=["House"]).head(10)
houses["ID"] =  IDs
df = houses.merge(df, on="ID")

If you print df, you will see something that looks like this:

In [21]:
df

Unnamed: 0,House,ID,Price,Link
0,Albergo Posta,0,€1.630,https://www.booking.com/hotel/it/albergo-posta...
1,Novotel Genova City,1,€1.760,https://www.booking.com/hotel/it/novotelgenova...
2,Albergo Panson,2,€1.663,https://www.booking.com/hotel/it/albergo-panso...
3,La Piccola Maddalena,3,€2.030,https://www.booking.com/hotel/it/la-piccola-ma...
4,memeapartments,4,€2.240,https://www.booking.com/hotel/it/memeapartment...
5,Best Western Hotel Metropoli,5,€3.089,https://www.booking.com/hotel/it/hotelmetropol...
6,Holiday Inn Genoa City,6,€2.376,https://www.booking.com/hotel/it/genoa-city.it...
7,Best Western Porto Antico,7,€2.430,https://www.booking.com/hotel/it/best-western-...
8,Al Villino Bruzza,8,€2.775,https://www.booking.com/hotel/it/al-villino-br...
9,Hotel Doria,9,€3.065,https://www.booking.com/hotel/it/doria-genova....


Awesome! Now we have the list of the ten cheapest places in Genova. But what if you want to compare the house prices in Genova with the prices of the nearby cities? We are now going to define a function that merges what we have done so far, and loops for each of the cities that we are interested in. If, for example, you also want to check the house prices in Monterosso, Loano, Imperia and Savona, our model will check the prices in all these cities and list all the hotels, starting from the cheapest.

### Complete function that loops for each city

As I mentioned above, this function simply brings together all the previous steps, and makes some adjustments. I will briefly describe each step here, and you will find the complete code at the end of this article.

1. The function takes as an input a list of cities: even if you only want to check one single city, you need to put it into a list of one element. On the other hand, the variable country must be a string. The first thing the code does is to check that there are no mistakes in the inputs.
2. Import the libraries and set the User-Agent.
3. Initialize an empty dataframe. We need this because we will append to it all the dataframes containing the house prices for each city.
4. Start the loop: at each cycle we update the name of the city.
5. Get prices, url links and house names, as we did before.
6. Append the dataframe that we have created for one city to the big dataframe that contains data about all the cities, and keep looping for the others.
7. Once the loop has ended, we can get rid of the column ID, we then have to change the data type of the column "Price": we need it to be of numeric type, so that we can order each hotel by price. At the end, we sort the dataframe by price and Bob's your uncle! We can also export our result to a csv file.

## Complete code

In [47]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

def get_house_prices(cities, country, people, in_month, in_day, in_year, out_month, out_day, out_year):
    
    if not isinstance(cities, list):
        raise TypeError("cities must be in a list")
    if not isinstance(country, str):
        raise TypeError("country must be a string containing only one country")
    
    # Import libraries and set headers
    headers = {"User-Agent": "paste_your_user_agent_here"}
    
    # Initialize empty dataframe
    final_df = pd.DataFrame()
    
    # Loop for each city
    for city in cities:
        url = "https://www.booking.com/searchresults.it.html?checkin_month={in_month}&checkin_monthday={in_day}&checkin_year={in_year}&checkout_month={out_month}&checkout_monthday={out_day}&checkout_year={out_year}&group_adults={people}&group_children=0&order=price&ss={city}%2C%20{country}"\
                .format(in_month=in_month,
                        in_day=in_day,
                        in_year=in_year,
                        out_month=out_month,
                        out_day=out_day,
                        out_year=out_year,
                        people=people,
                        city=city,
                        country=country)
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, "lxml")
        
        # Get prices
        prices = soup.find_all("div", class_="prco-ltr-right-align-helper")
        prices_list = [price.get_text().split("\n") for price in prices]
        prcs = []
        final_prices = dict()
        t = 3
        while t <= 30:
            prcs.append(prices_list[t-2][2])
            t+=3
        for i in range(len(prcs)):
            final_prices[i] = prcs[i].replace("\xa0", "")
        final_prices = pd.DataFrame(final_prices, index=["Price"]).T
        IDs = [i for i in range(10)]
        final_prices["ID"] = IDs
        
        # Get url links
        links = soup.find_all("a", class_="hotel_name_link url")
        links_list = ["https://www.booking.com" + str(link.get("href").replace("\n", "")) for link in links]
        final_links = pd.DataFrame(links_list[:10], columns=["Link"])
        final_links["ID"] = IDs
        
        # Merge the two dataframes
        df = final_prices.merge(final_links, on="ID")
        
        # Get house names
        names = soup.find_all("span", class_="sr-hotel__name")
        names_list = [name.get_text() for name in names]
        for i in range(len(names_list)):
            names_list[i] = names_list[i].replace("\n", "")
        houses = pd.DataFrame(names_list, columns=["House"]).head(10)
        houses["ID"] = IDs
        
        # Merge everything and append to the big dataframe
        df = houses.merge(df, on="ID")
        df["City"] = f"{city}"
        final_df = final_df.append(df).reset_index(drop=True)
    
    # Final adjustment and sort
    del final_df["ID"]
    for i in range(len(final_df)):
        final_df.loc[i, "Price"] = final_df.loc[i, "Price"].replace("€", "").replace(".", "")
    final_df["Price"] = final_df["Price"].apply(pd.to_numeric)
    final_df = final_df.sort_values(["Price"]).reset_index(drop=True)
    
    final_df.to_csv("houses.csv")
    
    return final_df