# Trip planning - Prepare your next holiday!

## Part 2 - Finding the best hotels

Now that we have the **rankings of the best French cities to visit in the coming week** (see Part 1), we can move on to **finding hotels to stay at**!

To do so, we will select the five best destinations and **scrap [Booking](https://www.booking.com/index.fr.html)'s website** to get the list of hotels there. In addition to the names, the idea will of course be to get **useful information to help us decide which hotel to pick**, such as ratings, locations or descriptions.

### Getting ready for web scraping

In [1]:
# Importing libraries

import pandas as pd
import os
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

In [2]:
# Loading the CSV created previously 

dataset = pd.read_csv("destinations.csv")

# Extracting and displaying the 5 best locations

top_destinations = dataset[dataset["rank"] <= 5].loc[:, "city_name"].to_list()
top_destinations

['Nimes', 'Aigues Mortes', 'Aix en Provence', 'Avignon', 'Uzes']

### Scraping the list of hotels in our top 5 destinations

The library we will be using for the web scraping part is **Scrapy**. A minor inconvenience of this library is that we'll have to **restart Jupyter's kernel** each time we move to a new city. First of all, we'll have to **create a class** configuring our "**spider**": it will contain all the steps required for scraping a given URL.

Then, we'll be able to run our spider for each of our 5 selected destinations and save the results.

In [3]:
# Creating a folder that will contain our files

os.mkdir("hotels_data/")

#### First city

In [4]:
# Creating a class that inherits from Scrapy's Spider class

class Hotels(scrapy.Spider):
    # Name of the spider
    name = "hotels"

    # Starting URL, i.e. the welcome page of the website
    start_urls = ["https://www.booking.com/index.fr.html"]

    # Simulating a request in the search bar through a parse function
    def parse(self, response):
        return scrapy.FormRequest.from_response(response,
                                                formdata = {"ss": destination_name},
                                                callback = self.after_search)

    # Indicating what to do once search results come up
    def after_search(self, response):
        # Selecting a specific block in the page through CSS elements
        hotels = response.css(".sr_item")

        # Getting the name, URL, coordinates, score and description of the hotels,
        # again through the use of CSS selectors
        for h in hotels:
            yield {"hotel_name": h.css(".sr-hotel__name::text").get(),
                   "url": "https://www.booking.com" + h.css(".hotel_name_link").attrib["href"],
                   "coords": h.css(".sr_card_address_line a").attrib["data-coords"],
                   "score": h.css(".bui-review-score__badge::text").get(),
                   "description": h.css(".hotel_desc::text").get()}
        
        # Accessing the next page, if relevant, and reexecuting this function
        try:
            next_page = response.css("a.paging-next").attrib["href"]
        except KeyError:
            logging.info("No next page. Terminating crawling process.")
        else:
            yield response.follow(next_page, callback = self.after_search)

Let's see how to run a spider for one of our destinations. We'll then just have to restart the kernel and move on to the next one! 

We could save what follows in a function to make our code **DRY (Don't Repeat Yourself)**, but we're facing an exception here: when our kernel restart, any function we would have coded will be **deleted from the memory**, so using such objects is of no use here.

In [5]:
# Setting the first city of our top 5 as the selected city

destination_name = top_destinations[0]

# Creating the name of the JSON file that will store the results

filename = "hotels_" + destination_name.replace(" ", "-") + ".json"

# Checking is a file with this name already exists, and overwriting it if it is
# This will only activate if you rerun a code that you already ran

if filename in os.listdir("hotels_data/"):
    os.remove("hotels_data/" + filename)

# Configuring the crawler with the navigator we want to simulate, the amount 
# of information we want to display and the method used to save the results
    
process = CrawlerProcess(settings = {"USER_AGENT": "Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)",
                                     "LOG_LEVEL": logging.WARNING,
                                     "FEEDS": {"hotels_data/" + filename: {"format": "json"}}})

# Starting the crawler

process.crawl(Hotels)
process.start()

It worked as intended! Now, you'll have to **restart the kernel** for each other city you want to make an hotel research for.

In the following cells, I will **copy-paste everything we wrote so far** in this notebook, so that you'll only have one cell to execute after restarting the kernel.

#### Second city

In [1]:
# REMEMBER TO RESTART YOUR KERNEL BEFORE EXECUTING THIS CELL

import pandas as pd
import os
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

dataset = pd.read_csv("destinations.csv")
top_destinations = dataset[dataset["rank"] <= 5].loc[:, "city_name"].to_list()

class Hotels(scrapy.Spider):
    name = "hotels"
    start_urls = ["https://www.booking.com/index.fr.html"]

    def parse(self, response):
        return scrapy.FormRequest.from_response(response,
                                                formdata = {"ss": destination_name},
                                                callback = self.after_search)

    def after_search(self, response):
        hotels = response.css(".sr_item")
        for h in hotels:
            yield {"hotel_name": h.css(".sr-hotel__name::text").get(),
                   "url": "https://www.booking.com" + h.css(".hotel_name_link").attrib["href"],
                   "coords": h.css(".sr_card_address_line a").attrib["data-coords"],
                   "score": h.css(".bui-review-score__badge::text").get(),
                   "description": h.css(".hotel_desc::text").get()}
        
        try:
            next_page = response.css("a.paging-next").attrib["href"]
        except KeyError:
            logging.info("No next page. Terminating crawling process.")
        else:
            yield response.follow(next_page, callback = self.after_search)
            
# Here is the only line that changes!

destination_name = top_destinations[1]

filename = "hotels_" + destination_name.replace(" ", "-") + ".json"

if filename in os.listdir("hotels_data/"):
    os.remove("hotels_data/" + filename)

process = CrawlerProcess(settings = {"USER_AGENT": "Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)",
                                     "LOG_LEVEL": logging.WARNING,
                                     "FEEDS": {"hotels_data/" + filename: {"format": "json"}}})

process.crawl(Hotels)
process.start()

#### Third city

In [1]:
# REMEMBER TO RESTART YOUR KERNEL BEFORE EXECUTING THIS CELL

import pandas as pd
import os
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

dataset = pd.read_csv("destinations.csv")
top_destinations = dataset[dataset["rank"] <= 5].loc[:, "city_name"].to_list()

class Hotels(scrapy.Spider):
    name = "hotels"
    start_urls = ["https://www.booking.com/index.fr.html"]

    def parse(self, response):
        return scrapy.FormRequest.from_response(response,
                                                formdata = {"ss": destination_name},
                                                callback = self.after_search)

    def after_search(self, response):
        hotels = response.css(".sr_item")
        for h in hotels:
            yield {"hotel_name": h.css(".sr-hotel__name::text").get(),
                   "url": "https://www.booking.com" + h.css(".hotel_name_link").attrib["href"],
                   "coords": h.css(".sr_card_address_line a").attrib["data-coords"],
                   "score": h.css(".bui-review-score__badge::text").get(),
                   "description": h.css(".hotel_desc::text").get()}
        
        try:
            next_page = response.css("a.paging-next").attrib["href"]
        except KeyError:
            logging.info("No next page. Terminating crawling process.")
        else:
            yield response.follow(next_page, callback = self.after_search)
            
# Here is the only line that changes!

destination_name = top_destinations[2]

filename = "hotels_" + destination_name.replace(" ", "-") + ".json"

if filename in os.listdir("hotels_data/"):
    os.remove("hotels_data/" + filename)

process = CrawlerProcess(settings = {"USER_AGENT": "Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)",
                                     "LOG_LEVEL": logging.WARNING,
                                     "FEEDS": {"hotels_data/" + filename: {"format": "json"}}})

process.crawl(Hotels)
process.start()

#### Fourth city

In [1]:
# REMEMBER TO RESTART YOUR KERNEL BEFORE EXECUTING THIS CELL

import pandas as pd
import os
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

dataset = pd.read_csv("destinations.csv")
top_destinations = dataset[dataset["rank"] <= 5].loc[:, "city_name"].to_list()

class Hotels(scrapy.Spider):
    name = "hotels"
    start_urls = ["https://www.booking.com/index.fr.html"]

    def parse(self, response):
        return scrapy.FormRequest.from_response(response,
                                                formdata = {"ss": destination_name},
                                                callback = self.after_search)

    def after_search(self, response):
        hotels = response.css(".sr_item")
        for h in hotels:
            yield {"hotel_name": h.css(".sr-hotel__name::text").get(),
                   "url": "https://www.booking.com" + h.css(".hotel_name_link").attrib["href"],
                   "coords": h.css(".sr_card_address_line a").attrib["data-coords"],
                   "score": h.css(".bui-review-score__badge::text").get(),
                   "description": h.css(".hotel_desc::text").get()}
        
        try:
            next_page = response.css("a.paging-next").attrib["href"]
        except KeyError:
            logging.info("No next page. Terminating crawling process.")
        else:
            yield response.follow(next_page, callback = self.after_search)
            
# Here is the only line that changes!

destination_name = top_destinations[3]

filename = "hotels_" + destination_name.replace(" ", "-") + ".json"

if filename in os.listdir("hotels_data/"):
    os.remove("hotels_data/" + filename)

process = CrawlerProcess(settings = {"USER_AGENT": "Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)",
                                     "LOG_LEVEL": logging.WARNING,
                                     "FEEDS": {"hotels_data/" + filename: {"format": "json"}}})

process.crawl(Hotels)
process.start()

#### Fifth city

In [1]:
# REMEMBER TO RESTART YOUR KERNEL BEFORE EXECUTING THIS CELL

import pandas as pd
import os
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

dataset = pd.read_csv("destinations.csv")
top_destinations = dataset[dataset["rank"] <= 5].loc[:, "city_name"].to_list()

class Hotels(scrapy.Spider):
    name = "hotels"
    start_urls = ["https://www.booking.com/index.fr.html"]

    def parse(self, response):
        return scrapy.FormRequest.from_response(response,
                                                formdata = {"ss": destination_name},
                                                callback = self.after_search)

    def after_search(self, response):
        hotels = response.css(".sr_item")
        for h in hotels:
            yield {"hotel_name": h.css(".sr-hotel__name::text").get(),
                   "url": "https://www.booking.com" + h.css(".hotel_name_link").attrib["href"],
                   "coords": h.css(".sr_card_address_line a").attrib["data-coords"],
                   "score": h.css(".bui-review-score__badge::text").get(),
                   "description": h.css(".hotel_desc::text").get()}
        
        try:
            next_page = response.css("a.paging-next").attrib["href"]
        except KeyError:
            logging.info("No next page. Terminating crawling process.")
        else:
            yield response.follow(next_page, callback = self.after_search)
            
# Here is the only line that changes!

destination_name = top_destinations[4]

filename = "hotels_" + destination_name.replace(" ", "-") + ".json"

if filename in os.listdir("hotels_data/"):
    os.remove("hotels_data/" + filename)

process = CrawlerProcess(settings = {"USER_AGENT": "Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)",
                                     "LOG_LEVEL": logging.WARNING,
                                     "FEEDS": {"hotels_data/" + filename: {"format": "json"}}})

process.crawl(Hotels)
process.start()

Here we are! Check in your freshly created `hotels_data` folder, you should have **a JSON file for each of our top 5 cities**.

It is now time to **explore these hotels lists** and combine them with the table generated in the previous notebook. That will be the purpose of **Part 3**!