# M3 Data collection and management : project
## Planning my next holidays ☀️

Let's create a script that allows to get some information about all the hotels in a given city on www.Booking.com 🧙
**We strongly recommend that you use Scrapy, it will be much easier !**

You can scrap as many information asyou want, but we suggest that you get at least :
* The hotel name, 
* The url to its booking.com page, 
* Its coordinates : latitude and longitude
* The score given by the website users
* The text description of the hotel

Then, you can execute this script for several cities from yesterday's list. Make sure you save the results in different files for each city and that the name of the city is stored in the filename (because you will use it later 😉)

In [1]:
import os
import logging
import pandas as pd

import scrapy
from scrapy.crawler import CrawlerProcess

In [2]:
df_best_weather = pd.read_csv('weather_cities.csv', index_col=0)
df_best_weather.head() 

Unnamed: 0,city_id,city_name,lat,lon,temperature,main_weather,humidity,expected_rain,wind_speed,UV_indice,rank,inverted_rank
0,33,Bayonne,43.493338,-1.475099,18.26,Clouds,60.29,0.89,1.57,1.75,1,35
1,28,Carcassonne,43.213036,2.349107,17.99,Clouds,55.71,0.0396,3.74,1.77,2,34
2,32,Biarritz,43.471144,-1.552727,17.69,Clouds,62.43,0.8996,1.61,1.74,3,33
3,19,Cassis,43.214036,5.539632,17.65,Clouds,60.0,0.0,2.27,1.76,4,32
4,27,Collioure,42.52505,3.083155,17.63,Clouds,63.57,0.0,2.39,1.82,5,31


In [3]:
list_cities = [city for city in df_best_weather.loc[0:2, 'city_name']]
list_cities

['Bayonne', 'Carcassonne', 'Biarritz']

In [4]:
destination = 'Biarritz'

In [5]:
class Plan_next_holidaysSpider(scrapy.Spider):

    # Name of your spider
    name = "plan_next_holidays"

    # Url to start your spider from 
    start_urls = [
        'https://www.booking.com/index.fr.html'
    ]
    
    def parse(self, response) :
        return scrapy.FormRequest.from_response(
            response,
            formdata={'ss' : destination},
            callback = self.after_search
        )
    
    def after_search(self, response) :
        hotels = response.css('div.sr_item')
        for hotel in hotels :
            yield {
                'hotel name' : hotel.css('.sr-hotel__name::text').get(),
                'hotel url' : 'https://www.booking.com' + hotel.css('.hotel_name_link').attrib['href'],
                'lat lon' : hotel.css('.sr_card_address_line a').attrib['data-coords'],
                'score' : hotel.css('.bui-review-score__badge::text').get(),
                'description' : hotel.css('.hotel_desc::text').get()
            }
            
        try :
            next_page = response.css("a.paging-next").attrib['href']
        except KeyError:
            logging.info('No next page. Terminating crawling process.')
        else :
            yield response.follow(next_page, callback=self.after_search)

In [6]:
filename = "hotels_" + destination.replace(" ", "-") + ".json"

# If file already exists, delete it before crawling (because Scrapy will concatenate the last and new results otherwise)
if filename in os.listdir('json_folder_city/'):
        os.remove('json_folder_city/' + filename)

# Declare a new CrawlerProcess with some settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.DEBUG,
    "FEEDS": {
        'json_folder_city/' + filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(Plan_next_holidaysSpider)
process.start()

2020-11-10 20:44:24 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-11-10 20:44:24 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.3 (default, Jul  2 2020, 11:26:31) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform macOS-10.15.4-x86_64-i386-64bit
2020-11-10 20:44:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-11-10 20:44:24 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 10,
 'USER_AGENT': 'Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-11-10 20:44:24 [scrapy.extensions.telnet] INFO: Telnet Password: a2e0a0c18da5b823
2020-11-10 20:44:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.exte