
# Plan your trip with Kayak 

## Company's description 📇

<a href="https://www.kayak.com" target="_blank">Kayak</a> is a travel search engine that helps user plan their next trip at the best price.

The company was founded in 2004 by Steve Hafner & Paul M. English. After a few rounds of fundraising, Kayak was acquired by <a href="https://www.bookingholdings.com/" target="_blank">Booking Holdings</a> which now holds: 

* <a href="https://booking.com/" target="_blank">Booking.com</a>
* <a href="https://kayak.com/" target="_blank">Kayak</a>
* <a href="https://www.priceline.com/" target="_blank">Priceline</a>
* <a href="https://www.agoda.com/" target="_blank">Agoda</a>
* <a href="https://Rentalcars.com/" target="_blank">RentalCars</a>
* <a href="https://www.opentable.com/" target="_blank">OpenTable</a>

With over \$300 million revenue a year, Kayak operates in almost all countries and all languages to help their users book travels accros the globe. 

## Project 🚧

The marketing team needs help on a new project. After doing some user research, the team discovered that **70% of their users who are planning a trip would like to have more information about the destination they are going to**. 

In addition, user research shows that **people tend to be defiant about the information they are reading if they don't know the brand** which produced the content. 

Therefore, Kayak Marketing Team would like to create an application that will recommend where people should plan their next holidays. The application should be based on real data about:

* Weather 
* Hotels in the area 

The application should then be able to recommend the best destinations and hotels based on the above variables at any given time. 

## Goals 🎯

As the project has just started, your team doesn't have any data that can be used to create this application. Therefore, your job will be to: 

* Scrape data from destinations 
* Get weather data from each destination 
* Get hotels' info about each destination
* Store all the information above in a data lake
* Extract, transform and load cleaned data from your datalake to a data warehouse

## Scope of this project 🖼️

Marketing team wants to focus first on the best cities to travel to in France. According <a href="https://one-week-in.com/35-cities-to-visit-in-france/" target="_blank">One Week In.com</a> here are the top-35 cities to visit in France: 

```python 
["Mont Saint Michel",
"St Malo",
"Bayeux",
"Le Havre",
"Rouen",
"Paris",
"Amiens",
"Lille",
"Strasbourg",
"Chateau du Haut Koenigsbourg",
"Colmar",
"Eguisheim",
"Besancon",
"Dijon",
"Annecy",
"Grenoble",
"Lyon",
"Gorges du Verdon",
"Bormes les Mimosas",
"Cassis",
"Marseille",
"Aix en Provence",
"Avignon",
"Uzes",
"Nimes",
"Aigues Mortes",
"Saintes Maries de la mer",
"Collioure",
"Carcassonne",
"Ariege",
"Toulouse",
"Montauban",
"Biarritz",
"Bayonne",
"La Rochelle"]
```

Your team should focus **only on the above cities for your project**. 


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you

### Get weather data with an API 

*   Use https://nominatim.org/ to get the gps coordinates of all the cities (no subscription required) Documentation : https://nominatim.org/release-docs/develop/api/Search/

*   Use https://openweathermap.org/appid (you have to subscribe to get a free apikey) and https://openweathermap.org/api/one-call-api to get some information about the weather for the 35 cities and put it in a DataFrame

*   Determine the list of cities where the weather will be the nicest within the next 7 days For example, you can use the values of daily.pop and daily.rain to compute the expected volume of rain within the next 7 days... But it's only an example, actually you can have different opinions on a what a nice weather would be like 😎 Maybe the most important criterion for you is the temperature or humidity, so feel free to change the rules !

*   Save all the results in a `.csv` file, you will use it later 😉 You can save all the informations that seem important to you ! Don't forget to save the name of the cities, and also to create a column containing a unique identifier (id) of each city (this is important for what's next in the project)

*   Use plotly to display the best destinations on a map

### Scrape Booking.com 

Since BookingHoldings doesn't have aggregated databases, it will be much faster to scrape data directly from booking.com 

You can scrap as many information asyou want, but we suggest that you get at least:

*   hotel name,
*   Url to its booking.com page,
*   Its coordinates: latitude and longitude
*   Score given by the website users
*   Text description of the hotel


### Create your data lake using S3 

Once you managed to build your dataset, you should store into S3 as a csv file. 

### ETL 

Once you uploaded your data onto S3, it will be better for the next data analysis team to extract clean data directly from a Data Warehouse. Therefore, create a SQL Database using AWS RDS, extract your data from S3 and store it in your newly created DB. 

## Deliverable 📬

To complete this project, your team should deliver:

* A `.csv` file in an S3 bucket containing enriched information about weather and hotels for each french city

* A SQL Database where we should be able to get the same cleaned data from S3 

* Two maps where you should have a Top-5 destinations and a Top-20 hotels in the area. You can use plotly or any other library to do so. It should look something like this: 

![Map](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Kayak_best_destination_project.png)

In [1]:
#import librairies

#API AND SCRAPPING
import requests
import json
import os 
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

#DATA MANIPULATION
import pandas as pd
import numpy as np
import time
import datetime

#VISUALISATION
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio

#STORAGE
from dotenv import load_dotenv
import boto3
import sqlalchemy


In [2]:
# Liste des villes à visiter*
# correction des villes avec - et lieux touristiques >villes
cities_list= [
"Le Mont-Saint-Michel",
"Saint-Malo",
"Bayeux",
"Le Havre",
"Rouen",
"Paris",
"Amiens",
"Lille",
"Strasbourg",
"Orschwiller", #Chateau du Haut Koenigsbourg"
"Colmar",
"Eguisheim",
"Besancon",
"Dijon",
"Annecy",
"Grenoble",
"Lyon",
"La Palud-sur-Verdon", #Gorge du Verdon
"Bormes-les-Mimosas",
"Cassis",
"Marseille",
"Aix-en-Provence",
"Avignon",
"Uzes",
"Nimes",
"Aigues-Mortes",
"Saintes-Maries-de-la-mer",
"Collioure",
"Carcassonne",
"Ariege",
"Toulouse",
"Montauban",
"Biarritz",
"Bayonne",
"La Rochelle"
]

In [3]:
# Récupération des coordonnées géographiques des villes

# Envoyer une requête GET à l'API geo.api.gouv.fr pour chaque ville
url = "https://geo.api.gouv.fr/communes?nom=Uzes&fields=nom,centre,boost=population&limit=1"
response = requests.get(url)
response.json()


[{'nom': 'Uzès',
  'centre': {'type': 'Point', 'coordinates': [4.416, 44.0289]},
  'code': '30334',
  '_score': 0.739517248275684}]

In [4]:
# Initialiser les listes pour stocker les latitudes et longitudes
coordinates_lat = []
coordinates_lon = []

# Boucler sur chaque ville dans la liste cities_list
for city in cities_list:
    # Remplacer les espaces par des '+' pour la requête URL
    city_query = city.replace(' ', '+')
    
    # Construire l'URL de la requête
    url = f"https://geo.api.gouv.fr/communes?nom={city_query}&fields=nom,centre,boost=population&limit=1"
    
   # Envoyer la requête GET
    response = requests.get(url)
    data = response.json()

    # Vérifier si la réponse contient des données
    if data:
        # Supposer que 'data' est une liste de résultats, prendre le premier
        first_result = data[0]
        
        # Extraire les coordonnées
        coordinates = first_result['centre']['coordinates']
        
        # Classer les coordonnées
        coord_lon = coordinates[0]  # Longitude
        coord_lat = coordinates[1]  # Latitude
        
        # Ajouter les coordonnées aux listes
        coordinates_lat.append(coord_lat)
        coordinates_lon.append(coord_lon)

# Afficher les listes de latitudes et longitudes
print("Latitudes:", coordinates_lat)
print("Longitudes:", coordinates_lon)

print(data)


Latitudes: [48.6245, 48.6463, 49.2772, 49.4958, 49.4412, 43.7968, 49.8987, 48.5703, 48.5691, 48.2468, 48.1115, 48.0377, 47.2602, 47.3319, 45.9024, 45.1842, 45.758, 43.801, 43.1575, 43.2185, 43.2119, 43.536, 43.9416, 44.0289, 43.8322, 43.5482, 43.4958, 42.5087, 43.2078, 42.9396, 43.6007, 44.0217, 43.471, 43.4844, 46.1621]
Longitudes: [-1.5278, -2.0066, -0.7016, 0.1312, 1.0912, 1.8296, 2.2847, -1.8538, 7.7621, 7.3627, 7.3924, 7.2966, 6.0123, 5.0322, 6.1264, 5.7155, 4.8351, 6.323, 6.3615, 5.5503, 2.5438, 5.3879, 4.8333, 4.416, 4.3429, 4.1606, 4.4374, 3.0744, 2.3491, 1.6055, 1.4328, 1.3646, -1.5562, -1.4611, -1.1765]
[{'nom': 'La Rochelle', 'centre': {'type': 'Point', 'coordinates': [-1.1765, 46.1621]}, 'code': '17300', '_score': 0.05525189117447229}]


In [5]:
df_booking = pd.DataFrame(cities_list, columns=["city"])
df_booking["city"]

0         Le Mont-Saint-Michel
1                   Saint-Malo
2                       Bayeux
3                     Le Havre
4                        Rouen
5                        Paris
6                       Amiens
7                        Lille
8                   Strasbourg
9                  Orschwiller
10                      Colmar
11                   Eguisheim
12                    Besancon
13                       Dijon
14                      Annecy
15                    Grenoble
16                        Lyon
17         La Palud-sur-Verdon
18          Bormes-les-Mimosas
19                      Cassis
20                   Marseille
21             Aix-en-Provence
22                     Avignon
23                        Uzes
24                       Nimes
25               Aigues-Mortes
26    Saintes-Maries-de-la-mer
27                   Collioure
28                 Carcassonne
29                      Ariege
30                    Toulouse
31                   Montauban
32      

In [6]:
import scrapy

class BookingSpider(scrapy.Spider):
    name = 'booking'
    start_urls = ['https://www.booking.com/searchresults.fr.html?dest_id=-1456928;dest_type=city']

    def parse(self, response):
        hotels = response.css('div.fe_accommodationlist_item')

        for hotel in hotels[:20]:
            yield {
                'name': hotel.css('span.ap-name::text').get(),
                'score': hotel.css('div.bui-review-score__badge::text').get(),
                'price': hotel.css('div.priceInfo::text').get().strip(),
                'url': 'https://www.booking.com' + hotel.css('a.accommodation-link::attr(href)').get()
            }

        next_page = response.css('a.paging-next::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)


# Name of the file where the results will be saved
filename = "booking.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir():
        os.remove(filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Chrome/97.0',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(BookingSpider)
process.start()

2024-06-18 23:17:39 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot)
2024-06-18 23:17:39 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 23.10.0, Python 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 24.0.0 (OpenSSL 3.0.13 30 Jan 2024), cryptography 42.0.2, Platform Windows-10-10.0.22631-SP0
2024-06-18 23:17:39 [scrapy.addons] INFO: Enabled addons:
[]


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-06-18 23:17:39 [scrapy.extensions.telnet] INFO: Telnet Password: c01af426af830ba6
2024-06-18 23:17:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-06-18 23:17:40 [scrap

In [7]:
class booking_spider(scrapy.Spider):
    # Name of your spider
    name = "booking"

    # Url to start your spider from 
    start_urls = [
        'https://www.booking.com/index.fr.html'
    ]

    # Callback function that will be called when starting your spider
    # It will ranking, the title, the url, the total earnings, the rating and the number of voters of the first <div> with class="quote"

    def parse(self, response):
        elements = response.css("div.sc-b189961a-0.hBZnfJ.cli-children")
        for element in elements:
            yield {
                "ranking" : element.css('h3.ipc-title__text::text').get().split(".")[0]
                ,
                "title" : element.css("h3.ipc-title__text::text").get().split(".")[1].strip()
                ,
                "url" : element.css('a.ipc-title-link-wrapper').attrib["href"]
                ,
                "total_earning" : element.css('span.sc-8f57e62c-2.elpuzG::text').get()
                ,
                "rating" : element.css('span.ipc-rating-star.ipc-rating-star--base.ipc-rating-star--imdb.ratingGroup--imdb-rating::text').get()
                ,
                "voters" : element.css('span.ipc-rating-star--voteCount::text').getall()[1]
            }
        
    
# Name of the file where the results will be saved
filename = "imdb1.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir()):
        os.remove(filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Chrome/97.0',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(imdb_spider)
process.start()

SyntaxError: unmatched ')' (2021251326.py, line 36)