# Plan your trip with Kayak
--------
## Data Collection
In this second phase, we will :
 
* collect the cities gps coordinates from nominatim API    
* collect weather data for each city over 7 days from One Call API 

### Table of Contents

* [1. Download the data from the Data Lake](#section1)
* [2. Get weather data for each destination](#section2)
    * [2.1. Get GPS coordinates from nominatim API](#section21)
       * [2.1.1. Save retrieved data in the data lake](#section211)
    * [2.2. Get weather data from One Call API](#section22)
       * [2.2.1. Save retrieved data in the data lake](#section221)

In [12]:
# getting and parsing data
import requests

# wrapping data
import pandas as pd

# time processing
from time import sleep
from datetime import datetime, timezone

# Predefined Functions
from modules import Funct as F

# global params
bucket_name = 'kayak-project'

# 1. Download the data from the Data Lake <a class="anchor" id="section1"></a>

In [None]:
# download the file from the data lake
F.download_file_dl(bucket_name, 'cities.txt', 'data/cities.txt')

#read the file
cities = F.read_txt('data/cities.txt')

# 2. Get weather data for each destination <a class="anchor" id="section2"></a>

## 2.1. Get GPS coordinates from nominatim API <a class="anchor" id="section21"></a> 🌍

In [16]:
base_url_geo = "https://nominatim.openstreetmap.org/search?"

def nominatim_geocode(address, format='json', limit=1, **kwargs):
    '''
    This wrapper around nominatim API
    Documentation : https://nominatim.org/release-docs/develop/api/Search/
    '''
    params = {"q":address, "format": format, "limit": limit, **kwargs}
    # send request / get_response_data and response_code
    try:
        response = requests.get(base_url_geo, params=params)
        response.raise_for_status()
    except requests.exceptions.HTTPError as e:
        print ("Http Error:",e)
    except requests.exceptions.ConnectionError as e:
        print ("Error Connecting:",e)
    except requests.exceptions.Timeout as e:
        print ("Timeout Error:",e)
    except requests.exceptions.RequestException as e:
        print ("Something Else !!",e)
    
    sleep(1)
    
    return response.json()

🗒 **_raise_for_status_** is used to handle exceptions if the status code is not 200  
🗒 **_time.sleep_** is used to delay code execution for some amount of time. Many requests, fired in rapid succession can, depending on the server in question, quickly take up all of the free connections and effectively become a **DoS Attack**. To allow for breathing space, as well as to make sure we don't negatively impact either the users of the website or the website itself, we'd limit the number of requests sent by delaying each one.

In [17]:
columns = ['id', 'name', 'latitude', 'longitude']
data = []

for count, city in enumerate(cities):
    response = nominatim_geocode(address = city, country = 'France')
    row =[count, city, response[0]['lat'], response[0]['lon']]
    data.append(row)
    
geo_df = pd.DataFrame(data=data, columns=columns)

In [18]:
geo_df.head()

Unnamed: 0,id,name,latitude,longitude
0,0,Mont Saint Michel,48.6359541,-1.511459954959514
1,1,St Malo,48.649518,-2.0260409
2,2,Bayeux,49.2764624,-0.7024738
3,3,Le Havre,49.4938975,0.1079732
4,4,Rouen,49.4404591,1.0939658


In [19]:
geo_df.dtypes

id            int64
name         object
latitude     object
longitude    object
dtype: object

In [20]:
# Convert latitude and longitude dtypes for the mapbox later
geo_df[["latitude", "longitude"]] = geo_df[["latitude", "longitude"]].apply(pd.to_numeric)

In [21]:
geo_df.dtypes

id             int64
name          object
latitude     float64
longitude    float64
dtype: object

### 2.1.1. Save retrieved data in the data lake <a class="anchor" id="section211"></a> 📚

In [None]:
geo_df.to_csv('data/cities_coordinates.csv', index=False)
F.upload_file_dl('data/cities_coordinates.csv', bucket_name, "cities_coordinates.csv")

## 2.2. Get weather data from One Call API <a class="anchor" id="section22"></a> ⛅

In [23]:
base_url_weather = 'https://api.openweathermap.org/data/2.5//onecall?'

def oneCall_weather(lat, lon, exclude, API_key = '4553685c373893d94b854a6c35825c33', units ='metric'):
    '''
    API: One Call
    weather data params: (la,  lon, exclude, api key) 
    url : https://api.openweathermap.org/data/2.5/onecall?lat={lat}&lon={lon}&exclude={part}&appid={API key}&units={units}
    format : json (default)
    Documentation : https://openweathermap.org/api/one-call-api
    '''
    params= {'lat': lat, 'lon': lon, 'exclude': exclude, 'APPID':API_key, 'units': units}
    try:
        response = requests.get(base_url_weather, params=params)
        response.raise_for_status()
    except requests.exceptions.HTTPError as e:
        print ("Http Error:",e)
    except requests.exceptions.ConnectionError as e:
        print ("Error Connecting:",e)
    except requests.exceptions.Timeout as e:
        print ("Timeout Error:",e)
    except requests.exceptions.RequestException as e:
        print ("Something Else !!",e)
    
    sleep(1)
    
    return response.json()

In [24]:
# convert unix timestamp to datetime
def convertDt(unixDt):
    utc_time = datetime.fromtimestamp(unixDt, timezone.utc)
    local_time = utc_time.astimezone()
    
    return (local_time.strftime("%Y-%m-%d %H:%M:%S (%Z)"))

In [25]:
# get weather data for each city
columns = ['day_time', 'temperature', 'precipitation_p', 'humidity', 'weather', 'cid']
weather_desc =[]
for i in range(len(geo_df)):
    
    cid = geo_df.loc[i, 'id'] # will be used as foreign key
    latitude = geo_df.loc[i, 'latitude']
    longitude = geo_df.loc[i, 'longitude']
    
    response_weather = oneCall_weather(lat = latitude, lon = longitude, exclude = 'current,minutely,hourly,alerts')
    
    # get weather data for 7 days
    for j in range(1, 8, 1): 

        day_time = convertDt(int(response_weather['daily'][j]['dt']))
        temperature = response_weather['daily'][j]['temp']['day']
        precipitation_p = response_weather['daily'][j]['pop']
        humidity = response_weather['daily'][j]['humidity']
        weather = response_weather['daily'][j]['weather'][0]['description']
        
        weather_desc.append([day_time, temperature, precipitation_p, humidity, weather, cid])
    
weather_df = pd.DataFrame(weather_desc, columns =columns)

In [26]:
weather_df.head(10)

Unnamed: 0,day_time,temperature,precipitation_p,humidity,weather,cid
0,2022-01-15 12:00:00 (UTC),6.25,0.0,70,clear sky,0
1,2022-01-16 12:00:00 (UTC),7.95,0.1,86,overcast clouds,0
2,2022-01-17 12:00:00 (UTC),10.18,0.08,87,broken clouds,0
3,2022-01-18 12:00:00 (UTC),8.23,0.0,72,few clouds,0
4,2022-01-19 12:00:00 (UTC),7.63,0.0,76,scattered clouds,0
5,2022-01-20 12:00:00 (UTC),8.39,0.0,70,broken clouds,0
6,2022-01-21 12:00:00 (UTC),5.95,0.0,65,scattered clouds,0
7,2022-01-15 12:00:00 (UTC),6.48,0.0,82,clear sky,1
8,2022-01-16 12:00:00 (UTC),9.33,0.02,80,overcast clouds,1
9,2022-01-17 12:00:00 (UTC),9.8,0.04,86,clear sky,1


### 2.2.1. Save retrieved data in the data lake <a class="anchor" id="section221"></a> 📚

In [None]:
weather_df.to_csv('data/cities_weather.csv', index=False)
F.upload_file_dl('data/cities_weather.csv', bucket_name, "cities_weather.csv")