# Extracting `DAILY` data from API and Loading it to DB 

As the purpose of our pipeline is to make Weather Data available for comparison to flights and airports data, in the first step we need to load the weather data in a raw form (JSON) into our database. So in later steps we can transform it to meaningful and useful tables.

### General Presteps:

The Goal of this Notebook is to get raw JSON data for Daily and Hourly Weather for 3 airport weather stations and load it as it is to our database.
- Find Station IDs for **defined** airports 
- Define the start and end of the period
- get the API Key from the `.env`

### Imports

we will need the credentials we saved in the `.env` file. We also will need SQLAlchemy and its functions

In [1]:
# we will need the credentials we saved in the .env file
from dotenv import dotenv_values

# We also will need SQLAlchemy and its functions
from sqlalchemy import create_engine, types
from sqlalchemy.dialects.postgresql import JSON as postgres_json

import pandas as pd

# requests library will make the API calls. 
# the json package will parse the JSON string and convert it to Python data structures
import requests
import json

# with 'datetime' we want to catch the timestamp of the API call. For the actuality reference. 
# and 'time' for slowing down a .bit
from datetime import datetime
import time

### Defining Airports andd finding the Station IDs

For our Pipeline we will use weather data from the weather stations at the 3 highly frequented airports
- **JFK**: John F. Kennedy Airport
- **MIA**: Miami International Airport
- **LAX**: Los Angeles Airport
- **

To find the Station IDs for the airpors without stressing our API Call limits, we will use the   search option of the **https://meteostat.net/**  

Search for the names of the airports above and find the Station IDs

In [28]:
airport_staids = {
    'JFK': 74486
    ,'MIA': 72202
    ,'LAX': 72295
    ,'JNB': 68212
    ,'HRE': 67816
    ,'CPT': 68816
    ,'MPM': 67842
    ,'LUN': 67765
           }

### Defining the period

Our flight Data is from 2024-01-01 until 2024-04-31. For the lectures we will use the same period for the meteostat JSON API.

In [29]:
period_start = "2019-01-01"
period_end = "2022-12-31"

### loading API Key

In [30]:
config = dotenv_values()
config.keys

<function OrderedDict.keys>

In [31]:
# getting API and DB credentials - Alternative 1: dotenv_values()

config = dotenv_values()
api_key = config['X-RapidAPI-Key']


# Part 1: Daily Station Data

Each API call will get 3 months of weather data for one Station ID.  

In the [**RapidAPI**](https://rapidapi.com/meteostat/api/meteostat/playground) interface you can find the code syntax we need to make the call. 

For each call we need to create a querystring with required parameters.

### Objectives -  Daily Station Data:

- create a for-loop for the 3 airports, generating a **querystring for each API call**
- define an empty dictionary to collect: time of the call, airport code, station id, related data
- make the API calls using the for-loop and fill the dictionary
- create pandas dataframe from the dictionary
- load the DB credentials from the `.env`
- create the engine
- define data types for the postgresql table columns
- using pandas import the dataframe to the Table in the Schema of the DB

### Test: For-loop generating the querystrings

In [32]:
airport_staids

{'JFK': 74486,
 'MIA': 72202,
 'LAX': 72295,
 'JNB': 68212,
 'HRE': 67816,
 'CPT': 68816,
 'MPM': 67842,
 'LUN': 67765}

In [33]:
# testing for-loop: querystring for each airport

for airport in airport_staids:
   
    querystring = {
        "station": airport_staids[airport]
        ,"start": period_start
        ,"end": period_end
        ,"model":"true"
    }
    print(airport, "\n", querystring)

JFK 
 {'station': 74486, 'start': '2019-01-01', 'end': '2022-12-31', 'model': 'true'}
MIA 
 {'station': 72202, 'start': '2019-01-01', 'end': '2022-12-31', 'model': 'true'}
LAX 
 {'station': 72295, 'start': '2019-01-01', 'end': '2022-12-31', 'model': 'true'}
JNB 
 {'station': 68212, 'start': '2019-01-01', 'end': '2022-12-31', 'model': 'true'}
HRE 
 {'station': 67816, 'start': '2019-01-01', 'end': '2022-12-31', 'model': 'true'}
CPT 
 {'station': 68816, 'start': '2019-01-01', 'end': '2022-12-31', 'model': 'true'}
MPM 
 {'station': 67842, 'start': '2019-01-01', 'end': '2022-12-31', 'model': 'true'}
LUN 
 {'station': 67765, 'start': '2019-01-01', 'end': '2022-12-31', 'model': 'true'}


**possible question**: what does 'model' parameter do?  
**answer**: Check the meteostat documentation for the Daily or Hourly endpoint.

### API CALL daily (per station)

In [34]:
#  let's catch each response in a dictionary. create an empty dictionary with the following keys:

weatherreport_dict = {'extracted_at':[], 
                'airport_code':[], 
                'station_id':[], 
                'extracted_data':[]
               }

# API CALL daily (station) - for the syntax: see the rapidapi interface

url = "https://meteostat.p.rapidapi.com/stations/daily"

headers = {
        "X-RapidAPI-Key": api_key,
        "X-RapidAPI-Host": "meteostat.p.rapidapi.com"
}

# for-loop for the querystrings
for airport in airport_staids:
   
    querystring = {
        "station":airport_staids[airport]
        ,"start":period_start
        ,"end":period_end
        ,"model":"true"
    }
    
    # making one call with the current querystring
    response = requests.get(url, headers=headers, params=querystring)
                
    # appending data to the dictionary:
    weatherreport_dict['extracted_at'].append(datetime.now)                # timestamp, 
    weatherreport_dict['airport_code'].append(airport)                       # airport code    
    weatherreport_dict['station_id'].append(airport_staids[airport])   # weater Station ID
    weatherreport_dict['extracted_data'].append(json.loads(response.text))   # JSON string

#### Checkout the filled dictionary

In [35]:
weatherreport_dict

{'extracted_at': [<function datetime.now(tz=None)>,
  <function datetime.now(tz=None)>,
  <function datetime.now(tz=None)>,
  <function datetime.now(tz=None)>,
  <function datetime.now(tz=None)>,
  <function datetime.now(tz=None)>,
  <function datetime.now(tz=None)>,
  <function datetime.now(tz=None)>],
 'airport_code': ['JFK', 'MIA', 'LAX', 'JNB', 'HRE', 'CPT', 'MPM', 'LUN'],
 'station_id': [74486, 72202, 72295, 68212, 67816, 68816, 67842, 67765],
 'extracted_data': [{'meta': {'generated': '2024-10-10 14:23:06'},
   'data': [{'date': '2019-01-01',
     'tavg': 10.4,
     'tmin': 4.4,
     'tmax': 15.0,
     'prcp': 2.8,
     'snow': 0.0,
     'wdir': 280.0,
     'wspd': 29.5,
     'wpgt': None,
     'pres': 1010.8,
     'tsun': None},
    {'date': '2019-01-02',
     'tavg': 3.3,
     'tmin': 1.1,
     'tmax': 5.0,
     'prcp': 0.0,
     'snow': 0.0,
     'wdir': 65.0,
     'wspd': 13.3,
     'wpgt': None,
     'pres': 1025.0,
     'tsun': None},
    {'date': '2019-01-03',
     'tavg':

### Make it a dataframe

this is our raw data, which we now can load into the database

In [36]:
weatherreport_daily_df = pd.DataFrame(weatherreport_dict)
weatherreport_daily_df

Unnamed: 0,extracted_at,airport_code,station_id,extracted_data
0,<built-in method now of type object at 0x10284...,JFK,74486,"{'meta': {'generated': '2024-10-10 14:23:06'},..."
1,<built-in method now of type object at 0x10284...,MIA,72202,"{'meta': {'generated': '2024-10-10 14:23:07'},..."
2,<built-in method now of type object at 0x10284...,LAX,72295,"{'meta': {'generated': '2024-10-10 14:23:07'},..."
3,<built-in method now of type object at 0x10284...,JNB,68212,"{'meta': {'generated': '2024-10-10 14:23:08'},..."
4,<built-in method now of type object at 0x10284...,HRE,67816,"{'meta': {'generated': '2024-10-10 14:23:09'},..."
5,<built-in method now of type object at 0x10284...,CPT,68816,"{'meta': {'generated': '2024-10-10 14:23:09'},..."
6,<built-in method now of type object at 0x10284...,MPM,67842,"{'meta': {'generated': '2024-10-10 14:23:10'},..."
7,<built-in method now of type object at 0x10284...,LUN,67765,"{'meta': {'generated': '2024-10-10 14:23:11'},..."


### SIDEBAR: For the curious and sceptics...

    In case you can't resist to know what the data looks like when flattened. 
    Here is the preview with pandas. BUT we are not transforming before loading in our pipeline just yet. 
    We Extract and Load the raw JSON.

In [37]:
pd.json_normalize(weatherreport_daily_df['extracted_data'][0]['data']).head()

Unnamed: 0,date,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
0,2019-01-01,10.4,4.4,15.0,2.8,0.0,280.0,29.5,,1010.8,
1,2019-01-02,3.3,1.1,5.0,0.0,0.0,65.0,13.3,,1025.0,
2,2019-01-03,4.9,1.1,7.8,0.0,0.0,290.0,18.4,,1015.0,
3,2019-01-04,3.7,1.1,7.2,0.0,0.0,231.0,18.7,,1014.7,
4,2019-01-05,6.9,5.0,7.8,10.2,0.0,19.0,19.1,,1001.4,


In [38]:
# using pd.json_normalize() twice to get to the weather_stats of one airport under 'data'

df_JFK = pd.json_normalize(pd.json_normalize(weatherreport_daily_df
['extracted_data']).loc[0, 'data'])
df_JFK

Unnamed: 0,date,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
0,2019-01-01,10.4,4.4,15.0,2.8,0.0,280.0,29.5,,1010.8,
1,2019-01-02,3.3,1.1,5.0,0.0,0.0,65.0,13.3,,1025.0,
2,2019-01-03,4.9,1.1,7.8,0.0,0.0,290.0,18.4,,1015.0,
3,2019-01-04,3.7,1.1,7.2,0.0,0.0,231.0,18.7,,1014.7,
4,2019-01-05,6.9,5.0,7.8,10.2,0.0,19.0,19.1,,1001.4,
...,...,...,...,...,...,...,...,...,...,...,...
1456,2022-12-27,-0.9,-2.1,1.7,0.0,0.0,263.0,17.3,,1025.5,
1457,2022-12-28,2.1,-2.7,7.2,0.0,0.0,221.0,13.7,,1026.7,
1458,2022-12-29,3.7,0.0,7.8,0.0,0.0,223.0,11.5,,1026.4,
1459,2022-12-30,5.4,1.1,10.6,0.0,0.0,198.0,10.1,,1025.1,


> #### Note: we only used up 3 API calls per attempt

### Loading the data into the DB

Now all we need to create a table in your Schema in our database is part of the `weather_daily_df` dataframe.  
We can use pandas' ability to work with SQLAlchemy and "save" the data to the DB using the `.to_sql()` method

In [39]:
# getting API and DB credentials - Alternative 1: dotenv_values()

config = dotenv_values()
 
pg_user = config['POSTGRES_USER'] # align the key labels with your .env file
pg_host = config['POSTGRES_HOST']
pg_port = config['POSTGRES_PORT']
pg_db = config['POSTGRES_DB']
pg_schema = config['POSTGRES_SCHEMA']
pg_pass = config['POSTGRES_PASS']

In [40]:
# updating the url
url = f'postgresql://{pg_user}:{pg_pass}@{pg_host}:{pg_port}/{pg_db}'

# creating the engine
engine = create_engine(url, echo=False)

In [41]:
engine.url # checking the url (pass is hidden)

postgresql://shaunkutsanzira:***@data-analytics-course-2.c8g8r1deus2v.eu-central-1.rds.amazonaws.com:5432/hh_analytics_24_2

In [42]:
# defining data types for the DB
dtype_dict = {
    'extracted_at':types.DateTime,
    'airport_code': types.String,
    'station_id': types.Integer,
    'extracted_data':postgres_json
             }

In [None]:
from datetime import datetime
# Assuming the 'extracted_at' column is generated using a function like `datetime.now`:
weatherreport_daily_df['extracted_at'] = pd.to_datetime(datetime.now())  # Ensure it's a valid datetime

In [47]:
# writing dataframe to DB
weatherreport_daily_df.to_sql(name = 'weatherreport_daily_raw', 
                       con = engine, 
                       schema = pg_schema, # pandas is allowing to specify, in which schema the table shall be created
                       if_exists='replace', 
                       dtype=dtype_dict,
                       index=False
                      )

8

In [46]:
weatherreport_daily_df

Unnamed: 0,extracted_at,airport_code,station_id,extracted_data
0,2024-10-10 16:31:53.951458,JFK,74486,"{'meta': {'generated': '2024-10-10 14:23:06'},..."
1,2024-10-10 16:31:53.951458,MIA,72202,"{'meta': {'generated': '2024-10-10 14:23:07'},..."
2,2024-10-10 16:31:53.951458,LAX,72295,"{'meta': {'generated': '2024-10-10 14:23:07'},..."
3,2024-10-10 16:31:53.951458,JNB,68212,"{'meta': {'generated': '2024-10-10 14:23:08'},..."
4,2024-10-10 16:31:53.951458,HRE,67816,"{'meta': {'generated': '2024-10-10 14:23:09'},..."
5,2024-10-10 16:31:53.951458,CPT,68816,"{'meta': {'generated': '2024-10-10 14:23:09'},..."
6,2024-10-10 16:31:53.951458,MPM,67842,"{'meta': {'generated': '2024-10-10 14:23:10'},..."
7,2024-10-10 16:31:53.951458,LUN,67765,"{'meta': {'generated': '2024-10-10 14:23:11'},..."


If you see a '3' as the result of the last cell. Something should be right. :) 

Check in DBeaver if you see a new table in your Schema. Don't forget to refresh your Schema.

## Now continue with the hourly data.