## Purpose

This notebook fetches the data from ```build_csv_postgres.ipynb``` file that was uploaded to the Azure Postgres database, and converts it back to a CSV. If you wish, you can skip this notebook and download the data manually from Postgres. 

This notebook was created to gain practice with using GraphQL and act as a template for further development in terms of fetching data from Postgres database

### Input

Ensure you have a ```.env``` file in the same directory as this notebook with 2 keys:

1. *URL* - for the Hasura database link that hosts our data via Postgres
2. *SECRET* - unique password string given by Hasura that allows us to establish connection with our database

### Output

This notebook will output the Postgres database content as a single CSV file, in the ```output``` folder

In [1]:
from gql import gql, Client
from gql.transport.aiohttp import AIOHTTPTransport
from gql.transport.exceptions import TransportQueryError

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

import os
import sys
import pandas as pd
import numpy as np
import json
import asyncio
import threading

In [2]:
# Store the keys from the environment file

url = os.getenv('URL')
secret = os.getenv('SECRET')

# Send connection request to the Hasura database using a GraphQL client

transport = AIOHTTPTransport(url=url, headers={
    'x-hasura-admin-secret': secret,
    'content-type': 'application/json'
})

gql_client = Client(transport=transport, fetch_schema_from_transport=True)

In [3]:
# Global data variable will hold the Postgres database content and will be outputted as CSV in the end

global data

data = None

# Asynchronous function required as the GraphQL queries are executed asynchronously

async def fetch_queries():

    global data
    print('Fetching started:', threading.current_thread())

    # Execute GraphQL queries to retrieve all house sale and tax assessment information
    # If we try to fetch all of the approx. 47k rows from the Postgres database, the result will be too large and result in an error
    # To avoid this, we only request data of 1000 rows per query and append its result to our dataframe

    try:

      # Total rows: 47385, if going at increments of 1000 rows, then it takes 48 loops

      for row_limit in range(1, 49):

          # Use double brackets to store them as part of the actual string when using f-strings

          query = gql(
              f"""
              query MyQuery {{
                  RealEstateUnits(where: {{ id: {{ _gte: "{ (row_limit-1)*1000 }" , _lte: "{row_limit*1000}" }} , PropertyType: {{_eq: "house"}} }}) {{
                      PropertyDetails
                      PropertyType
                  }}
              }}
              """
          )

          # Send query for execution

          result = await gql_client.execute_async(query)

          # Convert the resulting JSON string into a Pandas dataframe, and append it to our global data variable

          json_str = json.dumps(result)
          json_obj = json.loads(json_str)

          data = pd.concat([data, pd.json_normalize(json_obj['RealEstateUnits'])], ignore_index=True)      


    except TransportQueryError as err:
      print(f"Error: {err}")
      sys.exit(1)


    print('Fetching ended:', threading.current_thread())


# Since we are dealing with asynchronous functions, we must wait for the GraphQL query to finish before adding its content to the dataframe
# Using Threads help us achieve this, otherwise Python Coroutine object is returned 

thread = threading.Thread( target=asyncio.run, args=(fetch_queries(),) )
thread.start()
thread.join()

Fetching started: <Thread(Thread-3 (run), started 5964)>
Fetching ended: <Thread(Thread-3 (run), started 5964)>


In [4]:
# Change the dataframe column names to their original format and fill missing values as NaN 

data.columns = data.columns.str.replace("PropertyDetails.", "")
data = data.fillna(value=np.nan)

  data.columns = data.columns.str.replace("PropertyDetails.", "")


In [5]:
# Output dataframe as CSV file

data.to_csv('../datasets/output/fetch_postgres.csv', index=False)