# Feathr Feature Store on Home Credit

This notebook illustrates the use of Feature Store to create a model for home credits. It includes these steps:



## Prerequisite: Install Feathr

Install Feathr using pip:

`pip install -U feathr pandavro scikit-learn`

Or if you want to use the latest Feathr code from GitHub:

`pip install -I git+https://github.com/linkedin/feathr.git#subdirectory=feathr_project pandavro scikit-learn`

In [1]:
pip install -U feathr pandavro scikit-learn

[0mNote: you may need to restart the kernel to use updated packages.


## Prerequisite: Configure the required environment

In the first step (Provision cloud resources), you should have provisioned all the required cloud resources. If you use Feathr CLI to create a workspace, you should have a folder with a file called `feathr_config.yaml` in it with all the required configurations. Otherwise, update the configuration below.

The code below will write this configuration string to a temporary location and load it to Feathr. Please still refer to [feathr_config.yaml](https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) and use that as the source of truth. It should also have more explanations on the meaning of each variable.

In [2]:
import tempfile
yaml_config = """
# Please refer to https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml for explanations on the meaning of each field.
api_version: 1
project_config:
  project_name: 'feathr_home_credit'
  required_environment_variables:
    - 'REDIS_PASSWORD'
    - 'AZURE_CLIENT_ID'
    - 'AZURE_TENANT_ID'
    - 'AZURE_CLIENT_SECRET'
offline_store:
  adls:
    adls_enabled: tru
  wasb:
    wasb_enabled: true
  s3:
    s3_enabled: false
    s3_endpoint: 's3.amazonaws.com'
  jdbc:
    jdbc_enabled: false
    jdbc_database: 'feathrtestdb'
    jdbc_table: 'feathrtesttable'
  snowflake:
    url: "dqllago-ol19457.snowflakecomputing.com"
    user: "feathrintegration"
    role: "ACCOUNTADMIN"
spark_config:
  spark_cluster: 'azure_synapse'
  spark_result_output_parts: '1'
  azure_synapse:
    dev_url: "https://feathrhomecreditcaspark.dev.azuresynapse.net"
    pool_name: "spark31"
    # workspace dir for storing all the required configuration files and the jar resources
    workspace_dir: "abfss://feathrhomecreditcafs@feathrhomecreditcasto.dfs.core.windows.net/"
    executor_size: "Small"
    executor_num: 4
    feathr_runtime_location: wasbs://public@azurefeathrstorage.blob.core.windows.net/feathr-assembly-LATEST.jar
  databricks:
    workspace_instance_url: 'https://adb-6885802458123232.12.azuredatabricks.net/'
    workspace_token_value: ''
    config_template: {'run_name':'','new_cluster':{'spark_version':'9.1.x-scala2.12','node_type_id':'Standard_D3_v2','num_workers':2,'spark_conf':{}},'libraries':[{'jar':''}],'spark_jar_task':{'main_class_name':'','parameters':['']}}
    work_dir: 'dbfs:/feathr_getting_started'
    feathr_runtime_location: wasbs://public@azurefeathrstorage.blob.core.windows.net/feathr-assembly-LATEST.jar
online_store:
  redis:
    host: 'feathrhomecreditcaredis.redis.cache.windows.net'
    port: 6380
    ssl_enabled: True
feature_registry:
  purview:
    type_system_initialization: true
    purview_name: 'feathrhomecreditcapurview'
    delimiter: '__'
"""
tmp = tempfile.NamedTemporaryFile(mode='w', delete=False)
with open(tmp.name, "w") as text_file:
    text_file.write(yaml_config)


## View the data

In this tutorial, we use Feathr Feature Store to create a model that predicts NYC Taxi fares. The dataset comes from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The data is as below

In [3]:
import glob
import os
import tempfile
from datetime import datetime, timedelta
from math import sqrt

import pandas as pd
import pandavro as pdx
from feathr import FeathrClient
from feathr import BOOLEAN, FLOAT, INT32, ValueType, STRING
from feathr import Feature, DerivedFeature, FeatureAnchor
from feathr import BackfillTime, MaterializationSettings
from feathr import FeatureQuery, ObservationSettings
from feathr import RedisSink
from feathr import INPUT_CONTEXT, HdfsSource
from feathr import WindowAggTransformation
from feathr import TypedKey
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import lit

## Setup necessary environment variables

You have to setup the environment variables in order to run this sample. More environment variables can be set by referring to [feathr_config.yaml](https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) and use that as the source of truth. It should also have more explanations on the meaning of each variable.

In [4]:
os.environ['REDIS_PASSWORD'] = ''
os.environ['AZURE_CLIENT_ID'] = ''
os.environ['AZURE_TENANT_ID'] = '' 
os.environ['AZURE_CLIENT_SECRET'] = ''

Then we will initialize a feathr client:


In [5]:
client = FeathrClient(config_path=tmp.name)

## Misc pre-processing methods

## Installments payments

In [6]:
def installments_payments_preprocessing(df: DataFrame) -> DataFrame:
    import pandas as pd
    import datetime
    from pyspark import sql

    def aggAvgInstalments(df):
        df_ = df.copy()
        
        
        df_['AMT_INSTALMENT'] = pd.to_numeric(df_['AMT_INSTALMENT'])
        df_['AMT_PAYMENT'] = pd.to_numeric(df_['AMT_PAYMENT'])
        
        df_['INSTALMENT_MISSED'] = (df_['AMT_INSTALMENT'] > df_['AMT_PAYMENT']).astype(int)
        df_['AMT_UNPAID'] = df_['AMT_INSTALMENT'] - df_['AMT_PAYMENT']
        df_['PERC_UNPAID'] = df_['AMT_UNPAID']/df_['AMT_INSTALMENT']
        
        df_ = df_.fillna(0)
        agg = df_.groupby("SK_ID_CURR")
        # percentage of missed payments
        missed_instalments = agg['INSTALMENT_MISSED'].agg(lambda x: x.sum()/x.count()). \
            reset_index().set_index("SK_ID_CURR")
        # percentage of payments difference for each missed payment
        avg_percent_unpaid = agg['PERC_UNPAID'].mean().reset_index().set_index("SK_ID_CURR")
        # average payments difference for each missed payment
        avg_unpaid = agg['AMT_UNPAID'].mean().reset_index().set_index("SK_ID_CURR")
        final_df = missed_instalments
        final_df = final_df.join(avg_percent_unpaid, on='SK_ID_CURR')
        final_df = final_df.join(avg_unpaid,on="SK_ID_CURR")
        return final_df

    # add a TRAN_DATE column with a static date
    df = df.withColumn("TRAN_DATE", lit(datetime.datetime(2021,1,1,11,34,44).strftime('%Y-%m-%d %X')))

    df_org = df.toPandas()

    df_aggAvgInstalments = aggAvgInstalments(df_org)

    # results df would be merge to the original df
    df_result = pd.merge(df_org, df_aggAvgInstalments, on="SK_ID_CURR", how="left")
    # merging df with same column name would result a columnname with a suffix of `_x` and `_y`.
    # Renaming the column name with suffix `_x` to retain the original column name
    df_result.columns = df_result.columns.str.rstrip("_x")

    
    # convert panda to spark dataframe
    spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
    
    return spark_session.createDataFrame(df_result)  


## Feature definition for Installments payments

In [7]:

# source for pass through features
# "TRAN_DATE" column created on on the "datasource_prepocessing" method.
installments_payments_source_core = HdfsSource(name="installmentsPaymentsSourceCore",
                          path="abfss://feathrhomecreditcafs@feathrhomecreditcasto.dfs.core.windows.net/home_credit_data/installments_payments.csv",
                          preprocessing=installments_payments_preprocessing,
                          event_timestamp_column="TRAN_DATE",
                          timestamp_format="yyyy-MM-dd HH:mm:ss"
                          )

# key definition for installments payments
key_SK_ID_CURR = TypedKey(key_column="SK_ID_CURR",
                       key_column_type=ValueType.INT32,
                       description="SK ID CURR",
                       full_name="installments_payments.SK_ID_CURR")

# pass through columns of Installments payments CSV
# columns Installments payments
f_SK_ID_PREV = Feature(name="f_SK_ID_PREV",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="SK_ID_PREV")
f_SK_ID_CURR = Feature(name="f_SK_ID_CURR",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="SK_ID_CURR")
f_NUM_INSTALMENT_VERSION = Feature(name="f_NUM_INSTALMENT_VERSION",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="NUM_INSTALMENT_VERSION")
f_NUM_INSTALMENT_NUMBER = Feature(name="f_NUM_INSTALMENT_NUMBER",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="NUM_INSTALMENT_NUMBER")
f_DAYS_INSTALMENT = Feature(name="f_DAYS_INSTALMENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="DAYS_INSTALMENT")
f_DAYS_ENTRY_PAYMENT = Feature(name="f_DAYS_ENTRY_PAYMENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="DAYS_ENTRY_PAYMENT")
f_AMT_INSTALMENT = Feature(name="f_AMT_INSTALMENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_INSTALMENT")
f_AMT_PAYMENT = Feature(name="f_AMT_PAYMENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_PAYMENT")



f_AMT_UNPAID = Feature(name="f_AMT_UNPAID",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_UNPAID")
                  

features_installments_payments_core=[
  f_SK_ID_PREV,
  f_SK_ID_CURR,
  f_NUM_INSTALMENT_VERSION,
  f_NUM_INSTALMENT_NUMBER,
  f_DAYS_INSTALMENT,
  f_DAYS_ENTRY_PAYMENT,
  f_AMT_INSTALMENT,
  f_AMT_PAYMENT,

  f_AMT_UNPAID
  ]

anchor_installments_payments_core = FeatureAnchor(name="anchor_installments_payments_core",
                                source=installments_payments_source_core, #INPUT_CONTEXT,
                                features=features_installments_payments_core)


## Credit Card Balance

In [8]:
def credit_card_balance_preprocessing(df: DataFrame) -> DataFrame:
    import pandas as pd
    import datetime
    from pyspark import sql

    def avgCreditBalance(df):
        df['AMT_BALANCE'] = pd.to_numeric(df['AMT_BALANCE'])
        return df.groupby('SK_ID_CURR')['AMT_BALANCE'].mean()
    
    def creditCardBalanceRollingBalance(df):
        df_final = df.copy()
        df_final = df_final.sort_values(by="MONTHS_BALANCE")
        df_final = df_final.groupby("SK_ID_CURR")['AMT_BALANCE'].agg(
            lambda x: x.ewm(span=x.shape[0], adjust=False).mean().mean()
        )
        # print(df_final.columns.values.tolist())
        df_final = df_final.reset_index(name="CREDIT_CARD_BALANCE_EMA_AVG")
        df_final = df_final.set_index('SK_ID_CURR')
        return df_final
    
    def creditCardFeatures(credit_card_balance):
        dfs = []
        dfs.append(avgCreditBalance(credit_card_balance))
        dfs.append(creditCardBalanceRollingBalance(credit_card_balance))
        final_df = dfs.pop()
        while dfs:
            final_df = final_df.join(dfs.pop(),on='SK_ID_CURR')
        return final_df

    # add a TRAN_DATE column with a static date
    df = df.withColumn("TRAN_DATE", lit(datetime.datetime(2021,1,1,11,34,44).strftime('%Y-%m-%d %X')))
    df_org = df.toPandas()

    df_result = creditCardFeatures(df_org)

    # results df would be merge to the original df
    df_result = pd.merge(df_org, df_result, on="SK_ID_CURR", how="left")
    # merging df with same column name would result a columnname with a suffix of `_x` and `_y`.
    # Renaming the column name with suffix `_x` to retain the original column name
    df_result.columns = df_result.columns.str.rstrip("_x")
    
    # convert panda to spark dataframe
    spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
    
    return spark_session.createDataFrame(df_result)  

## Feature definition for Credit Card Balance

In [9]:

# source for pass through features
# "TRAN_DATE" column created on on the "datasource_prepocessing" method.
credit_card_balance_source_core = HdfsSource(name="creditCardBalanceSourceCore",
                          path="abfss://feathrhomecreditcafs@feathrhomecreditcasto.dfs.core.windows.net/home_credit_data/credit_card_balance.csv",
                          preprocessing=credit_card_balance_preprocessing,
                          event_timestamp_column="TRAN_DATE",
                          timestamp_format="yyyy-MM-dd HH:mm:ss"
                          )

# key definition for installments payments
# key_SK_ID_CURR = TypedKey(key_column="SK_ID_CURR",
#                        key_column_type=ValueType.INT32,
#                        description="SK ID CURR",
#                        full_name="credit_card_balance.SK_ID_CURR")

# pass through columns of Installments payments CSV
# columns Installments payments
f_SK_ID_PREV = Feature(name="f_SK_ID_PREV",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="SK_ID_PREV")
f_SK_ID_CURR_CC = Feature(name="f_SK_ID_CURR_CC",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="SK_ID_CURR")
f_MONTHS_BALANCE = Feature(name="f_MONTHS_BALANCE",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="MONTHS_BALANCE")
f_AMT_BALANCE = Feature(name="f_AMT_BALANCE",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_BALANCE")
f_AMT_CREDIT_LIMIT_ACTUAL = Feature(name="f_AMT_CREDIT_LIMIT_ACTUAL",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_CREDIT_LIMIT_ACTUAL")
f_AMT_DRAWINGS_ATM_CURRENT = Feature(name="f_AMT_DRAWINGS_ATM_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_DRAWINGS_ATM_CURRENT")
f_AMT_DRAWINGS_CURRENT = Feature(name="f_AMT_DRAWINGS_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_DRAWINGS_CURRENT")
f_AMT_DRAWINGS_OTHER_CURRENT = Feature(name="f_AMT_DRAWINGS_OTHER_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_DRAWINGS_OTHER_CURRENT")
f_AMT_DRAWINGS_POS_CURRENT = Feature(name="f_AMT_DRAWINGS_POS_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_DRAWINGS_POS_CURRENT")
f_AMT_INST_MIN_REGULARITY = Feature(name="f_AMT_INST_MIN_REGULARITY",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_INST_MIN_REGULARITY")
f_AMT_PAYMENT_CURRENT = Feature(name="f_AMT_PAYMENT_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_PAYMENT_CURRENT")
f_AMT_PAYMENT_TOTAL_CURRENT = Feature(name="f_AMT_PAYMENT_TOTAL_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_PAYMENT_TOTAL_CURRENT")
f_AMT_RECEIVABLE_PRINCIPAL = Feature(name="f_AMT_RECEIVABLE_PRINCIPAL",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_RECEIVABLE_PRINCIPAL")
f_AMT_RECIVABLE = Feature(name="f_AMT_RECIVABLE",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_RECIVABLE")
f_AMT_TOTAL_RECEIVABLE = Feature(name="f_AMT_TOTAL_RECEIVABLE",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="AMT_TOTAL_RECEIVABLE")
f_CNT_DRAWINGS_ATM_CURRENT = Feature(name="f_CNT_DRAWINGS_ATM_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="CNT_DRAWINGS_ATM_CURRENT")
f_CNT_DRAWINGS_CURRENT = Feature(name="f_CNT_DRAWINGS_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="CNT_DRAWINGS_CURRENT")
f_CNT_DRAWINGS_OTHER_CURRENT = Feature(name="f_CNT_DRAWINGS_OTHER_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="CNT_DRAWINGS_OTHER_CURRENT")
f_CNT_DRAWINGS_POS_CURRENT = Feature(name="f_CNT_DRAWINGS_POS_CURRENT",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="CNT_DRAWINGS_POS_CURRENT")
f_CNT_INSTALMENT_MATURE_CUM = Feature(name="f_CNT_INSTALMENT_MATURE_CUM",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="CNT_INSTALMENT_MATURE_CUM")
f_NAME_CONTRACT_STATUS = Feature(name="f_NAME_CONTRACT_STATUS",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="NAME_CONTRACT_STATUS")
f_SK_DPD = Feature(name="f_SK_DPD",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="SK_DPD")
f_SK_DPD_DEF = Feature(name="f_SK_DPD_DEF",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="SK_DPD_DEF")


f_CREDIT_CARD_BALANCE_EMA_AVG = Feature(name="f_CREDIT_CARD_BALANCE_EMA_AVG",
                  key=key_SK_ID_CURR,
                  feature_type=STRING,
                  transform="CREDIT_CARD_BALANCE_EMA_AVG")



features_credit_card_balance_core=[
  # f_SK_ID_PREV,
  f_SK_ID_CURR_CC,
  f_MONTHS_BALANCE,
  f_AMT_BALANCE,
  f_AMT_CREDIT_LIMIT_ACTUAL,
  f_AMT_DRAWINGS_ATM_CURRENT,
  f_AMT_DRAWINGS_CURRENT,
  f_AMT_DRAWINGS_OTHER_CURRENT,
  f_AMT_DRAWINGS_POS_CURRENT,
  f_AMT_INST_MIN_REGULARITY,
  f_AMT_PAYMENT_CURRENT,
  f_AMT_PAYMENT_TOTAL_CURRENT,
  f_AMT_RECEIVABLE_PRINCIPAL,
  f_AMT_RECIVABLE,
  f_AMT_TOTAL_RECEIVABLE,
  f_CNT_DRAWINGS_ATM_CURRENT,
  f_CNT_DRAWINGS_CURRENT,
  f_CNT_DRAWINGS_OTHER_CURRENT,
  f_CNT_DRAWINGS_POS_CURRENT,
  f_CNT_INSTALMENT_MATURE_CUM,
  f_NAME_CONTRACT_STATUS,
  f_SK_DPD,
  f_SK_DPD_DEF,

  f_CREDIT_CARD_BALANCE_EMA_AVG

  ]

anchor_credit_card_balance_core = FeatureAnchor(name="anchor_credit_card_balance_core",
                                source=credit_card_balance_source_core, #INPUT_CONTEXT,
                                features=features_credit_card_balance_core)


In [10]:
client.build_features(
    anchor_list=[
        anchor_installments_payments_core,
        anchor_credit_card_balance_core
        ], 
    derived_feature_list=[])

## Create training data using point-in-time correct feature join

A training dataset usually contains entity id columns, multiple feature columns, event timestamp column and label/target column. 

To create a training dataset using Feathr, one needs to provide a feature join configuration file to specify
what features and how these features should be joined to the observation data. The feature join config file mainly contains: 

1. The path of a dataset as the 'spine' for the to-be-created training dataset. We call this input 'spine' dataset the 'observation'
   dataset. Typically, each row of the observation data contains: 
   a) Column(s) representing entity id(s), which will be used as the join key to look up(join) feature value. 
   b) A column representing the event time of the row. By default, Feathr will make sure the feature values joined have
   a timestamp earlier than it, ensuring no data leakage in the resulting training dataset. 
   c) Other columns will be simply pass through onto the output training dataset.
2. The key fields from the observation data, which are used to joined with the feature data.
3. List of feature names to be joined with the observation data. The features must be defined in the feature
   definition configs.
4. The time information of the observation data used to compare with the feature's timestamp during the join.

Create training dataset via:



In [11]:
feature_queries = [
    FeatureQuery(
        feature_list=[
            "f_SK_ID_PREV",
            "f_SK_ID_CURR",
            "f_NUM_INSTALMENT_VERSION",
            "f_NUM_INSTALMENT_NUMBER",
            "f_DAYS_INSTALMENT",
            "f_DAYS_ENTRY_PAYMENT",
            "f_AMT_INSTALMENT",
            "f_AMT_PAYMENT",

            "f_AMT_UNPAID"
           
        ], key=key_SK_ID_CURR),
    FeatureQuery(
        feature_list=[
            # "f_SK_ID_PREV",
            "f_SK_ID_CURR_CC",
            "f_MONTHS_BALANCE",
            "f_AMT_BALANCE",
            "f_AMT_CREDIT_LIMIT_ACTUAL",
            "f_AMT_DRAWINGS_ATM_CURRENT",
            "f_AMT_DRAWINGS_CURRENT",
            "f_AMT_DRAWINGS_OTHER_CURRENT",
            "f_AMT_DRAWINGS_POS_CURRENT",
            "f_AMT_INST_MIN_REGULARITY",
            "f_AMT_PAYMENT_CURRENT",
            "f_AMT_PAYMENT_TOTAL_CURRENT",
            "f_AMT_RECEIVABLE_PRINCIPAL",
            "f_AMT_RECIVABLE",
            "f_AMT_TOTAL_RECEIVABLE",
            "f_CNT_DRAWINGS_ATM_CURRENT",
            "f_CNT_DRAWINGS_CURRENT",
            "f_CNT_DRAWINGS_OTHER_CURRENT",
            "f_CNT_DRAWINGS_POS_CURRENT",
            "f_CNT_INSTALMENT_MATURE_CUM",
            "f_NAME_CONTRACT_STATUS",
            "f_SK_DPD",
            "f_SK_DPD_DEF",

            "f_CREDIT_CARD_BALANCE_EMA_AVG"
        ], key=key_SK_ID_CURR),
]


settings = ObservationSettings(
    observation_path="abfss://feathrhomecreditcafs@feathrhomecreditcasto.dfs.core.windows.net/home_credit_data/installments_payments.csv",
    event_timestamp_column="1609472084",
    timestamp_format="epoch"
)

client.get_offline_features(observation_settings=settings,
                            feature_query=feature_queries,
                            output_path="abfss://feathrhomecreditcafs@feathrhomecreditcasto.dfs.core.windows.net/home_credit_data/output_installment-payment_credit_card_balance.avro")
client.wait_job_to_finish(timeout_sec=7200)

2022-07-07 09:53:13.989 | INFO     | feathr._synapse_submission:upload_or_get_cloud_path:62 - Uploading /var/folders/gs/dbrzk90d0m3849n982_q27w40000gn/T/tmp8pnag696/feathr_pyspark_driver.py to cloud..
2022-07-07 09:53:13.990 | INFO     | feathr._synapse_submission:upload_file:360 - Uploading file feathr_pyspark_driver.py
2022-07-07 09:53:16.328 | INFO     | feathr._synapse_submission:upload_file:366 - /var/folders/gs/dbrzk90d0m3849n982_q27w40000gn/T/tmp8pnag696/feathr_pyspark_driver.py is uploaded to location: abfss://feathrhomecreditcafs@feathrhomecreditcasto.dfs.core.windows.net/feathr_pyspark_driver.py
2022-07-07 09:53:16.329 | INFO     | feathr._synapse_submission:upload_or_get_cloud_path:65 - /var/folders/gs/dbrzk90d0m3849n982_q27w40000gn/T/tmp8pnag696/feathr_pyspark_driver.py is uploaded to location: abfss://feathrhomecreditcafs@feathrhomecreditcasto.dfs.core.windows.net/feathr_pyspark_driver.py
2022-07-07 09:53:16.357 | INFO     | feathr._synapse_submission:upload_or_get_cloud_p

## Download the result and show the result

Let's use the helper function `get_result_df` to download the result and view it:

In [12]:
import shutil
def get_result_df(client: FeathrClient) -> pd.DataFrame:
    """Download the job result dataset from cloud as a Pandas dataframe."""
    res_url = client.get_job_result_uri(block=True, timeout_sec=600)
    tmp_dir = "../../../results/output_installment-payment_credit_card_balance.avro"
    shutil.rmtree(tmp_dir, ignore_errors=True)
    client.feathr_spark_laucher.download_result(result_path=res_url, local_folder=tmp_dir)
    dataframe_list = []
    # assuming the result are in avro format
    for file in glob.glob(os.path.join(tmp_dir, '*.avro')):
        dataframe_list.append(pdx.read_avro(file))
    vertical_concat_df = pd.concat(dataframe_list, axis=0)
    return vertical_concat_df

df_res = get_result_df(client)

2022-07-07 10:02:28.837 | INFO     | feathr._synapse_submission:wait_for_completion:134 - Current Spark job status: success
2022-07-07 10:02:29.434 | INFO     | feathr._synapse_submission:download_file:378 - Beginning reading of results from abfss://feathrhomecreditcafs@feathrhomecreditcasto.dfs.core.windows.net/home_credit_data/output_installment-payment_credit_card_balance.avro
Downloading result files: 100%|██████████| 201/201 [07:22<00:00,  2.20s/it]
2022-07-07 10:09:53.786 | INFO     | feathr._synapse_submission:download_file:407 - Finish downloading files from abfss://feathrhomecreditcafs@feathrhomecreditcasto.dfs.core.windows.net/home_credit_data/output_installment-payment_credit_card_balance.avro to ../../results/output_installment-payment_credit_card_balance.avro.


In [13]:
print(df_res)

      SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER  \
0        2525854     100309                    1.0                     7   
1        2525854     100309                    1.0                     5   
2        2525854     100309                    1.0                     3   
3        2525854     100309                    1.0                     1   
4        2525854     100309                    1.0                     2   
...          ...        ...                    ...                   ...   
67081    1460610     455971                    1.0                    10   
67082    2099905     455971                    1.0                     2   
67083    1589506     455971                    1.0                    16   
67084    2749130     455971                    1.0                     3   
67085    2749130     455971                    1.0                    15   

      DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT  \
0               

In [14]:

with pd.option_context('display.max_columns', 50, 'display.max_rows', 50000):
#    print(df_res.columns.values.tolist())
   print(df_res[[
       "f_SK_ID_PREV",
        "f_SK_ID_CURR",
        # "f_NUM_INSTALMENT_VERSION",
        # "f_NUM_INSTALMENT_NUMBER",
        # "f_DAYS_INSTALMENT",
        # "f_DAYS_ENTRY_PAYMENT",
        # "f_AMT_INSTALMENT",
        # "f_AMT_PAYMENT",

        "f_AMT_UNPAID",

        # "f_SK_ID_PREV",
        # "f_SK_ID_CURR",
        # "f_MONTHS_BALANCE",
        # "f_AMT_BALANCE",
        # "f_AMT_CREDIT_LIMIT_ACTUAL",
        # "f_AMT_DRAWINGS_ATM_CURRENT",
        # "f_AMT_DRAWINGS_CURRENT",
        # "f_AMT_DRAWINGS_OTHER_CURRENT",
        # "f_AMT_DRAWINGS_POS_CURRENT",
        # "f_AMT_INST_MIN_REGULARITY",
        # "f_AMT_PAYMENT_CURRENT",
        # "f_AMT_PAYMENT_TOTAL_CURRENT",
        # "f_AMT_RECEIVABLE_PRINCIPAL",
        # "f_AMT_RECIVABLE",
        # "f_AMT_TOTAL_RECEIVABLE",
        # "f_CNT_DRAWINGS_ATM_CURRENT",
        # "f_CNT_DRAWINGS_CURRENT",
        # "f_CNT_DRAWINGS_OTHER_CURRENT",
        # "f_CNT_DRAWINGS_POS_CURRENT",
        # "f_CNT_INSTALMENT_MATURE_CUM",
        # "f_NAME_CONTRACT_STATUS",
        # "f_SK_DPD",
        # "f_SK_DPD_DEF",

        "f_CREDIT_CARD_BALANCE_EMA_AVG"
   ]])

      f_SK_ID_PREV f_SK_ID_CURR       f_AMT_UNPAID  \
0          2525854       100309                0.0   
1          2525854       100309                0.0   
2          2525854       100309                0.0   
3          2525854       100309                0.0   
4          2525854       100309                0.0   
...            ...          ...                ...   
67081      1589506       455971  743.5050000000001   
67082      1589506       455971  743.5050000000001   
67083      1589506       455971  743.5050000000001   
67084      1589506       455971  743.5050000000001   
67085      1589506       455971  743.5050000000001   

      f_CREDIT_CARD_BALANCE_EMA_AVG  
0                               0.0  
1                               0.0  
2                               0.0  
3                               0.0  
4                               0.0  
...                             ...  
67081            35381.861876514406  
67082            35381.861876514406  
67083    