# Feathr Feature Store on Home Credit

This notebook illustrates the use of Feature Store to create a model for home credits. It includes these steps:



## Prerequisite: Install Feathr

Install Feathr using pip:

`pip install -U feathr==0.7.1 pandavro scikit-learn`

Or if you want to use the latest Feathr code from GitHub:

`pip install -I git+https://github.com/linkedin/feathr.git#subdirectory=feathr_project pandavro scikit-learn`

In [None]:
%pip install -U feathr==0.7.1 pandavro scikit-learn

## Prerequisite: Configure the required environment

In the first step (Provision cloud resources), you should have provisioned all the required cloud resources. If you use Feathr CLI to create a workspace, you should have a folder with a file called `feathr_config.yaml` in it with all the required configurations. Otherwise, update the configuration below.

The code below will write this configuration string to a temporary location and load it to Feathr. Please still refer to [feathr_config.yaml](https://github.com/linkedin/feathr/blob/v0.7.2/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) and use that as the source of truth. It should also have more explanations on the meaning of each variable.

In [None]:
RESOURCE_PREFIX = '<RESOURCE_PREFIX>'

In [None]:
import tempfile
yaml_config = f"""
# Please refer to https://github.com/linkedin/feathr/blob/v0.7.2/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml for explanations on the meaning of each field.
api_version: 1
project_config:
  project_name: 'feathr_home_credit'
  required_environment_variables:
    - 'REDIS_PASSWORD'
    - 'AZURE_CLIENT_ID'
    - 'AZURE_TENANT_ID'
    - 'AZURE_CLIENT_SECRET'
offline_store:
  adls:
    adls_enabled: true
  wasb:
    wasb_enabled: true
  s3:
    s3_enabled: false
    s3_endpoint: 's3.amazonaws.com'
  jdbc:
    jdbc_enabled: false
    jdbc_database: 'feathrtestdb'
    jdbc_table: 'feathrtesttable'
spark_config:
  spark_cluster: 'azure_synapse'
  spark_result_output_parts: '1'
  azure_synapse:
    dev_url: "https://{RESOURCE_PREFIX}spark.dev.azuresynapse.net"
    pool_name: "spark31"
    # workspace dir for storing all the required configuration files and the jar resources
    workspace_dir: "abfss://{RESOURCE_PREFIX}fs@{RESOURCE_PREFIX}sto.dfs.core.windows.net/"
    executor_size: "Small"
    executor_num: 4
    feathr_runtime_location: wasbs://public@azurefeathrstorage.blob.core.windows.net/feathr-assembly-LATEST.jar
  databricks:
    workspace_instance_url: 'https://adb-6885802458123232.12.azuredatabricks.net/'
    workspace_token_value: ''
    config_template: {{'run_name':'','new_cluster':{{'spark_version':'9.1.x-scala2.12','node_type_id':'Standard_D3_v2','num_workers':2,'spark_conf':{{}}}},'libraries':[{{'jar':''}}],'spark_jar_task':{{'main_class_name':'','parameters':['']}}}}
    work_dir: 'dbfs:/feathr_getting_started'
    feathr_runtime_location: wasbs://public@azurefeathrstorage.blob.core.windows.net/feathr-assembly-LATEST.jar
online_store:
  redis:
    host: '{RESOURCE_PREFIX}redis.redis.cache.windows.net'
    port: 6380
    ssl_enabled: True
feature_registry:
  purview:
    type_system_initialization: true
    purview_name: '{RESOURCE_PREFIX}purview'
    delimiter: '__'
"""
tmp = tempfile.NamedTemporaryFile(mode='w', delete=False)
with open(tmp.name, "w") as text_file:
    text_file.write(yaml_config)

## View the data

In this tutorial, we use Feathr Feature Store to create a model that predicts NYC Taxi fares. The dataset comes from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The data is as below

In [None]:
import glob
import os
import tempfile
from datetime import datetime, timedelta
from math import sqrt

import pandas as pd
import pandavro as pdx
from feathr import FeathrClient
from feathr import BOOLEAN, FLOAT, INT32, ValueType, STRING
from feathr import Feature, DerivedFeature, FeatureAnchor
from feathr import BackfillTime, MaterializationSettings
from feathr import FeatureQuery, ObservationSettings
from feathr import RedisSink
from feathr import INPUT_CONTEXT, HdfsSource
from feathr import WindowAggTransformation
from feathr import TypedKey
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

## Setup necessary environment variables

You have to setup the environment variables in order to run this sample. More environment variables can be set by referring to [feathr_config.yaml](https://github.com/linkedin/feathr/blob/v0.7.2/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) and use that as the source of truth. It should also have more explanations on the meaning of each variable.

In [None]:
os.environ['REDIS_PASSWORD'] = ''
os.environ['AZURE_CLIENT_ID'] = ''
os.environ['AZURE_TENANT_ID'] = '' 
os.environ['AZURE_CLIENT_SECRET'] = ''

Then we will initialize a feathr client:


In [None]:
client = FeathrClient(config_path=tmp.name)

## Misc pre-processing methods

In [None]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import lit

def add_tran_date_column(df: DataFrame) -> DataFrame:
    df = df.withColumn("TRAN_DATE", lit("2021-01-01 11:34:44"))

    return df

def add_dummy_column(df: DataFrame) -> DataFrame:   
    df = df.withColumn("DUMMY", lit("dummy"))

    return df


## Bureau balance pre-processing method

In [None]:
def bureau_balance_preprocessing(df: DataFrame) -> DataFrame:
    import pandas as pd
    import datetime
    from pyspark import sql
    
    df = df.withColumn("TRAN_DATE", lit(datetime.datetime(2021,1,1,11,34,44).strftime('%Y-%m-%d %X')))
    
    # convert spark data frame to panda
    df_org =  df.toPandas()
    df_bureauBalanceRollingCreditLoan = df_org.copy()
    
    df_bureauBalanceRollingCreditLoan['STATUS'] = df_bureauBalanceRollingCreditLoan['STATUS'].replace(['X','C'],'0')
    df_bureauBalanceRollingCreditLoan['STATUS'] = pd.to_numeric(df_bureauBalanceRollingCreditLoan['STATUS'])
    df_bureauBalanceRollingCreditLoan = df_bureauBalanceRollingCreditLoan.groupby("SK_ID_BUREAU")['STATUS'].agg(
        lambda x: x.ewm(span=x.shape[0], adjust=False).mean().mean()
    )
    df_bureauBalanceRollingCreditLoan = df_bureauBalanceRollingCreditLoan.reset_index(name="CREDIT_STATUS_EMA_AVG")
    df_bureauBalanceRollingCreditLoan = df_bureauBalanceRollingCreditLoan.set_index('SK_ID_BUREAU')
    df_result = pd.merge(df_org, df_bureauBalanceRollingCreditLoan, on="SK_ID_BUREAU", how="left")
    
    # convert panda to spark dataframe
    spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
        
    return spark_session.createDataFrame(df_result)


## Bureau pre-processing method

In [None]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

def bureau_preprocessing(df: DataFrame) -> DataFrame:
    import datetime
    import pandas as pd
    from pyspark import sql
    from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

    def bureauBalanceRollingCreditLoan(df):
        df_final = df.copy()
        df_final['STATUS'] = df_final['STATUS'].replace(['X','C'],'0')
        df_final['STATUS'] = pd.to_numeric(df_final['STATUS'])
        df_final = df_final.groupby("SK_ID_BUREAU")['STATUS'].agg(
            lambda x: x.ewm(span=x.shape[0], adjust=False).mean().mean()
        )
        df_final = df_final.reset_index(name="CREDIT_STATUS_EMA_AVG")
        df_final = df_final.set_index('SK_ID_BUREAU')
        return df_final


    def aggCountBureau(df):
        agg = df.groupby("SK_ID_CURR")
        # count number of loans
        df_final = pd.DataFrame(agg['SK_ID_CURR'].agg('count').reset_index(name='NUM_CREDIT_COUNT'))
        # count number of loans prolonged
        loans_prolonged = agg['CNT_CREDIT_PROLONG'].sum().reset_index(name='CREDIT_PROLONG_COUNT').set_index("SK_ID_CURR")
        df_final = df_final.join(loans_prolonged,on='SK_ID_CURR')
        # count percentage of active loans
        active_loans = agg['CREDIT_ACTIVE'].value_counts().reset_index(name='ACTIVE_LOANS_COUNT')
        active_loans = active_loans[active_loans['CREDIT_ACTIVE'] == 'Active'][['SK_ID_CURR','ACTIVE_LOANS_COUNT']].set_index("SK_ID_CURR")
        df_final = df_final.join(active_loans,on='SK_ID_CURR')
        df_final['ACTIVE_LOANS_PERCENT'] = df_final['ACTIVE_LOANS_COUNT']/df_final['NUM_CREDIT_COUNT']
        df_final.drop(["ACTIVE_LOANS_COUNT"], axis=1, inplace=True)
        df_final['ACTIVE_LOANS_PERCENT'] = df_final['ACTIVE_LOANS_PERCENT'].fillna(0)
        # count credit type
        # one hot encode
        ohe = OneHotEncoder(sparse=False)
        ohe_fit = ohe.fit_transform(df[["CREDIT_TYPE"]])
        credit_type = pd.DataFrame(ohe_fit, columns = ohe.get_feature_names(["CREDIT_TYPE"]))
        credit_type.insert(loc=0, column='SK_ID_CURR', value=df['SK_ID_CURR'].values)
        credit_type = credit_type.groupby("SK_ID_CURR").sum()
        df_final = df_final.join(credit_type, on="SK_ID_CURR")
        df_final = df_final.set_index("SK_ID_CURR")

        return df_final
    
    # Average number of days between loans
    # Average number of overdue days of overdue loans
    def aggAvgBureau(df):
        # convert this column to numeric
        df['DAYS_CREDIT'] = pd.to_numeric(df['DAYS_CREDIT'])
        agg = df.groupby('SK_ID_CURR')
       
        # average of CREDIT_DAY_OVERDUE
        final_df = agg['CREDIT_DAY_OVERDUE'].mean().reset_index(name = "CREDIT_DAY_OVERDUE_MEAN")
        # average of days between credits of DAYS_CREDIT
        days_credit_between = pd.DataFrame(df['SK_ID_CURR'])
        
        days_credit_between['diff'] = agg['DAYS_CREDIT'].diff().values
        days_credit_between = days_credit_between.groupby("SK_ID_CURR")['diff'].mean().reset_index(name = 'DAYS_CREDIT_BETWEEN_MEAN')
        days_credit_between.set_index("SK_ID_CURR",inplace=True)
        final_df = final_df.join(days_credit_between, on='SK_ID_CURR')
        final_df = final_df.set_index("SK_ID_CURR")
        return final_df

    #  ratio of AMT_CREDIT_SUM_DEBT to AMT_CREDIT_SUM created
    def debtCreditRatio(df):
        df['AMT_CREDIT_SUM_DEBT'] = pd.to_numeric(df['AMT_CREDIT_SUM_DEBT'])
        df['AMT_CREDIT_SUM'] = pd.to_numeric(df['AMT_CREDIT_SUM'])
        #get debt:credit ratio
        df['DEBT_CREDIT_RATIO'] = df['AMT_CREDIT_SUM_DEBT']/df['AMT_CREDIT_SUM']
        df_final = df.groupby('SK_ID_CURR')['DEBT_CREDIT_RATIO'].mean().reset_index(name='DEBT_CREDIT_RATIO')
        df_final = df_final.set_index("SK_ID_CURR")

        df_final = df_final[df_final.columns.intersection(['SK_ID_CURR', 'DEBT_CREDIT_RATIO'])]
        
        return df_final
    
    # add a TRAN_DATE column with a static date
    df = df.withColumn("TRAN_DATE", lit(datetime.datetime(2021,1,1,11,34,44).strftime('%Y-%m-%d %X')))
    df_org = df.toPandas()
        
    df_aggCountBureau = aggCountBureau(df_org)
    df_aggAvgInstalments = aggAvgBureau(df_org)
    df_debtCreditRatio = debtCreditRatio(df_org)
    
    dfs = []

    dfs.append(df_aggCountBureau)
    dfs.append(df_aggAvgInstalments)
    dfs.append(df_debtCreditRatio)

    df_result = dfs.pop()
    while dfs:
        df_result = df_result.join(dfs.pop(),on='SK_ID_CURR')
    
    # results df would be merge to the original df
    df_result = pd.merge(df_org, df_result, on="SK_ID_CURR", how="left")
    # merging df with same column name would result a columnname with a suffix of `_x` and `_y`.
    # Renaming the column name with suffix `_x` to retain the original column name
    df_result.columns = df_result.columns.str.rstrip("_x")

    # convert panda to spark dataframe
    spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
    
    return spark_session.createDataFrame(df_result)  


## Defining Features with Feathr:

### Bureau Dataset
1. parent dataset: bureau.csv 
    1. count aggregation features created
    1. average aggregation features created
    1. debt:credit ratio feature created
1. child dataset: bureau_balance.csv
    1. rolling window credit loan status feature will be created and joined to parent dataset
1. combinig/joining both datasets, which will be aggregated in line with primary key ("SK_ID_CURR) of application_train (target dataframe) with the following features:    
    1. count aggregation features created
    1. average aggregation features created
    1. debt:credit ratio feature created
    1. rolling window credit loan status feature will be created and joined to parent dataset

In [None]:
# two datasource pointing to same csv, limitation that you could not mix
# pass through and aggregated features. By separating, it must have different datasource (datasource name)

# source for pass through features
# "TRAN_DATE" column created on on the "datasource_prepocessing" method.
bureau_source_core = HdfsSource(name="bureauSourceCore",
                          path=f"abfss://{RESOURCE_PREFIX}fs@{RESOURCE_PREFIX}sto.dfs.core.windows.net/home_credit_data/bureau.csv",
                          preprocessing=bureau_preprocessing,
                          event_timestamp_column="TRAN_DATE",
                          timestamp_format="yyyy-MM-dd HH:mm:ss"
                          )

# key definition for bureau datasource
key_SK_ID_BUREAU = TypedKey(key_column="SK_ID_BUREAU",
                       key_column_type=ValueType.INT32,
                       description="SK ID Bureau",
                       full_name="bureau.SK_ID_BUREAU")

key_SK_ID_CURR = TypedKey(key_column="SK_ID_CURR",
                       key_column_type=ValueType.INT32,
                       description="SK ID CURR",
                       full_name="bureau.SK_ID_CURR")

# pass through columns of BUREAU datasource CSV
f_SK_ID_CURR = Feature(name="f_SK_ID_CURR",
                        key=key_SK_ID_BUREAU, 
                        feature_type=INT32, 
                        transform="SK_ID_CURR")

f_SK_ID_BUREAU  = Feature(name="f_SK_ID_BUREAU",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="SK_ID_BUREAU")

f_CREDIT_ACTIVE = Feature(name="f_CREDIT_ACTIVE",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="CREDIT_ACTIVE")

f_CREDIT_CURRENCY = Feature(name="f_CREDIT_CURRENCY",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="CREDIT_CURRENCY")

f_DAYS_CREDIT = Feature(name="f_DAYS_CREDIT",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="DAYS_CREDIT")

f_CREDIT_DAY_OVERDUE = Feature(name="f_CREDIT_DAY_OVERDUE",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="CREDIT_DAY_OVERDUE")

f_DAYS_CREDIT_ENDDATE = Feature(name="f_DAYS_CREDIT_ENDDATE",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="DAYS_CREDIT_ENDDATE")

f_DAYS_ENDDATE_FACT = Feature(name="f_DAYS_ENDDATE_FACT",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="DAYS_ENDDATE_FACT")

f_AMT_CREDIT_MAX_OVERDUE = Feature(name="f_AMT_CREDIT_MAX_OVERDUE",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="AMT_CREDIT_MAX_OVERDUE")

f_CNT_CREDIT_PROLONG = Feature(name="f_CNT_CREDIT_PROLONG",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="CNT_CREDIT_PROLONG")

f_AMT_CREDIT_SUM = Feature(name="f_AMT_CREDIT_SUM",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="AMT_CREDIT_SUM")

f_AMT_CREDIT_SUM_DEBT = Feature(name="f_AMT_CREDIT_SUM_DEBT",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="AMT_CREDIT_SUM_DEBT")

f_AMT_CREDIT_SUM_LIMIT = Feature(name="f_AMT_CREDIT_SUM_LIMIT",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="AMT_CREDIT_SUM_LIMIT")

f_AMT_CREDIT_SUM_OVERDUE = Feature(name="f_AMT_CREDIT_SUM_OVERDUE",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="AMT_CREDIT_SUM_OVERDUE")

f_CREDIT_TYPE = Feature(name="f_CREDIT_TYPE",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="CREDIT_TYPE")

f_DAYS_CREDIT_UPDATE = Feature(name="f_DAYS_CREDIT_UPDATE",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="DAYS_CREDIT_UPDATE")

f_AMT_ANNUITY = Feature(name="f_AMT_ANNUITY",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="AMT_ANNUITY")



f_NUM_CREDIT_COUNT = Feature(name="f_NUM_CREDIT_COUNT",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="NUM_CREDIT_COUNT")

f_DEBT_CREDIT_RATIO = Feature(name="f_DEBT_CREDIT_RATIO",
                        key=key_SK_ID_BUREAU,
                        feature_type=STRING,
                        transform="DEBT_CREDIT_RATIO")


features_bureau_source_core=[
  f_SK_ID_CURR,
  f_SK_ID_BUREAU,
  f_CREDIT_ACTIVE,
  f_CREDIT_CURRENCY,
  f_DAYS_CREDIT,
  f_CREDIT_DAY_OVERDUE,
  f_DAYS_CREDIT_ENDDATE,
  f_DAYS_ENDDATE_FACT,
  f_AMT_CREDIT_MAX_OVERDUE,
  f_CNT_CREDIT_PROLONG,
  f_AMT_CREDIT_SUM,
  f_AMT_CREDIT_SUM_DEBT,
  f_AMT_CREDIT_SUM_LIMIT,
  f_AMT_CREDIT_SUM_OVERDUE,
  f_CREDIT_TYPE,
  f_DAYS_CREDIT_UPDATE,
  f_AMT_ANNUITY,

  f_NUM_CREDIT_COUNT,
  f_DEBT_CREDIT_RATIO,
  ]

anchor_bureau_source_core = FeatureAnchor(name="anchor_bureau_source_core",
                                source=bureau_source_core, #INPUT_CONTEXT,
                                features=features_bureau_source_core)


In [None]:
# source for aggregated features of BUREAU
bureau_source_agg = HdfsSource(name="bureauSourceAgg",
                          path=f"abfss://{RESOURCE_PREFIX}fs@{RESOURCE_PREFIX}sto.dfs.core.windows.net/home_credit_data/bureau.csv",
                          preprocessing=add_tran_date_column,
                          event_timestamp_column="TRAN_DATE",
                          timestamp_format="yyyy-MM-dd HH:mm:ss"
                          )


In [None]:
# source for aggregated features
bureau_balance_source_core = HdfsSource(name="bureauBalanceSourceCore",
                          path=f"abfss://{RESOURCE_PREFIX}fs@{RESOURCE_PREFIX}sto.dfs.core.windows.net/home_credit_data/bureau_balance.csv",
                          preprocessing=bureau_balance_preprocessing,
                          event_timestamp_column="TRAN_DATE",
                          timestamp_format="yyyy-MM-dd HH:mm:ss"
                          )

f_MONTHS_BALANCE  = Feature(name="f_MONTHS_BALANCE",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING, 
                        transform="MONTHS_BALANCE")

f_STATUS  = Feature(name="f_STATUS",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING,
                        transform="STATUS")

f_CREDIT_STATUS_EMA_AVG  = Feature(name="f_CREDIT_STATUS_EMA_AVG",
                        key=key_SK_ID_BUREAU, 
                        feature_type=STRING,
                        transform="CREDIT_STATUS_EMA_AVG")
                        
features_bureau_balance_source_core=[
  f_MONTHS_BALANCE,
  f_STATUS,
  f_CREDIT_STATUS_EMA_AVG
  ]

anchor_bureau_balance_source_core = FeatureAnchor(name="anchor_bureau_balance_source_core",
                                source=bureau_balance_source_core,
                                features=features_bureau_balance_source_core)

And then we need to build those features so that it can be consumed later. Note that we have to build both the "anchor" and the "derived" features (which is not anchored to a source).

In [None]:
client.build_features(
    anchor_list=[
        anchor_bureau_source_core,
        anchor_bureau_balance_source_core,
        ], 
    derived_feature_list=[])

## Create training data using point-in-time correct feature join

A training dataset usually contains entity id columns, multiple feature columns, event timestamp column and label/target column. 

To create a training dataset using Feathr, one needs to provide a feature join configuration file to specify
what features and how these features should be joined to the observation data. The feature join config file mainly contains: 

1. The path of a dataset as the 'spine' for the to-be-created training dataset. We call this input 'spine' dataset the 'observation'
   dataset. Typically, each row of the observation data contains: 
   a) Column(s) representing entity id(s), which will be used as the join key to look up(join) feature value. 
   b) A column representing the event time of the row. By default, Feathr will make sure the feature values joined have
   a timestamp earlier than it, ensuring no data leakage in the resulting training dataset. 
   c) Other columns will be simply pass through onto the output training dataset.
2. The key fields from the observation data, which are used to joined with the feature data.
3. List of feature names to be joined with the observation data. The features must be defined in the feature
   definition configs.
4. The time information of the observation data used to compare with the feature's timestamp during the join.

Create training dataset via:



In [None]:
feature_queries = [
    FeatureQuery(
        feature_list=[
            "f_SK_ID_CURR",
            "f_SK_ID_BUREAU",
            "f_CREDIT_ACTIVE",
            "f_CREDIT_CURRENCY",
            "f_DAYS_CREDIT",
            "f_CREDIT_DAY_OVERDUE",
            "f_DAYS_CREDIT_ENDDATE",
            "f_DAYS_ENDDATE_FACT",
            "f_AMT_CREDIT_MAX_OVERDUE",
            "f_CNT_CREDIT_PROLONG",
            "f_AMT_CREDIT_SUM",
            "f_AMT_CREDIT_SUM_DEBT",
            "f_AMT_CREDIT_SUM_LIMIT",
            "f_AMT_CREDIT_SUM_OVERDUE",
            "f_CREDIT_TYPE",
            "f_DAYS_CREDIT_UPDATE",
            "f_AMT_ANNUITY",

            "f_NUM_CREDIT_COUNT",
            "f_DEBT_CREDIT_RATIO",
        ], key=key_SK_ID_BUREAU),
    
    FeatureQuery(
        feature_list=[
            "f_MONTHS_BALANCE",
            "f_STATUS",
            "f_CREDIT_STATUS_EMA_AVG"
        ], key=key_SK_ID_BUREAU)
]

# spine dataset was created manually, it's the same as the bureau.csv 
# with constant event_timetamp_column
settings = ObservationSettings(
    observation_path=f"abfss://{RESOURCE_PREFIX}fs@{RESOURCE_PREFIX}sto.dfs.core.windows.net/home_credit_data/bureau.csv",
    event_timestamp_column="1609472084",
    timestamp_format="epoch"
)

# output would be in output_bureau.avro
client.get_offline_features(observation_settings=settings,
                            feature_query=feature_queries,
                            output_path=f"abfss://{RESOURCE_PREFIX}fs@{RESOURCE_PREFIX}sto.dfs.core.windows.net/home_credit_data/output_bureau.avro")
client.wait_job_to_finish(timeout_sec=7200)

## Download the result and show the result

Let's use the helper function `get_result_df` to download the result and view it:

In [None]:
import shutil
def get_result_df(client: FeathrClient) -> pd.DataFrame:
    """Download the job result dataset from cloud as a Pandas dataframe."""
    res_url = client.get_job_result_uri(block=True, timeout_sec=600)
    tmp_dir = "../output_bureau.avro"
    shutil.rmtree(tmp_dir, ignore_errors=True)
    client.feathr_spark_launcher.download_result(result_path=res_url, local_folder=tmp_dir)
    dataframe_list = []
    # assuming the result are in avro format
    for file in glob.glob(os.path.join(tmp_dir, '*.avro')):
        dataframe_list.append(pdx.read_avro(file))
    vertical_concat_df = pd.concat(dataframe_list, axis=0)
    return vertical_concat_df

df_res = get_result_df(client)

In [None]:

with pd.option_context('display.max_columns', 50, 'display.max_rows', 1000):
   print(df_res.columns.values.tolist())
   print(df_res[[
      "f_SK_ID_CURR",
      "f_SK_ID_BUREAU",
      "f_CREDIT_ACTIVE",
      "f_CREDIT_CURRENCY",
      "f_DAYS_CREDIT",
      "f_CREDIT_DAY_OVERDUE",
      "f_DAYS_CREDIT_ENDDATE",
      "f_DAYS_ENDDATE_FACT",
      "f_AMT_CREDIT_MAX_OVERDUE",
      "f_CNT_CREDIT_PROLONG",
      "f_AMT_CREDIT_SUM",
      "f_AMT_CREDIT_SUM_DEBT",
      "f_AMT_CREDIT_SUM_LIMIT",
      "f_AMT_CREDIT_SUM_OVERDUE",
      "f_CREDIT_TYPE",
      "f_DAYS_CREDIT_UPDATE",
      "f_AMT_ANNUITY",

      "f_SK_ID_BUREAU",
      "f_MONTHS_BALANCE",
      "f_STATUS",
      "f_CREDIT_STATUS_EMA_AVG",

      "f_NUM_CREDIT_COUNT",
      "f_DEBT_CREDIT_RATIO"
   ]])

In [None]:
backfill_time = BackfillTime(start=datetime(2020, 5, 20), 
                             end=datetime(2020, 5, 20), 
                             step=timedelta(days=1))
redisSink = RedisSink(table_name="homeCreditDemoFeature")
settings = MaterializationSettings(name="homeCreditFeatureSetting",
                                   backfill_time=backfill_time,
                                   sinks=[redisSink],
                                   feature_names=["f_NUM_CREDIT_COUNT"])

client.materialize_features(settings)
client.wait_job_to_finish(timeout_sec=500)

In [None]:
client.get_online_features('homeCreditDemoFeature', 
                           '6841943', 
                           ['f_NUM_CREDIT_COUNT'])