## <span style='color:#ff5f27'> 📝 Imports

In [1]:
from math import radians

import pandas as pd
import numpy as np

from features import transactions_fraud, window_aggs

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

## <span style="color:#ff5f27;"> 💽 Loading the Data </span>

The data you will use comes from three different CSV files:

- `credit_cards.csv`: credit card information such as expiration date and provider.
- `transactions.csv`: transaction information such as timestamp, location, and the amount. Importantly, the binary `fraud_label` variable tells us whether a transaction was fraudulent or not.
- `profiles.csv`: credit card user information such as birthdate and city of residence.

You can conceptualize these CSV files as originating from separate data sources.
**All three files have a credit card number column `cc_num` in common, which you can use for joins.**

Let's go ahead and load the data.

In [2]:
# Read the CSV file containing credit card data
credit_cards_df = pd.read_csv(
    "https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_data/credit_cards.csv",
)

# Display the first 3 rows of the credit_cards_df DataFrame
credit_cards_df.head(3)

Unnamed: 0,cc_num,provider,expires
0,4796807885357879,visa,05/23
1,4529266636192966,visa,03/22
2,4922690008243953,visa,02/27


In [3]:
# Read the CSV file containing profile data
# Parse the "birthdate" column as datetime
profiles_df = pd.read_csv(
    "https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_data/profiles.csv", 
    parse_dates=["birthdate"],
)

# Display the first 3 rows of the profiles_df
profiles_df.head(3)

Unnamed: 0,name,sex,mail,birthdate,City,Country,cc_num
0,Catherine Zimmerman,F,valenciajason@hotmail.com,1988-09-20,Bryn Mawr-Skyway,US,4796807885357879
1,Michael Williams,M,brettkennedy@yahoo.com,1977-03-01,Gates-North Gates,US,4529266636192966
2,Jessica Krueger,F,marthacruz@hotmail.com,1947-09-10,Greenfield,US,4922690008243953


In [4]:
# Read the CSV file containing transaction data
# Parse the "datetime" column as datetime
trans_df = pd.read_csv(
    "https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_data/transactions.csv", 
    parse_dates=["datetime"],
)

# Display the first 3 rows of the trans_df
trans_df.head(3)

Unnamed: 0,tid,datetime,cc_num,category,amount,latitude,longitude,city,country,fraud_label
0,11df919988c134d97bbff2678eb68e22,2022-01-01 00:00:24,4473593503484549,Health/Beauty,62.95,42.30865,-83.48216,Canton,US,0
1,dd0b2d6d4266ccd3bf05bc2ea91cf180,2022-01-01 00:00:56,4272465718946864,Grocery,85.45,33.52253,-117.70755,Laguna Niguel,US,0
2,e627f5d9a9739833bd52d2da51761fc3,2022-01-01 00:02:32,4104216579248948,Domestic Transport,21.63,37.60876,-77.37331,Mechanicsville,US,0


---

## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

Fraudulent transactions can differ from regular ones in many different ways. Typical red flags would for instance be a large transaction volume/frequency in the span of a few hours. It could also be the case that elderly people in particular are targeted by fraudsters. To facilitate model learning you will create additional features based on these patterns. In particular, you will create two types of features:
1. **Features that aggregate data from different data sources**. This could for instance be the age of a customer at the time of a transaction, which combines the `birthdate` feature from `profiles.csv` with the `datetime` feature from `transactions.csv`.
2. **Features that aggregate data from multiple time steps**. An example of this could be the transaction frequency of a credit card in the span of a few hours, which is computed using a window function.

Let's start with the first category.

In [None]:
# Convert the 'expires' column in credit_cards_df to a datetime object (first day of the month)
credit_cards_df['expires'] = pd.to_datetime(credit_cards_df['expires'], format='%m/%y') + pd.offsets.MonthEnd(0)

# Merge transactions with profiles on 'cc_num' to get birthdate
trans_df = trans_df.merge(profiles_df[['cc_num', 'birthdate']], on='cc_num', how='left')

# Compute the age at transaction
trans_df['age_at_transaction'] = (trans_df['datetime'] - trans_df['birthdate']).dt.days // 365

# Merge transactions with credit cards on 'cc_num' to get the expiration date
trans_df = trans_df.merge(credit_cards_df[['cc_num', 'expires']], on='cc_num', how='left')

# Compute days until card expires
trans_df['days_until_card_expires'] = (trans_df['expires'] - trans_df['datetime']).dt.days

# Display the first 3 rows
trans_df[['age_at_transaction', 'days_until_card_expires']].head(3)


In [None]:
# # Compute age at transaction.
# trans_df = transactions_fraud.get_age_at_transaction(
#     trans_df, 
#     profiles_df,
# )

# # Compute days until card expires.
# trans_df = transactions_fraud.get_days_until_card_expires(
#     trans_df, 
#     credit_cards_df,
# )

# # Display the first 3 rows
# trans_df[["age_at_transaction", "days_until_card_expires"]].head(3)

In [7]:
# Drop duplicate rows in the trans_df DataFrame based on the "datetime" column
trans_df = trans_df.drop_duplicates(["datetime"])

Next, you will create features that for each credit card aggregate data from multiple time steps.

Yoy will start by computing the distance between consecutive transactions, lets call it `loc_delta`.
Here you will use the [Haversine distance](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html?highlight=haversine#sklearn.metrics.pairwise.haversine_distances) to quantify the distance between two longitude and latitude coordinates.

In [8]:
# Sort the trans_df DataFrame based on the "datetime" column in ascending order
trans_df.sort_values("datetime", inplace=True)

# Apply the radians function to the "longitude" and "latitude" columns in the trans_df DataFrame
# This is a common preprocessing step for geographical data
trans_df[["longitude", "latitude"]] = trans_df[["longitude", "latitude"]].applymap(radians)

# Create a new column "loc_delta" in trans_df representing the haversine distance between consecutive transactions for each credit card
trans_df["loc_delta"] = trans_df.groupby("cc_num")\
    .apply(lambda x: transactions_fraud.haversine(x["longitude"], x["latitude"]))\
    .reset_index(level=0, drop=True)\
    .fillna(0)

Next lets compute windowed aggregates. Here you will use 4-hour windows, but feel free to experiment with different window lengths by setting `window_len` below to a value of your choice.

In [9]:
# Specify the window length as "4h" (4 hours)
window_len = "1h"

# Use the window_aggs.get_window_aggs_df function to calculate aggregated features for each window in trans_df
window_aggs_df = window_aggs.get_window_aggs_df(window_len, trans_df)

# Display the last few rows of the resulting window_aggs_df DataFrame to inspect the aggregated features
window_aggs_df.tail()

Unnamed: 0,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg,cc_num,datetime
106015,0.0,73.08,1.0,0.045635,4032019521897961,2022-03-24 10:57:02
106016,0.0,287.33,1.0,0.045846,4032019521897961,2022-03-28 11:57:02
106017,0.0,53.88,1.0,0.00012,4032019521897961,2022-04-01 12:57:02
106018,0.0,279.73,1.0,0.045928,4032019521897961,2022-04-05 13:57:02
106019,0.0,73.66,1.0,0.045974,4032019521897961,2022-04-09 14:57:02


### <span style="color:#ff5f27;">⚙️ Convert date time object to unix epoch in milliseconds </span>

In [10]:
# Convert the "datetime" values in the trans_df DataFrame to microseconds since the epoch
trans_df.datetime = trans_df.datetime.values.astype(np.int64) // 10 ** 6

# Convert the "datetime" values in the window_aggs_df DataFrame to microseconds since the epoch
window_aggs_df.datetime = window_aggs_df.datetime.values.astype(np.int64) // 10 ** 6

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

### <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A [feature group](https://docs.hopsworks.ai/3.0/concepts/fs/feature_group/fg_overview/) can be seen as a collection of conceptually related features. In this case, you will create a feature group for the transaction data and a feature group for the windowed aggregations on the transaction data. Both will have `cc_num` as primary key, which will allow you to join them when creating a dataset in the next tutorial.

Feature groups can also be used to define a namespace for features. For instance, in a real-life setting you would likely want to experiment with different window lengths. In that case, you can create feature groups with identical schema for each window length. 

Before you can create a feature group you need to connect to Hopsworks feature store.

In [11]:
import hopsworks
project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/148
Connected. Call `.close()` to terminate connection gracefully.


To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group and a version number, if it is not defined it will automatically be incremented to `1`.

In [12]:
# Get or create the 'transactions_fraud_batch_fg' feature group
trans_fg = fs.get_or_create_feature_group(
    name="transactions_fraud_batch_fg",
    version=1,
    description="Transaction data",
    primary_key=["cc_num"],
    event_time="datetime",
)

A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent you need to populate it with its associated data using the `insert` function.

In [13]:
# Insert data into feature group
trans_fg.insert(trans_df, wait=True)


Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/148/fs/90/fg/1305931


Uploading Dataframe: 0.00% |          | Rows 0/105092 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: transactions_fraud_batch_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/148/jobs/named/transactions_fraud_batch_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x17d09fa30>, None)

In [14]:
# # Update feature descriptions
# feature_descriptions = [
#     {"name": "tid", "description": "Transaction id"},
#     {"name": "datetime", "description": "Transaction time"},
#     {"name": "cc_num", "description": "Number of the credit card performing the transaction"},
#     {"name": "category", "description": "Expense category"},
#     {"name": "amount", "description": "Dollar amount of the transaction"},
#     {"name": "latitude", "description": "Transaction location latitude"},
#     {"name": "longitude", "description": "Transaction location longitude"},
#     {"name": "city", "description": "City in which the transaction was made"},
#     {"name": "country", "description": "Country in which the transaction was made"},
#     {"name": "fraud_label", "description": "Whether the transaction was fraudulent or not"},
#     {"name": "age_at_transaction", "description": "Age of the card holder when the transaction was made"},
#     {"name": "days_until_card_expires", "description": "Card validity days left when the transaction was made"},
#     {"name": "loc_delta", "description": "Haversine distance between this transaction location and the previous transaction location from the same card"},
# ]

# for desc in feature_descriptions: 
#     trans_fg.update_feature_description(desc["name"], desc["description"])

At the creation of the feature group, you will be prompted with an URL that will directly link to it; there you will be able to explore some of the aspects of your newly created feature group.

[//]: <> (insert GIF here)

You can move on and do the same thing for the feature group with our windows aggregation.

In [15]:
# Get or create the 'transactions' feature group with specified window aggregations
window_aggs_fg = fs.get_or_create_feature_group(
    name=f"transactions_{window_len}_aggs_fraud_batch_fg",
    version=1,
    description=f"Aggregate transaction data over {window_len} windows.",
    primary_key=["cc_num"],
    event_time="datetime",
)

In [16]:
# Insert data into feature group
window_aggs_fg.insert(window_aggs_df, wait=True)


Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/148/fs/90/fg/1306954


Uploading Dataframe: 0.00% |          | Rows 0/105092 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: transactions_1h_aggs_fraud_batch_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/148/jobs/named/transactions_1h_aggs_fraud_batch_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x17d391ff0>, None)

In [17]:
# Update feature descriptions
feature_descriptions = [
    {"name": "datetime", "description": "Transaction time"},
    {"name": "cc_num", "description": "Number of the credit card performing the transaction"},
    {"name": "loc_delta_mavg", "description": "Moving average of location difference between consecutive transactions from the same card"},
    {"name": "trans_freq", "description": "Moving average of transaction frequency from the same card"},
    {"name": "trans_volume_mavg", "description": "Moving average of transaction volume from the same card"},
    {"name": "trans_volume_mstd", "description": "Moving standard deviation of transaction volume from the same card"},
]

for desc in feature_descriptions: 
    window_aggs_fg.update_feature_description(desc["name"], desc["description"])