# Wallet Feature Engineering

The purpose of this notebook is to create features from OpenSea `Asset Events` time series
in order to:
- model and predict NFT fear of missing out (FOMO) behavior
- classify types of people participating in NFT exchanges

# Read Data

__Description of the dataset:__ Asset events ("events") were extracted by
 [莊惟翔](https://github.com/Fred-Zhuang)
via https://api.opensea.io/api/v1/assets endpoint.
This dateset contains only __successful__ events having occurred on the NFTs
and been tracked by OpenSea.

1. a list of `token_seller_address` and `token_owner_address` having event
    timestamp between 2022-05-03 and 2022-05-18 was used as the primer to
    extract all events involving these addresses
    (see `os_successful_events.feather`)
1. only events involved one of the 21 selected NFT collections were considered
    for this study
1. the final list of events was then used for feature engineering
1. this list spans 406 days, between 2021-04-30 and 2022-06-11

*The `event_type` indicates the types of events (transfer, successful auction, etc)
and the results are sorted by `event_timestamp`
(see [OpenSea API documentation](https://docs.opensea.io/reference/getting-assets)).

In [None]:
import os
import time
import numpy as np
import numpy_financial as npf
import pandas as pd
import seaborn as sns

data_dir = os.path.join(os.getcwd(), 'data')

start_time = time.time()
wallets = pd.read_feather(os.path.join(data_dir, 'NFT20_successful_events_new_有winneraddress.feather'))
total_time = time.time() - start_time
print("Total seconds to load:", total_time)
wallets.info(show_counts=True)

In [None]:
wallets.rename({'wallet_address_input': 'collection_contract_address'}, axis=1, inplace=True)

_\* renaming the column to match API parameter_

In [None]:
wallets.drop(['starting_price', 'ending_price',
              'approved_account', 'bid_amount', 'custom_event_name'],
             axis=1, inplace=True)
print("Memory usage:", wallets.memory_usage().sum() / 1024**2, "MB")

In [None]:
print("Most recent event:", max(wallets.event_timestamp))

In [None]:
print("Earliest event:", min(wallets.event_timestamp))

In [None]:
print("Length of this time series dataset:", max(wallets.event_timestamp) - min(wallets.event_timestamp))

## Reshape the dataframe

In [None]:
buy = wallets.set_index('winner_account_address').rename_axis('user_account_address')
buy['event_type'] = 'buy'
sell = wallets.set_index('token_seller_address').rename_axis('user_account_address')
sell['event_type'] = 'sell'

In [None]:
wallets = pd.concat([buy, sell]).reset_index().drop(['winner_account_address', 'token_seller_address'], axis=1)

In [None]:
print("Total number of wallet addresses used to retrieve data from OpenSea:", f'{wallets.user_account_address.nunique(): ,}')

## NFT collections in this dataset

In [None]:
wallets.groupby(by='collection_slug') \
    .agg({'event_timestamp': ['min', 'max', lambda x: (max(x) - min(x)).days],
          'event_type': 'size',
          'user_account_address': 'nunique'}) 
    #.sort_values(by='event_timestamp', ascending=False)

_\* lambda_0 represents the length of collection history, size of `event_type` the number of successful sales, nunique `user_account address` the number of unique wallets._

Do we left, right or both censor the time series?

## Data distribution (WIP)

... this project selects only transaction involving top-20 nft collections 

## Questions

- What is asset_event `created_date`, and how does it differ from `event_timestamp`?
  Would asset_contract created_date be more useful than asset_event created_date?
- Why are `starting_price` and `ending_price` always _null_?
- What can we do with `token_owner_address`?
- What can we do with `num_sales`? 
- `contract_address` appears to be the address of the smart contract used to execute the sales, i.e. exchange. How can we use this?

# Impute data

## `buy` vs. `sell` event_type

In [None]:
wallets.groupby(by=['collection_slug', 'event_type']) \
    .agg({'event_type': 'size', 'user_account_address': 'nunique'})

## `duration`
the time between the token listed and the completion of the the sale

_What to do when listing_time is `NaT`_? << We will fill duration NA with `pd.Timedelta(0)`, i.e. 0

In [None]:
wallets.query('duration.isna()').loc[:,["event_timestamp", "event_type", "listing_time", "payment_token_symbol", "deal_price"]].sort_values('listing_time')

In [None]:
wallets.duration = wallets.event_timestamp - wallets.listing_time
wallets.duration.fillna(pd.Timedelta(0), inplace=True)
wallets.groupby(["user_account_address", "event_type"])["duration"].mean().tail(10)

## `deal_price`, `deal_price_usd` and payment token attributes

payment token attributes: {symbol, decimals, and usdprice}

In [None]:
print("Number of observations missing token symbol, decimals or usdprice:",
      sum(wallets.payment_token_symbol.isna()), 
      sum(wallets.payment_token_decimals.isna()),
      sum(wallets.payment_token_usdprice.isna()))

In [None]:
wallets[wallets.payment_token_symbol.isna() |
        wallets.payment_token_decimals.isna() |
        wallets.payment_token_usdprice.isna()].shape[0]

Is the total number of records missing either payment token symbol, token decimals, i.e. the deal price multiplication factor,
or the token to USD exchange rate? **We will ignore these record for now.**

In [None]:
wallets["deal_price"] = wallets.deal_price / 10 ** wallets.payment_token_decimals
wallets.drop("payment_token_decimals", axis=1, inplace=True)
wallets.deal_price.agg({max, np.mean, min})

In [None]:
wallets["deal_price_usd"] = wallets.deal_price * wallets.payment_token_usdprice
wallets.deal_price_usd.agg({max, np.mean, min})

## `is_private` sales

_Do we assume Nan is __not__ private, i.e. 0?_

In [None]:
wallets.is_private.value_counts(dropna=False)

## `deal_price` is 0 or NaN

Do we keep these rows?

In [None]:
wallets[wallets.deal_price == 0]

In [None]:
wallets[wallets.deal_price.isna()]

## Payment tokens metadata are missing

In [None]:
print("number of rows missing payment token data:",
      sum(wallets.payment_token_symbol.isna()))

## `quantity`

Let purchased quantity be negative and sold be positive.

In [None]:
wallets.loc[:,'quantity'] = np.where(wallets.event_type == 'buy', -wallets.quantity, wallets.quantity)

In [None]:
wallets.quantity.apply([min, np.median, np.mean, max])

In [None]:
abs(wallets.quantity).quantile(q=[x / 10000 for x in range(0, 10000)]).tail(20)

99.9% of trades are single NFT

## WIP quantity and deal_price

Both `quantity` and `deal_price` however are heavily right skewed, i.e. mean > median.

In [None]:
collection_stats = wallets.loc[~wallets.quantity.isna()] \
    .groupby('collection_slug')[['quantity', 'deal_price_usd']].agg(['min', 'median', 'mean', "max", "sum"])
#collection_stats.loc['cool-cats-nft']
collection_stats

## `num_sales`

_How does this field differ from `quantity`?_ Number of token minted, or number of times this token has exchanged hands?

Assuming this is either the number of times a particular NFT has been sold (between two parties)

In [None]:
_ = wallets.num_sales.hist()

In [None]:
wallets.num_sales.agg(["min", "mean", "median", "max"]) 

In [None]:
sum(wallets.num_sales.isna())

In [None]:
wallets.num_sales.fillna(2, inplace=True)

## `cashflow` and `cashflow_usd` as a simple method to calculate profit

In [None]:
wallets["cashflow"] = np.where(wallets.event_type == "buy",
                                -wallets.deal_price,
                                 wallets.deal_price)
wallets["cashflow_usd"] = np.where(wallets.event_type == "buy",
                                -wallets.deal_price_usd,
                                 wallets.deal_price_usd)
wallets.loc[:, ["event_type", "cashflow", "cashflow_usd"]]

Example: January 2020 cash flow

In [None]:
wallets.set_index('event_timestamp') \
    .loc['2022-01'] \
    .groupby(['user_account_address', 'payment_token_symbol']) \
    [['cashflow', 'cashflow_usd']].sum() \
    .sort_values(by=['cashflow_usd', 'cashflow'], ascending=False)

To calculate cash flow by collection: `groupby(['user_account_address', 'payment_token_symbol', 'collection_slug'])`

Big Trader?

In [None]:
wallets.loc[wallets.user_account_address == "0x17082a8fbae3c10d73a361f218ae77bafb62bf4d"]

# Create features

Let `wallets_attr` be a collection of pandas Series, each consisting `user_account_address` as index and its attributes, and

`wallets_slug_attr` be aa collection of pandas Series, each consisting `user_account_address` and `collection_slug` as index and its attributes.

In [None]:
grp_wallets = wallets.groupby('user_account_address')
grp_wallets_slug = wallets.groupby(['user_account_address', 'collection_slug'])

## wallet `age` in days

This is not the actual age of the wallet. It is the length of trading history in days inferred by subtracting the timestamp of the most recent event from the oldest event.

In [None]:
X = grp_wallets['event_timestamp'].agg([lambda x: (max(x) - min(x)).days])
X.columns = ['age']
wallets_attr = [X]

In [None]:
X[X.age <= 0].sort_values('age')

_Do we ignore wallet age is 0 or over x months old (inactive wallet)?_

## wallet `age` in days by `collection_slug`

In [None]:
X = grp_wallets_slug['event_timestamp'].agg([lambda x: (max(x) - min(x)).days])
X.columns = ['age']
wallets_slug_attr = [X]

In [None]:
X[X.age <= 0].sort_values('age')

_Do we ignore wallet age is 0 or over x months old (inactive wallet)?_

## `mean_duration` average number of days to complete sales (or purchase)

This is defined as the average difference in days between the time a token is listed and the time the sales is completed.

In [None]:
X = grp_wallets['duration'].mean().dt.days
X.name = 'mean_duration'
wallets_attr.append(X)

## `mean_duration` by collection

This is defined as average days to complete a sales (or purchase) by collection

In [None]:
X = grp_wallets_slug['duration'].mean().dt.days
X.name = 'mean_duration'
wallets_slug_attr.append(X)

## `num_nft` the number of NFT having owned since the beginning

In [None]:
X = grp_wallets['quantity'].agg(lambda x : sum(abs(x)))
X.name = 'num_nft'
wallets_attr.append(X)

## `num_collection` number of collections having owned since the beginning

In [None]:
X = grp_wallets['collection_slug'].nunique()
X.name = 'num_collection'
wallets_attr.append(X)

## `num_nft_onhand` the number of NFT currently on-hand

per wallet


In [None]:
X = grp_wallets['quantity'].sum()
X.name = 'num_nft_onhand'
wallets_attr.append(X)

In [None]:
pd.concat(wallets_attr, axis=1).query('num_nft_onhand < 0')

per wallet by collection

In [None]:
X = grp_wallets_slug['quantity'].sum()
X.name = 'num_nft_onhand'
wallets_slug_attr.append(X)

In [None]:
pd.concat(wallets_slug_attr, axis=1).query('num_nft_onhand < 0')

_How do we adjust for data anomaly where `num_nft_onhand < 0`? Should we drop the anomaly, or should we make impute them to `0`?_

## `num_collect_onhand` the number of collections on-hand

The number of NFT on-hand the collection must be greater than 0.

In [None]:
X = grp_wallets_slug['quantity'].sum()
X = X[X > 0].reset_index().groupby('user_account_address').size()
X.name = 'num_collect_onhand'
wallets_attr.append(X)
X.shape

In [None]:
pd.concat(wallets_attr, axis=1).query('num_collect_onhand.isna()')

_How do we adjust for `num_collect_onhand` is None?_

## `num_event_contracts`

How do we make use of this feature?

In [None]:
grp_wallets[['contract_address', 'collection_slug']].nunique().sort_values('contract_address')

## `cumnum_nft_month` pd.DataFrame.cumsum by Month (TODO)

How would this feature be useful?

## `duration_held`

This duration or hold period is calculated at the individual NFT level. By including both the `collection_slug` and `token_id` as index, it is not necessary to calculate a separate column to hold a compound-key like:
```
X['nft_id'] = X.collection_slug + '-' + X.token_id
```

`min` and `max` of `event_timestamp` are used to account for the same token being bought and sold more than once by the same wallet.

In [None]:
X = wallets.pivot_table(index=['user_account_address', 'collection_slug', 'token_id'],
                  columns='event_type', values='event_timestamp',
                  aggfunc=['min', 'max']) \
    .assign(duration_held=lambda x: (x[('max', 'sell')] - x[('min', 'buy')])) \
    .dropna()
duration_held = X.duration_held
duration_held

## `mean_duration_held` average time NFT held before selling

The average time between buy and sell.

In [None]:
X = duration_held.groupby('user_account_address').mean()
X.name = 'mean_duration_held'
wallets_attr.append(X)
X

## `endurance_rank`

The percentage rank of the average hold period at the wallet level measured using the entire population included in this analysis.

In [None]:
X = duration_held.groupby('user_account_address') \
    .agg(['size', 'mean'])
X.sort_values(by='mean')

Data anomaly - excluding these wallets

In [None]:
X[X['mean'] < pd.Timedelta(0)]

In [None]:
X = X[X['mean'] > pd.Timedelta(0)] \
    .assign(endurance_rank=lambda x: x['mean'].rank(pct=True)) \
    .sort_values('endurance_rank', ascending=False)
X

\* _How do we account for wallets that had made few trade but long hold time?_

In [None]:
wallets_attr.append(X.endurance_rank)

## _buy_ vs _sell_ to date

- The total number aka __count__ of transactions and the quantity aka __sum__ of NFT
- The median and the total amount of transactions in USD

In [None]:
X = wallets.rename(columns={"user_account_address": "user_account_address"}) \
    .loc[:, ["user_account_address", "event_type", "quantity", "deal_price_usd"]] \
    .pivot_table(index="user_account_address",
                 columns="event_type",
                 values=["quantity", "deal_price_usd"],
                 aggfunc={"quantity": ["count", "sum"], "deal_price_usd": ["median", "sum"]},
                 fill_value=0)
X

In [None]:
X[ X[("quantity", "sum", "sell")] > X[("quantity", "count", "sell")] ].loc[:, "quantity"]

Examples which users had _bundled_ multiple NFTs in past transactions, i.e. total quantity sold (count, sell) is greater than the number the transactions (sum, sell).

In [None]:
idx = X.sort_values(('quantity', 'count', 'sell')).iloc[-10].name

In [None]:
wallets.set_index("user_account_address") \
    .loc[idx,
         ["event_timestamp", "event_type", "quantity"]].sort_values("event_timestamp")

An example showing the transaction history from an user.

In [None]:
X.columns = ['median_buy_usd', 'median_sell_usd', 'total_buy_usd', 'total_sell_usd', 'buy_xact', 'sell_xact', 'quantity_buy', 'quantity_sell']
X.quantity_buy = X.quantity_buy.abs()
wallets_attr.append(X)

## `irr` internal rate of return (WIP)

In [None]:
_ = wallets.groupby(['user_account_address', pd.Grouper(key='event_timestamp', freq='1M')])['cashflow_usd'].sum()
_

In [None]:
_.groupby('user_account_address').agg(npf.irr)

In [None]:
_['0x000000000004d7463d0f9c77383600bc82d612f5']

In [None]:
_['0x000000000ad266ec3db44bbe580e87f9baa358e6']

In [None]:
_['0x000000070f91b6c56fa08d4f3a26c7fc992b38f4']

In [None]:
_['0x0004ff7e7217dc672874fece2c7588581e97b1a7']

In [None]:
_['0x000cd27f10dffac73201258eaa3925c0452051a0']

In [None]:
_['0xffffc32855b2620c86f413065af8c58ec68d474d']

## `profit_usd` to date

Profit or lost is defined as the difference in purchasing an NFT preceding
the selling of the same token by the same wallet. Proceeds from repeated
purchasing and selling of the same token are added together.

In [None]:
X = wallets.groupby(['user_account_address', 'collection_slug', 'token_id', 'event_type']) \
    ['cashflow_usd'].sum() \
    .reset_index(level=3) \
    .pivot(columns='event_type') \
    .dropna() \
    .reset_index(col_level=1).droplevel(0, axis=1) \
    .melt(id_vars=['user_account_address', 'collection_slug', 'token_id'],
          value_vars=['buy', 'sell'],
          value_name='profit_usd')
X

**_TODO:_ Redo the above code chunk. It's a bit convoluted**

In [None]:
wallets_attr.append(X.groupby('user_account_address')['profit_usd'].sum())

## `cost_usd` to date

Cost or cost basis is defined as the price paid for the NFT sold.

In [None]:
cost = X[X.event_type == 'buy'].groupby('user_account_address')['profit_usd'].sum().abs()
cost.name = 'cost_usd'
wallets_attr.append(cost)

## `win`, `lose`, and `draw` counts to date

In [None]:
# WIP: unsure if we need this at event_timestamp level
X = wallets.groupby(['user_account_address', 'collection_slug', 'token_id', 'event_type', 'event_timestamp']) \
    ['deal_price_usd'].sum() \
    .reset_index(level=['event_type', 'event_timestamp']) 
X

In [None]:
X = wallets.groupby(['user_account_address', 'collection_slug', 'token_id', 'event_type']) \
    ['cashflow_usd'].sum() \
    .reset_index(level='event_type') \
    .pivot(columns='event_type') \
    .reset_index(col_level=1).droplevel(0, axis=1).assign(profit_usd=lambda x: x['sell'] + x['buy']) \
    .dropna()
X['win_lose'] = X['profit_usd'].apply(lambda x: 'win' if x > 0 else ('lose' if x < 0 else 'draw'))
X[X.profit_usd == 0]

Examine below one of the wallets with `draw`. The pattern seems like normal trading behavior.

In [None]:
wallets.loc[wallets.user_account_address == '0x001588cab7a0b727c388174b1ef20b2e3d20d39b',
           ['event_timestamp', 'event_type', 'collection_slug', 'token_id', 'deal_price']] \
    .pivot(index=['collection_slug', 'token_id'], columns='event_type')

In [None]:
win_lose = X.groupby(['user_account_address', 'win_lose']).size() \
    .reset_index('win_lose').pivot(columns='win_lose').droplevel(0, axis=1)
wallets_attr.append(win_lose)

## `win_ratio` and `lose_ratio` Win-Lose ratios to date

In [None]:
X = win_lose.fillna(0) \
    .assign(total=lambda x: x.sum(axis=1)) \
    .assign(win_ratio=lambda x: x.win / x.total) \
    .assign(lose_ratio=lambda x: x.lose / x.total)
X.sort_values('lose')

In [None]:
wallets_attr.append(X.loc[:, ['win_ratio', 'lose_ratio']])

## Convert to DataFrame

In [None]:
wallets_attr = pd.concat(wallets_attr, axis=1)
wallets_attr.shape

In [None]:
wallets_attr

# Explore Data

## `profit_ratio`

In [None]:
wallets_attr.assign(profit_ratio=lambda x: x.profit_usd / x.cost_usd).query('~profit_ratio.isna()').sort_values('profit_ratio')

## Daily Price and Volume (WIP)

We define
* Price as the minimum `deal price`
* Volume as either the total `deal_price` or total number of asset events, i.e. exchange frequency 

Todo: define the ceilings and floors for each FOMO wave, i.e. period

In [None]:
grp_cool_cats_nft=wallets.query('collection_slug == "cool-cats-nft" & deal_price > 0') \
    .groupby(pd.Grouper(key='event_timestamp', freq='1D'))

In [None]:
grp=wallets.query('deal_price > 0') \
    .groupby(pd.Grouper(key='event_timestamp', freq='1D'))

In [None]:
df = grp_cool_cats_nft.agg({"deal_price": ["sum", "mean", "median", "min"], "event_type": "count"}) \
        .assign(event_type_cnt_median=lambda x: x[('event_type', 'count')].median(),
                sum_pct_chg=lambda x: x[("deal_price", "sum")].pct_change(),
                min_pct_chg=lambda x: x[("deal_price", "min")].pct_change(),
                cnt_pct_chg=lambda x: x[("event_type", "count")].pct_change()) \
        .assign(sum_pp_diff=lambda x: x['sum_pct_chg'].diff(),
                min_pp_diff=lambda x: x['min_pct_chg'].diff(),
                cnt_pp_diff=lambda x: x['cnt_pct_chg'].diff())
df

In [None]:
df.sort_values(('event_type', 'count'))

In [None]:
_ = sns.lineplot(data=df, x='event_timestamp', y=('deal_price', 'sum'))

In [None]:
_ = sns.lineplot(data=df, x='event_timestamp', y=('event_type', 'count'))

In [None]:
_ = sns.lineplot(data=df, x='event_timestamp', y=('deal_price', 'min'))

Subsetting the event by date

In [None]:
wallets.set_index("event_timestamp").loc['2021-10-10']

In [None]:
wallets.groupby(["event_timestamp", "event_type"]).sum()

In [None]:
wallets.groupby(pd.Grouper(key="event_timestamp", freq="1D")).sum()

## Which user has bought and sold NFT during the specified period?

In [None]:
x = wallets.groupby("user_account_address")["event_type"].nunique().reset_index()
x = x[x.event_type > 1]

In [None]:
y = wallets.merge(x, on="user_account_address")
y.set_index("event_timestamp", inplace=True)

In [None]:
y.loc["2022-04"]["user_account_address"].nunique()
#query('user_account_address == "0xfffa6fc6acc3dbe04b175862376f1c5ff88cf9c1"')

In [None]:
hide_columns = ['token_owner_address', 'payment_token_decimals',
                'payment_token_usdprice',
                'transaction_hash', 'block_hash', 'block_number']
wallets.loc[:,~wallets.columns.isin(hide_columns)]

## Collections

Checking out the size, i.e. the number of **successful** asset events (buy or sell) by collection

In [None]:
wallets.groupby("collection_slug", as_index=False).size() \
    .sort_values("size", ascending=False).reset_index(drop=True)

_Are these popular collections on OpenSea or is it bias from data collection process?_

In [None]:
wallets.groupby(["collection_slug", "event_type"], as_index=False).size() \
    .pivot(index="collection_slug", columns="event_type", values="size") \
    .assign(diff=lambda x: x.buy - x.sell) \
    .sort_values(by=["buy", "sell"], ascending=False)

_Is it is reasonable to expect there are more buy events for a given collection?_

_How do we explain more selling than buy events? Could it be mint > transfer > sell_

## NFT_ID

a complex key created from `collection_slug` and `token_id` equals NFT unique ID

In [None]:
wallets["nft_id"] = wallets.collection_slug + '-' + wallets.token_id.astype('string')

In [None]:
wallets.groupby(["collection_slug", "nft_id", "event_type"], as_index=False).size() \
    .pivot(index=["collection_slug", "nft_id"], columns="event_type", values="size") \
    .sort_values(by=["buy", "sell"], ascending=False).head(20)

_*Tokens that have been exchanged mulitple times._

# Note
Feature engineering in ML
1. Feature Creation
1. Transformations
1. Feature Extraction, and
1. Feature Selection

# Appendix

## Wallet Dataset Profile using Created Attributes

https://pandas-profiling.ydata.ai/docs/master/index.html

In [None]:
wallets_attr[wallets_attr.age > 1]

In [None]:
from pandas_profiling import ProfileReport

profile = ProfileReport(wallets_attr[wallets_attr.age > 1])
profile.to_widgets()

## Exporting Data for Other Experiments

In [None]:
wallets.drop(["transaction_hash", "block_hash", "block_number", "token_seller_address"], axis=1) \
    .to_csv(os.path.join(data_dir, "nft20_success_with_winner_address.csv"), index=False)

wallets_attr.to_csv(os.path.join(data_dir, "nft20_success_with_winner_address_wallets_attrib.csv"))