# 08/05 Update 

* Changed target variable to difference between two prices. $\log \frac{P_n}{P_{n-1}}$

* **Added** a binary variable indicating whether the sell happens `>=30` days (1) and `<30` days(0)

* **Added** `log_price_n-1_sale`, and **Deleted** `log_price_n-2_sale` to the features

* **Deleted** rows that do not have `price_n-1_sale`

## Equation 2.1 Without Offer

$$
\log(P_{i,n=N})
= \beta_0
+ \beta_p \,\log\bigl(P_{i,n=N-2}\bigr)
+ \beta_1\,X_i^{\mathrm{NFTCharacteristics}}
+ \beta_2\,X_{i,n}^{\mathrm{buyer}}
+ \beta_3\,X_{i,n}^{\mathrm{seller}}
+ \beta_4\,X_{i,n-1}^{\mathrm{buyer}}
+ \beta_5\,X_{i,n-1}^{\mathrm{seller}}
+ \gamma_t
+ \varepsilon_{i,n}\,.
$$

* **Changed** target variable to log(Pn) -- Price N itself
* **Deleted** Pn-1

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble     import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics      import r2_score, mean_squared_error, mean_absolute_error
from scipy.stats.mstats import winsorize
import statsmodels.api as sm
from sklearn.preprocessing     import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model      import ElasticNetCV


df_table1 = pd.read_csv("df_table1.csv",
                        usecols=['token_id','time_n_sale',
                                 'time_n-1_sale','price_n_sale',
                                 'price_n-1_sale','buyer_n_sale',
                                 'seller_n_sale','buyer_n-1_sale','seller_n-1_sale', 'price_n-2_sale']) # 08/05 update 'price_n-2_sale' to df_table1

# 1. parse to datetime
df_table1['time_n_sale_dt']   = pd.to_datetime(df_table1['time_n_sale'],   unit='s')
df_table1['time_n-1_sale_dt'] = pd.to_datetime(df_table1['time_n-1_sale'], unit='s')

# 2. compute difference in days
df_table1['days_since_prev'] = (
    df_table1['time_n_sale_dt'] - df_table1['time_n-1_sale_dt']
).dt.days

# 3. make binary flag. #08/05 update making binary variable accounting for whether the NFT was sold before or after 30 days
df_table1['sold_after_30d'] = (df_table1['days_since_prev'] >= 30).astype(int)

# 4. drop any zero‐price_n-1_sale rows 
df_table1 = df_table1[df_table1['price_n-1_sale'] != 0]

df_buyern = pd.read_csv("df_table4.csv")
df_buyern_1 = pd.read_csv("df_table6.csv")
df_sellern = pd.read_csv("df_table5.csv")
df_sellern_1 = pd.read_csv("df_table7.csv")
df_nft_feature = pd.read_csv("df_table3.csv")

# rename columns in df_buyern, df_buyern_1 and df_sellern
df_buyern.rename(columns={'transaction_count':'buyern_tscount',
                          'active_period':'buyern_act_period',
                          'total_value':'buyern_total_value',
                          'total_gasUsed':'buyern_total_gasUsed',
                          'avg_gasPrice':'buyern_avg_gasPrice',
                          'avg_gasLimit':'buyern_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_rolling_std_value_last10'}, inplace=True)

df_buyern_1.rename(columns={'transaction_count':'buyern_1_tscount',
                          'active_period':'buyern_1_act_period',
                          'total_value':'buyern_1_total_value',
                          'total_gasUsed':'buyern_1_total_gasUsed',
                          'avg_gasPrice':'buyern_1_avg_gasPrice',
                          'avg_gasLimit':'buyern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_1_rolling_std_value_last10'}, inplace=True)

df_sellern.rename(columns={'transaction_count':'sellern_tscount',
                          'active_period':'sellern_act_period',
                          'total_value':'sellern_total_value',
                          'total_gasUsed':'sellern_total_gasUsed',
                          'avg_gasPrice':'sellern_avg_gasPrice',
                          'avg_gasLimit':'sellern_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_rolling_std_value_last10'}, inplace=True)

df_sellern_1.rename(columns={'transaction_count':'sellern_1_tscount',
                          'active_period':'sellern_1_act_period',
                          'total_value':'sellern_1_total_value',
                          'total_gasUsed':'sellern_1_total_gasUsed',
                          'avg_gasPrice':'sellern_1_avg_gasPrice',
                          'avg_gasLimit':'sellern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_1_rolling_std_value_last10'}, inplace=True)


# merge df_table1 with df_buyern and df_sellern and df_buyern_1
df = pd.merge(df_table1, df_buyern, left_on='buyer_n_sale',right_on='buyer_n_address', how='left')
df = pd.merge(df, df_sellern, left_on='seller_n_sale',right_on='seller_n_address', how='left')
df = pd.merge(df, df_buyern_1, left_on='buyer_n-1_sale',right_on='buyer_n-1_address', how='left')
df = pd.merge(df, df_sellern_1, left_on='seller_n-1_sale',right_on='seller_n-1_address', how='left')

# drop unecessary columns
df.drop(columns=['buyer_n_address','seller_n_address','buyer_n-1_address','seller_n-1_address'], inplace=True)
df.drop(columns=['buyer_n_sale','seller_n_sale','buyer_n-1_sale','seller_n-1_sale'], inplace=True)

# convert to year-month (optional, for dummies later)
df['time_n_sale']   = df['time_n_sale_dt'].dt.strftime('%Y-%m')
df['time_n-1_sale'] = df['time_n-1_sale_dt'].dt.strftime('%Y-%m')
df.drop(columns=['time_n_sale_dt', 'time_n-1_sale_dt'], inplace=True)

# merge df with df_nft_feature
df = pd.merge(df, df_nft_feature, left_on='token_id',right_on='token_id', how='left')

cat_cols = [
    "time_n_sale","time_n-1_sale",
    "Background","Clothes","Earring",
    "Eyes","Fur","Hat","Mouth"
]

# save a copy BEFORE encode
df_orig = df.copy()
        

# one hot encoding for categorical variables
cat_cols = ["time_n_sale","time_n-1_sale","Background", "Clothes","Earring", "Eyes","Fur", "Hat","Mouth"]

df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

#figure out which level got dropped for each categorical
encoded_cols = set(df.columns)

for col in cat_cols:
    # all levels present in the original
    levels = sorted(df_orig[col].dropna().unique())
    # the dummy‐columns you actually created
    created = [
        c.replace(f"{col}_","")
        for c in encoded_cols
        if c.startswith(f"{col}_")
    ]
    # the one missing is the dropped reference
    base = list(set(levels) - set(created))
    if len(base)==1:
        print(f"{col:15s} → base/reference level = {base[0]}")
    else:
        print(f"{col:15s} → unexpected drop (found {base})")

# winsorize 
col_to_winsorize = ['price_n_sale', 'price_n-1_sale', 'price_n-2_sale', # 08/05 update 'price_n-2_sale' to df_table1
                    'buyern_total_value','buyern_total_gasUsed','buyern_avg_gasPrice','buyern_avg_gasLimit','buyern_rolling_avg_value_last10','buyern_rolling_std_value_last10',
                    'sellern_total_value','sellern_total_gasUsed','sellern_avg_gasPrice','sellern_avg_gasLimit','sellern_rolling_avg_value_last10', 'sellern_rolling_std_value_last10',
                    'buyern_1_total_value','buyern_1_total_gasUsed','buyern_1_avg_gasPrice','buyern_1_avg_gasLimit','buyern_1_rolling_avg_value_last10','buyern_1_rolling_std_value_last10',
                    'sellern_1_total_value','sellern_1_total_gasUsed','sellern_1_avg_gasPrice','sellern_1_avg_gasLimit','sellern_1_rolling_avg_value_last10', 'sellern_1_rolling_std_value_last10',
                    ]

df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: winsorize(x, limits=[0.05, 0.05]))

# log transform winzoerized columns and rename them to all log_
df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: np.log(x + 1))
df.rename(columns={col: 'log_' + col for col in col_to_winsorize}, inplace=True)


# fillna 
df.fillna(0, inplace=True)

# train test split
X = df.drop(columns=['log_price_n_sale', 'token_id','log_price_n-1_sale'])
y = df['log_price_n_sale']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)

# standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Parameter tuning for ElasticNetCV
alphas_en = np.logspace(-4, 2, 50)  # This will be passed directly to ElasticNetCV
param_grid_en = {
    'l1_ratio': np.linspace(0.01, 1.0, 20)  # l1_ratio must be in [0, 1]
}

# ElasticNetCV
elatsic_net = ElasticNetCV(alphas=alphas_en, random_state=87, n_jobs=-1)

# GridSearchCV for ElasticNetCV (only tune l1_ratio)
grid_search_en = GridSearchCV(elatsic_net, param_grid_en, cv=10, n_jobs=-1, verbose=1, scoring='neg_mean_squared_error')

# fit gridsearchcv to find best parameters
print("Starting GridSearchCV...")
grid_search_en.fit(X_train_scaled, y_train)

# Get the best parameters
best_params_en = grid_search_en.best_estimator_

print(f"Best Parameters found: {grid_search_en.best_params_}")
print(f"Best Score found: {grid_search_en.best_score_:.4f}")

# Predicting on the test set 
y_pred_en = best_params_en.predict(X_test_scaled)


# Calculate metrics
r2_en = r2_score(y_test, y_pred_en)
mse_en = mean_squared_error(y_test, y_pred_en)
mae_en = mean_absolute_error(y_test, y_pred_en)
rmse_en = np.sqrt(mse_en)

# mape 
mask = y_test != 0
mape_en = np.mean(np.abs((y_test[mask] - y_pred_en[mask]) / y_test[mask])) * 100 if np.any(mask) else np.inf

# Print metrics

print(f"ElasticNetCV R^2: {r2_en:.4f}")
print(f"ElasticNetCV MSE: {mse_en:.4f}")
print(f"ElasticNetCV MAE: {mae_en:.4f}")
print(f"ElasticNetCV RMSE: {rmse_en:.4f}")
print(f"ElasticNetCV MAPE: {mape_en:.4f}")

# take log of pn and pn-1  getting the difference (including the price) 

# 2nd model, try include in the controls variables if the NFT was previously sold within last 30 days(YEs) or before(No) (substracting the price maynot be the good appoarch, the length of the ownership could be effect)

# 3rd, NFT fixed effect, not NFT characteristics but only buyers and sellers, its overtime. 




time_n_sale     → base/reference level = 2021-05
time_n-1_sale   → base/reference level = 2021-05
Background      → base/reference level = Aquamarine
Clothes         → base/reference level = Admirals Coat
Earring         → base/reference level = Cross
Eyes            → base/reference level = 3d
Fur             → base/reference level = Black
Hat             → base/reference level = Army Hat
Mouth           → base/reference level = Bored
Starting GridSearchCV...
Fitting 10 folds for each of 20 candidates, totalling 200 fits


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

Best Parameters found: {'l1_ratio': 0.9478947368421053}
Best Score found: -0.0842
ElasticNetCV R^2: 0.9362
ElasticNetCV MSE: 0.1136
ElasticNetCV MAE: 0.1813
ElasticNetCV RMSE: 0.3371
ElasticNetCV MAPE: 12.7704


In [2]:
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error

# Get the fitted ElasticNetCV from your GridSearchCV
enet = grid_search_en.best_estimator_

# Identify which original features had non-zero coef
selected = X.columns[enet.coef_ != 0].tolist()
print(f"{len(selected)} features selected by ElasticNet:\n{selected}\n")

# Subset your original (un-scaled) X_train and X_test
X_sel_train = X_train[selected]
X_sel_test  = X_test[selected]

# Add constant for intercept
X_sel_train_const = sm.add_constant(X_sel_train)
X_sel_test_const  = sm.add_constant(X_sel_test, has_constant='add')

# Convert all columns to float to avoid dtype=object issues
X_sel_train_const = X_sel_train_const.astype(float)
X_sel_test_const = X_sel_test_const.astype(float)

# Fit OLS on the training data
ols = sm.OLS(y_train, X_sel_train_const).fit()

# Print full regression table
print(ols.summary())

# Evaluate OLS on the test set
y_pred_ols = ols.predict(X_sel_test_const)
rmse_ols = np.sqrt(mean_squared_error(y_test, y_pred_ols))
print(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}")

# save summary to csv
with open('ols_summary.txt', 'w') as f:
    f.write(ols.summary().as_text())
    f.write(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}\n")
    

276 features selected by ElasticNet:
['log_price_n-2_sale', 'days_since_prev', 'sold_after_30d', 'buyern_tscount', 'buyern_act_period', 'log_buyern_total_value', 'log_buyern_total_gasUsed', 'log_buyern_avg_gasPrice', 'log_buyern_avg_gasLimit', 'log_buyern_rolling_avg_value_last10', 'log_buyern_rolling_std_value_last10', 'sellern_tscount', 'sellern_act_period', 'log_sellern_total_value', 'log_sellern_total_gasUsed', 'log_sellern_avg_gasPrice', 'log_sellern_avg_gasLimit', 'log_sellern_rolling_std_value_last10', 'buyern_1_tscount', 'buyern_1_act_period', 'log_buyern_1_total_value', 'log_buyern_1_rolling_avg_value_last10', 'log_buyern_1_rolling_std_value_last10', 'sellern_1_tscount', 'sellern_1_act_period', 'log_sellern_1_total_value', 'log_sellern_1_total_gasUsed', 'log_sellern_1_avg_gasPrice', 'rarity.rank', 'time_n_sale_2021-06', 'time_n_sale_2021-07', 'time_n_sale_2021-08', 'time_n_sale_2021-09', 'time_n_sale_2021-10', 'time_n_sale_2021-11', 'time_n_sale_2021-12', 'time_n_sale_2022-01'

# Equation 2.1 With Offer

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble     import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics      import r2_score, mean_squared_error, mean_absolute_error
from scipy.stats.mstats import winsorize
import statsmodels.api as sm
from sklearn.preprocessing     import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model      import ElasticNetCV


df_table1 = pd.read_csv("df_table1.csv",
                        usecols=['token_id','time_n_sale',
                                 'time_n-1_sale','price_n_sale',
                                 'price_n-1_sale','buyer_n_sale',
                                 'seller_n_sale','buyer_n-1_sale','seller_n-1_sale', 'price_n-2_sale']) # 08/05 update 'price_n-2_sale' to df_table1

# 1. parse to datetime
df_table1['time_n_sale_dt']   = pd.to_datetime(df_table1['time_n_sale'],   unit='s')
df_table1['time_n-1_sale_dt'] = pd.to_datetime(df_table1['time_n-1_sale'], unit='s')

# 2. compute difference in days
df_table1['days_since_prev'] = (
    df_table1['time_n_sale_dt'] - df_table1['time_n-1_sale_dt']
).dt.days

# 3. make binary flag. #08/05 update making binary variable accounting for whether the NFT was sold before or after 30 days
df_table1['sold_after_30d'] = (df_table1['days_since_prev'] >= 30).astype(int)

# 4. drop any zero‐price_n-1_sale rows 
df_table1 = df_table1[df_table1['price_n-1_sale'] != 0]

df_buyern = pd.read_csv("df_table4.csv")
df_buyern_1 = pd.read_csv("df_table6.csv")
df_sellern = pd.read_csv("df_table5.csv")
df_sellern_1 = pd.read_csv("df_table7.csv")
df_nft_feature = pd.read_csv("df_table3.csv")
df_offer = pd.read_csv("Panel_for_Model2.csv", usecols = ['token_id','total_offers','unique_makers_count'])

# rename columns in df_buyern, df_buyern_1 and df_sellern
df_buyern.rename(columns={'transaction_count':'buyern_tscount',
                          'active_period':'buyern_act_period',
                          'total_value':'buyern_total_value',
                          'total_gasUsed':'buyern_total_gasUsed',
                          'avg_gasPrice':'buyern_avg_gasPrice',
                          'avg_gasLimit':'buyern_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_rolling_std_value_last10'}, inplace=True)

df_buyern_1.rename(columns={'transaction_count':'buyern_1_tscount',
                          'active_period':'buyern_1_act_period',
                          'total_value':'buyern_1_total_value',
                          'total_gasUsed':'buyern_1_total_gasUsed',
                          'avg_gasPrice':'buyern_1_avg_gasPrice',
                          'avg_gasLimit':'buyern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_1_rolling_std_value_last10'}, inplace=True)

df_sellern.rename(columns={'transaction_count':'sellern_tscount',
                          'active_period':'sellern_act_period',
                          'total_value':'sellern_total_value',
                          'total_gasUsed':'sellern_total_gasUsed',
                          'avg_gasPrice':'sellern_avg_gasPrice',
                          'avg_gasLimit':'sellern_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_rolling_std_value_last10'}, inplace=True)

df_sellern_1.rename(columns={'transaction_count':'sellern_1_tscount',
                          'active_period':'sellern_1_act_period',
                          'total_value':'sellern_1_total_value',
                          'total_gasUsed':'sellern_1_total_gasUsed',
                          'avg_gasPrice':'sellern_1_avg_gasPrice',
                          'avg_gasLimit':'sellern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_1_rolling_std_value_last10'}, inplace=True)


# merge df_table1 with df_buyern and df_sellern and df_buyern_1
df = pd.merge(df_table1, df_buyern, left_on='buyer_n_sale',right_on='buyer_n_address', how='left')
df = pd.merge(df, df_sellern, left_on='seller_n_sale',right_on='seller_n_address', how='left')
df = pd.merge(df, df_buyern_1, left_on='buyer_n-1_sale',right_on='buyer_n-1_address', how='left')
df = pd.merge(df, df_sellern_1, left_on='seller_n-1_sale',right_on='seller_n-1_address', how='left')
df = pd.merge(df, df_offer, left_on='token_id', right_on='token_id', how='left')

# drop unecessary columns
df.drop(columns=['buyer_n_address','seller_n_address','buyer_n-1_address','seller_n-1_address'], inplace=True)
df.drop(columns=['buyer_n_sale','seller_n_sale','buyer_n-1_sale','seller_n-1_sale'], inplace=True)

# convert to year-month (optional, for dummies later)
df['time_n_sale']   = df['time_n_sale_dt'].dt.strftime('%Y-%m')
df['time_n-1_sale'] = df['time_n-1_sale_dt'].dt.strftime('%Y-%m')
df.drop(columns=['time_n_sale_dt', 'time_n-1_sale_dt'], inplace=True)

# merge df with df_nft_feature
df = pd.merge(df, df_nft_feature, left_on='token_id',right_on='token_id', how='left')

cat_cols = [
    "time_n_sale","time_n-1_sale",
    "Background","Clothes","Earring",
    "Eyes","Fur","Hat","Mouth"
]

# save a copy BEFORE encode
df_orig = df.copy()
        

# one hot encoding for categorical variables
cat_cols = ["time_n_sale","time_n-1_sale","Background", "Clothes","Earring", "Eyes","Fur", "Hat","Mouth"]

df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

#figure out which level got dropped for each categorical
encoded_cols = set(df.columns)

for col in cat_cols:
    # all levels present in the original
    levels = sorted(df_orig[col].dropna().unique())
    # the dummy‐columns you actually created
    created = [
        c.replace(f"{col}_","")
        for c in encoded_cols
        if c.startswith(f"{col}_")
    ]
    # the one missing is the dropped reference
    base = list(set(levels) - set(created))
    if len(base)==1:
        print(f"{col:15s} → base/reference level = {base[0]}")
    else:
        print(f"{col:15s} → unexpected drop (found {base})")

# winsorize 
col_to_winsorize = ['price_n_sale', 'price_n-1_sale', 'price_n-2_sale', # 08/05 update 'price_n-2_sale' to df_table1
                    'buyern_total_value','buyern_total_gasUsed','buyern_avg_gasPrice','buyern_avg_gasLimit','buyern_rolling_avg_value_last10','buyern_rolling_std_value_last10',
                    'sellern_total_value','sellern_total_gasUsed','sellern_avg_gasPrice','sellern_avg_gasLimit','sellern_rolling_avg_value_last10', 'sellern_rolling_std_value_last10',
                    'buyern_1_total_value','buyern_1_total_gasUsed','buyern_1_avg_gasPrice','buyern_1_avg_gasLimit','buyern_1_rolling_avg_value_last10','buyern_1_rolling_std_value_last10',
                    'sellern_1_total_value','sellern_1_total_gasUsed','sellern_1_avg_gasPrice','sellern_1_avg_gasLimit','sellern_1_rolling_avg_value_last10', 'sellern_1_rolling_std_value_last10',
                    'total_offers', 'unique_makers_count'
                    ]

df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: winsorize(x, limits=[0.05, 0.05]))


# log transform winzoerized columns and rename them to all log_
df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: np.log(x + 1))
df.rename(columns={col: 'log_' + col for col in col_to_winsorize}, inplace=True)

# fillna 
df.fillna(0, inplace=True)

# train test split
X = df.drop(columns=['log_price_n_sale', 'token_id', 'log_price_n-1_sale'])
y = df['log_price_n_sale']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)

# standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Parameter tuning for ElasticNetCV
alphas_en = np.logspace(-4, 2, 50)  # This will be passed directly to ElasticNetCV
param_grid_en = {
    'l1_ratio': np.linspace(0.01, 1.0, 20)  # l1_ratio must be in [0, 1]
}

# ElasticNetCV
elatsic_net = ElasticNetCV(alphas=alphas_en, random_state=87, n_jobs=-1)

# GridSearchCV for ElasticNetCV (only tune l1_ratio)
grid_search_en = GridSearchCV(elatsic_net, param_grid_en, cv=10, n_jobs=-1, verbose=1, scoring='neg_mean_squared_error')

# fit gridsearchcv to find best parameters
print("Starting GridSearchCV...")
grid_search_en.fit(X_train_scaled, y_train)

# Get the best parameters
best_params_en = grid_search_en.best_estimator_

print(f"Best Parameters found: {grid_search_en.best_params_}")
print(f"Best Score found: {grid_search_en.best_score_:.4f}")

# Predicting on the test set 
y_pred_en = best_params_en.predict(X_test_scaled)


# Calculate metrics
r2_en = r2_score(y_test, y_pred_en)
mse_en = mean_squared_error(y_test, y_pred_en)
mae_en = mean_absolute_error(y_test, y_pred_en)
rmse_en = np.sqrt(mse_en)

# mape 
mask = y_test != 0
mape_en = np.mean(np.abs((y_test[mask] - y_pred_en[mask]) / y_test[mask])) * 100 if np.any(mask) else np.inf

# Print metrics

print(f"ElasticNetCV R^2: {r2_en:.4f}")
print(f"ElasticNetCV MSE: {mse_en:.4f}")
print(f"ElasticNetCV MAE: {mae_en:.4f}")
print(f"ElasticNetCV RMSE: {rmse_en:.4f}")
print(f"ElasticNetCV MAPE: {mape_en:.4f}")

# take log of pn and pn-1  getting the difference (including the price) 

# 2nd model, try include in the controls variables if the NFT was previously sold within last 30 days(YEs) or before(No) (substracting the price maynot be the good appoarch, the length of the ownership could be effect)

# 3rd, NFT fixed effect, not NFT characteristics but only buyers and sellers, its overtime. 


time_n_sale     → base/reference level = 2021-05
time_n-1_sale   → base/reference level = 2021-05
Background      → base/reference level = Aquamarine
Clothes         → base/reference level = Admirals Coat
Earring         → base/reference level = Cross
Eyes            → base/reference level = 3d
Fur             → base/reference level = Black
Hat             → base/reference level = Army Hat
Mouth           → base/reference level = Bored
Starting GridSearchCV...
Fitting 10 folds for each of 20 candidates, totalling 200 fits


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

Best Parameters found: {'l1_ratio': 0.5831578947368421}
Best Score found: -0.0873
ElasticNetCV R^2: 0.9528
ElasticNetCV MSE: 0.0864
ElasticNetCV MAE: 0.1742
ElasticNetCV RMSE: 0.2940
ElasticNetCV MAPE: 11.3021


In [4]:
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error

# Get the fitted ElasticNetCV from your GridSearchCV
enet = grid_search_en.best_estimator_

# Identify which original features had non-zero coef
selected = X.columns[enet.coef_ != 0].tolist()
print(f"{len(selected)} features selected by ElasticNet:\n{selected}\n")

# Subset your original (un-scaled) X_train and X_test
X_sel_train = X_train[selected]
X_sel_test  = X_test[selected]

# Add constant for intercept
X_sel_train_const = sm.add_constant(X_sel_train)
X_sel_test_const  = sm.add_constant(X_sel_test, has_constant='add')

# Convert all columns to float to avoid dtype=object issues
X_sel_train_const = X_sel_train_const.astype(float)
X_sel_test_const = X_sel_test_const.astype(float)

# Fit OLS on the training data
ols = sm.OLS(y_train, X_sel_train_const).fit()

# Print full regression table
print(ols.summary())

# Evaluate OLS on the test set
y_pred_ols = ols.predict(X_sel_test_const)
rmse_ols = np.sqrt(mean_squared_error(y_test, y_pred_ols))
print(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}")

# save summary to csv
with open('ols_summary.txt', 'w') as f:
    f.write(ols.summary().as_text())
    f.write(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}\n")
    

288 features selected by ElasticNet:
['log_price_n-2_sale', 'days_since_prev', 'sold_after_30d', 'buyern_tscount', 'buyern_act_period', 'log_buyern_total_value', 'log_buyern_total_gasUsed', 'log_buyern_avg_gasPrice', 'log_buyern_avg_gasLimit', 'log_buyern_rolling_avg_value_last10', 'log_buyern_rolling_std_value_last10', 'sellern_tscount', 'sellern_act_period', 'log_sellern_total_value', 'log_sellern_total_gasUsed', 'log_sellern_avg_gasPrice', 'log_sellern_avg_gasLimit', 'log_sellern_rolling_avg_value_last10', 'log_sellern_rolling_std_value_last10', 'buyern_1_tscount', 'buyern_1_act_period', 'log_buyern_1_total_value', 'log_buyern_1_total_gasUsed', 'log_buyern_1_avg_gasLimit', 'log_buyern_1_rolling_avg_value_last10', 'log_buyern_1_rolling_std_value_last10', 'sellern_1_tscount', 'sellern_1_act_period', 'log_sellern_1_total_value', 'log_sellern_1_total_gasUsed', 'log_sellern_1_avg_gasPrice', 'log_sellern_1_avg_gasLimit', 'log_sellern_1_rolling_std_value_last10', 'log_total_offers', 'log_u

## Equation 2.2 Without Offer

$$
\log(P_{i,n=N}) - \log(P_{i,n=N-1})
= \beta_0
+ \beta_p \,\log\bigl(P_{i,n=N-2}\bigr)
+ \beta_1\,X_i^{\mathrm{NFTCharacteristics}} \\[1em]
+ \beta_2\,X_{i,n}^{\mathrm{buyer}}
+ \beta_3\,X_{i,n}^{\mathrm{seller}}
+ \beta_4\,X_{i,n-1}^{\mathrm{buyer}}
+ \beta_5\,X_{i,n-1}^{\mathrm{seller}}
+ \gamma_t
+ \varepsilon_{i,n}\,.
$$

* **Target Variable is now change of price**
* Pn-2 is included

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble     import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics      import r2_score, mean_squared_error, mean_absolute_error
from scipy.stats.mstats import winsorize
import statsmodels.api as sm
from sklearn.preprocessing     import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model      import ElasticNetCV


df_table1 = pd.read_csv("df_table1.csv",
                        usecols=['token_id','time_n_sale',
                                 'time_n-1_sale','price_n_sale',
                                 'price_n-1_sale','buyer_n_sale',
                                 'seller_n_sale','buyer_n-1_sale','seller_n-1_sale', 'price_n-2_sale']) # 08/05 update 'price_n-2_sale' to df_table1

# 1. parse to datetime
df_table1['time_n_sale_dt']   = pd.to_datetime(df_table1['time_n_sale'],   unit='s')
df_table1['time_n-1_sale_dt'] = pd.to_datetime(df_table1['time_n-1_sale'], unit='s')

# 2. compute difference in days
df_table1['days_since_prev'] = (
    df_table1['time_n_sale_dt'] - df_table1['time_n-1_sale_dt']
).dt.days

# 3. make binary flag. #08/05 update making binary variable accounting for whether the NFT was sold before or after 30 days
df_table1['sold_after_30d'] = (df_table1['days_since_prev'] >= 30).astype(int)

# 4. drop any zero‐price_n-1_sale rows 
df_table1 = df_table1[df_table1['price_n-1_sale'] != 0]

df_buyern = pd.read_csv("df_table4.csv")
df_buyern_1 = pd.read_csv("df_table6.csv")
df_sellern = pd.read_csv("df_table5.csv")
df_sellern_1 = pd.read_csv("df_table7.csv")
df_nft_feature = pd.read_csv("df_table3.csv")

# rename columns in df_buyern, df_buyern_1 and df_sellern
df_buyern.rename(columns={'transaction_count':'buyern_tscount',
                          'active_period':'buyern_act_period',
                          'total_value':'buyern_total_value',
                          'total_gasUsed':'buyern_total_gasUsed',
                          'avg_gasPrice':'buyern_avg_gasPrice',
                          'avg_gasLimit':'buyern_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_rolling_std_value_last10'}, inplace=True)

df_buyern_1.rename(columns={'transaction_count':'buyern_1_tscount',
                          'active_period':'buyern_1_act_period',
                          'total_value':'buyern_1_total_value',
                          'total_gasUsed':'buyern_1_total_gasUsed',
                          'avg_gasPrice':'buyern_1_avg_gasPrice',
                          'avg_gasLimit':'buyern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_1_rolling_std_value_last10'}, inplace=True)

df_sellern.rename(columns={'transaction_count':'sellern_tscount',
                          'active_period':'sellern_act_period',
                          'total_value':'sellern_total_value',
                          'total_gasUsed':'sellern_total_gasUsed',
                          'avg_gasPrice':'sellern_avg_gasPrice',
                          'avg_gasLimit':'sellern_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_rolling_std_value_last10'}, inplace=True)

df_sellern_1.rename(columns={'transaction_count':'sellern_1_tscount',
                          'active_period':'sellern_1_act_period',
                          'total_value':'sellern_1_total_value',
                          'total_gasUsed':'sellern_1_total_gasUsed',
                          'avg_gasPrice':'sellern_1_avg_gasPrice',
                          'avg_gasLimit':'sellern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_1_rolling_std_value_last10'}, inplace=True)


# merge df_table1 with df_buyern and df_sellern and df_buyern_1
df = pd.merge(df_table1, df_buyern, left_on='buyer_n_sale',right_on='buyer_n_address', how='left')
df = pd.merge(df, df_sellern, left_on='seller_n_sale',right_on='seller_n_address', how='left')
df = pd.merge(df, df_buyern_1, left_on='buyer_n-1_sale',right_on='buyer_n-1_address', how='left')
df = pd.merge(df, df_sellern_1, left_on='seller_n-1_sale',right_on='seller_n-1_address', how='left')

# drop unecessary columns
df.drop(columns=['buyer_n_address','seller_n_address','buyer_n-1_address','seller_n-1_address'], inplace=True)
df.drop(columns=['buyer_n_sale','seller_n_sale','buyer_n-1_sale','seller_n-1_sale'], inplace=True)

# convert to year-month (optional, for dummies later)
df['time_n_sale']   = df['time_n_sale_dt'].dt.strftime('%Y-%m')
df['time_n-1_sale'] = df['time_n-1_sale_dt'].dt.strftime('%Y-%m')
df.drop(columns=['time_n_sale_dt', 'time_n-1_sale_dt'], inplace=True)

# merge df with df_nft_feature
df = pd.merge(df, df_nft_feature, left_on='token_id',right_on='token_id', how='left')

cat_cols = [
    "time_n_sale","time_n-1_sale",
    "Background","Clothes","Earring",
    "Eyes","Fur","Hat","Mouth"
]

# save a copy BEFORE encode
df_orig = df.copy()
        

# one hot encoding for categorical variables
cat_cols = ["time_n_sale","time_n-1_sale","Background", "Clothes","Earring", "Eyes","Fur", "Hat","Mouth"]

df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

#figure out which level got dropped for each categorical
encoded_cols = set(df.columns)

for col in cat_cols:
    # all levels present in the original
    levels = sorted(df_orig[col].dropna().unique())
    # the dummy‐columns you actually created
    created = [
        c.replace(f"{col}_","")
        for c in encoded_cols
        if c.startswith(f"{col}_")
    ]
    # the one missing is the dropped reference
    base = list(set(levels) - set(created))
    if len(base)==1:
        print(f"{col:15s} → base/reference level = {base[0]}")
    else:
        print(f"{col:15s} → unexpected drop (found {base})")

# winsorize 
col_to_winsorize = ['price_n_sale', 'price_n-1_sale', 'price_n-2_sale', # 08/05 update 'price_n-2_sale' to df_table1
                    'buyern_total_value','buyern_total_gasUsed','buyern_avg_gasPrice','buyern_avg_gasLimit','buyern_rolling_avg_value_last10','buyern_rolling_std_value_last10',
                    'sellern_total_value','sellern_total_gasUsed','sellern_avg_gasPrice','sellern_avg_gasLimit','sellern_rolling_avg_value_last10', 'sellern_rolling_std_value_last10',
                    'buyern_1_total_value','buyern_1_total_gasUsed','buyern_1_avg_gasPrice','buyern_1_avg_gasLimit','buyern_1_rolling_avg_value_last10','buyern_1_rolling_std_value_last10',
                    'sellern_1_total_value','sellern_1_total_gasUsed','sellern_1_avg_gasPrice','sellern_1_avg_gasLimit','sellern_1_rolling_avg_value_last10', 'sellern_1_rolling_std_value_last10',
                    ]

df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: winsorize(x, limits=[0.05, 0.05]))

# log transform winzoerized columns and rename them to all log_
df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: np.log(x + 1))
df.rename(columns={col: 'log_' + col for col in col_to_winsorize}, inplace=True)

# now make log(price_n_sale/price_n-1_sale) as the target variable
df['log_price_change'] = df['log_price_n_sale'] - df['log_price_n-1_sale']

# fillna 
df.fillna(0, inplace=True)

# train test split
X = df.drop(columns=['log_price_n_sale', 'token_id', 'log_price_change','log_price_n-1_sale'])
y = df['log_price_change']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)

# standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Parameter tuning for ElasticNetCV
alphas_en = np.logspace(-4, 2, 50)  # This will be passed directly to ElasticNetCV
param_grid_en = {
    'l1_ratio': np.linspace(0.01, 1.0, 20)  # l1_ratio must be in [0, 1]
}

# ElasticNetCV
elatsic_net = ElasticNetCV(alphas=alphas_en, random_state=87, n_jobs=-1)

# GridSearchCV for ElasticNetCV (only tune l1_ratio)
grid_search_en = GridSearchCV(elatsic_net, param_grid_en, cv=10, n_jobs=-1, verbose=1, scoring='neg_mean_squared_error')

# fit gridsearchcv to find best parameters
print("Starting GridSearchCV...")
grid_search_en.fit(X_train_scaled, y_train)

# Get the best parameters
best_params_en = grid_search_en.best_estimator_

print(f"Best Parameters found: {grid_search_en.best_params_}")
print(f"Best Score found: {grid_search_en.best_score_:.4f}")

# Predicting on the test set 
y_pred_en = best_params_en.predict(X_test_scaled)


# Calculate metrics
r2_en = r2_score(y_test, y_pred_en)
mse_en = mean_squared_error(y_test, y_pred_en)
mae_en = mean_absolute_error(y_test, y_pred_en)
rmse_en = np.sqrt(mse_en)

# mape 
mask = y_test != 0
mape_en = np.mean(np.abs((y_test[mask] - y_pred_en[mask]) / y_test[mask])) * 100 if np.any(mask) else np.inf

# Print metrics

print(f"ElasticNetCV R^2: {r2_en:.4f}")
print(f"ElasticNetCV MSE: {mse_en:.4f}")
print(f"ElasticNetCV MAE: {mae_en:.4f}")
print(f"ElasticNetCV RMSE: {rmse_en:.4f}")
print(f"ElasticNetCV MAPE: {mape_en:.4f}")

# take log of pn and pn-1  getting the difference (including the price) 

# 2nd model, try include in the controls variables if the NFT was previously sold within last 30 days(YEs) or before(No) (substracting the price maynot be the good appoarch, the length of the ownership could be effect)

# 3rd, NFT fixed effect, not NFT characteristics but only buyers and sellers, its overtime. 




time_n_sale     → base/reference level = 2021-05
time_n-1_sale   → base/reference level = 2021-05
Background      → base/reference level = Aquamarine
Clothes         → base/reference level = Admirals Coat
Earring         → base/reference level = Cross
Eyes            → base/reference level = 3d
Fur             → base/reference level = Black
Hat             → base/reference level = Army Hat
Mouth           → base/reference level = Bored
Starting GridSearchCV...
Fitting 10 folds for each of 20 candidates, totalling 200 fits


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

Best Parameters found: {'l1_ratio': 0.8436842105263158}
Best Score found: -0.2459
ElasticNetCV R^2: 0.7772
ElasticNetCV MSE: 0.2834
ElasticNetCV MAE: 0.3309
ElasticNetCV RMSE: 0.5323
ElasticNetCV MAPE: 980.9360


In [6]:
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error

# Get the fitted ElasticNetCV from your GridSearchCV
enet = grid_search_en.best_estimator_

# Identify which original features had non-zero coef
selected = X.columns[enet.coef_ != 0].tolist()
print(f"{len(selected)} features selected by ElasticNet:\n{selected}\n")

# Subset your original (un-scaled) X_train and X_test
X_sel_train = X_train[selected]
X_sel_test  = X_test[selected]

# Add constant for intercept
X_sel_train_const = sm.add_constant(X_sel_train)
X_sel_test_const  = sm.add_constant(X_sel_test, has_constant='add')

# Convert all columns to float to avoid dtype=object issues
X_sel_train_const = X_sel_train_const.astype(float)
X_sel_test_const = X_sel_test_const.astype(float)

# Fit OLS on the training data
ols = sm.OLS(y_train, X_sel_train_const).fit()

# Print full regression table
print(ols.summary())

# Evaluate OLS on the test set
y_pred_ols = ols.predict(X_sel_test_const)
rmse_ols = np.sqrt(mean_squared_error(y_test, y_pred_ols))
print(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}")

# save summary to csv
with open('ols_summary.txt', 'w') as f:
    f.write(ols.summary().as_text())
    f.write(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}\n")
    

256 features selected by ElasticNet:
['log_price_n-2_sale', 'days_since_prev', 'sold_after_30d', 'buyern_tscount', 'buyern_act_period', 'log_buyern_total_value', 'log_buyern_total_gasUsed', 'log_buyern_avg_gasPrice', 'log_buyern_avg_gasLimit', 'log_buyern_rolling_avg_value_last10', 'log_buyern_rolling_std_value_last10', 'sellern_tscount', 'sellern_act_period', 'log_sellern_total_value', 'log_sellern_total_gasUsed', 'log_sellern_avg_gasPrice', 'log_sellern_avg_gasLimit', 'log_sellern_rolling_std_value_last10', 'buyern_1_tscount', 'log_buyern_1_total_value', 'log_buyern_1_total_gasUsed', 'log_buyern_1_avg_gasPrice', 'log_buyern_1_rolling_std_value_last10', 'sellern_1_tscount', 'sellern_1_act_period', 'log_sellern_1_total_value', 'log_sellern_1_total_gasUsed', 'log_sellern_1_avg_gasLimit', 'log_sellern_1_rolling_avg_value_last10', 'rarity.rank', 'time_n_sale_2021-06', 'time_n_sale_2021-07', 'time_n_sale_2021-08', 'time_n_sale_2021-09', 'time_n_sale_2021-10', 'time_n_sale_2021-11', 'time_n

## Equation 2.2 With Offer

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble     import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics      import r2_score, mean_squared_error, mean_absolute_error
from scipy.stats.mstats import winsorize
import statsmodels.api as sm
from sklearn.preprocessing     import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model      import ElasticNetCV


df_table1 = pd.read_csv("df_table1.csv",
                        usecols=['token_id','time_n_sale',
                                 'time_n-1_sale','price_n_sale',
                                 'price_n-1_sale','buyer_n_sale',
                                 'seller_n_sale','buyer_n-1_sale','seller_n-1_sale', 'price_n-2_sale']) # 08/05 update 'price_n-2_sale' to df_table1

# 1. parse to datetime
df_table1['time_n_sale_dt']   = pd.to_datetime(df_table1['time_n_sale'],   unit='s')
df_table1['time_n-1_sale_dt'] = pd.to_datetime(df_table1['time_n-1_sale'], unit='s')

# 2. compute difference in days
df_table1['days_since_prev'] = (
    df_table1['time_n_sale_dt'] - df_table1['time_n-1_sale_dt']
).dt.days

# 3. make binary flag. #08/05 update making binary variable accounting for whether the NFT was sold before or after 30 days
df_table1['sold_after_30d'] = (df_table1['days_since_prev'] >= 30).astype(int)

# 4. drop any zero‐price_n-1_sale rows 
df_table1 = df_table1[df_table1['price_n-1_sale'] != 0]

df_buyern = pd.read_csv("df_table4.csv")
df_buyern_1 = pd.read_csv("df_table6.csv")
df_sellern = pd.read_csv("df_table5.csv")
df_sellern_1 = pd.read_csv("df_table7.csv")
df_nft_feature = pd.read_csv("df_table3.csv")
df_offer = pd.read_csv("Panel_for_Model2.csv", usecols = ['token_id','total_offers','unique_makers_count'])

# rename columns in df_buyern, df_buyern_1 and df_sellern
df_buyern.rename(columns={'transaction_count':'buyern_tscount',
                          'active_period':'buyern_act_period',
                          'total_value':'buyern_total_value',
                          'total_gasUsed':'buyern_total_gasUsed',
                          'avg_gasPrice':'buyern_avg_gasPrice',
                          'avg_gasLimit':'buyern_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_rolling_std_value_last10'}, inplace=True)

df_buyern_1.rename(columns={'transaction_count':'buyern_1_tscount',
                          'active_period':'buyern_1_act_period',
                          'total_value':'buyern_1_total_value',
                          'total_gasUsed':'buyern_1_total_gasUsed',
                          'avg_gasPrice':'buyern_1_avg_gasPrice',
                          'avg_gasLimit':'buyern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_1_rolling_std_value_last10'}, inplace=True)

df_sellern.rename(columns={'transaction_count':'sellern_tscount',
                          'active_period':'sellern_act_period',
                          'total_value':'sellern_total_value',
                          'total_gasUsed':'sellern_total_gasUsed',
                          'avg_gasPrice':'sellern_avg_gasPrice',
                          'avg_gasLimit':'sellern_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_rolling_std_value_last10'}, inplace=True)

df_sellern_1.rename(columns={'transaction_count':'sellern_1_tscount',
                          'active_period':'sellern_1_act_period',
                          'total_value':'sellern_1_total_value',
                          'total_gasUsed':'sellern_1_total_gasUsed',
                          'avg_gasPrice':'sellern_1_avg_gasPrice',
                          'avg_gasLimit':'sellern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_1_rolling_std_value_last10'}, inplace=True)


# merge df_table1 with df_buyern and df_sellern and df_buyern_1
df = pd.merge(df_table1, df_buyern, left_on='buyer_n_sale',right_on='buyer_n_address', how='left')
df = pd.merge(df, df_sellern, left_on='seller_n_sale',right_on='seller_n_address', how='left')
df = pd.merge(df, df_buyern_1, left_on='buyer_n-1_sale',right_on='buyer_n-1_address', how='left')
df = pd.merge(df, df_sellern_1, left_on='seller_n-1_sale',right_on='seller_n-1_address', how='left')
df = pd.merge(df, df_offer, left_on='token_id', right_on='token_id', how='left')

# drop unecessary columns
df.drop(columns=['buyer_n_address','seller_n_address','buyer_n-1_address','seller_n-1_address'], inplace=True)
df.drop(columns=['buyer_n_sale','seller_n_sale','buyer_n-1_sale','seller_n-1_sale'], inplace=True)

# convert to year-month (optional, for dummies later)
df['time_n_sale']   = df['time_n_sale_dt'].dt.strftime('%Y-%m')
df['time_n-1_sale'] = df['time_n-1_sale_dt'].dt.strftime('%Y-%m')
df.drop(columns=['time_n_sale_dt', 'time_n-1_sale_dt'], inplace=True)

# merge df with df_nft_feature
df = pd.merge(df, df_nft_feature, left_on='token_id',right_on='token_id', how='left')

cat_cols = [
    "time_n_sale","time_n-1_sale",
    "Background","Clothes","Earring",
    "Eyes","Fur","Hat","Mouth"
]

# save a copy BEFORE encode
df_orig = df.copy()
        

# one hot encoding for categorical variables
cat_cols = ["time_n_sale","time_n-1_sale","Background", "Clothes","Earring", "Eyes","Fur", "Hat","Mouth"]

df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

#figure out which level got dropped for each categorical
encoded_cols = set(df.columns)

for col in cat_cols:
    # all levels present in the original
    levels = sorted(df_orig[col].dropna().unique())
    # the dummy‐columns you actually created
    created = [
        c.replace(f"{col}_","")
        for c in encoded_cols
        if c.startswith(f"{col}_")
    ]
    # the one missing is the dropped reference
    base = list(set(levels) - set(created))
    if len(base)==1:
        print(f"{col:15s} → base/reference level = {base[0]}")
    else:
        print(f"{col:15s} → unexpected drop (found {base})")

# winsorize 
col_to_winsorize = ['price_n_sale', 'price_n-1_sale', 'price_n-2_sale', # 08/05 update 'price_n-2_sale' to df_table1
                    'buyern_total_value','buyern_total_gasUsed','buyern_avg_gasPrice','buyern_avg_gasLimit','buyern_rolling_avg_value_last10','buyern_rolling_std_value_last10',
                    'sellern_total_value','sellern_total_gasUsed','sellern_avg_gasPrice','sellern_avg_gasLimit','sellern_rolling_avg_value_last10', 'sellern_rolling_std_value_last10',
                    'buyern_1_total_value','buyern_1_total_gasUsed','buyern_1_avg_gasPrice','buyern_1_avg_gasLimit','buyern_1_rolling_avg_value_last10','buyern_1_rolling_std_value_last10',
                    'sellern_1_total_value','sellern_1_total_gasUsed','sellern_1_avg_gasPrice','sellern_1_avg_gasLimit','sellern_1_rolling_avg_value_last10', 'sellern_1_rolling_std_value_last10',
                    'total_offers', 'unique_makers_count'
                    ]

df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: winsorize(x, limits=[0.05, 0.05]))


# log transform winzoerized columns and rename them to all log_
df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: np.log(x + 1))
df.rename(columns={col: 'log_' + col for col in col_to_winsorize}, inplace=True)

# now make log(price_n_sale/price_n-1_sale) as the target variable
df['log_price_change'] = df['log_price_n_sale'] - df['log_price_n-1_sale']

# fillna 
df.fillna(0, inplace=True)

# train test split
X = df.drop(columns=['log_price_n_sale', 'token_id', 'log_price_change','log_price_n-1_sale'])
y = df['log_price_change']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)

# standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Parameter tuning for ElasticNetCV
alphas_en = np.logspace(-4, 2, 50)  # This will be passed directly to ElasticNetCV
param_grid_en = {
    'l1_ratio': np.linspace(0.01, 1.0, 20)  # l1_ratio must be in [0, 1]
}

# ElasticNetCV
elatsic_net = ElasticNetCV(alphas=alphas_en, random_state=87, n_jobs=-1)

# GridSearchCV for ElasticNetCV (only tune l1_ratio)
grid_search_en = GridSearchCV(elatsic_net, param_grid_en, cv=10, n_jobs=-1, verbose=1, scoring='neg_mean_squared_error')

# fit gridsearchcv to find best parameters
print("Starting GridSearchCV...")
grid_search_en.fit(X_train_scaled, y_train)

# Get the best parameters
best_params_en = grid_search_en.best_estimator_

print(f"Best Parameters found: {grid_search_en.best_params_}")
print(f"Best Score found: {grid_search_en.best_score_:.4f}")

# Predicting on the test set 
y_pred_en = best_params_en.predict(X_test_scaled)


# Calculate metrics
r2_en = r2_score(y_test, y_pred_en)
mse_en = mean_squared_error(y_test, y_pred_en)
mae_en = mean_absolute_error(y_test, y_pred_en)
rmse_en = np.sqrt(mse_en)

# mape 
mask = y_test != 0
mape_en = np.mean(np.abs((y_test[mask] - y_pred_en[mask]) / y_test[mask])) * 100 if np.any(mask) else np.inf

# Print metrics

print(f"ElasticNetCV R^2: {r2_en:.4f}")
print(f"ElasticNetCV MSE: {mse_en:.4f}")
print(f"ElasticNetCV MAE: {mae_en:.4f}")
print(f"ElasticNetCV RMSE: {rmse_en:.4f}")
print(f"ElasticNetCV MAPE: {mape_en:.4f}")

# take log of pn and pn-1  getting the difference (including the price) 

# 2nd model, try include in the controls variables if the NFT was previously sold within last 30 days(YEs) or before(No) (substracting the price maynot be the good appoarch, the length of the ownership could be effect)

# 3rd, NFT fixed effect, not NFT characteristics but only buyers and sellers, its overtime. 




time_n_sale     → base/reference level = 2021-05
time_n-1_sale   → base/reference level = 2021-05
Background      → base/reference level = Aquamarine
Clothes         → base/reference level = Admirals Coat
Earring         → base/reference level = Cross
Eyes            → base/reference level = 3d
Fur             → base/reference level = Black
Hat             → base/reference level = Army Hat
Mouth           → base/reference level = Bored
Starting GridSearchCV...
Fitting 10 folds for each of 20 candidates, totalling 200 fits


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

Best Parameters found: {'l1_ratio': 0.791578947368421}
Best Score found: -0.2419
ElasticNetCV R^2: 0.7950
ElasticNetCV MSE: 0.2582
ElasticNetCV MAE: 0.3276
ElasticNetCV RMSE: 0.5082
ElasticNetCV MAPE: 568.2485


In [8]:
import pandas as pd 

df_offer = pd.read_csv("Panel_for_Model2.csv", usecols = ['token_id','total_offers','unique_makers_count'])

pd.options.display.max_columns = None  
numeric_cols = df_offer.select_dtypes(include=[np.number]).columns  
df_desc = pd.DataFrame(df_offer[numeric_cols].describe())
df_desc.loc['skewness'] = df_offer[numeric_cols].skew()
df_desc.loc['kurtosis'] = df_offer[numeric_cols].kurt()
df_desc

Unnamed: 0,token_id,total_offers,unique_makers_count
count,18508.0,18508.0,18508.0
mean,5023.064405,97.93003,0.54182
std,2856.759759,1147.437176,1.490313
min,5.0,0.0,0.0
25%,2571.0,0.0,0.0
50%,5012.5,0.0,0.0
75%,7459.0,0.0,0.0
max,9997.0,87903.0,18.0
skewness,-0.001131,37.870782,3.900112
kurtosis,-1.17484,2199.970302,19.545605


In [9]:
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error

# Get the fitted ElasticNetCV from your GridSearchCV
enet = grid_search_en.best_estimator_

# Identify which original features had non-zero coef
selected = X.columns[enet.coef_ != 0].tolist()
print(f"{len(selected)} features selected by ElasticNet:\n{selected}\n")

# Subset your original (un-scaled) X_train and X_test
X_sel_train = X_train[selected]
X_sel_test  = X_test[selected]

# Add constant for intercept
X_sel_train_const = sm.add_constant(X_sel_train)
X_sel_test_const  = sm.add_constant(X_sel_test, has_constant='add')

# Convert all columns to float to avoid dtype=object issues
X_sel_train_const = X_sel_train_const.astype(float)
X_sel_test_const = X_sel_test_const.astype(float)

# Fit OLS on the training data
ols = sm.OLS(y_train, X_sel_train_const).fit()

# Print full regression table
print(ols.summary())

# Evaluate OLS on the test set
y_pred_ols = ols.predict(X_sel_test_const)
rmse_ols = np.sqrt(mean_squared_error(y_test, y_pred_ols))
print(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}")

# save summary to csv
with open('ols_summary.txt', 'w') as f:
    f.write(ols.summary().as_text())
    f.write(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}\n")
    

270 features selected by ElasticNet:
['log_price_n-2_sale', 'days_since_prev', 'sold_after_30d', 'buyern_tscount', 'buyern_act_period', 'log_buyern_total_value', 'log_buyern_total_gasUsed', 'log_buyern_avg_gasPrice', 'log_buyern_avg_gasLimit', 'log_buyern_rolling_avg_value_last10', 'sellern_tscount', 'sellern_act_period', 'log_sellern_total_value', 'log_sellern_total_gasUsed', 'log_sellern_avg_gasPrice', 'log_sellern_avg_gasLimit', 'log_sellern_rolling_std_value_last10', 'buyern_1_tscount', 'buyern_1_act_period', 'log_buyern_1_total_value', 'log_buyern_1_total_gasUsed', 'log_buyern_1_avg_gasPrice', 'sellern_1_tscount', 'log_sellern_1_total_value', 'log_sellern_1_total_gasUsed', 'log_sellern_1_avg_gasLimit', 'log_sellern_1_rolling_avg_value_last10', 'log_total_offers', 'rarity.rank', 'time_n_sale_2021-06', 'time_n_sale_2021-07', 'time_n_sale_2021-08', 'time_n_sale_2021-09', 'time_n_sale_2021-10', 'time_n_sale_2021-11', 'time_n_sale_2021-12', 'time_n_sale_2022-01', 'time_n_sale_2022-02',

## Equation 2.3 Without Offer

$$
\log(P_{i,n=N}) - \log(P_{i,n=N-1})
= \beta_0
+ \beta_p \,\bigl(\log\bigl(P_{i,n=N-1}\bigr)\;-\;\log\bigl(P_{i,n=N-2}\bigr)\bigr)
+ \beta_1\,X_i^{\mathrm{NFTCharacteristics}} \\[1em]
+ \beta_2\,X_{i,n}^{\mathrm{buyer}}
+ \beta_3\,X_{i,n}^{\mathrm{seller}}
+ \beta_4\,X_{i,n-1}^{\mathrm{buyer}}
+ \beta_5\,X_{i,n-1}^{\mathrm{seller}}
+ \gamma_t
+ \varepsilon_{i,n}\,.
$$


* $\bigl(\log\bigl(P_{i,n=N-1}\bigr)\;-\;\log\bigl(P_{i,n=N-2}\bigr)\bigr)$ is now: `df['log_price_change_n-1'] = df['log_price_n-1_sale'] - df['log_price_n-2_sale']`

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble     import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics      import r2_score, mean_squared_error, mean_absolute_error
from scipy.stats.mstats import winsorize
import statsmodels.api as sm
from sklearn.preprocessing     import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model      import ElasticNetCV


df_table1 = pd.read_csv("df_table1.csv",
                        usecols=['token_id','time_n_sale',
                                 'time_n-1_sale','price_n_sale',
                                 'price_n-1_sale','buyer_n_sale',
                                 'seller_n_sale','buyer_n-1_sale','seller_n-1_sale', 'price_n-2_sale']) # 08/05 update 'price_n-2_sale' to df_table1

# 1. parse to datetime
df_table1['time_n_sale_dt']   = pd.to_datetime(df_table1['time_n_sale'],   unit='s')
df_table1['time_n-1_sale_dt'] = pd.to_datetime(df_table1['time_n-1_sale'], unit='s')

# 2. compute difference in days
df_table1['days_since_prev'] = (
    df_table1['time_n_sale_dt'] - df_table1['time_n-1_sale_dt']
).dt.days

# 3. make binary flag. #08/05 update making binary variable accounting for whether the NFT was sold before or after 30 days
df_table1['sold_after_30d'] = (df_table1['days_since_prev'] >= 30).astype(int)

# 4. drop any zero‐price_n-1_sale rows 
df_table1 = df_table1[df_table1['price_n-1_sale'] != 0]

df_buyern = pd.read_csv("df_table4.csv")
df_buyern_1 = pd.read_csv("df_table6.csv")
df_sellern = pd.read_csv("df_table5.csv")
df_sellern_1 = pd.read_csv("df_table7.csv")
df_nft_feature = pd.read_csv("df_table3.csv")

# rename columns in df_buyern, df_buyern_1 and df_sellern
df_buyern.rename(columns={'transaction_count':'buyern_tscount',
                          'active_period':'buyern_act_period',
                          'total_value':'buyern_total_value',
                          'total_gasUsed':'buyern_total_gasUsed',
                          'avg_gasPrice':'buyern_avg_gasPrice',
                          'avg_gasLimit':'buyern_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_rolling_std_value_last10'}, inplace=True)

df_buyern_1.rename(columns={'transaction_count':'buyern_1_tscount',
                          'active_period':'buyern_1_act_period',
                          'total_value':'buyern_1_total_value',
                          'total_gasUsed':'buyern_1_total_gasUsed',
                          'avg_gasPrice':'buyern_1_avg_gasPrice',
                          'avg_gasLimit':'buyern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_1_rolling_std_value_last10'}, inplace=True)

df_sellern.rename(columns={'transaction_count':'sellern_tscount',
                          'active_period':'sellern_act_period',
                          'total_value':'sellern_total_value',
                          'total_gasUsed':'sellern_total_gasUsed',
                          'avg_gasPrice':'sellern_avg_gasPrice',
                          'avg_gasLimit':'sellern_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_rolling_std_value_last10'}, inplace=True)

df_sellern_1.rename(columns={'transaction_count':'sellern_1_tscount',
                          'active_period':'sellern_1_act_period',
                          'total_value':'sellern_1_total_value',
                          'total_gasUsed':'sellern_1_total_gasUsed',
                          'avg_gasPrice':'sellern_1_avg_gasPrice',
                          'avg_gasLimit':'sellern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_1_rolling_std_value_last10'}, inplace=True)


# merge df_table1 with df_buyern and df_sellern and df_buyern_1
df = pd.merge(df_table1, df_buyern, left_on='buyer_n_sale',right_on='buyer_n_address', how='left')
df = pd.merge(df, df_sellern, left_on='seller_n_sale',right_on='seller_n_address', how='left')
df = pd.merge(df, df_buyern_1, left_on='buyer_n-1_sale',right_on='buyer_n-1_address', how='left')
df = pd.merge(df, df_sellern_1, left_on='seller_n-1_sale',right_on='seller_n-1_address', how='left')

# drop unecessary columns
df.drop(columns=['buyer_n_address','seller_n_address','buyer_n-1_address','seller_n-1_address'], inplace=True)
df.drop(columns=['buyer_n_sale','seller_n_sale','buyer_n-1_sale','seller_n-1_sale'], inplace=True)

# convert to year-month (optional, for dummies later)
df['time_n_sale']   = df['time_n_sale_dt'].dt.strftime('%Y-%m')
df['time_n-1_sale'] = df['time_n-1_sale_dt'].dt.strftime('%Y-%m')
df.drop(columns=['time_n_sale_dt', 'time_n-1_sale_dt'], inplace=True)

# merge df with df_nft_feature
df = pd.merge(df, df_nft_feature, left_on='token_id',right_on='token_id', how='left')

cat_cols = [
    "time_n_sale","time_n-1_sale",
    "Background","Clothes","Earring",
    "Eyes","Fur","Hat","Mouth"
]

# save a copy BEFORE encode
df_orig = df.copy()
        

# one hot encoding for categorical variables
cat_cols = ["time_n_sale","time_n-1_sale","Background", "Clothes","Earring", "Eyes","Fur", "Hat","Mouth"]

df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

#figure out which level got dropped for each categorical
encoded_cols = set(df.columns)

for col in cat_cols:
    # all levels present in the original
    levels = sorted(df_orig[col].dropna().unique())
    # the dummy‐columns you actually created
    created = [
        c.replace(f"{col}_","")
        for c in encoded_cols
        if c.startswith(f"{col}_")
    ]
    # the one missing is the dropped reference
    base = list(set(levels) - set(created))
    if len(base)==1:
        print(f"{col:15s} → base/reference level = {base[0]}")
    else:
        print(f"{col:15s} → unexpected drop (found {base})")

# winsorize 
col_to_winsorize = ['price_n_sale', 'price_n-1_sale', 'price_n-2_sale', # 08/05 update 'price_n-2_sale' to df_table1
                    'buyern_total_value','buyern_total_gasUsed','buyern_avg_gasPrice','buyern_avg_gasLimit','buyern_rolling_avg_value_last10','buyern_rolling_std_value_last10',
                    'sellern_total_value','sellern_total_gasUsed','sellern_avg_gasPrice','sellern_avg_gasLimit','sellern_rolling_avg_value_last10', 'sellern_rolling_std_value_last10',
                    'buyern_1_total_value','buyern_1_total_gasUsed','buyern_1_avg_gasPrice','buyern_1_avg_gasLimit','buyern_1_rolling_avg_value_last10','buyern_1_rolling_std_value_last10',
                    'sellern_1_total_value','sellern_1_total_gasUsed','sellern_1_avg_gasPrice','sellern_1_avg_gasLimit','sellern_1_rolling_avg_value_last10', 'sellern_1_rolling_std_value_last10',
                    ]

df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: winsorize(x, limits=[0.05, 0.05]))

# log transform winzoerized columns and rename them to all log_
df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: np.log(x + 1))
df.rename(columns={col: 'log_' + col for col in col_to_winsorize}, inplace=True)

# now make log(price_n_sale/price_n-1_sale) as the target variable
df['log_price_change'] = df['log_price_n_sale'] - df['log_price_n-1_sale']
df['log_price_change_n-1'] = df['log_price_n-1_sale'] - df['log_price_n-2_sale']

# fillna 
df.fillna(0, inplace=True)

# train test split
X = df.drop(columns=['log_price_n_sale', 'token_id', 'log_price_change','log_price_n-1_sale', 'log_price_n-2_sale'])
y = df['log_price_change']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)

# standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Parameter tuning for ElasticNetCV
alphas_en = np.logspace(-4, 2, 50)  # This will be passed directly to ElasticNetCV
param_grid_en = {
    'l1_ratio': np.linspace(0.01, 1.0, 20)  # l1_ratio must be in [0, 1]
}

# ElasticNetCV
elatsic_net = ElasticNetCV(alphas=alphas_en, random_state=87, n_jobs=-1)

# GridSearchCV for ElasticNetCV (only tune l1_ratio)
grid_search_en = GridSearchCV(elatsic_net, param_grid_en, cv=10, n_jobs=-1, verbose=1, scoring='neg_mean_squared_error')

# fit gridsearchcv to find best parameters
print("Starting GridSearchCV...")
grid_search_en.fit(X_train_scaled, y_train)

# Get the best parameters
best_params_en = grid_search_en.best_estimator_

print(f"Best Parameters found: {grid_search_en.best_params_}")
print(f"Best Score found: {grid_search_en.best_score_:.4f}")

# Predicting on the test set 
y_pred_en = best_params_en.predict(X_test_scaled)


# Calculate metrics
r2_en = r2_score(y_test, y_pred_en)
mse_en = mean_squared_error(y_test, y_pred_en)
mae_en = mean_absolute_error(y_test, y_pred_en)
rmse_en = np.sqrt(mse_en)

# mape 
mask = y_test != 0
mape_en = np.mean(np.abs((y_test[mask] - y_pred_en[mask]) / y_test[mask])) * 100 if np.any(mask) else np.inf

# Print metrics

print(f"ElasticNetCV R^2: {r2_en:.4f}")
print(f"ElasticNetCV MSE: {mse_en:.4f}")
print(f"ElasticNetCV MAE: {mae_en:.4f}")
print(f"ElasticNetCV RMSE: {rmse_en:.4f}")
print(f"ElasticNetCV MAPE: {mape_en:.4f}")

# take log of pn and pn-1  getting the difference (including the price) 

# 2nd model, try include in the controls variables if the NFT was previously sold within last 30 days(YEs) or before(No) (substracting the price maynot be the good appoarch, the length of the ownership could be effect)

# 3rd, NFT fixed effect, not NFT characteristics but only buyers and sellers, its overtime. 




time_n_sale     → base/reference level = 2021-05
time_n-1_sale   → base/reference level = 2021-05
Background      → base/reference level = Aquamarine
Clothes         → base/reference level = Admirals Coat
Earring         → base/reference level = Cross
Eyes            → base/reference level = 3d
Fur             → base/reference level = Black
Hat             → base/reference level = Army Hat
Mouth           → base/reference level = Bored
Starting GridSearchCV...
Fitting 10 folds for each of 20 candidates, totalling 200 fits


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

Best Parameters found: {'l1_ratio': 1.0}
Best Score found: -0.2414
ElasticNetCV R^2: 0.7828
ElasticNetCV MSE: 0.2763
ElasticNetCV MAE: 0.3300
ElasticNetCV RMSE: 0.5256
ElasticNetCV MAPE: 984.4709


In [11]:
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error

# Get the fitted ElasticNetCV from your GridSearchCV
enet = grid_search_en.best_estimator_

# Identify which original features had non-zero coef
selected = X.columns[enet.coef_ != 0].tolist()
print(f"{len(selected)} features selected by ElasticNet:\n{selected}\n")

# Subset your original (un-scaled) X_train and X_test
X_sel_train = X_train[selected]
X_sel_test  = X_test[selected]

# Add constant for intercept
X_sel_train_const = sm.add_constant(X_sel_train)
X_sel_test_const  = sm.add_constant(X_sel_test, has_constant='add')

# Convert all columns to float to avoid dtype=object issues
X_sel_train_const = X_sel_train_const.astype(float)
X_sel_test_const = X_sel_test_const.astype(float)

# Fit OLS on the training data
ols = sm.OLS(y_train, X_sel_train_const).fit()

# Print full regression table
print(ols.summary())

# Evaluate OLS on the test set
y_pred_ols = ols.predict(X_sel_test_const)
rmse_ols = np.sqrt(mean_squared_error(y_test, y_pred_ols))
print(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}")

# save summary to csv
with open('ols_summary.txt', 'w') as f:
    f.write(ols.summary().as_text())
    f.write(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}\n")
    

260 features selected by ElasticNet:
['days_since_prev', 'sold_after_30d', 'buyern_tscount', 'buyern_act_period', 'log_buyern_total_value', 'log_buyern_total_gasUsed', 'log_buyern_avg_gasPrice', 'log_buyern_avg_gasLimit', 'log_buyern_rolling_avg_value_last10', 'log_buyern_rolling_std_value_last10', 'sellern_tscount', 'sellern_act_period', 'log_sellern_total_value', 'log_sellern_total_gasUsed', 'log_sellern_avg_gasPrice', 'log_sellern_avg_gasLimit', 'log_sellern_rolling_std_value_last10', 'buyern_1_tscount', 'buyern_1_act_period', 'log_buyern_1_total_value', 'log_buyern_1_total_gasUsed', 'log_buyern_1_avg_gasPrice', 'log_buyern_1_rolling_std_value_last10', 'sellern_1_tscount', 'sellern_1_act_period', 'log_sellern_1_total_value', 'log_sellern_1_total_gasUsed', 'log_sellern_1_avg_gasLimit', 'log_sellern_1_rolling_avg_value_last10', 'rarity.rank', 'time_n_sale_2021-06', 'time_n_sale_2021-07', 'time_n_sale_2021-08', 'time_n_sale_2021-09', 'time_n_sale_2021-10', 'time_n_sale_2021-11', 'time_

## Equation 2.3 With Offer

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble     import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics      import r2_score, mean_squared_error, mean_absolute_error
from scipy.stats.mstats import winsorize
import statsmodels.api as sm
from sklearn.preprocessing     import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model      import ElasticNetCV


df_table1 = pd.read_csv("df_table1.csv",
                        usecols=['token_id','time_n_sale',
                                 'time_n-1_sale','price_n_sale',
                                 'price_n-1_sale','buyer_n_sale',
                                 'seller_n_sale','buyer_n-1_sale','seller_n-1_sale', 'price_n-2_sale']) # 08/05 update 'price_n-2_sale' to df_table1

# 1. parse to datetime
df_table1['time_n_sale_dt']   = pd.to_datetime(df_table1['time_n_sale'],   unit='s')
df_table1['time_n-1_sale_dt'] = pd.to_datetime(df_table1['time_n-1_sale'], unit='s')

# 2. compute difference in days
df_table1['days_since_prev'] = (
    df_table1['time_n_sale_dt'] - df_table1['time_n-1_sale_dt']
).dt.days

# 3. make binary flag. #08/05 update making binary variable accounting for whether the NFT was sold before or after 30 days
df_table1['sold_after_30d'] = (df_table1['days_since_prev'] >= 30).astype(int)

# 4. drop any zero‐price_n-1_sale rows 
df_table1 = df_table1[df_table1['price_n-1_sale'] != 0]

df_buyern = pd.read_csv("df_table4.csv")
df_buyern_1 = pd.read_csv("df_table6.csv")
df_sellern = pd.read_csv("df_table5.csv")
df_sellern_1 = pd.read_csv("df_table7.csv")
df_nft_feature = pd.read_csv("df_table3.csv")
df_offer = pd.read_csv("Panel_for_Model2.csv", usecols = ['token_id','total_offers','unique_makers_count'])

# rename columns in df_buyern, df_buyern_1 and df_sellern
df_buyern.rename(columns={'transaction_count':'buyern_tscount',
                          'active_period':'buyern_act_period',
                          'total_value':'buyern_total_value',
                          'total_gasUsed':'buyern_total_gasUsed',
                          'avg_gasPrice':'buyern_avg_gasPrice',
                          'avg_gasLimit':'buyern_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_rolling_std_value_last10'}, inplace=True)

df_buyern_1.rename(columns={'transaction_count':'buyern_1_tscount',
                          'active_period':'buyern_1_act_period',
                          'total_value':'buyern_1_total_value',
                          'total_gasUsed':'buyern_1_total_gasUsed',
                          'avg_gasPrice':'buyern_1_avg_gasPrice',
                          'avg_gasLimit':'buyern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'buyern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'buyern_1_rolling_std_value_last10'}, inplace=True)

df_sellern.rename(columns={'transaction_count':'sellern_tscount',
                          'active_period':'sellern_act_period',
                          'total_value':'sellern_total_value',
                          'total_gasUsed':'sellern_total_gasUsed',
                          'avg_gasPrice':'sellern_avg_gasPrice',
                          'avg_gasLimit':'sellern_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_rolling_std_value_last10'}, inplace=True)

df_sellern_1.rename(columns={'transaction_count':'sellern_1_tscount',
                          'active_period':'sellern_1_act_period',
                          'total_value':'sellern_1_total_value',
                          'total_gasUsed':'sellern_1_total_gasUsed',
                          'avg_gasPrice':'sellern_1_avg_gasPrice',
                          'avg_gasLimit':'sellern_1_avg_gasLimit',
                          'rolling_avg_value_last10':'sellern_1_rolling_avg_value_last10',
                          'rolling_std_value_last10':'sellern_1_rolling_std_value_last10'}, inplace=True)


# merge df_table1 with df_buyern and df_sellern and df_buyern_1
df = pd.merge(df_table1, df_buyern, left_on='buyer_n_sale',right_on='buyer_n_address', how='left')
df = pd.merge(df, df_sellern, left_on='seller_n_sale',right_on='seller_n_address', how='left')
df = pd.merge(df, df_buyern_1, left_on='buyer_n-1_sale',right_on='buyer_n-1_address', how='left')
df = pd.merge(df, df_sellern_1, left_on='seller_n-1_sale',right_on='seller_n-1_address', how='left')
df = pd.merge(df, df_offer, left_on='token_id', right_on='token_id', how='left')

# drop unecessary columns
df.drop(columns=['buyer_n_address','seller_n_address','buyer_n-1_address','seller_n-1_address'], inplace=True)
df.drop(columns=['buyer_n_sale','seller_n_sale','buyer_n-1_sale','seller_n-1_sale'], inplace=True)

# convert to year-month (optional, for dummies later)
df['time_n_sale']   = df['time_n_sale_dt'].dt.strftime('%Y-%m')
df['time_n-1_sale'] = df['time_n-1_sale_dt'].dt.strftime('%Y-%m')
df.drop(columns=['time_n_sale_dt', 'time_n-1_sale_dt'], inplace=True)

# merge df with df_nft_feature
df = pd.merge(df, df_nft_feature, left_on='token_id',right_on='token_id', how='left')

cat_cols = [
    "time_n_sale","time_n-1_sale",
    "Background","Clothes","Earring",
    "Eyes","Fur","Hat","Mouth"
]

# save a copy BEFORE encode
df_orig = df.copy()
        

# one hot encoding for categorical variables
cat_cols = ["time_n_sale","time_n-1_sale","Background", "Clothes","Earring", "Eyes","Fur", "Hat","Mouth"]

df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

#figure out which level got dropped for each categorical
encoded_cols = set(df.columns)

for col in cat_cols:
    # all levels present in the original
    levels = sorted(df_orig[col].dropna().unique())
    # the dummy‐columns you actually created
    created = [
        c.replace(f"{col}_","")
        for c in encoded_cols
        if c.startswith(f"{col}_")
    ]
    # the one missing is the dropped reference
    base = list(set(levels) - set(created))
    if len(base)==1:
        print(f"{col:15s} → base/reference level = {base[0]}")
    else:
        print(f"{col:15s} → unexpected drop (found {base})")

# winsorize 
col_to_winsorize = ['price_n_sale', 'price_n-1_sale', 'price_n-2_sale', # 08/05 update 'price_n-2_sale' to df_table1
                    'buyern_total_value','buyern_total_gasUsed','buyern_avg_gasPrice','buyern_avg_gasLimit','buyern_rolling_avg_value_last10','buyern_rolling_std_value_last10',
                    'sellern_total_value','sellern_total_gasUsed','sellern_avg_gasPrice','sellern_avg_gasLimit','sellern_rolling_avg_value_last10', 'sellern_rolling_std_value_last10',
                    'buyern_1_total_value','buyern_1_total_gasUsed','buyern_1_avg_gasPrice','buyern_1_avg_gasLimit','buyern_1_rolling_avg_value_last10','buyern_1_rolling_std_value_last10',
                    'sellern_1_total_value','sellern_1_total_gasUsed','sellern_1_avg_gasPrice','sellern_1_avg_gasLimit','sellern_1_rolling_avg_value_last10', 'sellern_1_rolling_std_value_last10',
                    'total_offers', 'unique_makers_count'
                    ]

df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: winsorize(x, limits=[0.05, 0.05]))


# log transform winzoerized columns and rename them to all log_
df[col_to_winsorize] = df[col_to_winsorize].apply(lambda x: np.log(x + 1))
df.rename(columns={col: 'log_' + col for col in col_to_winsorize}, inplace=True)

# now make log(price_n_sale/price_n-1_sale) as the target variable
df['log_price_change'] = df['log_price_n_sale'] - df['log_price_n-1_sale']
df['log_price_change_n-1'] = df['log_price_n-1_sale'] - df['log_price_n-2_sale']

# fillna 
df.fillna(0, inplace=True)

# train test split
X = df.drop(columns=['log_price_n_sale', 'token_id', 'log_price_change','log_price_n-1_sale', 'log_price_n-2_sale'])
y = df['log_price_change']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)

# standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Parameter tuning for ElasticNetCV
alphas_en = np.logspace(-4, 2, 50)  # This will be passed directly to ElasticNetCV
param_grid_en = {
    'l1_ratio': np.linspace(0.01, 1.0, 20)  # l1_ratio must be in [0, 1]
}

# ElasticNetCV
elatsic_net = ElasticNetCV(alphas=alphas_en, random_state=87, n_jobs=-1)

# GridSearchCV for ElasticNetCV (only tune l1_ratio)
grid_search_en = GridSearchCV(elatsic_net, param_grid_en, cv=10, n_jobs=-1, verbose=1, scoring='neg_mean_squared_error')

# fit gridsearchcv to find best parameters
print("Starting GridSearchCV...")
grid_search_en.fit(X_train_scaled, y_train)

# Get the best parameters
best_params_en = grid_search_en.best_estimator_

print(f"Best Parameters found: {grid_search_en.best_params_}")
print(f"Best Score found: {grid_search_en.best_score_:.4f}")

# Predicting on the test set 
y_pred_en = best_params_en.predict(X_test_scaled)


# Calculate metrics
r2_en = r2_score(y_test, y_pred_en)
mse_en = mean_squared_error(y_test, y_pred_en)
mae_en = mean_absolute_error(y_test, y_pred_en)
rmse_en = np.sqrt(mse_en)

# mape 
mask = y_test != 0
mape_en = np.mean(np.abs((y_test[mask] - y_pred_en[mask]) / y_test[mask])) * 100 if np.any(mask) else np.inf

# Print metrics

print(f"ElasticNetCV R^2: {r2_en:.4f}")
print(f"ElasticNetCV MSE: {mse_en:.4f}")
print(f"ElasticNetCV MAE: {mae_en:.4f}")
print(f"ElasticNetCV RMSE: {rmse_en:.4f}")
print(f"ElasticNetCV MAPE: {mape_en:.4f}")

# take log of pn and pn-1  getting the difference (including the price) 

# 2nd model, try include in the controls variables if the NFT was previously sold within last 30 days(YEs) or before(No) (substracting the price maynot be the good appoarch, the length of the ownership could be effect)

# 3rd, NFT fixed effect, not NFT characteristics but only buyers and sellers, its overtime. 




time_n_sale     → base/reference level = 2021-05
time_n-1_sale   → base/reference level = 2021-05
Background      → base/reference level = Aquamarine
Clothes         → base/reference level = Admirals Coat
Earring         → base/reference level = Cross
Eyes            → base/reference level = 3d
Fur             → base/reference level = Black
Hat             → base/reference level = Army Hat
Mouth           → base/reference level = Bored
Starting GridSearchCV...
Fitting 10 folds for each of 20 candidates, totalling 200 fits


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

Best Parameters found: {'l1_ratio': 1.0}
Best Score found: -0.2377
ElasticNetCV R^2: 0.8007
ElasticNetCV MSE: 0.2510
ElasticNetCV MAE: 0.3283
ElasticNetCV RMSE: 0.5010
ElasticNetCV MAPE: 571.3164


In [13]:
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error

# Get the fitted ElasticNetCV from your GridSearchCV
enet = grid_search_en.best_estimator_

# Identify which original features had non-zero coef
selected = X.columns[enet.coef_ != 0].tolist()
print(f"{len(selected)} features selected by ElasticNet:\n{selected}\n")

# Subset your original (un-scaled) X_train and X_test
X_sel_train = X_train[selected]
X_sel_test  = X_test[selected]

# Add constant for intercept
X_sel_train_const = sm.add_constant(X_sel_train)
X_sel_test_const  = sm.add_constant(X_sel_test, has_constant='add')

# Convert all columns to float to avoid dtype=object issues
X_sel_train_const = X_sel_train_const.astype(float)
X_sel_test_const = X_sel_test_const.astype(float)

# Fit OLS on the training data
ols = sm.OLS(y_train, X_sel_train_const).fit()

# Print full regression table
print(ols.summary())

# Evaluate OLS on the test set
y_pred_ols = ols.predict(X_sel_test_const)
rmse_ols = np.sqrt(mean_squared_error(y_test, y_pred_ols))
print(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}")

# save summary to csv
with open('ols_summary.txt', 'w') as f:
    f.write(ols.summary().as_text())
    f.write(f"\nTest RMSE (OLS on selected features): {rmse_ols:.4f}\n")
    

275 features selected by ElasticNet:
['days_since_prev', 'sold_after_30d', 'buyern_tscount', 'buyern_act_period', 'log_buyern_total_value', 'log_buyern_total_gasUsed', 'log_buyern_avg_gasPrice', 'log_buyern_avg_gasLimit', 'log_buyern_rolling_avg_value_last10', 'sellern_tscount', 'sellern_act_period', 'log_sellern_total_value', 'log_sellern_total_gasUsed', 'log_sellern_avg_gasPrice', 'log_sellern_avg_gasLimit', 'log_sellern_rolling_std_value_last10', 'buyern_1_tscount', 'buyern_1_act_period', 'log_buyern_1_total_value', 'log_buyern_1_total_gasUsed', 'log_buyern_1_avg_gasPrice', 'sellern_1_tscount', 'log_sellern_1_total_value', 'log_sellern_1_total_gasUsed', 'log_sellern_1_avg_gasLimit', 'log_total_offers', 'rarity.rank', 'time_n_sale_2021-06', 'time_n_sale_2021-07', 'time_n_sale_2021-08', 'time_n_sale_2021-09', 'time_n_sale_2021-10', 'time_n_sale_2021-11', 'time_n_sale_2021-12', 'time_n_sale_2022-01', 'time_n_sale_2022-02', 'time_n_sale_2022-03', 'time_n_sale_2022-04', 'time_n_sale_2022

# `summary_column` in statsmodel

* 

# potocal account 
FOCAL = "0x29469395eaf6f95920e59f858042f0e28d98a20b".lower()

In [8]:
import pandas as pd
import numpy as np

# ----------------------------
# Config
# ----------------------------
FOCAL = "0x29469395eaf6f95920e59f858042f0e28d98a20b".lower()
PATH_DF2 = "df_table2.csv"  # transfers
PATH_DF1 = "df_table1.csv"  # sales panel-like data

# ----------------------------
# Load data
# ----------------------------
# df_table2: token_id, transfer_from, transfer_to, event_timestamp
df2 = pd.read_csv(
    PATH_DF2,
    dtype={"token_id": "int64", "transfer_from": "string", "transfer_to": "string", "event_timestamp": "Int64"},
)
# normalize address case
for c in ["transfer_from", "transfer_to"]:
    df2[c] = df2[c].astype("string").str.lower()

# df_table1 has many columns; force address columns to string so we can .str.lower() safely
addr_dtypes = {
    "buyer_n_sale": "string",
    "seller_n_sale": "string",
    "buyer_n-1_sale": "string",
    "seller_n-1_sale": "string",
}
df1 = pd.read_csv(PATH_DF1, dtype=addr_dtypes)
for c in ["buyer_n_sale", "seller_n_sale", "buyer_n-1_sale", "seller_n-1_sale"]:
    if c in df1.columns:
        df1[c] = df1[c].astype("string").str.lower()

# Ensure numeric—pandas may infer float; make sure we can test against 0
num_cols = ["time_n_sale", "time_n-1_sale", "time_n-2_sale",
            "price_n_sale", "price_n-1_sale", "price_n-2_sale"]
for c in num_cols:
    if c in df1.columns:
        df1[c] = pd.to_numeric(df1[c], errors="coerce")

# ----------------------------
# 1) Token IDs transferred TO focal address (df_table2)
# ----------------------------
tokens_to_focal = (
    df2.loc[df2["transfer_to"] == FOCAL, "token_id"]
    .dropna()
    .astype("int64")
    .unique()
)
tokens_to_focal_set = set(tokens_to_focal)

print(f"[Step 1] Unique token_ids transferred TO focal: {len(tokens_to_focal)}")
# Optional peek:
# print(sorted(tokens_to_focal)[:20])

# ----------------------------
# 2) Count prior-sale info among those tokens (df_table1)
# ----------------------------
df1_for_tokens = df1[df1["token_id"].isin(tokens_to_focal_set)].copy()

# time_n-1_sale considered valid if not NA and not 0
df1_for_tokens["has_prev_sale_info"] = (
    df1_for_tokens["time_n-1_sale"].notna() & (df1_for_tokens["time_n-1_sale"] != 0)
)

rows_with_prev = int(df1_for_tokens["has_prev_sale_info"].sum())
tokens_with_prev = int(
    df1_for_tokens.groupby("token_id")["has_prev_sale_info"].max().sum()
)  # number of unique token_ids with at least one row having prior-sale info

print(f"[Step 2] Rows (among those tokens) with valid time_n-1_sale: {rows_with_prev}")
print(f"[Step 2] Unique token_ids (among those tokens) with valid time_n-1_sale: {tokens_with_prev}")

# ----------------------------
# 3) Price change stats when focal address is the seller (df_table1)
# ----------------------------
# Filter: tokens in our set, valid prior sale info, and seller_n_sale == focal
mask_valid_prev = (
    df1["time_n-1_sale"].notna()
    & (df1["time_n-1_sale"] != 0)
    & df1["price_n-1_sale"].notna()
    & (df1["price_n-1_sale"] > 0)  # avoid division by zero in pct change
    & df1["price_n_sale"].notna()
)

sales_by_focal = df1[
    (df1["token_id"].isin(tokens_to_focal_set))
    & mask_valid_prev
    & (df1["seller_n_sale"] == FOCAL)
].copy()

# Compute price difference and percent change
sales_by_focal["price_diff"] = sales_by_focal["price_n_sale"] - sales_by_focal["price_n-1_sale"]
sales_by_focal["pct_change"] = (sales_by_focal["price_diff"] / sales_by_focal["price_n-1_sale"]) * 100.0

print(f"[Step 3] Rows with focal as seller (and valid prior sale) for those tokens: {len(sales_by_focal)}")

# Stats summary
def summarize(col):
    s = sales_by_focal[col]
    return {
        "count": int(s.count()),
        "mean": float(s.mean()) if s.count() else np.nan,
        "std": float(s.std(ddof=1)) if s.count() > 1 else np.nan,
        "min": float(s.min()) if s.count() else np.nan,
        "max": float(s.max()) if s.count() else np.nan,
    }

summary = {
    "price_diff": summarize("price_diff"),
    "pct_change_%": summarize("pct_change"),
}

print("\n[Summary] Price differences (focal is seller):")
for k, v in summary.items():
    print(f"  {k}: {v}")

# Optional: also compute log-return if you want a return metric more like your models
if ("price_n_sale" in sales_by_focal.columns) and ("price_n-1_sale" in sales_by_focal.columns):
    sales_by_focal["log_return"] = np.log(sales_by_focal["price_n_sale"] / sales_by_focal["price_n-1_sale"])
    logret_stats = summarize("log_return")
    print("\n[Summary] log_return (ln(P_n / P_{n-1})):")
    print(f"  {logret_stats}")

# Save the detailed rows to inspect / cite
out_csv = "focal_seller_price_changes_on_transferred_tokens.csv"
sales_by_focal_cols = [
    c for c in [
        "token_id","time_n_sale","time_n-1_sale","price_n_sale","price_n-1_sale",
        "buyer_n_sale","seller_n_sale","buyer_n-1_sale","seller_n-1_sale",
        "price_diff","pct_change","log_return"
    ] if c in sales_by_focal.columns
]
sales_by_focal[sales_by_focal_cols].to_csv(out_csv, index=False)
print(f"\n[Saved] Detailed rows → {out_csv}")

# ----------------------------
# (Optional) Per-token rollup for quick scanning
# ----------------------------
per_token = (sales_by_focal
             .groupby("token_id")[["price_diff","pct_change"]]
             .agg(["count","mean","std","min","max"]))
out_rollup = "focal_seller_token_rollup.csv"
per_token.to_csv(out_rollup)
print(f"[Saved] Per-token rollup → {out_rollup}")


import numpy as np
import pandas as pd

s = sales_by_focal.copy()

# --- Primary: log returns (no trim) ---
logret_mean = s['log_return'].mean()
logret_med  = s['log_return'].median()
logret_iqr  = s['log_return'].quantile(0.75) - s['log_return'].quantile(0.25)

# --- Robustness A: winsorize % change at 1%/99% ---
lo, hi = s['pct_change'].quantile([0.01, 0.99])
s['pct_change_w'] = s['pct_change'].clip(lo, hi)
pct_w_stats = s['pct_change_w'].agg(['count','mean','std','min','max'])

# --- Robustness B: exclude micro-price priors ---
p1 = s['price_n-1_sale'].quantile(0.01)  # bottom 1% threshold
THRESH = max(p1, 0.05)                    # or choose a fixed 0.05 ETH
s_micro = s[s['price_n-1_sale'] >= THRESH]

micro_log_mean = s_micro['log_return'].mean()
micro_log_med  = s_micro['log_return'].median()

# --- Value-weighted mean log return (weights = prior price) ---
w = s['price_n-1_sale'].values
vw_mean_log = np.average(s['log_return'].values, weights=w)

print(f"Log return: mean {logret_mean:.4f}, median {logret_med:.4f}, IQR {logret_iqr:.4f}")
print(f"Winsorized % change (1/99):\n{pct_w_stats}")
print(f"Micro-price excluded (>= {THRESH:.4f}): mean log {micro_log_mean:.4f}, median log {micro_log_med:.4f}")
print(f"Value-weighted mean log return: {vw_mean_log:.4f} (~{(np.exp(vw_mean_log)-1)*100:.2f}% geometric)")


[Step 1] Unique token_ids transferred TO focal: 1577
[Step 2] Rows (among those tokens) with valid time_n-1_sale: 1577
[Step 2] Unique token_ids (among those tokens) with valid time_n-1_sale: 1577
[Step 3] Rows with focal as seller (and valid prior sale) for those tokens: 811

[Summary] Price differences (focal is seller):
  price_diff: {'count': 811, 'mean': -1.2901995654747227, 'std': 6.465579815823146, 'min': -77.1191, 'max': 31.680000000000003}
  pct_change_%: {'count': 811, 'mean': 18.57258870482054, 'std': 459.5916844037079, 'min': -85.00451671183377, 'max': 12672.000000000002}

[Summary] log_return (ln(P_n / P_{n-1})):
  {'count': 811, 'mean': -0.03151874970322617, 'std': 0.31856045951398715, 'min': -1.8974211443520939, 'max': 4.849840367846581}

[Saved] Detailed rows → focal_seller_price_changes_on_transferred_tokens.csv
[Saved] Per-token rollup → focal_seller_token_rollup.csv
Log return: mean -0.0315, median -0.0070, IQR 0.0504
Winsorized % change (1/99):
count    811.000000
m