# Centralized EXchange (CEX) addresses identification
The goal is to identify CEX/Brige addresses from transcation information about native and tokens.

Three available files : native transfers, token transfers and flags.


## Design and implementation
1) Read the files

2) Check the data info

3) Perform EDA

4) Data enginnering and preparation

5) Model creation

6) Model evaluation

7) Discussion


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Reading the files and checking the data info
The file is in parquet format and it is said that the data size is said to contain 3+M rows. It can be interesting to use PySpark to read it but first let us see if it can be read and loaded to memory using pandas.

In [3]:
df_seed_labels = pd.read_csv('./Data/seed_labels.csv')
df_seed_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8619 entries, 0 to 8618
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   eoa         8619 non-null   object
 1   prediction  8619 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 134.8+ KB


In [4]:
df_seed_labels.head(10)

Unnamed: 0,eoa,prediction
0,0x08928dcb5bd0b58f8de51921eaade7da8146d259,0
1,0x156ddb799cc9ecb1a099f3fac2a3ce5affef266e,0
2,0x12924049e2d21664e35387c69429c98e9891a820,1
3,0x14dc78bf7e2021cf7d080b7c0eb8d644e1a8751d,0
4,0x1a795b4d5ebce06f04381e879cf1219f3b43fb19,0
5,0x156ac784bae328f6afb6403460145ba8b029d3b5,0
6,0x1778d07c9f85ef0663e18c5bf8db042f70c3d08e,0
7,0x156dd612e9dfb0b448b6a164e14b36634ba86dcb,0
8,0x1302a937bc0fe8a06f124e644236c876b639f54e,0
9,0x020b0d4c844e0dbca51c9ab779df0191978c0359,0


In [5]:
df_transcation_native = pd.read_parquet('./Data/transaction_native_seeder.parquet')
df_transcation_native.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 431957 entries, 0 to 431956
Data columns (total 11 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   TX_HASH                    431957 non-null  object        
 1   BLOCK_NUMBER               431957 non-null  float64       
 2   BLOCK_TIMESTAMP            431957 non-null  datetime64[us]
 3   FROM_ADDRESS               431957 non-null  object        
 4   TO_ADDRESS                 431957 non-null  object        
 5   ORIGIN_FROM_ADDRESS        431957 non-null  object        
 6   ORIGIN_TO_ADDRESS          431956 non-null  object        
 7   ORIGIN_FUNCTION_SIGNATURE  431957 non-null  object        
 8   AMOUNT_PRECISE_RAW         431957 non-null  object        
 9   AMOUNT                     431957 non-null  float64       
 10  AMOUNT_USD                 431957 non-null  float64       
dtypes: datetime64[us](1), float64(3), object(7)
memory u

## Variables signification for native

TX_HASH : Transaction hash is a unique 66-character identifier that is generated when a transaction is executed. This will not be unique in this table as a transaction could include multiple transfer events.

BLOCK_NUMBER : Also known as block height. The block number, which indicates the length of the blockchain, increases after the addition of each new block.

BLOCK_TIMESTAMP : The date and time at which the block was produced.

FROM_ADDRESS / TO_ADDRESS : The sending/receiving(might be contract) address of this transfer.

ORIGIN_FROM_ADDRESS / ORIGIN_TO_ADDRESS : The from/to address of this transaction.

ORIGIN_FUNCTION_SIGNATURE : The function signature of this transaction.

AMOUNT_PRECISE_RAW : The precise, unadjusted amount of the transaction. This is returned as a string to avoid precision loss.

AMOUNT : The precise, adjusted amount of the transaction. This is returned as a string to avoid precision loss.

AMOUNT_USD : ETH value transferred, in USD.

In [6]:
df_transcation_native.head(10)

Unnamed: 0,TX_HASH,BLOCK_NUMBER,BLOCK_TIMESTAMP,FROM_ADDRESS,TO_ADDRESS,ORIGIN_FROM_ADDRESS,ORIGIN_TO_ADDRESS,ORIGIN_FUNCTION_SIGNATURE,AMOUNT_PRECISE_RAW,AMOUNT,AMOUNT_USD
0,0x8bd7d68e5789b88b434318e6c87b4116cd5e028f27f9...,106235297.0,2023-06-29 20:09:31,0x50b0aabf36b21e72add83b8904cb52bfe0171f66,0xe668fefdb351d44c0e2e97ec1293624e92b2fd8c,0x50b0aabf36b21e72add83b8904cb52bfe0171f66,0xe668fefdb351d44c0e2e97ec1293624e92b2fd8c,0x3a1b1d57,690000000000000,0.00069,1.28
1,0x887e2081f2ddf58141ee4b91a05c89de523a90ae7cd0...,106241518.0,2023-06-29 23:36:53,0x4e29fa717fb61753e26885421b84ff7e06df585e,0x915b6fdb668bee599a6b3afc70873253e917b747,0x4e29fa717fb61753e26885421b84ff7e06df585e,0x915b6fdb668bee599a6b3afc70873253e917b747,0x,29960000000000,3e-05,0.06
2,0xcada150cc9065785e951b23721190fed546909262ef9...,106226009.0,2023-06-29 14:59:55,0x5507dbd48a5a5bace8a6030e878cc4e0af147c33,0x2d4b7ec9923b9cf22d87ced721e69e1f8ed96a0a,0x5507dbd48a5a5bace8a6030e878cc4e0af147c33,0x2d4b7ec9923b9cf22d87ced721e69e1f8ed96a0a,0x30b84454,690000000000000,0.00069,1.28
3,0x007060442229c6453024af72102193a9b33191d48074...,106241929.0,2023-06-29 23:50:35,0x4d73adb72bc3dd368966edd0f0b2148401a178e2,0xac0b8956b436bf33f7a1d441a6c53a4b26895e54,0xac0b8956b436bf33f7a1d441a6c53a4b26895e54,0xdd69db25f6d620a7bad3023c5d32761d353d3de9,0x51905636,107842888682958,0.000108,0.2
4,0xd3c6396c8f16a1a7e02349357d93c986b1bafbac7169...,106226127.0,2023-06-29 15:03:51,0xd9185e233575f4e0d0e83159fdc6dfe9107bbf4d,0xce16f69375520ab01377ce7b88f5ba8c48f8d666,0xd9185e233575f4e0d0e83159fdc6dfe9107bbf4d,0xce16f69375520ab01377ce7b88f5ba8c48f8d666,0x8ca3bf68,1199811652325841,0.0012,2.22
5,0x9dd0f45d87754e32805042bc6a7ede1f280e3c5bcba2...,106242147.0,2023-06-29 23:57:51,0xd2578c95c2daf87e7542d4c305c95cef01295877,0x5130f6ce257b8f9bf7fac0a0b519bd588120ed40,0xd2578c95c2daf87e7542d4c305c95cef01295877,0x5130f6ce257b8f9bf7fac0a0b519bd588120ed40,0x0ce9a63d,10000000000000,1e-05,0.02
6,0xc9802461389f60bf943725ecd0f3d74ef842ae9ae55c...,106227928.0,2023-06-29 16:03:53,0x81e877dd467f65b79aff559a8fafed6e95f01ad8,0x448e2c2988a077086e7035ea56ed383f45cd0cc0,0x81e877dd467f65b79aff559a8fafed6e95f01ad8,0x448e2c2988a077086e7035ea56ed383f45cd0cc0,0x3a1b1d57,690000000000000,0.00069,1.28
7,0x4a3bb2ca80e988721195c494bf3c5a550f7f89fadbb7...,106211928.0,2023-06-29 07:10:33,0x6648d39dd323adc816cadcada011d2c497c5257b,0xd85b5e176a30edd1915d6728faebd25669b60d8b,0x6648d39dd323adc816cadcada011d2c497c5257b,0xd85b5e176a30edd1915d6728faebd25669b60d8b,0xe5585666,346115804033552,0.000346,0.64
8,0x5b6c30edafa006a84145a0ca945a7a3cbdde6fad00c5...,106215193.0,2023-06-29 08:59:23,0xfd7bd29e1050932829c1fc080ea42d7394c42847,0xb49c4e680174e331cb0a7ff3ab58afc9738d5f8b,0xfd7bd29e1050932829c1fc080ea42d7394c42847,0xb49c4e680174e331cb0a7ff3ab58afc9738d5f8b,0x1114cd2a,205410784554031472,0.205411,379.68
9,0x4fca9f6d0d15596bc02daafd91245d0423ed4df45dd3...,106212320.0,2023-06-29 07:23:37,0xeafb1a7bb2da730911e246ba071a59551d581a40,0x921e8b903cbaf28747fad40b3846420b41bcf970,0xeafb1a7bb2da730911e246ba071a59551d581a40,0x921e8b903cbaf28747fad40b3846420b41bcf970,0x,1000000000000000,0.001,1.85


In [7]:
df_transcation_token = pd.read_parquet('./Data/transaction_token_seeder.parquet')
df_transcation_token.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 903458 entries, 0 to 903457
Data columns (total 11 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   TX_HASH                    903458 non-null  object 
 1   BLOCK_NUMBER               903458 non-null  float64
 2   FROM_ADDRESS               903458 non-null  object 
 3   TO_ADDRESS                 903458 non-null  object 
 4   ORIGIN_FROM_ADDRESS        903458 non-null  object 
 5   ORIGIN_TO_ADDRESS          903437 non-null  object 
 6   ORIGIN_FUNCTION_SIGNATURE  903458 non-null  object 
 7   AMOUNT                     901814 non-null  float64
 8   AMOUNT_USD                 259047 non-null  float64
 9   RAW_AMOUNT_PRECISE         903458 non-null  object 
 10  CONTRACT_ADDRESS           903458 non-null  object 
dtypes: float64(3), object(8)
memory usage: 75.8+ MB


## Variables signification for tokens

TX_HASH : Transaction hash is a unique 66-character identifier that is generated when a transaction is executed. This will not be unique in this table as a transaction could include multiple transfer events.

BLOCK_NUMBER : Also known as block height. The block number, which indicates the length of the blockchain, increases after the addition of each new block. 

FROM_ADDRESS / TO_ADDRESS : The sending/receiving(might be contract) address of this transfer.

ORIGIN_FROM_ADDRESS / ORIGIN_TO_ADDRESS : The from/to address of this transaction.

ORIGIN_FUNCTION_SIGNATURE : The function signature of this transaction.

AMOUNT_PRECISE_RAW : The precise, unadjusted amount of the transaction. This is returned as a string to avoid precision loss.

AMOUNT : The precise, adjusted amount of the transaction. This is returned as a string to avoid precision loss.

AMOUNT_USD : ETH value transferred, in USD.

CONTRACT_ADDRESS : Contract address of the token being transferred.

In [8]:
df_transcation_token.head(10)

Unnamed: 0,TX_HASH,BLOCK_NUMBER,FROM_ADDRESS,TO_ADDRESS,ORIGIN_FROM_ADDRESS,ORIGIN_TO_ADDRESS,ORIGIN_FUNCTION_SIGNATURE,AMOUNT,AMOUNT_USD,RAW_AMOUNT_PRECISE,CONTRACT_ADDRESS
0,0x540b4d4b9b7a685b3e5d6a4931b6cc7c403803a43e7b...,5439957.0,0x79430e903e5c476a7bf8f0f450fbdc4d6bc91fab,0xe74fa58448aedd329ddf07404b9babf76a31b978,0x79430e903e5c476a7bf8f0f450fbdc4d6bc91fab,0x00c0184c0b5d42fba6b7ca914b31239b419ab80b,0x06e75722,187.873276,187.986854,187873275607474489678,0xda10009cbd5d07dd0cecc66161fc93d7c9000da1
1,0xad54732b5d614551d59f751b43d1d3d031fdc7f710c1...,5386474.0,0xa3128d9b7cca7d5af29780a56abeec12b05a6740,0x1b054f3b45ab48d58282448070861a8b67a0dfd8,0x1b054f3b45ab48d58282448070861a8b67a0dfd8,0xdef1abe32c034e558cdd535791643c58a13acc10,0x415565b0,11.67154,69.490815,11671539969587606306,0x8700daec35af8ff88c16bdf0418774cb3d7599b4
2,0x9d3d299864750270ea63fdc9734ef2a9d4cc67a9b700...,5394682.0,0xbd93951d2e9ec615f9940887559b4317032d98d0,0x8bde2372200e80d22212dc27b036080e929b779c,0x8bde2372200e80d22212dc27b036080e929b779c,0x68b3465833fb72a70ecdf485e0e4c7bd8665fc45,0x5ae401dc,96.306372,,96306372,0x7f5c764cbc14f9669b88837ca1490cca17c31607
3,0x23931aefbfef07ee47b1b223910a4e00e56820753360...,5394705.0,0x8bde2372200e80d22212dc27b036080e929b779c,0xdecc0c09c3b5f6e92ef4184125d5648a66e35298,0x8bde2372200e80d22212dc27b036080e929b779c,0xb0d502e938ed5f4df2e681fe6e419ff29631d62b,0x87b21efc,96.306372,,96306372,0x7f5c764cbc14f9669b88837ca1490cca17c31607
4,0x9524286d5ef0382770509b669fc691d6f691f72025c7...,5421479.0,0x0000000000000000000000000000000000000000,0x1b054f3b45ab48d58282448070861a8b67a0dfd8,0x1b054f3b45ab48d58282448070861a8b67a0dfd8,0x8700daec35af8ff88c16bdf0418774cb3d7599b4,0x30ead760,4.518495,,4518495228130065000,0x8c6f28f2f1a3c87f0f938b96d27520d9751ec8d9
5,0xb5628d7a87e882252d6ba65fb0761262baa9a47952eb...,5409779.0,0xd593be7d470ba396c180229308edab4e60b98479,0xd7f1dd5d49206349cae8b585fcb0ce3d96f1696f,0xd593be7d470ba396c180229308edab4e60b98479,0xd7f1dd5d49206349cae8b585fcb0ce3d96f1696f,0x83d13e01,50.0,,50000000,0x7f5c764cbc14f9669b88837ca1490cca17c31607
6,0x0e988b08b335ef28e411119b96ff390526e3bf4e2da4...,5410714.0,0x58488bb666d2da33f8e8938dbdd582d2481d4183,0x79430e903e5c476a7bf8f0f450fbdc4d6bc91fab,0x79430e903e5c476a7bf8f0f450fbdc4d6bc91fab,0x58488bb666d2da33f8e8938dbdd582d2481d4183,0xb88a802f,0.169358,,169358,0x7f5c764cbc14f9669b88837ca1490cca17c31607
7,0x59bff338b41c802566785e3389b6f0221baefe8eb7a6...,5396768.0,0x5ae7454827d83526261f3871c1029792644ef1b1,0x8bf0083ecea9bbe0b6ca47bdb3cd1c39f10bdf02,0x8bf0083ecea9bbe0b6ca47bdb3cd1c39f10bdf02,0x5ae7454827d83526261f3871c1029792644ef1b1,0x8875eb84,700.0,,700000000000000000000,0x518b31fa3e8a8cd58fe0ff0c925159ecc8eb872d
8,0x9386c1c90ef7243792b0ce8dd7aaa83077d0f570e681...,5419030.0,0x99c2ff391582b93af89aea9ff0348964d373b02f,0x85149247691df622eaf1a8bd0cafd40bc45154a9,0x99c2ff391582b93af89aea9ff0348964d373b02f,0x7314af7d05e054e96c44d7923e68d66475ffaab8,0xc8d205ac,0.859263,,859263,0x7f5c764cbc14f9669b88837ca1490cca17c31607
9,0xb750dd356cd86f204bf19eee3bc97830e3aff208e5c2...,5396574.0,0x8bf0083ecea9bbe0b6ca47bdb3cd1c39f10bdf02,0x0000000000000000000000000000000000000000,0x8bf0083ecea9bbe0b6ca47bdb3cd1c39f10bdf02,0xe169a8f96d11fcfa82766edacef71181e9d81eb3,0x85149258,412.0,,412000000000000000000,0xf4edd5cc013beea7658b59ebb21cf1dfb9e18d7f


### - All files were successfully loaded to memory.

### - The size of the dataframes and the files are almost the same => files were completely loaded 

### - There are some missing values in some columns that must be dealt with

## Performing EDA

In [9]:
print(f"Unique Native Origine From Addresses\t: {len(df_transcation_native['ORIGIN_FROM_ADDRESS'].unique())}")
print(f"Unique Native Origine To Addresses\t: {len(df_transcation_native['ORIGIN_TO_ADDRESS'].unique())}")
print(f"Unique Token Origine From Addresses\t: {len(df_transcation_token['ORIGIN_FROM_ADDRESS'].unique())}")
print(f"Unique Token Origine To Addresses\t: {len(df_transcation_token['ORIGIN_TO_ADDRESS'].unique())}")

Unique Native Origine From Addresses	: 32586
Unique Native Origine To Addresses	: 42324
Unique Token Origine From Addresses	: 36927
Unique Token Origine To Addresses	: 4004


### TO EXPLAIN : Why do we have more Destinations than Sources for natives but the other way around for tokens?

In [10]:
print(f"Unique Native From Addresses\t: {len(df_transcation_native['FROM_ADDRESS'].unique())}")
print(f"Unique Native To Addresses\t: {len(df_transcation_native['TO_ADDRESS'].unique())}")
print(f"Unique Token From Addresses\t: {len(df_transcation_token['FROM_ADDRESS'].unique())}")
print(f"Unique Token To Addresses\t: {len(df_transcation_token['TO_ADDRESS'].unique())}")

Unique Native From Addresses	: 19040
Unique Native To Addresses	: 43059
Unique Token From Addresses	: 17885
Unique Token To Addresses	: 20412


### TO EXPLAIN : Why do we have more Receivers than Senders for both natives and tokens?

### Possible idea to explore :
Since CEX/Bridges are involved in different transactions, the standard diviation of the amounts may vary since different people may have various amounts per transactions while other wallets my have some patterns !

In [11]:
df_group_amount_native = df_transcation_native.groupby('ORIGIN_FROM_ADDRESS').agg({
    'AMOUNT_USD': ['mean', 'std']
}) # Same can be done with with AMOUNT_PRECISE_RAW and AMOUNT after casting 

df_group_amount_native.columns = ['AMOUNT_USD_mean', 'AMOUNT_USD_std']
df_group_amount_native = df_group_amount_native.reset_index()
df_group_amount_native = pd.merge(df_seed_labels, df_group_amount_native, left_on='eoa', right_on='ORIGIN_FROM_ADDRESS')
df_group_amount_native.sort_values(by='AMOUNT_USD_std', ascending=False).head(10)

Unnamed: 0,eoa,prediction,ORIGIN_FROM_ADDRESS,AMOUNT_USD_mean,AMOUNT_USD_std
7958,0xf212ce21a97dbe30999a4c2b309d278bccbb686a,1,0xf212ce21a97dbe30999a4c2b309d278bccbb686a,76296.000645,61200.65712
5276,0xb085c58098753dfd3bf887618db33f2514aaf154,0,0xb085c58098753dfd3bf887618db33f2514aaf154,23149.736,41841.102169
4356,0x8bde2372200e80d22212dc27b036080e929b779c,0,0x8bde2372200e80d22212dc27b036080e929b779c,13120.144348,34409.439375
4548,0x7d0a2cfe9a4a729f912a9b24f3aec2a93436f451,0,0x7d0a2cfe9a4a729f912a9b24f3aec2a93436f451,2417.318981,31497.267225
6837,0xe9e5d197b2dd08463161de2f8848674482501e67,0,0xe9e5d197b2dd08463161de2f8848674482501e67,17920.478621,22148.301155
5574,0x992dac69827a200ba112a0303fe8f79f03c37d9d,0,0x992dac69827a200ba112a0303fe8f79f03c37d9d,4657.086154,17065.054596
5779,0xbcc224605383cb72dc603b1e3b4f4678b371c4dc,0,0xbcc224605383cb72dc603b1e3b4f4678b371c4dc,6925.951667,16750.100393
6704,0xddc976cb693fda9c7570ec68df397623e48815e9,0,0xddc976cb693fda9c7570ec68df397623e48815e9,11454.350517,15820.543682
4622,0x79c4213a328e3b4f1d87b4953c14759399db25e2,0,0x79c4213a328e3b4f1d87b4953c14759399db25e2,4411.868022,14132.211534
5785,0xbca4d68be543dcefb1a8bccb519503f9ba3f2026,0,0xbca4d68be543dcefb1a8bccb519503f9ba3f2026,5216.059556,13440.79979


In [12]:
df_group_amount_token = df_transcation_token.groupby('ORIGIN_FROM_ADDRESS').agg({
    'AMOUNT_USD': ['mean', 'std']
}) # Same can be done with with AMOUNT_PRECISE_RAW and AMOUNT after casting 

df_group_amount_token.columns = ['AMOUNT_USD_mean', 'AMOUNT_USD_std']
df_group_amount_token = df_group_amount_token.reset_index()
df_group_amount_token = pd.merge(df_seed_labels, df_group_amount_token, left_on='eoa', right_on='ORIGIN_FROM_ADDRESS')
df_group_amount_token.sort_values(by='AMOUNT_USD_std', ascending=False).head(10)

Unnamed: 0,eoa,prediction,ORIGIN_FROM_ADDRESS,AMOUNT_USD_mean,AMOUNT_USD_std
2719,0x74eb390c06a7cc1158a0895fb289e5037633e38b,0,0x74eb390c06a7cc1158a0895fb289e5037633e38b,31137.062567,82014.600883
2452,0x6d33ecd723155522d597682df1f0ac10e7d7d9ed,0,0x6d33ecd723155522d597682df1f0ac10e7d7d9ed,22156.117064,71135.687485
4180,0xb085c58098753dfd3bf887618db33f2514aaf154,0,0xb085c58098753dfd3bf887618db33f2514aaf154,31585.157955,47542.170058
6300,0xf212ce21a97dbe30999a4c2b309d278bccbb686a,1,0xf212ce21a97dbe30999a4c2b309d278bccbb686a,10010.493548,31888.880737
3035,0x8314125c8b68af2afd0d151eb4a551e88128a2ae,1,0x8314125c8b68af2afd0d151eb4a551e88128a2ae,4920.174314,28535.910983
6084,0xf35db530c0416106b749c301176155c04e3c684d,0,0xf35db530c0416106b749c301176155c04e3c684d,5849.496664,21092.435216
4426,0x992dac69827a200ba112a0303fe8f79f03c37d9d,0,0x992dac69827a200ba112a0303fe8f79f03c37d9d,8780.575909,20343.405813
4528,0xbb255186c84eaec8c9ab9fa4b5f4121beca89ee4,0,0xbb255186c84eaec8c9ab9fa4b5f4121beca89ee4,8749.986873,19848.208956
4582,0xbca4d68be543dcefb1a8bccb519503f9ba3f2026,0,0xbca4d68be543dcefb1a8bccb519503f9ba3f2026,9175.462214,17052.810329
4058,0x9ee457023bb3de16d51a003a247baead7fce313d,0,0x9ee457023bb3de16d51a003a247baead7fce313d,14027.849497,16545.362599


### Possible idea to explore :
Since CEX/Bridges are involved in different transactions, the time periods of the transactions can be helpful. Other wallets may have specific time periods in which they perform their transactions) :

-Create new features about different times of the day and different periods of time (days, weeks, months, seasons, etc)

### Idea explored :
CEX/Bridges perform more operations but are not detected until they surpass a threshold. Find out if information about transactions is useful :

-Create graph in which nodes are addresses, edges are transactions

-Compute nodes degrees and see if there is a corelation between them and CEX/Bridge addresses

-Explore other more advanced algorithms : PageRank algorithm. PR was usually introduced to rank web pages following their importance : if important pages refer to on page than that page is important ! 

In [20]:
import networkx as nx

df_transcation_native_clean = df_transcation_native[df_transcation_native['ORIGIN_FROM_ADDRESS'].notna() & df_transcation_native['ORIGIN_TO_ADDRESS'].notna()]
df_transcation_token_clean = df_transcation_token[df_transcation_token['ORIGIN_FROM_ADDRESS'].notna() & df_transcation_token['ORIGIN_TO_ADDRESS'].notna()]

df_transcation_native_clean = pd.merge(df_seed_labels, df_transcation_native_clean, left_on='eoa', right_on='ORIGIN_FROM_ADDRESS', how='right')
df_transcation_native_clean = pd.merge(df_seed_labels, df_transcation_native_clean, left_on='eoa', right_on='ORIGIN_TO_ADDRESS', how='right')
df_transcation_token_clean = pd.merge(df_seed_labels, df_transcation_token_clean, left_on='eoa', right_on='ORIGIN_FROM_ADDRESS', how='right')
df_transcation_token_clean = pd.merge(df_seed_labels, df_transcation_token_clean, left_on='eoa', right_on='ORIGIN_TO_ADDRESS', how='right')



G = nx.MultiDiGraph()

for _, row in df_transcation_native_clean.iterrows():
    G.add_edge(row['ORIGIN_FROM_ADDRESS'], row['ORIGIN_TO_ADDRESS'], edge_attr = row['AMOUNT_USD'])

for _, row in df_transcation_token_clean.iterrows():
    G.add_edge(row['ORIGIN_FROM_ADDRESS'], row['ORIGIN_TO_ADDRESS'], edge_attr = row['AMOUNT_USD'])

In [21]:
df_degree = pd.DataFrame(G.degree(), columns=["Node", "Degree"])
df_degree= pd.merge(df_seed_labels, df_degree, left_on='eoa', right_on='Node')
df_degree.sort_values(by='Degree', ascending=False).head(20)

Unnamed: 0,eoa,prediction,Node,Degree
619,0x1d19da85322c5f14201be546c326e0e6f521b6e6,0,0x1d19da85322c5f14201be546c326e0e6f521b6e6,5314
5509,0xb09c48582db808c8043d0eb982b9610d79d9c0e1,1,0xb09c48582db808c8043d0eb982b9610d79d9c0e1,4753
2,0x12924049e2d21664e35387c69429c98e9891a820,1,0x12924049e2d21664e35387c69429c98e9891a820,4548
6719,0xdbee2b501021ec7d8d6fe48f1829118178d2e4a3,0,0xdbee2b501021ec7d8d6fe48f1829118178d2e4a3,3233
2267,0x4e29fa717fb61753e26885421b84ff7e06df585e,1,0x4e29fa717fb61753e26885421b84ff7e06df585e,2950
6750,0xe48fe6012f97b6a13c0ce5cef314caf66e972deb,0,0xe48fe6012f97b6a13c0ce5cef314caf66e972deb,2729
2238,0x4f95ad114fbddf8df0756017e9bc856e730b2796,0,0x4f95ad114fbddf8df0756017e9bc856e730b2796,2527
5847,0xcc0d8f409b149b92c089b5a9177331338671501c,0,0xcc0d8f409b149b92c089b5a9177331338671501c,2427
7102,0xea4ddc1921a322dae69458920e9c6a61d0a4f7aa,0,0xea4ddc1921a322dae69458920e9c6a61d0a4f7aa,2399
3862,0x8314125c8b68af2afd0d151eb4a551e88128a2ae,1,0x8314125c8b68af2afd0d151eb4a551e88128a2ae,2384


In [22]:
pagerank = nx.pagerank(G)

In [23]:
df_pagerank = pd.DataFrame(list(pagerank.items()), columns=["Node", "PageRank"])
df_pagerank= pd.merge(df_seed_labels, df_pagerank, left_on='eoa', right_on='Node')
df_pagerank.sort_values(by=['PageRank'], ascending=False).head(20)

Unnamed: 0,eoa,prediction,Node,PageRank
3751,0x8f3c665cfd74fab1025e5693d80aaee3a4967398,0,0x8f3c665cfd74fab1025e5693d80aaee3a4967398,0.003176
6907,0xe4dac04b034144408c8dbdaa8a5c346e23dae5bc,0,0xe4dac04b034144408c8dbdaa8a5c346e23dae5bc,0.001516
183,0x19a4f9727e598476081698fb588a524f7828fe37,0,0x19a4f9727e598476081698fb588a524f7828fe37,0.00117
3808,0x787513072f5ed215e14f488325a27185ca0bbec9,0,0x787513072f5ed215e14f488325a27185ca0bbec9,0.000931
6867,0xe22373cd211161bb4432f5cdb1250aab337f4069,0,0xe22373cd211161bb4432f5cdb1250aab337f4069,0.000598
6757,0xd3fa2fdbd04f5c64d6d8df4af08dd671e6fce842,0,0xd3fa2fdbd04f5c64d6d8df4af08dd671e6fce842,0.000522
3248,0x631d375d7b49803750eda3d7b2c8078b0f05006d,0,0x631d375d7b49803750eda3d7b2c8078b0f05006d,0.000496
5538,0x9ff190a3e0f3e2fe19a3f1e848c9137a7838d553,0,0x9ff190a3e0f3e2fe19a3f1e848c9137a7838d553,0.00045
2782,0x45a318273749d6eb00f5f6ca3bc7cd3de26d642a,1,0x45a318273749d6eb00f5f6ca3bc7cd3de26d642a,0.000437
4587,0x936bd33f4f684adcd6da3b96fd1f72e034ea54d4,0,0x936bd33f4f684adcd6da3b96fd1f72e034ea54d4,0.000346


### Discussion
-No strong correlations between CEX/Bridges and amount or centrality

-More domain knowledge is necessary to continue this EDA and explore new ideas.

-Can GNNs help?