# On-Chain Clustering
In this notebook, we take care of clustering BTC addresses and entities based on their interaction with the LN. At the end, we will have a mapping between BTC entities and "components" (either star, snake, collector or proxy), that will be needed in the linking heuristics. The sections are:

1. Create On-Chain Clusters
    - 1.1 ...
    - 1.2 ...
2. Verify and Sort Mapping

# 1. Create on-chain Clusters [Bernhard]
Here we present how we obtain on-chain clusters of BTC entities based on their opening/closing channels in the LN.

In [None]:
# inputs: funding_addresses_csv_file, settlement_addresses_csv_file
# outputs: star_file, snake_file, collector_file, proxy_file, funding_address_entity_file, settlement_address_entity_file

# 2. Verify and Sort Mapping

Here we make sure that there is no entity overlap between components and then we create a unique identifier for each component.

#### Inputs (made available):
- `patterns_files` (stars, snakes, collectors, proxies)

#### Outputs (made available):
- `patterns_sorted_files` (stars, snakes, collectors, proxies)

In [8]:
import sys
sys.path.append("..")

from utils import df_to_two_dicts, patterns_list, write_json

# input files
from utils import patterns_files # stars, snakes, collectors, proxies

# output files
from utils import patterns_sorted_files

import pandas as pd

In [3]:
pattern_double_mapping = dict() # entity-star, star-entity
for pattern in patterns_list:
    pattern_double_mapping[pattern] = df_to_two_dicts(pd.read_csv(patterns_files[pattern])) 

In [4]:
# check that there is no entity overlap between stars, snakes, collectors and proxies
print('overlap of entities snakes-stars:')
print(len(set(pattern_double_mapping['snakes'][0]).intersection(set(pattern_double_mapping['stars'][0]))))
print('overlap of entities snakes-proxies:')
print(len(set(pattern_double_mapping['snakes'][0]).intersection(set(pattern_double_mapping['proxies'][0]))))
print('overlap of entities snakes-collectors:')
print(len(set(pattern_double_mapping['snakes'][0]).intersection(set(pattern_double_mapping['collectors'][0]))))
print('overlap of entities proxies-collectors:')
print(len(set(pattern_double_mapping['proxies'][0]).intersection(set(pattern_double_mapping['collectors'][0]))))

overlap of entities snakes-stars:
0
overlap of entities snakes-proxies:
0
overlap of entities snakes-collectors:
0
overlap of entities proxies-collectors:
0


In [5]:
# create a unique identifier for each component
i = 1  # to avoid negative zero
component_sorted_mapping_dict = dict()
for pattern in patterns_list:
    component_sorted_mapping_dict[pattern] = dict()
    for component in pattern_double_mapping[pattern][1] :
        component_sorted_mapping_dict[pattern][component] = i
        i += 1
    print(pattern, 'till', i)

stars till 53
snakes till 5691
collectors till 7167
proxies till 8156


In [9]:
# write to file
for pattern in patterns_list:
    write_json(component_sorted_mapping_dict[pattern], patterns_sorted_files[pattern], True)