### Levenshtein distance

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

With this analysis done, we can compare how close any pair of handles are which gives an indication of a sybil or non sybil account. 

### Importing the necessary libraries

In [None]:
## To install the jellyfish library.

!pip install jellyfish

In [1]:
## For data manipulation
import pandas as pd
from difflib import SequenceMatcher

## For numerical computations
import numpy as np

## For Visualizations
import seaborn as sns 
import matplotlib.pyplot as plt

## To calculate the Levenshtein distance.
import jellyfish

### Importing the dataset needed for the analysis

In [1]:
### Import the dataset here

In [3]:
# Storing each reviewer dataset in a list, for easy iteration.


names = [Adebola, AM2021, Anuj, Elbeth, BFA, Iamgold, Bobjiang, LKH, FizzyMidas,Tricelex, OxProof,
         Emmanel, Doggfather, EmmanuelJacobson, Flobisnitz, Greg, GreyTrainer, Stelescuvlad,
         Jshua, Kylin, TheHound, MountManu, Nadalie, Richard, Ogunsojosam, RobotTeddy,
         AnnAnna, Socal, Steegecs, ViktorLiu, Wolfman, ZER8, Kishoraditya, AnnAnna2, Waka, CassCee ]

In [4]:
# Making a loop through all the datasheet of human evaluation,
# then renaming the Unnamed: 0 column to handle so that all the data used for the analysis have a column in common
for i in names:
    i.rename({"Unnamed: 0": "handle"}, axis=1, inplace=True)

Since the analysis is to compare all the available handles, we will concatenate the datasets of each reviewer to a single datasets. 

This dataset will have all the handles reviewed.

In [5]:
concat_data = pd.concat(names)

In [6]:
concat_data.shape

(2050, 6)

After concatenating the datasets of each reviewer, we have 2050 handle.

Since our analysis is based of the comparisions of handles, we can just drop off rows with duplicated handles to avoid repetitions.

In [7]:
concat_data.drop_duplicates(subset=["handle"], inplace=True)

In [8]:
concat_data.shape

(1432, 6)

Now we have a dataset with 1432, rows of unique handles.

In [139]:
Anuj = ## Import the datasets for each reviewer here.
Elbeth = ## Import the datasets for each reviewer here.
BFA = ## Import the datasets for each reviewer here.
Iamgold = ## Import the datasets for each reviewer here.
Bobjiang = ## Import the datasets for each reviewer here.
LKH = ## Import the datasets for each reviewer here.
FizzyMidas = ## Import the datasets for each reviewer here.
Tricelex = ## Import the datasets for each reviewer here.
OxProof = ## Import the datasets for each reviewer here.
Emmanel = ## Import the datasets for each reviewer here.
Doggfather = ## Import the datasets for each reviewer here.
EmmanuelJacobson = ## Import the datasets for each reviewer here.
Flobisnitz = ## Import the datasets for each reviewer here.
Greg = ## Import the datasets for each reviewer here.
GreyTrainer = ## Import the datasets for each reviewer here.
Stelescuvlad = ## Import the datasets for each reviewer here.
TheHoundNo2 = ## Import the datasets for each reviewer here.
Jshua = ## Import the datasets for each reviewer here.
Kish = ## Import the datasets for each reviewer here.
Kylin = ## Import the datasets for each reviewer here.
Z4yr0 = ## Import the datasets for each reviewer here.
TheHound = ## Import the datasets for each reviewer here.
MountManu = ## Import the datasets for each reviewer here.
Nadalie = ## Import the datasets for each reviewer here.
Richard = ## Import the datasets for each reviewer here.
Ogunsojosam = ## Import the datasets for each reviewer here.
AM2021 = ## Import the datasets for each reviewer here.
RobotTeddy = ## Import the datasets for each reviewer here.
AnnAnna = ## Import the datasets for each reviewer here.
Socal = ## Import the datasets for each reviewer here.
Steegecs = ## Import the datasets for each reviewer here.
ViktorLiu = ## Import the datasets for each reviewer here.
Wolfman = ## Import the datasets for each reviewer here.
ZER8 = ## Import the datasets for each reviewer here.
Adebola = ## Import the datasets for each reviewer here.

In [140]:
names = [Adebola, AM2021, Anuj, Elbeth, BFA, Iamgold, Bobjiang, LKH, FizzyMidas,Tricelex, OxProof,
         Emmanel, Doggfather, EmmanuelJacobson, Flobisnitz, Greg, GreyTrainer, Stelescuvlad,
        Kish, Jshua, Kylin, Z4yr0, TheHound, MountManu, Nadalie, Richard, Ogunsojosam, RobotTeddy,
         AnnAnna, Socal, Steegecs, ViktorLiu, Wolfman, ZER8, TheHoundNo2, ]

In [145]:
# Making a loop through all the datasheet of human evaluation,
# then renaming the Unnamed: 0 column to handle so that all the data used for the analysis have a column in common
for i in names:
    i.rename({"Unnamed: 0": "handle"}, axis=1, inplace=True)

In [146]:
concat_data1 = pd.concat(names)

In [147]:
concat_data1.head()

Unnamed: 0,handle,Is Sybil? (T or F),"Confidence (low, so-so, high)",Notes,gitcoin_url,github_url
0,danielsheldon-eth,T,low,"new accounts (github and gitcoin), just one gr...",https://gitcoin.co/danielsheldon-eth,https://github.com/danielsheldon-eth
1,coinkafasi,F,High,relatively old gitcoin account with a lot of a...,https://gitcoin.co/coinkafasi,https://github.com/coinkafasi
2,laiyazhou,F,High,Old accounts with alot of activities,https://gitcoin.co/laiyazhou,https://github.com/laiyazhou
3,tsaikoga,F,High,Old accounts with alot of activities,https://gitcoin.co/tsaikoga,https://github.com/tsaikoga
4,khaitick,F,so-so,old accounts with little fundings spread acros...,https://gitcoin.co/khaitick,https://github.com/khaitick


In [148]:
concat_data1.drop_duplicates(subset=["handle"], inplace=True)
concat_data1.shape

(1695, 6)

In [149]:
def clean(names):
    for i in names:
        i.rename({"Unnamed: 0": "handle"}, axis=1, inplace=True)
    concat_ = pd.concat(names)
    concat_.drop_duplicates(subset=["handle"], inplace=True)
    return concat_
    
    

In [150]:
ZER8 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=74)
ViktorLiu = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=75)
Wolfman = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=76)
Steegecs = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=77)
Socal = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=78)
AnnAnna = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=79)
Lawrence = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=80)
AM2021 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=81)
Ogunsojosam = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=82)
Richard = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=83)
Nadalie = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=84)
MountManu = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=85)
TheHound = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=86)
Z4yr0 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=87)
Kylin = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=88)
Kish = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=89)
Jshua = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=90)
Adebola2 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=91)
Stelescuvlad = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=92)
GreyTrainer = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=93)
Greg = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=94)
EmmanuelJacobson = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=95)
Flobisnitz = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=96)
FizzyMidas = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=97)
Doggfather = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=98)
Emmanuel = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=99)
OxProof = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=100)
Tricelex = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=101)
Anuj = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=102)
LKH = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=103)
Bobjiang = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=104)
Iamgold = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=105)
BFA = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=106)
Adebola = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=107)
Elbeth = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=108)

In [152]:

names = [Adebola, AM2021, Anuj, Elbeth, BFA, Iamgold, Bobjiang, LKH, FizzyMidas,Tricelex, OxProof,
         Emmanuel, Doggfather, EmmanuelJacobson, Flobisnitz, Greg, GreyTrainer, Stelescuvlad,
        Kish, Jshua, Kylin, Z4yr0, TheHound, MountManu, Nadalie, Richard, Ogunsojosam, 
         AnnAnna, Socal, Steegecs, ViktorLiu, Wolfman, ZER8, Lawrence  ]

In [153]:
concat_data2 = clean(names)

In [154]:
concat_data2.shape

(1820, 6)

## Round 4

In [155]:
ZER8 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=109)
Wolfman = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=110)
ViktorLiu = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=111)
Steegecs = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=112)
Socal = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=113)
AnnAnna = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=114)
RobotTeddy = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=115)
AM2021 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=116)
Ogunsojosam = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=117)
Richard = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=118)
Nadalie = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=119)
MountManu = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=120)
TheHound = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=121)
Z4yr0 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=122)
Kylin = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=123)
Kish = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=124)
Jshua = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=125)
Lawrence = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=126)
Stelescuvlad = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=127)
GreyTrainer = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=128)
Greg = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=129)
EmmanuelJacobson = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=130)
Flobisnitz = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=131)
FizzyMidas = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=132)
Doggfather = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=133)
Emmanuel = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=134)
OxProof = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=135)
Tricelex = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=136)
Walter = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=137)
LKH = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=138)
Bobjiang = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=139)
Iamgold = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=140)
BFA = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=141)
Elbeth = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=142)
Anuj = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=143)
Adebola = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=144)

In [156]:

names = [Adebola, AM2021, Anuj, Elbeth, BFA, Iamgold, Bobjiang, LKH, FizzyMidas,Tricelex, OxProof,
         Emmanuel, Doggfather, EmmanuelJacobson, Flobisnitz, Greg, GreyTrainer, Stelescuvlad,
        Kish, Jshua, Kylin, Z4yr0, TheHound, MountManu, Nadalie, Richard, Ogunsojosam, 
         AnnAnna, Socal, Steegecs, ViktorLiu, Wolfman, ZER8, Lawrence ,Walter, RobotTeddy]

In [157]:
concat_data3 = clean(names)
concat_data3.shape

(2016, 6)

# Round 5

In [158]:
ZER8 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=143)
Wolfman = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=144)
ViktorLiu = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=145)
Steegecs = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=146)
Socal = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=147)
AnnAnna = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=148)
RobotTeddy = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=149)
AM2021 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=150)
Ogunsojosam = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=151)
Richard = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=152)
Nadalie = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=153)
MountManu = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=154)
TheHound = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=155)
Z4yr0 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=156)
Kylin = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=157)
Kish = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=158)
Jshua = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=159)
Anuj = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=160)
Stelescuvlad = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=161)
GreyTrainer = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=162)
Greg = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=163)
sheet11 = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=164)
#EmmanuelJacobson = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=130)
Flobisnitz = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=165)
FizzyMidas = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=166)
Doggfather = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=167)
Emmanuel = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=168)
OxProof = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=169)
Tricelex = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=170)
Adebola = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=171)
LKH = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=172)
Bobjiang = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=173)
Iamgold = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=174)
BFA = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=175)
hes =  pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=176)
Elbeth = pd.read_excel("./GR13_Human_Evaluation_-_FINAL.xlsx", sheet_name=177)

In [159]:
names = [Adebola, AM2021, Anuj, Elbeth, BFA, Iamgold, Bobjiang, LKH, FizzyMidas,Tricelex, OxProof,
         Emmanuel, Doggfather, sheet11, Flobisnitz, Greg, GreyTrainer, Stelescuvlad,
        Kish, Jshua, Kylin, Z4yr0, TheHound, MountManu, Nadalie, Richard, Ogunsojosam, 
         AnnAnna, Socal, Steegecs, ViktorLiu, Wolfman, ZER8, RobotTeddy, hes ]

In [160]:
concat_data4 = clean(names)
concat_data4.shape

(1941, 6)

In [161]:
data = pd.concat([concat_data, concat_data1, concat_data2, concat_data3, concat_data4])
data.shape

(8904, 6)

In [162]:
data.drop_duplicates(subset=["handle"], inplace=True)
data.shape

(6858, 6)

In [10]:
# -*- coding: utf-8 -*-
"""
Created on Wed Jan 26 10:56:48 2022

@author: Ogunjo Samuel
"""

# -*- coding: utf-8 -*-
"""
Created on Sat Jan 22 09:47:59 2022

@author: Ogunjo Samuel
"""

## I limited the handles to the first 1000. You can remove the [:1000] to run it on all the 1432 handles.
x = concat_data["handle"][:1000]

y = x.unique()

## This creates a zero matrix shaped len of x (1000) X len of x(1000)
alpha0 = np.zeros((len(x),len(x)))
k = 0; g = 0;
## This loop calculates the Levenshtein distance of all the handles.
for ii,i in enumerate(y):
    
    for jj, j in enumerate(x):
        
        s = jellyfish.levenshtein_distance(i,j)
        
        #print(s)
        alpha0[ii][jj] = s
            
        #k = k + 1; g = g + 1

#fig, ax = plt.subplots(figsize=(25,10))

## This creates the dataframe showing the Levenshtein distance of any pair of handles.
Ld = pd.DataFrame(alpha0,columns=x,index=x)
#hz = hy[hy > 0.4].dropna(axis=1,how='all')
#sns.heatmap(hy, annot=True)
'''
c = ax.pcolor(alpha0)
fig.colorbar(c,ax=ax)
'''

'\nc = ax.pcolor(alpha0)\nfig.colorbar(c,ax=ax)\n'

In [12]:
## Checking the shape of the matrix.
Ld.shape

(1000, 1000)

In [13]:
## This is the result of the Levenshtein distance
Ld

handle,connorpaca,okeaguugochukwu,takgod1,0xfkr,amitabha3366,ashkan6868,rookie23,mywork0,fenghaoda168,filiptronicek,...,setyo14,wackerow,calculator,sumocats,tianyaowang5,davycodez,ernopp,harperjustine,lulambelzi,ferdi071291
handle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
connorpaca,0.0,14.0,9.0,9.0,11.0,10.0,8.0,8.0,10.0,11.0,...,9.0,9.0,9.0,8.0,10.0,10.0,7.0,12.0,10.0,11.0
okeaguugochukwu,14.0,0.0,12.0,14.0,14.0,14.0,14.0,13.0,12.0,14.0,...,13.0,12.0,13.0,12.0,13.0,13.0,13.0,14.0,14.0,14.0
takgod1,9.0,12.0,0.0,7.0,10.0,9.0,7.0,7.0,8.0,12.0,...,6.0,6.0,9.0,8.0,9.0,6.0,6.0,12.0,9.0,10.0
0xfkr,9.0,14.0,7.0,0.0,12.0,9.0,7.0,6.0,12.0,12.0,...,7.0,6.0,9.0,8.0,12.0,9.0,6.0,12.0,10.0,10.0
amitabha3366,11.0,14.0,10.0,12.0,0.0,9.0,11.0,11.0,10.0,12.0,...,11.0,11.0,11.0,11.0,11.0,11.0,12.0,12.0,11.0,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
davycodez,10.0,13.0,6.0,9.0,11.0,10.0,9.0,7.0,10.0,11.0,...,7.0,8.0,9.0,8.0,9.0,0.0,8.0,12.0,9.0,11.0
ernopp,7.0,13.0,6.0,6.0,12.0,9.0,7.0,6.0,10.0,11.0,...,5.0,7.0,10.0,7.0,10.0,8.0,0.0,11.0,10.0,9.0
harperjustine,12.0,14.0,12.0,12.0,12.0,12.0,12.0,12.0,13.0,11.0,...,12.0,10.0,11.0,12.0,13.0,12.0,11.0,0.0,12.0,12.0
lulambelzi,10.0,14.0,9.0,10.0,11.0,10.0,9.0,10.0,11.0,11.0,...,10.0,8.0,9.0,8.0,11.0,9.0,10.0,12.0,0.0,11.0


Due to the size of the outputed dataset, It might be quite difficult to visualize all the result.

Let's try to take the mean (average) of all the score for each account.

In [14]:
ld_avg = pd.DataFrame()
ld_avg["handles"] = Ld.iloc[:, :].mean().index
ld_avg["Average"] = Ld.iloc[:, :].mean().values


In [32]:
ld_avg.head()

Unnamed: 0,handles,Average
0,connorpaca,9.704
1,okeaguugochukwu,13.574
2,takgod1,8.468
3,0xfkr,8.865
4,amitabha3366,11.009


In [24]:
ld_avg.to_csv("./Levenshtein_distance_average_1000.csv")

In [163]:
def Levenshtein_distance(data):
    distances = []
    userss = []
    for name in data["handle"]:
        x = data["handle"]

        y = [name]

        ## This creates a zero matrix shaped len of x (1000) X len of x(1000)
        alpha0 = np.zeros(len(x))
        k = 0; g = 0;
        ## This loop calculates the Levenshtein distance of all the handles.
        for ii,i in enumerate(y):


            for jj, j in enumerate(x):
                userss.append(j)

                s = jellyfish.levenshtein_distance(i,j)

                #print(s)
                alpha0[jj] = s

        distances.append(alpha0)
    return distances, userss

In [168]:
distancess, usersss = Levenshtein_distance(data)

In [169]:
len(userss)

6858

In [170]:
distance_joined = []
for i in range(6858):
    distance_joined.extend(distancess[i])

In [171]:
handless = []
for i in data["handle"]:
    handless.append([i] * 6858)

In [172]:
list_ = []
for i in range(6858):
    list_.extend(handless[i])

In [174]:
df = pd.DataFrame()
df["Account 1"] = list_
df["account 2"] = usersss
df["Distance"] = distance_joined

In [175]:
df.shape

(47032164, 3)

In [176]:
df.sort_values(by="Distance", inplace=True)
# To remove the diagonals.
dff = df[6858:]

In [177]:
dff.reset_index(drop=True, inplace=True)

In [179]:
dff.head()

Unnamed: 0,Account 1,account 2,Distance
0,cryptozr27,cryptozr29,1.0
1,maozx001,maozx002,1.0
2,cryptozr38,cryptozr32,1.0
3,jansenliuhappy7,jansenliuhappy2,1.0
4,hghahari,mghahari,1.0


In [180]:
dff.to_csv("./Levenshtein_distance_total_handles.csv", index=False)

In [182]:
dfs = dff.loc[dff["Distance"] <= 3]

In [183]:
dfs.to_csv("./Handles_with_distance(0-3).csv", index=False)

In [188]:
# -*- coding: utf-8 -*-
"""
Created on Wed Jan 26 10:56:48 2022

@author: Ogunjo Samuel
"""

# -*- coding: utf-8 -*-
"""
Created on Sat Jan 22 09:47:59 2022

@author: Ogunjo Samuel
"""

## I limited the handles to the first 1000. You can remove the [:1000] to run it on all the 1432 handles.
x = data["handle"]

y = x.unique()

## This creates a zero matrix shaped len of x (1000) X len of x(1000)
alpha0 = np.zeros((len(x),len(x)))
k = 0; g = 0;
## This loop calculates the Levenshtein distance of all the handles.
for ii,i in enumerate(y):
    
    for jj, j in enumerate(x):
        
        s = jellyfish.levenshtein_distance(i,j)
        
        #print(s)
        alpha0[ii][jj] = s
            
        #k = k + 1; g = g + 1

#fig, ax = plt.subplots(figsize=(25,10))

## This creates the dataframe showing the Levenshtein distance of any pair of handles.
total_df = pd.DataFrame(alpha0,columns=x,index=x)
#hz = hy[hy > 0.4].dropna(axis=1,how='all')
#sns.heatmap(hy, annot=True)
'''
c = ax.pcolor(alpha0)
fig.colorbar(c,ax=ax)
'''

'\nc = ax.pcolor(alpha0)\nfig.colorbar(c,ax=ax)\n'

In [199]:
summation = []
for i in range(6858):
    summation.append(sum(pd.Series(total_df.iloc[i, :]).sort_values().values[:2]))

In [203]:
pd.Series(total_df.iloc[1, :]).sort_values().values[:2]

array([0., 9.])

In [211]:
total_df1 = pd.DataFrame()
total_df1["handles"] = total_df.iloc[:, :].min().index
total_df1["Distance"] = summation


In [212]:
total_df1 = total_df1.loc[total_df1["Distance"] <= 3]

In [217]:
dfs = dfs.sort_values(by= ["Account 1", "account 2"])

In [219]:
dfs.reset_index(drop=True, inplace=True)

In [220]:
dfs.to_csv("./Distances(0-3)_sorted.csv", index=False)