Open
Description
How about removing words that have levinshtein distance <2:
import pandas as pd
from Levenshtein import distance
words = pd.read_csv("wordnet-list", header=None)
dedup = []
for word in words_list:
distances = [distance(word, candidate) for candidate in dedup]
if not distances or np.min(distances) > 1:
dedup.append(word)
len(dedup)
24911
Metadata
Metadata
Assignees
Labels
No labels