## Deduplicating data

In this notebook, we deduplicate data using the [Dedupe library](https://dedupe.readthedocs.io/en/latest/), which utilizes a shallow neural network to learn from a small training exercise.

If you are interested in building your own parser, the same folks have created the [Parserator](https://github.com/datamade/parserator) which you can use to extract text features and train your own text extraction (hooray! less brittle than regex!)

In [None]:
import pandas as pd
import dedupe
import os

In [None]:
customers = pd.read_csv('../data/customer_data_duped.csv', encoding='utf-8')

## Checking Data Quality

In [None]:
customers.head()

In [None]:
customers.dtypes

In [None]:
for col in customers.columns:
    print(col, customers[col].isnull().sum())

## Setting up Dedupe

In [None]:
variables = [
    {'field': 'name', 'type': 'String'},
    {'field': 'job', 'type': 'String'},
    {'field': 'company', 'type': 'String'},  
    {'field': 'street_address','type': 'String'},
    {'field': 'city','type': 'String'},
    {'field': 'state', 'type': 'String', 'has_missing': True},
    {'field': 'email', 'type': 'String', 'has_missing': True},
    {'field': 'user_name', 'type': 'String'},
]

deduper = dedupe.Dedupe(variables)

In [None]:
deduper

In [None]:
customers.shape

In [None]:
deduper.sample(customers.T.to_dict(), 500)

#### Either use training file (uncomment) or resume active training below

In [None]:
training_file = '../data/ignore-dedupe-training.json'
#if os.path.exists(training_file):
#    with open(training_file, 'rb') as f:
#        deduper.readTraining(f)

In [None]:
dedupe.consoleLabel(deduper)

In [None]:
deduper.train()

In [None]:
with open(training_file, 'w') as tf:
    deduper.writeTraining(tf)

In [None]:
dupes = deduper.match(customers.T.to_dict())

In [None]:
dupes[0]

In [None]:
customers.iloc[[268,1269]]

### Exercise: Flag duplicates by adding 2 extra columns, one for confidence score and one for duplicate_ids

In [None]:
# %load ../solutions/dedupe.py
import numpy as np

dupe_dict = {}

for dupepair, confidence in dupes:
    dupe_dict[dupepair[0]] = {'pair': dupepair, 'confidence': confidence[0]}
    dupe_dict[dupepair[1]] = {'pair': dupepair, 'confidence': confidence[0]}

customers['duplicate_pair'] = customers.index.map(lambda i: dupe_dict[i].get('pair')
                                                  if i in dupe_dict else np.nan)
customers['confidence'] = customers.index.map(lambda i: dupe_dict[i].get('confidence')
                                              if i in dupe_dict else np.nan)


In [None]:
customers[customers.confidence.notnull() == True].head()