In [None]:
# -*- coding: utf-8 -*-
r"""
This script looks for potential duplicates in a database

We are assuming a set of standard columns,
Name, Address, StreetNumber, StreetName, City, PostalCode, Province, Latitude, Longitude

There are two sections, an initial Preprocessing step, and then the comparison script

With minor modification, this can also be used for record linkage between two files.

I) Preprocessing:
    0. Read in JSON 'source' file that contains
        i. file name of database
        ii. mappings from database column names to standard names
        iii. a short dictionary of terms to replace in the Name field to improve
            potential matches (e.g., 'ch' for 'centre hospitalier')
    1. Read in database (columns as values from json)
    2. Strip all accents from all text fields
    3. Process Address and Street Name fields to standardise street types
    4. run recordlinkage's "clean" function to remove extra whitespace and any
        remaining non-ascii characters, and anything in parentheses
    5. Standardise PostalCode
        
II) Record Linkage:
    Use RecordLinkageToolkit to perform comparisons and create index pairs:
        Criteria:
                Province - Block (consider only matches where equal)
                Name - Damerau-Levenshtein, qgram
                Address - Damerau-Levenshtein, Cosine
                StreetNumber - Exact
                StreetName - Damerau-Levenshtein, Cosine
                City - Damerau-Levenshtein
                PostalCode - Exact
                Latitude/Longitude - Distance

    The result is a Pandas multiindex object, which we then use to create a file
    where every line contains the two objects being compared. 
    
    This output file will be fed to a machine learning script for classification.
"""