# Preprocessing
---------

### Author Information
**Author:** PJ Gibson  
**Email:** Peter.Gibson@doh.wa.gov  
**Github:**   https://github.com/DOH-PJG1303

### Project Information
**Created Date:** 2023-08-09
**Last Updated:** 2023-08-09  
**Version:** 1  

### Description

In this notebook, we'll be applying functions to clean up our dirty data.
This step is imperative to record linkage.
Note that if you want your model to work properly, you also need to apply this **exact** same step to your applied data.


### Notes

*\*If you are unfamiliar with the origins of this synthetic data, please see the [Synthetic-Gold](https://github.com/DOH-PJG1303/Synthetic-Gold) github project. We ran the simulation for the state of Nebraska, so all data is relevant to that state.
To manage the size of the data we'll have publicly stored on Github, we only captured relevant data for each table for the population living in years 2019-2022*

## 1. Import libraries

In [None]:
# Standard data analysis tools
import pandas as pd
import numpy as np

# Phonetic encoding library
import jellyfish

# Record linkage specific resources
import recordlinkage as rl
from recordlinkage.preprocessing import clean, phonetic
from recordlinkage.index import Block
from recordlinkage.base import BaseCompareFeature

## 2. Read Data

We'll be performing internal deduplication between the dirty data for synthetic Nebraska.
This is from the output of the previous script.

In [None]:
df = pd.read_parquet('../../Data/Training/02b. Wrangled Dirty Data.parquet')

## 3. Preprocessing

### 3.1 Column standardization

In [None]:
def standardize_columns(df, mapping_columns, extra_columns=[]):
    """
    Standardizes column names in the DataFrame based on user-provided mapping.

    Parameters:
    - df: DataFrame to be processed.
    - mapping_columns: Dictionary with keys being existing columns in df 
                       and values being the standardized names.
    - extra_columns: List of column names that should retain their original names.

    Returns:
    - A DataFrame with standardized column names.
    """
    
    # Define the desired standardized column order
    standardized_order = ['fname','mname','lname','dob','sex','add',
                          'zip','city','county','state','phone','email']

    # Update the dataframe columns based on the provided mapping
    df.rename(columns=mapping_columns, inplace=True)

    # Ensure that the dataframe columns follow the desired order, 
    # and then append any extra_columns at the end
    all_columns = extra_columns + [col for col in standardized_order if col in df.columns] 
    
    return df[all_columns]

### 3.2 Data Cleaning

| Field       | Preprocessing                                                                                                                                                                      |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| fname       | - Convert to lowercase. <br>- Remove non-alphabetical characters. <br>- Generate metaphone representation as `meta_fname`. <br>- Generate soundex representation as `sdx_fname`.     |
| lname       | - Convert to lowercase. <br>- Remove non-alphabetical characters. <br>- Generate metaphone representation as `meta_lname`. <br>- Generate soundex representation as `sdx_lname`.     |
| mname       | - Convert to lowercase. <br>- Take the first character.                                                                                                                              |
| dob         | - Convert to the format 'YYYY-M-D'.                                                                                                                                                    |
| sex_at_birth| - Convert to uppercase. <br>- Take the first character. <br>- Replace with NaN if not 'M', 'F', or 'O'.                                                                             |
| phone       | - Remove non-digit characters. <br>- Remove leading 0 or 1. <br>- Remove any digit repeated more than 6 times consecutively.                                                        |
| email       | - Convert to lowercase. <br>- Remove characters that aren't alphabets, numbers, @, or dots.                                                                                          |
| add         | - Convert to lowercase. <br>- Remove characters that aren't alphabets, numbers, or spaces. <br>- Replace common street abbreviations and directions with their respective short forms.|
| zip         | - Remove non-digit characters.                                                                                                                                                      |
| county      | - Convert to lowercase. <br>- Remove non-alphabetical characters. <br>- Remove the term ' county'.                                                                                   |
| city        | - Convert to lowercase. <br>- Remove non-alphabetical characters. <br>- Remove the term ' city'.                                                                                     |
| Various Fields | - Replace empty fields or fields with placeholder values (like 'none', 'na', 'missing', 'unknown') with NaN.                                                                       |


In [None]:
def apply_preprocessing(df):
    """
    Function to apply a series of transformations to a DataFrame.

    Parameters
    ----------
    df: DataFrame
        The DataFrame to which the transformations should be applied.

    Returns
    -------
    DataFrame
        The DataFrame after applying the transformations.
    """
    # Lowercase and remove non-alphabetical characters in 'fname' and 'lname'
    for field in ['fname', 'lname']:
        df[field] = df[field].str.lower().str.replace('[^a-z]','',regex=True)

    # Lowercase and take the first character of 'mname'
    df['mname'] = df['mname'].str.lower().str.slice(0,1)

    # Convert 'dob' column to 'YYYY-M-D' format
    df['dob'] = pd.to_datetime(df['dob'], errors='coerce').dt.strftime('%Y-%m-%d')

    # Convert 'sex' to uppercase and take the first letter
    df['sex'] = df['sex'].str.upper().str.slice(0, 1)

    # Replace any value that isn't 'M', 'F', or 'O' with NaN
    df['sex'] = df['sex'].where(df['sex'].isin(['M', 'F', 'O']))

    # Remove non-digits from 'phone', remove leading 0 or 1, and remove any digit repeated more than 6 times consecutively
    df['phone'] = df['phone'].str.replace('\D','',regex=True).str.replace('^[01]','',regex=True).str.replace( '.*(\d)\\1{6,}.*', '',regex=True)

    # Lowercase and remove non-alphabetical and non-digital characters (excluding @ and .) in 'email'
    df['email'] = df['email'].str.lower().str.replace('[^a-z0-9\.@]','', regex=True)

    # Lowercase, remove non-alphabetical and non-digital characters in 'add', and replace common street abbreviations and directions
    df['add'] = df['add'].str.lower().str.replace('[^a-z0-9 ]','',regex=True)

    replacements = {
        'street': 'st',
        'avenue': 'ave',
        'road': 'rd',
        'place': 'pl',
        'drive': 'dr',
        'court': 'ct',
        'lane': 'ln',
        'boulevard': 'blvd',
        'highway': 'hwy',
        'circle': 'cir',
        'apartment': 'apt',
        'suite': 'ste',
        'north': 'n',
        'south': 's',
        'east': 'e',
        'west': 'w',
        'northeast':'ne',
        'northwest':'nw',
        'southeast':'se',
        'southwest':'sw',
        'terrace': 'ter',
        'parkway': 'pkwy',
        'alley': 'aly',
        'fort': 'ft',
        'junction': 'jct',
        'point': 'pt',
        'square': 'sq',
        'heights': 'hts',
        'hollow': 'holw',
        'mountain': 'mtn',
        'expressway': 'expy',
        'falls': 'fls',
        'grove': 'grv',
        'harbor': 'hbr',
        'hill': 'hl',
        'loop': 'loop',
        'ridge': 'rdg',
        'trail': 'trl',
        'tunnel': 'tunl',
        'valley': 'vly',
        'extension': 'ext',
    }

    for k, v in replacements.items():
        df['add'] = df['add'].str.replace(r'\b' + k + r'\b', v, regex=True)

    # Remove non-digits from 'zip'
    df['zip'] = df['zip'].str.replace('\D','',regex=True)

    # Lowercase, remove non-alphabetical characters and ' county' in 'county'
    df['county'] = df['county'].str.lower().str.replace('[^a-z ]','', regex=True).str.replace(' county','', regex=True)

    # Lowercase, remove non-alphabetical characters and ' city' in 'city'
    df['city'] = df['city'].str.lower().str.replace('[^a-z ]','', regex=True).str.replace(' city','', regex=True)

    # Replace empty fields or fields with placeholder values with NaN
    fields = ['fname','mname','lname','dob','phone','email','add','zip','county','city']
    for field in fields:
        df[field] = df[field].apply(lambda x: np.nan if ((len(str(x).strip()) == 0)|(str(x).lower().strip() in ['none','na','missing','unknown'])) else x)


    # Generate metaphone versions of the fields
    for col in ['fname', 'lname']:
        df['meta_'+col] = phonetic(df[col], method='metaphone')
        df['sdx_'+col] = phonetic(df[col], method='soundex')

    return df

### 3.3 Application

In [None]:
# Define columns to map
mapping_cols = {
    'new__first_name': 'fname',
    'new__middle_name': 'mname',
    'new__last_name': 'lname',
    'new__dob': 'dob',
    'new__sex_at_birth': 'sex',
    'new__phone': 'phone',
    'new__email': 'email',
    'new__address': 'add',
    'new__zip': 'zip',
    'new__county_name': 'county',
    'new__city': 'city',
    'new__state': 'state'
}

# Standardize the columns using custom function
df_standardized = standardize_columns(df, mapping_columns=mapping_cols, extra_columns=['unique_id'])

# Apply preprocessing
df_preprocessed = apply_preprocessing(df_standardized)

## 4. Save

In [None]:
df_preprocessed.to_parquet('../../Data/Training/03. Preprocessed Data.parquet')