# Apply DQ Errors
---------

### Author Information
**Author:** PJ Gibson  
**Email:** Peter.Gibson@doh.wa.gov  
**Github:**   https://github.com/DOH-PJG1303

### Project Information
**Created Date:** 2023-08-08  
**Last Updated:** 2023-08-08  
**Version:** 1  

### Description

In this notebook, we'll be apply the infamous "Dirty" functions.
These functions allow us to intentionally insert data quality errors into the simulation so that we can train our machine learning model on data that resembles the messiness observed in real life data.


*\*If you are unfamiliar with the origins of this synthetic data, please see the [Synthetic-Gold](https://github.com/DOH-PJG1303/Synthetic-Gold) github project. We ran the simulation for the state of Nebraska, so all data is relevant to that state.
To manage the size of the data we'll have publicly stored on Github, we only captured relevant data for each table for the population living in years 2019-2022*

### Notes

## 1. Imports

In [4]:
import numpy as np
import pandas as pd
import re

## 2. Define our function

This function contains many sub-functions.
It's inputs are:
* the raw (clean) data
* A dictionary defining field names, and their respective transformations : probabilities.
* a random state to start off in (integer)

In [5]:
def apply_dirty(df, myDict, random_seed, flag_columns=False):
    """
    Apply transformations to a Pandas DataFrame based on the provided myDict dictionary.
    Transforms specified columns into new columns, and adds flag columns to track transformations.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be transformed.
        myDict (dict): A dictionary defining field names, and their respective transformations : probabilities.
        random_seed (int): A seed for random number generation to ensure reproducibility.
        flag_columns (bool): If True, saves boolean-type flag columns labeled f'flag___{column}___{transformation_name}' indicating what transformations a record undertook.
                             Number of flag columns = number of column transformations specified in dictionary.

    Returns:
        pd.DataFrame: The transformed DataFrame.
    """
    
    ################################################################################################################################################
    ##### BEGIN SUB FUNCTIONS 
    ################################################################################################################################################

    def null(s):
        """Return a NaN value."""
        return np.nan

    def only_first_char(s):
        """Return the first character of the input string."""
        if len(s) >= 1:
            return s[0]
        else:
            return s
        
    def fake_date(s):
        """Return a hardcoded date "2000-01-01"."""
        return '2000-01-01'

    def dob_jan_first(s):
        """Return a date in the format "YYYY-01-01" if the input has more than 4 characters."""
        if len(s) > 4:
            return f'{s[:4]}-01-01'
        else:
            return s
        
    def other_O(s):
        """Return the letter "O"."""
        return 'O'

    def unknown_U(s):
        """Return the letter "U"."""
        return 'U'
        
    def oppositely_identify(s):
        """Toggle between 'F' and 'M' values, if not either return the input unchanged."""
        if s == 'F':
            return 'M'
        elif s == 'M':
            return 'F'
        else:
            return s
        
    def other(s):
        """Return the word "Other"."""
        return 'Other' 
        
    def homeless(s):
        """Return the word "Homeless"."""
        return 'Homeless' 
        
    def no_county(s):
        """Remove the word "County" from the input string."""
        return re.replace(' County', '', s)
        
    def none(s):
        """Return the word "None"."""
        return 'None'

    def transpose(s):
        """Transpose two consecutive characters in the input string."""
        if len(s) > 1:
            t_index1 = np.random.randint(0, len(s) - 1)
            t_index2 = t_index1 + 1
            t_index3 = t_index2 + 1
            return s[:t_index1] + s[t_index2] + s[t_index1] + s[t_index3:]
        else:
            return s

    def delete_1char(s):
        """Delete a random character from the input string."""
        if len(s) > 1:
            del_index = np.random.randint(0, len(s) - 1)
            return s[:del_index] + s[del_index + 1:]
        else:
            return s

    def insert_1letter(s):
        """ Insert a random letter at a random position in the string. """
        if len(s) > 1:
            ascii_chars = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
            insert_index = np.random.randint(0, len(s) - 1)
            random_char = np.random.choice(ascii_chars)
            return s[:insert_index] + random_char + s[insert_index:]
        else:
            return s

    def insert_1num(s):
        """ Insert a random number at a random position in the string. """
        if len(s) > 1:
            ascii_chars = list('0123456789')
            insert_index = np.random.randint(0, len(s) - 1)
            random_char = np.random.choice(ascii_chars)
            return s[:insert_index] + random_char + s[insert_index:]
        else:
            return s

    def repeat_1char(s):
        """ Repeat a character at a random position in the string. """
        if len(s) > 1:
            insert_index = np.random.randint(0, len(s) - 1)
            return s[:insert_index] + s[insert_index] + s[insert_index:]
        else:
            return s

    def leading_whitespace(s):
        """ Add leading whitespace to the string. """
        return ' ' + s

    def trailing_whitespace(s):
        """ Add trailing whitespace to the string. """
        return s + ' '

    def year_equals_curyear(s):
        """ Replace the year in a date with the current year (2019 in this case). """
        return re.sub(r'(?P<y>\d{4})-0?(?P<m>\d{1,2})-0?(?P<d>\d{1,2})', '2019-\g<m>-\g<d>', s)

    def swap_month_day(s):
        """ Swap the month and day in a date. """
        return re.sub(r'(?P<y>\d{4})-0?(?P<m>\d{1,2})-0?(?P<d>\d{1,2})', '\g<y>-\g<d>-\g<m>', s)

    def format_year_2digits(s):
        """ Format the year as two digits in a date. """
        return re.sub(r'(?P<y1>\d{2})(?P<y2>\d{2})-0?(?P<m>\d{1,2})-0?(?P<d>\d{1,2})', '\g<y2>-\g<m>-\g<d>', s)

    def format_remove_leading_zeros(s):
        """ Remove leading zeros from dates. """
        return re.sub(r'(?P<y>\d{2,4})-0?(?P<m>\d{1,2})-0?(?P<d>\d{1,2})', '\g<y>-\g<m>-\g<d>', s)

    def format_slash_mdyyyy(s):
        """ Change date format to MM/DD/YYYY. """
        return re.sub(r'(?P<y>\d{2,4})-0?(?P<m>\d{1,2})-0?(?P<d>\d{1,2})', '\g<m>/\g<d>/\g<y>', s)

    def format_dash_only(s):
        """ Format phone numbers with dashes only. """
        return re.sub(r'(?P<ac>\d{3})(?P<f3>\d{3})(?P<l4>\d{4})', '\g<ac>-\g<f3>-\g<l4>', s)

    def format_parenthesis_dash(s):
        """ Format phone numbers with parenthesis and dashes. """
        return re.sub(r'(?P<ac>\d{3})(?P<f3>\d{3})(?P<l4>\d{4})', '(\g<ac>)-\g<f3>-\g<l4>', s)

    def prefix_plus1(s):
        """ Prefix the string with '+1 '. """
        return '+1 ' + s
    
    def fake_email(s):
        """ Replace the email address with a random fake email from a predefined list."""
        list_fake_emails = [ 'g@gmail.com', 'x@gmail.com', 'noemail@noemail.com', 'a@gmail.com', '1@gmail.com', 'a@a', '999az999@yahoo.com', 'n@n',
 '1@2', 'na@na.com', 'covid19@yahoo.com', 'no@no.com', 'no@email.com', 'm@n.com', 'ar@g', 'xxx@gmail.com', 'no@yahoo.org',
  'x@g', 'noname@gmail.com', 'm@m.com', 'c19@yahoo.com', 'none@email.com', 'm@m', 'none@none.com', 'noemail@gmail.com',
   'n@n.com', 'a@msn.com', 'g@g', 'noemail@email.com', '123@gmail.com', 'refused@email.com', 'k@gmail.com', 'email@email.com',
    '123@123.com', 'no@gmail.com', 'none@none.none', 'declined@gmail.com', 'n@a', 'none@gmail.com', 'none@none', 'm@n',
     'declined@catholichealth.net', 'na@gmail.com', 'refused@yahoo.org', 'test@test.com', 'unknown@gmail.com', 'vax@gmail.com',
      'd@d.com']
        return str(np.random.choice(list_fake_emails))

    def longhand_compass(s):
        """ Convert compass abbreviations to their full names.
        Transforms compass directions like N, S, E, and W into north, south, east, and west
        respectively, considering potential combinations such as NE, NW, etc.
        """
        s = re.sub(r'\bN(?=[EW]?\b)', 'north', s)
        s = re.sub(r'\bS(?=[EW]?\b)', 'south', s)
        s = re.sub(r'\b(north|south)?E\b', '\\1east', s)
        s = re.sub(r'\b(north|south)?W\b', '\\1west', s)
        s = s.title()
        return s

    def longhand_roadtype(s):
        """ Convert road type abbreviations to their full names.
        Transforms abbreviations like St, Ave, Rd, etc. into Street, Avenue, Road, etc.
        """
        s = re.sub(r'\bSt\b', 'Street', s)
        s = re.sub(r'\bAve\b', 'Avenue', s)
        s = re.sub(r'\bRd\b', 'Road', s)
        s = re.sub(r'\bPl\b', 'Place', s)
        s = re.sub(r'\bDr\b', 'Drive', s)
        s = re.sub(r'\bCt\b', 'Court', s)
        s = re.sub(r'\bLn\b', 'Lane', s)
        s = re.sub(r'\bBlvd\b', 'Boulevard', s)
        s = re.sub(r'\bHwy\b', 'Highway', s)
        s = re.sub(r'\bCir\b', 'Circle', s)
        s = re.sub(r'\bApt\b', 'Apartment', s)
        s = re.sub(r'\bSte\b', 'Suite', s)
        s = re.sub(r'\bTer\b', 'Terrace', s)
        s = re.sub(r'\bPkwy\b', 'Parkway', s)
        s = re.sub(r'\bAly\b', 'Alley', s)
        s = re.sub(r'\bFt\b', 'Fort', s)
        s = re.sub(r'\bJct\b', 'Junction', s)
        s = re.sub(r'\bPt\b', 'Point', s)
        s = re.sub(r'\bSq\b', 'Square', s)
        s = re.sub(r'\bHts\b', 'Heights', s)
        s = re.sub(r'\bHolw\b', 'Hollow', s)
        s = re.sub(r'\bMtn\b', 'Mountain', s)
        s = re.sub(r'\bExpy\b', 'Expressway', s)
        s = re.sub(r'\bFls\b', 'Falls', s)
        s = re.sub(r'\bGrv\b', 'Grove', s)
        s = re.sub(r'\bHbr\b', 'Harbor', s)
        s = re.sub(r'\bHl\b', 'Hill', s)
        s = re.sub(r'\bLoop\b', 'Loop', s)
        s = re.sub(r'\bRdg\b', 'Ridge', s)
        s = re.sub(r'\bTrl\b', 'Trail', s)
        s = re.sub(r'\bTunl\b', 'Tunnel', s)
        s = re.sub(r'\bVly\b', 'Valley', s)
        s = re.sub(r'\bExt\b', 'Extension', s)
        s = s.title()
        return s

    def address_no_info(s):
        """Replace the address with a random placeholder address taken from a predefined list."""
        list_address_no_info = ['123 fake street', 'general delivery', 'need address', 'bad address', 'need info', 'espanol', 'na', 'need', 'no address']
        return str(np.random.choice(list_address_no_info))

    
    def no_provider(s):
        """ Remove the domain part of an email address. """
        return re.sub(r'(?P<email_prefix>[^@]+)(?P<email_suffix>@.*)', '\g<email_prefix>', s)

    ################################################################################################################################################
    ##### BEGIN MASTER FUNCTION 
    ################################################################################################################################################

    # Set random seed
    np.random.seed(random_seed)

    # Copy data so function doesn't alter original DataFrame
    data = df.copy()

    # Apply transformations
    for column, transformations in myDict.items():

        # For each column, create new__column to manipulate
        new_column = f'new__{column}'
        data[new_column] = data[column].copy().astype(str)

        # For any transformations listed to apply to that column...
        for transformation_name, probability in transformations.items():

            # See if the transformation exists as a function already...
            transformation_function = locals().get(transformation_name, None)

            # Define what records would be affected by transformation...
            ### 1. Within random probability value
            ### 2. Not currently null, empty string, or "None" 
            mask = (np.random.rand(len(data)) < probability) & (~data[new_column].isna()) & (data[new_column] != '') & (data[new_column] != "None")

            # If there is a function:
            if transformation_function:
            
                # Apply the transformation to the mask
                data.loc[mask, new_column] = data.loc[mask, new_column].apply(transformation_function)

            else:

                if (transformation_name == 'nickname'):
                    # Can only give nickname where it is not null
                    mask = mask & (~data['nickname'].isna())
                    data.loc[mask,new_column] = data.loc[mask,'nickname']

                elif (transformation_name == 'plus__middle_name'):
                    # Can only add middle name when non-zero
                    mask = mask & (data['middle_name'].str.len() > 0)
                    data.loc[mask,new_column] = data.loc[mask, new_column] + data.loc[mask,'middle_name'] 

                elif (transformation_name == 'swap__first_name'):
                    # Can only swap with first name when first name is non-zero/null
                    mask = mask & (data['new__first_name'].str.len() > 0)

                    # Define firstnames and current column values for mask
                    firstnames = data.loc[mask,'new__first_name']
                    curnames = data.loc[mask,new_column]

                    # Swap them
                    data.loc[mask,new_column] = firstnames
                    data.loc[mask,'new__first_name'] = curnames
            
                elif (transformation_name == 'jdoe'):
                    # If sex_at_birth is female, assign first name as JANE and last name as DOE
                    data.loc[mask & (data['sex_at_birth'] == 'F'),'new__first_name'] = 'JANE'
                    # If sex_at_birth is male, assign first name as JOHN and last name as DOE
                    data.loc[mask & (data['sex_at_birth'] == 'M'),'new__first_name'] = 'JOHN'
                    data.loc[mask, new_column] = 'DOE'

                elif (transformation_name == 'secondary_email'):
                    # Can only assign secondary_email where it is not null
                    mask = mask & (~data['secondary_email'].isna())
                    data.loc[mask,new_column] = data.loc[mask,'secondary_email']

                else:
                    # Raise an error if the transformation_name doesn't match any of the predefined transformations
                    raise ValueError(f'{transformation_name} not identified as valid transformation')

            # If user specified they want to keep track of the flag_columns & the transformation probability is non-zero...
            if flag_columns & (probability>0): 
                # Create a flag column indicating the column, the transformation, and give a value of 1 for all transformed or "masked" rows.  otherwise 0.
                flag_column = f'flag___{column}___{transformation_name}'
                data[flag_column] = 0
                data.loc[mask, flag_column] = 1

    return data

## 2. Read in Data

In [6]:
df = pd.read_parquet('../../Data/Training/01. Wrangled Clean Data.parquet')

## 3. Define Data Profile

Please see [Data Quality Error Dictionary Guide](https://github.com/DOH-PJG1303/Python-Record-Linkage/blob/main/Supporting/Data%20Quality%20Error%20Dictionary%20Guide.md) for an in-depth description on how you can define your data profile.  For all options, see the `universal_dataProfile` variable at the end.

In [7]:
dataProfile = {'first_name':{
                         'null':0.00,
                         'nickname':0.3,
                         'transpose':0.03,
                         'delete_1char':0.03,
                         'repeat_1char':0.03,
                         'insert_1letter':0.03,
                         'plus__middle_name':0.03,
                         'leading_whitespace':0.03,
                         'trailing_whitespace':0.03
                        },
           
           'last_name':{
                        'null':0.00,
                        'transpose':0.03,
                        'delete_1char':0.03,
                        'repeat_1char':0.03,
                        'insert_1letter':0.03,
                        'leading_whitespace':0.03,
                        'trailing_whitespace':0.03,
                        'swap__first_name':0.005,
                        'jdoe':0.00001
                       },
           
           'dob':{
                        'null':0.001,
                        'none':0,
                        'swap_month_day':0.005,
                        'format_year_2digits':0,
                        'format_remove_leading_zeros':0,
                        'format_slash_mdyyyy':0,
                        'insert_1num':0,
                        'dob_jan_first':0.01
                 },

            'ssn':{
                           'null':0.8955
                        },

            'phone':{
                          'null':0.2513,
                          'none':0,
                          'transpose':0.01,
                          'format_dash_only':0.001,
                          'format_parenthesis_dash':0.99,
                          'prefix_plus1':0.001,
                          'delete_1char':0,
                          'repeat_1char':0,
                          'insert_1num':0,
                          'leading_whitespace':0,
                          'trailing_whitespace':0
                          },
           
           'address':{
                        'null':0.2579,
                        'none':0.18,
                        'transpose':0.01,
                        'delete_1char':0.01,
                        'repeat_1char':0.01,
                        'insert_1letter':0.01,
                        'insert_1num':0.01,
                        'leading_whitespace':0.01,
                        'trailing_whitespace':0.01,
                        'longhand_compass': 0.5,
                        'longhand_roadtype': 0.5
                     },

         
           'city':{
                        'null':0.2549,
                        'transpose':0.05,
                        'delete_1char':0.03,
                        'repeat_1char':0.03,
                        'insert_1letter':0.03,
                        'leading_whitespace':0,
                        'trailing_whitespace':0
                     },
          
          'state':{
                        'null':0.2536
                     },
           
           'zip':{
                        'null':0.2554,
                        'none':0,
                        'transpose':0.03,
                        'delete_1char':0,
                        'repeat_1char':0,
                        'insert_1num':0,
                        'leading_whitespace':0,
                        'trailing_whitespace':0
                     },
           
           'county_name':{
                        'null':0.2540,
                        'no_county':0,
                        'transpose':0,
                        'delete_1char':0,
                        'repeat_1char':0,
                        'insert_1letter':0,
                        'leading_whitespace':0,
                        'trailing_whitespace':0
                     }
           
         }

## 4. Apply dirty functions

In [8]:
output = apply_dirty(df,dataProfile,42,flag_columns=True)

## 5. Save

We'll have two saved outputs:
1. Dataframe lookup for flags.  This way we can look back at how different data quality issues impacted our model performance.
2. Dataframe for dirty data.  This is what we'll use to help generate our machine learning training datasets.

### 5.1 Flag Lookup

In [11]:
# Identify all columns with flag in them.  Save a dataframe of those columns + unique_id
columns_flag = list(output.filter(regex='flag').columns)
df_flags = output[['unique_id']+columns_flag]

# Save to parquet
df_flags.to_parquet('../../Data/Training/02a. Flag Lookup Hep C.parquet',index=False)

### 5.2 Diry Data

In [12]:
# Identify all columns with new__ in them.  Save a dataframe of those columns + unique_id
columns_new = list(output.filter(regex='new__').columns)
df_dirty = output[['unique_id','person_id']+columns_new]

# Save to parquet
df_dirty.to_parquet('../../Data/Training/02b. Wrangled Dirty Data Hep C.parquet',index=False)