# PAWS Data Pipeline
The objective of this script is to create a master data table that links all the PAWS datasources together.
## Pipeline sections
0. Import libraries
1. Create & populate database 
2. Create ***metadata master table*** schema to link all source tables together & populate with one of the dataset (e.g. SalesForce)
3. For each dataset, merge each record with the ***metadata master table***. If a match is found, link the two sources. If not, create a new record. <br/>
    a. Petpoint<br/>
    b. Volgistics<br/>
    c. Other - TBD<br/>
4. Write the new table to the database

### 0. Import libraries

In [None]:
import sqlite3
import pandas as pd
import re

### 1. Create & populate database 

In [None]:
# connect to or create database

conn = sqlite3.connect("./sample_data/paws.db")

In [None]:
# function for loading a csv into a database table or "updating" the table by dropping it and recreating it with the csv

def load_to_sqlite(csv_name, table_name, connection):
    
    # load csv into a dataframe
    df = pd.read_csv(csv_name, encoding='cp1252')
    
    # strip whitespace and periods from headers, convert to lowercase
    df.columns = df.columns.str.lower().str.strip()
    df.columns = df.columns.str.replace(' ', '_')
    df.columns = df.columns.map(lambda x: re.sub(r'\.+', '_', x))
    
    # create a cursor object, and use it to drop the table if it exists
    cursor = connection.cursor()
    cursor.execute(f'DROP TABLE {table_name}')
    connection.commit()
    cursor.close()
    
    # load dataframe into database table
    df.to_sql(table_name, connection, index=False,)

In [None]:
# load petpoint

load_to_sqlite('./sample_data/CfP_PDP_petpoint_deidentified.csv', 'petpoint', conn)

In [None]:
# load volgistics

load_to_sqlite('./sample_data/CfP_PDP_volgistics_deidentified.csv', 'volgistics', conn)

In [None]:
# load salesforce contacts

load_to_sqlite('./sample_data/CfP_PDP_salesforceContacts_deidentified.csv', 'salesforcecontacts', conn)

In [None]:
# load salesforce donations

load_to_sqlite('./sample_data/CfP_PDP_salesforceDonations_deidentified.csv', 'salesforcedonations', conn)

### 2. Create ***metadata master table*** schema to link all source tables together & populate with one of the dataset (e.g. SalesForce)

In [None]:
def create_user_master_df():
    """
    Creates a pandas dataframe placeholder with key meta-data to fuzzy-match
    the users from different datasets.
    
    Pseudo-code:
        Create a blank pandas dataframe (e.g. pd.DataFrame) with columns for
        Name (first, last), address, zip code, phone number, email, etc.
        
        Include "ID" fields for each of the datasets that will be merged.
        
        Populate/Initialize the dataframe with data from one of the datasets
        (e.g. Salesforce)
    """

### 3. For each dataset, merge each record with the ***metadata master table***
If a match is found, link the two sources. If not, create a new record. <br/>

In [None]:
def fuzzy_merge(new_df, master_df):
    """
    This function merges each new dataset with the metadata master table by
    going line-by-line on the new dataset and looking for a match in the 
    existing metadata master dataset. If a match is found
    
    Pseudo-code:
        LOOP: For each line in the new_df, compare that line against all lines in 
        the master_df. 
        
        LOGIC: For each comparison, generate (a) a fuzzy-match score on name,
        (b) T/F on whether zip-code matches, (c) T/F on whether email matches,
        (d) T/F on whether phone number matches.
        
        OUTPUT: For each comparison if the fuzzy-match score is above a threshold (e.g. >=90%)
        and (b), (c) or (d) matches, consider it a match and add the new dataset 
        id to the existing record. If it doesn't match, create a new record in the
        master dataset.
        
    Note: there's probably a more efficient way to do this (vs. going line-by-line)
    """

#### 3.A Petpoint merge 
Apply function above the Petpoint dataset

#### 3.B Volgistics merge
Apply function above the Volgistics dataset

#### 3.C Other - TBD - Merge

### 4. Write the new table to the database

In [4]:
# load_to_sqlite(master_df, master_table, conn)

## Other - placeholder - graveyard
Graveyard/placeholder code from previous sections

In [None]:
# simple join to check that it worked and the tables can be queried

df = pd.read_sql('''select * from petpoint as pp 
                    join volgistics as vol 
                    on pp."unnamed:_0" = vol."unnamed:_0"

                    join (SELECT * FROM salesforcecontacts AS sf_contacts
                            JOIN salesforcedonations AS sf_donations
                            ON sf_contacts."Account_ID" = sf_donations."Account_ID") as sf
                    on pp."unnamed:_0" = sf."unnamed:_0"
                    
                    ''', conn)

df.head()

In [None]:
# get all data matching on (first name + last name)

df2 = pd.read_sql('''SELECT * FROM petpoint AS pp
                     INNER JOIN volgistics AS vol ON pp."Intake_Record_Owner" = vol."First_name_Last_name"
                     INNER JOIN (SELECT * FROM salesforcecontacts AS sf_contacts
                            JOIN salesforcedonations AS sf_donations
                            ON sf_contacts."Account_ID" = sf_donations."Account_ID") AS sf
                     ON pp."Intake_Record_Owner" = (sf."First_Name" + " " + sf."Last_Name")
                  ''', conn)
df2.head()

In [None]:
# close database connection

conn.close()