# Virtuous Interview Exam

> Monday August 14th, 2023

# Requirements

> Virtuous Data Schema, changes that require ETL have been highlighted <br>

<br>

![alternative text](images/virtuous_requirements.png)

## Summary
> I need to perform a ETL of 3 spreadsheets into SQL tables that match the Virtuous schema
- Contacts: A table of contacts, with 1 or 2 constituents per row
- Gifts: Donations from individual constituents
- Contact Methods: Contact information for individual constituents

In [None]:
#| default_exp utils

# Previewing The datasets
> Loading Datasets using Pandas

In [None]:
#| export
import pandas as pd
import re

In [None]:
#| export
contact_methods = pd.read_csv('data/contact_methods.csv')
contacts = pd.read_csv('data/contacts.csv')
gifts = pd.read_csv('data/gifts.csv')

Filling na values with empty strings for consistency

In [None]:
#| export
contact_methods.fillna('', inplace=True)
contacts.fillna('', inplace=True)
gifts.fillna('', inplace=True)

In [None]:
contacts.head()

Unnamed: 0,Number,Company Name,First Name,Last Name,Street,City,State,Postal,Phone,E-mail,Remarks,Deceased?
0,653377813-7,,Karita & Kelvin,Lumbers,4 Bunting Parkway,Washington,DC,20535-871,kklumbers@ yahoo.co,,Is anonymous,
1,390551098-7,,Helga,Benech,48684 Jenifer Way,Las Vegas,NV,89130,,ebenech1@goodreads.com,,
2,093004505-X,,Masha,,353 Schmedeman Park,Indianapolis,IN,,,577-374-96523,,
3,729707142-0,A Company Co.,,,2055 Lakewood Parkway,Camden,NJ,8104,,,,No
4,488464926-5,,Hoyt,Castille,37 8th Trail,Grand Rapids,MI,49560,,fcastille4@timesonline.co.uk,,No


In [None]:
gifts.head()

Unnamed: 0,donor_number,gift_id,first_name,last_name,amount received,date,fund_id,credit card type,payment method,pledge_number,notes
0,848348568-0,95196378.0,Mannie,Turpin,$4.15,3/4/2019,,,PayPal,,
1,729707142-0,95196889.0,Cymbre,Cross,2.3648,3/5/2019,ChildSponsorship,,check,,
2,687119652-8,95197689.0,Ruggiero,Makepeace,$1.31,3/7/2019,,,cash,,
3,653377813-7,95198998.0,Karita,Lumbers,$2.04,3/10/2019,,American Ex,credit card,,In honor of Mannie Turpin
4,390551098-7,95198999.0,Helga,Benech,$5.80,2019/1/10,,,cash,89752384.0,


In [None]:
contact_methods.head()

Unnamed: 0,donor_number,Phone,E-mail,Fax
0,653377813-7,832-442-4988,,
1,390551098-7,,ebenech1@goodreads.com,
2,093004505-X,818-323-9865,,818-156-7985
3,729707142-0,,,
4,488464926-5,,fcastille4@timesonline.co.uk,


Before I begin the performing the analysis in each method's respective notebook, I'm going to standardize the column names and types across the 3 DataFrames so I can pass them to each notebook and save myself some tedious typing**<br>
<br>
** Also doing this to show usefullness of juptyer notebook in production environments :)

## Housekeeping 1
> Format Column Names
<br>

- Nothing worse than malformed column names, amiright?

### to_camel_case
> Apply Camel Casing to a string

In [None]:
#| export
def to_camel_case(s):
    # Remove all non-alphanumeric characters and replace with a space
    s = re.sub(r'[^a-zA-Z0-9]', ' ', s)
    
    # Split by space and capitalize the first letter of each word
    words = s.split()
    return ''.join(word.capitalize() for word in words)

### transform_cnames
> Rename DF cnames in place

In [None]:
#| export
def transform_cnames(df, func=to_camel_case):
    df.columns = df.columns.map(func)
    return None

In [None]:
#| export
for df in [contact_methods, contacts, gifts]:
    transform_cnames(df)

In [None]:
#| hide
contacts.columns

Index(['Number', 'CompanyName', 'FirstName', 'LastName', 'Street', 'City',
       'State', 'Postal', 'Phone', 'EMail', 'Remarks', 'Deceased'],
      dtype='object')

## Housekeeping 2
> Cleaning the column types

I identified 2 columns on the gift table that should be ints, replacing those values

In [None]:
#| export
int_cols = ['GiftId', 'PledgeNumber']

In [None]:
#| export
gifts[int_cols] = gifts[int_cols].replace({'':0}).astype(int)

Looks like AmountRecieved should really be a float, I'm removing any special characters (besides dashes and periods) and converting to a float :)

In [None]:
#| export
gifts['AmountReceived'] = gifts.AmountReceived.apply(lambda x: float(re.sub(r'[^a-zA-Z0-9\.-]', '', x)))

## Housekeeping 3
> Identify and clean any pieces of data in the wrong column

In [None]:
#| hide
contacts[['Phone', 'EMail']].head(3)

Unnamed: 0,Phone,EMail
0,kklumbers@ yahoo.co,
1,,ebenech1@goodreads.com
2,,577-374-96523


In [None]:
#| export
def classify_phone_email(value):
    if "@" in value:
        return "email"
    if re.search(r'\d{3}-\d{3}-\d{4}', value):
        return "phone"
    return None

In [None]:
#| export
for index, row in contacts.iterrows():
    phone_classification = classify_phone_email(row['Phone'])
    email_classification = classify_phone_email(row['EMail'])

    if phone_classification == "email":
        contacts.at[index, 'EMail'] = row['Phone']
        contacts.at[index, 'Phone'] = ''

    if email_classification == "phone":
        contacts.at[index, 'Phone'] = row['EMail']
        contacts.at[index, 'EMail'] = ''

In [None]:
#| hide
contacts[['Phone', 'EMail']].head(3)

Unnamed: 0,Phone,EMail
0,,kklumbers@ yahoo.co
1,,ebenech1@goodreads.com
2,577-374-96523,


## Housekeeping 4
> Consolidating contacts table <br>
<br>
- Searching for missing users <br>
- Splitting / joining households


There's a mmissing contact on the gifts table

In [None]:
#| hide
(~gifts.DonorNumber.isin(contacts.Number.unique())).any()

True

In [None]:
#| hide
(~contact_methods.DonorNumber.isin(contacts.Number.unique())).any()

False

In [None]:
#| export
donors_not_in_contacts = gifts.loc[~gifts.DonorNumber.isin(contacts.Number.unique()), :]

In [None]:
#| hide
donors_not_in_contacts

Unnamed: 0,DonorNumber,GiftId,FirstName,LastName,AmountReceived,Date,FundId,CreditCardType,PaymentMethod,PledgeNumber,Notes
9,809975531-Y,0,Adeline,Shakespeare,8.48,8/14/2019,,AMEX,credit card,0,
27,809975531-Y,0,Adeline,Shakespeare,7.58,8/14/2019,"Color run, ChildSponsorship",Mastercard,credit card,0,


Adding to contacts

In [None]:
#| export
contacts = pd.concat([contacts,
    pd.DataFrame(donors_not_in_contacts[['DonorNumber', 'FirstName', 'LastName']].drop_duplicates()
    .rename(columns={'DonorNumber': 'Number'})
    .drop_duplicates()
    .to_dict('records'))
])

Spliting rows with 2 people in the first name column

In [None]:
#| hide
contacts.FirstName.head(1)

0    Karita & Kelvin
Name: FirstName, dtype: object

Split the names on ' & ' or ' and ', then expand the resulting lists into new rows

In [None]:
#| export
contacts[['FirstName', 'SecondaryFirstName']] = contacts['FirstName'].str.split(' & | and ', expand=True).fillna('')

In [None]:
#| export
records_to_join = contacts.loc[contacts.Number.duplicated(), :].to_dict(orient='records')

In [None]:
#| export
contacts = contacts.loc[~contacts.Number.duplicated(), :]

In [None]:
#| export
for record in records_to_join:
    contacts.loc[contacts.Number.isin([record['Number']]), ['SecondaryFirstName', 'SecondaryLastName']] = [record['FirstName'], record['LastName']]

In [None]:
#| export
contacts[['LegacyIndividualId', 'SecondaryLegacyIndividualId']] = None

In [None]:
#| export
contacts.reset_index(inplace=True, drop=True)

Adding id since none was provided

In [None]:
#| export
id = 0
for index, row in contacts.iterrows():
    contacts.loc[index, 'LegacyIndividualId'] = id
    id += 1
    if row['SecondaryFirstName'] != '':
        contacts.loc[index, 'SecondaryLegacyIndividualId'] = id
        id += 1


In [None]:
#| export
contacts.fillna('', inplace=True)

Adding Secondary Last Name for the appropriate users

In [None]:
contacts['SecondaryLastName'] = contacts.apply(lambda x: x['LastName'] if x['SecondaryLastName'] == '' and x['SecondaryFirstName'] != '' else x['SecondaryLastName'], axis=1)

## Housekeeping 5
> Cleaning Records with blank first or last names

Checking for records where FirstName  and/or LastName are blank


In [None]:
#| export
blank_name_records = ((contacts.FirstName == '') | (contacts.LastName == ''))

Previewing the blank name records

In [None]:
contacts.loc[((contacts.FirstName == '') | (contacts.LastName == '')), :]

Unnamed: 0,Number,CompanyName,FirstName,LastName,Street,City,State,Postal,Phone,EMail,Remarks,Deceased,SecondaryFirstName,SecondaryLastName,LegacyIndividualId,SecondaryLegacyIndividualId
2,093004505-X,,Masha,,353 Schmedeman Park,Indianapolis,IN,,577-374-96523,,,,,,3,
3,729707142-0,A Company Co.,,,2055 Lakewood Parkway,Camden,NJ,8104.0,,,,No,,,4,
7,029456846-8,,,,608 Old Shore Alley,Marietta,GA,30066.0,,jdoley6@telegraph.co.uk,,,,,9,


Before I delete the records I'm going to check if the names are present on the gift table <br>
<br>
I'm going to start by getting the unique Numbers that the records belong too

In [None]:
#| export
blank_name_numbers= contacts.loc[blank_name_records, 'Number']

The names are present on the gifts table!

In [None]:
#| export
gift_name_records = gifts.loc[gifts.DonorNumber.isin(blank_name_numbers), ['DonorNumber', 'FirstName', 'LastName']].drop_duplicates()

I'm going to save to a variable, remove the invalid record, and then update the contacts table

In [None]:
#| export
gift_name_records = gift_name_records.loc[((gift_name_records.FirstName != '') & (gift_name_records.LastName != '')), :]
gift_name_records

Unnamed: 0,DonorNumber,FirstName,LastName
1,729707142-0,Cymbre,Cross
6,029456846-8,Romy,Doley
7,093004505-X,Masha,Butt Gow


Updating the records that previousuly had a blank first or last name

In [None]:
#| export
for _, row in gift_name_records.iterrows():
    contacts.loc[contacts['Number'] == row['DonorNumber'], ['FirstName', 'LastName']] = [row['FirstName'], row['LastName']]

All the records now have valid names

In [None]:
contacts.loc[blank_name_records, :]

Unnamed: 0,Number,CompanyName,FirstName,LastName,Street,City,State,Postal,Phone,EMail,Remarks,Deceased,SecondaryFirstName,SecondaryLastName,LegacyIndividualId,SecondaryLegacyIndividualId
2,093004505-X,,Masha,Butt Gow,353 Schmedeman Park,Indianapolis,IN,,577-374-96523,,,,,,3,
3,729707142-0,A Company Co.,Cymbre,Cross,2055 Lakewood Parkway,Camden,NJ,8104.0,,,,No,,,4,
7,029456846-8,,Romy,Doley,608 Old Shore Alley,Marietta,GA,30066.0,,jdoley6@telegraph.co.uk,,,,,9,


## Housekeeping 6
> Splitting Project Codes

In [None]:
#| export
project_codes = gifts.FundId.str.split(', ', expand=True)

In [None]:
#| export
project_codes.head(3)

Unnamed: 0,0,1
0,,
1,ChildSponsorship,
2,,


In [None]:
#| export
gifts[['Project1Code', 'Project2Code']] = project_codes

In [None]:
#| export
gifts = gifts.loc[:, gifts.columns.drop('FundId')].copy()

## Housekeeping 7
> Formating date

In [None]:
df['Date']

0       3/4/2019
1       3/5/2019
2       3/7/2019
3      3/10/2019
4      2019/1/10
5      3/20/2019
6      3/24/2019
7       4/9/2019
8      4/12/2019
9      8/14/2019
10     4/13/2019
11     4/13/2019
12     4/17/2019
13     4/19/2019
14     5/10/2019
15      6/4/2019
16      6/5/2019
17    2019/06/10
18     6/11/2019
19     6/20/2019
20     6/20/2019
21      7/1/2019
22     7/18/2019
23    2019/07/01
24      8/1/2019
25      8/3/2019
26     8/12/2019
27     8/14/2019
28     8/26/2019
29      9/1/2019
30      9/6/2019
Name: Date, dtype: object

Since Dates are in different formats I'm going to create a custom parser

In [None]:
#| export
from datetime import datetime
def custom_parser(date_str):
    try:
        return datetime.strptime(date_str, '%m/%d/%Y')
    except ValueError:
        return datetime.strptime(date_str, '%Y/%m/%d')
gifts['GiftDate'] = gifts['Date'].apply(custom_parser)
gifts = gifts.loc[:, gifts.columns.drop('Date')].copy()

## Housekeeping 8
> Pledge ID <br>
<br>

Fix pledge ID by duplicate values with a unique id

In [None]:
#| hide
gifts.PledgeNumber.unique()

array([       0, 89752384, 57398862, 65139856])

In [None]:
#| export
gifts.loc[ gifts.PledgeNumber == 0, 'PledgeNumber'] = gifts[gifts.PledgeNumber == 0].index

In [None]:
#| export
gifts = gifts.rename(columns={'PledgeNumber': 'LegacyPledgeID'})

# Export

In [None]:
#| hide
import nbdev
nbdev.nbdev_export('00_Setup.ipynb')

In [None]:
#| hide
nbdev.nbdev_export('00_Setup.ipynb')

<br>