## Logic <a anchor="href" id = "top"></a>
Enumerate through original raw data from each Government provider and fill missing contact, phone, email data in childcare database. Targeted fills conducted by search using pandas `pd.DataFrame.at[index, col]` function. Appropriate index is determined by iterating through columns in raw data and finding match in childcare database according to:
   1. source id, if available
   2. name & address, if neither have missing values
   3. name if all are unique and address has NAs
   4. address if all are unique and name has NAs

### [TO DO: ](#todo)
* Verify automatically

Prov/Terr | phone | email | contact | completed
----------|------|------|---------|-----------
NT | Y | N | Y | Yes
YT | Y | N | N | Yes 
NU | Y | Y | N | Yes
QC | Y | Y | N | Yes
NB | Y | N | N | Yes
AB | Y | N | N | Yes
BC | Y | Y | N | Yes
NL | Y | Y | Y | Yes
NS | Y | Y | N | Yes
PE | Y | Y | Y | Yes
ON | N | N | N | -
MB | Y | N | N | Yes
SK | Y | N | N | Yes

## Table of Contents

### Functions
* [Check Function](#check-func)
* [Fill Function](#fill-func)

### Operations
* [Set up and NB phone](#setup-nb)
* [NU](#nu)
* [NT](#nt)
* [YT](#yt)
* [BC](#bc)
* [NL](#nl)
* [PE](#pe)
* [NS](#ns)
* [MB](#mb)
* [SK](#sk)
* [AB](#ab)
* [QC](#qc)


### [Verify](#verify)

In [1]:
import pandas as pd
import os
import glob
import re
import numpy as np

In [2]:
os.chdir('/Users/kt/Documents/work/STATCAN/Projects/ODECF/Collection-ODECF/data/childcare')

In [3]:
# glob.glob('*.csv')

In [4]:
# RAW DATA
NT = pd.read_csv('NT-childcare.csv', index_col=[0])
QC = pd.read_csv('QC-CPE-GARD-MF.csv', index_col=[0], encoding = "utf-8-sig")
NB = pd.read_csv('NB-childcare.csv', index_col=[0])
AB = pd.read_csv('AB-childcare.csv', index_col=[0])
NL = pd.read_csv('NL-childcare.csv', index_col=[0])
MB = pd.read_csv('MB-childcare.csv', index_col=[0])
PE = pd.read_csv('PE-childcare.csv', index_col=[0])
NS = pd.read_csv('NS-Child_Care_Directory.csv', index_col=[0])
SK = pd.read_csv('SK-childcare.csv', index_col=[0])
ON = pd.read_csv('ON-childcare.csv', index_col=[0])
YT = pd.read_csv('YT-childcare.csv', index_col=[0])
NU = pd.read_csv('NU-childcare.csv', index_col=[0])
BC = pd.read_csv('BC-childcare_locations.csv', index_col=[0])

In [5]:
cc = pd.read_excel("../../..//FINAL/childcare-facilities-utf8.xlsx").set_index('id')

In [6]:
cc.provider.replace('Province of Qubec', 'Province of Québec', inplace = True)
cc.provider.unique()

array(['Northwest Territories', 'Province of New Brunswick',
       'Province of Alberta', 'Province of Manitoba',
       'Province of Québec', 'Province of Ontario', 'GoDayCare.com',
       'Province of Nova Scotia', 'Province of Newfoundland and Labrador',
       'Province of Prince Edward Island', 'Province of British Columbia',
       'Province of Saskatchewan', 'Yukon Territory', 'Nunavut'],
      dtype=object)

### Check function <a class="anchor" id="check-func"></a>

In [7]:
def precheck(df, provider, name = False, address = False, source_id = False):
    """
    
    Checks if name or address in raw dataframe are all found in childcare final database.
    Prints "missing" values (usually a product of string cleaning).
    
    param provider: expects string name of data provider.
    
    N.B. column names of input df must contain "name", "address"
    """
    
    if name == True:
        # Check for names in df missing from cc
        cc_missing = set(df.name) - set(cc.loc[cc.provider == provider].name)
        
        if len(cc_missing) > 0:
            return cc_missing

        
        # Check for names in cc missing from df
        df_missing = set(cc.loc[cc.provider == provider].name) - set(df.name)
        
        if len(df_missing) > 0:
            return df_missing
        
    if address == True:
        # Check for address in df missing from cc
        cc_missing = set(df.address) - set(cc.loc[cc.provider == provider].source_full_address)
    
        if len(cc_missing) > 0:
            return cc_missing
                         

        # Check for addresses in cc missing from df
        df_missing = set(cc.loc[cc.provider == provider].source_full_address) - set(df.address)
        
        if len(df_missing) > 0:
            return df_missing
        
    if source_id == True:
        # Check for address in df missing from cc
        cc_missing = set(df.source_id) - set(cc.loc[cc.provider == provider].source_id)
    
        if len(cc_missing) > 0:
            return cc_missing
                         

        # Check for addresses in cc missing from df
        df_missing = set(cc.loc[cc.provider == provider].source_id) - set(df.source_id)
        
        if len(df_missing) > 0:
            return df_missing
    
    print("All good.")

### Fill Function <a class="anchor" id="fill-func"></a>

In [8]:
def fill_missing(df, provider, by, phone = False, contact = False, email = False, verbose = False):
    """
    
    param df: pandas dataframe of raw provincial/territorial childcare data.
    param provider: name of data provider, as string.
    param by: whether to search by name or source_full_address.
    param phone: if True, fills in phone data.
    param contact: if True, fills in contact data.
    param email: if True, fills in email data.
    
    Output: fills childcare dataframe with selected raw dataframe contact info.
    
    N.B. Input and childcare dataframes must have columns named "phone", "contact", "email".
    
    """
    by_opts = ['name', 'address', 'source_id', 'name & address', 'name & source_id', 'name & source_id & address']
    if by not in by_opts:
        raise ValueError("Invalid by value. Expected one of: %s" % by_opts)
    
    tracker = []
    print("Filling data from {}".format(provider))
    
    
    # BY = NAME 
    if by == "name":
        for i,n in enumerate(df.name):
            # Get index
            idx = cc.loc[(cc.name == n) & (cc.provider == provider)].index.to_list()[0]
            tracker.append(idx)

            # Update information
            if phone == True:
                cc.at[idx, "phone"] = df.at[i, "phone"]
            if contact == True:
                cc.at[idx, "contact"] = df.at[i, "contact"]
            if email == True:
                cc.at[idx, "email"] = df.at[i, "email"]
    
    
    # BY = SOURCE_FULL_ADDRESS
    if by == "address":
        for i,a in enumerate(df.address):
            # Get index
            idx = cc.loc[(cc.source_full_address == a) & (cc.provider == provider)].index.to_list()[0]
            tracker.append(idx)

            # Update information
            if phone == True:
                cc.at[idx, "phone"] = df.at[i, "phone"]
            if contact == True:
                cc.at[idx, "contact"] = df.at[i, "contact"]
            if email == True:
                cc.at[idx, "email"] = df.at[i, "email"]
                
                
    # BY = SOURCE_ID
    if by == "source_id":
        for i,a in enumerate(df.source_id):
            # Get index
            idx = cc.loc[(cc.source_id == a) & (cc.provider == provider)].index.to_list()[0]
            tracker.append(idx)

            # Update information
            if phone == True:
                cc.at[idx, "phone"] = df.at[i, "phone"]
            if contact == True:
                cc.at[idx, "contact"] = df.at[i, "contact"]
            if email == True:
                cc.at[idx, "email"] = df.at[i, "email"]
                
                
    # BY = NAME & SOURCE_FULL_ADDRESS
    if by == "name & address":
        for i,a,n in zip(range(len(df)+1), df.address, df.name):
            if verbose == True:
                print(i, a, n)
            # Get index
            idx = cc.loc[(cc.source_full_address == a) & (cc.name == n) & (cc.provider == provider)].index.to_list()[0]
            tracker.append(idx)

            # Update information
            if phone == True:
                cc.at[idx, "phone"] = df.at[i, "phone"]
            if contact == True:
                cc.at[idx, "contact"] = df.at[i, "contact"]
            if email == True:
                cc.at[idx, "email"] = df.at[i, "email"]
    
    # BY = NAME & SOURCE_ID
    if by == "name & source_id":
        for i,src_id,n in zip(range(len(df)+1), df.source_id, df.name):
            # Get index
            idx = cc.loc[(cc.source_id == src_id) & (cc.name == n) & (cc.provider == provider)].index.to_list()[0]
            tracker.append(idx)

            # Update information
            if phone == True:
                cc.at[idx, "phone"] = df.at[i, "phone"]
            if contact == True:
                cc.at[idx, "contact"] = df.at[i, "contact"]
            if email == True:
                cc.at[idx, "email"] = df.at[i, "email"]
                
    # BY = NAME & SOURCE_ID & SOURCE_FULL_ADDRESS
    if by == "name & source_id & address":
        for i,src_id,n,a in zip(range(len(df)+1), df.source_id, df.name, df.address):
            # Get index
            idx = cc.loc[(cc.source_id == src_id) & (cc.name == n) & (cc.source_full_address == a) & (cc.provider == provider)].index.to_list()[0]
            tracker.append(idx)

            # Update information
            if phone == True:
                cc.at[idx, "phone"] = df.at[i, "phone"]
            if contact == True:
                cc.at[idx, "contact"] = df.at[i, "contact"]
            if email == True:
                cc.at[idx, "email"] = df.at[i, "email"]
                 
    # Check all rows were accounted for
    if len(tracker) == len(df):
        print("All rows found.")

### Set up and NB phone  <a class="anchor" id="setup-nb"></a>

In [9]:
cc[['contact', 'email']] = None, None

# Get phone from source address where found (NB)
def get_phone(x):
    try:
        return re.search('\(\d{3}\) \d{3}\-\d{4}', x)[0]
        
    except TypeError:
        return None
    
cc['phone'] = cc.source_full_address.map(get_phone).to_list()

### NU  <a class="anchor" id="nu"></a>

In [10]:
NU.rename(columns = {
    'Facility/Program':'name',
    'Address':'address',
    'Phone':'phone',
    'Email':'email',
    'Fax':'fax'
}, inplace = True)

NU.reset_index(inplace = True)

In [11]:
len(NU.name.unique()) == len(NU)

True

In [12]:
# Clean name
NU.name = NU.name.map(lambda x: x.strip())

precheck(NU, "Nunavut", name = True)

All good.


In [13]:
fill_missing(NU, "Nunavut", by = "name", phone = True, email = True)

Filling data from Nunavut
All rows found.


**Post op check**

In [14]:
# cc.loc[cc.provider == "Nunavut"].tail(10)

In [15]:
# NU

### NT phone and contact  <a class="anchor" id="nt"></a>

In [16]:
# Rename raw data
NT.rename(columns = {'facility_name' : 'name'}, inplace = True)

# Clean NT text
NT.name = NT.name.map(lambda x: x.strip().replace('  ', ' '))

# Reset index
NT = NT.reset_index().drop(columns = ['index'])

**Pre-op Check:**

In [17]:
NT.drop_duplicates(inplace = True)
NT.reset_index(inplace = True)

In [18]:
len(NT.name.unique()) == len(NT)

True

In [19]:
precheck(NT, "Northwest Territories", name = True)

All good.


**Operation:**

In [20]:
fill_missing(NT, "Northwest Territories", by = "name", phone = True, contact = True)

Filling data from Northwest Territories
All rows found.


**Post-op check:**

In [21]:
# NT.sort_values(by = "name")

In [22]:
# cc.loc[cc.provider == "Northwest Territories"].sort_values(by = "name")

### YT Phone  <a class="anchor" id="yt"></a>

In [23]:
YT.name.replace('Three H Preschool  Canada', 'Three H Preschool Canada', inplace = True)


# check 3 - all 50 in orig are accounted for in cc
# len(cc.loc[cc.name.isin(YT.name)])

precheck(YT, "Yukon Territory", name = True)

All good.


Grow with Joy (indices 9 and 21) have same name but different addresses.

In [24]:
len(YT.name.unique()) == len(YT)

False

In [25]:
len(YT.address.unique()) == len(YT)

False

In [26]:
fill_missing(YT, "Yukon Territory", by = "name & address", phone = True)

Filling data from Yukon Territory
All rows found.


**Manual post check:**

In [27]:
# cc.loc[cc.source_full_address.isin(YT.address)].sort_values(by = "name")

In [28]:
# YT.sort_values(by = "name")

Check on phone numbers for addresses that have duplicates:


In [29]:
# cc.loc[(cc.provider == "Yukon Territory") & (cc.source_full_address == "95 Lewes Boulevard, Y1A 3J4") | (cc.source_full_address == "22 Falcon Drive, Y1A 6C8") ]

In [30]:
# YT.loc[(YT.address == "95 Lewes Boulevard, Y1A 3J4") | (YT.address == "22 Falcon Drive, Y1A 6C8")]

### BC <a class="anchor" id="bc"></a>

In [31]:
BC.reset_index(inplace = True)
BC.rename(columns = {
    'FAC_PARTY_ID':'source_id',
    'NAME':'name',
    'PHONE':'phone',
    'EMAIL':'email',
    'WEBSITE':'website'
}, inplace = True)

In [32]:
# Check if source id is usable
len(BC.source_id.unique()) == len(BC)

True

In [33]:
precheck(BC, "Province of British Columbia", source_id = True)

All good.


In [34]:
fill_missing(BC, "Province of British Columbia", by = "source_id", phone = True, email = True)

Filling data from Province of British Columbia
All rows found.


**Post-op check:**

In [35]:
# BC[['source_id', 'name', 'phone', 'email']].tail(10)

In [36]:
# cc.loc[cc.provider == "Province of British Columbia"][['source_id', 'name', 'phone', 'email']].tail(10)

### NL <a class = "anchor" id = "nl"></a>

In [37]:
NL.rename(columns = {
    'Name':'name',
    'Street Address':'address',
    'Contact Name': 'contact'
}, inplace = True)

In [38]:
# Extract email from Contact column
def get_email(x):
    try:
        temp = x.replace('Visit Centre Website', '')
        return re.sub('\(\d{3}\) \d{3}\-\d{4}','', temp)
        
    except TypeError:
        return None
    
NL['email'] = NL.Contact.map(get_email)

In [39]:
len(NL.name.unique()) == len(NL)

True

In [40]:
len(NL.address.unique()) == len(NL)

False

In [41]:
precheck(NL, "Province of Newfoundland and Labrador", name = True, address = True)

All good.


In [42]:
fill_missing(NL, "Province of Newfoundland and Labrador", by = "name", phone = True, email = True, contact = True)

Filling data from Province of Newfoundland and Labrador
All rows found.


**Post op check:**

In [43]:
# cc.loc[cc.provider == "Province of Newfoundland and Labrador"]

In [44]:
# NL

In [45]:
# NL.loc[NL.address.duplicated(keep = False)]

In [46]:
# cc.loc[cc.source_full_address == "45 St. Marks Avenue"]

### PE  <a class="anchor" id="pe"></a>

In [47]:
PE.rename(columns = {
    'Name':'name',
    'Address':'address',
    'Contact Name':'contact',
    'E-mail':'email',
    'Phone':'phone'
}, inplace = True)

In [48]:
precheck(PE, "Province of Prince Edward Island", name = True, address = True)

All good.


In [49]:
len(PE.name.unique()) == len(PE)

True

In [50]:
len(PE.address.unique()) == len(PE)

False

In [51]:
fill_missing(PE, "Province of Prince Edward Island", by = "name & address", phone = True, contact = True, email = True)

Filling data from Province of Prince Edward Island
All rows found.


### NS <a class="anchor" id="ns"></a>

In [52]:
NS.reset_index(inplace=True)
NS.rename(columns = {
    'FACILITY_NAME':'name',
    'FACILITY_IDENTIFIER':'source_id',
    'ADDRESS':'address',
    'PHONE 1':'phone',
    'EMAIL 1':'email'
}, inplace = True)

# Check to use source id
len(NS.source_id.unique()) == len(NS)

True

In [53]:
precheck(NS, "Province of Nova Scotia", source_id = True)

All good.


In [54]:
len(NS.source_id.unique()) == len(NS)

True

In [55]:
fill_missing(NS, "Province of Nova Scotia", by = 'source_id', email = True, phone = True)

Filling data from Province of Nova Scotia
All rows found.


**Post-op check:**

In [56]:
# cc.loc[cc.provider == "Province of Nova Scotia"]

In [57]:
# NS[['name', 'email', 'phone']]


### MB  <a class="anchor" id="mb"></a>

In [58]:
MB.rename(columns = {
    'Legal Name' : 'name',
    'Phone' : 'phone',
    'Address' : 'address',
    'Facility Number' : 'source_id'
}, 
          inplace = True)

In [59]:
# Deduplicate MB
MB.drop_duplicates(inplace = True)
MB.reset_index(inplace = True)

len(MB) == len(cc.loc[cc.provider == "Province of Manitoba"])

True

In [60]:
len(MB.name.unique()) == len(MB)

False

In [61]:
len(MB.address.unique()) == len(MB)

False

In [62]:
len(MB.source_id.unique()) == len(MB)

False

In [63]:
# MB[MB.name.duplicated()]

In [64]:
# MB.loc[(MB.source_id == 100141) | (MB.source_id == 102146)]

In [65]:
# Clean name
MB.name = MB.name.map(lambda x: x.replace('  ', ' '))
MB.address = MB.address.map(lambda x: x.strip().replace('  ', ' ').replace('  ', ' '))
precheck(MB, "Province of Manitoba", source_id = True, name = True, address = True)

All good.


In [66]:
# cc.loc[(cc.provider == "Province of Manitoba") & (cc.source_full_address.str.contains('20 Island Shore Blvd'))]

In [67]:
fill_missing(MB, "Province of Manitoba", by = "name & source_id & address", phone = True)

Filling data from Province of Manitoba
All rows found.


**Post-op Check:**

In [68]:
# MB[['name', 'address', 'phone']]

In [69]:
# cc.loc[cc.provider == "Province of Manitoba"]#[['name', 'source_full_address', 'phone']]

### SK <a class = "anchor" id="sk"></a>

In [70]:
SK.rename(columns = {
    'facility_name': 'name',
    'full_address':'address'
}, inplace = True)

len(SK.address.unique()) == len(SK)

False

In [71]:
# 68	Saskatoon Lutheran Early Learning Center incomplete phone number
SK.loc[SK.address == "502 - 5th Street North, Martensville, SK, S0K 2T0", "phone"] = "306-931-4633"

In [72]:
len(SK.name.unique()) == len(SK)

False

In [73]:
# SK.address.sort_values().to_list()

In [74]:
# OK, samme address different phone extensions - CHECK IN FINAL - looks good
# SK[SK.address == '1940 McIntyre Street, Regina, SK, S4P 2R3 (Downtown)']

In [75]:
# Clean SK
SK.name = SK.name.map(lambda x: x.replace('  ', ' '))
SK.address = SK.address.map(lambda x: x.replace('  ', ' '))

precheck(SK, "Province of Saskatchewan", name = True, address = True)

All good.


In [76]:
fill_missing(SK, "Province of Saskatchewan", by = "name & address", phone = True)

Filling data from Province of Saskatchewan
All rows found.


**Post-op Check:**

In [77]:
# cc.loc[cc.name == "Saskatoon Lutheran Early Learning Center"]

In [78]:
# SK[['name', 'address', 'phone']]

In [79]:
# cc.loc[(cc.provider == "Province of Saskatchewan")][['name', 'source_full_address', 'phone']]#.sort_values(by = "name")

### AB <a class="anchor" id="ab"></a>

In [80]:
AB.rename(columns = {
    'Program Name' : 'name',
    'Program Address' : 'address',
    'Pseudo ProgramID': 'source_id',
    'Phone Number' : 'phone'
                    }, inplace = True)

In [81]:
AB.drop_duplicates(inplace = True)
AB.drop_duplicates(subset = ['source_id', 'name', 'address'], inplace = True)
AB.reset_index(inplace = True)

In [82]:
len(AB.name.unique()) == len(AB)

False

In [83]:
len(AB.address.unique()) == len(AB)

False

In [84]:
len(AB.source_id.unique()) == len(AB)

True

In [85]:
# Name cleaning
AB.name = AB.name.map(lambda x: x.replace('  ', ' '))
AB.address = AB.address.map(lambda x: x.replace('  ', ' '))


precheck(AB, "Province of Alberta", name = True, address = True, source_id = True)

All good.


In [86]:
fill_missing(AB, "Province of Alberta", by = "source_id", phone = True)

Filling data from Province of Alberta
All rows found.


**Post-op check:**

In [87]:
AB[['source_id', 'Type of program', 'name', 'address', 'phone']].tail(10)

Unnamed: 0,source_id,Type of program,name,address,phone
2908,180F60691DCF55197DAE21618FAA6F6F,DAY CARE PROGRAM,SUCKER CREEK FIRST NATION DAYCARE,UNIT 2009; BLK 2000 SE-18-74-14-5,7805232969
2909,8A5642A0DBA99868C88AD74B3842FE64,DAY CARE PROGRAM,CLEVER CUBS PRESCHOOL ACADEMY INC.,"UNIT 205, 35 CRANFORD WAY SE",4038501939
2910,BA5DA2371FDB54FA99E6DC23D332A315,DAY CARE PROGRAM,GLOBAL KIDS II,"UNIT 170, 7212 MACLEOD TRAIL SE",4034785922
2911,0381BFAED077653EF485AE5AE4B3E444,DAY CARE PROGRAM,MOSAIC MONTESSORI PARKDALE,3512 - 5 AVENUE NW,4039849090
2912,D7E8BD6C99C01ABC34162F8DA374ABCB,OUT OF SCHOOL CARE PROGRAM,MINI TREASURES OUT OF SCHOOL CARE,310-500 TIMBERLANDS DRIVE,5872733499
2913,F78B6BE2223082D198D73A5B2A6171F3,DAY CARE PROGRAM,MINI TREASURES DAYCARE,310-500 TIMBERLANDS DRIVE,5872733499
2914,D08B691412B0558E1799A67F1575B5F5,DAY CARE PROGRAM,BUTTERFLY KISSES EARLY LEARNING CENTRE,5019 53 STREET,7807098682
2915,9E48CFB7FC5B3634E3441573CCFDCDD7,DAY CARE PROGRAM,"ORDER TO REMEDY POSTPONED, ORDER TO REMEDY POS...",14804 78 STREET,7807099434
2916,3B78F6A7105008DFF4F6E12A7BA68583,OUT OF SCHOOL CARE PROGRAM,KEPLER ACADEMY EARLY LEARNING AND CHILDCARE OS...,"#511, 11 WESTWIND DRIVE",5879208936
2917,F546433A786E8F9BB966BC679E4DC771,DAY CARE PROGRAM,KEPLER ACADEMY EARLY LEARNING AND CHILDCARE WE...,"#511, 11 WESTWIND DRIVE",5879208936


In [88]:
cc.loc[cc.provider == "Province of Alberta"][['source_id', 'source_facility_type', 'name', 'source_full_address', 'phone']].tail(10)

Unnamed: 0_level_0,source_id,source_facility_type,name,source_full_address,phone
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3863,180F60691DCF55197DAE21618FAA6F6F,DAY CARE PROGRAM,SUCKER CREEK FIRST NATION DAYCARE,UNIT 2009; BLK 2000 SE-18-74-14-5,7805232969
3864,8A5642A0DBA99868C88AD74B3842FE64,DAY CARE PROGRAM,CLEVER CUBS PRESCHOOL ACADEMY INC.,"UNIT 205, 35 CRANFORD WAY SE",4038501939
3865,BA5DA2371FDB54FA99E6DC23D332A315,DAY CARE PROGRAM,GLOBAL KIDS II,"UNIT 170, 7212 MACLEOD TRAIL SE",4034785922
3866,0381BFAED077653EF485AE5AE4B3E444,DAY CARE PROGRAM,MOSAIC MONTESSORI PARKDALE,3512 - 5 AVENUE NW,4039849090
3867,D7E8BD6C99C01ABC34162F8DA374ABCB,OUT OF SCHOOL CARE PROGRAM,MINI TREASURES OUT OF SCHOOL CARE,310-500 TIMBERLANDS DRIVE,5872733499
3868,F78B6BE2223082D198D73A5B2A6171F3,DAY CARE PROGRAM,MINI TREASURES DAYCARE,310-500 TIMBERLANDS DRIVE,5872733499
3869,D08B691412B0558E1799A67F1575B5F5,DAY CARE PROGRAM,BUTTERFLY KISSES EARLY LEARNING CENTRE,5019 53 STREET,7807098682
3870,9E48CFB7FC5B3634E3441573CCFDCDD7,DAY CARE PROGRAM,"ORDER TO REMEDY POSTPONED, ORDER TO REMEDY POS...",14804 78 STREET,7807099434
3871,3B78F6A7105008DFF4F6E12A7BA68583,OUT OF SCHOOL CARE PROGRAM,KEPLER ACADEMY EARLY LEARNING AND CHILDCARE OS...,"#511, 11 WESTWIND DRIVE",5879208936
3872,F546433A786E8F9BB966BC679E4DC771,DAY CARE PROGRAM,KEPLER ACADEMY EARLY LEARNING AND CHILDCARE WE...,"#511, 11 WESTWIND DRIVE",5879208936


### QC <a class="anchor" id="qc"></a>

In [89]:
QC.rename(columns = {'Nom de l\'installation' : 'name',
                     'Adresse' : 'address',
                     'Téléphone': 'phone',
                     'Adresse courriel' : 'email',
                     'Télécopieur' : 'fax'
                    }, inplace = True)

In [90]:
QC.drop_duplicates(inplace = True)

In [91]:
len(cc.loc[cc.provider == "Province of Québec"])

3573

In [92]:
len(QC)

3576

In [93]:
len(QC.name.unique()) == len(QC)

False

In [94]:
len(QC.address.unique()) == len(QC)

False

In [95]:
# Name cleaning
QC.name = QC.name.map(lambda x: x.strip().replace('  ', ' '))
QC.address = QC.address.map(lambda x: x.strip().replace('  ', ' ').replace('  ', ' ').replace('  ', ' '))


# Check
precheck(QC, "Province of Québec", name = True, address = True)

All good.


In [96]:
# Operation
fill_missing(QC, "Province of Québec", by = "name & address", phone = True, email = True)

Filling data from Province of Québec
All rows found.


In [151]:
# QC[['name', 'address', 'phone', 'email']].head(15)

In [150]:
# cc.loc[cc.provider == "Province of Québec"][['name', 'source_full_address', 'phone', 'email']].head(15)

----
## Verify Counts <a anchor="href" id="verify"></a>

In [120]:
locations = [NT, NU, YT, PE, NS, NB, NL, QC, ON,MB,SK,AB,BC]
null_phone = []
null_contact = []
null_email = []

for l in locations:
    try:
        null_phone.append(len(l.loc[l.phone.isnull()]))
    except AttributeError:
        null_phone.append(len(l))
        
for l in locations:
    try:
        null_contact.append(len(l.loc[l.contact.isnull()]))
    except AttributeError:
        null_contact.append(len(l))
        
for l in locations:
    try:
        null_email.append(len(l.loc[l.email.isnull()]))
    except AttributeError:
        null_email.append(len(l))

In [121]:
len(cc.loc[cc.contact.isnull() & (cc.provider != "GoDayCare.com")])

21505

In [122]:
sum(null_contact)
# 8 less nulls in cc

21513

In [135]:
len(cc.loc[cc.phone.isnull() & (cc.provider != "GoDayCare.com")])

7173

In [134]:
sum(null_phone)
# 2 less in cc

7175

In [125]:
len(cc.loc[cc.email.isnull() & (cc.provider != "GoDayCare.com")])

12999

In [126]:
sum(null_email)
# 4 less nulls in cc

13003

**1:** Determine which providers dont match the number of missing vals in the final df:

In [107]:
provs = cc.loc[(cc.phone.isnull()) & (cc.provider != "GoDayCare.com") & (cc.provider != "Province of Ontario")].provider.unique().tolist()

provider_dict = {
    'Northwest Territories' : NT,
    'Province of New Brunswick' : NB,
    'Province of Manitoba' : MB,
    'Province of Québec' : QC,
    'Province of Nova Scotia' : NS,
    'Province of British Columbia': BC,
    'Nunavut':NU,
    'Province of Newfoundland and Labrador': NL
                }


In [131]:
NB['phone'] = NB.address.map(get_phone).to_list()

for p in provs:
    print(p)
    df = provider_dict[p]
    print(len(cc.loc[(cc.phone.isnull()) & (cc.provider == p)]) == len(df.loc[df.phone.isnull()]))

Northwest Territories
True
Province of New Brunswick
True
Province of Manitoba
True
Province of Québec
False
Province of Nova Scotia
True
Province of British Columbia
True


In [132]:
len(QC.loc[QC.phone.isnull()])

104

In [133]:
len(cc.loc[(cc.phone.isnull()) & (cc.provider == "Province of Québec")])

105

Correction where address and name are the same in MB

In [111]:
# correction where address and name are the same
cc.at[(cc.phone.isnull()) & (cc.provider == "Province of Manitoba"), 'phone'] = "(204) 586-9625"
cc.at[4539, "phone"] = "(204) 228-5963"

Manual forward-fill of phones and emails missing in QC duplicates

In [112]:
idx = [5031, 5033, 5035, 5036, 5038, 5193, 5569, 5733, 5920, 6710, 7130, 7811]
# cc.loc[(cc.phone.isnull()) & (cc.provider == "Province of Québec")].name.isin(QC.loc[QC.phone.isnull()].name)

for cca in cc.iloc[idx].source_full_address:
    try:
        print(cca)
        # forward-fill phone by hand
        phone_temp=cc.loc[(cc.source_full_address == cca) & (cc.phone.notnull())].phone.tolist()[0]
        idxp = cc.loc[(cc.source_full_address == cca) & (cc.phone.isnull())].index.tolist()
        cc.at[idxp[0], 'phone'] = phone_temp

        # forward-fill email by hand
        email_temp=cc.loc[(cc.source_full_address == cca) & (cc.email.notnull())].email.tolist()[0]
        idxe = cc.loc[(cc.source_full_address == cca) & (cc.email.isnull())].index.tolist()
        cc.at[idxe[0], 'email'] = email_temp
    except IndexError:
        pass

400, 3e Rue Fraser
418, rue Rouer
55, rue Principale
20, rue des Coquillages
795, rue Commerciale Nord
860, boulevard Raymond
2600, rue du Collège
2890, rue Notre-Dame
1945, rue Mullins, bureau 180
8055, rue Collerette
1200, route des Rivières
220, rue Saint-Marc


In [113]:
# cc.loc[(cc.source_full_address == cca) & (cc.email.notnull())].email.tolist()[0]

In [114]:
# cc.loc[(cc.source_full_address == cca) & (cc.phone.notnull())]

**2:** Providers that have email columns in raw data but contain missing values

In [136]:
#cc.loc[(cc.email.isnull()) & (cc.provider != "GoDayCare.com") & (cc.provider != "Province of Ontario")].provider.unique().tolist()

provsemail = [
 'Province of Québec',
 'Province of Nova Scotia',
 'Province of British Columbia',
 'Nunavut']

for p in provsemail:
    print(p)
    df = provider_dict[p]
    print(len(cc.loc[(cc.email.isnull()) & (cc.provider == p)]) == len(df.loc[df.email.isnull()]))

Province of Québec
False
Province of Nova Scotia
True
Province of British Columbia
True
Nunavut
True


In [116]:
cc.loc[(cc.provider == "Province of Québec") & (cc.email.isnull()) & (cc.name.duplicated()) & (cc.source_full_address.duplicated())]

Unnamed: 0_level_0,source_id,name,source_facility_type,facility_type,ages,capacity,infant,toddler,school_age,source_full_address,...,postal_code,city,province,provider,licence_status,longitude,latitude,contact,email,phone
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7061,,SERVICE DE GARDE TASIURVIK INC.,CPE,Day Care,,65,Y,Y,,", C.P. 280",...,J0M 1M0,Inukjuak,QC,Province of Québec,,,,,,
8040,,CENTRE ENFANCE ET FAMILLE DE STEP BY STEP / ST...,CPE,Day Care,,20,Y,Y,,", C.P. 771",...,J0L 1B0,Kahnawake,QC,Province of Québec,,,,,,


In [117]:
# QC.loc[QC.address == "220, rue Saint-Marc"]

In [118]:
# cc.loc[cc.source_full_address == "220, rue Saint-Marc"]

**3:** Providers that have contact columns in raw data but contain missing values


In [119]:
cc.loc[(cc.contact.isnull()) & (cc.provider != "GoDayCare.com") & (cc.provider != "Province of Ontario")].provider.unique().tolist()

print(len(cc.loc[(cc.contact.isnull()) & (cc.provider == 'Province of Newfoundland and Labrador')]) == len(NL.loc[NL.contact.isnull()]))

True


**4:** Export

In [130]:
cc.contact.replace("none", "", inplace = True)
cc.to_csv("childcare-alpha.csv")

## [Top](#top)