<a href="https://colab.research.google.com/github/4dsolutions/clarusway_data_analysis/blob/main/DAwPy_S10_(Working%20with%20Text%20and%20Time%20Data)/DAwPy_S10_Joining_Tables.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a><br/>
[![nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/4dsolutions/clarusway_data_analysis/blob/main/DAwPy_S10_%28Working%20with%20Text%20and%20Time%20Data%29/DAwPy_S10_Joining_Tables.ipynb)

________


<a data-flickr-embed="true" href="https://www.flickr.com/photos/kirbyurner/52136642608/in/photolist-2n4sSUz-2nr8Vrb-2oADYNY" title="Clarusway Banner"><img src="https://live.staticflickr.com/65535/52136642608_bd45cb00a9_b.jpg" width="1024" height="334" alt="Clarusway Banner"/></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>

<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:200%; text-align:center; borde|r-radius:10px 10px;">Data Analysis with Python</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:150%; text-align:center; border-radius:10px 10px;">Session - 11</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">Combining Tables + SQL</p>
## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:120%; text-align:center; border-radius:10px 10px;">Way to Reinvent Yourself</p>

In this Notebook, we: 

* develop a small set of tables in pandas
* write some Python code for adding data to at least one of them
* combine DataFrames using pandas `merge` and `join`
* store our tables to an SQLite database.

Let's create a small database consisting of three related tables:

* a roster patients seen by a practice
* patient visits with physicians
* a roster of physicians in the practice

In [None]:
import pandas as pd
import numpy as np

In [None]:
import sys
sys.version

Feeding in a list of tuples, while providing column names in the form of a named Series, results in the tuples going in row-wise i.e. row by row

In [None]:
patients = [
    ("13298","Debbie", "Rose",
     "32 SE Beacon St.", 
     "Portland", "OR", "97214", 
     "503-311-9928"),
    ("12446","Jerry", "Turing",
     "491 NW Shanny St.", 
     "Portland", "OR", "97111", 
     "503-311-7865"),
    ("77650","Bruce", "Flemming",
     "32 SE Beacon St.", 
     "Portland", "OR", "97214", 
     "503-311-9928"),
    ("89765","Susan", "Constanza",
     "8976 NW Circle Court, Apt 2E", 
     "Gresham", "OR", "97211", 
     "503-321-8640"),
    ("56768","Raul", "Sosa",
     "786 NW Couch St.", 
     "Portland", "OR", "97212", 
     "503-311-1018")
]

patients_df = pd.DataFrame(
    data=patients,
    columns = pd.Series(["MR", "FIRSTNM","LASTNM", 
               "STREET", "CITY", "STATE", "ZIPCODE", 
               "PHONE"], name="IDENT")
)

patients_df.set_index("MR", inplace=True)

In [None]:
patients_df

In [None]:
patients_df.index

Let's automate a process for adding new patient records.  Rather than make up a medical record number, we'll let Python randomly generate one for us, and making sure it's not already in use...

In [None]:
def get_mr(table=patients_df):
    not_ok = True
    while not_ok:
        mr = str(np.random.randint(10000,100000))
        if mr not in table.index:
            not_ok = False
    return mr

In [None]:
get_mr()

In [None]:
template = \
"""
{first} {last}
{street}, 
{city}, {state} {zipcode}
{phone}
"""

def add_patient(table=patients_df):
    not_ok = True
    while not_ok:
        
        #prompt for inputs
        first = input("First? >")
        last = input("Last? >")
        street = input("Street? >")
        city = input("City? >")
        state = input("State? >")
        zipcode = input("Zip code? >")
        phone = input("Phone? >")
        
        # substitute local vars into the template
        print(template.format(**locals()))
        ans = input("OK? (Y/N or Quit): >")
        
        if ans.upper() == "N":    # try again
            continue
        elif ans.upper() == "Y":  # add new info
            not_ok = False
            continue    
        else:                     # escape from loop
            break
        
    else: # not_ok == False
        print("Adding new patient record")
        new_mr = get_mr()
        # create a dict using the local vars we've filled in
        new_rec = pd.Series({"MR": new_mr,
                             "FIRSTNM": first,
                             "LASTNM": last,
                             "STREET": street,
                             "CITY": city,
                             "STATE": state,
                             "ZIPCODE": zipcode,
                             "PHONE": phone})
        # turn the input Series into a DataFrame with the same cols and index
        bottom_row = pd.DataFrame(new_rec).T.set_index("MR")
        return pd.concat([table, bottom_row]) # append new row
    
    # break (above) takes us here         
    print("No action taken")
    return table # return table as received

In [None]:
newtable  = add_patient()

In [None]:
newtable

In [None]:
physicians_df = pd.DataFrame(
    {"DR_ID": ["1001", "1002", "1003"],
     "DR_NAME": ["Sheela Morley, M.D.",
                 "Malcolm Head, D.O.",
                 "Patricia Lord, M.D."]}).set_index("DR_ID")

In [None]:
physicians_df

In [None]:
pd.Timedelta(1.5, unit='h')

In [None]:
import re

target = pd.Timedelta(1.25, unit='h')
target

In [None]:
target.isoformat()

In [None]:
re.sub(pattern = r"^.*(?P<h>\d+)H(?P<m>\d+)M.*$",
       repl    = r"\g<h>h \g<m>m",
       string  = target.isoformat())

In [None]:
visits = [('77650', '1001', '2023-5-17T13:50', "1.25"),
          ('77650', '1001', '2023-5-31T14:00', "0.75"),
          ('12446', '1003', '2023-5-31T10:15', "0.10"),
          ('89765', '1002', '2023-6-04T10:00', "0.50"),
          ('12446', '1003', '2023-6-04T10:15', "0.10"),]

visits_df = pd.DataFrame(
    data=visits,
    columns = pd.Series(["MR", "DR_ID","CHECK_IN", "DURATION"], name="VISIT")
)

visits_df["CHECK_IN"] = visits_df["CHECK_IN"].astype('datetime64[ns]')
visits_df["DURATION"] = visits_df["DURATION"].astype('float')
visits_df

In [None]:
visits_df.info()

In [None]:
visits_df.style.format({"DURATION":lambda x: re.sub(pattern = r"^.*(?P<h>\d+)H(?P<m>\d+)M.*$",
                                                    repl    = r"\g<h>h \g<m>m",
                                                    string  = pd.Timedelta(x, unit='h').isoformat()
                                             )})

In [None]:
visits_df

In [None]:
visits_df.info()

In [None]:
pd.merge(left=visits_df[visits_df.DR_ID == '1001'], 
         right=patients_df[["FIRSTNM", "LASTNM", "PHONE"]], 
         on="MR")

Might we use `join` to accomplish the same thing?

In [None]:
visits_df[visits_df.DR_ID == '1001'].join( 
         patients_df[["FIRSTNM", "LASTNM", "PHONE"]], 
         on="MR")

In [None]:
visit_by_doc_df = pd.merge(
                    left=physicians_df, 
                    right=visits_df,
                    how="right",
                    on="DR_ID", 
                    sort=True)
visit_by_doc_df

In [None]:
visit_by_doc_df.set_index(['DR_ID',"MR"], inplace=True)
visit_by_doc_df

In [None]:
visit_by_doc_df.join(patients_df[["LASTNM", "FIRSTNM", "PHONE"]], 
                        on="MR")

Lets [create a SQLite database](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html?highlight=to_sql#pandas.DataFrame.to_sql) from the DataFrames we have so far.

In [None]:
import sqlite3 as sql

In [None]:
conn_db = sql.connect("practice") # a physicians' practive

In [None]:
patients_df.to_sql("patients", con=conn_db, if_exists='replace')
visits_df.to_sql("visits", con=conn_db, if_exists='replace')
physicians_df.to_sql("physicians", con=conn_db, if_exists='replace')

conn_db.close()

In [None]:
conn_db = sql.connect("practice") # a physicians' practive

curs = conn_db.cursor()
curs.execute("SELECT * FROM patients;")
result = list(curs.fetchall())
conn_db.close()

result

In [None]:
conn_db = sql.connect("practice") # a physicians' practive

curs = conn_db.cursor()
curs.execute("""SELECT * FROM visits""")
result = list(curs.fetchall())
conn_db.close()

result

In [None]:
conn_db = sql.connect("practice") # a physicians' practive
curs = conn_db.cursor()

curs.execute("""SELECT visits.mr, lastnm, firstnm, dr_id, check_in, duration
                    FROM visits, patients
                    WHERE visits.mr == patients.mr""")
result = list(curs.fetchall())

conn_db.close()
result

SELECT a1, a2, b1, b2
FROM A
INNER JOIN B on B.f = A.f;

In [None]:
conn_db = sql.connect("practice") # a physicians' practive
curs = conn_db.cursor()

curs.execute("""SELECT visits.mr, lastnm, firstnm, dr_id, check_in, duration
                    FROM visits
                    INNER JOIN patients on visits.mr = patients.mr""")
result = list(curs.fetchall())

conn_db.close()
result

In [None]:
conn_db = sql.connect("practice") # a physicians' practive
curs = conn_db.cursor()

curs.execute("""SELECT patients.mr, lastnm, firstnm, dr_id, check_in, duration
                    FROM patients
                    LEFT JOIN visits on visits.mr = patients.mr""")
result = list(curs.fetchall())

conn_db.close()
result