In [1]:
%matplotlib inline
import os
import subprocess as sub
import matplotlib.pyplot as plt
import psycopg2
import psycopg2.extras
import seaborn as sns
import matplotlib
from tabulate import tabulate

# tell Seaborn that we're producing a document and not a slideshow or poster
sns.set()
sns.set_context('paper')

# expects to find connection credentials in local runtime environment
db_host=os.environ['SQL_LOCAL_SERVER']
db_host_port=int(os.environ['SQL_LOCAL_PORT'])
db_user=os.environ['SQL_USER']
db_name=os.environ['SQL_DB']
db_schema=os.environ['SQL_SCHEMA']
csvdir = "/home/jeffs/projects/def-dfuller/interact/permanent_archive/Saskatoon/Wave1/Ethica/saskatoon_01/raw"

def log(instr):
    print(f"→ {instr}")

# Ingest Log for Ethica, Saskatoon, Wave 1

After much investigation and discussion, we have decided to ingest the data in its raw form, with no filters applied, and then create a few obvious subtables from that ingested data. The reasoning is that, due to the nature of the data issues, there appears to be no single cleaning filter we can apply that will be appropriate for all downstream analysis. So we'll ingest the raw table, mark it as toxic so nobody uses it by accident, and then create a few sane starting points that people can actually use.


If we're going to have multiple tables like this, we'll want to standardize the table naming conventions. Currently, tables are broken into different schemas based on the temporal resolution, but since all of these starting filters will still have raw-level timing, we'll need to name them with that in mind.

## Possible Filters
There are two fundamental steps to the filters we've applied during the investigation phase: Prefiltration and uniquification. 

### Prefiltration
The incoming raw data is messy and conflicted. Some records are pure singletons, meaning that they are unique on userid, timestamp, and sensor values. Other records are true duplicates, meaning they occur multiple times but have identical userid, time, and sensor values. We call these "singleon-equivalent," since they can be reduced to singletons by simply deleting all but one occurrence. And then there are "conflicted duplicates," for which there are many records with identical userid and timestamp, but that report multiple conflicting sensor readings within the set.

The potential causes of the various kinds of duplicates are numerous: there could be hiccups with different versions of the host phone's OS, memory constraint problems on specific user's phones, different user preference settings, differing versions of the Ethica collection software, various types of power and service outages, etc. At this point, we have not yet found any way to attribute causes to the observed effects. And since we don't know the causes, we cannot use metadata as a way to screen the telemetry records. What we *can* do is use the attributes of the duplicates themselves as an indicator of the data's reliability and choose which ones to from the dataset.

#### Singletons vs Singleton-Equivalent vs Conflicted
It turns out that the vast majority of the duplicates occur in tight bursts, for just short periods of time. And since we don't know why the duplicates are there, the safest course might be that we should simply eliminate all duplicated records. But this would throw out all the singleton-equivalent records as well, so a second scheme was explored in which we keep the singleton and singleton-equivalent records, but filter out the conflicted duplicates.

Unfortunately, there are even deeper wrinkles.

#### Record Time vs Satellite Time
The GPS data contains four different temporal fields. One contains only the year/month and the other is an undocumented internal counter used by the Ethica software. The remaining two (record_time and satellite_time) both encode complete date and time, down to the millisecond. The record_time field is captured from the host phone's clock, and notes the time at which the sensor was read, while the satellite_time field is taken from the GPS signal and represents the time the signal was broadcast from orbit. In theory, these two fields should only deviate by a small number of milliseconds, but in practice, they at times differ wildly. 

Worse, the duplication of times mentioned above occurs in both fields, but not in any linear, predictable fashion. Sometimes the satellite time is repeated at multiple differing record times, sometimes it's the other way around, and in some cases, they actually agree. So the result of prefiltering the duplicates will change depending on which time field is used as the timestamp of record. For research that needs only GPS data, the satellite_time seems to be the most reliable. But the record_time stamp is the only temporal field available in the accelerometry (XL) table, so any analysis that needs to synchronize the two tables is forced to use record_time as the timestamp of choice.

### Uniquification
Depending on which method(s) were used to identify and remove the "unreliable" records, there can still be duplication present in the dataset, so the next step is to eliminate the duplicates so that only well-ordered, unique observations remain to be fed to the analysis algorithms.

It's easy to confuse the prefiltration and uniquification steps, as they tend to use the same fields and similar criteria, but they need to be considered as distinct operations, because their intended functions differ. Prefiltration is a process of identifying and eliminating records that may be associated with dubious/unreliable behavior in the capture system, but it does not necessarily remove all duplicates from the table, which is what uniquification aims to accomplish.

To get a better sense of why we keep the two steps distinct, consider the following scenario: Suppose we decide to reject GPS records with duplicated satellite_times but differing lat/lon values, because we believe that *those* are clearly caused by some kind of hiccup in the phone and should be rejected. Doing so would indeed filter out the data we've marked as unreliable, but it would still leave singleton-equivalent duplicates intact, which would be squashed in the uniquification step. 

And after doing all that, we might then want to link the GPS and XL table by their record_time fields, which would trigger a *third* filtration step, squashing the record_time duplicates.

## Naming convention
With all the above in mind, we've devised a naming convention to help identify the contents of the various filtered tables. Each filtration step can be summarized by a fairly terse string: "delconflsat" would indicate that we **del**eted all **confl**icted records based on **sat**ellite_time, while "delduplsat" reports that we **del**eted all **dupl**icates on that field, instead of just the conflicted ones. For all cases other than the raw table, when duplicated records are not deleted, they are instead "uniquified" by collapsing them to single records.

The first tables created at ingest time will be the raw tables, loaded as-is from the incoming data, complete with duplicates and conflicts. Other tables can then take their names by adding on the tag(s) of the filtration steps applied.

So the table called "ssk-w1-eth-gps-delconflrec" indicates that the GPS table was filtered to remove all conflicted record_times, while ssk-w1-eth-xls-delduplsat means the XLS table was filtered to remove all duplicates on the satellite_time field.

(Note: As an added precaution, the raw tables have the tag "TOXIC" appended to remind researchers not to use its content directly for analysis.)

## Ingest Targets

The preliminary ingest for Sask Wave 1 will create the following tables in the level_0 schema:

  - ssk-w1-eth-gps-raw-TOXIC
  - ssk-w1-eth-xls-raw-TOXIC
  - ssk-w1-eth-xls-delduplrec (raw xls with all rec-time duplicates removed)
  - ssk-w1-eth-xls-delconflrec (raw xls with all rec-time conflicts removed)
  - ssk-w1-eth-gps-delduplsat (raw gps with all sat-time duplicates removed)
  - ssk-w1-eth-gps-delduplrec (raw gps with all rec-time duplicates removed)
  - ssk-w1-eth-gps-delconflsat (raw gps with all sat-time conflicts removed)
  - ssk-w1-eth-gps-delconflrec (raw gps with all rec-time conflicts removed)

### ssk_w1_eth_gps_raw_TOXIC

Ideally, we want to use the psql COPY command, since it has been
optimized for fast loading, but the target telemetry table and the
incoming CSV file have different column names. We *COULD* just rename 
the columns in the CSV file, but that's a manual step that shouldn't
be embedded into the process.

So instead, we'll load the data into a temporary table, and then transfer it from there into the target table and match the records to their interact_ids at the same time. This method will more or less double the ingest time, but having a reliable ingest process that can be fully documented, without manual interventions, seems like the more robust path.

In [15]:
# Create the temporary incoming table and load the raw GPS data into it
def load_raw_CSV_data(csvfilepath, schema, tmptablename, tmpsqlvars, desttablename, destsqlvars, transfersql):
    log(f"Loading file '{csvfilepath}' into table '{tmptablename}' of schema '{schema}'")
    with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
        cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)

        # Create the temporary onboarding table
        fulltmptablename = f"{schema}.{tmptablename}"
        sql = f"""
            DROP TABLE IF EXISTS {fulltmptablename}; 
            CREATE TABLE {fulltmptablename} (
                {tmpsqlvars}
                );
                """
        log(f"Creating temp ingest table {tmptablename}")
        cur.execute(sql)
        # now add a comment describing table's purpose
        cur.execute(f"COMMENT ON TABLE {fulltmptablename} IS 'Temporary table for raw data ingest.';")

        # And load the data from the CSV
        log(f"Ingesting raw CSV")
        with open(csvfilepath, 'r') as f:
            # Notice that we don't need the `csv` module.
            next(f) # Skip the header row because copy_from doesn't want it
            cur.copy_from(f, fulltmptablename, sep=',')

        # Now collect some basic stats and validate the loaded raw data
        log(f"Checking success of raw data load")
        rowcount = 0
        with open(csvfilepath,'r') as fh:
            for line in fh:
                rowcount += 1
        log(f"Expecting {rowcount-1:,} lines in table") # don't count the header      
        sql = f"""SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
                  FROM (
                         SELECT COUNT(1) as num_recs
                         FROM {fulltmptablename}
                         GROUP BY user_id
                       ) as records_per_user
                  """
        cur.execute(sql)
        row = cur.fetchone()
        num_raw_recs = row['num_recs']
        num_raw_users = row['num_users']
        log(f"Ingested {num_raw_recs:,} records across {num_raw_users:,} users.")
        if num_raw_recs == rowcount - 1:
            log("Raw data ingest appears successful.")
            conn.commit()
        else:
            log("ERROR: INGESTED ROW COUNT DOES NOT MATCH SOURCE FILE")
            return
              
        # Create the final destination table
        fulldesttablename = f"{schema}.{desttablename}"
        sql = f"""
            DROP TABLE IF EXISTS {fulldesttablename} CASCADE; 
            CREATE TABLE {fulldesttablename} (
                {destsqlvars}
                )
                """
        log(f"Creating destination table {fulldesttablename}")
        cur.execute(sql)

        # And transfer the data from the staging table
        log(f"Matching ethica_ids to interact_ids and transfering records into destination table")
        cur.execute(transfer_sql)

        # And finally collect some basic stats and validate the ingested data
        log(f"Checking success of ingest")
        sql = f"""SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
                  FROM (
                         SELECT COUNT(1) as num_recs
                         FROM {fulldesttablename}
                         GROUP BY iid
                       ) as records_per_user
                  """
        cur.execute(sql)
        row = cur.fetchone()
        num_final_recs = row['num_recs']
        num_final_users = row['num_users']
        
        # count the number of records for which there are no matching interact_ids
        missing_sql = f"""
            SELECT count(1) as num_missing_users, sum(num_recs) as total_missing_recs, min(interact_id)
            FROM (
                SELECT count(1) as num_recs, min(interact_id) as interact_id, min(user_id), min(ethica_email)
                FROM 
                   portal_dev.ethica_assignments asgn
                   RIGHT OUTER JOIN {fulltmptablename} raw
                   ON asgn.ethica_id = raw.user_id
                GROUP BY user_id
            ) recs_per_unmatched_user
            WHERE interact_id IS NULL;
            """
        cur.execute(missing_sql)
        row = cur.fetchone()
        num_missing_recs = int(row['total_missing_recs'] or 0) #assign 0 if no records match
        num_missing_users = int(row['num_missing_users'] or 0) #assign 0 if no records match
        log(f"Identified {num_missing_recs:,} records across {num_missing_users:,} users for whom interact_id is not known.")
        
        # Report the succes, partial success, or failure of the ingest
        log(f"Transfered {num_final_recs:,} records for {num_final_users:,} users.")
        if num_raw_recs == num_final_recs:
            log("Data transfer and ingest operation complete. All records ingested.")
            conn.commit()
        elif num_raw_recs == num_final_recs + num_missing_recs:
            log(f"Data transfer and ingest operation complete. All records accounted for, but {num_missing_users:,} unknown users.")
            conn.commit()
        else:
            log("ERROR: TRANSFERED ROW COUNT DOES NOT MATCH STAGING TABLE")
            
        data = [ ['num_recs', num_final_recs], ['num_users', num_final_users], ]
        print(f"\nFigure: Counts for {fulldesttablename}")
        print(tabulate(data, floatfmt=',.0f', stralign='right' ))
        
        
        # If there were some unmatched user_ids, list them so operator can investigate
        if num_missing_recs:
            problems_sql = f"""
                SELECT count(1) as num_recs, 
                       min(user_id) as user_id
                FROM portal_dev.ethica_assignments asgn
                   RIGHT OUTER JOIN {fulltmptablename} raw
                   ON asgn.ethica_id = raw.user_id
                WHERE interact_id IS NULL
                GROUP BY user_id;
                """
            print(f"\nFigure: Unmatched user_ids from {tmptablename}")
            cur.execute(problems_sql)
            rows = cur.fetchall()
            print(tabulate(rows, headers='keys'))
            print(f"Total Unmatched Rows: {num_missing_recs:,}")
#        else:
#            log("Record count in CSV matches record count of ingested table")


In [16]:
# Set up the parameters for loading the GPS table data
csvfilename = "gps.csv"
fullcsvpath = os.path.join(csvdir, csvfilename)
tmptablename = "tmpgps"
desttablename = "ssk_w1_eth_gps_raw_TOXIC"

tmpsqlvars = """
    user_id BIGINT NOT NULL,
    date TEXT,
    device_id TEXT NOT NULL,
    record_time TIMESTAMP WITH TIME ZONE NOT NULL,
    timestamp TEXT,
    accu DOUBLE PRECISION,
    alt DOUBLE PRECISION,
    bearing DOUBLE PRECISION,
    lat DOUBLE PRECISION NOT NULL,
    lon DOUBLE PRECISION NOT NULL,
    provider TEXT,
    satellite_time TIMESTAMP WITH TIME ZONE,
    speed DOUBLE PRECISION
    """

destsqlvars = """
    iid BIGINT NOT NULL,  -- interact_id
    record_time TIMESTAMP WITH TIME ZONE NOT NULL, -- participant's UTC time, to millisec, from phone clock
    satellite_time TIMESTAMP WITH TIME ZONE NOT NULL, -- timestamp taken from satellite data
    lat DOUBLE PRECISION NOT NULL,
    lon DOUBLE PRECISION NOT NULL,
    speed DOUBLE PRECISION DEFAULT 'NaN',
    course DOUBLE PRECISION DEFAULT 'NaN',
    alt DOUBLE PRECISION DEFAULT 'NaN',
    accu DOUBLE PRECISION DEFAULT 'NaN',
    provider TEXT DEFAULT ''
    """

transfer_sql = f"""
   INSERT INTO {db_schema}.{desttablename} (iid,record_time,satellite_time,lat,lon,speed,course,alt,accu,provider)
        SELECT asgn.interact_id,
                raw.record_time,
                raw.satellite_time,
                raw.lat,
                raw.lon,
                raw.speed,
                raw.bearing,
                raw.alt,
                raw.accu,
                raw.provider
        FROM {db_schema}.{tmptablename} raw 
            INNER JOIN portal_dev.ethica_assignments AS asgn
            ON raw.user_id = asgn.ethica_id
        """
load_raw_CSV_data(fullcsvpath, db_schema, tmptablename, tmpsqlvars, desttablename, destsqlvars, transfer_sql)

→ Loading file '/home/jeffs/projects/def-dfuller/interact/permanent_archive/Saskatoon/Wave1/Ethica/saskatoon_01/raw/gps.csv' into table 'tmpgps' of schema 'level_0'
→ Creating temp ingest table tmpgps
→ Ingesting raw CSV
→ Checking success of raw data load
→ Expecting 28,897,052 lines in table
→ Ingested 28,897,052 records across 151 users.
→ Raw data ingest appears successful.
→ Creating destination table level_0.ssk_w1_eth_gps_raw_TOXIC
→ Matching ethica_ids to interact_ids and transfering records into destination table
→ Checking success of ingest
→ Identified 0 records across 0 users for whom interact_id is not known.
→ Transfered 28,897,052 records for 151 users.
→ Data transfer and ingest operation complete. All records ingested.

Figure: Counts for level_0.ssk_w1_eth_gps_raw_TOXIC
---------  ----------
 num_recs  28,897,052
num_users         151
---------  ----------


In [17]:
# Quick block that can be run at any time to validate the existing table
desttablename = "ssk_w1_eth_gps_raw_TOXIC"
fulldesttablename = f"{db_schema}.{desttablename}"
print(f"Figure: Validating table '{fulldesttablename}'\n")
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    

    # Collect the same basic stats as computed during ingest
    sql = f"""SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
              FROM (
                     SELECT COUNT(1) as num_recs
                     FROM {fulldesttablename}
                     GROUP BY iid
                   ) as records_per_user
              """
    cur.execute(sql)
    row = cur.fetchone()
    data = [['num_recs',row['num_recs']], ['num_users',row['num_users']],]


    print(tabulate(data, floatfmt=',.0f', stralign='right'))

Figure: Validating table 'level_0.ssk_w1_eth_gps_raw_TOXIC'

---------  ----------
 num_recs  28,897,052
num_users         151
---------  ----------


In [18]:
# This is the code that is needed to add a simple checksum to the ingest report 
# but I decided it was overkill.
import hashlib

hasher = f"""
    SELECT SUM(iid
              + extract('epoch' from record_time)                
              - extract('epoch' from satellite_time)
              + lat
              - lon
              + speed
              - course
              + alt
              - accu 
              )::bigint as simple_total
    FROM {fulldesttablename} ;
    """

cur.execute(hasher)
row = cur.fetchone()
fingerprint = hashlib.md5(row['simple_total'].to_bytes(8, 'big', signed=True)).hexdigest()
data.append(['simple_checksum',fingerprint])


### ssk_w1_eth_xls_raw_TOXIC
Proceeds as per the GPS table, but with accelerometry source file and column names.

In [19]:
# Set up the parameters for loading the XL table data
csvfilename = "accelerometer.csv"
fullcsvpath = os.path.join(csvdir, csvfilename)
tmptablename = "tmpxl"
desttablename = "ssk_w1_eth_xls_raw_TOXIC"

tmpsqlvars = """
    user_id BIGINT NOT NULL,
    date TEXT,
    device_id TEXT NOT NULL,
    record_time TIMESTAMP WITH TIME ZONE NOT NULL,
    timestamp TEXT,
    accu DOUBLE PRECISION,
    x_axis DOUBLE PRECISION NOT NULL,
    y_axis DOUBLE PRECISION NOT NULL,
    z_axis DOUBLE PRECISION NOT NULL
    """

destsqlvars = """
    iid BIGINT NOT NULL,   -- interact_id
    record_time TIMESTAMP WITH TIME ZONE NOT NULL, -- participant's UTC time, to millisec 
    x DOUBLE PRECISION NOT NULL,
    y DOUBLE PRECISION NOT NULL,
    z DOUBLE PRECISION NOT NULL
    """

transfer_sql = f"""
    INSERT INTO {db_schema}.{desttablename} (iid,record_time,x,y,z)
    SELECT asgn.interact_id,
        raw.record_time,
        raw.x_axis,
        raw.y_axis,
        raw.z_axis
    FROM {db_schema}.{tmptablename} raw 
        INNER JOIN portal_dev.ethica_assignments asgn
        ON raw.user_id = asgn.ethica_id 
        """
load_raw_CSV_data(fullcsvpath, db_schema, tmptablename, tmpsqlvars, desttablename, destsqlvars, transfer_sql)

→ Loading file '/home/jeffs/projects/def-dfuller/interact/permanent_archive/Saskatoon/Wave1/Ethica/saskatoon_01/raw/accelerometer.csv' into table 'tmpxl' of schema 'level_0'
→ Creating temp ingest table tmpxl
→ Ingesting raw CSV
→ Checking success of raw data load
→ Expecting 960,178,307 lines in table
→ Ingested 960,178,307 records across 152 users.
→ Raw data ingest appears successful.
→ Creating destination table level_0.ssk_w1_eth_xls_raw_TOXIC
→ Matching ethica_ids to interact_ids and transfering records into destination table
→ Checking success of ingest
→ Identified 0 records across 0 users for whom interact_id is not known.
→ Transfered 960,178,307 records for 152 users.
→ Data transfer and ingest operation complete. All records ingested.

Figure: Counts for level_0.ssk_w1_eth_xls_raw_TOXIC
---------  -----------
 num_recs  960,178,307
num_users          152
---------  -----------


In [20]:
# Quick block that can be run at any time to validate the existing XL table
desttablename = "ssk_w1_eth_xls_raw_TOXIC"
fulldesttablename = f"{db_schema}.{desttablename}"
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    
    # Collect the same basic stats as computed during ingest
    sql = f"""SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
              FROM (
                     SELECT COUNT(1) as num_recs
                     FROM {fulldesttablename}
                     GROUP BY iid
                   ) as records_per_user
              """
    print(f"Figure: Validating table '{fulldesttablename}'\n")
    cur.execute(sql)
    row = cur.fetchone()
    data = [['num_recs',row['num_recs']], ['num_users',row['num_users']],]
    print(tabulate(data, floatfmt=',.0f', stralign='right'))

Figure: Validating table 'level_0.ssk_w1_eth_xls_raw_TOXIC'

---------  -----------
 num_recs  960,178,307
num_users          152
---------  -----------


#### Comparing population of both tables
As a final validation, verify that the GPS and XLS raw tables both have data from the same users.

In [21]:
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
   
    mismatch_sql = f"""
        SET SCHEMA '{db_schema}';
        DROP TABLE IF EXISTS tmpgpsiids;
        DROP TABLE IF EXISTS tmpxlsiids;
        CREATE TABLE tmpgpsiids AS SELECT DISTINCT iid FROM ssk_w1_eth_gps_raw_toxic ;
        CREATE TABLE tmpxlsiids AS SELECT DISTINCT iid FROM ssk_w1_eth_xls_raw_toxic ;
        SELECT tmpgpsiids.iid AS gps_iid, tmpxlsiids.iid AS xls_iid 
        FROM tmpgpsiids 
             RIGHT OUTER JOIN tmpxlsiids 
             ON tmpgpsiids.iid = tmpxlsiids.iid 
        WHERE tmpgpsiids.iid IS NULL 
           OR tmpxlsiids.iid IS NULL;
        """
    
    print(f"\nFigure: Users with data present in only one table\n")
    cur.execute(mismatch_sql)
    rows = cur.fetchall()
    print(tabulate(rows, headers='keys', stralign='right'))

    # Interact_id 302562672 is known to have no GPS data


Figure: Users with data present in only one table

  gps_iid    xls_iid
---------  ---------
           302562672


### ssk-w1-eth-xls-delduplrec 
This materialized view will be extracted from the raw XLS data, with all record_time **duplicates** removed.

In [25]:
tablename = 'ssk_w1_eth_xls_delduplrec'
sourcetable = "ssk_w1_eth_xls_raw_TOXIC"
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    log(f"Creating {tablename} as a materialized view...")
    sql = f"""
        SET SCHEMA '{db_schema}';
        DROP MATERIALIZED VIEW IF EXISTS {tablename};
        CREATE MATERIALIZED VIEW {tablename}
        AS
            SELECT iid, record_time, x, y, z
            FROM (
                SELECT count(1) as numrows, 
                       iid, 
                       record_time,
                       min(x) as x,
                       min(y) as y,
                       min(z) as z
                FROM {sourcetable}
                GROUP BY iid, record_time
                ) as count_collisions
            WHERE numrows = 1; -- there are no duplicates at all
            CREATE UNIQUE INDEX {tablename}_idx ON {tablename} (iid, record_time);
            """
    cur.execute(sql)
    conn.commit()
    log(f"Done")
    
    # Report basic stats on the newly created view
    print(f"\nFigure: Basic counts for table {tablename}\n")
    sql = f"""
    SET SCHEMA '{db_schema}';
    SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
        FROM (
             SELECT COUNT(1) as num_recs
             FROM {tablename}
             GROUP BY iid
           ) as records_per_user
        """
    cur.execute(sql)
    row = cur.fetchone()
    data = [['num_recs',row['num_recs']], ['num_users',row['num_users']],]
    print(tabulate(data, floatfmt=',.0f', stralign='right'))
    conn.commit()


→ Creating ssk_w1_eth_xls_delduplrec as a materialized view...
→ Done

Figure: Basic counts for table ssk_w1_eth_xls_delduplrec

---------  -----------
 num_recs  832,152,864
num_users          152
---------  -----------


### ssk-w1-eth-gps-delduplrec 
This materialized view will be extracted from the raw GPS data, with all record_time **duplicates** removed.

In [24]:
tablename = 'ssk_w1_eth_gps_delduplrec'
sourcetable = "ssk_w1_eth_gps_raw_TOXIC"
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    log(f"Creating {tablename} as a materialized view...")
    sql = f"""
        SET SCHEMA '{db_schema}';
        DROP MATERIALIZED VIEW IF EXISTS {tablename};
        CREATE MATERIALIZED VIEW {tablename}
        AS
            SELECT iid, record_time, satellite_time, minlat as lat, minlon as lon
            FROM (
                SELECT count(1) as numrows, 
                       iid, 
                       record_time, 
                       min(satellite_time) as satellite_time,
                       min(lat) as minlat, 
                       max(lat) as maxlat,
                       min(lon) as minlon,
                       max(lon) as maxlon
                FROM {sourcetable}
                GROUP BY iid, record_time
                ) as count_collisions
            WHERE numrows = 1; -- there are no duplicates at all
            CREATE UNIQUE INDEX {tablename}_idx ON {tablename} (iid, record_time);
            """
    cur.execute(sql)
    conn.commit()
    log(f"Done")
    
    # Report basic stats on the newly created view
    print(f"\nFigure: Basic counts for table {tablename}\n")
    sql = f"""
    SET SCHEMA '{db_schema}';
    SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
        FROM (
             SELECT COUNT(1) as num_recs
             FROM {tablename}
             GROUP BY iid
           ) as records_per_user
        """
    cur.execute(sql)
    row = cur.fetchone()
    data = [['num_recs',row['num_recs']], ['num_users',row['num_users']],]
    print(tabulate(data, floatfmt=',.0f', stralign='right'))

→ Creating ssk_w1_eth_gps_delduplrec as a materialized view...
→ Done

Figure: Basic counts for table ssk_w1_eth_gps_delduplrec

---------  ---------
 num_recs  7,729,649
num_users        151
---------  ---------


### ssk-w1-eth-gps-delconflrec 
This materialized view will be extracted from the raw GPS data, with all record_time **conflicts** removed.

In [27]:
tablename = 'ssk_w1_eth_gps_delconflrec'
sourcetable = "ssk_w1_eth_gps_raw_TOXIC"
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    log(f"Creating {tablename} as a materialized view...")
    sql = f"""
        SET SCHEMA '{db_schema}';
        DROP MATERIALIZED VIEW IF EXISTS {tablename};
        CREATE MATERIALIZED VIEW {tablename}
        AS
            SELECT iid, record_time, satellite_time, minlat as lat, minlon as lon
            FROM (
                SELECT count(1) as numrows, 
                       iid, 
                       record_time, 
                       min(satellite_time) as satellite_time,
                       min(lat) as minlat, 
                       max(lat) as maxlat,
                       min(lon) as minlon,
                       max(lon) as maxlon
                FROM {sourcetable}
                GROUP BY iid, record_time
                ) as count_collisions
            WHERE numrows = 1 -- there are no duplicates at all
               OR (minlon = maxlon AND minlat = maxlat); -- the duplicates are identical
        CREATE UNIQUE INDEX {tablename}_idx ON {tablename} (iid, record_time);
        """
    cur.execute(sql)
    conn.commit()
    log(f"Done")
    
    # Report basic stats on the newly created view
    print(f"\nFigure: Basic counts for table {tablename}\n")
    sql = f"""
    SET SCHEMA '{db_schema}';
    SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
        FROM (
             SELECT COUNT(1) as num_recs
             FROM {tablename}
             GROUP BY iid
           ) as records_per_user
        """
    cur.execute(sql)
    row = cur.fetchone()
    data = [['num_recs',row['num_recs']], ['num_users',row['num_users']],]
    print(tabulate(data, floatfmt=',.0f', stralign='right'))

→ Creating ssk_w1_eth_gps_delconflrec as a materialized view...
→ Done

Figure: Basic counts for table ssk_w1_eth_gps_delconflrec

---------  ---------
 num_recs  7,791,943
num_users        151
---------  ---------


### ssk-w1-eth-xls-delconflrec 
This materialized view will be extracted from the raw XL data, with all record_time **conflicts** removed.

In [2]:
tablename = 'ssk_w1_eth_xls_delconflrec'
sourcetable = "ssk_w1_eth_xls_raw_TOXIC"
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    log(f"Creating {tablename} as a materialized view...")
    sql = f"""
        SET SCHEMA '{db_schema}';
        DROP MATERIALIZED VIEW IF EXISTS {tablename};
        CREATE MATERIALIZED VIEW {tablename}
        AS
            SELECT iid, record_time, minx as x, miny as y, minz as z
            FROM (
                SELECT count(1) as numrows, 
                       iid, 
                       record_time,
                       min(x) as minx,
                       max(x) as maxx,
                       min(y) as miny,
                       max(y) as maxy,
                       min(z) as minz,
                       max(z) as maxz
                FROM {sourcetable}
                GROUP BY iid, record_time
                ) as count_collisions
            WHERE numrows = 1 -- there are no duplicates at all
               OR (minx = maxx AND miny = maxy AND minz = maxz); -- the duplicates are identical
            CREATE UNIQUE INDEX {tablename}_idx ON {tablename} (iid, record_time);
            """
    cur.execute(sql)
    conn.commit()
    log(f"Done")
    
    # Report basic stats on the newly created view
    print(f"\nFigure: Basic counts for table {tablename}\n")
    sql = f"""
    SET SCHEMA '{db_schema}';
    SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
        FROM (
             SELECT COUNT(1) as num_recs
             FROM {tablename}
             GROUP BY iid
           ) as records_per_user
        """
    cur.execute(sql)
    row = cur.fetchone()
    data = [['num_recs',row['num_recs']], ['num_users',row['num_users']],]
    print(tabulate(data, floatfmt=',.0f', stralign='right'))
    conn.commit()


→ Creating ssk_w1_eth_xls_delconflrec as a materialized view...
→ Done

Figure: Basic counts for table ssk_w1_eth_xls_delconflrec

---------  -----------
 num_recs  836,034,501
num_users          152
---------  -----------


### ssk-w1-eth-gps-delduplsat
This materialized view will be extracted from the raw GPS data, with all satellite_time **duplicates** removed.

In [3]:
tablename = 'ssk_w1_eth_gps_delduplsat'
sourcetable = "ssk_w1_eth_gps_raw_TOXIC"
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    log(f"Creating {tablename} as a materialized view...")
    sql = f"""
        SET SCHEMA '{db_schema}';
        DROP MATERIALIZED VIEW IF EXISTS {tablename};
        CREATE MATERIALIZED VIEW {tablename}
        AS
            SELECT iid, record_time, satellite_time, minlat as lat, minlon as lon
            FROM (
                SELECT count(1) as numrows, 
                       iid, 
                       min(record_time) as record_time, 
                       satellite_time,
                       min(lat) as minlat, 
                       max(lat) as maxlat,
                       min(lon) as minlon,
                       max(lon) as maxlon
                FROM {sourcetable}
                GROUP BY iid, satellite_time
                ) as count_collisions
            WHERE numrows = 1; -- there are no duplicates at all
        CREATE UNIQUE INDEX {tablename}_idx ON {tablename} (iid, satellite_time);
        """
    cur.execute(sql)
    conn.commit()
    log(f"Done")
    
    # Report basic stats on the newly created view
    print(f"\nFigure: Basic counts for table {tablename}\n")
    sql = f"""
    SET SCHEMA '{db_schema}';
    SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
        FROM (
             SELECT COUNT(1) as num_recs
             FROM {tablename}
             GROUP BY iid
           ) as records_per_user
        """
    cur.execute(sql)
    row = cur.fetchone()
    data = [['num_recs',row['num_recs']], ['num_users',row['num_users']],]
    print(tabulate(data, floatfmt=',.0f', stralign='right'))

→ Creating ssk_w1_eth_gps_delduplsat as a materialized view...
→ Done

Figure: Basic counts for table ssk_w1_eth_gps_delduplsat

---------  ---------
 num_recs  6,262,936
num_users        151
---------  ---------


### ssk-w1-eth-gps-delconflsat
This materialized view will be extracted from the raw GPS data, with all satellite_time **conflicts** removed.

In [4]:
tablename = 'ssk_w1_eth_gps_delconflsat'
sourcetable = "ssk_w1_eth_gps_raw_TOXIC"
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    log(f"Creating {tablename} as a materialized view...")
    sql = f"""
        SET SCHEMA '{db_schema}';
        DROP MATERIALIZED VIEW IF EXISTS {tablename};
        CREATE MATERIALIZED VIEW {tablename}
        AS
            SELECT iid, record_time, satellite_time, minlat as lat, minlon as lon
            FROM (
                SELECT count(1) as numrows, 
                       iid, 
                       min(record_time) as record_time, 
                       satellite_time,
                       min(lat) as minlat, 
                       max(lat) as maxlat,
                       min(lon) as minlon,
                       max(lon) as maxlon
                FROM {sourcetable}
                GROUP BY iid, satellite_time
                ) as count_collisions
            WHERE numrows = 1 -- there are no duplicates at all
               OR (minlon = maxlon AND minlat = maxlat); -- the duplicates are identical
        CREATE UNIQUE INDEX {tablename}_idx ON {tablename} (iid, satellite_time);
        """
    cur.execute(sql)
    conn.commit()
    log(f"Done")
    
    # Report basic stats on the newly created view
    print(f"\nFigure: Basic counts for table {tablename}\n")
    sql = f"""
    SET SCHEMA '{db_schema}';
    SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
        FROM (
             SELECT COUNT(1) as num_recs
             FROM {tablename}
             GROUP BY iid
           ) as records_per_user
        """
    cur.execute(sql)
    row = cur.fetchone()
    data = [['num_recs',row['num_recs']], ['num_users',row['num_users']],]
    print(tabulate(data, floatfmt=',.0f', stralign='right'))

→ Creating ssk_w1_eth_gps_delconflsat as a materialized view...
→ Done

Figure: Basic counts for table ssk_w1_eth_gps_delconflsat

---------  ---------
 num_recs  7,275,243
num_users        151
---------  ---------


In [None]:
tablename = 'ssk-w1-eth-xls-delduplrec'
# Quick block that can be run at any time to validate counts on any of the above-created tables
with psycopg2.connect(user=db_user, host=db_host, port=db_host_port, database='interact_db') as conn:
    cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    
    # Collect the same basic stats as computed during ingest
    sql = f"""
    SET SCHEMA '{db_schema}';
    SELECT SUM(num_recs) as num_recs, COUNT(1) as num_users
        FROM (
             SELECT COUNT(1) as num_recs
             FROM {tablename}
             GROUP BY iid
           ) as records_per_user
        """
    cur.execute(sql)
    row = cur.fetchone()
    data = [['num_recs',row['num_recs']], ['num_users',row['num_users']],]
    print(f"Figure: Basic counts for table {tablename})
    print(tabulate(data, floatfmt=',.0f', stralign='right'))

In [None]:
# Create a table with no conflicted duplicates, and all singleton-equivalent records collapsed into singletons
CREATE TABLE IF NOT EXISTS level_0.tmpuniqgpsunconflictedobs AS
            SELECT 1 as numrecs, user_id, record_time, minlat as lat, minlon as lon
            FROM (
                SELECT count(1) as numrows, 
                       user_id, 
                       record_time, 
                       min(lat) as minlat, 
                       max(lat) as maxlat,
                       min(lon) as minlon,
                       max(lon) as maxlon
                FROM level_0.tmpsskgps
                GROUP BY user_id, record_time
                ) as count_collisions
            WHERE numrows = 1 -- there are no duplicates at all
                OR (minlat = maxlat AND minlon = maxlon) -- the duplicates are identical


Here endeth the ingest.