# Plate, Account, and Tag Frequencies

This notebook explores the number of trips per plate, account and tags, and tries to explain these numbers.

In [1]:
suppressWarnings(suppressMessages(library(tidyverse)))
suppressMessages(library(lubridate))
library(RSQLCipher)

In [2]:
Sys.setenv("SQL_KEY"=Sys.getenv("HOT_KEY"))

In [3]:
db_path = "../../../data/hot.db"

# import tables
bos = load_table(db_path, "bos", c(tag_id="c", posted_account="c", 
                                   plate_state_pri="c", plate_state_sec="c")) # ~10M rows

In [4]:
replace_null = function(x) {
    na_if(na_if(na_if(x, "11183834060272721597"), "-8355759756528748941"), "8974271441158017554")
}

## Plates

Since there are 365 days in the year, we'd expect most accounts/plates to have fewer than 700 trips a year.  Anything above that is cause for suspicion.  Looking at the most frequent plates, we see that some plates are registered over 50,000, 11,000, and 8,000 times. There are several plates in the 1500-4000 range, too.

In [54]:
bos %>%
    group_by(id=plate_state_sec) %>%
    summarize(count=n()) %>%
    arrange(desc(count)) %>%
    head(15) %>%
    execute(col_types="ci") %>%
    mutate(id=replace_null(id))

id,count
<chr>,<int>
,732140
,51168
-8.785122124138371e+18,11735
-3.3840254602703237e+18,8976
-6.167837656755352e+18,3863
-1.4295945908780777e+18,3269
-5.959073759183278e+18,2598
-4.30245480809633e+18,2193
-7.626160874510874e+18,1543
-4.947576073961224e+18,1394


In [58]:
bos %>%
    group_by(id=plate_state_pri) %>%
    summarize(count=n()) %>%
    arrange(desc(count)) %>%
    head(15) %>%
    execute(col_types="ci") %>%
    mutate(id=replace_null(id))

id,count
<chr>,<int>
,51168
-3.3840254602703237e+18,11073
-1.4295945908780777e+18,2462
-9.879625219582057e+17,1983
-5.959073759183278e+18,1911
-7.875715471405424e+18,1778
1.9787842221457812e+18,1519
-4.30245480809633e+18,1363
-5.82808225530464e+18,1057
-4.947576073961224e+18,1042


The BOS file is joined to the census/account file using `posted_account`, `plate_state_sec`, and `plate_state_pri`.  As long as the combination of these is unique, we need not be concerned about duplicate plates and accounts.  

Doing this, the most-frequent plate was now used "only" around 10,000 times in 2018 (8365 times when the plate was processed completely, and nother 2188 times when the plate was partially processed). 

Beyond this problematic plate, usage patterns make sense.

In [59]:
bos %>%
    group_by(acct=posted_account, plate=plate_state_sec, plate_pri=plate_state_pri) %>%
    summarize(count=n()) %>%
    arrange(desc(count)) %>%
    head(15) %>%
    execute(col_types="ccci") %>%
    mutate(plate=replace_null(plate), plate_pri=replace_null(plate_pri))

acct,plate,plate_pri,count
<chr>,<chr>,<chr>,<int>
,-3.3840254602703237e+18,-3.3840254602703237e+18,8365
-3.5074307519191567e+18,,,4824
-7.228328777495154e+18,,,2427
,,-3.3840254602703237e+18,2188
,,,1382
-7.893186328889782e+18,-3.7234900178383475e+17,-3.7234900178383475e+17,848
9.53887187898708e+17,-6.644171243177244e+18,-6.644171243177244e+18,755
6.567020529061285e+17,-6.493262359384201e+18,-6.493262359384201e+18,696
3.262768471685964e+18,5.630667284520024e+18,5.630667284520024e+18,652
1.6800085751703455e+18,3.738123700033018e+17,3.738123700033018e+17,638


We can also try to tease apart these uses by `tag_id`, which should in theory be unique to a vehicle. This problem is the same single problematic plate, "-3384025460270323568". 

In [60]:
bos %>%
    group_by(acct=posted_account, plate=plate_state_sec, plate_pri=plate_state_pri, tag_id) %>%
    summarize(count=n()) %>%
    arrange(desc(count)) %>%
    head(15) %>%
    execute(col_types="cccci") %>%
    mutate(plate=replace_null(plate), plate_pri=replace_null(plate_pri), tag_id=replace_null(tag_id))

acct,plate,plate_pri,tag_id,count
<chr>,<chr>,<chr>,<chr>,<int>
,-3.3840254602703237e+18,-3384025460270323568,,8365
,,-3384025460270323568,,2188
-7.893186328889782e+18,-3.7234900178383475e+17,-372349001783834727,,848
9.53887187898708e+17,-6.644171243177244e+18,-6644171243177243892,-8.496308514258509e+18,733
1.6800085751703455e+18,3.738123700033018e+17,373812370003301869,-8.010875764786662e+18,637
-7.300871342883308e+17,-8.362674334327864e+18,-8362674334327864509,3.6140413882189e+18,628
-3.129791676948063e+18,-5.562192719766815e+18,-5562192719766814090,,618
8.753495716351627e+18,-6.370314988642211e+18,-6370314988642211106,,614
8.753495716351627e+18,2.0037417519694065e+18,2003741751969406404,,572
8.753495716351627e+18,6.838011917252148e+18,6838011917252147646,,571


We can grab some sample rows from the `bos` table with these problematic plates

In [61]:
bos %>%
    mutate(plate_state_pri = as.character(plate_state_pri)) %>%
    filter(plate_state_pri == "-3384025460270323568") %>%
    head(2) %>%
    execute(col_types="iiiccccccii")

txn_id,trip_id,veh_class,tag_id,posted_account,plate_state_pri,plate_state_sec,plate_state,lic_plate_type_code,zip_code,plus4_code
<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>
246543669,104027829,2,,,-3384025460270323568,-3384025460270323568,WA,,,
246543674,104027830,2,,,-3384025460270323568,-3384025460270323568,WA,,,


## Accounts

We see many accounts with a large number of trips, but this is less of a concern, since we would expect multiple vehicles to be associated to single accounts such as transit agencies, large commercial customers, etc.

In [56]:
bos %>%
    group_by(id=posted_account) %>%
    summarize(count=n()) %>%
    arrange(desc(count)) %>%
    head(15) %>%
    execute(col_types="ci") %>%
    mutate(id=replace_null(id))

id,count
<chr>,<int>
,1554847
-9.697450036711304e+17,106600
8.753495716351627e+18,85112
-1.778903698080338e+18,68843
-1.3850906253090803e+18,66475
-3.5074307519191567e+18,31826
-7.228328777495154e+18,31474
-3.2719697867272914e+18,19982
-6.448167887457588e+18,14791
-4.4677064854474665e+18,12481


Some of these accounts have over 25,000 plates associated with them.  Many have over 1,000 plates.

In [69]:
bos %>%
    select(acct=posted_account, plate_state_sec) %>%
    distinct() %>%
    group_by(acct) %>%
    summarize(plates=n()) %>%
    arrange(desc(plates)) %>%
    head(15) %>%
    execute(col_types="ci") 

acct,plates
<chr>,<int>
,461129
-1.778903698080338e+18,25881
8.753495716351627e+18,17767
-7.228328777495154e+18,12085
-3.5074307519191567e+18,9377
-9.697450036711304e+17,8919
-6.448167887457588e+18,7792
-1.3850906253090803e+18,2746
-2.8707556137242926e+18,2515
4.718396806487642e+18,2411


We get a better picture by looking at tags.  The largest account has 2,300 tags.  King County Metro's fleet has 1,500 buses, so these are on the righ order of magnitude.

In [77]:
bos %>%
    select(acct=posted_account, tag_id) %>%
    distinct() %>%
    group_by(acct) %>%
    summarize(tags=n()) %>%
    arrange(desc(tags)) %>%
    head(15) %>%
    execute(col_types="ci") 

acct,tags
<chr>,<int>
,16453
-2.8707556137242926e+18,2302
-9.697450036711304e+17,912
-3.5074307519191567e+18,833
8.753495716351627e+18,553
4.718396806487642e+18,405
-3.2719697867272914e+18,360
-7.228328777495154e+18,335
-1.3850906253090803e+18,311
8.771727682284265e+18,221


Looking at the number of tags per account, we see that there are 200,000 accounts with just one tag, and 2,277 accounts with five or more tags.

In [80]:
bos %>%
    select(acct=posted_account, tag_id) %>%
    distinct() %>%
    group_by(acct) %>%
    summarize(tags=n()) %>%
    arrange(desc(tags)) %>%
    summarize(n1=sum(tags==1), n2=sum(tags==2), n3=sum(tags==3), 
              n4=sum(tags==4), n5=sum(tags==5), ng5l10=sum(tags>5 & tags<=10),
              ng10l50=sum(tags>10 & tags<=50), ng50l100=sum(tags>50 & tags <=100),
              ng100=sum(tags>100)) %>%
    execute

n1,n2,n3,n4,n5,ng5l10,ng10l50,ng50l100,ng100
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
205426,68153,20003,5377,1602,1491,742,27,17


In [10]:
trips = load_table(db_path, "trips_linked", c(tag_id="c", acct="c", plate="c", id="c"))

In [16]:
comm_acct = bos %>%
    select(acct=posted_account, tag_id) %>%
    distinct() %>%
    group_by(acct) %>%
    summarize(commercial=n() > 6) 

left_join(select(trips, acct),
         comm_acct,
         by="acct") %>%
group_by(commercial) %>%
summarize(trips=n()) %>%
execute

Only 57% of commericial accounts have census linkages, compared to 76% of noncommercial accounts.

In [15]:
left_join(select(trips, acct, fips),
         comm_acct,
         by="acct") %>%
group_by(commercial, is.na(fips)) %>%
summarize(pct=n()) %>%
execute

commercial,is.na(fips),pct
<dbl>,<dbl>,<dbl>
,0,652864
,1,7807574
0.0,0,5527071
0.0,1,1727472
1.0,0,717994
1.0,1,543159


## Summary

Frequency-of-use patterns look relatively good, overall.  There are many accounts which used the system thousands or even tens of thousands of times in 2018.  These are likely commercial or government users.  Excluding `NULL` plates, there is just one plate/account pair which was logged in the system more than 1,000 times.  Plates logged less than 1,000 times correspond to 2-3 trips on the facility per day, which does not seem too unusual.  

The single plate which was logged 10,553 times (an average of around 28 times a day) does not have any account or ZIP code information.  It is a Washington state plate with no particular special type.  As this plate occured in the `bos` table, we know it paid a fare for each trip.  Sample transaction IDs to look up for future investigation are `246543669` and `246543674`.

Several accounts have tens of thousands of associated plates.  This is potentially concerning, as it could indicate a poor OCR plate recognition rate.  The largest account has around 2,500 Good to Go tags associated with it, which is plausible. Overall, 0.7% of accounts have more than six tags.  These accounts made up around 15% of account trips, and 7% of overall trips.  These could perhaps be classified as commercial users.