# goal
\
demo the OEMC dataset ahead of analysis

- note that the way I format districts may not be the best thing. Currently, I:\
    (1) preserve the original field,\
    (2) add a numeric version that removes padding 0s and formats as int,\
    (3) add a non-numeric version that preserves fields with a potentially meaningful non-numeric pattern, ie) "CW3".\
  The idea was to use the numeric version because that matches up with the district numbering in the open data portal, but that might not be the right call.
- note also that the `kwfields.yml` is not an exhaustive groupings of the `init_type` and `fin_type`s in the data and could be expanded on
- I don't know what "EL CHECK" is

# setup

In [1]:
# dependencies
import yaml
import re
import numpy as np
import pandas as pd

In [2]:
# support methods
def readyaml(fname):
    with open(fname, 'r') as f:
        data = yaml.safe_load(f)
    return data


def group_timedelta(td):
    if pd.isna(td): return 'No dispatch reported'
    elif td < pd.Timedelta(0): return 'Dispatch before call'
    elif td < pd.Timedelta(minutes=5): return 'Dispatch under 5 minutes'
    elif td < pd.Timedelta(minutes=15): return 'Dispatch under 15 minutes'
    elif td < pd.Timedelta(minutes=30): return 'Dispatch under 30 minutes'
    elif td < pd.Timedelta(minutes=60): return 'Dispatch under 1 hour'
    elif td < pd.Timedelta(minutes=120): return 'Dispatch under 2 hours'
    elif td < pd.Timedelta(minutes=360): return 'Dispatch under 6 hours'
    elif td < pd.Timedelta(days=.5): return 'Dispatch under 12 hours'
    elif td < pd.Timedelta(days=1): return 'Dispatch under 24 hours'
    elif td < pd.Timedelta(days=2): return 'Dispatch under 48 hours'
    return 'Dispatch 48 hours or later'

In [4]:
# main
oemc = pd.read_parquet("../../data/OEMC_MP/export/output/oemc-prepped.parquet")
kwrules = readyaml("../../data/shared/hand/keywords.yml")
assert oemc.shape[0] > 12000000
assert not oemc.fin_type.isna().any()

# prep for analysis

- I thought this stuff was done in the version in `Chi-MP-data-story`, but it looks like I actually added these fields later and the version in the public repo is only lightly processed, not prepped for analysis.

**NOTE:** I'm not sure if the numeric district thing is the right approach or if it needs some tweaking, open to feedback and should review this before/when utilizing

In [None]:
oemc.loc[oemc.event_type == 'gun', ['event_group', 'event_type', 'init_type', 'fin_type']
].fillna("NO INITIAL TYPE").value_counts().head(50)

In [None]:
oemc[['event_group', 'event_type', 'init_type', 'fin_type']
].fillna("NO INITIAL TYPE").value_counts().head(50)

In [None]:
oemc.init_type.isna().sum()

In [None]:
oemc.loc[oemc.init_type.isna(), 'fin_type'].value_counts()

# preview data

In [None]:
qa = f"Q:\tHow many records are in the OEMC dispatch data?\nA:\t{oemc.shape[0]:,} records"
print(qa)

In [None]:
oemc.sample().T

# review of data

### all fields

In [None]:
oemc.info()

In [None]:
for col in oemc.columns:
    print(f"Column name:\t`{col}`")
    print(f"N non-missing:\t{oemc[col].notna().sum():,}")
    print(f"N unique:\t{len(oemc[col].unique()):,}")
    print()

### original

- not inclusive of every single original field
- might have light processing or formatting applied

In [None]:
oemc.call_date.describe()

In [None]:
oemc.district.value_counts().head(20)

In [None]:
oemc[['init_type', 'fin_type']].value_counts().head(10)

In [None]:
oemc[['init_priority', 'init_type']].value_counts().head(10)

### added for analysis

- might have light processing or formatting applied

In [None]:
oemc.event_group.value_counts()

In [None]:
oemc.loc[oemc.init_type.str.contains("GENERIC", na=False), ['init_type', 'fin_type']].value_counts()

In [None]:
oemc.event_type.value_counts()

In [None]:
oemc.year_called.value_counts().sort_index()

In [None]:
oemc.time_to_dispatch.describe()

In [None]:
oemc.ttd_group.value_counts()

In [None]:
oemc.numeric_district.value_counts()

# topics

