# New merges for master_pcts
* Losing appeals cases because of an inner join
* These appeals child cases aren't associated with an AIN, but we can grab it from parent case
* But, merging using PARENT_CASE rather than CASE_ID means it's a m:m merge (gross!)
* Figure out if we can expand the merges, switch the order, and create new master_pcts
* Add print statements to check the length of the df

In [1]:
import pandas as pd
import geopandas as gpd
import intake

catalog = intake.open_catalog("../catalogs/*.yml")
bucket_name = 'city-planning-entitlements'

# Import data
cases = pd.read_parquet(f's3://{bucket_name}/data/raw/tCASE.parquet')
app = pd.read_parquet(f's3://{bucket_name}/data/raw/tAPLC.parquet')
geo_info = pd.read_parquet(f's3://{bucket_name}/data/raw/tPROP_GEO_INFO.parquet')
la_prop = pd.read_parquet(f's3://{bucket_name}/data/raw/tLA_PROP.parquet')

In [2]:
# Subset dataframes before merging
keep_col = ['CASE_ID', 'APLC_ID', 'CASE_NBR', 
                'CASE_SEQ_NBR', 'CASE_YR_NBR', 'CASE_ACTION_ID', 
                'CASE_FILE_RCV_DT', 'CASE_FILE_DATE', 'PARNT_CASE_ID']

cases1 = (cases.assign(
    # Grab the year-month from received date
    CASE_FILE_DATE = pd.to_datetime(cases['CASE_FILE_RCV_DT']).dt.to_period('M'),
)[keep_col])

app1 = app[['APLC_ID', 'PROJ_DESC_TXT']]
geo_info1 = geo_info[['CASE_ID', 'PROP_ID']].drop_duplicates()
la_prop1 = la_prop[la_prop.ASSR_PRCL_NBR.notna()][['PROP_ID', 'ASSR_PRCL_NBR']]

# Identify parent cases
cases1['parent_is_null'] = cases1.PARNT_CASE_ID.isna()
cases1['PARENT_CASE'] = cases1.apply(lambda row: row.CASE_ID if row.parent_is_null == True 
                                         else row.PARNT_CASE_ID, axis = 1)

# Keep cases from 2010 onward
cases2 = cases1[cases1.CASE_FILE_DATE.dt.year >= 2010]

### First merge is between cases and geo_info
We add on PROP_ID column.

In [3]:
m1 = pd.merge(cases2, geo_info1, on = 'CASE_ID', how = 'left', validate = '1:m')

In [20]:
correct_joins = m1[m1.PROP_ID.notna()]
incorrect_joins = m1[m1.PROP_ID.isna()]

In [23]:
print(f"# obs when we join cases and geo_info: {len(m1)}")
print(f"# obs where PROP_ID was NaN: {len(incorrect_joins)}")
print(f"% where PROP_ID was NaN: {len(incorrect_joins) / len(m1)}")

# obs when we join cases and geo_info: 557696
# obs where PROP_ID was NaN: 5781
% where PROP_ID was NaN: 0.010365862405324765


In [34]:
# Of these incorrect joins, do they share parent cases with ones that were joined?
print("# unique parents that were correctly joined, but also appear in incorrect_joins")
print(f"{incorrect_joins[incorrect_joins.PARENT_CASE.isin(correct_joins.PARENT_CASE)].PARENT_CASE.nunique()}")
print(f"# unique parents in incorrect_joins: {incorrect_joins.PARENT_CASE.nunique()}")

# This shows a lot of parent cases won't get joined to a PROP_ID and AIN

# unique parents that were correctly joined, but also appear in incorrect_joins
1709
# unique parents in incorrect_joins: 5471


### Second merge is between geo_info and la_prop
* To fix these incorrect joins, we would have to have allowed a m:m merge.
* So, let's see if we can circumvent PROP_ID by mapping it to AIN directly.
* But first, we force it to be a 1:m merge, and only keep unique PROP_IDs.

In [10]:
# Do a second merge for PROP_ID and AINs
m2 = (pd.merge(geo_info1[["PROP_ID"]].drop_duplicates(), 
               la_prop1, 
               on = "PROP_ID", how = "left", validate = "1:m")
      .rename(columns = {"ASSR_PRCL_NBR": "AIN"})
     )

In [11]:
m2.head(2)

Unnamed: 0,PROP_ID,AIN
0,34237.0,5160003902
1,34306.0,5076007033


### Third merge is to create a CASE_ID to AIN crosswalk

* Combine the results of the previous 2 merges. Use PROP_ID to join cases with la_prop.
* But first, bring in our crosswalk_parcels_tracts to make sure we only keep parcels we have.
* Then, combine case info with AIN and get rid of all the duplicates.
* When we merge AIN info onto all our cases, we can then avoid using PROP_ID and the m:m merge.

In [16]:
crosswalk_parcels_tracts = (pd.read_parquet(
    "s3://city-planning-entitlements/data/crosswalk_parcels_tracts_lacity.parquet")
    [["uuid", "AIN"]]
    )

In [17]:
ain_side = m2[m2.AIN.isin(crosswalk_parcels_tracts.AIN)]

In [18]:
ain_side.head()

Unnamed: 0,PROP_ID,AIN
0,34237.0,5160003902
1,34306.0,5076007033
2,34323.0,5407005016
3,34169.0,5407002023
4,33937.0,5124001012


In [66]:
fix_me = incorrect_joins[["PARENT_CASE"]].drop_duplicates()

In [75]:
fix_me_propid = (pd.merge(fix_me, geo_info1, 
                         left_on = "PARENT_CASE", right_on = "CASE_ID", 
                         how = "inner", validate = "1:m")
                 # Hacking this with PARENT_CASE, so drop CASE_ID
                 .drop(columns = "CASE_ID")
                )

In [76]:
fix_me_propid.head()

Unnamed: 0,PARENT_CASE,PROP_ID
0,194352.0,59141075.0
1,194814.0,59141527.0
2,194814.0,59141526.0
3,189513.0,59124694.0
4,206223.0,59327730.0


In [81]:
fix_me_ain = (pd.merge(fix_me_propid, ain_side, 
                          on = "PROP_ID", how = "inner", validate = "m:1")
                 [["PARENT_CASE", "AIN"]]
                 .drop_duplicates()
                 .reset_index(drop=True)
                )

In [82]:
fix_me_ain

Unnamed: 0,PARENT_CASE,AIN
0,194352.0,2378025032
1,194814.0,2425005046
2,189513.0,4259024006
3,206223.0,7410014001
4,189127.0,2524022006
...,...,...
3717,174147.0,5547004005
3718,178277.0,5503014020
3719,235112.0,2209031008
3720,230680.0,4261014013


Fourth merge is to add on project description.