# New merges for master_pcts
* Losing appeals cases because of an inner join
* These appeals child cases aren't associated with an AIN, but we can grab it from parent case
* But, merging using PARENT_CASE rather than CASE_ID means it's a m:m merge (gross!)
* Figure out if we can expand the merges, switch the order, and create new master_pcts
* Add print statements to check the length of the df

In [None]:
import pandas as pd
import geopandas as gpd
import intake

catalog = intake.open_catalog("../catalogs/*.yml")
bucket_name = 'city-planning-entitlements'

# Import data
cases = pd.read_parquet(f's3://{bucket_name}/data/raw/tCASE.parquet')
app = pd.read_parquet(f's3://{bucket_name}/data/raw/tAPLC.parquet')
geo_info = pd.read_parquet(f's3://{bucket_name}/data/raw/tPROP_GEO_INFO.parquet')
la_prop = pd.read_parquet(f's3://{bucket_name}/data/raw/tLA_PROP.parquet')

In [None]:
# Subset dataframes before merging
keep_col = ['CASE_ID', 'APLC_ID', 'CASE_NBR', 
                'CASE_SEQ_NBR', 'CASE_YR_NBR', 'CASE_ACTION_ID', 
                'CASE_FILE_RCV_DT', 'CASE_FILE_DATE', 'PARNT_CASE_ID']

cases1 = (cases.assign(
    # Grab the year-month from received date
    CASE_FILE_DATE = pd.to_datetime(cases['CASE_FILE_RCV_DT']).dt.to_period('M'),
)[keep_col])

app1 = app[['APLC_ID', 'PROJ_DESC_TXT']]
geo_info1 = geo_info[['CASE_ID', 'PROP_ID']].drop_duplicates()
la_prop1 = la_prop[la_prop.ASSR_PRCL_NBR.notna()][['PROP_ID', 'ASSR_PRCL_NBR']]

# Identify parent cases
cases1['parent_is_null'] = cases1.PARNT_CASE_ID.isna()
cases1['PARENT_CASE'] = cases1.apply(lambda row: row.CASE_ID if row.parent_is_null == True 
                                         else row.PARNT_CASE_ID, axis = 1)

# Keep cases from 2010 onward
cases2 = cases1[cases1.CASE_FILE_DATE.dt.year >= 2010]

### First merge is between cases and geo_info
We add on PROP_ID column.

In [None]:
m1 = pd.merge(cases2, geo_info1, on = 'CASE_ID', how = 'left', validate = '1:m')

In [None]:
correct_joins = m1[m1.PROP_ID.notna()]
incorrect_joins = m1[m1.PROP_ID.isna()]

In [None]:
print(f"# obs when we join cases and geo_info: {len(m1)}")
print(f"# obs where PROP_ID was NaN: {len(incorrect_joins)}")
print(f"% where PROP_ID was NaN: {len(incorrect_joins) / len(m1)}")

In [None]:
# Of these incorrect joins, do they share parent cases with ones that were joined?
print("# unique parents that were correctly joined, but also appear in incorrect_joins")
print(f"{incorrect_joins[incorrect_joins.PARENT_CASE.isin(correct_joins.PARENT_CASE)].PARENT_CASE.nunique()}")
print(f"# unique parents in incorrect_joins: {incorrect_joins.PARENT_CASE.nunique()}")

# This shows a lot of parent cases won't get joined to a PROP_ID and AIN

This is a big rabbit hole that will never get rid of the m:m merge.
(1) There are PARENT_CASES where some of their CASE_IDs have PROP_ID merged, and some CASE_IDs that did not correctly join with PROP_ID. These PARENT_IDs will have some obs in correctly_joined and some in incorrectly_joined. This will involve m:m merge.
(2) There are also that fall completely within incorrectly_joined, and using PARENT_CASE, we can get PROP_IDs. This will involve m:m merge.
(3) There are also PARENT_CASEs that are only in incorrectly_joined, but we cannot get PROP_ID for them at the end of all this.

In [None]:
appear_in_correct = (incorrect_joins[incorrect_joins.PARENT_CASE.isin(correct_joins.PARENT_CASE)]
                     [["PARENT_CASE"]]
                     .drop_duplicates()
                    )

In [None]:
fix_me = m1[m1.PARENT_CASE.isin(appear_in_correct.PARENT_CASE)]

In [None]:
fix_me_crosswalk = fix_me[["PARENT_CASE", "PROP_ID"]].drop_duplicates()

In [None]:
pd.merge(fix_me, fix_me_crosswalk, on = "PARENT_CASE", how = "left", validate = "m:m")

### Second merge is between geo_info and la_prop
* To fix these incorrect joins, we would have to have allowed a m:m merge.
* So, let's see if we can circumvent PROP_ID by mapping it to AIN directly.
* But first, we force it to be a 1:m merge, and only keep unique PROP_IDs.

In [None]:
# Do a second merge for PROP_ID and AINs
m2 = (pd.merge(geo_info1[["PROP_ID"]].drop_duplicates(), 
               la_prop1, 
               on = "PROP_ID", how = "left", validate = "1:m")
      .rename(columns = {"ASSR_PRCL_NBR": "AIN"})
     )

In [None]:
m2.head(2)

### Third merge is to create a CASE_ID to AIN crosswalk

* Combine the results of the previous 2 merges. Use PROP_ID to join cases with la_prop.
* But first, bring in our crosswalk_parcels_tracts to make sure we only keep parcels we have.
* Then, combine case info with AIN and get rid of all the duplicates.
* When we merge AIN info onto all our cases, we can then avoid using PROP_ID and the m:m merge.

In [None]:
crosswalk_parcels_tracts = (pd.read_parquet(
    "s3://city-planning-entitlements/data/crosswalk_parcels_tracts_lacity.parquet")
    [["uuid", "AIN"]]
    )

In [None]:
ain_side = m2[m2.AIN.isin(crosswalk_parcels_tracts.AIN)]

In [None]:
ain_side.head()

In [None]:
incorrect_joins_with_propid = pd.merge(incorrect_joins.drop(columns = ["PROP_ID"]), 
                                       geo_info1.rename(columns = {"CASE_ID": "PARENT_CASE"}), 
                                       on = "PARENT_CASE", how = "inner", validate = "m:m")

In [None]:
incorrect_joins_with_ain = pd.merge(incorrect_joins_with_propid, ain_side,
                                    on = "PROP_ID", how = "inner", validate = "m:1"
                                   )

In [None]:
incorrect_joins.PARENT_CASE.nunique()

In [None]:
incorrect_joins_with_ain.PARENT_CASE.nunique()

Fourth merge is to add on project description.