# Urbansim Buildings Processing

### TODO

Once AWS access is restored:

- Re-run Update_urbansim_buildings_county.ipynb
    - Select distinct -> drop duplicates in urbansim_buildings tables
    
### Verify:

- joinid should be unique to APN

In [None]:
import os
import sys
import pandas as pd


user = os.environ['USER']
sys.path.insert(0, '/Users/{}/Box/DataViz Projects/Utility Code'.format(user))
from utils_io import *

## 1. Get data from Socrata

In [None]:
# buildings_2018
socrata_data_id = 'ahwz-jtst'
df = pull_df_from_socrata(socrata_data_id)

## 2. Inspect duplicates

In [None]:
# lots of duplicates
print(df.shape)  # (3655207, 15)
df.drop_duplicates(inplace=True)
print(df.shape)  # (3120776, 15)
# Duplicate APNs (makes sense since we're using buildings)
print(df['apn'].nunique())  # 2643041

## 3. Subset to data with values for EDA

In [None]:
county_cols = ['assessed_building_value', 'assessed_date', 'building_id',
               'building_sqft', 'building_type', 
               'jurisdiction_cty', 'last_sale_date', 'non_residential_sqft',
               'residential_units', 'tenure', 'year_built']

In [None]:
# subset to data with values
df['missing_cty_input'] = df[county_cols].isnull().all(axis=1)
cty_data = df[df['missing_cty_input'] == False].copy()

## 4. EDA

### Check duplicates

It appears that there are duplicate records for buildings where the duplicate record has no building id.

Solution: Drop null values using building_id column

In [None]:
cty_data.head(10)

Drop null building ids

In [None]:
# # this doesn't work since value is 'nan' not NaN
# cty_data.dropna(subset=['building_id'], inplace=True)

# instead
cty_data = cty_data[cty_data['building_id'] != 'nan']

In [None]:
cty_data.head(10)

Reassess duplicates

In [None]:
print(cty_data.shape)  # (476300, 16)
# still duplicate building ids
print(cty_data['building_id'].nunique())  # 475110

In [None]:
# get target # of buildings
target_num_buildings = cty_data[cty_data['fipco'] == 'CA085']['building_id'].nunique()
target_num_buildings

It looks like the duplicate values are arising from the merge (on APN) of county data and Parcels 2018 data.

Parcels 2018 appears to have APNs for Marin (fipco CA041) and San Francisco (fipco CA075) that match Santa Clara APNs

In [None]:
vcs = cty_data.groupby('building_id').size().sort_values(ascending=False)
dup_data = cty_data[cty_data['building_id'].isin(vcs[vcs > 1].index)]
dup_data.head()

Now it looks like duplicate values are arising from the same merge, but this time the problem is a many to one joinid:apn relationship.

This is only the case for one APN: 13241104

Solution: select one of the joinids as the correct one (verify with Parcels 2018 which one ought to be chosen)

In [None]:
print(dup_data['fipco'].value_counts())
sc_dups = dup_data[dup_data['fipco'] == 'CA085']

vcs1 = sc_dups.groupby('building_id').size().sort_values(ascending=False)
dup_sc = sc_dups[sc_dups['building_id'].isin(vcs1[vcs1 > 1].index)]
print(dup_sc['apn'].unique())
dup_sc

### Create cleaned version of data

In [None]:
non_dup = cty_data[cty_data['building_id'].isin(vcs[vcs == 1].index)]
sc_nondup = sc_dups[sc_dups['building_id'].isin(vcs1[vcs1 == 1].index)]

buildings = pd.concat([non_dup, sc_nondup, dup_sc.iloc[[0]]])
buildings = buildings[buildings['fipco'] == 'CA085']

In [None]:
buildings = buildings[buildings['fipco'] == 'CA085']
# building id is now unique
assert len(buildings) == target_num_buildings

### Check match with Parcels 2018 data

In [None]:
# without match in Parcels 2018: AL, MI, LI, RE, CO, SM, MO, WA, HO

d = {'SJ': 'San Jose',
     'SC': 'Santa Clara',
     'SU': 'Sunnyvale',
     'PA': 'Palo Alto',
     'MV': 'Mountain View',
     'ST': None,
     'LA': 'Los Altos',
     'AL': None,
     'LH': 'Los Altos Hills',
     'PV': 'Portola Valley',
     'MI': None,
     'CA': 'Campbell',
     'CU': 'Cupertino',
     'LG': 'Los Gatos',
     'SA': 'Saratoga',
     'MS': 'Monte Sereno',
     'LI': None,
     'MH': 'Morgan Hill',
     'RE': None,
     'GI': 'Gilroy',
     'CO': None,
     'SM': None,
     'MO': None,
     'WA': None,
     'HO': None}

In [None]:
missing_jurisdict = list(set(buildings['jurisdict'].unique()).difference(set(d.values())))
missing_jurisdict

In [None]:
for j in missing_jurisdict:
    print(j, '\n')
    print(buildings[buildings['jurisdict'] == j]['jurisdiction_cty'].value_counts())
    print('\n')

In [None]:
buildings['juris_compare'] = buildings['jurisdiction_cty'] + ' ' + buildings['jurisdict']
buildings['juris_compare'].value_counts()

In [None]:
for k, v in d.items():
    print('\n', v)
    print(buildings[buildings['jurisdiction_cty'] == k]['jurisdict'].value_counts(dropna=False))

In [None]:
buildings.to_csv('urbanim_buildings_proc.csv', index=False)