# Urbansim Parcels Processing

### TODO
AWS:

- Re-run Update_urbansim_parcels_county.ipynb
    - Select distinct -> drop duplicates in urbansim_parcels
    
- Add county origin column to `basis.{}_county` tables to fix fipco merge problem (discussed in EDA section in EDA_urbansim_buildings.ipynb)

### Verify:
- joinid should be unique to APN

In [1]:
import os
import sys
import pandas as pd


user = os.environ['USER']
sys.path.insert(0, '/Users/{}/Box/DataViz Projects/Utility Code'.format(user))
from utils_io import *

## 1. Get data from Socrata

In [2]:
socrata_data_id = '6q7r-gybw'
df = pull_df_from_socrata(socrata_data_id)

pulling data in 37 chunks of 100000 rows each
pulling chunk 0
pulling chunk 1
pulling chunk 2
pulling chunk 3
pulling chunk 4
pulling chunk 5
pulling chunk 6
pulling chunk 7
pulling chunk 8
pulling chunk 9
pulling chunk 10
pulling chunk 11
pulling chunk 12
pulling chunk 13
pulling chunk 14
pulling chunk 15
pulling chunk 16
pulling chunk 17
pulling chunk 18
pulling chunk 19
pulling chunk 20
pulling chunk 21
pulling chunk 22
pulling chunk 23
pulling chunk 24
pulling chunk 25
pulling chunk 26
pulling chunk 27
pulling chunk 28
pulling chunk 29
pulling chunk 30
pulling chunk 31
pulling chunk 32
pulling chunk 33
pulling chunk 34
pulling chunk 35
pulling chunk 36


## 2. Inspect duplicates

In [3]:
# lots of duplicates
print(df.shape)  # (3655207, 8)
df.drop_duplicates(inplace=True)
print(df.shape)  # (2644476, 8)
# still duplicate APNs
print(df['apn'].nunique())  # 2643041

(3655207, 7)
(2644476, 7)
2643041


In [15]:
# get target # of apns
target_num_apns = cty_data[cty_data['fipco'] == 'CA085']['apn'].nunique()
target_num_apns

381269

## 3. Subset to data with values for EDA

In [4]:
county_cols = ['acres', 'assessed_land_value', 'jurisdiction_cty']

In [6]:
# subset to data with values
df['missing_cty_input'] = df[county_cols].isnull().all(axis=1)
cty_data = df[df['missing_cty_input'] == False].copy()

In [None]:
cty_data.head()

## 4. EDA

### Check duplicates

It appears that there are duplicate records for apns due to the same Parcels 2018 fipco merge problem discussed in the EDA_urbansim_buildings.ipynb EDA section.

Solution: Drop non-Santa Clara values

In [None]:
vcs = cty_data.groupby('apn').size().sort_values(ascending=False)
dup_data = cty_data[cty_data['apn'].isin(vcs[vcs > 1].index)]
dup_data.head()

Now the secondary duplicate merge problem (noted in the EDA_urbansim_buildings.ipynb EDA section) is presented:
duplicate joinids for one APN: 13241104

Solution: select one of the joinids as the correct one (verify with Parcels 2018 which one ought to be chosen)

In [None]:
sc_dups = dup_data[dup_data['fipco'] == 'CA085']
vcs1 = sc_dups.groupby('apn').size().sort_values(ascending=False)
dup_sc = sc_dups[sc_dups['apn'].isin(vcs1[vcs1 > 1].index)]
print(dup_sc['apn'].unique())
dup_sc

### Create cleaned version of data

In [12]:
non_dup = cty_data[cty_data['apn'].isin(vcs[vcs == 1].index)]
sc_nondup = sc_dups[sc_dups['apn'].isin(vcs1[vcs1 == 1].index)]

parcels = pd.concat([non_dup, sc_nondup, dup_sc.iloc[[0]]])
parcels = parcels[parcels['fipco'] == 'CA085']

In [16]:
parcels = parcels[parcels['fipco'] == 'CA085']
# apn is now unique
assert len(parcels) == target_num_apns

### Check match with Parcels 2018 data

This table has the same jurisdiction join problems as urbansim_buildings (see EDA_urbansim_buildings.ipynb EDA section)

In [17]:
parcels['juris_compare'] = parcels['jurisdiction_cty'] + ' ' + parcels['jurisdict']
parcels['juris_compare'].value_counts()

SJ San Jose                       192453
SU Sunnyvale                       27869
SC Santa Clara                     22075
MV Mountain View                   15126
CU Cupertino                       13931
PA Palo Alto                       13780
GI Gilroy                          12744
MH Morgan Hill                     11432
CA Campbell                        10546
SJ Unincorporated Santa Clara      10449
SA Saratoga                         9896
LA Los Altos                        9494
LG Los Gatos                        9323
LG Unincorporated Santa Clara       3890
GI Unincorporated Santa Clara       3157
MH Unincorporated Santa Clara       2928
LH Los Altos Hills                  2811
SM Unincorporated Santa Clara       1688
LA Unincorporated Santa Clara       1248
nan San Jose                        1228
MS Monte Sereno                     1104
nan Unincorporated Santa Clara       820
ST Unincorporated Santa Clara        578
SA Unincorporated Santa Clara        334
CU Unincorporate

In [18]:
parcels.to_csv('urbanim_parcels_proc.csv', index=False)