# Development history of a few Oakland neighborhoods

Growing up in a small town on the East Coast, I knew the person who built our house.  Not personally, mind you, but I knew that he was the brother of the father of our next-door-neighbors (who were themselves quite old when I was a child).  Then I began to move around: first for college, then for graduate school, then to the Bay Area.  When I recently moved to Oakland and started putting down roots, I was missing the same kind of connection to my new neighborhood.  So I got a copy of Beth Bagwell's book [*Oakland: The Story of a City*](http://blog.ouroakland.net/2012/05/oakland-story-of-city.html), which filled in some broad background: first the Muwekma Ohlone villages and shell mounds, then the Spanish land grants to the rancheros (Peralta and his sons), then the gold rush and the land grab by greedy settlers, then the development of downtown and gradual spreading of the city boundary enabled by mass transit.  Then my neighborhood was built.  This was followed by WWII and the influx of African-Americans to work in the war effort, and then the grimmer recent history of the city center being abandoned for the suburbs.

But somewhere in that narrative was my question, still unanswered: how was my neighborhood constructed?  Sure, one can find old maps that list the development parcels, but what did the neighborhood look like going up?  Was it built all at once in the roaring 20s, or piecemeal, one-farmhouse here, another there, as the land slowly became subdivided and the orchards and oak trees cut down?  This is a narrow question, and more recent history is layered on top of this: how the highways divided many neighbors, how new apartment buildings went up, structures that burned down or were destroyed.  And it also ignores how the land was used by the Peraltas for decades, and the Ohlone for many centuries before them.  But the further back one looks, the fainter the traces.  Or, said another way: you have to start somewhere.

So I decided to answer the simplest question I could; when was each house built?  This required some information that shouldn't be too hard to find: a list of houses, where each one is, and when it was built.  I'll describe the process of analyzing this data as I go along, complete with snippets in case others want to replicate this info.  Standalone scripts can be found in this repository as well, as indicated in comments.

To skip right to the analysis, scroll down to the maps.

## The Dataset

The data sets made available as part of the [CodeForOakland](http://codeforoakland.org/data-sets/#oakland1) project have been crucial.  Initially, I found a GeoJSON file containing parcel info for the county of Alameda, which includes Oakland but includes several other large cities.  This was a huge file with 10x more data than just Oakland, couldn't be processed in memory, and lacked street addresses.  Several days later, I found a shapefile of all parcels in Oakland, from around 2011, that conveniently fit into memory - this was exactly what was needed.  Still, if I want to go back and look at the info for Piedmont (or another neighboring city), I can always dig up the county database.

This is the info for each entry:

In [15]:
import fiona
import pprint

baseFile = 'data/Oakland_parcels/parcels'
source = fiona.open(baseFile + '.shp')
print('Number of parcels: %d\n' % len(source))
print('Second parcel:')
source.next()    # first entry has a lot of coordinates, so skip
pprint.pprint(source.next())

source.close()


Number of parcels 105351

Second parcel:
{'geometry': {'coordinates': [[(-122.244024425766, 37.86867162821933),
                               (-122.24398667467038, 37.86866162294282),
                               (-122.24399852865695, 37.868349112878676),
                               (-122.24401355395213, 37.86835277617628),
                               (-122.24404457097216, 37.868359870908144),
                               (-122.2440757393451, 37.86836653553029),
                               (-122.24408444864746, 37.86836826921976),
                               (-122.24407290434733, 37.86867278199862),
                               (-122.244024425766, 37.86867162821933)]],
              'type': 'Polygon'},
 'id': '1',
 'properties': OrderedDict([('OBJECTID', 23),
                            ('ADDR_HN', None),
                            ('ADDR_PD', None),
                            ('ADDR_SN', 'DWIGHT'),
                            ('ADDR_ST', 'WAY'),
                  

As you can see, this database has a field for the street address, though the specific example above lacks this information ('None').  I'll catch this kind of error below.  If the street address wasn't present, it could be found by reverse geocoding the location (from Google's API, for instance).  All told, there are 105,000 entries.

The construction history exists in Zillow's database, so the dataset was completed by calling Zillow's API for each property.  The one catch is that Zillow's free API key allows only 1,000 queries per day.  Because I'm mostly interested in a few neighborhoods, I can approach this rate limit in a smart way by working outward from a central point; radial processing, if you will.  I'll chooose the center location to be a landmark in my neighborhood of Cleveland Heights: the Armenian Church on the top of the hill, at the corner of McKinley and Spruce.  At this rate, it will take 105 days (~3.5 months) to process all the data for Oakland.  All the surrounding neighborhoods should take a couple of weeks at most, which is a much more realistic proposition.

To process the parcels radially, the shapefile array was rewritten as a dictionary with the key being the distance between the landmark and the centroid of each parcel.  Notice the data stays as a dictionary to facilitate saving it back to a shape file at the end, which is useful for mapping purposes.  If that wasn't a concern, it would make a lot of sense to convert the format to something easier to manipulate, like a pandas DataFrame.

In [20]:
import fiona
import geopy.distance
distance = geopy.distance.vincenty    
import shapely.geometry as shp
import pickle 

baseFile = 'data/Oakland_parcels/parcels'
center = (37.8058428, -122.2399758)        # (lat, long), Armenian Church

data_raw = {}
data_duplicates = {}

with fiona.drivers():           # Register format drivers with a context manager

    with fiona.open(baseFile + '.shp') as source:
       
        for f in source :
            if 'geometry' not in f:
                print('No geometry key in entry {}'.format(f))
            c = shp.shape(f['geometry']).centroid
            p = (c.y, c.x)

            d = round(distance(p, center).m * 10**6)/10**6        # round to micrometers
            f['centroid'] = p            

            if d in data_raw :
                if d in data_duplicates :   
                    data_duplicates[d].append(f)      # add to list in existing dictionary key
                else :
                    data_duplicates[d] = [data_raw[d], f]      # create list in existing dictionary key
            else :
                data_raw[d] = f
print('Number of parcels in dict: %d\n' % len(data_raw))

Number of parcels in dict: 97718



Technically, computing the centroid is unnecessary for radial processing - the first coordinate of the parcel shape would suffice - but this info may be useful later on.  Also, the amount of time it adds to processing is negligible for such a small dataset.

Notice that there are now about 97,000 entries, about 8,000 less than the original file.  It turns out that these are duplicate entries, such as condominiums, that share the same street address and coordinates but have different assessor parcel numbers (APNs).  Doing the math confirms that this accounts for all duplicates:

In [21]:
duplicateCount = 0
for (key, value) in data_duplicates.items() :
    duplicateCount += len(value)
print('Total number of duplicate parcels: %d' % (duplicateCount-len(data_duplicates)))

Total number of duplicate parcels: 7633


$7633+97718 = 105351$, which was the original number of entries in the shapefile.

Interestingly, the micrometer precision in the distance key is necessary to distinguish parcels: rounding to millimeters results in clashes between different street addresses.  Given the number of parcels at a certain radius once the radius gets large, this isn't terribly surprising, but still an nice example about the importance of precision and probability.

Now the cleaned dataset is saved to file.

In [None]:
with open('data/OaklandParcels_inProcess.pkl', 'wb') as datafile :
    a = pickle.Pickler(datafile)
    compressed = {}
    compressed['data_raw'] = data_raw
    compressed['data_queried'] = {}
    compressed['data_errors'] = []
    a.dump(compressed)

## Completing the dataset

The dataset is processed starting with the smallest key and working outwards: each entry is first sent to the Zillow API, then popped from the input dictionary (data_raw) and placed in the output dictionary (data_queried) if the response is valid.  If the response is invalid, it is placed into an error dictionary (data_errors) for later processing.  There's also a rate limit of 10 queries/second, so a timer around the loop limits the rate.  It runs at 5 queries/second just to be nice to Zillow's server.

In [None]:
# CreateParcelDatabase.py
import requests
import xmltodict
import time
import pickle

# Zillow variables keys
with open('../private/API_keys.pkl', 'rb') as datafile:
    zid = pickle.load(datafile)             # API key
zurl = 'http://www.zillow.com/webservice/GetDeepSearchResults.htm?'

inProcessFile = 'data/OaklandParcels_inProcess.pkl'
radius = 1000            # only process parcels within this radius - used for debugging
numToProcess = 1000      # zillow API limits to 1000 queries per day

# load data structures
with open(inProcessFile, 'rb') as fid :
    compressed = pickle.load(fid)
# ...and unpack
data_raw = compressed['data_raw']
data_queried = compressed['data_queried']
data_errors = compressed['data_errors']
del compressed

# sort keys by distance from closest to furthest
sortedKeys = [k for k in sorted(data_raw) if k < radius]

for (i, key) in zip(range(numToProcess), sortedKeys) :
    startT = time.time()            # set up timer to keep requests under 10/s

    try :
        # read in address details from input dictionary
        zp = {'address' : '{} {} {}'.format(data_raw[key]['properties']['ADDR_HN'],
                      data_raw[key]['properties']['ADDR_SN'],
                      data_raw[key]['properties']['ADDR_ST']),
              'citystatezip' : 'Oakland, CA ' + str(data_raw[key]['properties']['ZIP']),
              'zws-id' : zid}
        r = requests.get(zurl, params=zp)
        r_dict = xmltodict.parse(r.text)['SearchResults:searchresults']

        if r_dict['message']['code'] == '0' :       # valid response?
            r_dict = r_dict['response']['results']['result']
            
            # in case the response is a list of multiple (similar) entries, take the first one
            if type(r_dict)==list :
                r_dict = r_dict[0]
            # prune extraneous fields
            r_dict.pop('links')
            r_dict.pop('zestimate')
            r_dict.pop('localRealEstate')
            data_raw[key]['zillow'] = r_dict
            # transfer  to output dictionary
            data_queried[key] = data_raw.pop(key)
            
        else :
            print('For request {}, zillow code is {}. Here''s the record:'.format(
                    zp['address'], r_dict['message']['code']))
            print(r_dict)
            print('-'*60)
            # transfer info to error dictionary for offline analysis
            data_errors.append({'key': key, 'value': data_raw.pop(key), 
                                  'zillow': r_dict, 'source': 'zillow'})
    except Exception as exc:
        print('Unspecified error: {}'.format(exc))
        data_errors.append({'key': key, 'value': data_raw.pop(key),
                              'source': 'exception'})
            
    # log status
    print(i, ' ', zp['address'])

    endT = time.time()
    if endT - startT < 0.2 :
        time.sleep(0.2 - (endT-startT))     # rate limit to 5 calls per second

# save dictionaries back to disk
compressed = {}
compressed['data_raw'] = data_raw
compressed['data_queried'] = data_queried
compressed['data_errors'] = data_errors
with open(inProcessFile, 'wb') as fid :
    a = pickle.Pickler(fid)
    a.dump(compressed)

The most common error was code 508, 'no exact match found for input address'.  This was primarily caused by the address not being in the Zillow database, such as for commercial buildings, churches, schools, etc.  But this was also caused by invalid address, such as a parcel without a street address ('None MacArthur Blvd'), which was the case for parks, municipal land, and Lake Merritt.  The error rate was about 5%, or 1 in 20.

Before mapping the data, it needs to be written back to a shapefile so it can be easily processed.  Even though shapefiles are an old format, they are pretty efficient for this kind of processing - much more so than the GeoJSON format.  When doing so, the schema from the original file needs to be modified to include the new data ('yearBuilt').  Extra fields from Zillow are jettisoned to speed up processing.  Finally, because this shapefile has the 'properties' dictionary in an OrderedDict class, the data fields need to be rearranged to match the schema's order.

In [None]:
# SaveParcelDictionaryAsShapefile.py
import fiona
import pickle
import os
from collections import OrderedDict
import numpy as np

inProcessFile = 'data/OaklandParcels_inProcess.pkl'         # data source
with open(inProcessFile, 'rb') as fid :
    compressed = pickle.load(fid)
# ...and unpack
data_raw = compressed['data_raw']
data_queried = compressed['data_queried']
del compressed

baseFile = 'data/Oakland_parcels/parcels'       # source of shape info in dictionary
outputFileName = 'Oakland_parcels_queried'
radius = 2000                                   # in m

# create output directory if it doesn't exist yet
if os.path.isdir('data/' + outputFileName) is False :
    os.makedirs('data/' + outputFileName)
outputFile = 'data/' + outputFileName + '/' + outputFileName + '.shp'

# Register format drivers with a context manager
with fiona.drivers():
    # get schema from original file
    with fiona.open(baseFile + '.shp') as source:
        meta = source.meta
        
    # add new fields to schema file
    meta['schema']['centroid'] = ('float:19:11', 'float:19:11')
    meta['schema']['id'] = 'float:19'
    meta['schema']['type'] = 'str:50'
    meta['schema']['yearBuilt'] = 'float:10'
    meta['schema']['properties']['YEARBUILT'] = 'int:6'
    meta['schema']['properties'] = OrderedDict(meta['schema']['properties'])
    schemaOrder = meta['schema']['properties']

    with fiona.open(outputFile, 'w', **meta) as sink:
        for (i,f) in enumerate(data_queried) :
            if f <= radius :   
                if 'yearBuilt' in data_queried[f]['zillow'] :
                    data_queried[f]['properties']['YEARBUILT'] = data_queried[f]['zillow']['yearBuilt']
                    data_queried[f].pop('zillow')
                else :
                    data_queried[f]['properties']['YEARBUILT'] = np.nan
                # reorder dictionary to match schema order
                data_queried[f]['properties'] = OrderedDict(
                        (k, data_queried[f]['properties'][k]) for k in schemaOrder)           
                sink.write(data_queried[f])

## The Maps

The kind of map used to show this data is a choropleth map, which maps a physical quantity (year built) onto a spatial extent by using some kind of shading.  This is a quick way to show what spatial patterns might require more analysis.  While Python has no straightforward way to do this, all the tools required are free and there's extensive support and code examples online.

The general overview is to first choose a projection grid to map the round world onto.  This converts the (longitude, latitude) pairs to (x, y) pairs in the coordinate system of the projection.  The parcel polygons are drawn and colored on top of this.  First, though, here are two functions to make colorbars, borrowed from [Sensitive Cities](http://sensitivecities.com/so-youd-like-to-make-a-map-using-python-EN.html#.V2hnJa4tVVz).

In [1]:
# Convenience functions for working with colour ramps and bars
def colorbar_index(ncolors, cmap, labels=None, **kwargs):
    """
    This is a convenience function to stop you making off-by-one errors
    Takes a standard colour ramp, and discretizes it,
    then draws a colour bar with correctly aligned labels
    """
    cmap = cmap_discretize(cmap, ncolors)
    mappable = cm.ScalarMappable(cmap=cmap)
    mappable.set_array([])
    mappable.set_clim(-0.5, ncolors+0.5)
    colorbar = plt.colorbar(mappable, **kwargs)
    colorbar.set_ticks(np.linspace(0, ncolors, ncolors))
    colorbar.set_ticklabels(range(ncolors))
    if labels:
        colorbar.set_ticklabels(labels)
    return colorbar

def cmap_discretize(cmap, N):
    """
    Return a discrete colormap from the continuous colormap cmap.

        cmap: colormap instance, eg. cm.jet. 
        N: number of colors.

    Example
        x = resize(arange(100), (5,100))
        djet = cmap_discretize(cm.jet, 5)
        imshow(x, cmap=djet)

    """
    if type(cmap) == str:
        cmap = get_cmap(cmap)
    colors_i = np.concatenate((np.linspace(0, 1., N), (0., 0., 0., 0.)))
    colors_rgba = cmap(colors_i)
    indices = np.linspace(0, 1., N + 1)
    cdict = {}
    for ki, key in enumerate(('red', 'green', 'blue')):
        cdict[key] = [(indices[i], colors_rgba[i - 1, ki], colors_rgba[i, ki]) for i in xrange(N + 1)]
    return matplotlib.colors.LinearSegmentedColormap(cmap.name + "_%d" % N, cdict, 1024)

And now the mapping routine:

In [None]:
# DrawParcelChoropleth.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.collections import PatchCollection
from mpl_toolkits.basemap import Basemap
from shapely.geometry import Polygon, MultiPolygon
from shapely.prepared import prep
from descartes import PolygonPatch
from itertools import chain
import geopy.distance
distance = geopy.distance.vincenty    

# shapefile database
baseFile = 'data/Oakland_parcels_queried/Oakland_parcels_queried'

center = geopy.Point(37.8058428, -122.2399758)        # (lat, long), Armenian Church
radius = 0.6                           # in km
ur = distance(kilometers=radius*2**0.5).destination(center, +45)
ll = distance(kilometers=radius*2**0.5).destination(center, -135)
ur = (ur.longitude, ur.latitude)
ll = (ll.longitude, ll.latitude)
extra = 0.01           # padding for edges
coords = list(chain(ll, ur))
w, h = coords[2] - coords[0], coords[3] - coords[1]

m = Basemap(
    projection='tmerc',
    lon_0=-122.,
    lat_0=37.,
    ellps = 'WGS84',
    llcrnrlon=coords[0] - extra * w,
    llcrnrlat=coords[1] - extra * h,
    urcrnrlon=coords[2] + extra * w,
    urcrnrlat=coords[3] + extra * h,
    lat_ts=0,
    resolution='i',
    suppress_ticks=True)

m.readshapefile(
    baseFile,
    'oakland',
    color='blue',
    zorder=2)
  
# set up a map dataframe
df_map = pd.DataFrame({
    'poly': [Polygon(xy) for xy in m.oakland],
    'id': [obj['OBJECTID'] for obj in m.oakland_info],
    'zip': [obj['ZIP'] for obj in m.oakland_info],
    'yearBuilt': [obj['YEARBUILT'] for obj in m.oakland_info]})

# Create projection view as a polygon to filter shapes
window = [ll, (ll[0], ur[1]), ur, (ur[0], ll[1]), ll]
window = list(zip( *m(*list(zip(*window))) ))
window_map = pd.DataFrame({'poly': [Polygon(window)]})
window_polygon = prep(MultiPolygon(list(window_map['poly'].values)))

# Remove any shapes that are outside the map window
df_map = df_map[ [window_polygon.intersects(i) for i in df_map.poly] ]

# draw tract patches from polygons
df_map['patches'] = df_map['poly'].map(lambda x: PolygonPatch(
    x,
    ec='#787878', lw=.25, alpha=.9,
    zorder=4))

# create colormap based on year built
cmap_range = (1885.5, 1930.5)
ncolors = 8
yearBuilt_bins = np.linspace(min(cmap_range), max(cmap_range), ncolors+1)
cmap = matplotlib.cm.coolwarm
cmap.set_bad(color='white')       # if yearBuilt is nan
norm = matplotlib.colors.BoundaryNorm(yearBuilt_bins, ncolors)

plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, axisbg='w', frame_on=False)

# plot parcels by adding the PatchCollection to the axes instance
pc = PatchCollection(df_map['patches'].values, match_original=True)
pc.set_facecolor(cmap(norm(df_map.yearBuilt)/ncolors));
ax.add_collection(pc)

# create labels for colorbar
yearBuilt_labels = ['%.0f-%.0f' % (yearBuilt_bins[i], yearBuilt_bins[i+1])
                        for i in range(ncolors)]
yearBuilt_labels.append('>%.0f' % yearBuilt_bins[-1])

cb = colorbar_index(ncolors=ncolors+1, cmap=cmap, shrink=0.5, labels=yearBuilt_labels)
cb.ax.tick_params(labelsize=6)

# Draw a map scale
m.drawmapscale(
    coords[0] + w * 0.5, coords[1] + h * 0.1,
    coords[0], coords[1],
    radius/2*1000,   # length
    barstyle='fancy', labelstyle='simple',
    units = 'm',
#    format='%.2f',
    fillcolor1='w', fillcolor2='#555555',
    fontcolor='#555555',
    zorder=4)
plt.title("Oakland housing development, 1890-1930")
plt.tight_layout()
fig.set_size_inches(7.22, 5.25)  
plt.savefig('data/Oakland_temp.png', dpi=300, alpha=True)
plt.show()

## Results

The mapping code above produces the following map:

<img src="Oakland_500m_1890to1930.png">

It's easy to see a couple of trends in this small scale map:
- Although roads aren't labeled, they are clearly visible in the negative space.  The big diagonal swoosh is I-580, and the almost horizontal road spanning the bottom is Park Blvd.  Even some pedestrian pathways, a common occurance in the hilly East Bay, are visible as thin lines between houses in the upper right.
- Some parcels aren't labeled at all, not even with parcel boundaries.  These are not in Zillow's database, such as schools, churches, commercial property, and parks.  The big empty space on the right side is Oakland High School.
- Some parcels have boundaries but are colored white.  This indicates they were either built after this date span or have no valid date of construction - overflow or invalid data, in other words.

With that in mind, here are two larger scale maps containing the 6000 parcels closest to the centerpoint, for two slightly overlapping time scales.

<img src="Oakland_1300m_1880to1920.png">

<img src="Oakland_1300m_1920to1960.png">

In the later map, all bright red parcels were built after 1960.

Looking at these larger scale maps, a couple trends can be seen.  In the late 19th Century, or the Victorian Era, houses were built to the south and east of Lake Merritt, primarily in the Clinton and Bella Vista neighborhoods, with a few between Merritt and Cleveland Heights.  These houses tended to group in small numbers and be separated by a block or two: perhaps this area was still used for agriculture?  In any case, this area used to be called Brooklyn and was separated from downtown Oakland by a toll bridge.

The first two decades of the 20th C saw the edges of development spreading outward.  The development was strongest in the neighborhoods of Lakeshore, Cleveland Heights, and the upper elevations of Trestle Glen (to the SE of the where the label is).  These areas aligned closely with the building of the streetcar system, which by then ran through all these areas.

Just about the only area to see complete development was Bella Vista, likely because its proximity to the Arbor Villa estate of Francis "Borax" Smith made it a very desirable location.  Oddly, Haddon Hill saw only a few houses being built before 1920 - perhaps this hill that overlooks Lake Merritt smelled too strong, for the lake (truly a tidal inlet then) was quite polluted.  The valley of Trestle Glen was almost ignored during this period: the train trestle that lent the name was torn down around 1906, but very few houses were built until 10-20 years later.

Then in the 1920s most of these neighborhoods were almost completely developed, approaching 90% coverage.  The big exceptions are the Clinton area, which saw only 70-80% coverage, Arbor Villa, and the western slope of Haddon Hill.  Arbor Villa wasn't built until Francis Smith went bankrupt in the early 1930s and had to sell off his estate that spanned 5 city blocks.  The only thing remaining - sadly, his mansion was razed - is a row of palm trees on the southern edge that stretches 3 blocks long.  It wasn't until the Great Depression was truly over in 1940 that this area saw substantial development, as you can see in the plot below.

<img src="Oakland_ArborVilla.png">

By the 1950s, Arbor Villa been almost entirely filled in.  Given the quality of the dataset - to be discussed shortly - it's hard to make statements about specific parcels, which were the only ones built after this time.  An interesting question is how many were rebuilt because of catastrophe (fire or the 1989 Loma Prieta earthquake), versus changing use (converting a single family home to a multi-unit building, or residential to commercial), versus homeowner whim.

So my take-away from this analysis is that houses in these districts were not built in large tracts at the same time, as suburbs were in the 1950s, but rather piece-meal over a span of 5-15 years.  A couple questions arose: why was Clinton so sparsely developed early on?  It may be that it was fully developed, but that many buildings were replaced after the 1960s.  There are also a fair number of error parcels in this district (see below), which indicate both mixed commercial use and omis

### Data Quality

The first thing to note is that this dataset is from 2011 or earlier, and does not include any structures that have been razed in the past (or previous parcels that have been subsequently subdivided).  Arbor Villa is the most famous example, but there were surely others.  Refining the dataset to include this data would be a substantial research project in itself (and would probably involve close reading of the Sanborn maps created for insurance quotes in the 19th and early 20th centuries, though they only came out every 5-10 years).

How accurate is this data in the first place?  The Zillow information mostly comes from county sources - one can explore this on a property-by-property basis on their website.  Comparing a few cases with historical sources such as the [Oakland Wiki](https://localwiki.org/oakland "Oakland - LocalWiki") gives a sense of accuracy, or at least a closer approximation.  Here are some examples:

- 2901 Park Blvd (corner of Park and McKinley) was built around 1912 (as part of the [Mary Smith Home for Friendless Girls](https://localwiki.org/oakland/Mary_Smith_Home_for_Friendless_Girls)), while Zillow puts it at 1930.
- [1047 Bella Vista](https://oaklandwiki.org/Fenton_Home_Orphanage) was built in 1892, 18 years before Zillow claims it was.  (An interesting aside: Susan Fenton, the sister of the woman who founded Fenton's Creamery, which is still operating on Piedmont St and as beloved as ever, founded a Home for Destitue Children here in 1925.)
- [The Kaiser house](https://oaklandwiki.org/Kaiser_House) at 664 Haddon Road, where Henry Kaiser lived between 1925 and the mid 1940s, was built in 1924.  Zillow cites a date of 1925.

More historical citations can be found in ["An Architectural Guidebook to San Francisco and the Bay Area"](https://books.google.com/books?id=FkVQx6MWa8MC&lpg=RA2-PA120&ots=OANNWFMFSG&dq=%221047%20bella%20vista%22%20oakland&pg=RA2-PA120#v=onepage&q=%221047%20bella%20vista%22%20oakland&f=false), by Susan Dinkelspiel Cerny (2007).

Based on this comparison, it seems like houses older than around 1920 are less likely to be accurately labeled in Zillow's archives than more recent houses.

And finally, for reference, here is a plot of all parcels with an error, either because it's not in Zillow's database or because it's present but has an invalid date ('nan').

<img src="Oakland_2000m_errorParcels.png">

## Future Directions

There are several ways to extend this analysis:
- Look at a larger area geographically, such as all of Oakland as well as surrounding communities.  The most prominent omission right now is Piedmont, a town which never incorporated into Oakland and is just barely visible as the straight edge at the top of the larger maps.  In particular, it would be interesting to look at the flats of Berkeley, Oakland, and Emeryville that were developed by people displaced by the 1906 earthquake in San Francisco.
- Map other attributes.  The Zillow query returns several other interesting fields:

In [3]:
data_queried[32.873555]['zillow']

OrderedDict([('zpid', '24763478'),
             ('address',
              OrderedDict([('street', '684 Spruce St'),
                           ('zipcode', '94610'),
                           ('city', 'Oakland'),
                           ('state', 'CA'),
                           ('latitude', '37.806119'),
                           ('longitude', '-122.23984')])),
             ('FIPScounty', '6001'),
             ('useCode', 'MultiFamily2To4'),
             ('taxAssessmentYear', '2015'),
             ('taxAssessment', '204567.0'),
             ('yearBuilt', '1910'),
             ('lotSizeSqFt', '4400'),
             ('finishedSqFt', '700'),
             ('bathrooms', '1.0'),
             ('bedrooms', '2'),
             ('totalRooms', '8'),
             ('lastSoldDate', '02/22/1996'),
             ('lastSoldPrice',
              OrderedDict([('@currency', 'USD'), ('#text', '148000')]))])

I'd expect many of these fields to show interesting patterns on this map.  The useCode would show which areas have most rental units compared to single-family homes, and would likely give a good sense of (residential) building height in each area.

Total rooms and finished square feet may also be good proxies for how desirable the house was when built: presumably, larger houses were built for wealthier families.  (Of course, these numbers will reflect additions and renovations in the time since, and may not be as reliable as a result).

Obviously, assessed value is a crucial variability that I have for the most part ignored, since there's already a large industry devoted to that question.

In order to follow up the observation about hilltops (in general) being developed first, the data set could also be expanded to incorporate elevation information.  This would then need to be filtered to find local maxima (hill tops), slopes (hill-sides), and local minima (valleys).  Based on the results here, I can easily imagine a weak but discernable correlation ($r=0.2-0.5$) between year built and location type, though I'm not sure this trend would be also be valid for neighborhoods such as the Oakland Hills that likely had different time courses of development.

Lastly, please feel free to fork this code and play around with it in your own neighborhood.