<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#-Create-New-Dataset" data-toc-modified-id="-Create-New-Dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span> Create New Dataset</a></span><ul class="toc-item"><li><span><a href="#-Create-Dictionary-to-Hold-Data-Structure" data-toc-modified-id="-Create-Dictionary-to-Hold-Data-Structure-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span> Create Dictionary to Hold Data Structure</a></span></li><li><span><a href="#-Business-Registration" data-toc-modified-id="-Business-Registration-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span> Business Registration</a></span></li><li><span><a href="#-Evictions" data-toc-modified-id="-Evictions-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span> Evictions</a></span></li><li><span><a href="#-Meidan-Price-Per-Square-Foot" data-toc-modified-id="-Meidan-Price-Per-Square-Foot-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span> Meidan Price Per Square Foot</a></span></li><li><span><a href="#-Add-Features-to-Business-Dataset" data-toc-modified-id="-Add-Features-to-Business-Dataset-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span> Add Features to Business Dataset</a></span></li><li><span><a href="#Save-new-dataset-for-modeling" data-toc-modified-id="Save-new-dataset-for-modeling-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Save new dataset for modeling</a></span></li><li><span><a href="#Export-Dictionary" data-toc-modified-id="Export-Dictionary-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Export Dictionary</a></span></li></ul></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
import json
import ast
import pickle


import os
import sys
import re

module_path = os.path.abspath(os.path.join('./lib/'))

if module_path not in sys.path:   
    sys.path.append(module_path)

from utilities import *

import datetime

import matplotlib.pyplot as plt
import seaborn as sns

import os
import sys
import re
import time
import datetime

import geopandas as gpd
from geopandas.tools import geocode
from shapely.geometry import Point

%matplotlib inline


<h2> Create New Dataset

Build a dataset where each row represents events that happened for a neighborhood for a time interval.
Start with a core dictionary of the structure:
```
{ 
    '<time period>' : {
        
            <neighborhood>  :  {
                
                   <feature name>  :  value 
                   
                   }
           }
}
           
e.g., 
            
            
{ {'2000-Q1': {'North Beach': {'bus_opened': 0, 'bus_closed': 0},  etc...
 


```

Each feature or set of features comes from a different dataset. Each dataset contained location information that needed to be cleaned and processed. Each dataset was cleaned in its own notebook.   

-  Create lat,lon from point or location data
-  Fill in missing neighborhoods based on lat,lon 
-  In some cases additional cleaning and filtering was needed.

When constructing this dataset rows for which neighborhoods could not be identified were dropped.

Create base dictionary and keys to hold the features of interest.

In [2]:
import datetime

def get_timestamp():
    #create timestamp for loggin and uniquely naming output files
    now = datetime.datetime.now()
    current_time_stamp = f'{now.year}'
    current_time_stamp += f'{now.year}'
    current_time_stamp += f'{now.month}'
    current_time_stamp += f'{now.day}'
    current_time_stamp += f'{now.second}'
    current_time_stamp += f'{now.microsecond}'
    return(current_time_stamp)

In [3]:
now = datetime.datetime.now()
this_year = current_time_stamp = f'{now.year}' 
this_year

'2019'

<h3> Create Dictionary to Hold Data Structure

In [4]:
#initialize new dictionary
sf_neighborhood_changes_dict = {}

#get current year
now = datetime.datetime.now()
this_year = now.year
#set start year
start_year = 2000
#set number of years
nyears = this_year - (start_year-2)
nyears

#create dict structure
for i in range(1,nyears):
    qdict = {}
    for j in range(1,5):
        q = f'{start_year}-Q{j}'
        #create a nested dictionary for every quarter

        ndict = {}
        for n in neighborhoods:
            #print(f'{q}, {n}')

            ndict[n] =  {'bus_opened': 0,
                         'bus_closed': 0,
                         'evictions': 0,
                         'med_price_sqft': 0
                }
            
        sf_neighborhood_changes_dict[q] = ndict
 
    start_year += 1

#sf_neighborhood_changes_dict

In [5]:
#for each row, increment the dictionary entry for that qstart by 1
def increment_metric(q,n,m):  
    #print(f'{q}, {n}, {m}')
    sf_neighborhood_changes_dict[q][n][m] += 1
    #return None

In [6]:
len(sf_neighborhood_changes_dict)

80

<h3> Business Registration

In [7]:
def reload():
    df = pd.read_csv('../tmp/reg_bus_sfonly_clean_stg8_asof20190514.csv', low_memory = False)
    #convert dates
    df['dba_start'] = pd.to_datetime(df['dba_start']) 
    df['dba_end'] = pd.to_datetime(df['dba_end'])
    df['loc_start'] = pd.to_datetime(df['loc_start'])
    df['loc_end'] = pd.to_datetime(df['loc_end'])
    return df

dfbus = reload()

In [8]:
dfbus.shape

(195200, 37)

drop missing neighborhoods. Note, for this dataset the fixed neigborhoods are in 
the 'new_neighborhoods' column. 

In [9]:
missing_neighborhood_mask = dfbus['new_neighborhood'] == '.'

In [10]:
dfbus_goodhoods = dfbus[~missing_neighborhood_mask]

In [11]:
dfbus_goodhoods.shape

(194826, 37)

In [12]:
dfbus_goodhoods.columns

Index(['zip', 'certificate_number', 'city', 'dba_name',
       'full_business_address', 'lic', 'lic_code_description', 'location',
       'mail_city', 'mail_state', 'mail_zipcode', 'mailing_address_1',
       'naic_code', 'naic_code_description', 'neighborhood', 'ownership_name',
       'parking_tax', 'state', 'supervisor_district',
       'transient_occupancy_tax', 'ttxid', 'dba_start', 'dba_end', 'loc_start',
       'loc_end', 'lat', 'lon', 'y_start', 'q_start', 'yq_start', 'y_end',
       'q_end', 'yq_end', 'dur', 'status', 'new_neighborhood',
       'neighborhood_size'],
      dtype='object')

Limit the data to the analysis years. Create a df to iterate over the start dates, then one to iterate over the end dates.

In [13]:
startedmask = dfbus_goodhoods['y_start'] > 1999
endedmask = (dfbus_goodhoods['status'] == 'closed') & (dfbus_goodhoods['y_end'] > 1999)

In [14]:
dfstarted = dfbus_goodhoods[startedmask]
dfended = dfbus_goodhoods[endedmask]

In [15]:
res = dfstarted.apply(lambda row: increment_metric(row.yq_start, row.new_neighborhood, 'bus_opened'), axis=1)
dfstarted.shape

(172794, 37)

In [16]:
#Test
sf_neighborhood_changes_dict['2018-Q1']['Mission']

{'bus_opened': 357, 'bus_closed': 0, 'evictions': 0, 'med_price_sqft': 0}

In [17]:
res = dfended.apply(lambda row: increment_metric(row.yq_end, row.new_neighborhood, 'bus_closed'), 
                    axis=1)

In [18]:
#Test
sf_neighborhood_changes_dict['2018-Q1']['Mission']

{'bus_opened': 357, 'bus_closed': 237, 'evictions': 0, 'med_price_sqft': 0}

<h3> Evictions

Read cleaned eviction data

In [19]:
dfevic = pd.read_csv('../tmp/evic_cleanstge2_201920195159439871.csv', low_memory = False)

Drop missing neighborhoods

In [20]:
missing_neighborhood_mask = dfevic['neighborhood'] == '.'
dfevic_goodhoods = dfevic[~missing_neighborhood_mask]

In [21]:
dfevic_goodhoods.shape

(39037, 35)

Limit the data to the analysis years. Create a df to iterate over the start dates, then one to iterate over the end dates.

In [22]:
dfevic_goodhoods.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39037 entries, 0 to 40448
Data columns (total 35 columns):
access_denial              39037 non-null bool
address                    39037 non-null object
breach                     39037 non-null bool
capital_improvement        39037 non-null bool
city                       39035 non-null object
client_location            39037 non-null object
condo_conversion           39037 non-null bool
constraints_date           3940 non-null object
demolition                 39037 non-null bool
development                39037 non-null bool
ellis_act_withdrawal       39037 non-null bool
eviction_id                39037 non-null object
failure_to_sign_renewal    39037 non-null bool
file_date                  39037 non-null object
good_samaritan_ends        39037 non-null bool
illegal_use                39037 non-null bool
late_payments              39037 non-null bool
lead_remediation           39037 non-null bool
neighborhood               39037 n

In [25]:
#for each row, increment the dictionary entry for that qstart by 1
def increment_metric(q,n,m):  
    #print(f'{q}, {n}, {m}')
    try:
        sf_neighborhood_changes_dict[q][n][m] += 1
    except:
        pass
        #print(f'error on: {q}, {n}, {m}')
    #return None

In [26]:
startedmask_e = dfevic_goodhoods['year'] > 1999


In [27]:
dfstarted_e = dfevic_goodhoods[startedmask_e]


In [28]:
dfstarted_e.shape

(31477, 35)

In [29]:
res = dfstarted_e.apply(lambda row: increment_metric(row.yq, row.neighborhood, 'evictions'), axis=1)

In [30]:
#Test
sf_neighborhood_changes_dict['2018-Q1']['Mission']

{'bus_opened': 357, 'bus_closed': 237, 'evictions': 55, 'med_price_sqft': 0}

<h3> Meidan Price Per Square Foot

This file was downloaded from Zillow. https://www.zillow.com/research/data/

In [31]:
dfmedpr = pd.read_csv('../data/Neighborhood_MedianValuePerSqft_allSF.csv', low_memory=False)

In [32]:
dfmedpr.head()

Unnamed: 0,RegionID,Neighborhood,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,...,2018-06,2018-07,2018-08,2018-09,2018-10,2018-11,2018-12,2019-01,2019-02,2019-03
0,275789,Castro/Upper Market,Upper Market,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,3364,234,234,...,1095,1097,1099,1102,1104,1103,1102,1100,1097,1089
1,272885,Bayview,Bayview,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,877,113,114,...,674,678,681,687,696,701,705,707,704,698
2,268020,Bernal Heights,Bernal Heights,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,727,174,175,...,1187,1197,1206,1210,1210,1205,1201,1200,1194,1183
3,417524,Buena Vista,Buena Vista,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,2016,236,237,...,1211,1223,1237,1245,1249,1247,1246,1242,1228,1213
4,417519,Corona Heights,Corona Heights,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,5314,254,254,...,1266,1274,1278,1284,1291,1295,1301,1307,1305,1298


This dataset has one row per neighborood.
Rename the region to neighborhood.

In [33]:
dfmedpr.rename(columns={"Neighborhood": "neighborhood"}, inplace=True)

First,  append the SizeRank directly to the business dataset.  Add it to the dictionary first, then it will be an easier transformation.

In [34]:
size_dict = {}
def add_size(n,r):
    size_dict[n] = r
    return None

res = dfmedpr.apply(lambda row: add_size(row.neighborhood, row.SizeRank), axis=1)

In [35]:
def add_size_rank_to_bus(n):
    try:
        size = size_dict[n]
    except:
        size = 0
    return size

In [36]:
import warnings;
warnings.simplefilter('ignore')

In [37]:
dfbus_goodhoods['size_rank'] = dfbus_goodhoods.apply(lambda row: 
                                    add_size_rank_to_bus(row.new_neighborhood), axis=1)   

In [38]:
#save again
dfbus_goodhoods.to_csv('../tmp/reg_bus_with_zillow_rank_asof20190515.csv', index=False)

In [39]:
dfbus_goodhoods = pd.read_csv('../tmp/reg_bus_with_zillow_rank_asof20190515.csv', low_memory = False)

In [40]:
dfbus_goodhoods['size_rank'].value_counts().head()

0       75429
235     14171
498     12861
355      6365
1124     5897
Name: size_rank, dtype: int64

The median value per quarter for this dataset is contained in columns. We will need a new function to process the  the metric.

In [41]:
start_year

2020

In [42]:
#get current year
now = datetime.datetime.now()
this_year = now.year
#set start year
start_year = 2000
#set number of years
nyears = this_year - (start_year-2)
nyears, start_year

(21, 2000)

In [43]:
dfmedpr.head()

Unnamed: 0,RegionID,neighborhood,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,...,2018-06,2018-07,2018-08,2018-09,2018-10,2018-11,2018-12,2019-01,2019-02,2019-03
0,275789,Castro/Upper Market,Upper Market,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,3364,234,234,...,1095,1097,1099,1102,1104,1103,1102,1100,1097,1089
1,272885,Bayview,Bayview,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,877,113,114,...,674,678,681,687,696,701,705,707,704,698
2,268020,Bernal Heights,Bernal Heights,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,727,174,175,...,1187,1197,1206,1210,1210,1205,1201,1200,1194,1183
3,417524,Buena Vista,Buena Vista,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,2016,236,237,...,1211,1223,1237,1245,1249,1247,1246,1242,1228,1213
4,417519,Corona Heights,Corona Heights,San Francisco,CA,San Francisco-Oakland-Hayward,San Francisco County,5314,254,254,...,1266,1274,1278,1284,1291,1295,1301,1307,1305,1298


In [44]:
def add_med_price(row): 
    
    n = row.neighborhood
    m = 'med_price_sqft'
    year = start_year
    
    for i in range(1,nyears-1):
        
        for j in range(1,5):
            q = f'{year}-Q{j}'            
            
            #if it quarter j is 1, get the value for January
            #if it is 2,3,4 get the value for April,July,October
            if j == 1:
                price_mo = f'{year}-01'
            elif j == 2:
                price_mo = f'{year}-04'
            elif j == 3:
                price_mo = f'{year}-07'
            else:
                price_mo = f'{year}-04'                

            price = row[price_mo]
            
            
            try:
                sf_neighborhood_changes_dict[q][n][m] = price
            except:
                pass
                #print(f'error on: {q}, {n}, {m}')
                
                
        year +=1 

In [45]:
res = dfmedpr.apply(lambda x: add_med_price(x), axis = 1)

In [46]:
#Test
sf_neighborhood_changes_dict['2018-Q1']['Mission']

{'bus_opened': 357, 'bus_closed': 237, 'evictions': 55, 'med_price_sqft': 1111}

<h3> Add Features to Business Dataset

In [47]:
dfbus_goodhoods.shape

(194826, 38)

In [48]:
started_2000_through_2018 = (dfbus_goodhoods['y_start'] > 1999) & (dfbus_goodhoods['y_start'] < 2019) 

In [49]:
df_recent = dfbus_goodhoods[started_2000_through_2018]

In [50]:
df_recent.shape

(168692, 38)

In [51]:
df_recent.head()

Unnamed: 0,zip,certificate_number,city,dba_name,full_business_address,lic,lic_code_description,location,mail_city,mail_state,...,q_start,yq_start,y_end,q_end,yq_end,dur,status,new_neighborhood,neighborhood_size,size_rank
5,94133.0,86,San Francisco,1601 Grant Parking,1601 Grant Ave,.,.,"{'latitude': '37.801724', 'longitude': '-122.4...",,,...,1,2013-Q1,2016,2.0,2016-Q2,13,closed,North Beach,Medium,5770
7,94118.0,189,San Francisco,Abbey Carpet,2900 Geary Blvd,.,.,"{'latitude': '37.782133', 'longitude': '-122.4...",San Francisco,CA,...,3,2012-Q3,9999,0.0,,27,open,Presidio Heights,Medium,3259
8,94118.0,189,San Francisco,Abbey Carpet Of San Francisco,2900 Geary Blvd,.,.,"{'latitude': '37.782133', 'longitude': '-122.4...",San Francisco,CA,...,4,2013-Q4,9999,0.0,,22,open,Presidio Heights,Medium,3259
10,94124.0,216,San Francisco,Abc Insurance,1727 Oakdale Ave,.,.,"{'latitude': '37.736139', 'longitude': '-122.3...",,,...,4,2002-Q4,2018,2.0,2018-Q2,62,closed,Bayview Hunters Point,Medium,0
14,94107.0,244,San Francisco,Able Services,868 Folsom St,.,.,"{'latitude': '37.780746', 'longitude': '-122.4...",San Francisco,CA,...,2,2005-Q2,9999,0.0,,56,open,South of Market,Dense,498


In [52]:
df_recent.tail()

Unnamed: 0,zip,certificate_number,city,dba_name,full_business_address,lic,lic_code_description,location,mail_city,mail_state,...,q_start,yq_start,y_end,q_end,yq_end,dur,status,new_neighborhood,neighborhood_size,size_rank
194808,94108.0,1087955,San Francisco,Wenjing.com,1010 Stockton St,.,.,,San Francisco,CA,...,2,2018-Q2,9999,0.0,,4,open,Chinatown,Medium,0
194813,94109.0,1101492,San Francisco,Likewhereyoulive,891 Beach St,.,.,,,,...,3,2014-Q3,2019,2.0,2019-Q2,21,closed,Aquatic Park / Ft. Mason,Sparse,0
194817,94112.0,1101534,San Francisco,Quad Coin Llc,1551 Ocean Ave,.,.,,San Diego,CA,...,3,2018-Q3,9999,0.0,,3,open,Ingleside,Sparse,5192
194824,94104.0,167146,San Francisco,Houlihan Lokey Financial Advisors Inc,1 Sansome St #1700,.,.,,Los Angeles,CA,...,2,2010-Q2,9999,0.0,,36,open,Financial District,Medium,4444
194825,94124.0,1101543,San Francisco,Global Team Corporation,2910 Griffith,.,.,,San Francisco,CA,...,2,2006-Q2,9999,0.0,,52,open,Bret Harte,Sparse,0


In [53]:
def add_features_from_dict(dict, n, yq):
    bo = dict[yq][n]['bus_opened'] 
    bc = dict[yq][n]['bus_closed'] 
    e = dict[yq][n]['evictions']
    mp = dict[yq][n]['med_price_sqft']
    #print(f'{bo}, {bc}, {e}, {mp}')
    return bo, bc, e, mp

In [54]:
df_recent[['bus_opened', 'bus_closed', 'evictions', 'med_pr_sqft']] = df_recent.apply(lambda row: 
                            add_features_from_dict(sf_neighborhood_changes_dict,
                                                  row.new_neighborhood,
                                                  row.yq_start),                                                          
                                                  axis=1, result_type='expand')




In [55]:
df_recent.head()

Unnamed: 0,zip,certificate_number,city,dba_name,full_business_address,lic,lic_code_description,location,mail_city,mail_state,...,yq_end,dur,status,new_neighborhood,neighborhood_size,size_rank,bus_opened,bus_closed,evictions,med_pr_sqft
5,94133.0,86,San Francisco,1601 Grant Parking,1601 Grant Ave,.,.,"{'latitude': '37.801724', 'longitude': '-122.4...",,,...,2016-Q2,13,closed,North Beach,Medium,5770,89,15,18,755
7,94118.0,189,San Francisco,Abbey Carpet,2900 Geary Blvd,.,.,"{'latitude': '37.782133', 'longitude': '-122.4...",San Francisco,CA,...,,27,open,Presidio Heights,Medium,3259,36,6,4,970
8,94118.0,189,San Francisco,Abbey Carpet Of San Francisco,2900 Geary Blvd,.,.,"{'latitude': '37.782133', 'longitude': '-122.4...",San Francisco,CA,...,,22,open,Presidio Heights,Medium,3259,35,39,4,1088
10,94124.0,216,San Francisco,Abc Insurance,1727 Oakdale Ave,.,.,"{'latitude': '37.736139', 'longitude': '-122.3...",,,...,2018-Q2,62,closed,Bayview Hunters Point,Medium,0,28,0,11,0
14,94107.0,244,San Francisco,Able Services,868 Folsom St,.,.,"{'latitude': '37.780746', 'longitude': '-122.4...",San Francisco,CA,...,,56,open,South of Market,Dense,498,45,2,24,668


<h3>Save new dataset for modeling

In [56]:
#save new dataset for modeling
df_recent.to_csv('../tmp/recent_bus_with_features_20190515.csv', index=False)

In [57]:
df_recent = pd.read_csv('../tmp/recent_bus_with_features_20190515.csv', low_memory = False)

<h3>Export Dictionary

In [58]:
#https://pythonspot.com/save-a-dictionary-to-a-file/
import pickle
#dict = {'Python' : '.py', 'C++' : '.cpp', 'Java' : '.java'}
f = open('../data/sf_neighborhood_changes_dict.pkl','wb')
pickle.dump(sf_neighborhood_changes_dict,f)
f.close()