<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Get-Data" data-toc-modified-id="Get-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Get Data</a></span></li><li><span><a href="#Profile-the-Dataset" data-toc-modified-id="Profile-the-Dataset-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Profile the Dataset</a></span></li><li><span><a href="#Clean-the-Data" data-toc-modified-id="Clean-the-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Clean the Data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#-Fill-in-Missing-Neighborhood-Values" data-toc-modified-id="-Fill-in-Missing-Neighborhood-Values-4.0.1"><span class="toc-item-num">4.0.1&nbsp;&nbsp;</span> Fill in Missing Neighborhood Values</a></span></li><li><span><a href="#-Get-Year-and-Quarter" data-toc-modified-id="-Get-Year-and-Quarter-4.0.2"><span class="toc-item-num">4.0.2&nbsp;&nbsp;</span> Get Year and Quarter</a></span></li><li><span><a href="#Save" data-toc-modified-id="Save-4.0.3"><span class="toc-item-num">4.0.3&nbsp;&nbsp;</span>Save</a></span></li></ul></li></ul></li></ul></div>

<a id="section__top"></a>

<h2>Imports

In [51]:
import pandas as pd
import numpy as np
import pandas_profiling
import time
import datetime

import os
import sys
import re

module_path = os.path.abspath(os.path.join('./lib/'))

if module_path not in sys.path:   
    sys.path.append(module_path)
    
from utilities import *
from sodapy_dataset_reader import *

The sodapy_dataset_reader import contains a classes that use the
__[sodapy library ](https://pypi.org/project/sodapy/)__
to interface with the __[Socrata API ](https://dev.socrata.com)__
<br>
Many open data platforms are implemented using the Socrata (or SODA) API, including
__[SFData.gov](https://data.sfgov.org/)__, which is used for this project.

<h2>Get Data

In [2]:
#data source
source_domain = 'data.sfgov.org'
#dataset: eviction notices
dataset_id = '5cei-gny5'

#Instantiate Reader
spr = SodapyDatasetReader(source_domain, dataset_id)

#Get Meta Data
md = spr.get_metadata()

#get and print rowcount
rowcount = int(spr.get_row_count()[0]['count'])
print(f'Dataset contains {rowcount} rows')

#set limit to rowcount
df = spr.get_df(limit=rowcount)
df.shape

Dataset contains 40449 rows


(40449, 42)

In [3]:
#md

<h2>Profile the Dataset

The Pandas profiler is a utility function that expands on the initial data exploration 
steps one does when one first loads a new dataset. Code and documentation can be found here:
__[Pandas Profiling](https://github.com/pandas-profiling/pandas-profiling)__

In [4]:
profile = pandas_profiling.ProfileReport(df)

In [5]:
display(profile)

0,1
Number of variables,42
Number of observations,40449
Total Missing (%),9.8%
Total size in memory,7.8 MiB
Average record size in memory,203.0 B

0,1
Numeric,0
Categorical,21
Boolean,19
Date,0
Text (Unique),1
Rejected,0
Unsupported,1

0,1
Distinct count,12
Unique (%),0.0%
Missing (%),3.5%
Missing (n),1412

0,1
5,5181
11,4582
2,4518
Other values (8),24756

Value,Count,Frequency (%),Unnamed: 3
5,5181,12.8%,
11,4582,11.3%,
2,4518,11.2%,
10,4127,10.2%,
3,3688,9.1%,
4,3292,8.1%,
6,3279,8.1%,
8,3107,7.7%,
7,2541,6.3%,
1,2391,5.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),91.6%
Missing (n),37058

0,1
1,3391
(Missing),37058

Value,Count,Frequency (%),Unnamed: 3
1,3391,8.4%,
(Missing),37058,91.6%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),3.5%
Missing (n),1412

0,1
1,23463
2,15574
(Missing),1412

Value,Count,Frequency (%),Unnamed: 3
1,23463,58.0%,
2,15574,38.5%,
(Missing),1412,3.5%,

0,1
Distinct count,115
Unique (%),0.3%
Missing (%),3.5%
Missing (n),1412

0,1
53,3049
20,1936
39,1898
Other values (111),32154

Value,Count,Frequency (%),Unnamed: 3
53,3049,7.5%,
20,1936,4.8%,
39,1898,4.7%,
32,1586,3.9%,
42,1579,3.9%,
8,1393,3.4%,
5,1300,3.2%,
37,1014,2.5%,
102,1000,2.5%,
16,828,2.0%,

0,1
Distinct count,16
Unique (%),0.0%
Missing (%),91.1%
Missing (n),36867

0,1
6,1577
7,819
8,313
Other values (12),873
(Missing),36867

Value,Count,Frequency (%),Unnamed: 3
6,1577,3.9%,
7,819,2.0%,
8,313,0.8%,
10,242,0.6%,
9,157,0.4%,
5,134,0.3%,
15,80,0.2%,
4,52,0.1%,
12,41,0.1%,
14,39,0.1%,

0,1
Distinct count,42
Unique (%),0.1%
Missing (%),3.5%
Missing (n),1412

0,1
20,4510
36,2545
35,2538
Other values (38),29444

Value,Count,Frequency (%),Unnamed: 3
20,4510,11.1%,
36,2545,6.3%,
35,2538,6.3%,
29,2039,5.0%,
5,1814,4.5%,
16,1621,4.0%,
9,1472,3.6%,
34,1420,3.5%,
21,1378,3.4%,
3,1293,3.2%,

0,1
Distinct count,29
Unique (%),0.1%
Missing (%),3.5%
Missing (n),1400

0,1
28859,4881
28858,2853
29492,2838
Other values (25),28477

Value,Count,Frequency (%),Unnamed: 3
28859,4881,12.1%,
28858,2853,7.1%,
29492,2838,7.0%,
28852,2558,6.3%,
28861,2522,6.2%,
28862,2448,6.1%,
56,2392,5.9%,
28853,2217,5.5%,
64,1978,4.9%,
54,1883,4.7%,

0,1
Distinct count,42
Unique (%),0.1%
Missing (%),3.5%
Missing (n),1412

0,1
19,4510
36,2545
35,2538
Other values (38),29444

Value,Count,Frequency (%),Unnamed: 3
19,4510,11.1%,
36,2545,6.3%,
35,2538,6.3%,
26,2039,5.0%,
3,1814,4.5%,
14,1621,4.0%,
10,1472,3.6%,
34,1420,3.5%,
21,1378,3.4%,
9,1293,3.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),91.6%
Missing (n),37055

0,1
1,3394
(Missing),37055

Value,Count,Frequency (%),Unnamed: 3
1,3394,8.4%,
(Missing),37055,91.6%,

0,1
Distinct count,11
Unique (%),0.0%
Missing (%),3.5%
Missing (n),1417

0,1
4,6277
8,6115
7,4653
Other values (7),21987

Value,Count,Frequency (%),Unnamed: 3
4,6277,15.5%,
8,6115,15.1%,
7,4653,11.5%,
9,4361,10.8%,
6,4278,10.6%,
5,3956,9.8%,
1,3591,8.9%,
3,2054,5.1%,
2,2031,5.0%,
10,1716,4.2%,

0,1
Distinct count,11
Unique (%),0.0%
Missing (%),3.5%
Missing (n),1417

0,1
3,6274
10,6115
4,4690
Other values (7),21953

Value,Count,Frequency (%),Unnamed: 3
3,6274,15.5%,
10,6115,15.1%,
4,4690,11.6%,
9,4646,11.5%,
8,4026,10.0%,
6,3713,9.2%,
7,3639,9.0%,
5,2650,6.6%,
2,2047,5.1%,
1,1232,3.0%,

0,1
Distinct count,12
Unique (%),0.0%
Missing (%),3.5%
Missing (n),1412

0,1
5,5181
11,4582
7,4518
Other values (8),24756

Value,Count,Frequency (%),Unnamed: 3
5,5181,12.8%,
11,4582,11.3%,
7,4518,11.2%,
9,4127,10.2%,
10,3688,9.1%,
2,3292,8.1%,
1,3279,8.1%,
4,3107,7.7%,
3,2541,6.3%,
6,2391,5.9%,

0,1
Distinct count,16
Unique (%),0.0%
Missing (%),3.6%
Missing (n),1469

0,1
2,6265
1,5131
15,4413
Other values (12),23171

Value,Count,Frequency (%),Unnamed: 3
2,6265,15.5%,
1,5131,12.7%,
15,4413,10.9%,
9,3787,9.4%,
11,3688,9.1%,
13,3511,8.7%,
8,2584,6.4%,
7,2438,6.0%,
10,2267,5.6%,
5,1720,4.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0077134

0,1
False,40137
True,312

Value,Count,Frequency (%),Unnamed: 3
False,40137,99.2%,
True,312,0.8%,

0,1
Distinct count,7613
Unique (%),18.8%
Missing (%),0.0%
Missing (n),0

0,1
1100 Block Of Market Street,399
300 Block Of Arballo Drive,216
1300 Block Of Market Street,197
Other values (7610),39637

Value,Count,Frequency (%),Unnamed: 3
1100 Block Of Market Street,399,1.0%,
300 Block Of Arballo Drive,216,0.5%,
1300 Block Of Market Street,197,0.5%,
700 Block Of Gonzalez Drive,181,0.4%,
100 Block Of Font Boulevard,172,0.4%,
0 Block Of Chumasero Drive,120,0.3%,
1000 Block Of Market Street,118,0.3%,
300 Block Of Eddy Street,105,0.3%,
0 Block Of Turk Street,102,0.3%,
300 Block Of Serrano Drive,99,0.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.2248

0,1
False,31356
True,9093

Value,Count,Frequency (%),Unnamed: 3
False,31356,77.5%,
True,9093,22.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.038715

0,1
False,38883
True,1566

Value,Count,Frequency (%),Unnamed: 3
False,38883,96.1%,
True,1566,3.9%,

0,1
Distinct count,14
Unique (%),0.0%
Missing (%),0.0%
Missing (n),2

0,1
San Francisco,40430
San Franicsco,4
San Franisco,2
Other values (10),11

Value,Count,Frequency (%),Unnamed: 3
San Francisco,40430,100.0%,
San Franicsco,4,0.0%,
San Franisco,2,0.0%,
Sn Francisco,2,0.0%,
San Francisc,1,0.0%,
158an Francisco,1,0.0%,
San ‘Francisco,1,0.0%,
San Frncisco,1,0.0%,
San Francicso,1,0.0%,
San Francisoc,1,0.0%,

Unsupported value

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0028431

0,1
False,40334
True,115

Value,Count,Frequency (%),Unnamed: 3
False,40334,99.7%,
True,115,0.3%,

0,1
Distinct count,2513
Unique (%),6.2%
Missing (%),90.1%
Missing (n),36428

0,1
2005-08-01T00:00:00.000,9
2021-06-20T00:00:00.000,9
2007-11-12T00:00:00.000,7
Other values (2509),3996
(Missing),36428

Value,Count,Frequency (%),Unnamed: 3
2005-08-01T00:00:00.000,9,0.0%,
2021-06-20T00:00:00.000,9,0.0%,
2007-11-12T00:00:00.000,7,0.0%,
2022-04-29T00:00:00.000,7,0.0%,
2018-06-17T00:00:00.000,7,0.0%,
2018-05-01T00:00:00.000,7,0.0%,
2021-04-26T00:00:00.000,7,0.0%,
2005-11-15T00:00:00.000,7,0.0%,
2018-05-04T00:00:00.000,7,0.0%,
2007-06-01T00:00:00.000,7,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.028703

0,1
False,39288
True,1161

Value,Count,Frequency (%),Unnamed: 3
False,39288,97.1%,
True,1161,2.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0089743

0,1
False,40086
True,363

Value,Count,Frequency (%),Unnamed: 3
False,40086,99.1%,
True,363,0.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.096368

0,1
False,36551
True,3898

Value,Count,Frequency (%),Unnamed: 3
False,36551,90.4%,
True,3898,9.6%,

First 3 values

Last 3 values

Value,Count,Frequency (%),Unnamed: 3
AL2K0014,1,0.0%,
E2K2137,1,0.0%,
E980001,1,0.0%,
E980002,1,0.0%,
E980003,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
S001123,1,0.0%,
S001124,1,0.0%,
S001125,1,0.0%,
T2K2040,1,0.0%,
on at lease one occasion,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0023239

0,1
False,40355
True,94

Value,Count,Frequency (%),Unnamed: 3
False,40355,99.8%,
True,94,0.2%,

0,1
Distinct count,5502
Unique (%),13.6%
Missing (%),0.0%
Missing (n),0

0,1
2012-08-16T00:00:00.000,240
2016-02-05T00:00:00.000,232
2010-09-24T00:00:00.000,112
Other values (5499),39865

Value,Count,Frequency (%),Unnamed: 3
2012-08-16T00:00:00.000,240,0.6%,
2016-02-05T00:00:00.000,232,0.6%,
2010-09-24T00:00:00.000,112,0.3%,
1999-10-12T00:00:00.000,55,0.1%,
2010-04-19T00:00:00.000,53,0.1%,
1999-10-04T00:00:00.000,50,0.1%,
2009-09-14T00:00:00.000,49,0.1%,
2013-10-07T00:00:00.000,49,0.1%,
2011-09-22T00:00:00.000,48,0.1%,
1998-09-15T00:00:00.000,44,0.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.00019778

0,1
False,40441
True,8

Value,Count,Frequency (%),Unnamed: 3
False,40441,100.0%,
True,8,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.021879

0,1
False,39564
True,885

Value,Count,Frequency (%),Unnamed: 3
False,39564,97.8%,
True,885,2.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.040495

0,1
False,38811
True,1638

Value,Count,Frequency (%),Unnamed: 3
False,38811,96.0%,
True,1638,4.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.00019778

0,1
False,40441
True,8

Value,Count,Frequency (%),Unnamed: 3
False,40441,100.0%,
True,8,0.0%,

0,1
Distinct count,42
Unique (%),0.1%
Missing (%),3.5%
Missing (n),1412

0,1
Mission,4383
Tenderloin,2545
Sunset/Parkside,2538
Other values (38),29571

Value,Count,Frequency (%),Unnamed: 3
Mission,4383,10.8%,
Tenderloin,2545,6.3%,
Sunset/Parkside,2538,6.3%,
Outer Richmond,2039,5.0%,
Castro/Upper Market,1814,4.5%,
Lakeshore,1621,4.0%,
Hayes Valley,1472,3.6%,
South of Market,1420,3.5%,
Nob Hill,1378,3.4%,
Haight Ashbury,1293,3.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.059334

0,1
False,38049
True,2400

Value,Count,Frequency (%),Unnamed: 3
False,38049,94.1%,
True,2400,5.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.16562

0,1
False,33750
True,6699

Value,Count,Frequency (%),Unnamed: 3
False,33750,83.4%,
True,6699,16.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.033746

0,1
False,39084
True,1365

Value,Count,Frequency (%),Unnamed: 3
False,39084,96.6%,
True,1365,3.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.24915

0,1
False,30371
True,10078

Value,Count,Frequency (%),Unnamed: 3
False,30371,75.1%,
True,10078,24.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.03345

0,1
False,39096
True,1353

Value,Count,Frequency (%),Unnamed: 3
False,39096,96.7%,
True,1353,3.3%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),2

0,1
CA,40443
Ca,4
(Missing),2

Value,Count,Frequency (%),Unnamed: 3
CA,40443,100.0%,
Ca,4,0.0%,
(Missing),2,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0020272

0,1
False,40367
True,82

Value,Count,Frequency (%),Unnamed: 3
False,40367,99.8%,
True,82,0.2%,

0,1
Distinct count,12
Unique (%),0.0%
Missing (%),3.5%
Missing (n),1412

0,1
8,5181
5,4582
9,4518
Other values (8),24756

Value,Count,Frequency (%),Unnamed: 3
8,5181,12.8%,
5,4582,11.3%,
9,4518,11.2%,
6,4127,10.2%,
3,3688,9.1%,
1,3292,8.1%,
2,3279,8.1%,
7,3107,7.7%,
4,2541,6.3%,
11,2391,5.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.018319

0,1
False,39708
True,741

Value,Count,Frequency (%),Unnamed: 3
False,39708,98.2%,
True,741,1.8%,

0,1
Distinct count,45
Unique (%),0.1%
Missing (%),1.7%
Missing (n),682

0,1
94110,4899
94109,2824
94117,2772
Other values (41),29272

Value,Count,Frequency (%),Unnamed: 3
94110,4899,12.1%,
94109,2824,7.0%,
94117,2772,6.9%,
94112,2641,6.5%,
94102,2515,6.2%,
94122,2439,6.0%,
94103,2369,5.9%,
94114,2293,5.7%,
94132,2053,5.1%,
94118,1948,4.8%,

Unnamed: 0,:@computed_region_26cr_cadq,:@computed_region_6ezc_tdp2,:@computed_region_6pnf_4xz7,:@computed_region_6qbp_sg9q,:@computed_region_9jxd_iqea,:@computed_region_ajp5_b2md,:@computed_region_bh8s_q3mv,:@computed_region_fyvs_ahh9,:@computed_region_h4ep_8xdi,:@computed_region_p5aj_wyqh,:@computed_region_qgnn_b9vv,:@computed_region_rxqg_mtj9,:@computed_region_yftq_j783,access_denial,address,breach,capital_improvement,city,client_location,condo_conversion,constraints_date,demolition,development,ellis_act_withdrawal,eviction_id,failure_to_sign_renewal,file_date,good_samaritan_ends,illegal_use,late_payments,lead_remediation,neighborhood,non_payment,nuisance,other_cause,owner_move_in,roommate_same_unit,state,substantial_rehab,supervisor_district,unapproved_subtenant,zip
0,8,,1,71,,41,28861,40,,7,9,4,9,False,700 Block Of Faxon Avenue,False,False,San Francisco,"{'latitude': '37.726686880000024', 'longitude'...",False,2024-03-29T00:00:00.000,False,False,False,M190675,False,2019-03-29T00:00:00.000,False,False,False,False,West of Twin Peaks,False,False,False,True,False,CA,False,7,False,94112
1,10,1.0,2,20,6.0,36,28852,36,1.0,10,5,9,14,False,100 Block Of Turk Street,False,False,San Francisco,"{'latitude': '37.78316342723367', 'longitude':...",False,,False,False,False,M190653,False,2019-03-29T00:00:00.000,False,False,False,False,Tenderloin,False,True,False,False,False,CA,False,6,False,94102
2,3,,2,99,,23,308,23,,1,6,10,3,False,200 Block Of Bay Street,False,False,San Francisco,"{'latitude': '37.805980597572834', 'longitude'...",False,,False,False,False,M190654,True,2019-03-29T00:00:00.000,False,False,False,False,North Beach,False,False,False,False,False,CA,False,3,False,94133
3,3,,2,99,,23,308,23,,1,6,10,3,False,200 Block Of Bay Street,False,False,San Francisco,"{'latitude': '37.805980597572834', 'longitude'...",False,,False,False,False,M190656,True,2019-03-29T00:00:00.000,False,False,False,False,North Beach,False,False,False,False,False,CA,False,3,False,94133
4,3,,2,99,,23,308,23,,1,6,10,3,False,200 Block Of North Point Street,False,False,San Francisco,"{'latitude': '37.80680623758749', 'longitude':...",False,,False,False,False,M190655,True,2019-03-29T00:00:00.000,False,False,False,False,North Beach,False,False,False,False,False,CA,False,3,False,94133


In [6]:
#output to file...
fn = f'profile-{dataset_id}.html'
profile.to_file(outputfile='../tmp/fn')

<h2>Clean the Data

[back to top](#section__top)

In [7]:
df.head()

Unnamed: 0,:@computed_region_26cr_cadq,:@computed_region_6ezc_tdp2,:@computed_region_6pnf_4xz7,:@computed_region_6qbp_sg9q,:@computed_region_9jxd_iqea,:@computed_region_ajp5_b2md,:@computed_region_bh8s_q3mv,:@computed_region_fyvs_ahh9,:@computed_region_h4ep_8xdi,:@computed_region_p5aj_wyqh,...,non_payment,nuisance,other_cause,owner_move_in,roommate_same_unit,state,substantial_rehab,supervisor_district,unapproved_subtenant,zip
0,8,,1,71,,41,28861,40,,7,...,False,False,False,True,False,CA,False,7,False,94112
1,10,1.0,2,20,6.0,36,28852,36,1.0,10,...,False,True,False,False,False,CA,False,6,False,94102
2,3,,2,99,,23,308,23,,1,...,False,False,False,False,False,CA,False,3,False,94133
3,3,,2,99,,23,308,23,,1,...,False,False,False,False,False,CA,False,3,False,94133
4,3,,2,99,,23,308,23,,1,...,False,False,False,False,False,CA,False,3,False,94133


In [8]:
df.shape

(40449, 42)

In [9]:
df.columns

Index([':@computed_region_26cr_cadq', ':@computed_region_6ezc_tdp2',
       ':@computed_region_6pnf_4xz7', ':@computed_region_6qbp_sg9q',
       ':@computed_region_9jxd_iqea', ':@computed_region_ajp5_b2md',
       ':@computed_region_bh8s_q3mv', ':@computed_region_fyvs_ahh9',
       ':@computed_region_h4ep_8xdi', ':@computed_region_p5aj_wyqh',
       ':@computed_region_qgnn_b9vv', ':@computed_region_rxqg_mtj9',
       ':@computed_region_yftq_j783', 'access_denial', 'address', 'breach',
       'capital_improvement', 'city', 'client_location', 'condo_conversion',
       'constraints_date', 'demolition', 'development', 'ellis_act_withdrawal',
       'eviction_id', 'failure_to_sign_renewal', 'file_date',
       'good_samaritan_ends', 'illegal_use', 'late_payments',
       'lead_remediation', 'neighborhood', 'non_payment', 'nuisance',
       'other_cause', 'owner_move_in', 'roommate_same_unit', 'state',
       'substantial_rehab', 'supervisor_district', 'unapproved_subtenant',
       'zip'],

drop computed columns per this link
https://support.socrata.com/hc/en-us/articles/360007155973-Handling-Computed-Columns-with-FME
Datasets that have location data that intersects a Spatial Lens Boundary will have computed columns. These columns are created and curated by the Socrata platform, not the user. They are used to perform the geographic join between the geocoded row and the underlying spatial lens polygon. The column names begin with the prefix :@computed_region_.

In [10]:
df = df.drop([':@computed_region_26cr_cadq', ':@computed_region_6ezc_tdp2',
       ':@computed_region_6pnf_4xz7', ':@computed_region_6qbp_sg9q',
       ':@computed_region_9jxd_iqea', ':@computed_region_ajp5_b2md',
       ':@computed_region_bh8s_q3mv', ':@computed_region_fyvs_ahh9',
       ':@computed_region_h4ep_8xdi', ':@computed_region_p5aj_wyqh',
       ':@computed_region_qgnn_b9vv', ':@computed_region_rxqg_mtj9',
       ':@computed_region_yftq_j783'], axis = 1)

df.shape

(40449, 29)

In [11]:
df.head()

Unnamed: 0,access_denial,address,breach,capital_improvement,city,client_location,condo_conversion,constraints_date,demolition,development,...,non_payment,nuisance,other_cause,owner_move_in,roommate_same_unit,state,substantial_rehab,supervisor_district,unapproved_subtenant,zip
0,False,700 Block Of Faxon Avenue,False,False,San Francisco,"{'latitude': '37.726686880000024', 'longitude'...",False,2024-03-29T00:00:00.000,False,False,...,False,False,False,True,False,CA,False,7,False,94112
1,False,100 Block Of Turk Street,False,False,San Francisco,"{'latitude': '37.78316342723367', 'longitude':...",False,,False,False,...,False,True,False,False,False,CA,False,6,False,94102
2,False,200 Block Of Bay Street,False,False,San Francisco,"{'latitude': '37.805980597572834', 'longitude'...",False,,False,False,...,False,False,False,False,False,CA,False,3,False,94133
3,False,200 Block Of Bay Street,False,False,San Francisco,"{'latitude': '37.805980597572834', 'longitude'...",False,,False,False,...,False,False,False,False,False,CA,False,3,False,94133
4,False,200 Block Of North Point Street,False,False,San Francisco,"{'latitude': '37.80680623758749', 'longitude':...",False,,False,False,...,False,False,False,False,False,CA,False,3,False,94133


In [12]:
df.isnull().sum()

access_denial                  0
address                        0
breach                         0
capital_improvement            0
city                           2
client_location             1400
condo_conversion               0
constraints_date           36428
demolition                     0
development                    0
ellis_act_withdrawal           0
eviction_id                    0
failure_to_sign_renewal        0
file_date                      0
good_samaritan_ends            0
illegal_use                    0
late_payments                  0
lead_remediation               0
neighborhood                1412
non_payment                    0
nuisance                       0
other_cause                    0
owner_move_in                  0
roommate_same_unit             0
state                          2
substantial_rehab              0
supervisor_district         1412
unapproved_subtenant           0
zip                          682
dtype: int64

Examine the neighborhoods. They can be matched with those in the main dataset. 

In [13]:
neighborhoods = set(df.neighborhood)

In [14]:
neighborhoods

{'Bayview Hunters Point',
 'Bernal Heights',
 'Castro/Upper Market',
 'Chinatown',
 'Excelsior',
 'Financial District/South Beach',
 'Glen Park',
 'Golden Gate Park',
 'Haight Ashbury',
 'Hayes Valley',
 'Inner Richmond',
 'Inner Sunset',
 'Japantown',
 'Lakeshore',
 'Lincoln Park',
 'Lone Mountain/USF',
 'Marina',
 'McLaren Park',
 'Mission',
 'Mission Bay',
 'Nob Hill',
 'Noe Valley',
 'North Beach',
 'Oceanview/Merced/Ingleside',
 'Outer Mission',
 'Outer Richmond',
 'Pacific Heights',
 'Portola',
 'Potrero Hill',
 'Presidio',
 'Presidio Heights',
 'Russian Hill',
 'Seacliff',
 'South of Market',
 'Sunset/Parkside',
 'Tenderloin',
 'Treasure Island',
 'Twin Peaks',
 'Visitacion Valley',
 'West of Twin Peaks',
 'Western Addition',
 nan}

Fill missing values for columns we are likely to use in the analysis. Do not drop any for now.

In [23]:
fill_cols = {'neighborhood':'.', 
             'supervisor_district':'.',
             'zip':'.'
            }

In [25]:
#df.fillna(fill_cols).isnull().sum()
df.fillna(fill_cols, inplace=True)

In [26]:
df.isnull().sum()

access_denial                  0
address                        0
breach                         0
capital_improvement            0
city                           2
client_location             1400
condo_conversion               0
constraints_date           36428
demolition                     0
development                    0
ellis_act_withdrawal           0
eviction_id                    0
failure_to_sign_renewal        0
file_date                      0
good_samaritan_ends            0
illegal_use                    0
late_payments                  0
lead_remediation               0
neighborhood                   0
non_payment                    0
nuisance                       0
other_cause                    0
owner_move_in                  0
roommate_same_unit             0
state                          2
substantial_rehab              0
supervisor_district            0
unapproved_subtenant           0
zip                            0
dtype: int64

Examine the city names. There are errors, this column won't be used in the analysis. This just confirms the data applies only to the city of San Francisco.

In [27]:
df.city.value_counts()

San Francisco      40430
San Franicsco          4
San Franisco           2
Sn Francisco           2
San Francisc           1
158an Francisco        1
San ‘Francisco         1
San Frncisco           1
San Francicso          1
San Francisoc          1
San Frnacisco          1
3/9/2017               1
San Franciso           1
Name: city, dtype: int64

In [20]:
df.columns

Index(['access_denial', 'address', 'breach', 'capital_improvement', 'city',
       'client_location', 'condo_conversion', 'constraints_date', 'demolition',
       'development', 'ellis_act_withdrawal', 'eviction_id',
       'failure_to_sign_renewal', 'file_date', 'good_samaritan_ends',
       'illegal_use', 'late_payments', 'lead_remediation', 'neighborhood',
       'non_payment', 'nuisance', 'other_cause', 'owner_move_in',
       'roommate_same_unit', 'state', 'substantial_rehab',
       'supervisor_district', 'unapproved_subtenant', 'zip'],
      dtype='object')

Confirm that the client location field has the coordinate information.  We need to extract that to lat and lon.

In [28]:
df.iloc[0].client_location

{'latitude': '37.726686880000024',
 'longitude': '-122.4598938584836',
 'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}

The following code 

In [29]:
from LatLong import *
def convert_location(id, nm, loc):
    try:
        lat = float(loc['latitude'])
        lon = float(loc['longitude'])
    except:
        #print(f'an error occurred extracting coordinates on {id}, {nm}')
        lat, lon = 0,0
    #print(f'returning lat: {lat}, lon: {lon}')
    return lat, lon

In [30]:
start_execution = time.time()

df[['lat', 'lon']] = df.apply(lambda row: convert_location( row.eviction_id, 
                                     row.eviction_id,
                                     row.client_location), 
                                       axis=1, result_type='expand')
end_execution = time.time()
print(f'Elapsed time: {end_execution - start_execution}')


Elapsed time: 10.2966148853302


Use .describe() to verify the lat and lon columns

In [35]:
df.lon.describe(), df.lat.describe()

(count    40449.000000
 mean      -118.199439
 std         22.381036
 min       -122.511077
 25%       -122.452921
 50%       -122.428534
 75%       -122.413833
 max          0.000000
 Name: lon, dtype: float64, count    40449.000000
 mean        36.456004
 std          6.902975
 min          0.000000
 25%         37.743222
 50%         37.764445
 75%         37.781983
 max         37.830455
 Name: lat, dtype: float64)

In [None]:
# ts = get_timestamp()
# df.to_csv(f'../tmp/evic_cleanstge1_{ts}.csv', index=False)

In [None]:
# df = pd.read_csv('../tmp/evic_cleanstge1_2019201951136955025.csv', low_memory = False)

<h4> Fill in Missing Neighborhood Values

All attempts to fill in missing neighborhoods failed. May be a sign of bad coordinates. These rows will not be included in the analysis.

In [37]:
from LatLong import *

def get_neighborhood(lat,lon):
    try:
        ll = LatLong(lat,lon)
        neighborhood = ll.get_neighborhood()
    except:
        return '.'
    else:
        return neighborhood

In [38]:
#This is a sample of some the attempts
#ret = get_neighborhood(37.806, -122.411)
#37.70796257157375 -122.4635364761915
#37.707921903319395 -122.4287166819376
ret = get_neighborhood(37.70792, -122.4287)
#ret = get_neighborhood(37.806, -122.411)
#ret = get_neighborhood(df.iloc[0]['lat'],df.iloc[0]['lon'])
ret

'.'

In [40]:
start_execution = time.time()
df['new_neighborhood'] = df.apply(lambda row: get_neighborhood(row.lat, row.lon) if  (row.neighborhood == '.') 
                                                                 &   (row.lat > 30)                                                 
                                  else row.neighborhood,
                                  axis=1)
end_execution = time.time()
print(f'Elapsed time: {end_execution - start_execution}')

Elapsed time: 7.864480018615723


In [41]:
df.neighborhood.value_counts()

Mission                           4383
Tenderloin                        2545
Sunset/Parkside                   2538
Outer Richmond                    2039
Castro/Upper Market               1814
Lakeshore                         1621
Hayes Valley                      1472
South of Market                   1420
.                                 1412
Nob Hill                          1378
Haight Ashbury                    1293
Marina                            1250
Excelsior                         1210
Noe Valley                        1194
Inner Sunset                      1174
Bernal Heights                    1172
Pacific Heights                   1134
Inner Richmond                    1105
Bayview Hunters Point             1100
Russian Hill                       967
North Beach                        949
Oceanview/Merced/Ingleside         837
Lone Mountain/USF                  825
West of Twin Peaks                 741
Outer Mission                      682
Western Addition         

Note that there are 1,412 eviction notice records for which a specific neighborhood cannot be found.

<h4> Get Year and Quarter

Extract year and quarter into separate columns to be used when combining with business data.  Then display the data to ensure it was done correctly.

In [42]:
# #Convert date into quarter string
def get_year_and_quarter(dt):
    dt = pd.Timestamp(dt)
    year = dt.year        
    quarter = dt.quarter
    #print(f'{dt}, {year}, {quarter}, {dt.year}-Q{dt.quarter}' )
    return year, quarter, f'{dt.year}-Q{dt.quarter}'

In [43]:
df[['year', 'quarter', 'yq']] = df.apply(lambda row: get_year_and_quarter(row.file_date),
                       axis=1,  result_type='expand')
                       

In [44]:
df.head()

Unnamed: 0,access_denial,address,breach,capital_improvement,city,client_location,condo_conversion,constraints_date,demolition,development,...,substantial_rehab,supervisor_district,unapproved_subtenant,zip,lat,lon,new_neighborhood,year,quarter,yq
0,False,700 Block Of Faxon Avenue,False,False,San Francisco,"{'latitude': '37.726686880000024', 'longitude'...",False,2024-03-29T00:00:00.000,False,False,...,False,7,False,94112,37.726687,-122.459894,West of Twin Peaks,2019,1,2019-Q1
1,False,100 Block Of Turk Street,False,False,San Francisco,"{'latitude': '37.78316342723367', 'longitude':...",False,,False,False,...,False,6,False,94102,37.783163,-122.411599,Tenderloin,2019,1,2019-Q1
2,False,200 Block Of Bay Street,False,False,San Francisco,"{'latitude': '37.805980597572834', 'longitude'...",False,,False,False,...,False,3,False,94133,37.805981,-122.411141,North Beach,2019,1,2019-Q1
3,False,200 Block Of Bay Street,False,False,San Francisco,"{'latitude': '37.805980597572834', 'longitude'...",False,,False,False,...,False,3,False,94133,37.805981,-122.411141,North Beach,2019,1,2019-Q1
4,False,200 Block Of North Point Street,False,False,San Francisco,"{'latitude': '37.80680623758749', 'longitude':...",False,,False,False,...,False,3,False,94133,37.806806,-122.411309,North Beach,2019,1,2019-Q1


In [45]:
df[['file_date', 'year', 'quarter', 'yq']].tail(15)

Unnamed: 0,file_date,year,quarter,yq
40434,1997-01-03T00:00:00.000,1997,1,1997-Q1
40435,1997-01-03T00:00:00.000,1997,1,1997-Q1
40436,1997-01-03T00:00:00.000,1997,1,1997-Q1
40437,1997-01-03T00:00:00.000,1997,1,1997-Q1
40438,1997-01-03T00:00:00.000,1997,1,1997-Q1
40439,1997-01-02T00:00:00.000,1997,1,1997-Q1
40440,1997-01-02T00:00:00.000,1997,1,1997-Q1
40441,1997-01-02T00:00:00.000,1997,1,1997-Q1
40442,1997-01-02T00:00:00.000,1997,1,1997-Q1
40443,1997-01-02T00:00:00.000,1997,1,1997-Q1


In [46]:
df.year.describe()

count    40449.000000
mean      2006.821256
std          6.860986
min       1997.000000
25%       2000.000000
50%       2006.000000
75%       2013.000000
max       2019.000000
Name: year, dtype: float64

In [47]:
df.quarter.describe()

count    40449.000000
mean         2.473188
std          1.095643
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          4.000000
Name: quarter, dtype: float64

<h4>Save

Save the data in the tmp directory. It will be combined with other data to create the final dataset for analysis.

In [49]:
ts = get_timestamp()
df.to_csv(f'../tmp/evic_cleanstge2_{ts}.csv', index=False)