# Detroit Blight Prediction with Open Data

## Overview

Property blight, its spread and its affects on neighborhoods, people and the local economy is a major issue which affects many cities today. Detroit is a city where this has been very widespread and severe. From the literature attempts to predict the spread of blight from variables such crime, abandoned buildings, presence of amenities such as water and electricity, and taxes in arrears.

In this project we will attempt to use openly available data for blight violation, crime incidents and 311 calls to train a supervised machine learning model to predict blight spread. We additionally have a demolition dataset which lists addresses which have been condemned. We will use this dataset to act as the "IsBlighted" variable for a building. This will be used to train our model and check our accuracy. We found 5174 buildings which were definitely blighted. We took 5174 non blighted buildings too and together this is our training/test dataset.

### Reproducibility

All the scripts and produced results for this project are available at https://github.com/IvoDonev/DSCapstone. The original datasets are available at https://github.com/uwescience/datasci_course_materials/tree/master/capstone/blight. In the github repo there is also an anacondaEnv.yml file to create the environment needed to execute these scripts.


## Building definitions

The data in these open data sets need to be linked to buildings in order to be able to cross reference items.

In order to define buildings from our Detroit data I looked at other available data sources in the https://data.detroitmi.gov/ site. In particular I consider the parcel map dataset (https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf) and the neighborhoods dataset (https://data.detroitmi.gov/Government/Detroit-Neighborhoods/5mn6-ihjv). I downloaded both these datasets as geojson files from the links provided.
The first files provides spatial information (a multipolygon) for each property in the city. With this dataset I can reverse search any point (latitude longitude pair) to see which polygon (i.e. parcel it falls in). With the second dataset I can similarly reverse search to see which neighborhood any point or building falls in. This may make things easier to visualize on a neighborhood level and to be able to group buildings within neighborhoods. In the case where a point does not fall within any parcel (e.g. on the street outside) I picked the nearest polygon. 

Reverse searching the large dataset (318000 parcels) is no easy feat, so I utilized the geospatial query capabilities of MongoDB (https://www.mongodb.com/). I ran a local instance of this db and loaded both geojson files as two separate collections (DetroitParcels and DetroitAreas). This required some fiddling with geojson file to make it so that each parcel is on a separate line. I loaded the parcels and areas through a small python script, talking to the MongoDB using pyMongo library. I then generated a geospatial index on the collection to make lookups faster. The query function can is available on my github https://github.com/IvoDonev/DSCapstone/blob/master/QueryBuildings.py

## Variables

The following variables were considered for our model:
1. Number of crime incidents associated with a building
1. Number of blight violations incidents associated with a building
1. Number of 311 calls
1. Whether the building is state owned. This was determined by checking the "owner1" property for each building in the parcel dataset and comparing to 'DETROIT LAND BANK AUTHORITY'
1. The zoning category for the building. This was determined form the "zoning" property for each parcel
1. Which neighborhood each building is in. This was done by getting the centroid of each parcel and performing a reverse lookup on the neighborhoods dataset.

## Secondary Variables

Additionally, we consider the affects of the surrounding building's variables on each building also. Such an approach was suggested in the paper by Morckel. To achieve this we consider the 100 nearest parcels within a 1km radius. We then perform a weighted average of the number of crime incidents, number of blight violations, number of 311 calls using the following weight function:

$e^{-\dfrac{distance}{500}}$

## Visualising the variables

Below is a choropleth visualisation of the number of violation incidents and crime incidents as grouped per neighborhood in Detroit. Form this visualisation you can easily see any troubled areas. It can also be seen that there are a large number incidents for Downtown Detroit. This area is a non residential and perhaps the overwhelming number of incidents in these areas is due to it being a default rather than reality. Additionally the areas of Warrendale and MidWest have high number of blight violations. 

Considering only blight violations the western areas have many more incidents. Considering 311 and crime numbers the areas are much more varied although in all three categories Warrendale has very high numbers. For 311 and crime the area of Claytown has very high numbers.

In [1]:
import geopandas as gpd
import folium
import pandas as pd

In [12]:
def AddLayer(filePath, layerName, legendName, showInitially, foliumMap, valueField="count"):
    
    df = pd.read_csv(filePath)
    df['_id'] = df['_id'].astype(str)

    layer = folium.Choropleth(geo_data='data/Detroit Neighborhoods.geojson', 
                      data=df,
                      columns=['_id', valueField],
                      key_on='feature.properties.target_fid', 
                      nan_fill_color='White',
                      fill_color='YlOrRd', 
                      fill_opacity=0.9,
                      line_opacity=0.2,
                      highlight=False,
                      legend_name=legendName).add_to(foliumMap)
    layer.show = showInitially

    layer.layer_name=layerName
    
def AddAreasOverlay(map):
    styleFunction = lambda x: {'fillColor': '#0000ff', 'fillOpacity':0, 'weight':1}

    areas = folium.GeoJson(
        "data/Detroit Neighborhoods.geojson",
        name='Areas',
        control=False,
        style_function=styleFunction,
        tooltip=folium.features.GeoJsonTooltip(fields=["new_nhood"], labels=False)
    ).add_to(map)
    map.keep_in_front(areas)

variablesMap = folium.Map([42.379858, -83.066083], zoom_start = 11)
AddLayer("data/ViolationsPerArea.csv", "Blight Violations Per Area", "Number of blight violations", True, variablesMap)
AddLayer("data/CrimePerArea.csv", "Crime Per Area", "Number of crime incidents", False, variablesMap)
AddLayer("data/311PerArea.csv", "311 Per Area", "Number of 311 incidents", False, variablesMap)
AddAreasOverlay(variablesMap)

# This map timesout the Jupyter notebook so we save to a file and then load it again.
folium.LayerControl(collapsed=False).add_to(variablesMap)
variablesMap.save("data/VariablesMap.html")

HTML('<iframe src="https://ivodonev.github.io/DSCapstone/data/VariablesMap.html" width=100% height=600></iframe>')


## Visualising the demolition

Below we show a heat map of each demolition order in Detroit. This visualisation can quickly show us problem areas and allows us to drill down into areas for a more detailed view. 

From this we can see that Downtown detroit actually has very few demolition orders. They seem relatively spread out throughout the city with the areas of Warrendale (in the west) and Mappleridge, Franklin and Gratiot-Findlay (in the north east). 


In [3]:
from folium import plugins
from folium.plugins import HeatMap

demoHeatmap = folium.Map([42.379858, -83.066083], zoom_start = 11)

demoParcels = pd.read_csv("data/DemolishedParcels.csv")

heat_data = [[row['lat'],row['lon']] for index, row in demoParcels.iterrows()]
HeatMap(heat_data, radius=9, max_zoom=15).add_to(demoHeatmap)
AddAreasOverlay(demoHeatmap)


demoHeatmap


## Model

I built a classifier using a python's Scikit-learn library and the Random Forest classifier. Is used a RandomizedSearchCV to perform a hyperparameter tuning on the model. I took all the 10000 records (5000 demolished and 5000 non) and performed a 5 fold cross validation fit. With that I found my model has an accuracy of 


In [4]:
with open('Data/ModelAccuracy.txt', 'r') as myfile:
    print(myfile.read())

Model
Mean validation score: 0.774 (std: 0.010)
Parameters: {'min_samples_split': 10, 'max_features': 5, 'max_depth': 6, 'bootstrap': False, 'criterion': 'gini'}



From the model we can determine which features/variables were most important. Below these are presented as a bar chart with the highest importance on top. It can be seen that the number of violations for a parcel was the most important variable, followed by whether it is state owned and then the secondary number of crimes (weighed average of the neighboring parcels).

In [5]:
from bokeh.plotting import figure 
from bokeh.io import output_notebook, show

df = pd.read_csv("data/FeatureImportances.csv").sort_values(by=["importance"], ascending=False).head(10)
output_notebook()

df = df.sort_values(by=["importance"], ascending=True)
p = figure(title="Importance of different variables for evaluating random forest classifier", y_range=list(df["variable"]))
p.hbar(height=0.8, y='variable', right="importance", left=0, source=df)
# p.xaxis.formatter = timeFormatter
p.xaxis.axis_label = "Importance of variable"
show(p)

## Results

With the fitted model we then classify all the parcels in the entire corpus (~380000 parcels). We then group by the area and show the proportion of classified as blighted parcels per area. This is displayed below to show us danger areas where our model predicts blight is likely to spread or occur.


In [6]:
blightPredictionMap = folium.Map([42.379858, -83.066083], zoom_start = 11)
AddLayer("data/Nhood_BlightProb.csv", "Blight Prediction Probability Per Area", "Proportion of predicted blighted parcels", True, blightPredictionMap, "prob")
AddAreasOverlay(blightPredictionMap)
blightPredictionMap

From the map above we can see the Springwell and Mapleridge are high danger areas. There also some very small areas with very high proportions of blighted parcels surrounded by low proportions (such as Brewster Homes and Penrose) which perhaps may be a fault of the model rather than the reality.

Below is the top 10 danger areas.

In [7]:
df = pd.read_csv("data/Nhood_BlightProb.csv")
df = df.sort_values(by=['prob'], ascending=False)
print(df.head(10))

     Unnamed: 0  _id      prob           area_name
199         199   90  1.000000      Brewster Homes
14           14  110  0.941776             Penrose
147         147   42  0.925990         Springwells
135         135   31  0.815117     Oakwood Heights
105         105  193  0.781668            Franklin
71           71  162  0.763463          Mapleridge
192         192   84  0.675910          Henry Ford
123         123  209  0.663640           Fox Creek
193         193   85  0.635659  West Virginia Park
112         112    2  0.579321       Chandler Park


Although seeing area-wide predictions can help us see which areas are in danger of blight spread, for larger areas it may be useful to drill down and see the individual parcels which are predicted to be blighted. This may allow for intervention on a street or house level if a trend can be visually seen. 

In [8]:
nhoodID = 146

df = pd.read_csv("data/AllPredictions.csv")
df = df.loc[((df['nhood_id']==nhoodID) & (df["ProbT"]==1)), ['nhood_id','lat','lon','area_name']]
predictHM = folium.Map([42.379858, -83.066083], zoom_start = 11)

heat_data = [[row['lat'],row['lon']] for index, row in df.iterrows()]
HeatMap(heat_data, radius=9, max_zoom=15).add_to(predictHM)

subset = df[['lat', 'lon']]
tuples = [tuple(x) for x in subset.values]

folium.map.FitBounds(tuples).add_to(predictHM)
AddAreasOverlay(predictHM)
print("Area - " + df.iloc[0]["area_name"])
predictHM

Area - Elijah McCoy


## Conclusion

Using the open datasets for blight violations, crime incidents, 311 calls, demolition orders as well as parcel and area information for the city of Detroit we were able to create a model with reasonable ~80% accuracy for blight prediction and provide helpful visualisations which will allow us to see problem areas and drill down to see potential streets etc.. where blight is likely to occur.

### Future work and improvements

This current work does not consider any time scales and considers all violations etc at the same time. This is not realistic as areas change over the years and this should be considered. Additionally this would allow us to consider the spread of blight throughout the city.

Additionally the prediction accuracy of our model ~80% is not as high as it could be due I think due to the dirtiness of our data and in particular the definition of being blighted. I think that perhaps looking at individual parcels might also be too fine a granularity in being able to get sufficient records for each address. Also the results may be misleading in being able to predict individual addresses accurately. There might also be considerations for privacy with considering individual addresses.

Further more there may be some invalid parcels which are marked as "Demolished/Dismantled" which would not necessarily mean blighted as some may be dismantle due to normal development where houses are taken down for commercial development. This ambiguity in the category that we are trying to predict and train our model may cause some inaccuracy. There is potential to improve this by having a secondary source to verify such as satellite imagery, visiting the locations physically or some additional data from the Detroit government. 

With this framework we can extend the model to be able to handle additional datasets which could better help us predict. From the literature it suggests demographics and tax arrears to be considered. 
