# Error In Spatial Analysis
### Branson Fox
The objective of this notebook is to test the variability in results of fitting spatial models based on using various geocoding options. We will first analyze and fit models using the groud truth, and then fit the same models using all 14 of the tested geocoders.

## Dependencies
The modules are used to produce this analysis. Note that we are using PySal 2.2. 

In [1]:
import os
import numpy as np
import pysal as ps

## Load Data
We'll define a list of geocoders and load the data appropriately. Then we'll set the model parameters (which are the same throughout datasets)

In [2]:
# Load Data
geocoders = ['truth', 'BingSingle', 'EsriSingle', 'Google', 'HereSingle', 'OpenCage', 'TomTomSingle', 'CensusSingle', 'GeocodioSingle', 'TomTomBatch', 'BingBatch', 'CensusBatch', 'GeocodioBatch', 'EsriBatch', 'HereBatch']
datasets = {}
for geocoder in geocoders:
    datasets[geocoder] = {'data' : ps.lib.io.open('../data/shapes/' + geocoder + '.dbf')}
    
# Set Parameters
shape = ps.lib.io.open('../data/shapes/truth.shp')
weights = ps.lib.weights.Queen(shape)

# Variables
dependent = 'homicides'
independent = ['med_inc', 'homeown', 'non_wht', 'poverty']

We need the data to be in arrays to work with PySal, so we'll pre-format those.

In [3]:
# Create Arrays of Data for Each Geocoder
for dataset in datasets:
    datasets[dataset]['y'] = np.array([datasets[dataset]['data'].by_col(dependent)]).T
    datasets[dataset]['x'] = np.array([datasets[dataset]['data'].by_col(var) for var in independent]).T

## Explore The Ground Truth
We'll determine the appropriate spatial model to fit using the ground truth as our basis

In [4]:
truth_ols = ps.model.spreg.OLS(datasets['truth']['y'], datasets['truth']['x'],
                               w=weights, name_y=dependent, name_x=independent,
                               spat_diag=True, moran=True, name_w='Queens', name_ds='Ground Truth')
print(truth_ols.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :Ground Truth
Weights matrix      :      Queens
Dependent Variable  :   homicides                Number of Observations:         238
Mean dependent var  :      6.3697                Number of Variables   :           5
S.D. dependent var  :      7.6801                Degrees of Freedom    :         233
R-squared           :      0.5700
Adjusted R-squared  :      0.5627
Sum squared residual:    6010.386                F-statistic           :     77.2297
Sigma-square        :      25.796                Prob(F-statistic)     :   1.325e-41
S.E. of regression  :       5.079                Log likelihood        :    -721.955
Sigma-square ML     :      25.254                Akaike info criterion :    1453.910
S.E of regression ML:      5.0253                Schwarz criterion     :    1471.272

-----------------------------------------------------------------------------

Based on this OLS Model, it will be appropriate to fit a spatial lag model. We'll do that for the ground truth, and then for each of the additional geocoders.

In [5]:
truth_lag = ps.model.spreg.GM_Lag(datasets['truth']['y'], datasets['truth']['x'],
                               w=weights, name_y=dependent, name_x=independent,
                               name_w='Queens', name_ds='Ground Truth')
print(truth_lag.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :Ground Truth
Weights matrix      :      Queens
Dependent Variable  :   homicides                Number of Observations:         238
Mean dependent var  :      6.3697                Number of Variables   :           6
S.D. dependent var  :      7.6801                Degrees of Freedom    :         232
Pseudo R-squared    :      0.8142
Spatial Pseudo R-squared:  0.6299

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      -4.0759764       1.4015459      -2.9082005       0.0036352
             med_inc       0.0000732       0.0000231       3.1646632       0.0015526
             homeown       1.7169311       1.6687826       

## Fit 14 Additional Models

In [6]:
for geocoder in geocoders[1:]:
    lag = ps.model.spreg.GM_Lag(datasets[geocoder]['y'], datasets[geocoder]['x'],
                               w=weights, name_y=dependent, name_x=independent,
                               name_w='Queens', name_ds=geocoder)
    print(lag.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :  BingSingle
Weights matrix      :      Queens
Dependent Variable  :   homicides                Number of Observations:         238
Mean dependent var  :      6.2815                Number of Variables   :           6
S.D. dependent var  :      7.6097                Degrees of Freedom    :         232
Pseudo R-squared    :      0.8126
Spatial Pseudo R-squared:  0.6282

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      -4.1397177       1.4003068      -2.9562933       0.0031136
             med_inc       0.0000730       0.0000230       3.1770927       0.0014876
             homeown       1.8970813       1.6668684       

Notes: Poverty becomes significant in the OpenCage Model