# Error In Spatial Analysis
### Branson Fox
The objective of this notebook is to test the variability in results of fitting spatial models based on using various geocoding options. We will first analyze and fit models using the groud truth, and then fit the same models using all 14 of the tested geocoders.

## Dependencies
The modules are used to produce this analysis. Note that we are using PySal 2.2. 

In [1]:
import os
import numpy as np
import pysal as ps

## Load Data
We'll define a list of geocoders and load the data appropriately. Then we'll set the model parameters (which are the same throughout datasets)

In [2]:
# Load Data
geocoders = ['truth', 'BingSingle', 'EsriSingle', 'Google', 'HereSingle', 'OpenCage', 'TomTomSingle', 'CensusSingle', 'GeocodioSingle', 'TomTomBatch', 'BingBatch', 'CensusBatch', 'GeocodioBatch', 'EsriBatch', 'HereBatch']
datasets = {}
for geocoder in geocoders:
    datasets[geocoder] = {'data' : ps.lib.io.open('../data/shapes/' + geocoder + '.dbf')}
    
# Set Parameters
shape = ps.lib.io.open('../data/shapes/truth.shp')
weights = ps.lib.weights.Queen(shape)

# Variables
dependent = 'homicides'
independent = ['med_inc', 'homeown', 'non_wht', 'poverty']

We need the data to be in arrays to work with PySal, so we'll pre-format those.

In [3]:
# Create Arrays of Data for Each Geocoder
for dataset in datasets:
    datasets[dataset]['y'] = np.array([datasets[dataset]['data'].by_col(dependent)]).T
    datasets[dataset]['x'] = np.array([datasets[dataset]['data'].by_col(var) for var in independent]).T

## Explore The Ground Truth
We'll determine the appropriate spatial model to fit using the ground truth as our basis

In [4]:
truth_ols = ps.model.spreg.OLS(datasets['truth']['y'], datasets['truth']['x'],
                               w=weights, name_y=dependent, name_x=independent,
                               spat_diag=True, moran=True, name_w='Queens', name_ds='Ground Truth')
print(truth_ols.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :Ground Truth
Weights matrix      :      Queens
Dependent Variable  :   homicides                Number of Observations:         247
Mean dependent var  :      2.7933                Number of Variables   :           5
S.D. dependent var  :      4.3930                Degrees of Freedom    :         242
R-squared           :      0.4693
Adjusted R-squared  :      0.4605
Sum squared residual:    2519.385                F-statistic           :     53.5011
Sigma-square        :      10.411                Prob(F-statistic)     :   2.939e-32
S.E. of regression  :       3.227                Log likelihood        :    -637.292
Sigma-square ML     :      10.200                Akaike info criterion :    1284.584
S.E of regression ML:      3.1937                Schwarz criterion     :    1302.131

-----------------------------------------------------------------------------

Based on this OLS Model, it will be appropriate to fit a spatial lag model. We'll do that for the ground truth, and then for each of the additional geocoders.

In [5]:
truth_lag = ps.model.spreg.GM_Lag(datasets['truth']['y'], datasets['truth']['x'],
                               w=weights, name_y=dependent, name_x=independent,
                               name_w='Queens', name_ds='Ground Truth')
print(truth_lag.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :Ground Truth
Weights matrix      :      Queens
Dependent Variable  :   homicides                Number of Observations:         247
Mean dependent var  :      2.7933                Number of Variables   :           6
S.D. dependent var  :      4.3930                Degrees of Freedom    :         241
Pseudo R-squared    :      0.7988
Spatial Pseudo R-squared:  0.5697

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      -1.8920716       0.8134011      -2.3261238       0.0200119
             med_inc       0.0000349       0.0000127       2.7394890       0.0061535
             homeown       0.9964329       0.9622978       

## Fit 14 Additional Models

In [6]:
for geocoder in geocoders[1:]:
    lag = ps.model.spreg.GM_Lag(datasets[geocoder]['y'], datasets[geocoder]['x'],
                               w=weights, name_y=dependent, name_x=independent,
                               name_w='Queens', name_ds=geocoder)
    print(lag.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :  BingSingle
Weights matrix      :      Queens
Dependent Variable  :   homicides                Number of Observations:         247
Mean dependent var  :      2.7508                Number of Variables   :           6
S.D. dependent var  :      4.3441                Degrees of Freedom    :         241
Pseudo R-squared    :      0.7950
Spatial Pseudo R-squared:  0.5707

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      -1.8878998       0.8117071      -2.3258387       0.0200272
             med_inc       0.0000343       0.0000127       2.7037463       0.0068563
             homeown       1.0396959       0.9605874       

Notes: Needs to Be Re-Analyzed with the modification of homicides to a rate.