# Spatial Regression
- In other words, do airBnB hosts care what the prices of nearby listings are?

### First I Run a regular regression, doesn't have to be very good
- r2 score doesn't matter here
- We are interested in the independent variables and their coefficients/P-Values

In [1]:
import os
os.environ['USE_PYGEOS'] = '0'
import warnings
warnings.filterwarnings('ignore', message='The weights matrix is not fully connected')

In [2]:
# !pip install spreg
# !pip install libpysal
import shapely
import pandas as pd
from statsmodels.formula.api import ols as sm_ols
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from libpysal.weights import KNN
from libpysal.cg import KDTree
import spreg
import scipy.stats as stats
from sklearn.metrics import mean_squared_error

In [3]:
pd.set_option('display.max_columns', 150)

In [4]:
# Boston airbnb listings data
listings = pd.read_csv('../inputs/listings.csv.gz', compression='gzip')

In [5]:
# getting number of bathrooms as a number for regression
num_pattern = r'(\d+\.*\d*)'
listings['bath_num'] = listings['bathrooms_text'].str.extract(num_pattern).astype(float)

In [6]:
# make the price a float
listings['price'] = listings['price'].apply(lambda x: float(x.replace('$', '').replace(',', '')))

In [7]:
# get some variables that I think could predict price + geography
listings_4reg = listings[['bath_num','minimum_nights','bathrooms_text','beds', 'bedrooms','accommodates','price','latitude', 'longitude']].dropna().reset_index(drop=True)

In [8]:
# create regression a la assignment 6
reg = sm_ols('price ~ accommodates + bath_num + bedrooms', data=listings_4reg).fit()

In [9]:
reg.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.17
Model:,OLS,Adj. R-squared:,0.17
Method:,Least Squares,F-statistic:,228.4
Date:,"Sat, 29 Apr 2023",Prob (F-statistic):,8.94e-135
Time:,14:38:20,Log-Likelihood:,-23133.0
No. Observations:,3343,AIC:,46270.0
Df Residuals:,3339,BIC:,46300.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-23.8794,10.738,-2.224,0.026,-44.933,-2.826
accommodates,30.7784,2.529,12.171,0.000,25.820,35.737
bath_num,77.4061,7.780,9.950,0.000,62.153,92.660
bedrooms,12.1055,4.215,2.872,0.004,3.842,20.369

0,1,2,3
Omnibus:,7906.896,Durbin-Watson:,1.896
Prob(Omnibus):,0.0,Jarque-Bera (JB):,97752592.896
Skew:,23.243,Prob(JB):,0.0
Kurtosis:,839.434,Cond. No.,13.6


- The P-Values of these 3 dependent variables indicates that they are statistically significant

### Now I add in spatial weights and compare

In [10]:
# create spatial weights based on the 3 nearest neighbors
coordinates = np.column_stack((listings_4reg['longitude'], listings_4reg['latitude']))
kd = KDTree(np.array(coordinates))
w = KNN.from_array(kd, k=3)

#### Spatial two stage least squares (S2SLS) with results and diagnostics
[link](https://spreg.readthedocs.io/en/latest/generated/spreg.GM_Lag.html) to docs

In [11]:
# add 'const' to act as intercept
listings_4reg['const'] = 1

In [12]:
explanatory_variables = ['const','accommodates', 'bath_num', 'bedrooms']
reg_wGEO = spreg.GM_Lag(listings_4reg['price'].values[:, None],
                        listings_4reg[explanatory_variables].values,
                        w=w,
                        name_x=explanatory_variables,
                        name_y='price')

In [13]:
print(reg_wGEO.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :       price                Number of Observations:        3343
Mean dependent var  :    196.1891                Number of Variables   :           5
S.D. dependent var  :    268.9393                Degrees of Freedom    :        3338
Pseudo R-squared    :      0.1674
Spatial Pseudo R-squared:  0.1703

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT     -20.9641236      14.4156888      -1.4542575       0.1458749
        accommodates      30.9812154       2.6183262      11.8324505       0.0000000
            bath_num      77.6544394       7.8313498       

## Analysis
- Null hypothesis: There is no relationship between the prices of AirBnB listings and the prices of nearby AirBnB listings.
- Alt hypothesis: There is a spatial relationship between the prices of AirBnB listings and the prices of nearby AirBnB listings.

1. P-Value of the spatial weight coefficient is .76
    - This indicates that there is a high probability (76%) that the observed results could have occurred by chance alone.
    - In other words: We fail to reject the null hypothesis, and we can assume that there is no relationship between the prices of Boston AirBnB listings and the prices of nearby AirBnB listings.

2. Below are the mean squared errors for both a regular regression(reg) and a spatial regression(reg_wGEO) using the same explanatory variables.
    - After adding spatial weights, the MSE increases by a small amount
        - This implies that adding the spatial correlation component does not reduce the variation in our predictions.
        - Instead, the addition actually increases the variation slightly, making for a worse prediction. 

In [14]:
# get the predicted values of the dependent variable from the regression model
y_pred = reg.predict()

# get the actual values of the dependent variable from the original data
y_true = listings_4reg['price']

# calculate the mean squared error
mse = mean_squared_error(y_true, y_pred)

In [15]:
mse

59995.609356723544

In [16]:
# get the predicted values of the dependent variable from the spatial regression model
y_pred = reg_wGEO.predy.flatten()

# get the actual values of the dependent variable from the original data
y_true = listings_4reg['price']

# calculate the mean squared error
mse_w = mean_squared_error(y_true, y_pred)

In [17]:
mse_w

60204.229004777815