**Build a regression model.**

In [181]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm

I want to use the category data, which means I need it to be numerical. 

In [182]:
data = pd.read_csv('joined_for_regression.csv')

I'm going to try both putting the categories as straight numbers, and putting them as dummy columns to see what happens. 

In [183]:
data['categorized'] = np.where(data['categorized']=='Restaurant',1,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Bar',2,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Coffee Shop',3,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Museum',4,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Park',5,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Monument',6,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Stadium',7,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Mall',8,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Botanical Garden',9,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Concert Hall',10,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Hospital',11,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Library',12,data['categorized'])
data['categorized'] = np.where(data['categorized']=='Tourism',13,data['categorized'])
data['categorized'] = np.where(data['categorized']=='City Hall',14,data['categorized'])

In [184]:
dummies = pd.get_dummies(data['categorized'],dtype=int)
dummies = dummies.rename(columns={
    1: 'Bar',
    2: 'Restaurant',
    3: 'Coffee Shop',
    4: 'Museum',
    5: 'Park',
    6: 'Monument',
    7: 'Stadium',
    8: 'Mall',
    9: 'Botanical Garden',
    10: 'Concert Hall',
    11: 'Hospital',
    12: 'Library',
    13: 'Tourism',
    14: 'City Hall'})

In [185]:
regress = pd.concat([data, dummies],axis=1)

In [186]:
regress['categorized'] = regress['categorized'].astype(int)

The bike return I have is for just after 8 pm on a Saturday, so late in the day for bike usage but still potentially within the range of significant usage, especially in a city's center. <p>
First, looking at available bikes as the dependent variable.

Station number and lat/long are essentially the same information, so I probably don't want to use them at the same time, so I will be trying it once with each.<p>
The way I'm going to do this:<ul>
- throw everything in<br>
- remove whatever has the highest p-value (that's over the 5% threshold)
- repeat until everything is under the threshold

In [187]:
# free bikes by station number
X = regress[['Station_Number','distance','categorized','Bar','Restaurant','Coffee Shop',
'Museum','Park','Monument','Stadium','Mall','Botanical Garden','Concert Hall','Hospital',
'Library','Tourism','City Hall']]
y = regress['free_bikes']
X = sm.add_constant(X)

In [188]:
lin_reg = sm.OLS(y,X[['const','Station_Number','Restaurant','Coffee Shop','Museum',
'Hospital']])

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.061
Model:                            OLS   Adj. R-squared:                  0.053
Method:                 Least Squares   F-statistic:                     7.330
Date:                Wed, 23 Oct 2024   Prob (F-statistic):           1.14e-06
Time:                        23:45:10   Log-Likelihood:                -1738.4
No. Observations:                 570   AIC:                             3489.
Df Residuals:                     564   BIC:                             3515.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const              4.7000      0.462     10.

In [189]:
#free bikes by lat/long
X = regress[['latitude','longitude','distance','categorized','Bar','Restaurant',
'Coffee Shop','Museum','Park','Monument','Stadium','Mall','Botanical Garden',
'Concert Hall','Hospital','Library','Tourism','City Hall']]
y = regress['free_bikes']
X = sm.add_constant(X)

In [190]:
lin_reg = sm.OLS(y,X[['const','latitude','longitude','Bar','Restaurant','Coffee Shop',
'Museum','Park','Monument','Stadium','Mall','Botanical Garden','Concert Hall',
'Hospital','Library','Tourism','City Hall']])

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.224
Model:                            OLS   Adj. R-squared:                  0.203
Method:                 Least Squares   F-statistic:                     10.65
Date:                Wed, 23 Oct 2024   Prob (F-statistic):           9.16e-23
Time:                        23:45:10   Log-Likelihood:                -1684.1
No. Observations:                 570   AIC:                             3400.
Df Residuals:                     554   BIC:                             3470.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const             5919.9035    987.103  

Even though the p-value did not get to the 5% threshold, I removed 'categorized' because it has occured to me that it's just dupilications of all the data from the classification columns, and if I'm using those I shouldn't include it. Removing it didn't change the r-squared, but did change the coefficients. (It got removed via p-value when using station number.) I'm going to just preemptively remove it for the other variable.


Now the number of empty slots.

In [191]:
# empty slots by station number
X = regress[['Station_Number','distance','Bar','Restaurant', 'Coffee Shop','Museum',
'Park','Monument','Stadium','Mall','Botanical Garden','Concert Hall','Hospital', 
'Library', 'Tourism', 'City Hall']]
y = regress['empty_slots']
X = sm.add_constant(X)

In [192]:
lin_reg = sm.OLS(y,X[['const','Station_Number','distance','Library']])

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            empty_slots   R-squared:                       0.066
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     13.25
Date:                Wed, 23 Oct 2024   Prob (F-statistic):           2.26e-08
Time:                        23:45:10   Log-Likelihood:                -1699.8
No. Observations:                 570   AIC:                             3408.
Df Residuals:                     566   BIC:                             3425.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const              9.1194      0.497     18.

That doesn't seem right. One of two things that has a meaningful impact on empty spots at bike stations being libraries is very funny though. 

In [193]:
# emplty slots by lat/long
X = regress[['latitude','longitude','distance','Bar','Restaurant',
'Coffee Shop','Museum','Park','Monument','Stadium','Mall','Botanical Garden',
'Concert Hall','Hospital','Library','Tourism','City Hall']]
y = regress['empty_slots']
X = sm.add_constant(X)

..... there's nothing to remove. Ok. 

In [194]:
lin_reg = sm.OLS(y,X)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            empty_slots   R-squared:                       0.247
Model:                            OLS   Adj. R-squared:                  0.225
Method:                 Least Squares   F-statistic:                     11.31
Date:                Wed, 23 Oct 2024   Prob (F-statistic):           1.61e-25
Time:                        23:45:10   Log-Likelihood:                -1638.4
No. Observations:                 570   AIC:                             3311.
Df Residuals:                     553   BIC:                             3385.
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const            -8742.9707    914.801  

**Provide model output and an interpretation of the results.**

Using lat/long rather than station number provided better r-squared values for both the free bikes variable (0.224 vs 0.061) and the empty slots variable (0.247 vs 0.066). Using lat/long, all of the location types show as statistically significant, but distance is not for bikes available at a station.

The data for the number of free bikes/empty slots was retrived at 8pm on a Saturday.

In [195]:
X = regress[['latitude','longitude','Bar','Restaurant','Coffee Shop','Museum','Park','Monument',
'Stadium','Mall','Botanical Garden','Concert Hall','Hospital','Library','Tourism','City Hall']]
y = regress['free_bikes']
X = sm.add_constant(X)

lin_reg = sm.OLS(y,X)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.224
Model:                            OLS   Adj. R-squared:                  0.203
Method:                 Least Squares   F-statistic:                     10.65
Date:                Wed, 23 Oct 2024   Prob (F-statistic):           9.16e-23
Time:                        23:45:10   Log-Likelihood:                -1684.1
No. Observations:                 570   AIC:                             3400.
Df Residuals:                     554   BIC:                             3470.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const             5919.9035    987.103  

Everything included here has a positive effect on the number of available bikes, with all types of location having similar effects. Being further west has a positive impact, while being further north is the only thing that shows as having a negative effect.

In [196]:
X = regress[['latitude','longitude','distance','Bar','Restaurant',
'Coffee Shop','Museum','Park','Monument','Stadium','Mall','Botanical Garden',
'Concert Hall','Hospital','Library','Tourism','City Hall']]
y = regress['empty_slots']
X = sm.add_constant(X)

lin_reg = sm.OLS(y,X)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            empty_slots   R-squared:                       0.247
Model:                            OLS   Adj. R-squared:                  0.225
Method:                 Least Squares   F-statistic:                     11.31
Date:                Wed, 23 Oct 2024   Prob (F-statistic):           1.61e-25
Time:                        23:45:10   Log-Likelihood:                -1638.4
No. Observations:                 570   AIC:                             3311.
Df Residuals:                     553   BIC:                             3385.
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const            -8742.9707    914.801  

The relationships this time are the opposite of in the available bikes model, albiet with different coefficients. The distance coefficient, which was not present at all in the available bikes model, is tiny. 

In [197]:
lin_reg = sm.OLS(y,X[['const','latitude','longitude','Bar','Restaurant',
'Coffee Shop','Museum','Park','Monument','Stadium','Mall','Botanical Garden',
'Concert Hall','Hospital','Library','Tourism','City Hall']])

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            empty_slots   R-squared:                       0.237
Model:                            OLS   Adj. R-squared:                  0.216
Method:                 Least Squares   F-statistic:                     11.47
Date:                Wed, 23 Oct 2024   Prob (F-statistic):           1.20e-24
Time:                        23:45:10   Log-Likelihood:                -1642.1
No. Observations:                 570   AIC:                             3316.
Df Residuals:                     554   BIC:                             3386.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const            -8551.4025    916.973  

However, removing it _does_ make the slightly model worse.


Regardless, this model does not fit the data very well; it is possible that a pull from a more active time of day, or using data from the Yelp API that included things like ratings would give a better model. However, with the timezones, getting a pull at noon, or even right after the end of the workday, would be difficult for me. 

The models' fits get drammatically better if you put the *CLEARLY* colinear free bikes/empty slots variables back in. Which probably makes some kind of sense, but I'm not entirely sure why.

In [198]:
X = regress[['empty_slots','latitude','longitude','Bar','Restaurant','Coffee Shop','Museum','Park','Monument',
'Stadium','Mall','Botanical Garden','Concert Hall','Hospital','Library','Tourism','City Hall']]
y = regress['free_bikes']
X = sm.add_constant(X)

lin_reg = sm.OLS(y,X)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.509
Model:                            OLS   Adj. R-squared:                  0.495
Method:                 Least Squares   F-statistic:                     35.84
Date:                Wed, 23 Oct 2024   Prob (F-statistic):           8.98e-75
Time:                        23:45:10   Log-Likelihood:                -1553.5
No. Observations:                 570   AIC:                             3141.
Df Residuals:                     553   BIC:                             3215.
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const              339.6224    845.197  

In [202]:
X = regress[['free_bikes','latitude','longitude','distance','Bar','Restaurant',
'Coffee Shop','Museum','Park','Monument','Stadium','Mall','Botanical Garden',
'Concert Hall','Hospital','Library','Tourism','City Hall']]
y = regress['empty_slots']
X = sm.add_constant(X)

lin_reg = sm.OLS(y,X)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            empty_slots   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.520
Method:                 Least Squares   F-statistic:                     37.30
Date:                Wed, 23 Oct 2024   Prob (F-statistic):           3.18e-80
Time:                        23:48:38   Log-Likelihood:                -1501.2
No. Observations:                 570   AIC:                             3038.
Df Residuals:                     552   BIC:                             3117.
Df Model:                          17                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const            -5422.8785    741.747  