Build a regression model.

In [1]:
import numpy as np
import pandas as pd

import statsmodels.api as sm

In [2]:
model_df = pd.read_csv('regression.csv')

In [6]:
model_df.head()

Unnamed: 0,Bike Station,restaurants,stores,parks,bars,Total Reviews,Average Rating,Empty Slots,Free Bikes
0,Bike Station 0,6,2,0,2,2952.0,4.0,3.0,16.0
1,Bike Station 1,3,1,2,0,1553.0,4.225,12.0,2.0
2,Bike Station 10,6,2,0,1,3952.0,3.875,12.0,4.0
3,Bike Station 100,5,1,0,0,3194.0,4.25,15.0,3.0
4,Bike Station 101,6,1,1,0,3727.0,4.075,13.0,3.0


#### Conducting regression with Empty Slots as the dependent variable

In [3]:
X = model_df[['restaurants', 'stores', 'parks', 'bars', 'Total Reviews', 'Average Rating']]

In [4]:
#The first attempt aims to analyze what variables affect the number of empty slots at a bike station on a Friday night in Toronto

Y = model_df['Empty Slots']

In [5]:
X = sm.add_constant(X)
regression1 = sm.OLS(Y, X).fit()
print(regression1.summary())

                            OLS Regression Results                            
Dep. Variable:            Empty Slots   R-squared:                       0.073
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     6.441
Date:                Mon, 18 Dec 2023   Prob (F-statistic):           1.50e-06
Time:                        10:19:16   Log-Likelihood:                -1666.6
No. Observations:                 500   AIC:                             3347.
Df Residuals:                     493   BIC:                             3377.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -5.4708      7.014     -0.

In [8]:
#Remove Restaurants, Stores, and Constant dropped as they are not statistically significant and re-run the model

X1 = model_df[['stores', 'bars', 'Total Reviews', 'Average Rating']]

In [9]:
regression2 = sm.OLS(Y, X1).fit()
print(regression2.summary())

                                 OLS Regression Results                                
Dep. Variable:            Empty Slots   R-squared (uncentered):                   0.745
Model:                            OLS   Adj. R-squared (uncentered):              0.743
Method:                 Least Squares   F-statistic:                              363.2
Date:                Mon, 18 Dec 2023   Prob (F-statistic):                   7.59e-146
Time:                        10:22:11   Log-Likelihood:                         -1668.2
No. Observations:                 500   AIC:                                      3344.
Df Residuals:                     496   BIC:                                      3361.
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
                     coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------

**Model #1 Analysis:** Based on the regression results we can see that number of stores, bars, Total Reviews, and Average Rating all indicators of the number of available bikes at each station. The observed correlation of 0.745 is quite strong, and significantly increased once the constant was dropped from the model.

In [10]:
X2 = model_df[['restaurants', 'stores', 'parks', 'bars', 'Total Reviews', 'Average Rating']]

In [11]:
#The second attempt aims to analyze what variables affect the number of free bikes at a bike station on a Friday night in Toronto

Y1 = model_df['Free Bikes']

In [12]:
X2 = sm.add_constant(X2)
regression3 = sm.OLS(Y1, X2).fit()
print(regression3.summary())

                            OLS Regression Results                            
Dep. Variable:             Free Bikes   R-squared:                       0.139
Model:                            OLS   Adj. R-squared:                  0.129
Method:                 Least Squares   F-statistic:                     13.31
Date:                Mon, 18 Dec 2023   Prob (F-statistic):           5.33e-14
Time:                        10:24:08   Log-Likelihood:                -1652.9
No. Observations:                 500   AIC:                             3320.
Df Residuals:                     493   BIC:                             3349.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             34.1725      6.824      5.

In [13]:
#Remove Bars as they are not statistically significant and re-run the model

X3 = model_df[['stores', 'parks', 'bars', 'Total Reviews', 'Average Rating']]

In [14]:
X3 = sm.add_constant(X3)
regression4 = sm.OLS(Y1, X3).fit()
print(regression4.summary())

                            OLS Regression Results                            
Dep. Variable:             Free Bikes   R-squared:                       0.139
Model:                            OLS   Adj. R-squared:                  0.130
Method:                 Least Squares   F-statistic:                     15.90
Date:                Mon, 18 Dec 2023   Prob (F-statistic):           1.55e-14
Time:                        10:24:44   Log-Likelihood:                -1653.1
No. Observations:                 500   AIC:                             3318.
Df Residuals:                     494   BIC:                             3344.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             34.9082      6.729      5.

**Model #2 Analysis:** Number of stores, parks, bars, total reviews, and average rating all impact the number free bikes available at a station. This model yields a significantly lower R2 implying a poor fit compared to the previous model. Stores has a negative correlation, while parks and bars have a positive impact, which does not follow intuition. One explanation could be the time of day (night) and year (mid-December) the data was pulled, however, the low correlation yields the model unusable.

**Reflection & Next Steps:** Overall the dataset is limiting in a few key aspects:

1) Bike data was pulled for a single point in time and may have external variables affecting the results. For example, this data was pulled for a Friday night in Mid-December, which likely would have less usage than if the data was pulled mid-July. To create a more robust model I would want to pull data across different times of day and across different times of year to avoid any seasonality of outliers that could affect this single point-in-time data pull.

2) FourSqare & Yelp both had endpoint limitations of 10 and 20 results respectively, which could also affect the data. To build a more robust model I would want to fully analyze the data within the 1000m radius of each bike station, not the first 10 or 20 data points.

3) Due to project limitations I was only able to explore the variables that affected empty slots/free bikes based solely on intuition. If more time were allocated, as a Next Step I would better evaluate different variable combinations to assess whether additional POI variables have a greater impact on predicting number of free bikes.


Provide model output and an interpretation of the results. 

# Stretch

How can you turn the regression model into a classification model?