Build a regression model.

In [None]:
import sqlite3
import pandas as pd
import statsmodels.api as sm

conn = sqlite3.connect("data/bike_project.db")
query = """
SELECT bs.name AS station, bs.latitude, bs.longitude, bs.slots,
       y.total_yelp_venues, y.avg_yelp_rating, y.total_yelp_reviews, y.closest_yelp_distance,
       f.total_fsq_venues, distance AS closest_fsq_distance
FROM bike_stations bs
LEFT JOIN yelp_summary y ON bs.name = y.station
LEFT JOIN fsq_summary f ON bs.name = f.station
"""
merged_df = pd.read_sql(query, conn)
conn.close()

merged_df = merged_df.dropna()

X = merged_df[[
    "total_yelp_venues",
    "avg_yelp_rating",
    "closest_yelp_distance",
    "total_fsq_venues",
    "closest_fsq_distance"
]]

y = merged_df["slots"]

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()



Provide model output and an interpretation of the results. 

In [20]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  slots   R-squared:                       0.029
Model:                            OLS   Adj. R-squared:                  0.025
Method:                 Least Squares   F-statistic:                     7.175
Date:                Thu, 15 May 2025   Prob (F-statistic):           1.08e-05
Time:                        04:30:30   Log-Likelihood:                -3291.2
No. Observations:                 961   AIC:                             6592.
Df Residuals:                     956   BIC:                             6617.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
total_yelp_venues        -0.20

As observed from the results of the regression model, we have an R-squared value of 2.9% which means that only 2.9% of the variability of our dependent variable can be explained by the independent variables.
While the model’s R-squared value of 2.9% indicates very weak explanatory power, this isn’t surprising. The total number of bikes at a station is likely driven more by city planning, infrastructure, and commuting patterns than restaurant density or venue proximity.
However, the model still revealed weak but consistent patterns — e.g., stations near denser POI areas tend to be slightly larger. With richer data (e.g., population density, traffic volume), a more predictive model could be built.

# Stretch

How can you turn the regression model into a classification model?

To turn the regression problem into a classification problem, we redefine the target variable. Instead of predicting the total number of bikes at a station, we classify whether the station is “high capacity” or “low capacity.” This can be done by applying a threshold (e.g., stations with more than 10 total slots are labeled as 1, the rest as 0).

The same POI features (venue counts, ratings, and distances) can be used as inputs. 

In [24]:


merged_df["high_capacity"] = (merged_df["slots"] > 10).astype(int)
y = merged_df["high_capacity"]

X = merged_df[[
    "total_yelp_venues",
    "avg_yelp_rating",
    "closest_yelp_distance",
    "total_fsq_venues",
    "closest_fsq_distance"
]]

X = sm.add_constant(X)
logit_model = sm.Logit(y, X).fit()
print(logit_model.summary())



Optimization terminated successfully.
         Current function value: 0.046135
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:          high_capacity   No. Observations:                  961
Model:                          Logit   Df Residuals:                      956
Method:                           MLE   Df Model:                            4
Date:                Thu, 15 May 2025   Pseudo R-squ.:                  0.4263
Time:                        04:34:52   Log-Likelihood:                -44.336
converged:                       True   LL-Null:                       -77.281
Covariance Type:            nonrobust   LLR p-value:                 1.670e-13
                            coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------
total_yelp_venues        -2.0279      0.512     -3.958      0.000      -3.032      -1

Our logistic regression model predicts whether a bike station is "high capacity" (more than 10 docks) based on surrounding POI data.

The model performed surprisingly well with a pseudo R-square of 0.426, suggesting POI characteristics can offer meaningful clues about station size. Stations in areas with fewer Yelp venues are more likely to be high-capacity. This may reflect that larger stations are placed in less dense, more open locations (e.g., parks, campuses). Higher average Yelp ratings around a station strongly increase the likelihood of it being high-capacity, possibly indicating upscale or more developed zones. 

Foursquare-based features were not statistically significant, suggesting Yelp may offer more relevant coverage or better signal for this use case.

