Build a regression model.

In [5]:
# NOTE:
# This join is being re-run here to ensure that the regression model below uses 
# the most up-to-date and complete dataset from all sources (CityBikes, Yelp, Foursquare).
# Ideally, this merge would belong in joining_data.ipynb, 
# but it is repeated here for modeling integrity and convenience.

import pandas as pd

# Load the joined dataset
import pandas as pd

df = pd.read_csv("../data/processed/joined_bike_venue_data.csv")


# Optional: Check for missing values
print(df.isnull().sum())

# Drop rows with missing values in relevant columns
model_df = df.dropna(subset=[
    "venue_count_fsq", "avg_distance",
    "venue_count_yelp", "avg_rating", "avg_reviews",
    "free_bikes"
])

# Define X (independent variables) and y (target)
X = model_df[[
    "venue_count_fsq",
    "avg_distance",
    "venue_count_yelp",
    "avg_rating",
    "avg_reviews"
]]
y = model_df["free_bikes"]

import statsmodels.api as sm

# Add constant term for intercept
X = sm.add_constant(X)

# Fit OLS regression model
model = sm.OLS(y, X).fit()

# Print the results
print(model.summary())



station_id            0
name                  0
latitude              0
longitude             0
free_bikes            0
venue_count_fsq      39
avg_distance         39
venue_count_yelp    904
avg_rating          904
avg_reviews         904
dtype: int64
                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.145
Model:                            OLS   Adj. R-squared:                  0.098
Method:                 Least Squares   F-statistic:                     3.110
Date:                Tue, 29 Jul 2025   Prob (F-statistic):             0.0122
Time:                        08:19:34   Log-Likelihood:                -337.16
No. Observations:                  98   AIC:                             686.3
Df Residuals:                      92   BIC:                             701.8
Df Model:                           5                                         
Covariance Type:            nonrobus

Provide model output and an interpretation of the results. 

In [7]:
#The model shows that the average number of reviews of nearby businesses is positively associated with the number of available bikes at a station, and this relationship is statistically significant. Other variables such as venue counts and average ratings showed weaker or non-significant relationships.

#While the model explains ~15% of the variation in free bike availability, this is reasonable given the real-time, volatile nature of the data. With more temporal context (e.g., time of day, day of week), the model's performance could improve.

# Stretch

How can you turn the regression model into a classification model?

In [1]:
#I would look at reframing this into a classification model to possibly predict whether a given station will have low bike availability (e.g., fewer than 3 bikes)