Build a regression model.

In [11]:
!pip install statsmodels
import statsmodels.api as sm
import pandas

df = pandas.read_json("joined_osaka_venue_station_data.json")

df = df.dropna(subset=["free_bikes", "venue_rating"])

df["free_bikes"] = pandas.to_numeric(df["free_bikes"], errors="coerce")
df["venue_rating"] = pandas.to_numeric(df["venue_rating"], errors="coerce")

y = df["free_bikes"]
X = df[["venue_rating"]]
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()


Provide model output and an interpretation of the results.

In [12]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.012
Model:                            OLS   Adj. R-squared:                 -0.009
Method:                 Least Squares   F-statistic:                    0.5666
Date:                Tue, 20 May 2025   Prob (F-statistic):              0.455
Time:                        23:38:55   Log-Likelihood:                -34.976
No. Observations:                  50   AIC:                             73.95
Df Residuals:                      48   BIC:                             77.78
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            0.7722      0.239      3.226   

Based on the regression model results output above, a few interpretations can be drawn regarding the relation between the Number of Bikes and the Restaurants' (in its vicinity's) Ratings:


*   Per the R-squared value of **0.012**: This model does not quite fit the correlation we are trying to find, as it can only explain 1.2% of the variances between ratings and the number of bikes.

  Perhaps this means that the number bikes being parked at these stations are not solely in response to the restaurants' popularities around them. Most likely, the restaurants are not the main, or only, motivation to park their bikes there.

*   Per the P-Value of 0.455 and Coeff. of -0.0422 for the Venue Ratings: This particular chracteristic of the restaurant is not statistically significant, nor does it have much impact on the Number of Bikes variable. This is aligned with the intepretation from the R-Squared value.

  It would seem that, while intuitively, one would think a restaurant's rating/popularity may draw customers to the area this is not the case when it comes to the bikes in the area. Perhaps this means customers are not biking to the restaurants and are taking another mode of transportation. Alternatively, the bikes in the area are not there due to the restaurants but rather for other purposes in the area. In any case, the correlation between bikes at those stations and the ratings of the restaurants in the vicinity (however high those ratings are) seems to be very low.

# Stretch

How can you turn the regression model into a classification model?

To turn the regression model into a classification model the number of bikes continuous variable would need to be used to define **discrete** variables/categories. These categories would then need to be used to answer similar, or even the same, questions surrounding correlation.

The approach would be akin to the following:


1.   First, defining the classification type (logistic or multinominal)
2.   Use the data drawn previously on the venue and assign them as predictors, as to what they might indicate.
3.  Process the data to be used (data cleaning) if not done already.
4.  Establish/prepare test sets to put the data through.
5.  Apply the classification model.
6.  Evaluate the results.

