Build a regression model.

In [7]:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
df = pd.read_pickle("/Users/snoopy/GitHub/LHL-Project-2-Statistical-Modelling-with-Python/data/joined_citybike_yelp.pkl")

# Reduce the number of unique categories
top_n = 10
category_counts = df['Category'].value_counts()
top_categories = category_counts.index[:top_n]
df['Category'] = df['Category'].apply(lambda x: x if x in top_categories else 'Other')

# One-hot encode the reduced categories
encoder = OneHotEncoder(sparse=False, drop='first')
category_encoded = encoder.fit_transform(df[['Category']])
category_df = pd.DataFrame(category_encoded, columns=encoder.get_feature_names_out(['Category']))

# Create dataset with reduced categories
df_model = df[['free_bikes', 'Rating', 'Review Count']].dropna()
df_model = pd.concat([df_model, category_df], axis=1)

# Define independent and dependent variables
X = df_model.drop(columns=['free_bikes'])
y = df_model['free_bikes']
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()

# Display the model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     5.015
Date:                Tue, 25 Feb 2025   Prob (F-statistic):           2.20e-08
Time:                        20:10:50   Log-Likelihood:                -40881.
No. Observations:               12898   AIC:                         8.179e+04
Df Residuals:                   12885   BIC:                         8.188e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const         

Provide model output and an interpretation of the results. 

1. Model Fit (R-Squared & Adjusted R-Squared)
R-squared: 0.005
The model explains only 0.5% of the variance in the number of available bikes.
This is extremely low, indicating that the chosen predictors do not strongly explain bike availability.
Adjusted R-squared: 0.004
Similar to R-squared, showing that even after accounting for the number of predictors, the model does not improve significantly.
2. Statistical Significance (P-values)
A p-value < 0.05 indicates statistical significance.
Significant Predictors:
Rating (p = 0.000) → Positive relationship (higher-rated areas tend to have more bikes).
Pizza Places (p = 0.002) → Positive impact (more bikes near pizza places).
Non-Significant Predictors:
Review Count (p = 0.301) → No meaningful impact.
Most Category variables (e.g., Indian, Italian, Japanese restaurants, parks, performing arts) are not significant.
Performing Arts (p = 0.987) → Almost no impact.
3. Coefficients (Effect on free_bikes)
Intercept (const = 3.0467)
When all other factors are zero, the baseline bike availability is about 3 bikes per station.
Rating (+0.94 bikes per unit increase in rating)
Higher-rated locations tend to have slightly more available bikes.
Pizza places (+1.35 bikes)
Bike stations near pizza places tend to have about 1.35 more available bikes.
Other categories show no clear pattern in bike availability.
4. Multicollinearity Warning
The large condition number (2.08e+03) suggests potential multicollinearity (strong correlations between predictors).
Possible issues:
Too many category variables → Some might be redundant.
Highly correlated predictors (e.g., Rating and Review Count might be correlated).
Insights & Next Steps
Key Findings
Bike Availability is NOT Strongly Influenced by POI Categories

The very low R² suggests that POI characteristics alone are not good predictors.
Other factors (e.g., weather, time of day, commute patterns) might be more relevant.
Ratings Have a Small Positive Effect

Higher-rated locations tend to have slightly more bikes, but the effect is weak.
Pizza Places Show Some Relationship

This could suggest that bike riders frequently stop at pizza places or that these areas have better bike station management.
