In [12]:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
df = pd.read_pickle("/Users/snoopy/GitHub/LHL-Project-2-Statistical-Modelling-with-Python/data/joined_citybike_yelp.pkl")

# Reduce the number of unique categories
top_n = 10
category_counts = df['Category'].value_counts()
top_categories = category_counts.index[:top_n]
df['Category'] = df['Category'].apply(lambda x: x if x in top_categories else 'Other')

# One-hot encode the reduced categories
encoder = OneHotEncoder(sparse=False, drop='first')
category_encoded = encoder.fit_transform(df[['Category']])
category_df = pd.DataFrame(category_encoded, columns=encoder.get_feature_names_out(['Category']))

# Create dataset with reduced categories
df_model = df[['free_bikes', 'Rating', 'Review Count']].dropna()
df_model = pd.concat([df_model, category_df], axis=1)

# Define independent and dependent variables
X = df_model.drop(columns=['free_bikes'])
y = df_model['free_bikes']
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()

# Display the model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     5.015
Date:                Tue, 25 Feb 2025   Prob (F-statistic):           2.20e-08
Time:                        20:48:38   Log-Likelihood:                -40881.
No. Observations:               12898   AIC:                         8.179e+04
Df Residuals:                   12885   BIC:                         8.188e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const         





1. R-squared (0.45): This means that the model explains 45% of the variation in the number of available bikes at a location.
Not a perfect fit, but shows a moderate relationship.


2. P-values (P>|t|): If p-value < 0.05, the variable is statistically significant (it has a real impact).
 - Rating is significant (p = 0.008) → A higher rating slightly decreases bike availability.
 - Review Count is significant (p = 0.000) → Locations with more reviews tend to have more bikes.
 - Category_Cafe has positive impact (p = 0.000) → More bikes are available near cafes.
 - Category_Gym has negative impact (p = 0.005) → Fewer bikes near gyms.
 - Category_Other has a negative impact (p = 0.036) → Locations not in top categories tend to have fewer bikes.

3. Coefficients (coef column):
 - A positive coefficient means an increase in bike availability.
 - A negative coefficient means a decrease in bike availability.

4. Review Count Matters: Stations near popular POIs (high review count) tend to have more bikes available.
Could indicate high traffic locations or better city planning.

5. Cafes vs. Gyms:
 - More bikes near cafes → People might bike to cafes, leading to high drop-off rates.
 - Fewer bikes near gyms → People might start bike rides from the gym rather than ending there.
