Build a regression model.

In [5]:
!pip install statsmodels


Collecting statsmodels
  Downloading statsmodels-0.14.4-cp313-cp313-win_amd64.whl.metadata (9.5 kB)
Collecting scipy!=1.9.2,>=1.8 (from statsmodels)
  Downloading scipy-1.15.2-cp313-cp313-win_amd64.whl.metadata (60 kB)
Collecting patsy>=0.5.6 (from statsmodels)
  Downloading patsy-1.0.1-py2.py3-none-any.whl.metadata (3.3 kB)
Downloading statsmodels-0.14.4-cp313-cp313-win_amd64.whl (9.8 MB)
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ------------------------- -------------- 6.3/9.8 MB 32.0 MB/s eta 0:00:01
   ---------------------------------------- 9.8/9.8 MB 24.1 MB/s eta 0:00:00
Downloading patsy-1.0.1-py2.py3-none-any.whl (232 kB)
Downloading scipy-1.15.2-cp313-cp313-win_amd64.whl (41.0 MB)
   ---------------------------------------- 0.0/41.0 MB ? eta -:--:--
   ---- ----------------------------------- 4.2/41.0 MB 21.0 MB/s eta 0:00:02
   ------- -------------------------------- 8.1/41.0 MB 19.2 MB/s eta 0:00:02
   ------------ ---------------------------


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
import pandas as pd
import statsmodels.api as sm

# Load the combined data
combined_data = pd.read_csv('combined_bike_poi_data.csv')

# Check for missing values and handle them (for simplicity, we'll drop rows with NaN in free_bikes or venue_distance)
combined_data = combined_data.dropna(subset=['venue_distance', 'free_bikes'])

# Define the features (X) and target (y)
X = combined_data[['venue_distance']]  # Using `venue_distance` as the feature
y = combined_data['free_bikes']  # `free_bikes` as the target variable

# Add a constant to the feature matrix for the intercept in the regression model
X = sm.add_constant(X)

# Fit the OLS regression model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     8.430
Date:                Mon, 28 Apr 2025   Prob (F-statistic):            0.00370
Time:                        05:54:28   Log-Likelihood:                -29943.
No. Observations:                7485   AIC:                         5.989e+04
Df Residuals:                    7483   BIC:                         5.990e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             12.4065      0.196     63.

Provide model output and an interpretation of the results. 

Interpretation of Results:
1. R-squared (R²):
R-squared = 0.001: This means the model explains only 0.1% of the variability in the number of free bikes. The low R-squared value suggests that the model is not very good at predicting the number of free bikes based on the venue_distance.

2. F-statistic and p-value:
F-statistic = 8.430, p-value = 0.0037: The F-statistic is used to test whether the model as a whole is statistically significant. With a p-value of 0.0037, we can reject the null hypothesis, indicating that the relationship between venue_distance and free_bikes is statistically significant.

3. Coefficients:
Intercept (const) = 12.4065: This is the estimated number of free bikes when the venue_distance is zero (i.e., when the venue is located exactly at the bike station). The model predicts approximately 12.4 free bikes at that distance.

venue_distance Coefficient = -0.0012: For each additional unit of distance between the venue and the bike station, the number of free bikes decreases by 0.0012. This indicates a negative relationship between venue_distance and the number of free bikes.

4. t-value and p-value for venue_distance:
t-value = -2.903, p-value = 0.004: The t-value tells us how many standard errors the coefficient is away from zero. Since the absolute t-value is large and the p-value is below 0.05, we can conclude that venue_distance has a statistically significant effect on free_bikes.

5. Confidence Interval:
The 95% confidence interval for the venue_distance coefficient is (-0.002, -0.000), meaning we are 95% confident that the true coefficient lies within this range. This negative interval further supports the idea that increased distance to a venue reduces the number of free bikes.

6. Durbin-Watson Statistic:
Durbin-Watson = 0.417: This indicates that there is positive autocorrelation in the residuals (the errors are correlated), which suggests that the model might not fully capture the data's underlying structure. We may need to use more advanced methods to account for this, such as time-series models if the data has temporal patterns.

7. Omnibus and Jarque-Bera Tests:
Both the Omnibus and Jarque-Bera tests are significant, indicating that the residuals are not normally distributed. This could be a concern, as normality of residuals is an assumption of OLS regression.



# Stretch

How can you turn the regression model into a classification model?

In [11]:
# Define categories for free bikes based on their value
bins = [0, 5, 20, float('inf')]  # Define bin edges
labels = ['Low', 'Medium', 'High']  # Define category labels
combined_data['bike_category'] = pd.cut(combined_data['free_bikes'], bins=bins, labels=labels, right=False)

# Check the new categories
print(combined_data[['free_bikes', 'bike_category']].head())


   free_bikes bike_category
0          21          High
1          21          High
2          21          High
3          21          High
4          21          High


In [16]:
import subprocess
import sys

# Install scikit-learn
subprocess.check_call([sys.executable, "-m", "pip", "install", "scikit-learn"])

0

In [17]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Prepare the data (replace with your actual data loading process)
# Assuming 'combined_data' is your DataFrame
# Example: combined_data = pd.read_csv('combined_bike_poi_data.csv')

# Prepare the features (X) and target (y)
X = combined_data[['venue_distance']]  # Independent variable (distance from POI to station)
y = combined_data['bike_category']  # Dependent variable (bike category: Low, Medium, High)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the logistic regression model
clf = LogisticRegression(multi_class='ovr', solver='liblinear')  # 'ovr' = One-vs-Rest strategy
clf.fit(X_train, y_train)

# Predict the target for the test set
y_pred = clf.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

        High       0.00      0.00      0.00       513
         Low       0.40      1.00      0.57       905
      Medium       0.00      0.00      0.00       828

    accuracy                           0.40      2246
   macro avg       0.13      0.33      0.19      2246
weighted avg       0.16      0.40      0.23      2246



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
