Build a regression model.

In [10]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import linear_model, datasets

## Cleaned Data Modelling

From previous .ipynb notebook, we shall be working with cleaned dataframe 'EDA_bikestations_298' CSV file. Identifying model statistics for bikestations with locations under 298 in distance.

In [11]:


EDA_bikestations = pd.read_csv('/Users/mitchellpalmer/Projects/Lighthouse Lab Projects/Data_Statistical_Modeling/Statistical_Modeling_Project/Statistical-Modelling-Project./data/EDA_bikestations_298.csv')

EDA_bikestations

Unnamed: 0,station_id,free bikes,empty slots,total bike slots,average rating,total ratings,average popularity,restaurant_count,bar_count,cafe_counts,bakery_counts,coffee_counts
0,0,0,32,32,7.750000,206.0,0.832688,1,0,0,0,1
1,1,0,23,23,7.533333,185.0,0.711452,1,0,0,0,1
2,2,7,26,33,7.700000,290.0,0.775019,5,0,0,2,0
3,3,8,6,14,7.433333,120.0,0.720962,2,0,2,0,1
4,4,6,17,23,7.240000,1477.0,0.749633,9,0,2,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...
171,188,1,19,20,6.900000,858.0,0.659510,12,1,0,2,1
172,189,1,16,17,7.706383,5085.0,0.872900,21,1,5,4,1
173,190,12,18,30,7.677778,513.0,0.801814,2,1,0,1,0
174,192,0,17,17,6.830000,314.0,0.712404,10,1,3,1,0


In [12]:

X = EDA_bikestations.drop(columns=['station_id','free bikes','empty slots','total bike slots'])

y_totalbikes = EDA_bikestations['total bike slots']
y_freebikes = EDA_bikestations['free bikes']

## Pre-Model statistical testing

Building upon the pairplot and scatterplots from **joining_data.ipynb**, we shall test for regression model assumptions prior to final analysis

- Linear Relationships
    - Pearson's Correlation Coeffiecent
- Multicolinarity
    - Variance Inflation Factors (VIF)

## Linear Correlation Coefficients



In [13]:
# Linear Relationship | Pearson's Correlation

from scipy.stats import pearsonr

correlation_results = {
    'feature': [],
    'correlation': [],
    'p_value': []
}

print('Total Bike Slots Linear Correlations')
print('')
for variable in X.columns:
    corr, p = pearsonr(X[variable], y_totalbikes)
    correlation_results['feature'].append(variable)
    correlation_results['correlation'].append(round(corr,3))
    correlation_results['p_value'].append(round(p,3))

Total_Bikes_Correlation = pd.DataFrame(correlation_results)

Total_Bikes_Correlation['Statistically Significant'] = [
    True if p < 0.05 
    else False 
    for p in Total_Bikes_Correlation['p_value']
]

Total_Bikes_Correlation['Correlation Strength'] = [
    'Weak' if -0.5 < value < 0.5
    else 'Strong'
    for value in Total_Bikes_Correlation['correlation']
]

Total_Bikes_Correlation
    

Total Bike Slots Linear Correlations



Unnamed: 0,feature,correlation,p_value,Statistically Significant,Correlation Strength
0,average rating,-0.003,0.964,False,Weak
1,total ratings,0.002,0.981,False,Weak
2,average popularity,-0.117,0.121,False,Weak
3,restaurant_count,0.01,0.892,False,Weak
4,bar_count,0.103,0.173,False,Weak
5,cafe_counts,0.128,0.089,False,Weak
6,bakery_counts,-0.028,0.716,False,Weak
7,coffee_counts,0.182,0.016,True,Weak


In [14]:
correlation_results = {
    'feature': [],
    'correlation': [],
    'p_value': []
}

print('Free Bikes Linear Correlations')
print('')
for variable in X.columns:
    corr, p = pearsonr(X[variable], y_freebikes)
    correlation_results['feature'].append(variable)
    correlation_results['correlation'].append(round(corr,3))
    correlation_results['p_value'].append(round(p,3))

Free_Bikes_Correlation = pd.DataFrame(correlation_results)

Free_Bikes_Correlation['Statistically Significant'] = [
    True if p < 0.05 
    else False 
    for p in Free_Bikes_Correlation['p_value']
]

Free_Bikes_Correlation['Correlation Strength'] = [
    'Weak' if -0.5 < value < 0.5
    else 'Strong' 
    for value in Free_Bikes_Correlation['correlation']
]

Free_Bikes_Correlation


Free Bikes Linear Correlations



Unnamed: 0,feature,correlation,p_value,Statistically Significant,Correlation Strength
0,average rating,0.124,0.101,False,Weak
1,total ratings,0.079,0.297,False,Weak
2,average popularity,0.085,0.263,False,Weak
3,restaurant_count,0.004,0.957,False,Weak
4,bar_count,0.376,0.0,True,Weak
5,cafe_counts,-0.145,0.055,False,Weak
6,bakery_counts,-0.031,0.68,False,Weak
7,coffee_counts,0.066,0.381,False,Weak


## Linear Correlation Coefficients Results

Pearsonr Correlation Coefficient Tests 
- All X-variables against two y-variables (Total Bike Slots & Free Bikes) 
    - All results showed weak linear correlations. 

- Most p-values insignificant, with results above predetermined signifiance level alpha of 5% 

## Multicollinearity 

Testing for Multicollinearity with statistical testing | **Variance Inflation Factors (VIF)**

In [15]:
# AI Assistance

# No Multicollinearity | Variance Inflation Factors (VIF)

# Utilise Assumption Threshold of 5

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif["feature"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

vif['Multicollinearity'] = [ True if value >= 5 
                            else False
                            for value in vif['VIF']
]
vif

Unnamed: 0,feature,VIF,Multicollinearity
0,average rating,86.186538,True
1,total ratings,2.782366,False
2,average popularity,88.679607,True
3,restaurant_count,5.260828,True
4,bar_count,2.159801,False
5,cafe_counts,2.630564,False
6,bakery_counts,2.246527,False
7,coffee_counts,1.685208,False


In [16]:
# Test Multicollinearity between highest variables: Ratings, Popularity & Restaurant Count

vif = pd.DataFrame()
vif["feature"] = X[['average rating','average popularity','restaurant_count']].columns
vif["VIF"] = [variance_inflation_factor(X[['average rating','average popularity','restaurant_count']].values, i) for i in range(X[['average rating','average popularity','restaurant_count']].shape[1])]

vif['Multicollinearity'] = [ True if value >= 5 
                            else False
                            for value in vif['VIF']
]

vif

Unnamed: 0,feature,VIF,Multicollinearity
0,average rating,83.454672,True
1,average popularity,85.317597,True
2,restaurant_count,2.707623,False


In [17]:
# Drop ratings to view Multicollinearity among category counts

vif_categories = pd.DataFrame()
X_categories = X.drop(columns=['average rating','total ratings','average popularity'])

vif_categories["feature"] = X_categories.columns
vif_categories["VIF"] = [variance_inflation_factor(X_categories.values, i) for i in range(X_categories.shape[1])]

vif_categories['Multicollinearity'] = [ True if value >= 5 
                            else False
                            for value in vif_categories['VIF']
]
vif_categories

Unnamed: 0,feature,VIF,Multicollinearity
0,restaurant_count,4.18576,False
1,bar_count,1.583222,False
2,cafe_counts,2.441968,False
3,bakery_counts,2.202995,False
4,coffee_counts,1.527384,False


## Multicollinearity Results

Extremely strong multicollinearity present among 'average rating' and 'popularity.

Secondly, very strong multicollinearity among Rating & Popularity variables with Restaurant Counts.

Seperating of variables of 'Category Types' and 'Ratings'.
- Acknowledge remaining higher multicollinearity value for 'Restaurant_count" at 4.19 and impact on Regression Analysis

# Regression Models
Provide model output and an interpretation of the results. 

## Multivariate Regression Models


In [18]:
# Location Categories analysis against Total Bike Slots

X_categories = sm.add_constant(X_categories)
lin_reg = sm.OLS(y_totalbikes,X_categories)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:       total bike slots   R-squared:                       0.074
Model:                            OLS   Adj. R-squared:                  0.047
Method:                 Least Squares   F-statistic:                     2.724
Date:                Mon, 28 Jul 2025   Prob (F-statistic):             0.0214
Time:                        10:40:47   Log-Likelihood:                -603.44
No. Observations:                 176   AIC:                             1219.
Df Residuals:                     170   BIC:                             1238.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               18.7589      0.980  

In [19]:
# Location Categories analysis against Free Bikes

X_categories = sm.add_constant(X_categories)
lin_reg = sm.OLS(y_freebikes,X_categories)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free bikes   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.162
Method:                 Least Squares   F-statistic:                     7.764
Date:                Mon, 28 Jul 2025   Prob (F-statistic):           1.33e-06
Time:                        10:40:47   Log-Likelihood:                -536.48
No. Observations:                 176   AIC:                             1085.
Df Residuals:                     170   BIC:                             1104.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const                5.0408      0.670  

### Multivariate Regression Models Analysis

- Location Categories Vs Total Bike Slots
    - Adjusted R-squared / Coefficient Of Determination = 0.047
        - Rounded 5% of variation in Total Bike Slotsavailable across Lisbon, Portugal can be explained by this model.
        - Locations Categories (e.g Restaurant, Bar, Cafe, Bakeries, Coffee) is a **poor** explanatory variable for Total Bike Slots at CityBike stations in Lisbon


- Location Categories Vs Free Bikes
    - R-squared / Coefficient Of Determination = 0.162
        - Rounded 16% of variation in Free Bikes available across Lisbon, Portgual can be explain by this model.
        - Location Cateogires (e.g Restaurant, Bar, Cafe, Bakeries, Coffee) is a **moderate** explnatory variable for Free bikes at CityBike stations in Lisbon, at the relevant time period of **12:39 AM Sunday, July 27 2025.**

    - Variables
        - Bars
            - Coeffieicent
                - Holding all other features constant, for each additional 'bar' location surrounding CityBike Stations, the number of free bikes available is estiamted to **increase by 1.72**.
            - p-value
                - At 0.00, highly statistically significant 
        - Cafes
            - Coeffieicent
                - Holding all other features constant, for each additional 'cafe' location surrounding CityBike Stations, the number of free bikes available is estiamted to **decrease by 0.663**.
            - p-value
                - At 0.028, highly statistically significant. Meeting our alpha signifiance level of 5%

    - Further Regression Model Analysis

In [20]:
# Categories Bars & Cafes analysis against Free Bikes

X_selected = sm.add_constant(X_categories[['bar_count','cafe_counts']])
lin_reg = sm.OLS(y_freebikes,X_selected)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free bikes   R-squared:                       0.179
Model:                            OLS   Adj. R-squared:                  0.170
Method:                 Least Squares   F-statistic:                     18.89
Date:                Mon, 28 Jul 2025   Prob (F-statistic):           3.80e-08
Time:                        10:40:47   Log-Likelihood:                -537.19
No. Observations:                 176   AIC:                             1080.
Df Residuals:                     173   BIC:                             1090.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           4.9671      0.585      8.490      

- Inference
    - Based on our two strongest variables (Bars & Cafes), we can infer that 17% of variation in Free Bikes across Lisbon can be explain by these two predictors.

    **TimeStamp** = **12:39 AM Sunday, July 27 2025.**

    - Bars
        - Each bar within a 1,000 metre radius of a CityBike station increases 1.67 Free Bikes.
            - Inference can be customers ride bikes to stations with local bars during this TimeStamp
            OR
            - CityBike allocates 1.67 more bikes at stations at this Timestamp for anticaption of customer use
    
    - Cafes
        - Each cafe within a 1,000 metre radius of a CityBike station decreases 0.736 Free Bikes
            - Inference can be CityBike Stations with cafes have lower quantity of customers riding bikes during this TimeStamp
            OR
            - CityBike allocates 0.736 less bikes at stations durign this TimeStamp


    

## Linear Regression Models

Compare 'Average Popularity' of local locations with y-varibles


In [21]:

X_pop = EDA_bikestations['average popularity']
y_freebikes
y_totalbikes

# Total Bike Slots
X_pop = sm.add_constant(X_pop)
lin_reg = sm.OLS(y_totalbikes,X_pop)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:       total bike slots   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.425
Date:                Mon, 28 Jul 2025   Prob (F-statistic):              0.121
Time:                        10:40:47   Log-Likelihood:                -609.00
No. Observations:                 176   AIC:                             1222.
Df Residuals:                     174   BIC:                             1228.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 28.2475      5

In [22]:
# Free Bike Slots
X_pop = sm.add_constant(X_pop)
lin_reg = sm.OLS(y_freebikes,X_pop)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:             free bikes   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     1.261
Date:                Mon, 28 Jul 2025   Prob (F-statistic):              0.263
Time:                        10:40:47   Log-Likelihood:                -553.94
No. Observations:                 176   AIC:                             1112.
Df Residuals:                     174   BIC:                             1118.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  1.0083      3

## Linear Regression Models Analysis

Comparing 'Average Popularity' of local locations with Free Bikes & Total Bike Slots

- As visualised in 'joining_data.ipynb' pairplots and scatterplots and the above 'Pearson R' tests, the 'Average Popularity' does not have strong linear relationships with either y-variable.

- Total Bike Slots
    - 'Average Popularity' has a strong negative correlation coefficient in linear regression with 'Total Bike Slots' at -10.361. However, with a p-value of 0.121, marking it **statistically insignificant**, and expected variabaility of 'Total Bike Slot'  only explained by 'Average Popularity' to 1.4%%.

- Free Bikes
    - Similarly, 'Average Popularity' of locations within 1,000 metre radius of CityBike Stations does have a strong correlation coefficient in a Linear Regression of 5.464. However, with a p-value of 0.263, marking it **statistically insignificant**, the expected variabaility of 'Free Bikes' can only be explained by 'Average Popularity' to 0.7%%


‘Average Popularity is an **extremely poor predictor** of both Free Bikes and Total Bike Slots at CityBike stations, with low explanatory power and no statistically significant relationship observed in either regression model.

# Stretch

How can you turn the regression model into a classification model?