Build a regression model using Python’s `statsmodels` module that demonstrates a relationship between the number of bikes in a particular location and the characteristics of the POIs in that location.

In [10]:
# Imports libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import scipy

sns.set_theme()

import warnings
warnings.filterwarnings('ignore')

At the end of part 3 - joining table, we've performed the EDA and feature correlation with the independent variables bearing the bar characteristics (only numeric) are not highly linearly related to our target variable ('cb_bike_num'). Usually the correlation needs to >0.5 or <-0.5 to be significant. However, we chose to keep the variables 'review_count', 'distance' as asked in the assignment.md file; while remove the near zero correlation.

### Multivariate Linear Regression

Multivariate iple linear regression consists of one continuous dependent variable and more than one independent variable. We use 'cb_bike_num' as our dependent variable ($y$) and 'review_count', 'distance' as our independent variables ($x_1$ and $x_2$). This multiple linear regression model uses the relationship:

$$
y=b_0 + b_1x_1 + b_2x_2
$$

Provide model output and an interpretation of the results. 

In [11]:
# Load the clean data merged_all_df_clean csv file from part 3 to run the model
merged_all_df_clean = pd.read_csv('../data/merged_all_df_clean.csv')
merged_all_df_clean

Unnamed: 0,cb_station_id,cb_station_name,cb_latitude,cb_longitude,cb_bike_num,name,postcode,category,distance,review_count,rating,cb_name_count,price_level
0,72bfd647b3d2b650546f42319729757d,Cégep Marie-Victorin,45.617500,-73.606011,11,Resto-bar Capucine - Nord-Est de Montréal,Unavailable,Sports Bar,246.0,44.0,4.0,4,2
1,72bfd647b3d2b650546f42319729757d,Cégep Marie-Victorin,45.617500,-73.606011,11,Piano Bar la Belle Epoque,H1G 2V6,Cocktail Bar,661.0,44.0,4.0,4,2
2,72bfd647b3d2b650546f42319729757d,Cégep Marie-Victorin,45.617500,-73.606011,11,Cafe liana bar & grill,H1E 1M4,Bar,809.0,44.0,4.0,4,2
3,72bfd647b3d2b650546f42319729757d,Cégep Marie-Victorin,45.617500,-73.606011,11,La Veranda,H1G 2V5,Bar,960.0,44.0,4.0,4,2
4,36c6491aa1b52e5ef7005f984738de27,Gare d'autocars de Montréal (Berri / Ontario),45.516926,-73.564257,15,Le Saint Bock,H2X 3K4,Bar,132.0,44.0,4.0,10,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5814,6d0a3c1b3a79bb42f1125970e00cd5b1,Parc du Pélican (1ere Ave / Masson),45.545188,-73.576443,39,La Succursale,H1Y 1Y1,Lounge,607.0,25.0,4.0,10,2
5815,6d0a3c1b3a79bb42f1125970e00cd5b1,Parc du Pélican (1ere Ave / Masson),45.545188,-73.576443,39,Pub Rosemont,H2G 1V2,Lounge,797.0,44.0,4.0,10,2
5816,6d0a3c1b3a79bb42f1125970e00cd5b1,Parc du Pélican (1ere Ave / Masson),45.545188,-73.576443,39,Chez Baptiste Sur Masson,H1Y 1X6,Bar,433.0,44.0,4.0,10,2
5817,6d0a3c1b3a79bb42f1125970e00cd5b1,Parc du Pélican (1ere Ave / Masson),45.545188,-73.576443,39,Broue Bar Gaspé,H1Y 1W1,Karaoke Bar,25.0,44.0,4.0,10,2


In [12]:
y = merged_all_df_clean['cb_bike_num']
X = merged_all_df_clean[['review_count', 'distance']]
X = sm.add_constant(X) # Adds a column of 1's so the model will contain an intercept
X.head()

Unnamed: 0,const,review_count,distance
0,1.0,44.0,246.0
1,1.0,44.0,661.0
2,1.0,44.0,809.0
3,1.0,44.0,960.0
4,1.0,44.0,132.0


In [13]:
model = sm.OLS(y, X) # Instantiate
results = model.fit() # Fit the model 
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:            cb_bike_num   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.040
Method:                 Least Squares   F-statistic:                     123.0
Date:                Mon, 25 Sep 2023   Prob (F-statistic):           4.73e-53
Time:                        03:33:06   Log-Likelihood:                -19187.
No. Observations:                5819   AIC:                         3.838e+04
Df Residuals:                    5816   BIC:                         3.840e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           22.4546      0.247     90.972   

### Interpret the Model Output

* Adj. R-squared: In this particular case, R-squared and the Adj. R-squared are nearly the same with very low values. According to the Adj. R-squared, our multivariate model only explains 4% of the variations in the data. It means that our model doesn't fit the data well.

* P>|t| (or the p-value): All of the p-values are zero (the true value of the coefficient is equal to zero) which indicates most likely no relationship between the number of bikes at a station and bar's review counts or distance. No variable has a p-value > 0.05, no variable should be removed from the model in order to get a better model fit.

* coef: Multivariate regression outputs will have a coefficient for each independent variable. What we can tell from this output is that the review of bars that are 1Km around the bike station has a very small positive impact on the number of bikes at that station, whereas the distance between the bar and the station has a near zero negative impact on the number of bikes.

In [15]:
# Test the normalilty assumption by checking normality on the residuals
scipy.stats.shapiro(residuals)

ShapiroResult(statistic=0.9249078631401062, pvalue=0.0)

p<0.05, we reject the null hypothesis that the residuals is normally distributed.

In [16]:
# Test the homoscedasticity assumption on the residuals
stat, p, f_stat, f_p = sm.stats.diagnostic.het_breuschpagan(residuals,results.model.exog)
print(p,f_p)

1.5183151933479793e-08 1.4491571376089e-08


p<0.05, we reject the null hypothesis and conclude that heteroscedasticity is present in the regression model. Therefore, the results of the regression become unreliable.

In [17]:
# Test the multicollinearity between two independent variables
stat, p = stats.pearsonr(merged_all_df_clean['review_count'], merged_all_df_clean['distance'])
print('%0.60f' % p)

0.000029674232081741327694780416268649503308552084490656852722


p<0.05, we reject the null hypothesis and conclude that there's not a significant correlation between review_count and distance.

Overall, this Multivariate Linear Regression model is not a strong one, the model doesn't fit really well with our data. We saw earlier in part 3 that all of our variables are not normally distributed.

# Stretch

How can you turn the regression model into a classification model?

* When applying a Regression Model, the Regression algorithm is used to determine continuous values. We were treating the bike numbers of a station as a continuous dependent variable to predict its values based on bar's review counts and distance from the bike stations.

* We use a Classification model forecast or classify distinct values, the output of a Classification model has a discrete probability distribution (as opposed to regression models, where the output variable is continuous).

We might consider using this difference as a main basis to transform the output of the target variable 'cb_bike_num' into discrete distribution.

In [25]:
# List all unique values of 'cb_bike_num'
sorted(merged_all_df_clean['cb_bike_num'].unique())

[11,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 37,
 38,
 39,
 41,
 43,
 46,
 47,
 49]

In [26]:
merged_all_df_clean['cb_bike_num'].value_counts().sort_values(ascending=False)

19    1511
23    1312
15    1048
27     568
31     335
26     121
35     111
39      98
22      69
21      64
43      60
18      55
17      52
11      51
33      50
29      50
47      40
34      34
25      34
37      30
32      20
13      19
14      17
30      15
38      11
49      10
24      10
46      10
16       8
41       6
Name: cb_bike_num, dtype: int64

Assuming we put all the values of 'cb_bike_num' into equal bins of number of bikes per station to classify the output: 

* 10-19 bikes per station
* 20-29 bikes per station
* 30-39 bikes per station
* 40-49 bikes per station

A Multinomial regression models for multinomial classification might be used in this case to model outputs that can take two or more values. The output variable can take different values, then might be represented as a Multinoulli random vector.

For example, if the output variable can belong to one of four classes (10-19, 20-29, 30-39 or 40-49 bikes per station), then [1, 0, 0, 0] will represent the category of 10-19 bikes, [0, 1, 0, 0] will represent the category of 20-29 bikes, [0, 0, 1, 0] will represent the category of 30-39 bikes, and [0, 0, 0, 1] will represent the category of 40-49 bikes.