Build a regression model.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [20]:
#building a multivariate regression model
X = pd.read_csv('joined_data.csv')
y = pd.Series(X['num_of_bikes'], name='total_bikes')

In [21]:
X = X.drop(['bike_station_location', 'name', 'main_category_x', 'address', 'price', 'status', 'main_category_y', 'num_of_bikes'], axis=1)

In [22]:
X

Unnamed: 0,rating,review_count,distance_away_x,latitude,longitude
0,4.0,1735,82.518277,45.511950,-122.614160
1,4.5,487,225.793673,45.511950,-122.614160
2,4.0,408,33.461068,45.511950,-122.614160
3,3.5,270,58.987619,45.511950,-122.614160
4,4.0,374,340.597357,45.548276,-122.611164
...,...,...,...,...,...
577,4.5,703,896.491052,45.596562,-122.747900
578,4.5,235,829.466752,45.596562,-122.747900
579,3.5,256,905.682509,45.596562,-122.747900
580,4.0,238,141.103916,45.517899,-122.660052


In [23]:
X = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,X)

Provide model output and an interpretation of the results. 

In [24]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            total_bikes   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.179
Method:                 Least Squares   F-statistic:                     26.31
Date:                Sat, 26 Aug 2023   Prob (F-statistic):           5.72e-24
Time:                        15:30:39   Log-Likelihood:                -1708.5
No. Observations:                 582   AIC:                             3429.
Df Residuals:                     576   BIC:                             3455.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const           -1968.6443    553.366     

In [25]:
#some variables have high P values, so backwards selection was used to eliminate the highest P values one by one
#until either all P values were below the standard 0.05 or the adjusted R-squared value stopped increasing
X = X.drop(['rating'], axis=1)

In [26]:
X = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,X)

In [27]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            total_bikes   R-squared:                       0.183
Model:                            OLS   Adj. R-squared:                  0.178
Method:                 Least Squares   F-statistic:                     32.39
Date:                Sat, 26 Aug 2023   Prob (F-statistic):           2.25e-24
Time:                        15:50:55   Log-Likelihood:                -1709.4
No. Observations:                 582   AIC:                             3429.
Df Residuals:                     577   BIC:                             3451.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const           -1785.3116    536.591     

In [28]:
X= X.drop(['review_count'], axis=1)

In [29]:
X = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,X)

In [30]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            total_bikes   R-squared:                       0.180
Model:                            OLS   Adj. R-squared:                  0.176
Method:                 Least Squares   F-statistic:                     42.33
Date:                Sat, 26 Aug 2023   Prob (F-statistic):           9.69e-25
Time:                        15:52:34   Log-Likelihood:                -1710.6
No. Observations:                 582   AIC:                             3429.
Df Residuals:                     578   BIC:                             3447.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const           -1858.0473    535.012     

In [31]:
X = X.drop(['distance_away_x'], axis=1)

In [32]:
X = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(y,X)

In [33]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            total_bikes   R-squared:                       0.176
Model:                            OLS   Adj. R-squared:                  0.174
Method:                 Least Squares   F-statistic:                     62.02
Date:                Sat, 26 Aug 2023   Prob (F-statistic):           3.94e-25
Time:                        15:57:43   Log-Likelihood:                -1711.9
No. Observations:                 582   AIC:                             3430.
Df Residuals:                     579   BIC:                             3443.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1749.0209    531.477     -3.291      0.0

After removing the P values one by one, it is observed that only the location (latitude and longitude) of the bike station has a significant relationship to the total_bikes possible at each station, which inituitively makes sense because locations with higher density of people should have more total bikes than locations with lower density, since places with more people will result in more potential bike users. However, even though all the high P values were removed and the P values for latitude and longitude are 0.000, the adjusted R-squared value is 0.174 which is very low, and it also decreased slightly each time a high P value was removed from the model. The coef values for latitude and longitude are also useless in this model because a positive or negative change in the value only represents a change in location, not a change to how small or large the value is, so even though the values are negative, it does not mean there is a negative correlation between these variables. Because of this, it can be concluded that although the relationship between total_bikes and location is significant, the model is not a good fit for the data, which could mean that the values in the dataset just naturally do not have a good linear relationship with eachother, or that there was not enough data in the dataset to produce a good linear relationship. Not having enough data is most likely the case since Portland is a relatively small city so the amount of bike stations and businesses will be relatively low, therefore resulting in less aviable data.

# Stretch

How can you turn the regression model into a classification model?