### Build a regression model.

In [2]:
# Importing ibraries
import statsmodels.api as sm
import pandas as pd


In [150]:
# Loading data from previous sections
ReddingBikesDF = pd.read_csv('../data/ReddingBikesDF.csv')
combinedSummaryDF = pd.read_csv('../data/combinedSummaryDF.csv', index_col=0)


In [151]:
# Setting up numerical data for regression model
bikeColumns = ['empty_slots', 'free_bikes', 'total_slots']
combinedSummaryDF = pd.merge(combinedSummaryDF, ReddingBikesDF[bikeColumns], left_on=ReddingBikesDF.index, right_index=True)

In [152]:
combinedSummaryDF.drop(columns='key_0',inplace=True)

In [153]:
combinedSummaryDF.head()

Unnamed: 0_level_0,venueCount,distance,rating,totalRatings,price,empty_slots,free_bikes,total_slots
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,43,523.970492,4.618605,177.918605,1.802326,2,2,4
1,46,354.960819,4.741304,166.48913,1.771739,3,2,5
2,46,348.262772,4.741304,166.48913,1.771739,2,3,5
3,4,960.983837,4.7125,112.875,1.0,3,0,3
4,13,630.752696,4.569231,116.576923,1.615385,3,3,6


In [123]:
X = pd.DataFrame(combinedSummaryDF[['venueCount','distance','rating','totalRatings','price']])
Y = pd.Series(combinedSummaryDF['total_slots'])

In [124]:
X.head()

Unnamed: 0_level_0,venueCount,distance,rating,totalRatings,price
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,43,523.970492,4.618605,177.918605,1.802326
1,46,354.960819,4.741304,166.48913,1.771739
2,46,348.262772,4.741304,166.48913,1.771739
3,4,960.983837,4.7125,112.875,1.0
4,13,630.752696,4.569231,116.576923,1.615385


In [125]:
Y.head()

station
0    4
1    5
2    5
3    3
4    6
Name: total_slots, dtype: int64


#### Provide model output and an interpretation of the results. 

## Backwards Selection Model

In [126]:
X = sm.add_constant(X) # adding a constant
lin_reg = sm.OLS(Y,X)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            total_slots   R-squared:                       0.545
Model:                            OLS   Adj. R-squared:                  0.403
Method:                 Least Squares   F-statistic:                     3.836
Date:                Mon, 27 May 2024   Prob (F-statistic):             0.0178
Time:                        20:42:34   Log-Likelihood:                -28.200
No. Observations:                  22   AIC:                             68.40
Df Residuals:                      16   BIC:                             74.95
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           13.9639     12.176      1.147   

Model is really bad, all p-values are above 0.05. A larger city with more bike station data would help lower the p-value and error. 

Continuing on though, removing the next highest p-value variable: price

In [127]:
X.drop('price',inplace=True, axis=1)

In [128]:
lin_reg = sm.OLS(Y,X)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            total_slots   R-squared:                       0.537
Model:                            OLS   Adj. R-squared:                  0.428
Method:                 Least Squares   F-statistic:                     4.929
Date:                Mon, 27 May 2024   Prob (F-statistic):            0.00800
Time:                        20:42:46   Log-Likelihood:                -28.398
No. Observations:                  22   AIC:                             66.80
Df Residuals:                      17   BIC:                             72.25
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           14.2272     11.910      1.195   

Model looks much better! Not all variables are rejecting null hypothesis yet. Dropping rating next

In [129]:
X.drop('rating',inplace=True, axis=1)

In [130]:
lin_reg = sm.OLS(Y,X)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            total_slots   R-squared:                       0.530
Model:                            OLS   Adj. R-squared:                  0.451
Method:                 Least Squares   F-statistic:                     6.758
Date:                Mon, 27 May 2024   Prob (F-statistic):            0.00301
Time:                        20:43:03   Log-Likelihood:                -28.569
No. Observations:                  22   AIC:                             65.14
Df Residuals:                      18   BIC:                             69.50
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            8.2201      2.464      3.337   

Better again, dropping the next variable: totalRating

In [133]:
X.drop('totalRatings',inplace=True, axis=1)

In [134]:
lin_reg = sm.OLS(Y,X)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            total_slots   R-squared:                       0.368
Model:                            OLS   Adj. R-squared:                  0.302
Method:                 Least Squares   F-statistic:                     5.539
Date:                Mon, 27 May 2024   Prob (F-statistic):             0.0127
Time:                        20:44:05   Log-Likelihood:                -31.815
No. Observations:                  22   AIC:                             69.63
Df Residuals:                      19   BIC:                             72.90
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         11.9609      2.200      5.436      0.0

We reach a a model that rejects the null hypothesis, giving us a statistically significant model of:

 `11.9609-0.0897(venueCount)-0.0073(distanceInMeters)=(BikeSlots)`

This model shows that the amount of bike slots at each bike stations location is dependent on the average distance to the number of venues nearby. 
