Build a regression model.

In [12]:
# Imports
import statsmodels.api as sm  
import pandas as pd           
import numpy as np            

# Read the data from 'station_poi_df.csv' into a DataFrame
station_poi_df = pd.read_csv('station_poi_df.csv')

# Select the features variables and the target variable
X = station_poi_df[['distance', 'review_count', 'popularity', 'rating', 'price']]
y = pd.Series(station_poi_df['free_bikes'])

# Add a constant column for intercept in regression
X = sm.add_constant(X)

# Create a regression model
lin_reg = sm.OLS(y, X)

# Fit regression to the data
model = lin_reg.fit()


Provide model output and an interpretation of the results. 

In [6]:
# First run model
model.summary()

0,1,2,3
Dep. Variable:,free_bikes,R-squared:,0.013
Model:,OLS,Adj. R-squared:,0.011
Method:,Least Squares,F-statistic:,8.298
Date:,"Tue, 05 Sep 2023",Prob (F-statistic):,8.36e-08
Time:,16:24:18,Log-Likelihood:,-8851.3
No. Observations:,3257,AIC:,17710.0
Df Residuals:,3251,BIC:,17750.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.6064,0.772,7.265,0.000,4.093,7.119
distance,-0.0007,0.000,-2.394,0.017,-0.001,-0.000
review_count,0.0003,7.66e-05,4.435,0.000,0.000,0.000
popularity,0.8092,0.319,2.537,0.011,0.184,1.435
rating,0.0632,0.099,0.638,0.524,-0.131,0.258
price,0.3115,0.102,3.042,0.002,0.111,0.512

0,1,2,3
Omnibus:,625.157,Durbin-Watson:,0.147
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1389.512
Skew:,1.095,Prob(JB):,1.87e-302
Kurtosis:,5.334,Cond. No.,13900.0


The initial model has a low R-squared value of 0.011, indicating that it explains only 1% of the variability in the dependent variable. Additionally, there are high p-values associated with the 'rating' variable, which surpass the desired significance level. The next logical step would involve refining the model by eliminating the variables with high p-values in an effort to achieve a better fit.

In [11]:
# Second run model
X = station_poi_df[['popularity','price','review_count','distance']]
y = pd.Series(station_poi_df['free_bikes'])
X = sm.add_constant(X)
lin_reg = sm.OLS(y,X)
model = lin_reg.fit()
model.summary()

0,1,2,3
Dep. Variable:,free_bikes,R-squared:,0.012
Model:,OLS,Adj. R-squared:,0.011
Method:,Least Squares,F-statistic:,10.27
Date:,"Tue, 05 Sep 2023",Prob (F-statistic):,2.89e-08
Time:,16:27:49,Log-Likelihood:,-8851.5
No. Observations:,3257,AIC:,17710.0
Df Residuals:,3252,BIC:,17740.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.0615,0.295,20.582,0.000,5.484,6.639
popularity,0.8546,0.311,2.748,0.006,0.245,1.464
price,0.3057,0.102,2.998,0.003,0.106,0.506
review_count,0.0003,7.65e-05,4.488,0.000,0.000,0.000
distance,-0.0007,0.000,-2.351,0.019,-0.001,-0.000

0,1,2,3
Omnibus:,623.927,Durbin-Watson:,0.147
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1383.921
Skew:,1.093,Prob(JB):,3.06e-301
Kurtosis:,5.327,Cond. No.,7110.0


The second model maintains the same R-squared value of 0.011, but the p-values for the variables are now below the commonly accepted threshold of <0.05. While this model seems to be a better fit than the original, it still appears statistically insignificant.

# Stretch

How can you turn the regression model into a classification model?

Transform the numeric variables, where applicable, into categorical variables (e.g., convert 'free_bikes' from a count of available bikes to a binary 'yes/no' category), and subsequently, perform logistic regression analysis.