Build a regression model.

In [1]:
# Is there a link between the number of bike stations close to a restaurant and the rating of the restaurant?
# import numpy
import numpy as np
import pandas as pd

# import linear_model and datasets from sklearn
import statsmodels.api as sm
yelp_df = pd.read_pickle('../data/yelp_dataframe.pkl')
fsq_df = pd.read_pickle('../data/fsq_dataframe.pkl')
bike_df = pd.read_pickle('../data/bike_dataframe.pkl')
fsq_categories = pd.read_pickle('../data/fsq_categories_df.pkl')
yelp_categories = pd.read_pickle('../data/yelp_categories.pkl')
regression_df = pd.merge(yelp_df, bike_df, how='inner', on='bs_id')

Provide model output and an interpretation of the results. 

In [2]:
#count the number of times each yelp id comes up, will tell us how many bike stations it has near it
bikes_near_restaurant = regression_df[['yelp_id', 'rating', 'name']].groupby(by=['yelp_id', 'rating']).count()
avg_distance = regression_df[['yelp_id', 'rating', 'distance']].groupby(by=['yelp_id', 'rating']).mean()
bikes_near_restaurant = pd.merge(bikes_near_restaurant, avg_distance, how='right', on=['yelp_id', 'rating'])
bikes_near_restaurant.rename(columns={'name': 'num_bike_stations', 'distance': 'avg_distance'}, inplace=True)
bikes_near_restaurant = bikes_near_restaurant.reset_index() 
bikes_near_restaurant

Unnamed: 0,yelp_id,rating,num_bike_stations,avg_distance
0,-3U1K-W87mtmRdNO4d-UUg,4.5,3,271.809363
1,-9_4a5avO5hdClyAdNRicw,3.6,3,268.223683
2,-CNdcUG-6q5c1yZ8cM_MUg,4.1,9,647.757418
3,-FhPFfOrHIc9t69Qgrxmsw,4.8,1,180.839129
4,-HGeu3vZUYNLwEoRb-KhMw,4.1,1,783.745818
...,...,...,...,...
1166,zjEdfLF4v8WqdkpDiPqjWg,4.8,11,10019.009627
1167,zoNL3rnRBrUdByTGkiyfeQ,3.7,1,7168.766116
1168,zs02vRnaP8TB67r2OcCeIQ,4.3,61,968.472203
1169,zsJoDVfWfiA8pYMey2Ix2Q,5.0,1,202.969837


In [3]:
# lets create our X and Y axis, we are looking at the number of bikes VS the rating they recieved
X = bikes_near_restaurant.drop(['rating','yelp_id'], axis = 1)
y = bikes_near_restaurant['rating']

# add a constant to my X values to give the model some wiggle room to fit nicer
X = sm.add_constant(X)
#build my OLS squares
lin_reg = sm.OLS(y,X['num_bike_stations'])

In [4]:
#build my model and display it
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                                 OLS Regression Results                                
Dep. Variable:                 rating   R-squared (uncentered):                   0.329
Model:                            OLS   Adj. R-squared (uncentered):              0.328
Method:                 Least Squares   F-statistic:                              573.2
Date:                Sun, 15 Dec 2024   Prob (F-statistic):                   2.06e-103
Time:                        21:13:53   Log-Likelihood:                         -3087.1
No. Observations:                1171   AIC:                                      6176.
Df Residuals:                    1170   BIC:                                      6181.
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                        coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------

A Few major and easily noticable things here, 
1. Our P score is insignificantly small, so YAY our test is unlikely to be completly random and we can be comfortable in our standings.
2. our R-squared is fairly small, looking at .329 so approximately 32.9% of the variance in restaurants ratings can be explained by the model.
3. Our Coeficiant is sitting at .1859 so for every 1 that extra bike station you would expect the star rating to go up by .1859

Wonderful! We might say we have solved restaurants! Simply put up 27 bike stations with 500m of your restaurant and you are GUARENTEED a 5* rating, sadly we all know this isn't true and a more accurate way of looking at this model would be understanding the busier areas (that have more bike stations) are going to be more popular for the better restaurants.

# Stretch

How can you turn the regression model into a classification model?

 - We could easily edit this as we are looking at Ratings, 0-5 potential options into a classification problem based on those factors, I am running out of time to do the full calculation on that sadly this time around but I will look into it more in the future.