Build a regression model.

In [1]:
## Get the data back from sql and import the needed packages
import sqlite3
import pandas as pd
import statsmodels.api as sm
conn = sqlite3.connect('mydatabase')
fulldf=pd.read_sql('select * from bike_stations b join yelp y on b.Station_Number = y.Nearest_Station join foursquare f on f.Nearest_Station = b.Station_Number', conn)
fulldf = fulldf.drop(columns=['Nearest_Station']) 
## fulldf is now a full dataframe with latitude, longitude, free_bikes, total yelp results, and total foursquare results








Model Building

In [12]:
x = fulldf['Free Bikes']
y = fulldf['Yelp_Results']

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)


## Free Bikes has a P score of .284 so it is not statistically significant in predicting the amount of nearby businesses, confirming our thoughts from the correlation matrix

                            OLS Regression Results                            
Dep. Variable:                   yelp   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.151
Date:                Sat, 21 Oct 2023   Prob (F-statistic):              0.284
Time:                        12:02:08   Log-Likelihood:                -2525.9
No. Observations:                 400   AIC:                             5056.
Df Residuals:                     398   BIC:                             5064.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        110.7823     10.376     10.677      0.0

In [3]:
x = fulldf[['Free Bikes']]
y = fulldf['Foursquare_Results']

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)
## same thing but with foursquare as the independent variable instead of yelp, same results

                            OLS Regression Results                            
Dep. Variable:             foursquare   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.6985
Date:                Fri, 27 Oct 2023   Prob (F-statistic):              0.404
Time:                        23:07:14   Log-Likelihood:                -1534.0
No. Observations:                 400   AIC:                             3072.
Df Residuals:                     398   BIC:                             3080.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         18.6930      0.869     21.505      0.0

In [2]:
x = fulldf[['Latitude','Longitude']]
y = fulldf['Yelp_Results']

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)


## Our hypothesis is proven,
## Latitude and Longitude both have a p-value of 0 which means they are statistically significant in predicing the amount of nearby businesses
## This could probably be used for any latitude and longitude, not just bike stations

                            OLS Regression Results                            
Dep. Variable:                   yelp   R-squared:                       0.089
Model:                            OLS   Adj. R-squared:                  0.085
Method:                 Least Squares   F-statistic:                     19.48
Date:                Wed, 25 Oct 2023   Prob (F-statistic):           8.52e-09
Time:                        16:15:48   Log-Likelihood:                -2507.7
No. Observations:                 400   AIC:                             5021.
Df Residuals:                     397   BIC:                             5033.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.112e+05   1.86e+04      5.973      0.0

In [2]:
x = fulldf[['Latitude','Longitude']]
y = fulldf['foursquare']

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)

## same thing but with foursquare as the independent variable instead of yelp

                            OLS Regression Results                            
Dep. Variable:             foursquare   R-squared:                       0.065
Model:                            OLS   Adj. R-squared:                  0.060
Method:                 Least Squares   F-statistic:                     13.81
Date:                Fri, 27 Oct 2023   Prob (F-statistic):           1.60e-06
Time:                        22:49:55   Log-Likelihood:                -1520.9
No. Observations:                 400   AIC:                             3048.
Df Residuals:                     397   BIC:                             3060.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       7156.5086   1578.944      4.532      0.0

SUMMARY

Using the amount of free bikes to predict the number of nearby businesses has a p-score of .284 and .404 for yelp and foursquare, with r-squared zero, so it is not statistically significant

Using Latitude and longitude to predict the number of nearby businesses has p-score of 0, so the relationship is significant even though r squared isnt great.
We could probably extrapolate this to use any latitude/longitude, not just ones from bike stations, to predict how many businesses are nearby.

# Stretch

How can you turn the regression model into a classification model?

- You could take the average number of yelp/foursquare results, and classify each station as having an above or below average amount of nearby businesses