In [1]:
import pandas as pd
import statsmodels.api as sm

Build a regression model.

In [2]:
fs_bikes = pd.read_pickle('../data/fs_bikes.pkl')
yelp_bikes = pd.read_pickle('../data/yelp_bikes.pkl')

+ Below, I'm trying out One Hot Encoding (OHE) for the `price` column since it's categorical

In [3]:
yelp_price_encoding = pd.get_dummies(data=yelp_bikes, columns=['price'], dtype=int)

In [4]:
y = yelp_price_encoding['number_of_bikes']
x = yelp_price_encoding[['review_count', 'rating', 'price_$', 'price_$$', 'price_$$$', 'price_$$$$']]
x = sm.add_constant(x)

In [5]:
lin_reg = sm.OLS(y, x)

model = lin_reg.fit()

Provide model output and an interpretation of the results. 

In [6]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        number_of_bikes   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                  0.014
Method:                 Least Squares   F-statistic:                     21.23
Date:                Fri, 31 Jan 2025   Prob (F-statistic):           6.99e-25
Time:                        19:00:02   Log-Likelihood:                -29373.
No. Observations:                8640   AIC:                         5.876e+04
Df Residuals:                    8633   BIC:                         5.881e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           18.8900      0.761     24.823   

+ I'm going to remove the `price_$$` and `rating` variables since the p-value is high, and try again.

In [7]:
x = x.drop(columns=['price_$$', 'rating'])
lin_reg = sm.OLS(y, x)
model = lin_reg.fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        number_of_bikes   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                  0.014
Method:                 Least Squares   F-statistic:                     31.67
Date:                Fri, 31 Jan 2025   Prob (F-statistic):           3.10e-26
Time:                        19:00:02   Log-Likelihood:                -29373.
No. Observations:                8640   AIC:                         5.876e+04
Df Residuals:                    8635   BIC:                         5.879e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           18.7472      0.101    185.386   

+ The p-value for `price_$` is still above the threshold, so I'm going to remove it as well.

In [8]:
x = x.drop(columns='price_$')
lin_reg = sm.OLS(y, x)
model = lin_reg.fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        number_of_bikes   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                  0.013
Method:                 Least Squares   F-statistic:                     40.35
Date:                Fri, 31 Jan 2025   Prob (F-statistic):           6.94e-26
Time:                        19:00:02   Log-Likelihood:                -29376.
No. Observations:                8640   AIC:                         5.876e+04
Df Residuals:                    8636   BIC:                         5.879e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           18.6824      0.097    191.889   

+ The adjusted R<sup>2</sup> value is low, meaning that our independent variables can only account for 1.3% of the variance in the number of bikes.

+ The p-value of the F-statistic is low, so we can assume that the model fits better with our independent variables included.

+ From the coefficients, we can see that the price seems to have a much bigger effect on the number of bikes than the review count

##### Trial

Below, I wanted to try OHE for the categories of our points of interest. This isn't recommended for catergorical variables with many values, and there are better tests to perform, so I'm not using the result for analysis. I just wanted to see what the output would look like.

In [9]:
fs_encoding = pd.get_dummies(data=fs_bikes, columns=['category0'], dtype=int)

In [10]:
category_list = []
for i in list(fs_encoding.columns):
    if i.startswith('category0') == True: category_list.append(i)

In [11]:
y = fs_encoding['number_of_bikes']
x = fs_encoding[category_list]
x = sm.add_constant(x)
lin_reg = sm.OLS(y, x)
model = lin_reg.fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        number_of_bikes   R-squared:                       0.057
Model:                            OLS   Adj. R-squared:                  0.031
Method:                 Least Squares   F-statistic:                     2.162
Date:                Fri, 31 Jan 2025   Prob (F-statistic):           2.11e-21
Time:                        19:00:02   Log-Likelihood:                -29180.
No. Observations:                8640   AIC:                         5.884e+04
Df Residuals:                    8402   BIC:                         6.052e+04
Df Model:                         237                                         
Covariance Type:            nonrobust                                         
                                                                       coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------

# Stretch

How can you turn the regression model into a classification model?

We can turn this into a classification model by defining our y variable, in this case `number_of_bikes`, as categories. We can split it into a binary variable with `low_number_of_bikes` and `high_number_of_bikes`, and run a logistic regression model. We can also split it into more categories by inlcuding `medium_number_of_bikes` or more categories, and use then multinomial logistic regression.