Build a regression model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as plt
import statsmodels.api as sm


In [2]:
df = pd.read_csv('final_merge.csv')
df

Unnamed: 0,empty_slots,free_bikes,Latitude,Longitude,Bike_station,Total_bikes,Category,Rating,Restaurant_name,Location,Distance
0,9,7,49.27,-123.12,Yaletown-Roundhouse Station,16,Restaurant,-1.0,Ganache Patisserie,1262 Homer St,674.00
1,9,7,49.27,-123.12,Yaletown-Roundhouse Station,16,Breakfast & Brunch,4.5,OEB Breakfast Co. Yaletown,1137 Marinaside Crescent,1956.67
2,4,10,49.27,-123.12,Spyglass & Seawall,14,Restaurant,-1.0,Ganache Patisserie,1262 Homer St,674.00
3,4,10,49.27,-123.12,Spyglass & Seawall,14,Breakfast & Brunch,4.5,OEB Breakfast Co. Yaletown,1137 Marinaside Crescent,1956.67
4,6,8,49.27,-123.12,Stamps Landing,14,Restaurant,-1.0,Ganache Patisserie,1262 Homer St,674.00
...,...,...,...,...,...,...,...,...,...,...,...
527,12,3,49.25,-123.10,18th & Main,15,"Tapas/Small Plates, Wine Bars, Cocktail Bars",4.5,Published On Main,3593 Main Street,1681.55
528,16,0,49.25,-123.10,20th & Main,16,"Tapas/Small Plates, Wine Bars, Cocktail Bars",4.5,Published On Main,3593 Main Street,1681.55
529,7,5,49.25,-123.10,Ontario & 23rd,12,"Tapas/Small Plates, Wine Bars, Cocktail Bars",4.5,Published On Main,3593 Main Street,1681.55
530,17,5,49.25,-123.10,27th & Main,22,"Tapas/Small Plates, Wine Bars, Cocktail Bars",4.5,Published On Main,3593 Main Street,1681.55


In [3]:
df.keys()

Index(['empty_slots', 'free_bikes', 'Latitude', 'Longitude', 'Bike_station',
       'Total_bikes', 'Category', 'Rating', 'Restaurant_name', 'Location',
       'Distance'],
      dtype='object')

After trying with different variables I have the best fit for the following model using empty_slots as dependent variable and Latitude, Longitude, Rating, Distance and free_bikes as independent variables

In [7]:
# Convert object columns to numeric or categorical data types
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
df['Category'] = df['Category'].astype('category')

# Prepare the data for regression
X = df[['Latitude', 'Longitude', 'Rating', 'Distance','free_bikes']]  # Independent variables
y = df['empty_slots']  # Dependent variable

# Add a constant term to the independent variables
X = sm.add_constant(X)

# Build the regression model
model = sm.OLS(y, X)

# Fit the regression model
results = model.fit()

Provide model output and an interpretation of the results. 

In [8]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:            empty_slots   R-squared:                       0.416
Model:                            OLS   Adj. R-squared:                  0.410
Method:                 Least Squares   F-statistic:                     74.95
Date:                Tue, 04 Jul 2023   Prob (F-statistic):           3.17e-59
Time:                        10:16:42   Log-Likelihood:                -1544.4
No. Observations:                 532   AIC:                             3101.
Df Residuals:                     526   BIC:                             3126.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -3874.8198   2994.216     -1.294      0.1

1. R-squared: The coefficient of determination (R-squared) is a measure of how well the regression model fits the data. In this case, the R-squared value is 0.416, which means that approximately 41.6% of the variation in the dependent variable (empty_slots) can be explained by the independent variables included in the model.
2. F-statistic: The F-statistic tests the overall significance of the regression model. A larger F-statistic indicates a stronger relationship between the independent variables and the dependent variable. Here, the F-statistic is 74.95, and the associated probability (Prob (F-statistic)) is very small (3.17e-59), indicating that the model is statistically significant.
3. free_bikes has a coefficient of -0.5865, indicating that a one-unit increase in free_bikes is associated with a decrease in empty_slots by 0.5865, assuming other variables remain constant. The p-value (P>|t| = 0.000) suggests that this coefficient is statistically significant.






# Stretch

How can you turn the regression model into a classification model?

1. Train the Regression Model
   - Using the appropriate algorithm, train the regression model.
   - This involves letting the model learn the relationships between the independent and dependent variables by fitting it to the data.
2. Determine Projected Probabilities or Values
   - After the regression model has been trained, retrieve projected probabilities or values for each data point.
   - These anticipated values represent the model's estimation of the result variable based on the supplied inputs.