Build a regression model.

In [1]:
# libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [11]:
# Load the dataset for combined stations and yelp point of interest (POI)
data = pd.read_csv('../data/combined_station_yelp.csv')
data

Unnamed: 0,station_name,station_latitude,station_longitude,free_bikes,empty_slots,number_of_bikes,poi_name,poi_categories,poi_address,poi_distance,poi_latitude,poi_longitude,poi_ratings,poi_review_count
0,Hess at king,43.259126,-79.877212,5,7,12,La Luna,Middle Eastern,"306 King Street W,Hamilton, ON L8P 1B1,Canada",108.424550,43.259422,-79.878488,4.0,63
1,Hess at king,43.259126,-79.877212,5,7,12,Hambrgr,Burgers,"49 King William Street,Hamilton, ON L8R 1A2,Ca...",858.672096,43.257210,-79.866900,4.5,202
2,Hess at king,43.259126,-79.877212,5,7,12,Earth To Table : Bread Bar,Pizza,"258 Locke Street S,Hamilton, ON L8P 4B9,Canada",1052.141521,43.252840,-79.887020,4.0,293
3,Hess at king,43.259126,-79.877212,5,7,12,The Ship,Seafood,"23 Augusta Street,Hamilton, ON L8N 1P6,Canada",970.855528,43.252150,-79.870000,4.0,208
4,Hess at king,43.259126,-79.877212,5,7,12,Berkeley North,Bars,"31 King William Street,Hamilton, ON L8R 1A1,Ca...",792.544458,43.257405,-79.867715,4.5,43
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2801,Cannon at Ottawa,43.247565,-79.818050,4,3,7,Mr Beast Burger,Burgers,"224 Ottawa Street N,Hamilton, ON L8H 3Z6,Canada",149.392720,43.248686,-79.817039,4.0,1
2802,Cannon at Ottawa,43.247565,-79.818050,4,3,7,Bernie’s Tavern,Modern European,"1101-1103 Cannon St E,Hamilton, ON L8L 2J5,Canada",293.018582,43.248570,-79.821395,3.5,3
2803,Cannon at Ottawa,43.247565,-79.818050,4,3,7,The Hearty Hooligan,Cafes,"292 Ottawa Street N,Hamilton, ON L8H 3Z9,Canada",324.046798,43.250241,-79.816466,4.5,7
2804,Cannon at Ottawa,43.247565,-79.818050,4,3,7,Simply Italian Bakery,Bakeries,"212 Ottawa Street N,Hamilton, ON L8H 3Z6,Canada",116.158933,43.248448,-79.817282,4.0,1


In [5]:
# Define the columns for which you want to calculate summary statistics
columns_to_agg = {
    'poi_ratings': 'mean',
    'poi_review_count': 'mean',
    'poi_distance': 'mean'
}

# Group by 'station_name' and calculate summary statistics
poi_stats = data.groupby('station_name').agg(columns_to_agg).reset_index()

# Select relevant columns for the model
selected_columns = ['number_of_bikes', 'poi_ratings', 'poi_review_count', 'poi_distance']
data = data[selected_columns].dropna()  # Remove rows with missing values

# Step 2: Model Building
# Define the dependent variable (number_of_bikes) and independent variables (POI characteristics)
y = data['number_of_bikes']
X = data[['poi_ratings', 'poi_review_count', 'poi_distance']]

# Add a constant (intercept) to the independent variables
X = sm.add_constant(X)

# Fit a linear regression model
model = sm.OLS(y, X).fit()

# Step 3: Model Evaluation
# Print a summary of the regression results
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        number_of_bikes   R-squared:                       0.020
Model:                            OLS   Adj. R-squared:                  0.019
Method:                 Least Squares   F-statistic:                     19.53
Date:                Wed, 06 Sep 2023   Prob (F-statistic):           1.57e-12
Time:                        10:12:55   Log-Likelihood:                -8828.9
No. Observations:                2806   AIC:                         1.767e+04
Df Residuals:                    2802   BIC:                         1.769e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               12.4142      0.611  

  x = pd.concat(x[::order], 1)


Provide model output and an interpretation of the results. 

##### Model Summary
This linear regression model aims to predict the 'number_of_bikes' based on the predictor variables 'poi_ratings', 'poi_review_count', and 'poi_distance'. Here's an interpretation of the results:

* R-squared: The R-squared value is 0.020, which means that the model explains only 2.0% of the variance in the 'number_of_bikes' variable. In other words, the model doesn't fit the data very well, and most of the variance in 'number_of_bikes' is not explained by the predictor variables.
* Adjusted R-squared: Adjusted R-squared takes into account the number of predictor variables in the model. In your case, it's 0.019, which is very close to the R-squared value. This suggests that adding more predictor variables may not improve the model significantly.

##### Coefficients:

* const: This represents the intercept of the regression equation. In this case, it's approximately 12.4142. It means that when all predictor variables are zero, the estimated number_of_bikes is around 12.4142.
* poi_ratings: The coefficient for 'poi_ratings' is -0.3612. This suggests that for every unit increase in 'poi_ratings', the estimated 'number_of_bikes' decreases by 0.3612 units, assuming all other variables are held constant.
* poi_review_count: The coefficient for 'poi_review_count' is -0.0091. This implies that for every additional review in 'poi_review_count', the estimated 'number_of_bikes' decreases by 0.0091 units, holding other variables constant.
* poi_distance: The coefficient for 'poi_distance' is 0.0021. It means that for every unit increase in 'poi_distance', the estimated 'number_of_bikes' increases by 0.0021 units, assuming all other variables remain constant.


Overall, the model could better explain the variance in 'number_of_bikes.' The low R-squared value and non-normality in residuals suggest that there may be better fits for this data than the linear regression model. 

# Stretch

How can you turn the regression model into a classification model?

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset into a DataFrame
data = pd.read_csv('../data/combined_station_yelp.csv')

# Define the threshold for classification (e.g., 10 bikes)
threshold = 10

# Create a binary target variable 'bike_above_threshold'
data['bike_above_threshold'] = (data['number_of_bikes'] > threshold).astype(int)

# Feature selection
X = data[['station_latitude', 'station_longitude', 'poi_distance', 'poi_ratings', 'poi_review_count']]
y = data['bike_above_threshold']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train a logistic regression classifier
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test)

# Evaluate the classification model
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion)
print("Classification Report:\n", classification_report_str)


Accuracy: 0.5516014234875445
Confusion Matrix:
 [[137 150]
 [102 173]]
Classification Report:
               precision    recall  f1-score   support

           0       0.57      0.48      0.52       287
           1       0.54      0.63      0.58       275

    accuracy                           0.55       562
   macro avg       0.55      0.55      0.55       562
weighted avg       0.55      0.55      0.55       562



The classification model has an accuracy of approximately 55.16%, which means it performs slightly better than random guessing. It has reasonably balanced precision and recall for both classes, but there is room for improvement in overall classification performance by further tuning the model or exploring different algorithms to enhance its performance.