Build a regression model.

In [28]:
import pandas as pd
import statsmodels.api as sm
import requests

# Load dataset into a Pandas DataFrame
merged_df = pd.read_csv('merged_data.csv')
merged_df

Unnamed: 0,Station Name,Latitude,Longitude,Name,Categories,Distance,Address,City,Rating,Empty slots
0,Hess at king,43.259126,-79.877212,,,,,,,9.0
1,Bayfront Park,43.269288,-79.871327,,,,,,,25.0
2,Bay at Strachan,43.267859,-79.867923,,,,,,,18.0
3,Bay at Mulberry,43.263198,-79.871803,,,,,,,10.0
4,City Hall,43.256132,-79.874499,,,,,,,15.0
...,...,...,...,...,...,...,...,...,...,...
240,,43.240320,-79.810500,Original Pizza,Pizza,1018.830790,"1388 Main Street E, Hamilton, ON L8K 1C1, Canada",Hamilton,4.0,7.0
241,,43.239770,-79.808572,Domino's Pizza,"Pizza, Chicken Wings, Sandwiches",1159.452639,"1440 Main Street E, Unit 2, Hamilton, ON L8K 6...",Hamilton,3.0,7.0
242,,43.249817,-79.808151,BarBurrito - Hamilton,"Fast Food, Mexican",905.623779,"1275 Barton St E, Hamilton, ON L8H 2V4, Canada",Hamilton,3.5,7.0
243,,43.247520,-79.829750,The Taco Station,"Mexican, Food Trucks",934.408074,"Hamilton, ON L8M 3C3, Canada",Hamilton,3.0,7.0


In [29]:
# Remove rows with missing values in 'Categories' and 'Address'
merged_df.dropna(subset=['Categories', 'Address'], inplace=True)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 145 to 244
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Station Name  0 non-null      object 
 1   Latitude      100 non-null    float64
 2   Longitude     100 non-null    float64
 3   Name          50 non-null     object 
 4   Categories    100 non-null    object 
 5   Distance      100 non-null    float64
 6   Address       100 non-null    object 
 7   City          50 non-null     object 
 8   Rating        50 non-null     float64
 9   Empty slots   100 non-null    float64
dtypes: float64(5), object(5)
memory usage: 8.6+ KB


In [30]:
# delete column 'Station Name'
merged_df.drop(['Station Name'], axis=1, inplace=True)
# Display information about the DataFrame
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 145 to 244
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Latitude     100 non-null    float64
 1   Longitude    100 non-null    float64
 2   Name         50 non-null     object 
 3   Categories   100 non-null    object 
 4   Distance     100 non-null    float64
 5   Address      100 non-null    object 
 6   City         50 non-null     object 
 7   Rating       50 non-null     float64
 8   Empty slots  100 non-null    float64
dtypes: float64(5), object(4)
memory usage: 7.8+ KB


In [31]:
# Remove rows with missing values in 'City' and 'Rating'
merged_df.dropna(subset=['City', 'Rating'], inplace=True)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, 195 to 244
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Latitude     50 non-null     float64
 1   Longitude    50 non-null     float64
 2   Name         50 non-null     object 
 3   Categories   50 non-null     object 
 4   Distance     50 non-null     float64
 5   Address      50 non-null     object 
 6   City         50 non-null     object 
 7   Rating       50 non-null     float64
 8   Empty slots  50 non-null     float64
dtypes: float64(5), object(4)
memory usage: 3.9+ KB


In [25]:
# Handle Duplicate Rows 
merged_df.drop_duplicates(inplace=True)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, 50 to 99
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Latitude    50 non-null     float64
 1   Longitude   50 non-null     float64
 2   Name        50 non-null     object 
 3   Categories  50 non-null     object 
 4   Distance    50 non-null     float64
 5   Address     50 non-null     object 
 6   City        50 non-null     object 
 7   Rating      50 non-null     float64
dtypes: float64(4), object(4)
memory usage: 3.5+ KB


In [None]:
# Fill missing values in 'City' with 'Hamilton'
merged_df['City'].fillna('Hamilton', inplace=True)
merged_df.info()

In [32]:
# Reset the index after data transformations
merged_df.reset_index(drop=True, inplace=True)

# Check the DataFrame after transformations
print(merged_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Latitude     50 non-null     float64
 1   Longitude    50 non-null     float64
 2   Name         50 non-null     object 
 3   Categories   50 non-null     object 
 4   Distance     50 non-null     float64
 5   Address      50 non-null     object 
 6   City         50 non-null     object 
 7   Rating       50 non-null     float64
 8   Empty slots  50 non-null     float64
dtypes: float64(5), object(4)
memory usage: 3.6+ KB
None


In [33]:
#view cleaned data
merged_df

Unnamed: 0,Latitude,Longitude,Name,Categories,Distance,Address,City,Rating,Empty slots
0,43.24838,-79.81775,Hambrgr Ottawa Street,"Burgers, American",93.403783,"207 Ottawa Street N, Hamilton, ON L8H 3Z4, Canada",Hamilton,4.0,7.0
1,43.247406,-79.817549,Cannon Coffee,"Coffee & Tea, Breakfast & Brunch",44.275042,"180 Ottawa Street N, Hamilton, ON L8H 3Z3, Canada",Hamilton,4.5,7.0
2,43.24788,-79.81788,MERK,Tapas Bars,40.067246,"189 Ottawa St N, Hamilton, ON L8H 3Z4, Canada",Hamilton,4.5,7.0
3,43.242741,-79.819429,Caro Restaurant and Bar,"Italian, Bars",547.938391,"4 Ottawa Street N, Hamilton, ON L8H 3Y7, Canada",Hamilton,4.5,7.0
4,43.2475,-79.81656,Mancala Monk Board Game Cafe,"Cafes, Coffee & Tea, Hobby Shops",113.295982,"1229 Cannon Street E, Hamilton, ON L8H 1T8, Ca...",Hamilton,5.0,7.0
5,43.248568,-79.821587,Shorty's Pizza,Pizza,310.043536,"1099 Cannon Street E, Hamilton, ON L8L 2J6, Ca...",Hamilton,4.0,7.0
6,43.24584,-79.81808,Mike's Subs,Sandwiches,188.716178,"122 Ottawa Street N, Hamilton, ON L8H 3Z1, Canada",Hamilton,4.0,7.0
7,43.253072,-79.82329,Purple Pear,"Seafood, Steakhouses",745.010667,"946 Barton Street E, Hamilton, ON L8E 5H3, Canada",Hamilton,4.5,7.0
8,43.24474,-79.81852,Hammerhead's On Ottawa,Fish & Chips,316.041371,"80 Ottawa Street N, Hamilton, ON L8H 3Z1, Canada",Hamilton,4.5,7.0
9,43.246309,-79.818385,Boardwalk Cheesesteaks,"Cheesesteaks, Sandwiches",143.422021,"131 Ottawa Street North, Hamilton, ON L8L 1Y5,...",Hamilton,4.5,7.0


In [45]:
import pandas as pd
import statsmodels.api as sm
import requests

# Define your dependent variable (number of bikes) and independent variables (characteristics of POIs)
y = merged_df['Empty slots']
X = merged_df[['Rating', 'Distance']]  # relevant POI characteristics

# Add a constant term for the intercept in the regression model
X = sm.add_constant(X)

# Build the regression model
model = sm.OLS(y, X).fit()

# Interpret and print the results
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:            Empty slots   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -23.50
Date:                Thu, 16 Nov 2023   Prob (F-statistic):               1.00
Time:                        02:58:34   Log-Likelihood:                 1489.5
No. Observations:                  50   AIC:                            -2973.
Df Residuals:                      47   BIC:                            -2967.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0000   1.85e-14   3.78e+14      0.0

  return 1 - self.ssr/self.centered_tss


Provide model output and an interpretation of the results. 

In [None]:
 OLS Regression Results                            
==============================================================================
Dep. Variable:            Empty slots   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -23.50
Date:                Thu, 16 Nov 2023   Prob (F-statistic):               1.00
Time:                        02:58:34   Log-Likelihood:                 1489.5
No. Observations:                  50   AIC:                            -2973.
Df Residuals:                      47   BIC:                            -2967.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0000   1.85e-14   3.78e+14      0.000       7.000       7.000
Rating      5.884e-15   3.99e-15      1.475      0.147   -2.14e-15    1.39e-14
Distance   -5.898e-17   1.19e-17     -4.943      0.000    -8.3e-17    -3.5e-17
==============================================================================
Omnibus:                        9.972   Durbin-Watson:                   0.480
Prob(Omnibus):                  0.007   Jarque-Bera (JB):                3.251
Skew:                          -0.255   Prob(JB):                        0.197
Kurtosis:                       1.859   Cond. No.                     3.31e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.31e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [None]:
R-squared and Adjusted R-squared: The R-squared value measures the proportion of the variance in the dependent variable that is predictable from the independent variables. Negative R-squared values are uncommon and may indicate issues with the model. It's important to investigate further.

F-statistic and Prob (F-statistic): The F-statistic tests the overall significance of the regression model. A low p-value (close to 0) for the F-statistic indicates that at least one independent variable significantly predicts the dependent variable.

Coefficients (const, Rating, Distance): These are the estimated coefficients for the intercept (const) and each independent variable (Rating, Distance). The t-statistic and p-value (P>|t|) associated with each coefficient test whether the coefficient is significantly different from zero. A low p-value suggests that the variable is significant.

Standard Errors: Standard errors provide a measure of the variability of the coefficient estimates. Smaller standard errors indicate more precise estimates.

Omnibus, Durbin-Watson, Jarque-Bera (JB), Skew, Kurtosis: These are diagnostics for the regression model. For example, Omnibus tests the skewness and kurtosis of the residuals. Durbin-Watson tests for autocorrelation in the residuals. Jarque-Bera tests the assumption of normality in the residuals.

Condition Number: The condition number measures the sensitivity of a regression model to changes in the input data. A large condition number might indicate multicollinearity, suggesting that some predictors are highly correlated.