Build a regression model.

In [3]:
import pandas as pd

In [4]:
# Load data
project_data = pd.read_csv('project_data_all.csv')
project_data.head()


Unnamed: 0,uid,station,latitude,longitude,slots,free_bikes,ll,edu_facility,edu_distance,venue_name,venue_review_count,venue_rating,venue_distance,venue_price
0,7301,Primrose Ave / Davenport Rd,43.67142,-79.445947,15,12,"43.67142,-79.445947",St Josephat,588.0,The Cat's Cradle Sports and Spirits,5.0,4.5,634.199655,
1,7301,Primrose Ave / Davenport Rd,43.67142,-79.445947,15,12,"43.67142,-79.445947",St Josephat,588.0,The Greater Good Bar,53.0,4.0,604.248992,$$
2,7301,Primrose Ave / Davenport Rd,43.67142,-79.445947,15,12,"43.67142,-79.445947",St Josephat,588.0,CANO Restaurant,71.0,4.5,920.049873,$$
3,7301,Primrose Ave / Davenport Rd,43.67142,-79.445947,15,12,"43.67142,-79.445947",St Josephat,588.0,This Month Only Bar,5.0,1.0,819.082749,$
4,7301,Primrose Ave / Davenport Rd,43.67142,-79.445947,15,12,"43.67142,-79.445947",St Josephat,588.0,EL TREN LATÍNO,1.0,4.0,774.179855,


In [18]:
# add a column to calculate the % of total bikes available for use at each location ie. free bikes as % of total slots
# I want to use this as the dependant variable in my regression model

project_data['percent_free'] = project_data['free_bikes'] / project_data['slots'] * 100



In [20]:
# prepare a filtered dataframe for the data needed for this exercise
# I am going to create a linear regression model with % of free bikes bikes as the dependant variable
# and the number of venues within a 1000 metre radius, average number of venue reviews and average venue rating
# as the independant variables


# choose the columns that I want in the dataframe for the model
col_filter = ['uid', 'percent_free', 'venue_name', 'venue_review_count', 'venue_rating']
filtered_df = project_data[col_filter]


#filter the dataframe by the bike stations id and get the venue count, and average number of venue reviews and venue rating

model_data = filtered_df.groupby(['uid', 'percent_free']).agg({
    'venue_name': 'count',
    'venue_rating': 'mean',
    'venue_review_count': 'mean'
}).reset_index()

# Rename the columns for clarity
model_data.rename(columns={
    'venue_name': 'venue_count',
    'venue_rating': 'mean_rating',
    'venue_review_count': 'mean_review_count'
}, inplace=True)

model_data.head()


Unnamed: 0,uid,percent_free,venue_count,mean_rating,mean_review_count
0,7000,28.571429,1250,3.83,106.68
1,7002,31.578947,2500,3.68,98.0
2,7004,100.0,2500,3.78,186.8
3,7005,82.608696,2500,3.81,134.86
4,7006,84.210526,2500,3.61,153.86


In [31]:
X = model_data[['venue_count', 'mean_rating', 'mean_review_count']]
y = model_data['percent_free']

In [32]:
import statsmodels.api as sm

In [33]:
X = sm.add_constant(X) 
lin_reg = sm.OLS(y,X)

Provide model output and an interpretation of the results. 

In [34]:
# model output

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:           percent_free   R-squared:                       0.167
Model:                            OLS   Adj. R-squared:                  0.161
Method:                 Least Squares   F-statistic:                     28.37
Date:                Sun, 03 Sep 2023   Prob (F-statistic):           9.76e-17
Time:                        12:20:30   Log-Likelihood:                -2047.8
No. Observations:                 429   AIC:                             4104.
Df Residuals:                     425   BIC:                             4120.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                66.4809     12.68

### Model Interpretation

#### R-squared 
##### R-squared reflects the fit of the model. The R-squared value in this model is 0.167 and the Adjusted R-squared is 0.161 which means that the model is capable of explaining only 16.1% of the patterns in the data.

#### P>|t| (p-value)
##### The output here shows a p-value of 0.012 or less for each of the indpendant variables. The probablity of the relationship between number of venues, average number of venue reviews, and average venue rating for venues near a location and of free rental bikes available being due to chance is 1.2% or lower. This of course is only applicable for the specific date and time that the city bikes data was retrieved and could change based on the day and time of day that the data is observed.

#### Coef
##### The coef for the constant and each of the variables is displayed. The interesting one to me is and coef for mean venue rating. It is -11.4077. It indicates a negative correlation between mean venue rating and percent of free bikes at a bike sation. The bike sations with higher rated venues within 1000 metres of the station appear to have a lower percentage of the bikes at that station available for use. This of course could also be because areas where there are higher rated venues also have more atractions, more tourist traffic and are higher populated areas which would explain higher bike rental use.

#### Confounding Variables
##### There are likely other variables that are really the driver behind both the independant and dependant variables in this case, the most obvious ones being the resident population and volume of tourists regularly visiting specific areas of the city. More residents and more tourists would both likely be positively correlated with number of rental bikes being used and number of venues open for business in each region of the city. These are likely confounding variables and should be taken into account when reviewing this model.

##### It should also be noted that the data on bike use was retrieved from the City Bikes API on a specific date and at a specific time. If the data were observed on a different date or at a different time it could impact the model differently. It would make sense to look at some average data or compare the output for data on different days of the week and different months of the year.

# Stretch

How can you turn the regression model into a classification model?

##### I am not sure I would turn the regression model into a classification model but if I had to I would need to create a categorical variable with two possible values, like yes or no, or true or false. For example, I could create a classification model that asked to predict whether there is a bike rental station in a particular neighbourhood (yes there is, or no there isn't) based on number of venues in that area, quality of venues (measured by ratings), cost of eating at a venue, etc. Given the data set that I have I could also consider including the proximity to a post-secondary institution in a model of this type.

