Build a regression model.

In [2]:
# imports
import pandas as pd
import statsmodels.api as sm

grouped_business_df = pd.read_csv('grouped_business.csv', index_col = 0)

In [9]:
grouped_business_df

Unnamed: 0,Station Name,Average Rating,Average Reviews,Number of Bikes
0,Broadway Residence Hall,4.175000,572.050000,4
1,Casey Eye Institute,4.175000,80.850000,6
2,Cleveland High School,4.300000,561.050000,9
3,Cully Park,4.083333,47.111111,10
4,Doernbecher Children's Hospital,4.175000,48.750000,6
...,...,...,...,...
233,SW Yamhill at Director Park,4.175000,1497.900000,20
234,Shattuck Hall,4.175000,605.500000,3
235,Tilikum West at SW Moody,4.025000,285.600000,18
236,Urban Center Plaza at SW 6th,4.150000,741.550000,14


Data audit and cleaning

In [6]:
grouped_business_df.shape
#This table has 238 rows and 4 columns

(238, 4)

In [4]:
grouped_business_df.head()

Unnamed: 0,Station Name,Average Rating,Average Reviews,Number of Bikes
0,Broadway Residence Hall,4.175,572.05,4
1,Casey Eye Institute,4.175,80.85,6
2,Cleveland High School,4.3,561.05,9
3,Cully Park,4.083333,47.111111,10
4,Doernbecher Children's Hospital,4.175,48.75,6


In [7]:
#Checking for missing values
missing_values =grouped_business_df.isnull().sum()
print(missing_values)
#No missing values in the dataframe

Station Name       0
Average Rating     0
Average Reviews    0
Number of Bikes    0
dtype: int64


In [8]:
#Basic descriptive statistics for each predictor variable by Number of Bikes
descriptive_stats = grouped_business_df.groupby('Number of Bikes').describe()
descriptive_stats

Unnamed: 0_level_0,Average Rating,Average Rating,Average Rating,Average Rating,Average Rating,Average Rating,Average Rating,Average Rating,Average Reviews,Average Reviews,Average Reviews,Average Reviews,Average Reviews,Average Reviews,Average Reviews,Average Reviews
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Number of Bikes,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
1,1.0,4.275,,4.275,4.275,4.275,4.275,4.275,1.0,235.95,,235.95,235.95,235.95,235.95,235.95
2,4.0,4.30625,0.082601,4.25,4.25,4.275,4.33125,4.425,4.0,604.8625,322.491177,275.0,358.8125,606.625,852.675,931.2
3,19.0,4.196053,0.148666,3.875,4.1375,4.2,4.325,4.45,19.0,435.955263,281.167948,97.05,225.475,400.6,580.75,1028.65
4,11.0,4.172727,0.262311,3.45,4.1375,4.225,4.3,4.425,11.0,688.45,275.693302,70.4,620.6,696.7,803.85,1117.7
5,26.0,4.238462,0.122929,3.95,4.13125,4.275,4.31875,4.45,26.0,368.319231,251.454135,63.3,161.25,298.925,483.925,919.55
6,20.0,4.205,0.200755,3.675,4.15625,4.25,4.3375,4.475,20.0,492.105,467.24593,48.75,157.45,352.55,469.25,1534.05
7,16.0,4.204688,0.163864,3.85,4.15,4.225,4.35,4.375,16.0,311.134375,182.454284,52.55,176.9,282.825,398.7375,792.7
8,18.0,4.156944,0.265934,3.225,4.15625,4.2125,4.2625,4.475,18.0,338.947222,222.441728,64.45,126.0625,239.625,571.6625,645.95
9,24.0,4.273177,0.177921,3.925,4.19375,4.3125,4.375,4.625,24.0,409.478125,456.402911,54.25,112.875,150.875,620.725,1707.55
10,9.0,4.242593,0.091709,4.083333,4.175,4.3,4.3,4.325,9.0,818.512346,468.401795,47.111111,657.95,814.3,1049.5,1477.9


In [10]:
X = grouped_business_df['Number of Bikes']
Y = grouped_business_df['Average Rating']

In [11]:
X = sm.add_constant(X)
lin_model = sm.OLS(Y, X).fit()

In [12]:
model = lin_model.summary()

Provide model output and an interpretation of the results. 

In [13]:
print(model)

                            OLS Regression Results                            
Dep. Variable:         Average Rating   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.004
Method:                 Least Squares   F-statistic:                   0.02730
Date:                Tue, 12 Dec 2023   Prob (F-statistic):              0.869
Time:                        22:17:59   Log-Likelihood:                 88.758
No. Observations:                 238   AIC:                            -173.5
Df Residuals:                     236   BIC:                            -166.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               4.2146      0.023    1

In [None]:
#Inference from above statsmodel:
#R-squared measures the proportion of the variance in the dependent variable (Average Rating) explained by the independent variable(s) (Number of Bikes).
#In this case, R-squared is 0.000, indicating that the model does not explain a significant amount of variability in 'Average Rating'. The model doesn't fit the data well and is not statistically significant
#Also the p-value for the Number of Bikes is actually very high at 0.869, we typically are looking for less than 0.05.

In [15]:
#using 'Average Reviews' for the stats model
X = grouped_business_df['Number of Bikes']
Y = grouped_business_df['Average Reviews']

X = sm.add_constant(X)
lin_model = sm.OLS(Y, X).fit()
model = lin_model.summary()
print(model)

                            OLS Regression Results                            
Dep. Variable:        Average Reviews   R-squared:                       0.137
Model:                            OLS   Adj. R-squared:                  0.133
Method:                 Least Squares   F-statistic:                     37.43
Date:                Tue, 12 Dec 2023   Prob (F-statistic):           3.90e-09
Time:                        22:33:09   Log-Likelihood:                -1760.3
No. Observations:                 238   AIC:                             3525.
Df Residuals:                     236   BIC:                             3532.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const             274.1422     55.022     

Interpretation of Results:

R-squared: From the model above, an R-squared value of 0.137 indicates that about 13.7% of the variability in the dependent variable (Average Reviews) is explained by the variability in the independent variable (Number of Bikes). Thus, this model suggests a moderate level of variation in the dependent variable.

P-value: In this model output, the P-value for the Number of Bikes is 0.000. The low p-value suggests that we can reject the null hypothesis. The p-value indicates that the 'Number of Bikes' variable is a significant predictor of 'Average Reviews' in the model. 'Number of Bikes' variable has statistically significant effect on 'Average Reviews'.

Coefficient: For this linear regression model, the coefficient for Number of Bikes is approximately 30.8923, suggesting an estimated change in 'Average Reviews' for a one-unit increase in the 'Number of Bikes'.

F-statistic: In this model output, the F-statistic is 37.43 with a very low p-value (3.90e-09) which is an indicator that the model is statistically significant, and the 'Number of Bikes' is a significant predictor of 'Average Reviews'.

Conclusion: While we can establish a statistically significant relationship between the independent and dependent variables, it is also worthy to note that the relationship between them does not imply causation.


# Stretch

How can you turn the regression model into a classification model?

To convert the regression model into a classification model, I will first define which attribute I am trying to predict. Next, I'll work on grouping my continuous variables into discrete labels to help create the class. Then I'll select an approrpiate classification algorithm and afterwards, evaluate my model.