Build a regression model.

In [172]:
# Imports
import statsmodels.api as sm  
import pandas as pd           
import numpy as np            

# Read the data from 'station_poi_df.csv' into a DataFrame
station_poi_df = pd.read_csv('station_poi_df.csv')

# Function to fill missing values
def median_fillna(column):
    """
    Fill missing values in a DataFrame column with the median of that column.

    Parameters:
    ----------
    column : str
        The name of the column in the DataFrame

    Returns:
    -------
    None

    Notes:
    ------
    This function replaces missing (NaN) values in the column,
    with the median value of that column. The operation is performed in-place.

    """
    station_poi_df[column].fillna(station_poi_df[column].median(), inplace=True)

# Fill missing values
station_poi_df['hours'].fillna('not available', inplace=True)
median_fillna('popularity(0-1)')
median_fillna('review_count')
median_fillna('distance')

# Select the features variables and the target variable
X = station_poi_df[['distance', 'review_count', 'popularity(0-1)', 'rating']]
y = pd.Series(station_poi_df['free_bikes'])

# Add a constant column for intercept in regression
X = sm.add_constant(X)

# Create a regression model
lin_reg = sm.OLS(y, X)

# Fit regression to the data
model = lin_reg.fit()


Provide model output and an interpretation of the results. 

In [173]:
# First run model
model.summary()

0,1,2,3
Dep. Variable:,free_bikes,R-squared:,0.012
Model:,OLS,Adj. R-squared:,0.01
Method:,Least Squares,F-statistic:,6.366
Date:,"Sun, 03 Sep 2023",Prob (F-statistic):,4.33e-05
Time:,12:49:05,Log-Likelihood:,-5812.6
No. Observations:,2124,AIC:,11640.0
Df Residuals:,2119,BIC:,11660.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,8.8492,0.990,8.934,0.000,6.907,10.792
distance,-0.0010,0.000,-2.170,0.030,-0.002,-9.67e-05
review_count,0.0003,6.99e-05,4.310,0.000,0.000,0.000
popularity(0-1),0.0044,0.665,0.007,0.995,-1.300,1.308
rating,-0.1469,0.114,-1.289,0.197,-0.370,0.077

0,1,2,3
Omnibus:,308.556,Durbin-Watson:,0.192
Prob(Omnibus):,0.0,Jarque-Bera (JB):,475.521
Skew:,1.011,Prob(JB):,5.5199999999999996e-104
Kurtosis:,4.132,Cond. No.,18800.0


The initial model has a low R-squared value of 0.010, indicating that it explains only 1% of the variability in the dependent variable. Additionally, there are high p-values associated with the 'popularity' and 'rating' variables, which surpass the desired significance level. The next logical step would involve refining the model by eliminating the variables with high p-values in an effort to achieve a better fit.

In [171]:
# Second run model
X = station_poi_df[['distance','review_count']]
y = pd.Series(station_poi_df['free_bikes'])
X = sm.add_constant(X)
lin_reg = sm.OLS(y,X)
model = lin_reg.fit()
model.summary()

0,1,2,3
Dep. Variable:,free_bikes,R-squared:,0.011
Model:,OLS,Adj. R-squared:,0.01
Method:,Least Squares,F-statistic:,11.86
Date:,"Sun, 03 Sep 2023",Prob (F-statistic):,7.52e-06
Time:,12:20:48,Log-Likelihood:,-5813.5
No. Observations:,2124,AIC:,11630.0
Df Residuals:,2121,BIC:,11650.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,7.7083,0.243,31.672,0.000,7.231,8.186
distance,-0.0011,0.000,-2.281,0.023,-0.002,-0.000
review_count,0.0003,6.98e-05,4.275,0.000,0.000,0.000

0,1,2,3
Omnibus:,306.564,Durbin-Watson:,0.188
Prob(Omnibus):,0.0,Jarque-Bera (JB):,471.215
Skew:,1.007,Prob(JB):,4.75e-103
Kurtosis:,4.125,Cond. No.,4350.0


The second model maintains the same R-squared value of 0.010, but the p-values for the variables are now below the commonly accepted threshold of <0.05. While this model seems to be a better fit than the original, it still appears statistically insignificant.

# Stretch

How can you turn the regression model into a classification model?

Transform the numeric variables, where applicable, into categorical variables (e.g., convert 'free_bikes' from a count of available bikes to a binary 'yes/no' category), and subsequently, perform logistic regression analysis.