Build a regression model.

Provide model output and an interpretation of the results. 

In [18]:
# model_building.ipynb notebook 4 - Questions in steps.

## Question-1: Build a regression model.

import pandas as pd
import statsmodels.api as sm

## Select relevant columns for modeling
selected_columns = [
    "Shortest Yelp Distance from Station",
    "Empty_slots"
]

## Create a subset of the DataFrame with the selected columns
data = merged_df_citybik_fsq_yelp[selected_columns]

## Data Preprocessing
### Make a copy of the data to avoid modifying the original DataFrame
data = data.copy()

### Check for missing values and drop rows with missing values
data.dropna(inplace=True)

## Split the data into independent variables (X) and the target variable (y)
X = data[["Shortest Yelp Distance from Station"]]
y = data["Empty_slots"]

## Add a constant term to the independent variables for the intercept
X = sm.add_constant(X)

## Create and train a linear regression model using statsmodels
model = sm.OLS(y, X).fit()

## Print a summary of the regression model
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            Empty_slots   R-squared:                       0.309
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     14.28
Date:                Sun, 03 Sep 2023   Prob (F-statistic):           0.000649
Time:                        19:27:51   Log-Likelihood:                -116.99
No. Observations:                  34   AIC:                             238.0
Df Residuals:                      32   BIC:                             241.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
co

In [None]:
R-squared value of model is 0.309 which suggests that model explains a moderate amount of the variability in the data.
The adjusted R-squared (Adj. R²) is 0.287, which is slightly lower than R-squared. This might be due to overfitting or there might be other variables which can be explored to improve the fitting.
Also  with a p-value of 0.001, it can be deduced that the model is statistically significant and there is strong evidence to suggest that changes in the "Shortest Yelp Distance from Station" variable are associated with changes in the number of empty slots. 
In other words, the distance from the station is likely a significant predictor of the number of empty slots.






# Stretch

How can you turn the regression model into a classification model?

In [None]:
## Question-3: How can you turn the regression model into a classification model?

### Define the classes
### Class 1: Low Availability (Empty_slots <= 5), Class 2: High Availability (Empty_slots > 5)
merged_df_citybik_fsq_yelp['Availability'] = (merged_df_citybik_fsq_yelp['Empty_slots'] <= 5).astype(int)

### Create a copy of the DataFrame to avoid the SettingWithCopyWarning
data = merged_df_citybik_fsq_yelp.copy()

### Select relevant columns for modeling
selected_columns = ["Shortest Yelp Distance from Station", "Availability"]
data = data[selected_columns]

### Data Preprocessing
data.dropna(inplace=True)

### Split the data into training and testing sets
X = data[["Shortest Yelp Distance from Station"]]
y = data["Availability"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Create and train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

### Make predictions on the test set
y_pred = model.predict(X_test)

### Evaluate the classification model
accuracy = accuracy_score(y_test, y_pred)

### Set zero_division=1 to avoid UndefinedMetricWarning
classification_rep = classification_report(y_test, y_pred, zero_division=1)

print("Accuracy:", accuracy)
print("Classification Report:\n", classification_rep)

### Now use this trained classification model for prediction
### For example:
new_data = pd.DataFrame({
    "Shortest Yelp Distance from Station": [25],
})
predicted_class = model.predict(new_data)
print("Predicted Class:", predicted_class)