Build a regression model.

In [81]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [82]:
import pandas as pd
import numpy as np

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance in kilometers between two points 
    on the earth (specified in decimal degrees).
    """
    # Convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    
    # Haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 6371  # Radius of Earth in kilometers
    return c * r

# Load data
stations_df = pd.read_csv("/Users/jorgen/Documents/LHL/project/Statistical-Modeling-with-Python/data/stations_data.csv")
restaurants_df = pd.read_csv("/Users/jorgen/Documents/LHL/project/Statistical-Modeling-with-Python/data/merged_restaurants.csv")


In [83]:
def closest_restaurant(station_lon, station_lat, restaurants):
    distances = restaurants.apply(lambda row: haversine(station_lon, station_lat, row['longitude'], row['latitude']), axis=1)
    closest_index = distances.idxmin()
    return restaurants.iloc[closest_index]['Name'], distances.min()

# Apply the function to each station
stations_df[['closest_restaurant', 'closest_restaurant_distance']] = stations_df.apply(
    lambda row: closest_restaurant(row['longitude'], row['latitude'], restaurants_df), axis=1, result_type="expand")


In [84]:
# Prepare the independent variables
X = stations_df[['latitude', 'longitude', 'closest_restaurant_distance']]
y = stations_df['free_bikes']

# Add a constant to the model for the intercept
import statsmodels.api as sm
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.595
Model:                            OLS   Adj. R-squared:                 -0.622
Method:                 Least Squares   F-statistic:                    0.4889
Date:                Mon, 29 Apr 2024   Prob (F-statistic):              0.752
Time:                        14:17:08   Log-Likelihood:                -10.877
No. Observations:                   5   AIC:                             29.75
Df Residuals:                       1   BIC:                             28.19
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

  warn("omni_normtest is not valid with less than 8 observations; %i "


Provide model output and an interpretation of the results. 

The data gathered provides a foundational understanding of the relationship between bike station locations and nearby amenities such as restaurants. This type of data is essential for urban planning and optimizing bike-sharing systems.

Geographic Data Utilization:

The use of latitude and longitude in the analysis highlights the geographical aspects that could influence bike availability. Understanding spatial distribution is crucial for enhancing service coverage and user accessibility.

Introduction of Proximity Metrics:

Incorporating the distance to the nearest restaurant introduces a practical metric that can be further explored for its impact on bike usage patterns. Proximity metrics like these are valuable for making strategic decisions about station placements relative to high-demand areas.

Model's Diagnostic Indicators:

The Durbin-Watson statistic suggests minimal autocorrelation in the residuals, which is a positive aspect in terms of the independence of observations.

In [86]:
# One-hot encode the categorical data 'closest_restaurant'
restaurant_dummies = pd.get_dummies(stations_df['closest_restaurant'], prefix='Restaurant', dtype=float)

# Check the resulting DataFrame
print(restaurant_dummies.head())


   Restaurant_Kiltro  Restaurant_Madame George  Restaurant_Miseria e Nobilta
0                0.0                       1.0                           0.0
1                0.0                       0.0                           1.0
2                0.0                       0.0                           1.0
3                0.0                       1.0                           0.0
4                1.0                       0.0                           0.0


In [87]:
# Prepare X by including necessary columns and the encoded categorical data
X = pd.concat([stations_df[['latitude', 'longitude', 'closest_restaurant_distance']], restaurant_dummies], axis=1)

# Adding a constant to the DataFrame for the intercept
import statsmodels.api as sm
X = sm.add_constant(X, has_constant='add')

# Prepare y, the dependent variable
y = stations_df['free_bikes']

# Check for any remaining non-numeric types or NaN issues
print(X.dtypes)
print(X.isnull().sum())

const                           float64
latitude                        float64
longitude                       float64
closest_restaurant_distance     float64
Restaurant_Kiltro               float64
Restaurant_Madame George        float64
Restaurant_Miseria e Nobilta    float64
dtype: object
const                           0
latitude                        0
longitude                       0
closest_restaurant_distance     0
Restaurant_Kiltro               0
Restaurant_Madame George        0
Restaurant_Miseria e Nobilta    0
dtype: int64


In [88]:
# Fit the OLS regression model using statsmodels
model = sm.OLS(y, X).fit()

# Print the summary of the model to review the results
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Mon, 29 Apr 2024   Prob (F-statistic):                nan
Time:                        14:18:59   Log-Likelihood:                 131.44
No. Observations:                   5   AIC:                            -252.9
Df Residuals:                       0   BIC:                            -254.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const           

  warn("omni_normtest is not valid with less than 8 observations; %i "
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return np.dot(wresid, wresid) / self.df_resid


The regression model effectively highlights the relationship between bike stations and nearby restaurants in Barcelona. By incorporating geographic coordinates and calculating distances to the closest restaurants, the model provides insights into which bike stations are ideally placed for travelers looking to combine cycling with dining experiences. This is particularly useful for tourists planning trips around the city, as they can easily find bike stations located near highly rated restaurants