Build a regression model.

In [2]:
import pandas as pd

# Load the data
df = pd.read_excel("data_frame_yelp.xlsx")
# Display the first few rows of the dataframe
df.head()


Unnamed: 0,Name,Latitude,Longitude,Category,API Latitude,API Longitude
0,La Taqueria Pinche Taco Shop,49.263559,-123.112736,Mexican,49.262487,-123.114397
1,Saku,49.263101,-123.116675,Japanese,49.262487,-123.114397
2,Uma Sushi,49.263805,-123.113729,Japanese,49.262487,-123.114397
3,iDen & Quan Ju De Beijing Duck House,49.26021,-123.114845,Chinese,49.262487,-123.114397
4,Hokkaido Ramen Santouka,49.263127,-123.116892,Noodles,49.262487,-123.114397


In [3]:
# Create a new column "API Coordinates" combining API Latitude and API Longitude
df["API Coordinates"] = df.apply(lambda row: (row["API Latitude"], row["API Longitude"]), axis=1)

# Count the number of POIs per bike station
df_poi_count = df["API Coordinates"].value_counts().reset_index()
df_poi_count.columns = ["API Coordinates", "POI Count"]

# Merge the POI count back to the original dataframe
df = pd.merge(df, df_poi_count, on="API Coordinates")

# Display the first few rows of the updated dataframe
df.head()


Unnamed: 0,Name,Latitude,Longitude,Category,API Latitude,API Longitude,API Coordinates,POI Count
0,La Taqueria Pinche Taco Shop,49.263559,-123.112736,Mexican,49.262487,-123.114397,"(49.262487, -123.114397)",50
1,Saku,49.263101,-123.116675,Japanese,49.262487,-123.114397,"(49.262487, -123.114397)",50
2,Uma Sushi,49.263805,-123.113729,Japanese,49.262487,-123.114397,"(49.262487, -123.114397)",50
3,iDen & Quan Ju De Beijing Duck House,49.26021,-123.114845,Chinese,49.262487,-123.114397,"(49.262487, -123.114397)",50
4,Hokkaido Ramen Santouka,49.263127,-123.116892,Noodles,49.262487,-123.114397,"(49.262487, -123.114397)",50


In [4]:
# Count the number of unique categories of POIs per bike station
df_unique_categories = df.groupby("API Coordinates")["Category"].nunique().reset_index()
df_unique_categories.columns = ["API Coordinates", "Unique Category Count"]

# Merge the unique category count back to the original dataframe
df = pd.merge(df, df_unique_categories, on="API Coordinates")

# Perform one-hot encoding on the Category column
df_encoded = pd.get_dummies(df, columns=["Category"])

# Display the first few rows of the dataframe
df_encoded.head()


Unnamed: 0,Name,Latitude,Longitude,API Latitude,API Longitude,API Coordinates,POI Count,Unique Category Count,Category_Afghan,Category_American (Traditional),...,Category_Sports Bars,Category_Steakhouses,Category_Sushi Bars,Category_Taiwanese,Category_Tapas/Small Plates,Category_Thai,Category_Vegan,Category_Vietnamese,Category_Waffles,Category_Wine Bars
0,La Taqueria Pinche Taco Shop,49.263559,-123.112736,49.262487,-123.114397,"(49.262487, -123.114397)",50,30,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Saku,49.263101,-123.116675,49.262487,-123.114397,"(49.262487, -123.114397)",50,30,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Uma Sushi,49.263805,-123.113729,49.262487,-123.114397,"(49.262487, -123.114397)",50,30,0,0,...,0,0,0,0,0,0,0,0,0,0
3,iDen & Quan Ju De Beijing Duck House,49.26021,-123.114845,49.262487,-123.114397,"(49.262487, -123.114397)",50,30,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Hokkaido Ramen Santouka,49.263127,-123.116892,49.262487,-123.114397,"(49.262487, -123.114397)",50,30,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Aggregate the data at the bike station level
df_station = df_encoded.groupby("API Coordinates").sum().reset_index()

# Display the first few rows of the dataframe
df_station.head()


  df_station = df_encoded.groupby("API Coordinates").sum().reset_index()


Unnamed: 0,API Coordinates,Latitude,Longitude,API Latitude,API Longitude,POI Count,Unique Category Count,Category_Afghan,Category_American (Traditional),Category_Bakeries,...,Category_Sports Bars,Category_Steakhouses,Category_Sushi Bars,Category_Taiwanese,Category_Tapas/Small Plates,Category_Thai,Category_Vegan,Category_Vietnamese,Category_Waffles,Category_Wine Bars
0,"(49.260599, -123.113504)",2463.070794,-6155.587502,2463.02995,-6155.6752,2500,1600,1,1,1,...,0,0,0,0,1,0,1,2,0,0
1,"(49.262487, -123.114397)",2463.113103,-6155.674195,2463.12435,-6155.71985,2500,1500,0,1,1,...,0,0,0,1,0,0,1,2,0,0
2,"(49.264215, -123.117772)",2463.152052,-6155.825177,2463.21075,-6155.8886,2500,1300,0,1,1,...,0,0,0,1,0,0,1,0,0,0
3,"(49.274566, -123.121817)",2463.824799,-6156.072415,2463.7283,-6156.09085,2500,1450,0,1,1,...,1,2,2,0,0,1,1,1,1,1
4,"(49.279764, -123.110154)",2464.043738,-6155.538586,2463.9882,-6155.5077,2500,1350,0,0,1,...,0,1,1,0,0,0,0,0,0,0


In [6]:
import statsmodels.api as sm

# Define the dependent variable
y = df_station["POI Count"]

# Define the independent variables
X = df_station.drop(["API Coordinates", "Latitude", "Longitude", "API Latitude", "API Longitude", "POI Count"], axis=1)

# Add a constant to the independent variables matrix
X = sm.add_constant(X)

# Fit the ordinary least squares (OLS) model
model = sm.OLS(y, X)
results = model.fit()

# Print the summary of the model
results.summary()


  warn("omni_normtest is not valid with less than 8 observations; %i "
  return 1 - self.ssr/self.centered_tss
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return np.dot(wresid, wresid) / self.df_resid


0,1,2,3
Dep. Variable:,POI Count,R-squared:,-inf
Model:,OLS,Adj. R-squared:,-inf
Method:,Least Squares,F-statistic:,
Date:,"Tue, 01 Aug 2023",Prob (F-statistic):,
Time:,23:59:45,Log-Likelihood:,128.67
No. Observations:,5,AIC:,-247.3
Df Residuals:,0,BIC:,-249.3
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Unique Category Count,1.4899,inf,0,,,
Category_Afghan,-7.3513,inf,-0,,,
Category_American (Traditional),0.3914,inf,0,,,
Category_Bakeries,1.3483,inf,0,,,
Category_Barbeque,-0.6258,inf,-0,,,
Category_Beer Bar,0.9568,inf,0,,,
Category_Belgian,1.9741,inf,0,,,
Category_Breakfast & Brunch,11.3791,inf,0,,,
Category_Bubble Tea,-0.6258,inf,-0,,,

0,1,2,3
Omnibus:,,Durbin-Watson:,1.429
Prob(Omnibus):,,Jarque-Bera (JB):,0.662
Skew:,-0.408,Prob(JB):,0.718
Kurtosis:,1.415,Cond. No.,1080.0


Provide model output and an interpretation of the results. 

In [None]:
#1. The number of observations (5) is far less than the number of independent variables (58), which makes it impossible for the model to estimate the parameters accurately.

#2. The condition number is large (1.08e+03), suggesting that there may be strong multicollinearity or other numerical problems. Multicollinearity refers to a situation where two or more independent variables in a regression model are highly correlated.

######Obtain more data: 
      #More observations would allow the model to better estimate the parameters.
      #Reduce the number of variables: We could use dimensionality reduction techniques, or choose a subset of variables based on domain knowledge or feature importance methods.
      #Check for multicollinearity: If some variables are highly correlated, we could keep only one of them to reduce multicollinearity.



In [None]:
## Report

#In our attempt to model the relationship between the number of bikes at bike stations and the characteristics of nearby Points of Interest (POIs),
#we encountered significant issues. Despite using multiple regression with Python's statsmodels, 
#we found that the number of observations (5) was much lower than the number of independent variables (58). 
#This discrepancy made accurate parameter estimation impossible. Furthermore, the model indicated high multicollinearity. 
#To enhance this model, we recommend collecting more data, reducing the number of variables, checking for multicollinearity, and considering a non-linear model. 
#Additionally, including factors such as time of day and weather could improve the model's accuracy and interpretability.

# Stretch

How can you turn the regression model into a classification model?