Build a regression model.

In [31]:
import pandas as pd
import statsmodels.api as sm

# Load your data from the CSV file using the provided file path
file_path = r'C:\Users\affuy\Documents\Data_Sets\combined_df_new.csv'
data = pd.read_csv(file_path)

data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412 entries, 0 to 411
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Categories   18 non-null     object 
 1   Distance     18 non-null     float64
 2   Address      18 non-null     object 
 3   City         8 non-null      object 
 4   Name         8 non-null      object 
 5   Rating       8 non-null      float64
 6   Total Bikes  394 non-null    float64
dtypes: float64(3), object(4)
memory usage: 22.7+ KB


In [35]:
# Remove rows with missing values in 'Categories' and 'Address'
data.dropna(subset=['Categories', 'Address'], inplace=True)
data.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 0 to 17
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Categories   18 non-null     object 
 1   Distance     18 non-null     float64
 2   Address      18 non-null     object 
 3   City         8 non-null      object 
 4   Name         8 non-null      object 
 5   Rating       8 non-null      float64
 6   Total Bikes  18 non-null     float64
dtypes: float64(3), object(4)
memory usage: 1.1+ KB


In [36]:
# Step 2: Handle Duplicate Rows (if needed)
data.drop_duplicates(inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 0 to 17
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Categories   18 non-null     object 
 1   Distance     18 non-null     float64
 2   Address      18 non-null     object 
 3   City         8 non-null      object 
 4   Name         8 non-null      object 
 5   Rating       8 non-null      float64
 6   Total Bikes  18 non-null     float64
dtypes: float64(3), object(4)
memory usage: 1.1+ KB


In [39]:
# Step 3: Data Type Conversion
# Round 'Distance' to 2 decimal places
# Assuming your DataFrame is named 'data'
data['Distance'] = data['Distance'].round(2)
data.tail()

Unnamed: 0,Categories,Distance,Address,City,Name,Rating,Total Bikes
13,"Burger Joint, Fast Food Restaurant",929.0,"Avenida Logroño, 303",,,,24.0
14,Bistro,832.0,"Calle Bahía de Almeria, 21",,,,24.0
15,"Bar, Restaurant",686.0,"Paseo de la Alameda, 83",,,,24.0
16,"Burger Joint, Fast Food Restaurant",917.0,"Terminal, 1",,,,24.0
17,BBQ Joint,991.0,"Terminal, 1",,,,24.0


In [41]:
# Fill missing values in 'City' with 'Madrid'
data['City'].fillna('Madrid', inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 0 to 17
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Categories   18 non-null     object 
 1   Distance     18 non-null     float64
 2   Address      18 non-null     object 
 3   City         18 non-null     object 
 4   Name         8 non-null      object 
 5   Rating       8 non-null      float64
 6   Total Bikes  18 non-null     float64
dtypes: float64(3), object(4)
memory usage: 1.1+ KB


In [43]:
# Fill missing values in 'Name' with 'Other'
data['Name'].fillna('Other', inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 0 to 17
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Categories   18 non-null     object 
 1   Distance     18 non-null     float64
 2   Address      18 non-null     object 
 3   City         18 non-null     object 
 4   Name         18 non-null     object 
 5   Rating       8 non-null      float64
 6   Total Bikes  18 non-null     float64
dtypes: float64(3), object(4)
memory usage: 1.1+ KB


In [45]:
# Fill missing values in 'Rating' with the mean of 'Rating'
mean_rating = data['Rating'].mean()
data['Rating'].fillna(mean_rating, inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 0 to 17
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Categories   18 non-null     object 
 1   Distance     18 non-null     float64
 2   Address      18 non-null     object 
 3   City         18 non-null     object 
 4   Name         18 non-null     object 
 5   Rating       18 non-null     float64
 6   Total Bikes  18 non-null     float64
dtypes: float64(3), object(4)
memory usage: 1.1+ KB


In [46]:
data.head()

Unnamed: 0,Categories,Distance,Address,City,Name,Rating,Total Bikes
0,Cafeteria,616.45,"Plaza Ramón y Cajal, s/n, 28040 Madrid, Spain",Madrid,Cafetería Facultad Farmacia,3.5,24.0
1,Cafeteria,388.34,"Avenida Complutense, S/N, 28040 Madrid, Spain",Madrid,Cafetería de la Facultad de Ciencias de la Inf...,4.0,24.0
2,Coffee & Tea,799.14,"Avenida Complutense, 28040 Madrid, Spain",Madrid,Cafetería Facultad de Odontologia,5.0,24.0
3,"Cafeteria, Breakfast & Brunch",692.01,"Ciudad Universitaria, Carretera M-30, 28040 Ma...",Madrid,Cafetería de la Facultad de Filosofía UCM,3.5,24.0
4,"Coffee & Tea, Breakfast & Brunch",675.97,"Avenida Complutense, s/n, 28040 Madrid, Spain",Madrid,Metrocoffee,4.0,24.0


In [47]:
# Reset the index after data transformations
data.reset_index(drop=True, inplace=True)

# Check the DataFrame after transformations
print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Categories   18 non-null     object 
 1   Distance     18 non-null     float64
 2   Address      18 non-null     object 
 3   City         18 non-null     object 
 4   Name         18 non-null     object 
 5   Rating       18 non-null     float64
 6   Total Bikes  18 non-null     float64
dtypes: float64(3), object(4)
memory usage: 1.1+ KB
None


In [49]:
# View the cleaned Data
data

Unnamed: 0,Categories,Distance,Address,City,Name,Rating,Total Bikes
0,Cafeteria,616.45,"Plaza Ramón y Cajal, s/n, 28040 Madrid, Spain",Madrid,Cafetería Facultad Farmacia,3.5,24.0
1,Cafeteria,388.34,"Avenida Complutense, S/N, 28040 Madrid, Spain",Madrid,Cafetería de la Facultad de Ciencias de la Inf...,4.0,24.0
2,Coffee & Tea,799.14,"Avenida Complutense, 28040 Madrid, Spain",Madrid,Cafetería Facultad de Odontologia,5.0,24.0
3,"Cafeteria, Breakfast & Brunch",692.01,"Ciudad Universitaria, Carretera M-30, 28040 Ma...",Madrid,Cafetería de la Facultad de Filosofía UCM,3.5,24.0
4,"Coffee & Tea, Breakfast & Brunch",675.97,"Avenida Complutense, s/n, 28040 Madrid, Spain",Madrid,Metrocoffee,4.0,24.0
5,Coffee & Tea,1054.36,"Ciudad Universitaria, Madrid, Spain",Madrid,Cafeteria Etsit,3.0,24.0
6,Latin American,1051.33,"Calle de la Sierra de Molina, 31, 28053 Madrid...",Madrid,Donde Siempre,1.0,24.0
7,Tapas Bars,1051.33,"Calle de la Sierra de Molina, 7, 28053 Madrid,...",Madrid,Ufarte,2.0,24.0
8,"Wine Bar, Tapas Restaurant",916.0,"Calle Bahía de Palma, 9",Madrid,Other,3.25,24.0
9,"Bakery, Coffee Shop, Restaurant",482.0,"Avenida de Cantabria, 39",Madrid,Other,3.25,24.0


In [50]:
import statsmodels.api as sm

# Define your dependent variable (number of bikes) and independent variables (characteristics of POIs)
y = data['Total Bikes']
X = data[['Rating', 'Distance']]  # Add all relevant POI characteristics

# Add a constant term for the intercept in the regression model
X = sm.add_constant(X)

# Build the regression model
model = sm.OLS(y, X).fit()

# Interpret and print the results
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:            Total Bikes   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -7.500
Date:                Thu, 07 Sep 2023   Prob (F-statistic):               1.00
Time:                        09:01:00   Log-Likelihood:                 422.66
No. Observations:                  18   AIC:                            -839.3
Df Residuals:                      15   BIC:                            -836.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         24.0000   2.86e-11   8.38e+11      0.0

  return 1 - self.ssr/self.centered_tss


Provide model output and an interpretation of the results. 

In [53]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            Total Bikes   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -7.500
Date:                Thu, 07 Sep 2023   Prob (F-statistic):               1.00
Time:                        09:02:48   Log-Likelihood:                 422.66
No. Observations:                  18   AIC:                            -839.3
Df Residuals:                      15   BIC:                            -836.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         24.0000   2.86e-11   8.38e+11      0.0



**Presentation Summary of Regression Analysis**

I would like to present the results of my regression analysis, where I aimed to predict the number of "Total Bikes" in our dataset based on certain factors. Let's go through the key findings:

- **Dependent Variable (Total Bikes)**: My main focus is on understanding the factors that influence the number of available bikes.

- **Model Performance (R-squared)**: Unfortunately, my model did not perform well in explaining the variation in the number of total bikes. The R-squared and Adjusted R-squared values were both reported as negative infinity, indicating that my model does not effectively fit the data. This suggests that my current model may not be the best tool for this prediction task.

- **Model Significance (F-statistic)**: The F-statistic, which measures the overall significance of my model, was reported as a negative value with a probability of 1.00. This further supports the idea that my model is not statistically significant in explaining the variation in total bikes.

- **Variable Coefficients**: I examined the coefficients of the independent variables, including 'Rating' and 'Distance.' Unfortunately, both of these variables seem to have negligible effects on the number of total bikes, as indicated by their small coefficients and high p-values.

- **Residual Analysis (Omnibus, Skew, Kurtosis)**: My analysis of the model's residuals suggests that they may not follow a normal distribution, which is a crucial assumption in regression analysis.

- **Model Stability (Condition Number)**: The condition number, a measure of model stability, was reported as quite large. This could be indicative of issues like multicollinearity or numerical instability in my model.

In summary, my regression analysis did not yield satisfactory results for predicting the number of total bikes based on the variables I considered. It appears that my current model may not be the best fit for this prediction task. Further investigation and potential model refinement may be required to improve my predictive capabilities.

Thank you for your attention, and I welcome any questions or suggestions for my next steps in this analysis.

# Stretch

How can you turn the regression model into a classification model?