### Step 6: Multiple Linear Regression on Price

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Loading the dataset 
df_clean = pd.read_csv("../data/df_clean.csv")
df_clean['price'] = df_clean['price'].replace(r'[\$,]', '', regex=True).astype(float)

In [2]:
# Copy and clean
df_model = df_clean.copy()

# Ensure 'price' is numeric
df_model['price'] = df_model['price'].replace(r'[\$,]', '', regex=True).astype(float)

# Keep only positive prices
df_model = df_model[df_model['price'] > 0]

# Keep selected features
features = [
    'accommodates', 'bedrooms', 'bathrooms', 'room_type',
    'number_of_reviews', 'review_scores_rating', 'neighbourhood_cleansed'
]
df_model = df_model[features + ['price']].dropna()

In [5]:
import statsmodels.api as sm

# Define features and target
X = df_dummies.drop(columns=['price'])
y = df_dummies['price']

# Add constant term
X_sm = sm.add_constant(X)

# 🔧 Fix object dtype issue
X_sm = X_sm.astype(float)
y = y.astype(float)

# Fit the model
model = sm.OLS(y, X_sm).fit()

# Show summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.030
Method:                 Least Squares   F-statistic:                     2.409
Date:                Mon, 11 Aug 2025   Prob (F-statistic):           8.27e-07
Time:                        02:09:59   Log-Likelihood:                -14091.
No. Observations:                2011   AIC:                         2.827e+04
Df Residuals:                    1966   BIC:                         2.852e+04
Df Model:                          44                                         
Covariance Type:            nonrobust                                         
                                                coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------

### Observations
We built a multiple linear regression model to predict Airbnb listing prices in Geneva, measured in Swiss francs (CHF). Our model included both numerical and categorical variables: **accommodates, bedrooms, bathrooms, number of reviews, review scores rating, room type,** and **neighbourhood_cleansed**. Categorical features were dummy encoded to ensure proper inclusion in the regression.

The results revealed several meaningful patterns. As expected, listings with more **bedrooms**, **bathrooms**, and higher **guest capacity** were associated with higher prices. **Entire homes/apartments** showed a strong positive effect compared to private or shared rooms. Interestingly, some neighborhoods (such as **Cologny**, **Bellevue**, and **Chêne-Bougeries**) significantly increased predicted prices, reflecting Geneva’s local housing market structure and demand for premium areas. While **review scores** showed a modest influence, they still added explanatory value. Overall, this model highlighted how physical attributes and location drive Airbnb pricing across the city.