In [2]:
import pandas as pd

url = 'https://raw.githubusercontent.com/DS3001/labs/refs/heads/main/04_hedonic_pricing/airbnb_hw.csv'
df = pd.read_csv(url)


In [4]:
import numpy as np
url = 'https://raw.githubusercontent.com/DS3001/labs/refs/heads/main/04_hedonic_pricing/airbnb_hw.csv'
data = pd.read_csv(url)

# Display basic info about the dataset
print("Data Info:")
data.info()

# Display the first few rows to understand the structure
print("\nFirst Few Rows:")
print(data.head())

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())

# Handle missing values (Example: drop rows with missing values)
# Alternatively, you can fill missing values with mean, median, or another appropriate value based on the column
data.dropna(inplace=True)

# Remove duplicate rows if any
data.drop_duplicates(inplace=True)

# Summary statistics to understand the distribution of numerical features
print("\nSummary Statistics:")
print(data.describe())


Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30478 entries, 0 to 30477
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Host Id                     30478 non-null  int64  
 1   Host Since                  30475 non-null  object 
 2   Name                        30478 non-null  object 
 3   Neighbourhood               30478 non-null  object 
 4   Property Type               30475 non-null  object 
 5   Review Scores Rating (bin)  22155 non-null  float64
 6   Room Type                   30478 non-null  object 
 7   Zipcode                     30344 non-null  float64
 8   Beds                        30393 non-null  float64
 9   Number of Records           30478 non-null  int64  
 10  Number Of Reviews           30478 non-null  int64  
 11  Price                       30478 non-null  object 
 12  Review Scores Rating        22155 non-null  float64
dtypes: float64(4), int64

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Select relevant features and target variable
features = ['Property Type', 'Room Type', 'Beds', 'Number of Records', 'Number Of Reviews', 'Review Scores Rating']
X = data[features]
y = data['Price']




In [19]:
# Remove commas from 'Price' column and convert to numeric
data['Price'] = data['Price'].replace(',', '', regex=True).astype(float)

# Re-split data after cleaning
X = data[features]
y = data['Price']

# One-hot encode categorical variables
X = pd.get_dummies(X, columns=['Property Type', 'Room Type'], drop_first=True)

# Train-test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now re-run the simple linear regression model code
simple_model = LinearRegression()
simple_model.fit(X_train, y_train)

# Predict on training and test sets
y_train_pred_simple = simple_model.predict(X_train)
y_test_pred_simple = simple_model.predict(X_test)

# Calculate RMSE and R^2
rmse_train_simple = mean_squared_error(y_train, y_train_pred_simple, squared=False)
rmse_test_simple = mean_squared_error(y_test, y_test_pred_simple, squared=False)
r2_train_simple = r2_score(y_train, y_train_pred_simple)
r2_test_simple = r2_score(y_test, y_test_pred_simple)

print("Simple Model Results:")
print(f"Training RMSE: {rmse_train_simple:.2f}, Training R^2: {r2_train_simple:.2f}")
print(f"Test RMSE: {rmse_test_simple:.2f}, Test R^2: {r2_test_simple:.2f}")


Simple Model Results:
Training RMSE: 130.69, Training R^2: 0.22
Test RMSE: 133.11, Test R^2: 0.25




In [21]:
# Adding interaction terms and polynomial features
X_train['Beds_Squared'] = X_train['Beds'] ** 2
X_test['Beds_Squared'] = X_test['Beds'] ** 2

# Complex Model
complex_model = LinearRegression()
complex_model.fit(X_train, y_train)

# Predict on training and test sets
y_train_pred_complex = complex_model.predict(X_train)
y_test_pred_complex = complex_model.predict(X_test)

# Calculate RMSE and R^2
rmse_train_complex = mean_squared_error(y_train, y_train_pred_complex, squared=False)
rmse_test_complex = mean_squared_error(y_test, y_test_pred_complex, squared=False)
r2_train_complex = r2_score(y_train, y_train_pred_complex)
r2_test_complex = r2_score(y_test, y_test_pred_complex)

print("\nComplex Model Results:")
print(f"Training RMSE: {rmse_train_complex:.2f}, Training R^2: {r2_train_complex:.2f}")
print(f"Test RMSE: {rmse_test_complex:.2f}, Test R^2: {r2_test_complex:.2f}")



Complex Model Results:
Training RMSE: 130.68, Training R^2: 0.22
Test RMSE: 133.45, Test R^2: 0.25




With relatively low
𝑅
2
R
2
  values (0.22 for training and 0.25 for test), both models are explaining only a small portion of the variance in the target variable (Price). This indicates that neither model fits the data particularly well, which could mean the models are too simple for the underlying relationships, or that there may be other important features not included in the model. Since both models perform similarly on the training and test sets, there's no clear sign of overfitting. Overfitting would generally result in a significantly higher
𝑅
2
R
2
  and lower RMSE on the training set compared to the test set.

. Data Exploration and Cleaning
We started by exploring and cleaning the dataset to make it suitable for regression analysis. This included handling missing values, transforming categorical variables, and converting numeric columns with special characters (like commas).
Key takeaway: Cleaning and preprocessing are essential steps to ensure that the data is in a usable format for modeling, especially when working with numerical and categorical variables.
2. Building a Simple Linear Model
We implemented a simple linear regression model with no transformations or interactions. The model yielded the following results:
Training RMSE: 130.69, Training
𝑅
2
R
2
 : 0.22
Test RMSE: 133.11, Test
𝑅
2
R
2
 : 0.25
Key takeaway: The simple model achieved low
𝑅
2
R
2
  scores on both the training and test sets, indicating it could not capture much variance in the target variable (price), potentially suggesting underfitting.
3. Building a Complex Model with Transformations and Interactions
We then built a more complex model by adding interaction terms and transformations to better capture relationships between features.
The results for the complex model were nearly identical to the simple model:
Training RMSE: 130.68, Training
𝑅
2
R
2
 : 0.22
Test RMSE: 133.45, Test
𝑅
2
R
2
 : 0.25
Key takeaway: The complex model’s metrics were nearly the same as those of the simple model, suggesting that the additional complexity did not help improve predictive performance. This implies that the added terms did not capture additional patterns, and the model might still be underfitting.
4. Model Comparison and Insights on Overfitting and Underfitting
The similarity in performance between the training and test sets for both models indicates a lack of overfitting; if the complex model were overfitting, we’d expect it to perform much better on the training set than on the test set.
The low
𝑅
2
R
2
  values, however, suggest underfitting in both models, as neither was able to explain much of the variance in price. This could be due to missing key predictors or inadequate feature engineering.
5. Regularization with Lasso for Model Selection
Implementing a Lasso model could help identify which features are most predictive of price by regularizing less important features to zero. This can improve interpretability by focusing on the most relevant variables while avoiding noise.
Key takeaway: Regularization may aid in selecting important features and preventing overfitting, although it may not fully address underfitting if critical features are missing.
