# Airbnb Price Prediction: Regression Analysis

## Project Overview
This notebook implements an end-to-end regression pipeline to predict the price of Airbnb listings. The workflow includes extensive data preprocessing, feature engineering to extract insights from host and property attributes, and the training of an optimized XGBoost regressor.

## Workflow Summary:
1.  **Data Preprocessing:**
    * **Target Variable:** Log-transformation (`log1p`) of the `price` variable to handle skewness.
    * **Cleaning:** Conversion of percentage and currency strings to numerical formats; extraction of bathroom counts from text fields.
    * **Imputation:** Strategy using Median for numerical/datetime columns and Mode for categorical columns.
2.  **Feature Engineering:**
    * Created derived features such as `host_year` (experience), `review_score_avg` (aggregated quality), and capacity ratios (e.g., `bedroom_to_guest_ratio`).
    * Mapped `host_verifications` to consolidated categories.
3.  **Modeling:**
    * **Algorithm:** XGBoost Regressor (`XGBRegressor`).
    * **Optimization:** Hyperparameters (learning rate, depth, estimators) were tuned via Grid Search (conducted separately) to minimize RMSE.
    * **Encoding:** One-Hot Encoding for categorical variables.

## Performance:
* **Final Model MAE:** **96**
* **Baseline Model MAE:** **135**
* **Improvement:** Reduced prediction error by approximately **29%** compared to the baseline.

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [1]:
import os
os.environ["OMP_NUM_THREADS"] = "1"

import pandas as pd
import numpy as np
import re

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold
from xgboost import XGBRegressor

## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [2]:
train = pd.read_csv('regression_train.csv')
X_test = pd.read_csv('regression_test.csv')

In [3]:
X_test1 = X_test.copy()
X_test1 = X_test1.set_index('id')

In [4]:
y_train = train['price']
# Set all as variables besides price first and filter later
X_train = train.drop(columns='price')

In [5]:
# Check the shape and distribution of training data
print(X_train.shape)
X_train.describe()

(9410, 56)


Unnamed: 0,id,host_listings_count,host_total_listings_count,latitude,longitude,accommodates,bedrooms,beds,minimum_nights,maximum_nights,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,9410.0,9407.0,9407.0,9410.0,9410.0,9410.0,9387.0,9400.0,9410.0,9410.0,...,7600.0,7600.0,7600.0,7600.0,7600.0,9410.0,9410.0,9410.0,9410.0,7600.0
mean,545235.145377,346.833528,429.743808,34.07325,-111.458761,4.758448,1.840098,2.537872,10.299469,504.786079,...,4.8073,4.864647,4.863064,4.828883,4.706684,59.825505,56.543039,2.349628,0.079809,1.680961
std,259877.99299,1078.689536,1287.216654,8.951985,34.713673,2.955905,1.267157,1.898267,27.470024,433.333884,...,0.327033,0.286539,0.308169,0.284084,0.381064,127.447447,126.460857,7.706789,1.34009,2.125712
min,100067.0,1.0,1.0,21.86866,-159.71428,1.0,0.0,0.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.01
25%,318804.25,2.0,2.0,22.22014,-159.350707,2.0,1.0,1.0,1.0,90.0,...,4.75,4.85,4.85,4.8,4.64,1.0,1.0,0.0,0.0,0.39
50%,543517.0,8.0,11.0,35.642705,-87.677378,4.0,2.0,2.0,2.0,365.0,...,4.9,4.95,4.95,4.92,4.79,6.0,3.0,0.0,0.0,1.2
75%,766969.5,70.0,108.0,41.896009,-87.625694,6.0,2.0,3.0,4.0,1125.0,...,5.0,5.0,5.0,5.0,4.91,45.0,40.0,0.0,0.0,2.46
max,999877.0,5265.0,9059.0,42.02195,-82.46021,16.0,17.0,29.0,730.0,1125.0,...,5.0,5.0,5.0,5.0,5.0,597.0,597.0,85.0,26.0,80.45


In [6]:
X_train.head(3)

Unnamed: 0,id,listing_location,description,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_neighbourhood,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,917611,chicago,Solo Hyde Park visitors are invited stay in th...,2008-08-29,,My apartment is a 2nd floor walk-up in a vinta...,within an hour,100%,81%,Hyde Park,...,4.99,4.98,4.95,4.94,False,1,0,1,0,2.02
1,298170,chicago,Awesome 3 bedroom/2 bathroom in one of Chicago...,2011-06-09,"Chicago, IL",,within a few hours,100%,94%,Roscoe Village,...,4.76,4.62,4.88,4.71,True,23,23,0,0,0.67
2,386102,chicago,We offer the highest standards of cleanliness....,2011-07-31,"Chicago, IL","I have a small family (partner, Dave and daugh...",within an hour,100%,100%,Logan Square,...,4.97,4.95,4.91,4.88,False,1,1,0,0,4.19


In [7]:
# Check the format of the response variable
y_train

Unnamed: 0,price
0,$125.00
1,$230.00
2,$208.00
3,$167.00
4,$73.00
...,...
9405,$804.00
9406,$799.00
9407,$239.00
9408,$156.00


In [8]:
# Convert the response column into float
y_train = y_train.str.replace(r'[\$,]', '', regex=True).astype(float)
y_train_log = np.log1p(y_train)

### Feature Selection

In [9]:
# Drop the columns that will not be used for prediction
X_train = X_train.drop(columns=['id','description', 'host_location', 'host_neighbourhood',
                                'latitude', 'longitude','amenities','property_type'])

X_test = X_test.drop(columns=['id','description', 'host_location', 'host_neighbourhood',
                              'latitude', 'longitude','amenities','property_type'])

### Convert Variables' Data Type

In [10]:
### Check object columns first
X_train.select_dtypes(include='object').head()

Unnamed: 0,listing_location,host_since,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,room_type,bathrooms_text,has_availability,first_review,last_review
0,chicago,2008-08-29,My apartment is a 2nd floor walk-up in a vinta...,within an hour,100%,81%,"['email', 'phone']",True,True,Hyde Park,Private room,1 shared bath,True,2015-01-09,2024-10-14
1,chicago,2011-06-09,,within a few hours,100%,94%,"['email', 'phone', 'work_email']",True,True,North Center,Entire home/apt,2 baths,True,2015-10-15,2024-05-05
2,chicago,2011-07-31,"I have a small family (partner, Dave and daugh...",within an hour,100%,100%,"['email', 'phone']",True,False,Logan Square,Entire home/apt,1 bath,True,2011-09-16,2025-02-17
3,chicago,2011-08-25,Conceptual artist loves to explore... \n\nI've...,within an hour,100%,100%,"['email', 'phone']",True,True,Pullman,Entire home/apt,1 bath,True,2011-09-06,2025-02-01
4,chicago,2011-08-25,Conceptual artist loves to explore... \n\nI've...,within an hour,100%,100%,"['email', 'phone']",True,True,Pullman,Entire home/apt,1 bath,True,2011-09-18,2024-08-25


In [11]:
# Train
X_train['host_since'] = pd.to_datetime(X_train['host_since'], errors='coerce')
X_train['host_response_rate'] = X_train['host_response_rate'].str.replace('%', '', regex=False).astype(float)
X_train['host_acceptance_rate'] = X_train['host_acceptance_rate'].str.replace('%', '', regex=False).astype(float)
X_train['first_review'] = pd.to_datetime(X_train['first_review'], errors='coerce')
X_train['last_review'] = pd.to_datetime(X_train['last_review'], errors='coerce')

In [12]:
# Test
X_test['host_since'] = pd.to_datetime(X_test['host_since'], errors='coerce')
X_test['host_response_rate'] = X_test['host_response_rate'].str.replace('%', '', regex=False).astype(float)
X_test['host_acceptance_rate'] = X_test['host_acceptance_rate'].str.replace('%', '', regex=False).astype(float)
X_test['first_review'] = pd.to_datetime(X_test['first_review'], errors='coerce')
X_test['last_review'] = pd.to_datetime(X_test['last_review'], errors='coerce')

In [13]:
### Check numeric and datetime columns to make sure the conversion is done
X_train.select_dtypes(include='number').head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bedrooms,beds,minimum_nights,maximum_nights,minimum_minimum_nights,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,100.0,81.0,1.0,1.0,1,1.0,1.0,3,89,3,...,4.99,4.99,4.98,4.95,4.94,1,0,1,0,2.02
1,100.0,94.0,24.0,38.0,8,3.0,3.0,32,1125,32,...,4.91,4.76,4.62,4.88,4.71,23,23,0,0,0.67
2,100.0,100.0,1.0,2.0,5,2.0,3.0,3,28,3,...,4.93,4.97,4.95,4.91,4.88,1,1,0,0,4.19
3,100.0,100.0,3.0,3.0,2,1.0,1.0,32,125,32,...,4.84,4.97,4.97,4.71,4.86,3,3,0,0,1.94
4,100.0,100.0,3.0,3.0,1,1.0,1.0,32,120,32,...,4.72,4.98,5.0,4.76,4.85,3,3,0,0,0.29


In [14]:
X_train.select_dtypes(include='datetime64').head()

Unnamed: 0,host_since,first_review,last_review
0,2008-08-29,2015-01-09,2024-10-14
1,2011-06-09,2015-10-15,2024-05-05
2,2011-07-31,2011-09-16,2025-02-17
3,2011-08-25,2011-09-06,2025-02-01
4,2011-08-25,2011-09-18,2024-08-25


### Impute Missing Data

#### Training

In [15]:
# Identify columns with missing values
missing_cols = X_train.columns[X_train.isnull().any()]
print(missing_cols)

Index(['host_since', 'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'bedrooms', 'beds',
       'has_availability', 'first_review', 'last_review',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month'],
      dtype='object')


In [16]:
# Impute missing values
# datetime
X_train['host_since'] = X_train['host_since'].fillna(X_train['host_since'].median())
X_train['first_review'] = X_train['first_review'].fillna(X_train['first_review'].median())
X_train['last_review'] = X_train['last_review'].fillna(X_train['last_review'].median())

# categorical
X_train['host_response_time'] = X_train['host_response_time'].fillna(X_train['host_response_time'].mode()[0])

# boolean
X_train['host_has_profile_pic'] = X_train['host_has_profile_pic'].astype('boolean')
X_train['host_has_profile_pic'] = X_train['host_has_profile_pic'].fillna(X_train['host_has_profile_pic'].mode()[0])
X_train['host_identity_verified'] = X_train['host_identity_verified'].astype('boolean')
X_train['host_identity_verified'] = X_train['host_identity_verified'].fillna(X_train['host_identity_verified'].mode()[0])
X_train['has_availability'] = X_train['has_availability'].astype('boolean')
X_train['has_availability'] = X_train['has_availability'].fillna(X_train['has_availability'].mode()[0])

# numeric
X_train['host_response_rate'] = X_train['host_response_rate'].fillna(X_train['host_response_rate'].median())
X_train['host_acceptance_rate'] = X_train['host_acceptance_rate'].fillna(X_train['host_acceptance_rate'].median())
X_train['host_listings_count'] = X_train['host_listings_count'].fillna(X_train['host_listings_count'].median())
X_train['host_total_listings_count'] = X_train['host_total_listings_count'].fillna(X_train['host_total_listings_count'].median())
X_train['bedrooms'] = X_train['bedrooms'].fillna(X_train['bedrooms'].median())
X_train['beds'] = X_train['beds'].fillna(X_train['beds'].median())
X_train['review_scores_rating'] = X_train['review_scores_rating'].fillna(X_train['review_scores_rating'].median())
X_train['review_scores_accuracy'] = X_train['review_scores_accuracy'].fillna(X_train['review_scores_accuracy'].median())
X_train['review_scores_cleanliness'] = X_train['review_scores_cleanliness'].fillna(X_train['review_scores_cleanliness'].median())
X_train['review_scores_checkin'] = X_train['review_scores_checkin'].fillna(X_train['review_scores_checkin'].median())
X_train['review_scores_communication'] = X_train['review_scores_communication'].fillna(X_train['review_scores_communication'].median())
X_train['review_scores_location'] = X_train['review_scores_location'].fillna(X_train['review_scores_location'].median())
X_train['review_scores_value'] = X_train['review_scores_value'].fillna(X_train['review_scores_value'].median())
X_train['reviews_per_month'] = X_train['reviews_per_month'].fillna(X_train['reviews_per_month'].median())

In [17]:
missing_cols = X_train.columns[X_train.isnull().any()]
print(missing_cols)

Index(['host_about', 'host_verifications'], dtype='object')


#### Test

In [18]:
# Identify columns with missing values
missing_cols_test = X_test.columns[X_test.isnull().any()]
print(missing_cols_test)

Index(['host_since', 'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'bathrooms_text',
       'bedrooms', 'beds', 'has_availability', 'first_review', 'last_review',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month'],
      dtype='object')


In [19]:
# Impute missing values
# datetime
X_test['host_since'] = X_test['host_since'].fillna(X_test['host_since'].median())
X_test['first_review'] = X_test['first_review'].fillna(X_test['first_review'].median())
X_test['last_review'] = X_test['last_review'].fillna(X_test['last_review'].median())

# categorical
X_test['host_response_time'] = X_test['host_response_time'].fillna(X_test['host_response_time'].mode()[0])

# boolean
X_test['host_has_profile_pic'] = X_test['host_has_profile_pic'].astype('boolean')
X_test['host_has_profile_pic'] = X_test['host_has_profile_pic'].fillna(X_test['host_has_profile_pic'].mode()[0])
X_test['host_identity_verified'] = X_test['host_identity_verified'].astype('boolean')
X_test['host_identity_verified'] = X_test['host_identity_verified'].fillna(X_test['host_identity_verified'].mode()[0])
X_test['has_availability'] = X_test['has_availability'].astype('boolean')
X_test['has_availability'] = X_test['has_availability'].fillna(X_test['has_availability'].mode()[0])

# numeric
X_test['host_response_rate'] = X_test['host_response_rate'].fillna(X_test['host_response_rate'].median())
X_test['host_acceptance_rate'] = X_test['host_acceptance_rate'].fillna(X_test['host_acceptance_rate'].median())
X_test['host_listings_count'] = X_test['host_listings_count'].fillna(X_test['host_listings_count'].median())
X_test['host_total_listings_count'] = X_test['host_total_listings_count'].fillna(X_test['host_total_listings_count'].median())
X_test['bedrooms'] = X_test['bedrooms'].fillna(X_test['bedrooms'].median())
X_test['beds'] = X_test['beds'].fillna(X_test['beds'].median())
X_test['review_scores_rating'] = X_test['review_scores_rating'].fillna(X_test['review_scores_rating'].median())
X_test['review_scores_accuracy'] = X_test['review_scores_accuracy'].fillna(X_test['review_scores_accuracy'].median())
X_test['review_scores_cleanliness'] = X_test['review_scores_cleanliness'].fillna(X_test['review_scores_cleanliness'].median())
X_test['review_scores_checkin'] = X_test['review_scores_checkin'].fillna(X_test['review_scores_checkin'].median())
X_test['review_scores_communication'] = X_test['review_scores_communication'].fillna(X_test['review_scores_communication'].median())
X_test['review_scores_location'] = X_test['review_scores_location'].fillna(X_test['review_scores_location'].median())
X_test['review_scores_value'] = X_test['review_scores_value'].fillna(X_test['review_scores_value'].median())
X_test['reviews_per_month'] = X_test['reviews_per_month'].fillna(X_test['reviews_per_month'].median())

In [20]:
missing_cols2 = X_test.columns[X_test.isnull().any()]
print(missing_cols2)

Index(['host_about', 'host_verifications', 'bathrooms_text'], dtype='object')


### Feature Engineering: Create New Variables

In [21]:
### Train
X_train['host_year'] = (pd.Timestamp('today') - X_train['host_since']).dt.days / 365
X_train['review_time_length'] = (X_train['last_review'] - X_train['first_review']).dt.days / 365
X_train['review_score_avg'] = X_train[['review_scores_accuracy', 'review_scores_rating', 'review_scores_cleanliness',
                                      'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
                                      'review_scores_value']].mean(axis=1)
X_train['bedroom_to_guest_ratio'] = X_train['bedrooms'] / (X_train['accommodates']+1)
X_train['host_entire_room_ratio'] = X_train['calculated_host_listings_count_entire_homes'] / X_train['calculated_host_listings_count']
X_train['host_private_room_ratio'] = X_train['calculated_host_listings_count_private_rooms'] / X_train['calculated_host_listings_count']
X_train['has_host_description'] = X_train['host_about'].notnull().astype(int)
X_train['bedroom_to_guest_ratio'] = X_train['bedrooms'] / (X_train['accommodates'] + 1)

# New variables based on text
def map_host_verification(val):
    if pd.isna(val) or val == "[]":
        return 'unverified'
    elif val == "['email']":
        return 'email_only'
    elif val == "['phone']":
        return 'phone_only'
    elif val == "['email', 'phone']":
        return 'email_and_phone'
    elif val == "['phone', 'work_email']":
        return 'phone_and_work_email'
    elif val == "['email', 'phone', 'work_email']":
        return 'full_verification'
    else:
        return 'other'

X_train['verification_category'] = X_train['host_verifications'].apply(map_host_verification)

def extract_bathroom_number(val):
    if pd.isna(val):
        return np.nan
    val = val.lower().strip()
    if 'half-bath' in val:
        return 0.5
    match = re.match(r'^([\d\.]+)', val)
    if match:
        return float(match.group(1))
    return np.nan

X_train['bathrooms'] = X_train['bathrooms_text'].apply(extract_bathroom_number)

In [22]:
### Test
X_test['host_year'] = (pd.Timestamp('today') - X_test['host_since']).dt.days / 365
X_test['review_time_length'] = (X_test['last_review'] - X_test['first_review']).dt.days / 365
X_test['review_score_avg'] = X_test[['review_scores_accuracy', 'review_scores_rating', 'review_scores_cleanliness',
                                      'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
                                      'review_scores_value']].mean(axis=1)
X_test['bedroom_to_guest_ratio'] = X_test['bedrooms'] / (X_test['accommodates']+1)
X_test['host_entire_room_ratio'] = X_test['calculated_host_listings_count_entire_homes'] / X_test['calculated_host_listings_count']
X_test['host_private_room_ratio'] = X_test['calculated_host_listings_count_private_rooms'] / X_test['calculated_host_listings_count']
X_test['has_host_description'] = X_test['host_about'].notnull().astype(int)
X_test['bedroom_to_guest_ratio'] = X_test['bedrooms'] / (X_test['accommodates'] + 1)

# New variables based on text
X_test['verification_category'] = X_test['host_verifications'].apply(map_host_verification)
X_test['bathrooms'] = X_test['bathrooms_text'].apply(extract_bathroom_number)

In [23]:
# Drop the variables that will not be directly used after creating new variables
X_train = X_train.drop(columns=['host_about', 'first_review', 'last_review', 'host_since', 'bathrooms_text', 'host_verifications'])
X_test = X_test.drop(columns=['host_about', 'first_review', 'last_review', 'host_since', 'bathrooms_text', 'host_verifications'])

In [24]:
missing_cols = X_train.columns[X_train.isnull().any()]
print(missing_cols)

Index([], dtype='object')


In [25]:
missing_cols = X_test.columns[X_test.isnull().any()]
print(missing_cols)

Index(['bathrooms'], dtype='object')


In [26]:
# impute missing values for the bathrooms column in X_test
X_test['bathrooms'] = X_test['bathrooms'].fillna(X_test['bathrooms'].median())

In [27]:
X_train.select_dtypes(include='number').describe()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bedrooms,beds,minimum_nights,maximum_nights,minimum_minimum_nights,...,calculated_host_listings_count_shared_rooms,reviews_per_month,host_year,review_time_length,review_score_avg,bedroom_to_guest_ratio,host_entire_room_ratio,host_private_room_ratio,has_host_description,bathrooms
count,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0,...,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0,9410.0
mean,97.963974,88.730712,346.725505,429.610308,4.758448,1.840489,2.537301,10.299469,504.786079,9.912859,...,0.079809,1.588448,7.784563,2.863957,4.830293,0.314072,0.843374,0.140687,0.680128,1.604516
std,9.353003,22.020507,1078.534521,1287.03314,2.955905,1.265632,1.897339,27.470024,433.333884,26.428753,...,1.34009,1.919725,3.473854,2.463125,0.24033,0.141934,0.337644,0.319869,0.466452,0.938312
min,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.01,0.736986,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,100.0,92.0,2.0,2.0,2.0,1.0,1.0,1.0,90.0,2.0,...,0.0,0.54,4.646575,1.030822,4.807143,0.25,1.0,0.0,0.0,1.0
50%,100.0,99.0,8.0,11.0,4.0,2.0,2.0,2.0,365.0,2.0,...,0.0,1.2,8.443836,2.465753,4.901429,0.333333,1.0,0.0,1.0,1.0
75%,100.0,100.0,70.0,108.0,6.0,2.0,3.0,4.0,1125.0,3.0,...,0.0,2.09,10.257534,3.657534,4.931429,0.4,1.0,0.0,1.0,2.0
max,100.0,100.0,5265.0,9059.0,16.0,17.0,29.0,730.0,1125.0,730.0,...,26.0,80.45,17.265753,13.775342,5.0,3.333333,1.0,1.0,1.0,17.0


#### One Hot Encoder

In [28]:
# boolean columns
X_train['host_has_profile_pic'] = X_train['host_has_profile_pic'].map({'True': 1, 'False': 0})
X_train['host_identity_verified'] = X_train['host_identity_verified'].map({'True': 1, 'False': 0})
X_train['has_availability'] = X_train['has_availability'].map({'True': 1, 'False': 0})
X_train['instant_bookable'] = X_train['instant_bookable'].map({'True': 1, 'False': 0})

X_test['host_has_profile_pic'] = X_test['host_has_profile_pic'].map({'True': 1, 'False': 0})
X_test['host_identity_verified'] = X_test['host_identity_verified'].map({'True': 1, 'False': 0})
X_test['has_availability'] = X_test['has_availability'].map({'True': 1, 'False': 0})
X_test['instant_bookable'] = X_test['instant_bookable'].map({'True': 1, 'False': 0})

In [29]:
# categorical columns
categorical_cols = X_train.select_dtypes(include='object').columns
rest_cols = X_train.drop(columns=categorical_cols).columns

In [30]:
encoder = OneHotEncoder(drop='first', handle_unknown='ignore')

# One-hot encode categorical variables
encoded_cats = encoder.fit_transform(X_train[categorical_cols]).toarray()
encoded_cats_df = pd.DataFrame(encoded_cats, columns=encoder.get_feature_names_out(categorical_cols), index=X_train.index)

# Repeat for test data
encoded_cats_test = encoder.transform(X_test[categorical_cols]).toarray()
encoded_cats_test_df = pd.DataFrame(encoded_cats_test, columns=encoder.get_feature_names_out(categorical_cols), index=X_test.index)



In [31]:
X_train_processed = pd.concat([encoded_cats_df, X_train[rest_cols]], axis=1)
X_test_processed = pd.concat([encoded_cats_test_df, X_test[rest_cols]], axis=1)

## 3) Machine Learning Model

In this section, I train the model using optimal hyperparameters that were identified through a separate Grid Search process. To ensure runtime efficiency and adhere to the instructions, I have excluded the search code itself. The parameters defined below represent the already tuned configuration, which I am using to generate the final test predictions.

In [32]:
# Fit the tuned model
tuned_xgb_model = XGBRegressor(
    random_state=1,
    objective = 'reg:squarederror',
    colsample_bytree = 0.5,
    learning_rate = 0.01,
    max_depth = 8,
    n_estimators = 2600,
    reg_lambda = 0.01,
    subsample = 0.75
    )

tuned_xgb_model.fit(X_train_processed, y_train_log)

In [33]:
# Predict
y_test_pred_log = tuned_xgb_model.predict(X_test_processed)
y_test_pred = np.expm1(y_test_pred_log)

## 4) Exporting the Predictions

In [34]:
prediction = pd.DataFrame({'id': X_test1.index, 'predicted': y_test_pred})
prediction.to_csv('regression_prediction_xgb_all_features.csv', index=False)