## Instructions:

- Put the parts of your code under the corresponding sections. (0.25/2 points will be taken off for not doing this.)
- Do not include any redundant/irrelevant code, text or comments. (0.5/2 points will be taken off for not doing this.)
- **Your code must run without any errors or runtime issues.** (Failure to meet this condition will result in a 0.)
- **Your code must return your Public Leaderboard score.** (Failure to meet this condition will result in a 0.)
- **Submit both your ipynb and your html file for grading purposes.**

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [1]:
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train2 = train.copy()
train2['price'] = train2['price'].replace('[\$,]', '', regex=True).astype(float)

drop_cols = ['id', 'description', 'host_about', 'host_name', 'first_review', 'last_review', 'host_verifications', 'listing_location']
train2 = train2.drop(columns=[col for col in drop_cols if col in train2.columns])

test = test.drop(columns=[col for col in drop_cols if col in test.columns and col != 'id'])

for col in ['host_response_rate', 'host_acceptance_rate']:
    train2[col] = train2[col].str.rstrip('%').astype(float) / 100
    test[col] = test[col].str.rstrip('%').astype(float) / 100

train2['bathrooms'] = train2['bathrooms_text'].str.extract('(\d+\.?\d*)').astype(float)
test['bathrooms'] = test['bathrooms_text'].str.extract('(\d+\.?\d*)').astype(float)
train2 = train2.drop(columns=['bathrooms_text'])
test = test.drop(columns=['bathrooms_text'])

train2['host_since'] = pd.to_datetime(train2['host_since'], errors='coerce')
test['host_since'] = pd.to_datetime(test['host_since'], errors='coerce')
train2['host_since_days'] = (pd.to_datetime('2023-01-01') - train2['host_since']).dt.days
test['host_since_days'] = (pd.to_datetime('2023-01-01') - test['host_since']).dt.days
train2 = train2.drop(columns=['host_since'])
test = test.drop(columns=['host_since'])



In [3]:
cardinality = train2.select_dtypes(include=['object', 'bool']).nunique()
high_card_cols = cardinality[cardinality > 100].index.tolist()

X = train2.drop(columns=['price'] + high_card_cols)
y = np.log1p(train2['price']) 

categorical_cols = X.select_dtypes(include=['object', 'bool']).columns.tolist()
for col in categorical_cols:
    freq = X[col].value_counts(normalize=True)
    X[col + '_freq'] = X[col].map(freq)
X = X.drop(columns=categorical_cols)

test_ids = test['id']
X_test = test.drop(columns=[col for col in high_card_cols if col in test.columns])
for col in categorical_cols:
    freq = train[col].value_counts(normalize=True)
    X_test[col + '_freq'] = test[col].map(freq)
X_test = X_test.drop(columns=[col for col in categorical_cols if col in X_test.columns])
X_test = X_test.reindex(columns=X.columns, fill_value=0)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1)

outlier_price = 20000
X_train = X_train[y_train < outlier_price]
y_train = y_train[y_train < outlier_price]

## 3) Machine Learning Model

- This section is required to train the **already tuned** model and obtain the test predictions (or prediction probabilities) with it.
- As written in the instructions, your code must not have any runtime issues, so **do NOT include your grid search here!** You will still need to tune your model to pass the thresholds. However, you need to keep that as your personal work and should NOT include the grid search here.

In [4]:
xgb = XGBRegressor(
    n_estimators=1600,
    learning_rate=0.016515151515151517,
    max_depth=8,
    min_child_weight=1,
    subsample=0.6838383838383838,
    colsample_bytree=0.6484848484848484,
    gamma=0.0030303030303030303,
    reg_lambda=0.20408163265306123,
    scale_pos_weight=24,
    objective='reg:squarederror',
    random_state=1
)
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_valid)
mae = mean_absolute_error(np.expm1(y_valid), np.expm1(y_pred))
print(f"Validation MAE: {mae:.2f}")


Validation MAE: 87.11


## 4) Exporting the Predictions

Include the code that (1) puts the predictions in the format that Kaggle understands and (2) exports it as a csv file.

In [5]:
y_test_pred = xgb.predict(X_test)

submission = pd.DataFrame({
    'id': test_ids,
    'price': np.expm1(y_test_pred) 
})
submission.to_csv("submission.csv", index=False)