# CS541 Applied Machine Learning Spring 2025 - Class Challenge

In this class challenge assignment, you will be building a machine learning model to predict the price of an Airbnb rental, given the dataset we have provided. Total points: **100 pts**

To submit your solution, you need to submit a python (.py) file named challenge.py on Gradescope.
Initial Submission due on April 22, 2025
Final Submission due May 1, 2025

The top-3 winners will present their methodology on the last day of class (May 1st). Instructions on the presentation to follow.

There will be a Leaderboard for the challenge that can be seen by all students. USE YOUR FULL NAME AND NO NICKNAMES.

To encourage you to get started early on the challenge, you are required to submit an initial submission due on **April 22**. For this submission, your model needs to achieve a MSE of 0.16 or lower denoted as Baseline1.csv in the Kaggle Leaderboard. The final submission will be due on **May 1**.


## Problem and dataset description
Pricing a rental property such as an apartment or house on Airbnb is a difficult challenge. A model that accurately predicts the price can potentially help renters and hosts on the platform make better decisions. In this assignment, your task is to train a model that takes features of a listing as input and predicts the price.

We have provided you with a dataset collected from the Airbnb website for New York, which has a total of 29,985 entries, each with 765 features. You may use the provided data as you wish in development. We will train your submitted code on the same provided dataset, and will evaluate it on 2 other test sets (one public, and one hidden during the challenge).

We have already done some minimal data cleaning for you, such as converting text fields into categorical values and getting rid of the NaN values. To convert text fields into categorical values, we used different strategies depending on the field. For example, sentiment analysis was applied to convert user reviews to numerical values ('comments' column). We added different columns for state names, '1' indicating the location of the property. Column names are included in the data files and are mostly descriptive.

Also in this data cleaning step, the price value that we are trying to predict is calculated by taking the log of original price. Hence, the minimum value for our output price is around 2.302 and maximum value is around 9.21 on the training set.


## Datasets and Codebase

Please download the zip file from the link posted on Piazza/Resources.
In this notebook, we implemented a linear regression model with random weights (**attached in the end**). For datasets, there’re 2 CSV files for features and labels:

    challenge.ipynb (This file: you need to add your code in here, convert it to .py to submit)
    data_cleaned_train_comments_X.csv
    data_cleaned_train_y.csv


## Instructions to build your model
1.  Implement your model in **challenge.ipynb**. You need to modify the *train()* and *predict()* methods of **Model** class (*attached at the end of this notebook*). You can also add other methods/attributes  to the class, or even add new classes in the same file if needed, but do NOT change the signatures of the *train()* and *predict()* as we will call these 2 methods for evaluating your model.

2. To submit, you need to convert your notebook (.ipynb) to a python **(.py)** file. Make sure in the python file, it has a class named **Model**, and in the class, there are two methods: *train* and *predict*. Other experimental code should be removed if needed to avoid time limit exceeded on gradescope.

3.  You can submit your code on gradescope to test your model. You can submit as many times you like. The last submission will count as the final model.

An example linear regression model with random weights is provided to you in this notebook. Please take a look and replace the code with your own.


## Evaluation

We will evaluate your model as follows

    model = Model() # Model class imported from your submission
    X_train = pd.read_csv("data_cleaned_train_comments_X.csv")  # pandas Dataframe
    y_train = pd.read_csv("data_cleaned_train_y.csv")  # pandas Dataframe
    model.train(X_train, y_train) # train your model on the dataset provided to you
    y_pred = model.predict(X_test) # test your model on the hidden test set (pandas Dataframe)
    mse = mean_squared_error(y_test, y_pred) # compute mean squared error


**There will be 2 test sets, one is public which means you can see MSE on this test set on the Leaderboard (denoted as *MSE (PUBLIC TESTSET)*), and the other one is hidden during the challenge (denoted as *MSE (HIDDEN TESTSET)*)**.
Your score on the hidden test set will be your performance measure. So, don’t try to overfit your model on the public test set. Your final grade will depend on the following criteria:


1.  	Is it original code (implemented by you)? Use of Generative AI to generate code will be flagged as academic misconduct and will be reported to the Academic Conduct Committee (ACC)
2.  	Does it take a reasonable time to complete?
    Your model needs to finish running in under 40 minutes on our machine. We run the code on a machine with 4 CPUs, 6.0GB RAM.
3.  	Does it achieve a reasonable MSE?
    - **Initial submission (10 pts)**: Your model has to be better than the simplest model results which should be a MSE of 0.16 or lower denoted as Baseline1.csv in the leaderboard. Note this will due on **April 22**.
    
    The grade will be linearly interpolated for the submissions that lie in between the checkpoints above. We will use MSE on the hidden test set to evaluate your model (lower is better).

    **Bonus**: **Top 3** with the best MSE on the hidden test set will get a 5 point bonus.

# Answer the below questions (in the final submission due on May 1st)

1. What are the top-5 features contributed the most towards the performance? How did you identify these features? Your answer should be between 300-350 words.

2. What are the top-5 features contributed the least towards the performance? Your answer should be between 300-350 words.

3. Share the training and validation loss plots
Title of the plot should indicate the number of training / validation data points used.


# Notes & Code

**Note 1: This is a regression problem** in which we want to predict the price for an AirBnB property. You should try different models and finetune their hyper parameters.  A little feature engineering can also help to boost the performance.

**Note 2**: You may NOT use additional datasets. This assignment is meant to challenge you to build a better model, not collect more training data, so please only use the data we provided. We tested the code on Python 3.10 and 3.9, thus it’s highly recommended to use these Python versions for the challenge.


In this challenge, you can only use built-in python modules, and these following:
- Numpy
- pandas
- scikit_learn
- matplotlib
- scipy
- torchsummary
- xgboost
- torchmetrics
- lightgbm
- catboost
- torch



In [268]:
from typing import Tuple
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression
from xgboost import DMatrix, train as xgb_train
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.optim as optim

class MLPRegressor(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.model = nn.Sequential(
            # nn.Linear(input_dim, 256),
            # nn.ReLU(),
            # nn.Dropout(0.3),

            # nn.Linear(256, 128),
            # nn.ReLU(),
            # nn.Dropout(0.3),

            # nn.Linear(128, 64),
            # nn.ReLU(),
            # nn.Dropout(0.2),
            # nn.Linear(64, 1)

            nn.Linear(input_dim, 24),
            nn.ReLU(),
            nn.Dropout(0.3),

            # nn.Linear(64, 32),
            # nn.ReLU(),
            # nn.Dropout(0.3),

            # nn.Linear(64, 64),
            # nn.ReLU(),
            # nn.Dropout(0.2),

            nn.Linear(24, 1)
        )

    def forward(self, x):
        return self.model(x)

class Model:
    def __init__(self):
        self.selector = None
        self.scaler = None
        self.model = None
        self.k_features = 100
        self.device = 'mps' if torch.backends.mps.is_available() else 'cpu'

    def train(self, X_train: pd.DataFrame, y_train: pd.DataFrame) -> None:
        if 'id' in X_train.columns:
            X_train = X_train.drop(columns=['id'])
        y_train = y_train['price'].values.reshape(-1, 1)

        self.selector = SelectKBest(score_func=f_regression, k=self.k_features)
        X_selected = self.selector.fit_transform(X_train, y_train)

        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X_selected)

        X_tensor = torch.tensor(X_scaled, dtype=torch.float32).to(self.device)
        y_tensor = torch.tensor(y_train, dtype=torch.float32).to(self.device)

        self.model = MLPRegressor(input_dim=X_scaled.shape[1]).to(self.device)
        optimizer = optim.Adam(self.model.parameters(), lr=2e-3)
        loss_fn = nn.MSELoss()

        self.model.train()
        for epoch in range(6500):
            optimizer.zero_grad()
            output = self.model(X_tensor)
            loss = loss_fn(output, y_tensor)
            loss.backward()
            optimizer.step()
            if epoch % 100 == 0:
                print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

    def predict(self, X_test: pd.DataFrame) -> np.array:
        if 'id' in X_test.columns:
            X_test = X_test.drop(columns=['id'])
        X_selected = self.selector.transform(X_test)
        X_scaled = self.scaler.transform(X_selected)

        X_tensor = torch.tensor(X_scaled, dtype=torch.float32).to(self.device)
        self.model.eval()
        with torch.no_grad():
            y_pred = self.model(X_tensor).cpu().numpy()
        return y_pred

In [269]:
# Local testing
X = pd.read_csv("./data/trainData.csv")
y = pd.read_csv("./data/trainLabel.csv")

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
model = Model()
model.train(X_train, y_train)


  y = column_or_1d(y, warn=True)


Epoch 0, Loss: 22.7976
Epoch 100, Loss: 2.3754
Epoch 200, Loss: 1.6926
Epoch 300, Loss: 1.4786
Epoch 400, Loss: 1.3383
Epoch 500, Loss: 1.1868
Epoch 600, Loss: 1.0799
Epoch 700, Loss: 0.9863
Epoch 800, Loss: 0.9212
Epoch 900, Loss: 0.8662
Epoch 1000, Loss: 0.8320
Epoch 1100, Loss: 0.7995
Epoch 1200, Loss: 0.7655
Epoch 1300, Loss: 0.7360
Epoch 1400, Loss: 0.7265
Epoch 1500, Loss: 0.7002
Epoch 1600, Loss: 0.6954
Epoch 1700, Loss: 0.6745
Epoch 1800, Loss: 0.6516
Epoch 1900, Loss: 0.6334
Epoch 2000, Loss: 0.6133
Epoch 2100, Loss: 0.5913
Epoch 2200, Loss: 0.5773
Epoch 2300, Loss: 0.5613
Epoch 2400, Loss: 0.5457
Epoch 2500, Loss: 0.5324
Epoch 2600, Loss: 0.5110
Epoch 2700, Loss: 0.5013
Epoch 2800, Loss: 0.4723
Epoch 2900, Loss: 0.4698
Epoch 3000, Loss: 0.4542
Epoch 3100, Loss: 0.4359
Epoch 3200, Loss: 0.4191
Epoch 3300, Loss: 0.4132
Epoch 3400, Loss: 0.3943
Epoch 3500, Loss: 0.3785
Epoch 3600, Loss: 0.3616
Epoch 3700, Loss: 0.3511
Epoch 3800, Loss: 0.3328
Epoch 3900, Loss: 0.3206
Epoch 4000,

In [270]:

y_pred_train = model.predict(X_train)
train_mse = mean_squared_error(y_train['price'], y_pred_train, squared=True)
print(f"Train MSE: {train_mse:.4f}")
y_pred = model.predict(X_val)
mse = mean_squared_error(y_val['price'], y_pred, squared=True)
print(f"MSE: {mse:.4f}")

Train MSE: 0.1386
MSE: 0.1465


In [145]:
# Truncate the dataset
part_X_train = X.sample(frac=0.1, random_state=42)
part_y_train = y.sample(frac=0.1, random_state=42)
part_X_train.to_csv("./data/part_train.csv", index=False)
part_y_train.to_csv("./data/part_label.csv", index=False)

In [267]:
model = Model() # Model class imported from your submission
X_train = pd.read_csv("./data/trainData.csv")  # pandas Dataframe
y_train = pd.read_csv("./data/trainLabel.csv")  # pandas Dataframe
model.train(X_train, y_train) # train your model on the dataset provided to you
X_test = pd.read_csv("./data/testingData.csv") # pandas Dataframe
y_pred = model.predict(X_test) # test your model on the hidden test set (pandas Dataframe)
# mse = mean_squared_error(y_test, y_pred) # compute mean squared error

# Keep id and price columns to submission.csv
submission = pd.DataFrame({
    'id': X_test['id'],
    'price': y_pred.flatten()
})
submission.to_csv("./data/submission.csv", index=False)

  y = column_or_1d(y, warn=True)


Epoch 0, Loss: 21.4944
Epoch 100, Loss: 1.7211
Epoch 200, Loss: 1.4094
Epoch 300, Loss: 1.1967
Epoch 400, Loss: 1.0522
Epoch 500, Loss: 0.9292
Epoch 600, Loss: 0.8664
Epoch 700, Loss: 0.7824
Epoch 800, Loss: 0.7195
Epoch 900, Loss: 0.6883
Epoch 1000, Loss: 0.6627
Epoch 1100, Loss: 0.6292
Epoch 1200, Loss: 0.6130
Epoch 1300, Loss: 0.5836
Epoch 1400, Loss: 0.5616
Epoch 1500, Loss: 0.5650
Epoch 1600, Loss: 0.5469
Epoch 1700, Loss: 0.5339
Epoch 1800, Loss: 0.5263
Epoch 1900, Loss: 0.5071
Epoch 2000, Loss: 0.4981
Epoch 2100, Loss: 0.4899
Epoch 2200, Loss: 0.4768
Epoch 2300, Loss: 0.4637
Epoch 2400, Loss: 0.4548
Epoch 2500, Loss: 0.4436
Epoch 2600, Loss: 0.4355
Epoch 2700, Loss: 0.4224
Epoch 2800, Loss: 0.4107
Epoch 2900, Loss: 0.4028
Epoch 3000, Loss: 0.3953
Epoch 3100, Loss: 0.3880
Epoch 3200, Loss: 0.3667
Epoch 3300, Loss: 0.3624
Epoch 3400, Loss: 0.3471
Epoch 3500, Loss: 0.3371
Epoch 3600, Loss: 0.3252
Epoch 3700, Loss: 0.3207
Epoch 3800, Loss: 0.3100
Epoch 3900, Loss: 0.3012
Epoch 4000,

**GOOD LUCK!**
