### __Group Assignment - Predicting Airbnb Listing Prices in Melbourne, Australia__

--- 

**Kaggle Competition Ends:** Friday, 6 June 2025 @ 3:00pm (Week 13)  
**Assignment Due Date on iLearn:** Friday, 6 June 2025 @ 11.55pm (Week 13)   
**Total Marks:** 30

**Overview:**   

- In the group assignment you will form a team of 3 students and participate in a forecasting competition on Kaggle


**Instructions:** 

- Form a team of 3 students 
- Each team member needs to join [https://www.kaggle.com](https://www.kaggle.com/)  
- Choose a team leader and form a team on Kaggle [https://www.kaggle.com/t/fc5974a56165cea945ee1ec182b079af](https://www.kaggle.com/t/fc5974a56165cea945ee1ec182b079af)
    - Team leader to click on `team` and invite other 2 team members to join
    - Your **team's name must start** with our unit code
- All team members should work on all the tasks however   
    - Each team member will be responsible for one of the 3 tasks listed below    
- **Your predictions must be generated by a model you develop here** 
    - You will receive a mark of **zero** if your code is not able produce the forecasts you submit to Kaggle 

**Competition Rankings**

The rankings for the competition are determined through two different leaderboards:

- **Public Leaderboard Ranking**: Available during the competition, these rankings are calculated based on 50% of the test dataset, which includes 1,500 observations. This allows participants to see how they are performing while the competition is still ongoing.
- **Final Leaderboard Ranking**: These rankings are recalculated from the other 50% of the test dataset, which consists of the remaining 1,500 observations, and are revealed 5 minutes after the competition concludes. This final evaluation determines the ultimate standings of the competition.



**Marks** 

- Assignment: 30 marks consisting of Solutions (27 marks) + Video Presentation (3 marks)
- **Each Student's Mark: 50% x overall assignment mark + 50% x mark for the task that you are responsible for**  



**Submissions:**  

1. On Kaggle: submit your team's forecast in order to be ranked by Kaggle
2. On iLearn **only team leader to submit** the assignment Jupyter notebook re-named to your team's name on Kaggle   
    - The Jupyter notebook must contain team members names/ID numbers, and the group name Kaggle
    - One 15 minute video recording of your work 
        - 5 marks will be deducted from each Task for which there is no video presentation   


---
---

### <span style="background-color: yellow;">**Fill out the following information**

- Team Name on Kaggle: `(insert here)`
- Team Leader and Team Member 1: `(insert here)`
- Team Member 2: `(insert here)`
- Team Member 3: `(insert here)`

---

## Task 1: Problem Description and Initial Data Analysis

- You must clearly explain all your answers in both the Markdown file and the recorded video.

**Total Marks: 9**   

Based on the Competition Overview, datasets and additional information provided on Kaggle, along with insights gained from personal research of the topic, write **Problem Description** (about 500 words) focusing on the sections listed below: 
1. Forecasting Problem - explain what we are trying to do and how it could be used in the real world, e.g. who and how may benefit from it (2 marks)    
2. Evaluation Criteria - discuss the criterion that is used in this competition to assess forecasting performance, and its pros and cons. (2 marks)     
3. Categorise all variables provided in the dataset according to their type; Hint: similar to what we had in Programming Task 1 (2 marks)  
4. Missing Values - explain what you find for both the training and test datasets at this stage (2 marks)
5. Provide and discuss some interesting *univariate* data characteristics (e.g. summary statistics and plots) in the training dataset  (1 marks)       
- Hints:
    - You should **not** discuss any specific predictive algorithms at this stage


Student in charge of this task: `(insert name here)`

---

## Task 2: Data Cleaning, Missing Observations and Feature Engineering

- You must clearly explain all your answers in both the Markdown file and the recorded video. 

**Total Marks: 9**

Student in charge of this task: `(insert name here)`

**Task 2, Question 1**: Clean **all** numerical features so that they can be used in training algorithms. For instance, `host_response_rate` feature is in object format containing both numerical values and text. Extract numerical values (or equivalently eliminate the text) so that the numerical values can be used as a regular feature.  
(2 marks)

In [91]:
import pandas as pd

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

# bathrooms
def parse_bathroom(bath_str):
    if pd.isnull(bath_str):
        return None
    bath_str = bath_str.lower()
    if 'half' in bath_str:
        return 0.5
    digits = ''.join(ch for ch in bath_str if ch.isdigit() or ch == '.')
    try:
        return float(digits)
    except:
        return None

df_train['bathrooms'] = df_train['bathrooms'].apply(parse_bathroom)
df_test['bathrooms'] = df_test['bathrooms'].apply(parse_bathroom)

# host_response_rate
df_train['host_response_rate'] = df_train['host_response_rate'].str.replace('%', '', regex=False).astype(float)
df_test['host_response_rate'] = df_test['host_response_rate'].str.replace('%', '', regex=False).astype(float)

# host_acceptance_rate
df_train['host_acceptance_rate'] = df_train['host_acceptance_rate'].str.replace('%', '', regex=False).astype(float)
df_test['host_acceptance_rate'] = df_test['host_acceptance_rate'].str.replace('%', '', regex=False).astype(float)



For Task 2, Question 1, we cleaned all numerical features that were incorrectly stored as text. The `bathrooms` column included values like "1 bath" or "Half-bath", which we converted into numeric values such as 1.0 and 0.5 using a custom parsing function. We also cleaned the `host_response_rate` and `host_acceptance_rate` columns by removing the percentage signs and converting the remaining strings into floats. These steps ensure that all relevant numerical features are now in the correct format and ready to be used in training algorithms.


**Task 2, Question 2** Create at least 4 new features from existing features which contain multiple items of information.   
(2 marks)

In [89]:
import numpy as np

for df in (df_train, df_test):

    #  new feature that counts how many verification methods the host has provided
    df['host_verifications_count'] = df['host_verifications'].str.count(',') + 1

    # new feature that measures how long the listing has been receiving reviews (in years)
    # This is calculated as the difference between the last and first review dates
    df['first_review'] = pd.to_datetime(df['first_review'], errors='coerce', dayfirst=True)
    df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce', dayfirst=True)
    df['years_of_review_activity'] = (
        (df['last_review'] - df['first_review']).dt.days / 365
    ).round(2)

    #  gym
    df['has_gym'] = df['amenities'].str.contains('gym', case=False, na=False).astype(int)

    #  kitchen
    df['has_kitchen'] = df['amenities'].str.contains('kitchen', case=False, na=False).astype(int)

# Display the new features in df_train to verify they were created correctly


df_train[[
    'host_verifications', 'host_verifications_count',
    'first_review', 'last_review', 'years_of_review_activity',
    'amenities', 'has_gym', 'has_kitchen'
]].head()




Unnamed: 0,host_verifications,host_verifications_count,first_review,last_review,years_of_review_activity,amenities,has_gym,has_kitchen
0,"['email', 'phone']",2,2013-03-29,2023-02-18,9.9,"[""Sukin conditioner"", ""Extra pillows and blank...",0,0
1,"['email', 'phone']",2,2013-01-12,2023-03-08,10.16,"[""Extra pillows and blankets"", ""Laundromat nea...",0,1
2,"['email', 'phone']",2,2015-07-06,2022-06-13,6.94,"[""Microwave"", ""Hot tub"", ""Conditioner"", ""Smoke...",0,1
3,"['email', 'phone']",2,2011-10-16,2012-01-27,0.28,"[""Hot tub"", ""Gym"", ""Washer"", ""Dryer"", ""Kitchen...",1,1
4,"['email', 'phone', 'work_email']",3,2010-11-24,2023-03-03,12.28,"[""Laundromat nearby"", ""Private patio or balcon...",0,1


`(Task 2, Question 2 Text Here - insert more cells as required)`

**Task 2, Question 3**: Impute the missing values for all features in both the training and test datasets.   
(2 marks)

In [104]:

#  Numeric 
numeric_cols = df_train.select_dtypes(include=['int64','float64']).columns
for col in numeric_cols:
    med = df_train[col].median()
    df_train[col] = df_train[col].fillna(med)
    if col in df_test.columns:
        df_test[col]  = df_test[col].fillna(med)

# Text 
text_cols = [
    'name', 'description', 'neighborhood_overview',
    'host_about', 'host_location', 'amenities', 'host_verifications'
]
for col in text_cols:
    df_train[col] = df_train[col].fillna('')
    df_test [col] = df_test [col].fillna('')

# Date
date_cols = ['host_since', 'first_review', 'last_review']
for col in date_cols:
    mode_val       = df_train[col].mode(dropna=True)[0]
    df_train[col]  = df_train[col].fillna(mode_val)
    df_test [col]  = df_test [col].fillna(mode_val)

# categorical 
cat_cols = [
    c for c in df_train.select_dtypes(include=['object']).columns
    if c not in text_cols
]
for col in cat_cols:
    mode_val       = df_train[col].mode(dropna=True)[0]
    df_train[col]  = df_train[col].fillna(mode_val)
    df_test [col]  = df_test [col].fillna(mode_val)

# 5) Check for any remaining missing values
missing = pd.DataFrame({
    'train_missing': df_train.isnull().sum(),
    'test_missing':  df_test.isnull().sum()
})
print(missing[missing.sum(axis=1) > 0])



Empty DataFrame
Columns: [train_missing, test_missing]
Index: []


`(Task 2, Question 3 Text Here - insert more cells as required)`

**Task 2, Question 4**: Encode all categorical variables appropriately as discussed in class. 

- Where multiple values are given for an observation encode the observation as 'other'. 
- Where a categorical feature contains more than 5 unique values, map the features into 5 most frequent values + 'other' and then encode appropriately.  
(2 marks)

In [117]:

import numpy as np
top5={col: df_train[col].value_counts().nlargest(5).index
    for col in cat_columns
     }
for col in cat_columns:
    df_train[col] = df_train[col].where(df_train[col].isin(top5[col]), 'other')
df_train_encoded = pd.get_dummies(df_train, columns=cat_columns, drop_first=True)
for col in cat_columns:
    df_test[col] = df_test[col].where(df_test[col].isin(top5[col]), 'other')
df_test_encoded = pd.get_dummies(df_test, columns=cat_columns, drop_first=True)


`(Task 2, Question 4 Text Here - insert more cells as required)`

**Task 2, Question 5**: Perform any additional data preparation steps you consider necessary before building your predictive models, and clearly explain each action you take.  
(1 mark)

In [114]:
## Task 2, Question 5 Code Here

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# 1) Drop pure identifiers & high-cardinality text
#    • 'ID' and 'host_name' carry no predictive signal
#    • 'name' and 'description' are ultra-high-cardinality
df_train.drop(['ID','host_name','name','description'], axis=1, inplace=True)
df_test .drop(['ID','host_name','name','description'], axis=1, inplace=True)

# 2) Engineer price per person
#    • Normalizes price by capacity, often a stronger signal
df_train['price_per_person'] = df_train['price'] / df_train['accommodates']
df_test ['price_per_person'] = df_test ['price'] / df_test ['accommodates']

# 3) Log-transform skewed numeric features
#    • Price and price_per_person are right-skewed → log1p smooths them
for col in ['price','price_per_person']:
    df_train[f'log_{col}'] = np.log1p(df_train[col])
    df_test [f'log_{col}'] = np.log1p(df_test [col])

# 4) One-hot encode key categorical variables
#    • Converts categories into model-friendly numeric columns
to_dummies = ['room_type','property_type','instant_bookable','host_is_superhost']
df_train = pd.get_dummies(df_train, columns=to_dummies, drop_first=True)
df_test  = pd.get_dummies(df_test,  columns=to_dummies, drop_first=True)

#  ➤ Align train/test columns after dummies
df_test = df_test.reindex(columns=df_train.columns, fill_value=0)

# 5) Scale remaining numeric features
#    • Brings all numerics roughly onto the same scale for many algorithms
scaler = StandardScaler()
num_feats = [
    'accommodates','bathrooms','bedrooms','beds',
    'days_since_first_review','years_of_review_activity',
    'log_price','log_price_per_person'
]
df_train[num_feats] = scaler.fit_transform(df_train[num_feats])
df_test [num_feats] = scaler.transform(df_test [num_feats])


Why these steps?
Drop IDs & high-card text
Columns like “ID”, “host_name”, or free-form “name/description” don’t generalize and can inject noise.
Price per person
Simple “price” favors large-capacity homes; dividing by accommodates captures per-guest cost.
Log transform
Reduces skew in monetary features, making distributions more Gaussian-like and helping many models.
One-hot encoding
Converts categorical flags into binary columns so they can be used in linear- or tree-based models.
Scaling
Centers and scales numeric inputs—important for algorithms (e.g. KNN, SVM, regularized regressions) that are sensitive to feature magnitude.

--- 
## Task 3: Fit and tune predictive models, submit predictions & win competition

- You must clearly explain all your answers in both the Markdown file and the recorded video.
- 
**Total Marks: 9**

For this task, you should not create any new features and must rely on the variables constructed in Task 2.  
 

1. Perform some EDA to measure the relationship between the features and the target variable, and carefully explain your findings. (2 marks)

2. Choose and carefully explain 3 different machine learning (ML) regression models that you will apply in this competition. (2 marks)
   
3. Train the models from the above question and tune their hyperparameters via cross-validation. Discuss the fitted weights, optimised hyperparameter values, and their training dataset predictive performance. (2 marks)   

4. Select your best model, create predictions of the test dataset and submit your forecasts on Kaggle's competition page. Provide Kaggle ranking and score (screenshots) and comment on your performance in the competition. (2 marks)

5. Suggest ways to improve your ranking and implement them, providing further evidence from Kaggle (screenshots). (1 mark)   

- Hints:
    - Make sure your Python code works so that your results can be replicated by the marker
    - You will receive the mark of zero for this Task if your code does not produce the forecasts uploaded to Kaggle



Student in charge of this task: `(insert name here)`

In [None]:
#Task 3 code here

`(Task 3 - insert more cells as required)`

---
---
## Marking Criteria

To receive full marks your solutions must satisfy the following criteria:

- Problem Description: 9 marks
- Data Cleaning: 9 marks
- Building Forecasting models: 9 marks
- Video Presentation: 3 marks
    - Duration less than 15 min, presentation skill and content 
    - Each team member delivers a 5-minute presentation on their assigned task
    - All assignment questions must be discussed on video  
    - Your code must be readable on the video
    - Discuss both the actions you took and, more importantly, the reasoning behind these actions, explaining the significance of key steps
    - During the video recording, make sure that both your face and Jupyter Notebook are clearly visible
- Forecasts correctly uploaded to Kaggle
- Python code is clean and concise
- Written explanations are provided in clear and easy to understand sentences
- The assignment notebook is well-organised and easy to follow
- Failure to meet the above marking criteria will result in a deduction of marks

---
---