# Hackathon: Predicting Airbnb rental prices using ML

### What is a hackathon?

A hackathon is an event designed to use technology, primarily coding, to accomplish an objective. In this case, the development of a simple machine learning classifier.

<div class="alert alert-success">
<b>About Dataset</b>

<u>Context</u>

This dataset provides an in-depth look into the dynamic world of Airbnb accommodations across diverse locations in the USA. From cozy apartments in vibrant urban centers to tranquil retreats in scenic rural areas, it offers valuable insights into the wide range of lodging options available on the platform. With detailed information on property features, pricing, reviews, and host profiles, this dataset enables AI enthusiasts to explore the intricate ecosystem of shared lodging. It uncovers trends, patterns, and preferences that influence the ever-evolving landscape of modern hospitality. Whether examining market trends, evaluating the impact of tourism on local economies, or simply exploring global travel patterns, this dataset serves as a practical resource for understanding the multifaceted world of Airbnb accommodations.

<u>Content</u>

The contents are

<ol>
<li>log_price</li>
<li>property_type</li>
<li>amenities</li>
<li>accommodates</li>
<li>bathrooms</li>
<li>bed_type</li>
<li>cancellation_policy</li>
<li>city</li>
<li>first_review</li>
<li>host_has_profile_pic</li>
<li>host_identity_verified</li>
<li>host_response_rate</li>
<li>host_since</li>
<li>latitude</li>
<li>longitude</li>
<li>name</li>
<li>neighbourhood</li>
<li>number_of_reviews</li>
<li>review_scores_rating</li>
<li>zipcode</li>
<li>bedrooms</li>
<li>beds</li>
</ol>

 <u>Inspiration</u>

The goal is to apply machine learning regression algorithms to predict the price variable based on property features, reviews, and host profiles.
    
</div>

<div class="alert alert-info"><b>Task</b>

Pricing is often a challenging task for Airbnb hosts. The goal of this assignment is to demonstrate how machine learning can be applied to address this challenge. To achieve this, follow these steps: load the dataset, identify the target variable, convert it if necessary, explore and preprocess the data, build a regression model, and evaluate its performance using an appropriate metric.

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed, but do not remove the answer cell.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
 
</div>

<div class="alert alert-warning"><b>Hints</b>

- Clearly define your target variable and choose the appropriate performance metric for evaluation.  
- The "Amenities" column contains valuable information—be sure to use it wisely.  
- If you encounter columns with date or time information, consider converting them to datetime format if they are relevant to the task.

</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Monday, November 25th, 12:00

Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>


In [2]:
pip install numpy scipy

Collecting scipy
  Using cached scipy-1.14.1.tar.gz (58.6 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mPreparing metadata [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[53 lines of output][0m
  [31m   [0m [36m[1m+ meson setup /private/var/folders/tl/rtlxkzzj66z03_lkg34h6wq80000gn/T/pip-install-xo4rdmwl/scipy_e952693941144f3d8de700b66f57e6b5 /private/var/folders/tl/rtlxkzzj66z03_lkg34h6wq80000gn/T/pip-install-xo4rdmwl/scipy_e952693941144f3d8de700b66f57e6b5/.mesonpy-73mdtaxz -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md --native-file=/private/var/folders/tl/rtlxkzzj66z03_lkg34h6wq80000gn/T/pip-install-xo4rdmwl/scipy_e952693941144f3d8

In [10]:
conda install scikit-learn

Channels:
 - defaults
 - conda-forge
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3/envs/myenv

  added / updated specs:
    - scikit-learn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    scikit-learn-1.5.1         |  py312hd77ebd4_0         9.7 MB
    scipy-1.14.1               |  py312ha409365_0        22.2 MB
    threadpoolctl-3.5.0        |  py312h989b03a_0          49 KB
    ------------------------------------------------------------
                                           Total:        31.9 MB

The following NEW packages will be INSTALLED:

  blas               pkgs/main/osx-arm64::blas-1.0-openblas 
  joblib             pkgs/main/osx-arm64::joblib-1.4.2-py312hca03da5_0 
  libgfortran        pkgs/main/osx-arm64::libgfortran-5.0.0-11_3_0_hca03da5_28 
  libgfortran5     

In [11]:
import numpy as np
import pandas as pd
from sklearn import set_config

set_config(transform_output="pandas")
data = pd.read_csv('https://github.com/jnin/information-systems/raw/refs/heads/main/data/Airbnb_Hackathon_IA1_2024.zip', compression='zip').convert_dtypes()
data.head()

Unnamed: 0,log_price,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,...,last_review,latitude,longitude,name,neighbourhood,number_of_reviews,review_scores_rating,zipcode,bedrooms,beds
0,5.010635,Apartment,Entire home/apt,"{""Wireless Internet"",""Air conditioning"",Kitche...",3,1.0,Real Bed,strict,True,NYC,...,2016-07-18,40.696524,-73.991617,Beautiful brownstone 1-bedroom,Brooklyn Heights,2,100.0,11201.0,1,1
1,5.129899,Apartment,Entire home/apt,"{""Wireless Internet"",""Air conditioning"",Kitche...",7,1.0,Real Bed,strict,True,NYC,...,2017-09-23,40.766115,-73.98904,Superb 3BR Apt Located Near Times Square,Hell's Kitchen,6,93.0,10019.0,3,3
2,6.620073,House,Entire home/apt,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",4,1.0,Real Bed,flexible,True,SF,...,,37.772004,-122.431619,Beautiful Flat in the Heart of SF!,Lower Haight,0,,94117.0,2,2
3,4.442651,Apartment,Private room,"{TV,""Wireless Internet"",Heating,""Smoke detecto...",2,1.0,Real Bed,strict,True,SF,...,2017-09-05,37.753164,-122.429526,Comfort Suite San Francisco,Noe Valley,3,100.0,94131.0,1,1
4,4.418841,Apartment,Entire home/apt,"{TV,Internet,""Wireless Internet"",""Air conditio...",3,1.0,Real Bed,moderate,True,LA,...,2017-04-21,33.980454,-118.462821,Beach Town Studio and Parking!!!11h,,15,97.0,90292.0,1,1


In [13]:
pip install matplotlib seaborn

Collecting matplotlib
  Downloading matplotlib-3.9.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.55.0-cp312-cp312-macosx_10_13_universal2.whl.metadata (164 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.7-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.3 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-11.0.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (9.1 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pyparsing-3.2.0-py3-none-any.whl.metadata (5.0 kB)
Downloading matplotlib-3.9.2-cp312-cp312-macosx_11_0_arm64.whl (7.8 MB)
[2K   [90m

In [15]:
pip install --upgrade numpy

Collecting numpy
  Using cached numpy-2.1.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (62 kB)
Using cached numpy-2.1.3-cp312-cp312-macosx_11_0_arm64.whl (13.5 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
Successfully installed numpy-2.1.3
Note: you may need to restart the kernel to use updated packages.


In [17]:
pip install numpy pandas matplotlib seaborn scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [18]:
pip install numpy==1.23.5 pandas==1.5.3 matplotlib==3.6.2 seaborn==0.12.2 scikit-learn==1.2.2

Collecting numpy==1.23.5
  Downloading numpy-1.23.5.tar.gz (10.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[33 lines of output][0m
  [31m   [0m Traceback (most recent call last):
  [31m   [0m   File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
  [31m   [0m     main()
  [31m   [0m   File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
  [31m   [0m     json_out['return_val'] = hook(**hook_input['kwarg

In [16]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Load the dataset
data = pd.read_csv('Airbnb_Hackathon_IA1_2024.csv')

# Display the first few rows
print("Dataset Overview:")
print(data.head())

# Summary statistics
print("\nDataset Summary:")
print(data.describe())

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject