<a href="https://colab.research.google.com/github/RochX/comp486-assignments/blob/main/project/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Project
## New York City AirBNB prices

In [1]:
from pathlib import Path
import pandas as pd
import urllib.request

# download the data if it is not downloaded
if not Path("new_york_listings_2024.csv").is_file():
  # here I download the data from my personal git repo for this class instead of using Google Drive
  url = "https://raw.githubusercontent.com/RochX/comp486-assignments/main/project/new_york_listings_2024.csv"
  urllib.request.urlretrieve(url, "new_york_listings_2024.csv")

new_york_listings_2024 = pd.read_csv("new_york_listings_2024.csv")

Get some information on the data.

In [2]:
new_york_listings_2024.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  int64  
 1   name                            20758 non-null  object 
 2   host_id                         20758 non-null  int64  
 3   host_name                       20758 non-null  object 
 4   neighbourhood_group             20758 non-null  object 
 5   neighbourhood                   20758 non-null  object 
 6   latitude                        20758 non-null  float64
 7   longitude                       20758 non-null  float64
 8   room_type                       20758 non-null  object 
 9   price                           20758 non-null  float64
 10  minimum_nights                  20758 non-null  int64  
 11  number_of_reviews               20758 non-null  int64  
 12  last_review                     

Take a look at the categorical features listed (besides `name` and `host_name`).

In [3]:
for col in list(new_york_listings_2024.drop(["name", "host_name"], axis=1).select_dtypes(include=["object"])):
  if col != "baths":
    print(new_york_listings_2024.value_counts(col), end="\n\n\n")
  else:
    print(new_york_listings_2024.value_counts(col))

neighbourhood_group
Manhattan        8038
Brooklyn         7719
Queens           3761
Bronx             949
Staten Island     291
dtype: int64


neighbourhood
Bedford-Stuyvesant            1586
Harlem                        1063
Williamsburg                   969
Midtown                        942
Hell's Kitchen                 867
                              ... 
Bay Terrace, Staten Island       1
Navy Yard                        1
Lighthouse Hill                  1
Chelsea, Staten Island           1
Neponsit                         1
Length: 221, dtype: int64


room_type
Entire home/apt    11549
Private room        8804
Shared room          293
Hotel room           112
dtype: int64


last_review
2023-09-04    326
2023-12-03    255
2023-12-17    244
2023-09-05    223
2023-11-30    212
             ... 
2020-04-17      1
2020-04-16      1
2020-04-15      1
2020-04-14      1
2021-03-21      1
Length: 1878, dtype: int64


license
No License            17569
Exempt                 2135


We see that some of these "categorical" features are actually numeric except with "missing data" or something similar as actual data.
These features are `rating`, `bedrooms`, and `baths`.
We can make the following corrections for each feature:
- `rating`: Convert `No Rating` and `New ` into `NaN`
- `bedrooms`: Convert `Studio` into `1`. Studio apartments are basically 1 bedroom apartments, but the bed isn't in its own room.
- `baths`: Convert `Not specified` into `NaN`.

Let's clean the data in accordance to this.

In [5]:
cleaned_data = new_york_listings_2024
cleaned_data["rating"] = new_york_listings_2024["rating"].replace({"No rating": None, "New ": None})
cleaned_data["bedrooms"] = new_york_listings_2024["bedrooms"].replace({"Studio": "1"})
cleaned_data["baths"] = new_york_listings_2024["baths"].replace({"Not specified": None})

obj_to_num_cols = ["rating", "bedrooms", "baths"]
cleaned_data[obj_to_num_cols] = cleaned_data[obj_to_num_cols].apply(pd.to_numeric)

cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  int64  
 1   name                            20758 non-null  object 
 2   host_id                         20758 non-null  int64  
 3   host_name                       20758 non-null  object 
 4   neighbourhood_group             20758 non-null  object 
 5   neighbourhood                   20758 non-null  object 
 6   latitude                        20758 non-null  float64
 7   longitude                       20758 non-null  float64
 8   room_type                       20758 non-null  object 
 9   price                           20758 non-null  float64
 10  minimum_nights                  20758 non-null  int64  
 11  number_of_reviews               20758 non-null  int64  
 12  last_review                     

Now that `rating`, `bedrooms`, and `baths` are numeric values, let's try creating a correlation matrix for `price`.

In [6]:
cleaned_data.corr()["price"]

  cleaned_data.corr()["price"]


id                                0.002372
host_id                          -0.005987
latitude                         -0.001143
longitude                        -0.033460
price                             1.000000
minimum_nights                   -0.006527
number_of_reviews                -0.012588
reviews_per_month                -0.009917
calculated_host_listings_count   -0.007333
availability_365                  0.020151
number_of_reviews_ltm            -0.011263
rating                           -0.004692
bedrooms                          0.074036
beds                              0.066882
baths                             0.066048
Name: price, dtype: float64

In [28]:
cleaned_data.drop(["id", "host_id"], axis=1).describe()

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,rating,bedrooms,beds,baths
count,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,17006.0,20758.0,20758.0,20745.0
mean,40.7268,-73.93916,187.77662,28.55844,42.6426,1.25791,18.84411,205.99032,10.85211,4.73425,1.39416,1.72372,1.17799
std,0.06029,0.0614,1022.79721,33.53652,73.56165,1.90466,70.91083,135.08777,21.35707,0.29439,0.78812,1.21227,0.48046
min,40.50031,-74.24984,10.0,1.0,1.0,0.01,1.0,0.0,0.0,1.75,1.0,1.0,0.0
25%,40.68415,-73.98071,80.0,30.0,4.0,0.21,1.0,87.0,1.0,4.64,1.0,1.0,1.0
50%,40.72282,-73.94959,125.0,30.0,14.0,0.65,2.0,215.0,3.0,4.81,1.0,1.0,1.0
75%,40.7631,-73.91746,199.0,30.0,49.0,1.8,5.0,353.0,15.0,4.93,2.0,2.0,1.0
max,40.91115,-73.71365,100000.0,1250.0,1865.0,75.49,713.0,365.0,1075.0,5.0,15.0,42.0,15.5


In [27]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
import numpy as np

numeric_feature_names = list(cleaned_data.select_dtypes(include=["int64", "float64"]).drop(["id", "host_id"], axis=1))

scaler = make_column_transformer(
    (StandardScaler(), numeric_feature_names),
    remainder='passthrough'
)
scaler.set_output(transform="pandas")
pd.set_option('display.float_format', lambda x: '%.5f' % x)

scaler.fit_transform(cleaned_data)

Unnamed: 0,standardscaler__latitude,standardscaler__longitude,standardscaler__price,standardscaler__minimum_nights,standardscaler__number_of_reviews,standardscaler__reviews_per_month,standardscaler__calculated_host_listings_count,standardscaler__availability_365,standardscaler__number_of_reviews_ltm,standardscaler__rating,...,standardscaler__baths,remainder__id,remainder__name,remainder__host_id,remainder__host_name,remainder__neighbourhood_group,remainder__neighbourhood,remainder__room_type,remainder__last_review,remainder__license
0,-0.71466,-0.41446,-0.12982,0.04299,-0.53892,-0.64470,-0.25165,-1.52490,-0.50814,0.90274,...,,1312228,Rental unit in Brooklyn · ★5.0 · 1 bedroom,7130382,Walter,Brooklyn,Clinton Hill,Private room,2015-12-20,No License
1,0.66031,-0.79703,-0.04280,0.04299,-0.45735,-0.53444,1.69451,1.16971,-0.41449,-0.21827,...,-0.37048,45277537,Rental unit in New York · ★4.67 · 2 bedrooms ·...,51501835,Jeniffer,Manhattan,Hell's Kitchen,Entire home/apt,2023-05-01,No License
2,0.39749,-0.90297,-0.00076,-0.79194,-0.49813,0.21636,-0.25165,1.01425,-0.22720,-1.91676,...,-0.37048,971353993633883038,Rental unit in New York · ★4.17 · 1 bedroom · ...,528871354,Joshua,Manhattan,Chelsea,Entire home/apt,2023-12-18,Exempt
3,1.80458,-0.05437,-0.06627,0.04299,1.54102,0.06410,-0.23754,1.16231,0.05375,-0.32018,...,-0.37048,3857863,Rental unit in New York · ★4.64 · 1 bedroom · ...,19902271,John And Catherine,Manhattan,Washington Heights,Private room,2023-09-17,No License
4,0.40340,-0.64231,-0.10049,0.04299,-0.43016,-0.53444,1.60989,0.95503,-0.36767,0.59701,...,-0.37048,40896611,Condo in New York · ★4.91 · Studio · 1 bed · 1...,61391963,Stay With Vibe,Manhattan,Murray Hill,Entire home/apt,2023-12-03,No License
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,-0.25573,-0.85338,-0.13960,0.04299,1.10600,0.28987,-0.25165,-0.36266,0.05375,0.05349,...,-0.37048,24736896,Rental unit in New York · ★4.75 · 1 bedroom · ...,186680487,Henry D,Manhattan,Lower East Side,Private room,2023-09-29,No License
20754,0.06272,-1.00223,-0.08093,0.04299,0.18159,-0.40843,-0.25165,-1.52490,-0.46132,-0.93164,...,-0.37048,2835711,Rental unit in New York · ★4.46 · 1 bedroom · ...,3237504,Aspen,Manhattan,Greenwich Village,Entire home/apt,2023-07-01,No License
20755,0.50673,-0.88383,0.10875,0.04299,0.23596,0.43688,-0.25165,-1.52490,0.75611,0.66495,...,-0.37048,51825274,Rental unit in New York · ★4.93 · 1 bedroom · ...,304317395,Jeff,Manhattan,Hell's Kitchen,Entire home/apt,2023-12-08,No License
20756,-0.21642,-0.85191,-0.07116,0.04299,-0.48454,-0.18267,-0.25165,1.16231,-0.18037,0.90274,...,-0.37048,782661008019550832,Rental unit in New York · ★5.0 · 1 bedroom · 1...,163083101,Marissa,Manhattan,Chinatown,Entire home/apt,2023-09-17,No License


In [44]:
pd.set_option('display.max_columns', None)
cleaned_data[cleaned_data['price'] > 187*100]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license,rating,bedrooms,beds,baths
3990,17160286,Rental unit in Brooklyn · ★4.48 · 1 bedroom · ...,110361431,Bobbi,Brooklyn,Bedford-Stuyvesant,40.69085,-73.93806,Private room,100000.0,30,29,2023-10-20,0.96,2,346,10,No License,4.48,1,1,1.0
5492,605115521796576121,Rental unit in Brooklyn · ★4.33 · 1 bedroom · ...,110361431,Bobbi,Brooklyn,Bedford-Stuyvesant,40.69254,-73.93636,Private room,100000.0,30,9,2023-10-31,0.45,2,365,5,No License,4.33,1,1,1.0


In [51]:
cleaned_data["price_per_night"] = cleaned_data["price"] / cleaned_data["minimum_nights"]
filtered_data = cleaned_data[cleaned_data['price'] <= 5000]
filtered_data.corr()["price_per_night"]

  filtered_data.corr()["price_per_night"]


id                                0.102772
host_id                           0.120907
latitude                          0.046508
longitude                        -0.109088
price                             0.441455
minimum_nights                   -0.186892
number_of_reviews                 0.018191
reviews_per_month                 0.116938
calculated_host_listings_count   -0.006409
availability_365                  0.018842
number_of_reviews_ltm             0.060275
rating                           -0.022452
bedrooms                          0.124818
beds                              0.148870
baths                             0.121724
price_per_night                   1.000000
Name: price_per_night, dtype: float64