<a href="https://colab.research.google.com/github/RochX/comp486-assignments/blob/main/project/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Project
## New York City AirBNB prices

In [1]:
from pathlib import Path
import pandas as pd
import urllib.request

# download the data if it is not downloaded
if not Path("new_york_listings_2024.csv").is_file():
  # here I download the data from my personal git repo for this class instead of using Google Drive
  url = "https://raw.githubusercontent.com/RochX/comp486-assignments/main/project/new_york_listings_2024.csv"
  urllib.request.urlretrieve(url, "new_york_listings_2024.csv")

new_york_listings_2024 = pd.read_csv("new_york_listings_2024.csv")

Get some information on the data.

In [2]:
new_york_listings_2024.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  int64  
 1   name                            20758 non-null  object 
 2   host_id                         20758 non-null  int64  
 3   host_name                       20758 non-null  object 
 4   neighbourhood_group             20758 non-null  object 
 5   neighbourhood                   20758 non-null  object 
 6   latitude                        20758 non-null  float64
 7   longitude                       20758 non-null  float64
 8   room_type                       20758 non-null  object 
 9   price                           20758 non-null  float64
 10  minimum_nights                  20758 non-null  int64  
 11  number_of_reviews               20758 non-null  int64  
 12  last_review                     

Take a look at the categorical features listed (besides `name` and `host_name`).

In [3]:
for col in list(new_york_listings_2024.drop(["name", "host_name"], axis=1).select_dtypes(include=["object"])):
  if col != "baths":
    print(new_york_listings_2024.value_counts(col), end="\n\n\n")
  else:
    print(new_york_listings_2024.value_counts(col))

neighbourhood_group
Manhattan        8038
Brooklyn         7719
Queens           3761
Bronx             949
Staten Island     291
dtype: int64


neighbourhood
Bedford-Stuyvesant            1586
Harlem                        1063
Williamsburg                   969
Midtown                        942
Hell's Kitchen                 867
                              ... 
Bay Terrace, Staten Island       1
Navy Yard                        1
Lighthouse Hill                  1
Chelsea, Staten Island           1
Neponsit                         1
Length: 221, dtype: int64


room_type
Entire home/apt    11549
Private room        8804
Shared room          293
Hotel room           112
dtype: int64


last_review
2023-09-04    326
2023-12-03    255
2023-12-17    244
2023-09-05    223
2023-11-30    212
             ... 
2020-04-17      1
2020-04-16      1
2020-04-15      1
2020-04-14      1
2021-03-21      1
Length: 1878, dtype: int64


license
No License            17569
Exempt                 2135


We see that some of these "categorical" features are actually numeric except with "missing data" or something similar as actual data.
These features are `rating`, `bedrooms`, and `baths`.
We can make the following corrections for each feature:
- `rating`: Convert `No Rating` and `New ` into `NaN`
- `bedrooms`: Convert `Studio` into `1`. Studio apartments are basically 1 bedroom apartments, but the bed isn't in its own room.
- `baths`: Convert `Not specified` into `NaN`.

Let's clean the data in accordance to this.

In [5]:
cleaned_data = new_york_listings_2024
cleaned_data["rating"] = new_york_listings_2024["rating"].replace({"No rating": None, "New ": None})
cleaned_data["bedrooms"] = new_york_listings_2024["bedrooms"].replace({"Studio": "1"})
cleaned_data["baths"] = new_york_listings_2024["baths"].replace({"Not specified": None})

obj_to_num_cols = ["rating", "bedrooms", "baths"]
cleaned_data[obj_to_num_cols] = cleaned_data[obj_to_num_cols].apply(pd.to_numeric)

cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  int64  
 1   name                            20758 non-null  object 
 2   host_id                         20758 non-null  int64  
 3   host_name                       20758 non-null  object 
 4   neighbourhood_group             20758 non-null  object 
 5   neighbourhood                   20758 non-null  object 
 6   latitude                        20758 non-null  float64
 7   longitude                       20758 non-null  float64
 8   room_type                       20758 non-null  object 
 9   price                           20758 non-null  float64
 10  minimum_nights                  20758 non-null  int64  
 11  number_of_reviews               20758 non-null  int64  
 12  last_review                     

Now that `rating`, `bedrooms`, and `baths` are numeric values, let's try creating a correlation matrix for `price`.

In [6]:
cleaned_data.corr()["price"]

  cleaned_data.corr()["price"]


id                                0.002372
host_id                          -0.005987
latitude                         -0.001143
longitude                        -0.033460
price                             1.000000
minimum_nights                   -0.006527
number_of_reviews                -0.012588
reviews_per_month                -0.009917
calculated_host_listings_count   -0.007333
availability_365                  0.020151
number_of_reviews_ltm            -0.011263
rating                           -0.004692
bedrooms                          0.074036
beds                              0.066882
baths                             0.066048
Name: price, dtype: float64