## NYC Airbnb Stattistics: 

### <u>Source Description (by author):</u>
> Airbnb, Inc is an American San Francisco-based company operating an online marketplace for short- and long-term homestays and experiences.
The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk, and Joe Gebbia.
Since it was founded in 2008, Airbnb has become one of the most successful and valuable start-ups in the world and has significantly impacted the HORECA (hotel, restaurant, and catering) industry.
Airbnb is a platform that allows house and apartment owners to rent their properties to guests for short-term stays.
Since 2011, hosts have been using Airbnb. This dataset describes the listing activity and metrics in NYC for 2023.

### <u>Column Content: </u>
- id : id value of hosted house
- name : Name introducing the hosted home
- host_id: id of the host
- host_name : Name of the host
- neighbourhood_group: the area where the hosted home is located
- neighbourhood: Nomination around hosted home
- latitude: latitude
- longitude: longitude
- room_type: Type of hosted home
- price: Daily accommodation price (target variable)
- minimum_nights: Minimum number of nights
- number_of_reviews : Total number of reviews
- last_review: Last review date
- reviews_per_month: number of reviews per month
- calculated_host_listings_count: Number of accommodations hosted by the host
- availability_365: number of days
- number_of_reviews_ltm: Number of reviews in the last n months
- license: Accommodation License
'Special': Only one person has a license


In [1]:
import pandas as pd
import numpy as np
from shapely.geometry import Point, Polygon

In [2]:
airbnb = pd.read_csv('./Data/NYC-Airbnb-2023.csv')

  airbnb = pd.read_csv('./Data/NYC-Airbnb-2023.csv')


In [3]:
airbnb.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,150,30,49,2022-06-21,0.3,3,314,1,
1,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,60,30,50,2019-12-02,0.3,2,365,0,
2,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.8038,-73.96751,Private room,75,2,118,2017-07-21,0.72,1,0,0,
3,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,68,2,575,2023-02-19,3.41,1,106,52,
4,5136,"Large Sunny Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,275,60,3,2022-08-10,0.03,1,181,1,


In [4]:
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42931 entries, 0 to 42930
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              42931 non-null  int64  
 1   name                            42919 non-null  object 
 2   host_id                         42931 non-null  int64  
 3   host_name                       42926 non-null  object 
 4   neighbourhood_group             42931 non-null  object 
 5   neighbourhood                   42931 non-null  object 
 6   latitude                        42931 non-null  float64
 7   longitude                       42931 non-null  float64
 8   room_type                       42931 non-null  object 
 9   price                           42931 non-null  int64  
 10  minimum_nights                  42931 non-null  int64  
 11  number_of_reviews               42931 non-null  int64  
 12  last_review                     

Below I drop columns that will not be relevant to my analysis.

In [5]:
columns_to_drop = ["license", "name", "host_name", "host_id", "last_review", "neighbourhood", "number_of_reviews_ltm"]
airbnb.drop(columns_to_drop, axis=1, inplace=True)

In [6]:
airbnb['reviews_per_month'].fillna(0, inplace=True)

In [7]:
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42931 entries, 0 to 42930
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              42931 non-null  int64  
 1   neighbourhood_group             42931 non-null  object 
 2   latitude                        42931 non-null  float64
 3   longitude                       42931 non-null  float64
 4   room_type                       42931 non-null  object 
 5   price                           42931 non-null  int64  
 6   minimum_nights                  42931 non-null  int64  
 7   number_of_reviews               42931 non-null  int64  
 8   reviews_per_month               42931 non-null  float64
 9   calculated_host_listings_count  42931 non-null  int64  
 10  availability_365                42931 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 3.6+ MB


In [8]:
airbnb.room_type.value_counts()

Entire home/apt    24279
Private room       17879
Shared room          576
Hotel room           197
Name: room_type, dtype: int64

### Add neighbourhood_cd column:
- New York city has 300+ neighbours districts, and we will assign each airbnb house to its corresponding neighbourhood code from 1-300+. Finally, we will airbnb dataset with crime dataset with the shared key column "neighbourhood_cd". 
- We will refer to the geometric information in fullDownload.geojson on latitude and longitude.

In [9]:
all_geo = pd.read_json('./Data/fullDownload.geojson')
all_geo = all_geo['features']

In [10]:
def filter_ny(all):
    for i in range(len(all)):
        if all[i]['properties']['state'] != 'NY':
            del all[i]
        elif all[i]['properties']['city'] not in ['Manhattan', 'Brooklyn', 'Queens', 'Bronx', 'Staten Island']:
            del all[i]
    return all

all_geo = filter_ny(all_geo)

In [11]:
class NYCDistrict:
    def __init__(self, dict):
        self.Coordinates = dict['geometry']['coordinates'][0][0]
        self.HolcGrade = dict['properties']['holc_grade']
        self.city = dict['properties']['city']
        self.name = dict['properties']['name']
        self.RandomLat = None
        self.RandomLong = None
        self.Median_Income = None
        self.CensusTract = None

    @property
    def HolcColor(self):
        if self.HolcGrade == 'A':
            return 'darkgreen'
        elif self.HolcGrade == 'B':
            return 'cornflowerblue'
        elif self.HolcGrade == 'C':
            return 'gold'
        elif self.HolcGrade == 'D':
            return 'maroon'
        else:
            return 'black'

    

Districts = []
for district in all_geo:
    Districts.append(NYCDistrict(district))

In [12]:
def get_dist_name(lat, long):
    point = Point(lat, long)
    for i in range(len(Districts)):
        polygon = Polygon(Districts[i].Coordinates)
        if polygon.contains(point): 
            return i
    return -1

In [13]:
airbnb['neighbourhood_cd'] = airbnb.apply(lambda row: get_dist_name(row['longitude'], row['latitude']), axis=1)

In [14]:
# drop the rows where the neighbourhood_cd is -1
airbnb['neighbourhood_cd'].replace(-1, np.nan, inplace=True)
airbnb.dropna(inplace=True)   

### output the cleaned file to csv: airbnb

In [15]:
airbnb.to_csv('./Data/airbnb.csv', index=False)