Compare the distribution of Airbnbs and other traditional accommodation types such as hotels.

data source: https://data.cityofnewyork.us/City-Government/Hotels-Properties-Citywide/tjus-cn27

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sys
%matplotlib inline

In [None]:
df_hotel = pd.read_csv('../data/Hotels_Properties_Citywide.csv')

In [None]:
df_hotel.columns

Description for columns:  

'PARID': No description, Doesn't seem important  

'BOROCODE': Don't need it because of 'Borough'  

'BLOCK': Borough, Block, and Lot (BBL) is the parcel number system used to identify each unit of real estate in New York City for numerous city purposes. It consists of three numbers, separated by slashes; the borough, which is 1 digit; the block number, which is up to 5 digits; and the lot number, which is up to 4 digits.  

'LOT'  

'TAXYEAR': An annual accounting period for keeping records and reporting income and expenses. We're not investigating tax, so don't need it

'STREET NUMBER'
'STREET NAME'
'Postcode'

'BLDG_CLASS': Building class (Use and Occupancy classification: https://igpny.com/wp-content/uploads/2019/05/NYC-DOB-Building-Code-Chapter-3-Use-and-Occupancy-Classification.pdf). I don't think we need it.

'TAXCLASS': We're not interested in tax here

'OWNER_NAME': Do we wanna check if owner of Airbnb and hotels is the same?

'Borough': We need it

'Latitude': We need it

'Longitude': We need it

'Community Board': Membership - Community Boards are local representative bodies. There are 59 throughout the city. Each Board consists of up to 50 unsalaried members appointed by the Borough President, with half nominated by the City Council Members who represent the community district.
Are we interested in if airbnbs are nearby the community board? 

'Council District': Council District means any of four political subdivisions within the City by which City Council members are elected.
Are we interested in if airbnbs are nearby the community board? 

'Census Tract',
'BIN': Building Identification Number. Don't think we need this.

'BBL': Borough, Block, Lot

'NTA': Neighborhood Tabulation Areas; created by the NYC Dept of Planning by aggregating census tracts into 195 neighborhood-like areas.
Maybe interesting? Because it's neighborhood like areas.


Questions: Do we need BBL, Street number/name, Postcode?  
I think only for a geographic use, borough, latitue, longitude are enough? 

In [None]:
df_hotel.head()

In [None]:
# check the data size
print(df_hotel.info())
print(df_hotel.describe())
print('Data`s Shape: ', df_hotel.shape)
print('\nType of features \n', df_hotel.dtypes.value_counts())
isnull_series = df_hotel.isnull().sum()
isna_series = df_hotel.isna().sum()
print('\nNull columns and numbers:\n ', isnull_series[isnull_series > 0].sort_values(ascending=False))

In [None]:
df_hotel.value_counts(['Borough']).sort_index()

In [None]:
df_hotel.value_counts(['NTA']).sort_index()

In the end we're just using borough, latitue, longitude



In [None]:
# drop all rows with any NaN and NaT values
df_hotel_clean = df_hotel.dropna(inplace=False)

In [None]:
df_hotel_clean.head()

In [None]:
isnull_series = df_hotel_clean.isnull().sum()
isna_series = df_hotel_clean.isna().sum()
print(isnull_series, isna_series)

In [None]:
df_hotel_clean_2 = df_hotel_clean[["PARID", "Borough", "Latitude", "Longitude"]]
df_hotel_clean_2.head()

In [None]:
print(df_hotel_clean_2.info())
print(df_hotel_clean_2.describe())
print('Data`s Shape: ', df_hotel_clean_2.shape)
print('\nType of features \n', df_hotel_clean_2.dtypes.value_counts())

In [None]:
df_hotel_clean_2['Borough'] = df_hotel_clean_2['Borough'].replace(['1', '2', '3', '4', '5'], 
['Manhattan', 'Bronx', 'Brooklyn', 'Queens', 'Staten Island'])
df_hotel_clean_2.head()

In [None]:
df_hotel_clean_2['Borough'] = df_hotel_clean_2['Borough'].replace(['MANHATTAN', 'BRONX', 'BROOKLYN', 'QUEENS', 'STATEN IS'], 
['Manhattan', 'Bronx', 'Brooklyn', 'Queens', 'Staten Island'])
df_hotel_clean_2.head()
# df_hotel_clean_2.to_csv('../data/Hotels_clean.csv')

In [None]:
df_hotel_clean_2.value_counts(['Borough']).sort_index()

In [None]:
df_hotel_sorted = df_hotel_clean_2.sort_values(by=['Borough'], inplace=False)
df_hotel_sorted1 = df_hotel_sorted.rename({'PARID': 'parid', 'Borough': 'neighbourhood group', 'Latitude': 'latitude', 'Longitude':'longitude'}, axis=1)


In [None]:
# Is df_hotel_sorted1's parid unique?
df_hotel_sorted1.parid.nunique() == df_hotel_sorted1.shape[0]

In [None]:
df_hotel_sorted1['hotel_counts_per_neighbourhood_group'] = df_hotel_sorted1.groupby('neighbourhood group')['neighbourhood group'].transform('count')

In [None]:
df_hotel_sorted1.head()

In [None]:
df_hotel_sorted1.value_counts(['neighbourhood group']).sort_index()

In [None]:
df_hotel_sorted1.value_counts(['parid']).sort_index()

In [None]:
df_hotel_sorted1.reset_index(drop=True, inplace=True)
df_hotel_sorted1.head()

In [None]:
df_hotel_sorted1 = df_hotel_sorted1.rename({'neighbourhood group': 'neighbourhood_group'}, axis=1)

In [None]:
df_hotel_sorted1['latitude'] = df_hotel_sorted1['latitude'].round(5)
df_hotel_sorted1['longitude'] = df_hotel_sorted1['longitude'].round(5)
df_hotel_sorted1.head()

In [None]:
df_hotel_sorted1.to_csv('../data/Hotels_clean_sorted.csv', index = False)

Add a column that counts a number of airbnbs per neighbourhood_group

In [None]:
df_airbnb = pd.read_csv('../data/airbnb_open_data_full_clean.csv')
df_airbnb.value_counts(['neighbourhood_group']).sort_index()

In [None]:
df_airbnb['airbnb_counts_per_neighbourhood_group'] = df_airbnb.groupby('neighbourhood_group')['neighbourhood_group'].transform('count')
df_airbnb.head()

Merging Airbnb and Hotel

    merging based on latitude, longitude does not work (Just tried it to see if I can implement in another way)

In [None]:
df_hotel_temp = df_hotel_sorted1[['neighbourhood_group', 'hotel_counts_per_neighbourhood_group']].copy()

df_hotel_temp.drop_duplicates(subset='neighbourhood_group', keep="last", inplace=True)

df_hotel_temp.reset_index(drop=True, inplace=True)
df_hotel_temp.head()

In [None]:
df_airbnb_temp = df_airbnb[['neighbourhood_group', 'airbnb_counts_per_neighbourhood_group']].copy()

df_airbnb_temp.drop_duplicates(subset='neighbourhood_group', keep="last", inplace=True)

df_airbnb_temp.reset_index(drop=True, inplace=True)
df_airbnb_temp.head()

In [None]:
df_hotel_airbnb = pd.merge(df_hotel_temp, df_airbnb_temp, on='neighbourhood_group', how='outer')
df_hotel_airbnb.head()

In [None]:
df_hotel_airbnb.to_csv('../data/Hotels_Airbnbs_merged.csv', index = False)