Compare the distribution of Airbnbs and other traditional accommodation types such as hotels.

data source: https://data.cityofnewyork.us/City-Government/Hotels-Properties-Citywide/tjus-cn27

In [62]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sys
%matplotlib inline

In [86]:
df_hotel = pd.read_csv('../data/Hotels_Properties_Citywide.csv')

In [None]:
df_hotel.columns

Description for columns:  

'PARID': No description, Doesn't seem important  

'BOROCODE': Don't need it because of 'Borough'  

'BLOCK': Borough, Block, and Lot (BBL) is the parcel number system used to identify each unit of real estate in New York City for numerous city purposes. It consists of three numbers, separated by slashes; the borough, which is 1 digit; the block number, which is up to 5 digits; and the lot number, which is up to 4 digits.  

'LOT'  

'TAXYEAR': An annual accounting period for keeping records and reporting income and expenses. We're not investigating tax, so don't need it

'STREET NUMBER'
'STREET NAME'
'Postcode'

'BLDG_CLASS': Building class (Use and Occupancy classification: https://igpny.com/wp-content/uploads/2019/05/NYC-DOB-Building-Code-Chapter-3-Use-and-Occupancy-Classification.pdf). I don't think we need it.

'TAXCLASS': We're not interested in tax here

'OWNER_NAME': Do we wanna check if owner of Airbnb and hotels is the same?

'Borough': We need it

'Latitude': We need it

'Longitude': We need it

'Community Board': Membership - Community Boards are local representative bodies. There are 59 throughout the city. Each Board consists of up to 50 unsalaried members appointed by the Borough President, with half nominated by the City Council Members who represent the community district.
Are we interested in if airbnbs are nearby the community board? 

'Council District': Council District means any of four political subdivisions within the City by which City Council members are elected.
Are we interested in if airbnbs are nearby the community board? 

'Census Tract',
'BIN': Building Identification Number. Don't think we need this.

'BBL': Borough, Block, Lot

'NTA': Neighborhood Tabulation Areas; created by the NYC Dept of Planning by aggregating census tracts into 195 neighborhood-like areas.
Maybe interesting? Because it's neighborhood like areas.


Questions: Do we need BBL, Street number/name, Postcode?  
I think only for a geographic use, borough, latitue, longitude are enough? 

In [None]:
df_hotel.head()

In [64]:
# check the data size
print(df_hotel.info())
print(df_hotel.describe())
print('Data`s Shape: ', df_hotel.shape)
print('\nType of features \n', df_hotel.dtypes.value_counts())
isnull_series = df_hotel.isnull().sum()
isna_series = df_hotel.isna().sum()
print('\nNull columns and numbers:\n ', isnull_series[isnull_series > 0].sort_values(ascending=False))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5519 entries, 0 to 5518
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   PARID             5519 non-null   int64  
 1   BOROCODE          5519 non-null   int64  
 2   BLOCK             5519 non-null   int64  
 3   LOT               5519 non-null   int64  
 4   TAXYEAR           5519 non-null   int64  
 5   STREET NUMBER     5514 non-null   object 
 6   STREET NAME       5519 non-null   object 
 7   Postcode          5519 non-null   int64  
 8   BLDG_CLASS        5519 non-null   object 
 9   TAXCLASS          5519 non-null   int64  
 10  OWNER_NAME        5519 non-null   object 
 11  Borough           5514 non-null   object 
 12  Latitude          5502 non-null   float64
 13  Longitude         5502 non-null   float64
 14  Community Board   5502 non-null   float64
 15  Council District  5502 non-null   float64
 16  Census Tract      5502 non-null   float64


In [None]:
df_hotel.value_counts(['Borough']).sort_index()

In [None]:
df_hotel.value_counts(['NTA']).sort_index()

In the end we're just using borough, latitue, longitude



In [87]:
# drop all rows with any NaN and NaT values
df_hotel_clean = df_hotel.dropna(inplace=False)

In [None]:
df_hotel_clean.head()

In [None]:
isnull_series = df_hotel_clean.isnull().sum()
isna_series = df_hotel_clean.isna().sum()
print(isnull_series, isna_series)

In [None]:
# remove duplicates of the same ID 
# what if there are different rooms in the same hotel building? For now I assume we keep them
# df_hotel_clean.drop_duplicates(subset='PARID', keep="last")

In [88]:
df_hotel_clean_2 = df_hotel_clean[["PARID", "Borough", "Postcode", "NTA"]]
df_hotel_clean_2.head()

Unnamed: 0,PARID,Borough,Postcode,NTA
0,1000080039,MANHATTAN,10004,Battery Park City-Lower Manhattan
1,1000080051,MANHATTAN,10004,Battery Park City-Lower Manhattan
2,1000100033,MANHATTAN,10004,Battery Park City-Lower Manhattan
3,1000110029,MANHATTAN,10004,Battery Park City-Lower Manhattan
4,1000161301,MANHATTAN,10282,Battery Park City-Lower Manhattan


In [67]:
print(df_hotel_clean_2.info())
print(df_hotel_clean_2.describe())
print('Data`s Shape: ', df_hotel_clean_2.shape)
print('\nType of features \n', df_hotel_clean_2.dtypes.value_counts())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5517
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   PARID     5473 non-null   int64 
 1   Borough   5473 non-null   object
 2   Postcode  5473 non-null   int64 
 3   NTA       5473 non-null   object
dtypes: int64(2), object(2)
memory usage: 213.8+ KB
None
              PARID      Postcode
count  5.473000e+03   5473.000000
mean   1.421075e+09  10211.873744
std    9.567826e+08    436.099099
min    1.000080e+09  10001.000000
25%    1.008370e+09  10016.000000
50%    1.011381e+09  10019.000000
75%    1.013370e+09  10036.000000
max    5.073650e+09  11694.000000
Data`s Shape:  (5473, 4)

Type of features 
 int64     2
object    2
dtype: int64


In [89]:
df_hotel_clean_2['Borough'] = df_hotel_clean_2['Borough'].replace(['1', '2', '3', '4', '5'], 
['Manhattan', 'Bronx', 'Brooklyn', 'Queens', 'Staten Island'])
df_hotel_clean_2.head()

Unnamed: 0,PARID,Borough,Postcode,NTA
0,1000080039,MANHATTAN,10004,Battery Park City-Lower Manhattan
1,1000080051,MANHATTAN,10004,Battery Park City-Lower Manhattan
2,1000100033,MANHATTAN,10004,Battery Park City-Lower Manhattan
3,1000110029,MANHATTAN,10004,Battery Park City-Lower Manhattan
4,1000161301,MANHATTAN,10282,Battery Park City-Lower Manhattan


In [90]:
df_hotel_clean_2['Borough'] = df_hotel_clean_2['Borough'].replace(['MANHATTAN', 'BRONX', 'BROOKLYN', 'QUEENS', 'STATEN IS'], 
['Manhattan', 'Bronx', 'Brooklyn', 'Queens', 'Staten Island'])
df_hotel_clean_2.head()
# df_hotel_clean_2.to_csv('../data/Hotels_clean.csv')

Unnamed: 0,PARID,Borough,Postcode,NTA
0,1000080039,Manhattan,10004,Battery Park City-Lower Manhattan
1,1000080051,Manhattan,10004,Battery Park City-Lower Manhattan
2,1000100033,Manhattan,10004,Battery Park City-Lower Manhattan
3,1000110029,Manhattan,10004,Battery Park City-Lower Manhattan
4,1000161301,Manhattan,10282,Battery Park City-Lower Manhattan


In [91]:
df_hotel_clean_2.rename(columns={'Postcode': 'zipcode', 'Borough': 'neighbourhood_group', 'NTA': 'neighbourhood'}, inplace=True)
df_hotel_clean_2.head()

Unnamed: 0,PARID,neighbourhood_group,zipcode,neighbourhood
0,1000080039,Manhattan,10004,Battery Park City-Lower Manhattan
1,1000080051,Manhattan,10004,Battery Park City-Lower Manhattan
2,1000100033,Manhattan,10004,Battery Park City-Lower Manhattan
3,1000110029,Manhattan,10004,Battery Park City-Lower Manhattan
4,1000161301,Manhattan,10282,Battery Park City-Lower Manhattan


In [92]:
df_hotel_clean_2['neighbourhood'] = df_hotel_clean_2['neighbourhood'].str.replace('-', ' ')
df_hotel_clean_2.head()

Unnamed: 0,PARID,neighbourhood_group,zipcode,neighbourhood
0,1000080039,Manhattan,10004,Battery Park City Lower Manhattan
1,1000080051,Manhattan,10004,Battery Park City Lower Manhattan
2,1000100033,Manhattan,10004,Battery Park City Lower Manhattan
3,1000110029,Manhattan,10004,Battery Park City Lower Manhattan
4,1000161301,Manhattan,10282,Battery Park City Lower Manhattan


In [107]:
df_zipcode = pd.read_csv('../data/neighbourhoods.csv')

In [108]:
df_zipcode.head()

Unnamed: 0,neighbourhood_group,neighbourhood,zipcode
0,Bronx,Allerton,10467
1,Bronx,Baychester,10469
2,Bronx,Belmont,10457
3,Bronx,Belmont,10458
4,Bronx,Bronx Park,10460


In [None]:
df_hotel_clean_2.shape

In [101]:
df_hotel_copy3 = df_hotel_clean_2
df_hotel_copy3.head()

Unnamed: 0,PARID,neighbourhood_group,zipcode,neighbourhood
0,1000080039,Manhattan,10004,Battery Park City Lower Manhattan
1,1000080051,Manhattan,10004,Battery Park City Lower Manhattan
2,1000100033,Manhattan,10004,Battery Park City Lower Manhattan
3,1000110029,Manhattan,10004,Battery Park City Lower Manhattan
4,1000161301,Manhattan,10282,Battery Park City Lower Manhattan


In [115]:

df_zipcode['neighbourhood'].value_counts()


Bedford-Stuyvesant    6
Upper East Side       5
Financial District    5
Harlem                5
Crown Heights         5
                     ..
Unionport             1
Throgs Neck           1
Two Bridges           1
Arverne               1
Allerton              1
Name: neighbourhood, Length: 243, dtype: int64

: 

In [110]:
df_hotel_copy3['neighbourhood'] = df_hotel_copy3['neighbourhood'].apply(lambda x: df_zipcode[df_zipcode['neighbourhood'].str.contains(x, na=False)]['neighbourhood'].values[0] if df_zipcode[df_zipcode['neighbourhood'].str.contains(x, na=False)].shape[0]>0 else x)

df_hotel_copy3[df_hotel_copy3['neighbourhood'].str.contains('Gowanus')]

Unnamed: 0,PARID,neighbourhood_group,zipcode,neighbourhood
2352,3004200052,Brooklyn,11217,Park Slope Gowanus
2353,3004340016,Brooklyn,11215,Park Slope Gowanus
2354,3004340049,Brooklyn,11215,Park Slope Gowanus
2355,3004410001,Brooklyn,11215,Park Slope Gowanus
2356,3004410042,Brooklyn,11215,Park Slope Gowanus
2382,3009800075,Brooklyn,11215,Park Slope Gowanus
2383,3009800107,Brooklyn,11215,Park Slope Gowanus
2384,3010330005,Brooklyn,11215,Park Slope Gowanus
2385,3010330006,Brooklyn,11215,Park Slope Gowanus
5119,3004200052,Brooklyn,11217,Park Slope Gowanus


In [98]:
df_hotel_copy3[df_hotel_copy3['neighbourhood'].str.contains('Battery Park City Lower Manhattan')]

Unnamed: 0,PARID,neighbourhood_group,zipcode,neighbourhood
0,1000080039,Manhattan,10004,Battery Park City Lower Manhattan
1,1000080051,Manhattan,10004,Battery Park City Lower Manhattan
2,1000100033,Manhattan,10004,Battery Park City Lower Manhattan
3,1000110029,Manhattan,10004,Battery Park City Lower Manhattan
4,1000161301,Manhattan,10282,Battery Park City Lower Manhattan
...,...,...,...,...
2774,1000921001,Manhattan,10038,Battery Park City Lower Manhattan
2775,1000921002,Manhattan,10038,Battery Park City Lower Manhattan
2776,1000921003,Manhattan,10038,Battery Park City Lower Manhattan
2777,1001060017,Manhattan,10038,Battery Park City Lower Manhattan


In [None]:
for value in df_zipcode['neighbourhood']:
    df_hotel_clean_2['neighbourhood'] = df_hotel_clean_2['neighbourhood'].str.replace(value, lambda x: x.group(0).replace(x.group(0), value))


df_hotel_clean_2

In [None]:
for value in df_zipcode['neighbourhood'].unique():
    df_hotel_clean_2['neighbourhood'] = df_hotel_clean_2['neighbourhood'].replace(to_replace=r'^{}.*'.format(value), value=value, regex=True)

df_hotel_clean_2.head()

In [None]:

df_hotel_clean_2[df_hotel_clean_2['neighbourhood'].str.contains('Gowanus')]
    
    

In [None]:
df_hotel_clean_2['hotel_counts_per_neighbourhood'] = df_hotel_clean_2.groupby('neighbourhood')['neighbourhood'].transform('count')

In [None]:
df_hotel_clean_2.head()

In [None]:

df_hotel_clean_2.value_counts(['neighbourhood', 'hotel_counts_per_neighbourhood']).sort_index()

airbnb

In [None]:
df_airbnb = pd.read_csv('../data/airbnb_open_data_full_clean.csv')
df_airbnb['airbnb_counts_per_neighbourhood'] = df_airbnb.groupby('neighbourhood')['neighbourhood'].transform('count')
df_airbnb.head()

In [None]:
df_hotel_temp = df_hotel_clean_2[['neighbourhood', 'neighbourhood_group', 'zipcode', 'hotel_counts_per_neighbourhood']].copy()
df_hotel_temp.drop_duplicates(subset='neighbourhood', keep="last", inplace=True)
df_hotel_temp.reset_index(drop=True, inplace=True)
df_hotel_temp.head()

In [None]:
# csv 
df_hotel_temp.to_csv('../data/hotel_counts.csv')

In [None]:
#df_airbnb_zip = df_airbnb.merge(df_zipcode[['neighbourhood', 'neighbourhood_group', 'zipcode']], on=['neighbourhood', 'neighbourhood_group'])
#df_airbnb_zip.head()

In [None]:
df_airbnb_temp = df_airbnb[['neighbourhood', 'neighbourhood_group', 'airbnb_counts_per_neighbourhood']].copy()
df_airbnb_temp.drop_duplicates(subset='neighbourhood', keep="last", inplace=True)
df_airbnb_temp.reset_index(drop=True, inplace=True)
df_airbnb_temp.head()

In [None]:
df_airbnb_temp.value_counts(['neighbourhood', 'airbnb_counts_per_neighbourhood']).sort_index()

In [None]:
merged_df = pd.merge(df_airbnb_temp, df_hotel_temp, on=['neighbourhood', 'neighbourhood_group'], how='outer').fillna(0)
merged_df.to_csv('../data/33.csv')

In [None]:
df_airbnb_hotel = df_airbnb_temp.merge(df_hotel_temp, on=['neighbourhood', 'neighbourhood_group'])
df_airbnb_hotel.head()

In [None]:
df_airbnb_hotel.to_csv('../data/Hotels_Airbnbs_Neighbourhood_counts.csv')