<center><b>New York City Airbnb Open Data </b></center>

<center> <div> <img src="New_York_City_.png" alt="New York Map" width = "400"/></div> </center>

## Background and Objective

According to Investopedia, Airbnb is an online marketplace that connects people who want to rent out their homes with people who are looking for accommodations in specific locales.

As a person who loves to travel, Airbnb has been my go-to platform to search for lodging, vacation rentals, and tourism activities.
Aside from it being a much cheaper alternative than hotels, the convenience Airbnb provides is excellent, thanks to their own website/mobile application.

With that in mind, I want to create a **Data Visualization** on the **New York City Airbnb Dataset** to understand the following:
- Distribution of listings — (a) location, (b) room type
- Differences in the (a) price, (b) minimum nights, (c) number of reviews, and (d) availability per neighbourhood group and room type
- Relationship between the dataset features

**Hypothesis Testing** will also be conducted to determine whether the observations that will arise in the Exploratory Data Analysis are statistically significant

Ultimately, I will try to **answer the problem statement** and give a recommendation on what type of property to invest in and where


## Problem Statement

Where is the optimal location, and what type of property should we invest in for an Airbnb business?

## About the Dataset

- **Dataset**: Airbnb listings and metrics in NYC, NY, USA (2019)
- **Context**: This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.
- **Source**: https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data

## Data Wrangling

#### IMPORT LIBRARIES, DEFINE FUNCTIONS, AND READ DATASET

In [None]:
# import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import folium
from folium import plugins
import missingno as msno

In [None]:
# define functions

def showoutliers(df, column_name = ""):
        iqr = df[column_name].quantile(.75) - df[column_name].quantile(.25)
        lowerbound = (df[column_name].quantile(.25)) - iqr * 1.5
        upperbound = (df[column_name].quantile(.75)) + iqr * 1.5
        lowerbound_outliers = df[df[column_name] < lowerbound]
        higherbound_outliers = df[df[column_name] > upperbound]
        outliers = pd.concat([lowerbound_outliers,higherbound_outliers])
        return outliers

def countoutliers(df, column_name = ""):
        iqr = df[column_name].quantile(.75) - df[column_name].quantile(.25)
        lowerbound = (df[column_name].quantile(.25)) - iqr * 1.5
        upperbound = (df[column_name].quantile(.75)) + iqr * 1.5
        lowerbound_outliers = df[df[column_name] < lowerbound]
        higherbound_outliers = df[df[column_name] > upperbound]
        outliers = pd.concat([lowerbound_outliers,higherbound_outliers])
        return len(outliers)

In [None]:
# load the dataset

data = pd.read_csv('AB_NYC_2019.csv')
data.head()

#### DATA TYPES

In [None]:
# inspect datatypes
display(data.info())

In [None]:
# change data types

data.last_review = pd.to_datetime(data.last_review)
print(data.last_review.dtype)

#### MISSING VALUES

In [None]:
# check count of missing values
data.isnull().sum().sort_values(ascending = False).apply(lambda x: x if x > 0 else None).dropna()

In [None]:
# check ratio of missing values
data.isnull().mean().sort_values(ascending = False).apply(lambda x: x if x > 0 else None).dropna()

In [None]:
#visualize missing values

msno.bar(data)

plt.show()

I decided to drop the rows with missing values in columns ``name`` and `host_name` as these missing values are immaterial

In [None]:
# drop missing values
data = data.dropna(subset = ['name', 'host_name'])

The missing values for ``reviews_per_month`` and `last_review` have the same count hence we can presume that these missing values are related to listing that received no reviews.

With that, I will replace `NaN` values with `0` on the `reviews_per_month` column and leave the `last_review` as is.

In [None]:
# fill missing values
data = data.fillna({'reviews_per_month':0})

Since there is a ``number_of_reviews`` column, we can validate the accuracy of our `reviews_per_month` column by comparing the count of `0` reviews

In [None]:
# validate
len(data['number_of_reviews'] == 0) == len(data['reviews_per_month'] == 0)

I will also create a new column named ``with review`` to indicate which observations are previously reviewed by the guest

In [None]:
# create a list with the corresponding values
with_review = [1 if i > 0 else 0 for i in data.reviews_per_month ]

# get the index location of the column - useful for inserting a column in a df
loc = data.columns.get_loc('number_of_reviews')

# insert with_review column
data.insert(loc, 'with_review', with_review)
data.head()

#### ADDITIONAL VALIDATION

In [None]:
#validate numerical columns
data.describe(exclude = object, datetime_is_numeric=True)

Based on the result, there are listings with zero prices. That doesn't seem right.

In [None]:
#filter zero price listings
zero_price = data['price'] == 0

print(f"No. of listing with no price: {data[zero_price].shape[0]}")
display(data[zero_price].head())

Since there are only 11 observations with no price, i'll just drop it since it doesn't seem reasonable to have free listings

In [None]:
#drop observations with 0 price

data = data.query("price > 0")

## Exploratory Data Analysis

After cleaning the dataset, the next step is to perform Exploratory Data Analysis to answer the questions defined in the objective.

#### STATISTICS OF THE DATA

Create another dataframe to remove non-relevant variables for our statistical computation

In [None]:
#remove non-relevant variable for statistical computation
data_stat = data.drop(['id', 'host_id', 'latitude', 'longitude', 'with_review'], axis = 1)


#separate relevant numeric columns
numeric = ['price', 'minimum_nights', 'number_of_reviews', 'calculated_host_listings_count', 'reviews_per_month', 'availability_365']

Compute summary statistics for numerical variables

In [None]:
#Numerical Statistics

#melt the dataframe
data_melt = pd.melt(data, id_vars= 'id', value_vars = numeric)

#compute statistics on the melted dataframe
data_group = data_melt.groupby('variable').agg({'value':['min', 'max', 'mean' , 'median', 'std']})\
            .sort_values(('value', 'mean'), ascending = False)
#display
data_group

The price for each observation has a wide range.
Also, it seems like there are properties that is occupied continuously througout the year.

In [None]:
#Categorical Statistics

data_stat.describe(include = object)

Based on the desriptive statistics, the following can be noted:

- Out of the 48,847 listings, there are only 11,448 hosts which means that the `listing-to-host ratio is approximately 4:1`.
- There are only 3 room types and the most common is the `Entire Home/Apartment`.<br>
- There are `18 units in Hillside Hotel` that are being listed in AirBnb. <br>
- `21,642 or 44%` of the AirBnb listings in New York `are located in Manhattan`. <br>

#### DISTRIBUTION OF PRICE, MINIMUM NIGHTS, NUMBER OF REVIEWS, LISTING COUNT, AND AVAILABILITY

In [None]:
#Histogram of Numerical Columns

# check distribution

data[numeric].hist(figsize = (18,7), layout = (2,3), grid = False, bins = 20)

plt.title('Distribution of Numerical Variables')
plt.tight_layout()
plt.show()

The histograms above shows that there are a lot of outliers in the dataset which makes it harder for us to see the actual data distribution.
To resolve this, let's create two separate plots, one for outliers and one for normal, using the `countoutliers()` and `showoutliers()` function I defined earlier

In [None]:
# Distribution of Outliers
fig, axes = plt.subplots(2,3, figsize = (18,7))
plt.suptitle('Distribution of Outlier Observations', fontsize = 15, y = 1)

for col, ax in zip(data[numeric], axes.flat):
        outliers = showoutliers(data, col)

        #plot the histogram in the specified axes
        outliers[col].hist(bins = 20, grid = False, ax = ax)

        #chart formatting
        ax.set_title(f"{col.replace('_', ' ').title()}\nOutlier count: {countoutliers(data, col)}")

plt.tight_layout()
plt.show()

In [None]:
# Distribution of Observations Excluding Outliers

fig, axes = plt.subplots(2,3, figsize = (18,7))
plt.suptitle('Distribution of Observations Excluding Outliers', fontsize = 15, y = 1)

for col, ax in zip(data[numeric], axes.flat):
        outliers = showoutliers(data, col)
        normal = data[~ data.index.isin(outliers.index)]

        #plot the histogram in the specified axes
        normal[col].hist(bins = 20, grid = False, ax = ax)

        #chart formatting
        ax.axvline(x = normal[col].mean(), color = 'red', label = 'Mean', linestyle = '--')
        ax.axvline(x = normal[col].median(), color = 'green', label = 'Median', linewidth = 2)
        ax.legend()
        ax.set_title(f"{col.replace('_', ' ').title()}\nNormal count: {len(data) - countoutliers(data, col)}")

plt.tight_layout(h_pad=3)
plt.show()

After excluding the outliers, we can now clearly see the distribution of our dataset. I also plotted the mean to add more information.
From this plot, we can presume that our numeric columns are mostly skewed to the right, in other words, there are a lot of small value observations. The following are also noted:

- Price ranges from around 20 to 350 USD per night with a mean of approximately 120USD.
- There are a few listings that requires a minimum stay of > 5 nights.
- A property listing receives only 10 reviews in average, although there are a few that received more than 20 reviews.
- Majority of the hosts in NYC only has one property listing.
- More than 20,000 properties are occupied almost all throughout the year, however there are also properties with almost no bookings.

#### DISTRIBUTION OF LISTINGS

In [None]:
#Geospatial Distribution Plot

#define variables
latitude = 40.7128
longitude = -74.0060
ny_geo = "Borough Boundaries.geojson"
group = data.groupby('neighbourhood_group', as_index = False)['price'].mean()

#create map instance
ny_map = folium.Map(location = [latitude, longitude], zoom_start=10, tiles = 'Stamen Toner')

#add choropleth
choropleth = folium.Choropleth(geo_data=ny_geo,
                               data=group,
                               columns=['neighbourhood_group', 'price'],
                               key_on='feature.properties.boro_name',
                               fill_color='YlOrRd', 
                               fill_opacity=0.7, 
                               line_opacity=1,
                               legend_name='AirBnb Price',
                               highlight = True,
                               smooth_factor = 0).add_to(ny_map)


#add markers indiccating the number of listings in the area
listings = plugins.MarkerCluster().add_to(ny_map)
for lat, lng, label in zip(data['latitude'], data['longitude'], data['room_type']):
        folium.Marker(location=[lat, lng],
                      icon=None,
                      popup=label).add_to(listings)
    

#add labels indicating the name of the community
style_function = "font-size: 15px; font-weight: bold"
choropleth.geojson.add_child(folium.features.GeoJsonTooltip(['boro_name'], style=style_function, labels=False))

#display map
ny_map


The markers in the geospatial map confirms the previous observation that approximately half of the AirBnb listings in New York are in Manhattan. <br>

In addition, I added a choropleth to visualize the average prices across different neighborhood groups. <br>
The closer the color is to dark red, the higher the average price is in that location. <br>

We can observe that aside from having the most number of AirBnb listings, Manhattan also has the highest average price. <br>
Meanwhile, Bronx and Queens has the lowest average price. 

In [None]:
# Property Listings per Neighbourhood and Room Type

#configure parameters
sns.set(rc = {'figure.figsize':(17,8)},
        style = 'ticks')

#specify order
order = data['neighbourhood_group'].value_counts().sort_values().index

#plot the data
sns.countplot(x = 'neighbourhood_group', data = data, order = order, hue = 'room_type', palette = 'rocket')

#display
plt.title('Count of Listings per Neighborhood and Room Type', fontsize = 15)
plt.show()

The neighborhood group with the smallest number of listings in AirBnb is the Staten Island. There are two beaches in Staten Island, the South Beach, and the Midlane Beach. 
Theoretically, these types of locations should have more listings due to it being a tourist spot.

#### DIFFERENCES BETWEEN GROUPS

In [None]:
#create subplots
fig, axes = plt.subplots(2, 3, figsize=(16, 8))

#specify the order
order = data['neighbourhood_group'].value_counts(ascending = True).index

#plot the data
for col, ax in zip(data[numeric], axes.flat):
    sns.boxplot(x='neighbourhood_group',
                y=col,
                data=data,
                showfliers=False,
                ax=ax,
                palette='Blues',
                order=order)
                
    ax.set_xlabel('')

#customize chart
plt.suptitle('Differences Between Neighborhood Groups', fontsize=15, y=1)
plt.tight_layout(h_pad=3)

#display
plt.show()

- There is a noticeable difference between the average prices and minimum nights required in Manhattan, Brooklyn, and the rest of the neigborhoods in New York.<br><br>

- The number of reviews given by guests in Staten Islands are slightly higher compared to the other neighborhoods.<br><br>

- Despite having higher prices, the average availability of the listings in Manhattan is approximately 50 days in a year.<br> This means that out of 365 days, 315 days have been booked for a stay, and 86% of the time the Manhattan units are occupied by guests.<br><br>

- Meanwhile, listings in Staten Island are available almost 50% of the time. A seasonality analysis might be appropriate in this case. <br>Unfortunately, the data needed is not available.

In [None]:
#create subplots
fig, axes = plt.subplots(2,3, figsize = (16,8))

#define the order
order = ['Private room', 'Shared room', 'Entire home/apt']

#plot the data
for col, ax in zip(data[numeric], axes.flat):
        sns.boxplot(x = 'room_type', y = col, data = data, showfliers = False, ax = ax, palette = 'rocket', order = order)
        ax.set_xlabel('')

#customize chart
plt.suptitle('Differences Between Room Types', fontsize = 15, y = 1)
plt.tight_layout(h_pad = 3)

#display
plt.show()

- Entire Home/Apt type of Airbnb listing is the most expensive room type. It's average price is 300% higher than the Shared Room and 230% higher than the Private Room. This makes sense as Entire Home/Apt has more space and is bigger than Private Rooms and Shared Rooms <br> <br> 
- Shared room has the lowest minimum number of nights followed by Private room and Entire Home/Apt<br><br>

- The three room types received almost equal number of reviews. <br><br>

- Whereas for the availability, Shared Room has the highest available days, and Entire Home/Apt has the lowest

In [None]:
#create subplots
fig, axes = plt.subplots(2,3, figsize = (16,8))

#define the order
order = data['neighbourhood_group'].value_counts().index

#plot the data
for col, ax in zip(data[numeric], axes.flat):
        sns.pointplot(hue = 'room_type', y = col, data = data, x = 'neighbourhood_group', join = True, ci = 68, palette = 'rocket', order = order, ax = ax)
        ax.set_xlabel('')
        ax.legend(loc = 'upper right')

#customize chart
plt.suptitle('Differences Between Neighbourhood Group and Room Type', fontsize = 15, y = 1)
plt.tight_layout(h_pad = 3)

#display
plt.show()

- The Entire Home/Apt listings in Manhattan has the most expensive average price, followed by Brooklyn, Staten Island, Queens, and Bronx.<br>
Also, the large variability of prices for Entire Home/Apt is only prevalent in Staten Island and not with the other neigborhoods. <br> <br>

- Average prices for Private Room and Shared Room is almost the same across Brooklyn, Queens, Bronx, and Staten Island. Manhattan, on the other hand, are priced at slightly higher rates. <br> <br>

- For Private Room and Shared Room, average prices in Queens, Bronx, and Staten Island follows a certain trend, that is, Queens > Bronx > Staten Island. However, that trend is not present when it comes to Entire Home/Apt. Instead, it became Staten Island > Queens > Bronx.

#### RELATIONSHIP BETWEET NUMERICAL FEATURES

In [None]:
#plot scatterplots
sns.pairplot(data[numeric], y_vars = 'price', height = 2.5)

#chart labels
plt.suptitle('Correlation Plot of Numeric Variables', y = 1.2)

#display
plt.show()

It seems like there is no linear relationship between the numerical columns.

In [None]:
#plot correlation heatmap
sns.heatmap(data[numeric].corr(), annot = True)

#display
plt.title('Correlation Plot', fontsize = 15)
plt.show()

The correlation coeffecients computed supports the observations based on the scatter plot that there is no linear relationship between the numerical columns.

## Hypothesis Testing

After performing Exploratory Data Analysis, the next step is to test whether our observations are statistically significant

### IS THERE AN ASSOCIATION BETWEEN NEIGHBORHOOD AND ROOM TYPE?

In [None]:
# Property Listings per Neighbourhood and Room Type

#configure parameters
sns.set(rc = {'figure.figsize':(17,8)},
        style = 'ticks')

#specify order
order = data['neighbourhood_group'].value_counts().sort_values().index

#plot the data
sns.countplot(x = 'neighbourhood_group', data = data, order = order, hue = 'room_type', palette = 'rocket')

#display
plt.title('Number of Room Type per Neighborhood', fontsize = 15, y = 1.02)
plt.show()

<div class="alert alert-warning">
<strong>Ho</strong>: There are no differences between the number of listings in atleast one of the neighborhoods in New York City. <br>
<strong>Ha</strong>: There are is differences between the number of listings in atleast one of the neighborhoods in New York City.
</div>

In [None]:
# Crosstab

table = pd.crosstab(data['neighbourhood_group'], data['room_type'], margins = True, normalize = True)
table

In [None]:
#Chi-Square Test

p_val = scipy.stats.chi2_contingency(table, correction = True)[1]

print(f'P-value is: {p_val}')

<div class="alert alert-success">
<strong>Conclusion:</strong> <br>Since p-value < the .05, we reject the null hypothesis and therefore conclude that there is an association between Neighborhood Group and Room Type.
                                       </div>

### DOES THE AVERAGE PRICE DIFFER PER NEIGHBORHOOD GROUP?

In [None]:
#Box Plot

#plot parameters
fig, axes = plt.subplots(figsize = (12,6))

#specify the order
order = data['neighbourhood_group'].value_counts(ascending = True).index

#plot the data
sns.boxplot(x='neighbourhood_group',
            y='price',
            data=data,
            showfliers=False,
            palette='Blues',
            order=order)
                
    

#customize chart
plt.suptitle('Average Price per Neighbourhood Group', fontsize=15, y=.98)
plt.xlabel('')
plt.tight_layout(h_pad=3)

#display
plt.show()

<div class="alert alert-warning">

<strong>Ho:</strong> The average prices does not differ between neighborhood groups <br>
<strong>Ha:</strong> The average prices does not differ between neighborhood groups <br>
<strong>α:</strong> 5%
</div>

In [None]:
#ANOVA

f_statistic, p_val = scipy.stats.f_oneway(data[data['neighbourhood_group'] == 'Staten Island']['price'],
                                           data[data['neighbourhood_group'] == 'Queens']['price'],
                                           data[data['neighbourhood_group'] == 'Bronx']['price'],
                                           data[data['neighbourhood_group'] == 'Brooklyn']['price'],
                                           data[data['neighbourhood_group'] == 'Manhattan']['price'])

print(f'P-value is: {p_val}')

<div class="alert alert-success">
    <strong> Conlusion: </strong> <br>
Since the p-value is less than the alpha, we reject the null hypothesis and conclude that
there are differences in average prices between neighborhood groups
</div>

### IS THE AVERAGE PRICE IN MANHATTAN GREATER THAN THE AVERAGE PRICE IN STATEN ISLAND, BRONX, QUEENS, AND BROOKLYN?

<div class="alert alert-warning">

<strong>Ho:</strong> The average price in Manhattan is <= average prices in other NY Neighborhoods <br>
<strong>Ha:</strong> The average price in Manhattan is > average prices in other NY Neighborhoods <br>
<strong>α:</strong> 5%
    
</div>

In [None]:
#T-Test

manhattan = data[data['neighbourhood_group'] == 'Manhattan']['price']
others = data[data['neighbourhood_group'] != 'Manhattan']['price']

scipy.stats.ttest_ind(manhattan, others)

<div class="alert alert-success">
    <strong>Conlusion:</strong> <br><br>
Since the p-value is less than the alpha, we reject the null hypothesis and conclude that the average price in Manhattan is greater than the other neighborhood groups in New York
</div>

### DOES THE AVERAGE AVAILABILITY DAYS DIFFER PER NEIGHBORHOOD GROUP?

In [None]:
#Box Plot

#plot parameters
fig, axes = plt.subplots(figsize = (12,6))

#specify the order
order = data['neighbourhood_group'].value_counts(ascending = True).index

#plot the data
sns.boxplot(x='neighbourhood_group',
            y='availability_365',
            data=data,
            showfliers=False,
            palette='Blues',
            order=order)
                
    

#customize chart
plt.suptitle('Average Availability Days in a  Year per Neighbourhood Group', fontsize=15, y=.98)
plt.xlabel('')
plt.tight_layout(h_pad=3)

#display
plt.show()

<div class="alert alert-warning">

<strong>Ho:</strong> The average price in Manhattan does not differ between neighborhood groups <br>
<strong>Ha:</strong> The average price in Manhattan differs between neighborhood groups <br>
<strong>α:</strong> 5%
    
</div>

In [None]:
#ANOVA

f_statistic, p_val = scipy.stats.f_oneway(data[data['neighbourhood_group'] == 'Staten Island']['availability_365'],
                                           data[data['neighbourhood_group'] == 'Queens']['availability_365'],
                                           data[data['neighbourhood_group'] == 'Bronx']['availability_365'],
                                           data[data['neighbourhood_group'] == 'Brooklyn']['availability_365'],
                                           data[data['neighbourhood_group'] == 'Manhattan']['availability_365'])

print(f'P-value is: {p_val}')

<div class="alert alert-success">
    <strong>Conlusion:</strong> <br><br>
Since the p-value is less than the alpha, we reject the null hypothesis and conclude that the average availabile days per year differs per neighborhood groups
</div>

### DOES THE AVERAGE PRICE DIFFER BY ROOM TYPE?

In [None]:
#configure parameters
fig, axes = plt.subplots(figsize = (12,5))

#define the order
order = data.groupby('room_type')['price'].mean().index

#plot the data
sns.boxplot(y = 'room_type', x = 'price', data = data, showfliers = False, palette = 'rocket', order = order)

#customize chart
plt.suptitle('Average Price per Room Type', fontsize = 15, y = 1)
plt.tight_layout(h_pad = 3)
ax.set_xlabel('')

#display
plt.show()

<div class="alert alert-warning">

<strong>Ho:</strong> The average price does not differ per room type <br>
<strong>Ha:</strong> The average price differs per room type <br>
<strong>α:</strong> 5%
    
</div>

In [None]:
#ANOVA

f_statistic, p_val = scipy.stats.f_oneway(data[data['room_type'] == 'Entire home/apt']['price'],
                                          data[data['room_type'] == 'Private room']['price'],
                                          data[data['room_type'] == 'Shared room']['price'])
                                          
                                           
print(f'P-value is: {p_val}')

<div class="alert alert-success">
    <strong>Conlusion:</strong> <br><br>
Since the p-value is less than the alpha, we reject the null hypothesis and conclude that the average price differs by room type</div>

### DOES THE AVERAGE AVAILABILITY DAYS DIFFER BY ROOM TYPE?

In [None]:
#configure parameters
fig, axes = plt.subplots(figsize = (12,5))

#define the order
order = data.groupby('room_type')['availability_365'].mean().index

#plot the data
sns.boxplot(y = 'room_type', x = 'availability_365', data = data, showfliers = False, palette = 'rocket', order = order)

#customize chart
plt.suptitle('Average Availability per Room Type', fontsize = 15, y = 1)
plt.tight_layout(h_pad = 3)
ax.set_xlabel('')

#display
plt.show()

<div class="alert alert-warning">

<strong>Ho:</strong> The average availability does not differ per room type <br>
<strong>Ha:</strong> The average availability differs per room type <br>
<strong>α:</strong> 5%
    
</div>

In [None]:
#ANOVA

f_statistic, p_val = scipy.stats.f_oneway(data[data['room_type'] == 'Entire home/apt']['availability_365'],
                                          data[data['room_type'] == 'Private room']['availability_365'],
                                          data[data['room_type'] == 'Shared room']['availability_365'])
                                          
                                           
print(f'P-value is: {p_val}')

<div class="alert alert-success">
    <strong>Conlusion:</strong> <br><br>
Since the p-value is less than the alpha, we reject the null hypothesis and conclude that the average availability differs by room type</div>

The results of the hypothesis testing shows that the observations based on the Exploratory Data Analysis are statistically significant.

## Answer the problem statement

After having exploring the dataset and testing the significance of the observations, we can now answer the question,<br> **"What type of property should we invest in and where"**

To answer this, first I need to create a column named `income` that contains the yearly income of an Airbnb unit derived from the `price` column multiplied by `365 - availability_365`.

In [None]:
#create income column
data['income'] = (365 - data['availability_365']) * data['price']

display(data.groupby('neighbourhood_group')['income'].mean().sort_values().reset_index())
display(data.groupby('room_type')['income'].mean().sort_values().reset_index())

In [None]:
#visualize income

#configure parameters
fig = plt.figure(figsize = (16,6))
ax0 = fig.add_subplot(1,2,1)
ax1 = fig.add_subplot(1,2,2)



#neighbourhood group
order = data.groupby('neighbourhood_group')['income'].mean().index
sns.boxplot(x = 'neighbourhood_group', y = 'income', data = data, showfliers = False, palette = 'Blues', order = order, ax = ax0)


#room type
sns.boxplot(x = 'room_type', y = 'income', data = data, showfliers = False, palette = 'rocket', ax = ax1)


#customize chart
plt.suptitle('Average Income per Neighborhood and Room Type', fontsize = 15, y = 1)
plt.tight_layout(h_pad = 3)
ax0.set_xlabel('')
ax1.set_xlabel('')

#display
plt.show()


Check how many listings have not been booked even atleast once a year

In [None]:
no_guest = data['availability_365'] == 365
data_no_guest = data[no_guest]
data_no_guest.groupby('neighbourhood_group')['name'].count().sort_values().reset_index()

In [None]:
#visualize data

#configure parameters
fig = plt.figure(figsize = (16,6))


#plot
order = data_no_guest.groupby('neighbourhood_group')['id'].count().sort_values().index
sns.countplot(x = 'neighbourhood_group',data = data_no_guest,palette = 'Blues', order = order)


#customize chart
plt.suptitle('Number of Airbnb Listings with Zero Bookings', fontsize = 15, y = .95)
ax0.set_xlabel('')
ax1.set_xlabel('')

#display
plt.show()


There are 572 units in Manhattan that have not been booked even once, while there are only 12 units that have not been booked in Staten Island. 

For our analysis, it would be more helpful to get the booking ratio per neighborhood and room type. <br>
Booking ratio would be the percentage of Airbnb listings that are booked atleast once over total units per room type and neighborhood

To do that, I will create a `booked` column containing values as follows:
- `0`: No bookings for the year
- `1`: With atleast 1 booking for the year

In [None]:
#create booked column
data['booked'] = [1 if i != 0 else 0 for i in data.availability_365]

Next, I will create a dataframe containing the `booking ratio`.

In [None]:
booking = data.groupby(['neighbourhood_group','room_type'], as_index = False)['booked'].mean()\
              .sort_values('booked', ascending = False).reset_index(drop = True)
booking

From the dataframe above we can see that **90% of the Private Rooms in Staten Island have been occupied by guests** while **Manhattan's Private Rooms only has a booking rate of 60%.**

Next, I want to determine the average length of stay of guests per neighborhood group and room type: `occupancy rate per listing.` <br>
I will express it as a percentage of the average days occupied to be computed as follows: `365 - availability_365 / 365 days`

In [None]:
data['occupancy_rate'] = (365 - data['availability_365'])/365

occupancy = data.groupby(['neighbourhood_group', 'room_type'], as_index = False)['occupancy_rate'].mean()\
                .sort_values('occupancy_rate', ascending = False).reset_index(drop = True)
occupancy

- Staten Island's shared room are occupied by guests 82% of the time. The hosts for these listings must be very busy throughout the year. <br><br>
- It is then followed by Brooklyn's Entire home/apt with a 73% occupancy rate. <br><br>
- On the other hand, Staten Island's Private Rooms has the lowest occupancy rate.

To choose the best investment, I will consider three things:
- Weighted Annual Income,
- Booking Rate - percentage of units that have guests, and
- Occupancy Rate - percentage of days (over 365 days) that the unit have guests

Since I am a busy person, basically, I want a unit with the highest weighted annual income with the lowest occupancy rate.<br>`Highest income, Lowest effort`

In [None]:
#create income dataframe with the booking rate

income = data.groupby(['neighbourhood_group','room_type']).agg({'booked':'mean', 'price':'mean'})
income

In [None]:
#create model dataframe by joining the income dataframe and occupancy dataframe
model = pd.merge(income, occupancy, how = 'left', on=['neighbourhood_group', 'room_type'])

#create weighted annual income column
model['weighted_annual_income'] = model['price'] * 365 * model['occupancy_rate'] * model['booked']
model = model.sort_values(['occupancy_rate'], ascending = True).reset_index(drop = True)

model

In [None]:
#set the income threshold
thresh =  model['weighted_annual_income'].mean()

#choose the highest income that is greater than the threshold with the lowest occupancy rate
for i, row, in model[['weighted_annual_income', 'neighbourhood_group', 'room_type']].iterrows():
    if model.loc[i, 'weighted_annual_income'] > thresh:
        if model.loc[i, 'weighted_annual_income'] > model.loc[i+1, 'weighted_annual_income']:
            print(f"Recommended location is: {model.loc[i, 'neighbourhood_group']}")
            print(f"Recommended room type is: {model.loc[i, 'room_type']}")
            print(f"Estimated Annual Income is: {model.loc[i, 'weighted_annual_income']}")
            print(f"This is {thresh/model.loc[i, 'weighted_annual_income']* 100:.2f}% higher than the threshold")
            break

### That's all. Thank you!

#### I would really appreciate it if you could give some feedback on this project so I know what to improve on. Cheers!