<a href="https://colab.research.google.com/github/OmBirari/Dinosuar/blob/main/Copy_of_Project_Airbnb_Bookings_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. </b>

## <b>This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values. </b>

## <b> Explore and analyze the data to discover key understandings (not limited to these) such as : 
* What can we learn about different hosts and areas?
* What can we learn from predictions? (ex: locations, prices, reviews, etc)
* Which hosts are the busiest and why?
* Is there any noticeable difference of traffic among different areas and what could be the reason for it? </b>

#Airbnb- A service connecting people who want to rent a place to live and those who are in search to accomodate a place in certain area.
###**So what's the model on which Airbnb works...**
<b> 1) Airbnb offers people an easy, relatively stress-free way to earn some income from their property.</b>

<b> 2) Guests often find Airbnb is cheaper, has more character, and is homier than hotels.</b>

<b> 3) Airbnb makes the bulk of its revenue by charging a service fee for each booking.</b>

####So basically Airbnb valuation reached over 100 billion USD in Q4 of 2020.. 
That seems crazy when you look at their Revenue Model, which is:

1)Commission from hosts: Everytime someone chooses a host’s property and makes payment, Airbnb takes 10% of the payment amount as commission. This is one of the components of Airbnb fee structure.

2)Transaction fee from travelers: When travelers make payments for stays, they are charged a 3% fee for the transaction. This amount adds to the Airbnb revenue.

###So now lets see what the dataset has to offer:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings

In [None]:
from pandas_profiling import ProfileReport

In [None]:
import os

In [None]:
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

In [None]:
df = pd.read_csv('/content/Airbnb NYC 2019.csv')

In [None]:
print (df)

In [None]:
#Now we generate a report
profile = ProfileReport(df)

In [None]:
profile

In [None]:
#saving the report in html file.
profile.to_notebook_iframe()
profile.to_file("Pandas_profiling_Report.html")

In [None]:
df.info()

###So as we can see, there are 48895 rows and 16 columns.


In [None]:
df.isnull().sum()

#####Latitude and longitude which represent a co-ordinate, neighbourhood, neighbourhood_group and room_type fit in categorical type. last_review says about the date, we will have changes in it as required.

We can check there are 4 columns containing null values which are name, host_name (looks like listing name and host_name doesn't really matter to us for now) and last_reviews, reviews_per_month. So we will just ```fillna(0)``` to those null values

In [None]:
df.fillna(0, inplace=True)

In [None]:
from scipy.stats.stats import describe
exclude_col = set(df.columns) - {'latitude', 'longitude', 'host_id', 'id'}
df[exclude_col].describe()

In [None]:
col_list = df[exclude_col].describe().columns.tolist() 

####Now we play a bit with histograms, heat maps and graphs and try to discover key undertsndings. 

In [None]:
plt.figure(figsize=(12, 8))
heatmap = sns.heatmap(df[col_list].corr(), linewidths=0.5, linecolor= "black", vmin=-1, annot=True, cmap="seismic")
plt.show()

######*no_of_reviews and reviews-per-month do relate more, but the 'price' factor seems less relatable to other columns.
###What can we learn from predictions?
- Most people prefer staying at cheaper hotels
- Costlier properties has significantly less no of reviews, but cheaper properties have large number of reviews


In [None]:
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(24, 10))
axes = axes.flatten()
for col, ax in zip(col_list, axes):
    sns.histplot(x=col, data=df, ax=ax, kde=True, element='poly')
    ax.set_title(f'Column {col} skewness : {df[col].skew()}')

plt.tight_layout(h_pad=2, w_pad=1.2)

From these plots of numerical data columns, it can be concluded that all these have a positive skewed distribution including 'price'. Although,  availability distributed uniformly throughout days of a year, so it means we have all sort of listings available uniformly throughout the year.

In [None]:
hosts_areas = df.groupby(['host_name','neighbourhood_group'])['calculated_host_listings_count'].max().reset_index()
hosts_areas.sort_values(by='calculated_host_listings_count', ascending=False).head(5)

##What can we learn about different hosts and areas?
So, the most number of listings are from Manhattan location by a host named Sonder, Blueground from Brooklyn and Manhattan. Manhattan, Brooklyn seem to have busy airbnb locales.

#Now to see which Hosts are the busiest and why?

In [None]:
busiest_hosts = df.groupby(['host_name','host_id','room_type'])['number_of_reviews'].max().reset_index()
busiest_hosts = busiest_hosts.sort_values(by='number_of_reviews', ascending=False).head(10)
busiest_hosts

In [None]:
#plotting the upper chart.
name = busiest_hosts['host_name']
reviews = busiest_hosts['number_of_reviews']

fig = plt.figure(figsize = (12, 6))
 
# creating the bar plot
plt.bar(name, reviews, color ='Green',
        width = 0.6)
 
plt.xlabel("Name of the Host")
plt.ylabel("Number of Reviews")
plt.title("Busiest Hosts")
plt.show()

##So our Busiest Hosts are:

*Dona*,
*Jj*,
*Maya*,
*Carol*

####Beacause these hosts listed room types as Private Rooms or Entire Houses which can be seen preffered my most people in the City.

## Is there any noticeable difference of traffic among different areas and what could be the reason for it?

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(24, 12))
ax = axes.flatten()

sns.set_theme(style="dark")
sns.scatterplot(data=df, x='longitude', y='latitude', hue='neighbourhood_group', ax=ax[0]);
ax[0].set_title('Location of neighbourhood groups')
sns.scatterplot(data=df[df['price'] < 300], x='longitude', y='latitude', hue='price', size="price", sizes=(20, 60), palette='Dark2_r', ax=ax[1])
ax[1].set_title('Variation of Price based on Location ($0 - 300)')
sns.scatterplot(data=df, x='longitude', y='latitude', hue='number_of_reviews', size="number_of_reviews", sizes=(20, 150), palette='GnBu_d', hue_norm=(0, 2), ax=ax[2])
ax[2].set_title('Variation of number of reviews given based on location')
sns.scatterplot(data=df, x='longitude', y='latitude', hue='availability_365', style="room_type", palette='GnBu_d', ax=ax[3])
ax[3].set_title('Availability of Rooms')
plt.show()

Lets interpret the above plots.

1)The first plot shows us the dataset for airbnb spread throughout New York city.

2)In the second plot, we have limited our sight to *300 usd* max, because about 75% of the data lies in range of *175 usd*.

3)In the third plot, we can see more number of reviews through the outskirts of the city.

4)In the 4th plot, we can follow up a trend where the heart of the city is busiest.

In [None]:
traffic_areas = df.groupby(['neighbourhood_group','room_type'])['minimum_nights'].count().reset_index()
traffic_areas = traffic_areas.sort_values(by='minimum_nights', ascending=False)
traffic_areas

In [None]:
room_type = traffic_areas['room_type']
stayed = traffic_areas['minimum_nights']

fig = plt.figure(figsize = (12, 6))
 
# creating the bar plot
plt.bar(room_type, stayed, color ='Blue',
        width = 0.6)
 
plt.xlabel("Room Type")
plt.ylabel("Minimum number of nights stayed")
plt.title("Traffic Areas")
plt.show()

As we can observe, People prefer cheaper listings of Entire homes or private rooms present in Locales of Manhattan, Brooklyn and Queens.

###**Now let's see the duration and time of year when the house listings are mostly occupied.**

In [None]:
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')

df['Days'], df['Month'], df['Years'] = (df['last_review'].dt.day, df['last_review'].dt.month, df['last_review'].dt.year)
df['last_review'] = pd.to_datetime(df['last_review']).dt.date

In [None]:
filtered_df = df[df['Years'] != 2022]

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(21, 6))
ax = axes.flatten()

sns.histplot(data=filtered_df, x='Days', hue='room_type', multiple="stack", ax=ax[0])
ax[0].set_title('No of Guest in days of the months')
sns.histplot(data=filtered_df, x='Month', hue='room_type', multiple="stack", ax=ax[1])
ax[0].set_title('No of Guest in months of the years')
sns.despine(fig, left=True)

So, we've created plot of day, month based on the last_review date, though it is not fully accurate,many guest prefer to not give a rating(me including). We can follow a trend where, first day and last day of the month have most number of guests. Also in the middle of the year, in June Month, there is a surge in guest count.

#Conclusions
- Most people want private rooms/homes.
- Visitors tend to live less time in private rooms compared to Entire Houses/Apratments
- Visitors look for cheap places. 
- It'd be a better if we had avg guest ratings of a property, that would be beneficial in understanding the property more and could also be a factor in deciding price (a low rated property tends to lower their price)
- Manhattan and Brooklyn are the two distinguished, expensive & posh areas of New York.
- Though location of property has high relation on deciding its price, but a property in popular location doesn't. It will stay occupied in most of the time.