<a href="https://colab.research.google.com/github/Nilaydhage/Capstone---I/blob/main/eda_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Airbnb Booking Analysis

##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -** Shrikant N. Patole
##### Nilay H. Dhage

### **Problem Statement:**
Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalised way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analysed and used for security, business decisions, understanding of customers' and providers' (hosts) behaviour and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values.So let's Explore and analyse the data to discover key understandings.

### **Github Link**


## **1. Importing the libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

## **2. Load Dataset**

In [None]:
 #load dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df=pd.read_csv("/content/Airbnb NYC 2019 (2).csv")

In [None]:
#lets see first 5 rows of dataset
df.head()

In [None]:
#Dataset Rows and Column count
df.shape

In [None]:
#Get the non-null count and datatypes of values in each column
df.info()

In [None]:

df.isnull().sum()

In [None]:
 # Get Descriptive stats of data 
df.describe()

**Variable Description:**

1. id: Unique ID
2. name: name of the listing
3. host_id: Unique host ID
4. host_name: name of the host
5. neighbourhood_group: location
6. neighbourhood: area
7. latitude: latitude range
8. longitude: longitude range
9. room_type: type of listing
10. price: price of listing
11. minimum_nights: minimum nights to be paid for
12. number_of_reviews: number of reviews
13. last_review: content of the last review
14. reviews_per_month: number of reviews per month
15. calculated_host_listings_count: total count
16. availability_365: number of days when listing is available for booking

Checking unique values of categorical columns

In [None]:
df['neighbourhood_group'].unique()

In [None]:
df['room_type'].unique()

In [None]:
len(df['neighbourhood'].unique())

In [None]:
len(df['host_id'].unique())

## **Data Wrangling**

#### **3.Replacing Null values**
We can see that min value in price column is 0.So to replace the zero values using Interpolate method,we first need to sort the data with neighbourhood group,nieghbourhood,room type and price column respectively.We can assume that these three column highly determine the price.

In [None]:
# Make a copy of original dataset
df1=df.copy()

In [None]:
#firstly,remove the duplicates
df.drop_duplicates(inplace=True)

In [None]:
# Replace 0 in Price column with np.NaN
# Lets see how many null values we have now
df1['price'].replace(0,np.nan,inplace=True)
df1.isnull().sum()

In [None]:
# Checking the NaN values replaced on zero by comparing to original dataset
df1.loc[df['price']==0]

In [None]:
# Sorting the dataset to use interpolate method
df1=df.sort_values(by=['neighbourhood_group','neighbourhood','room_type','price'],ascending=True)
df1['price']=df1['price'].interpolate(method='nearest')

In [None]:
# Lets see if the null values has been replaced by some values
df1.loc[df['price']==0]

In [None]:
# To crosscheck the Interpolate method,lets calculate average price of room depending on Neighbourhood group and room types
Average_price=df1.groupby(['neighbourhood_group','room_type'])['price'].mean()
Average_price

So if we compare the above 11 values replaced,they are pretty close but yet they can vary depending on the neighbourhood. We have replaced the null values in price column successfully.

In [None]:
# We can replace the null values in reviews_per_month column with 0

df1['reviews_per_month'] = df1['reviews_per_month'].replace(np.nan, 0)

We can see in the 'last_review' column that there are some reviews from before 2017 year.The data we have is from late 2019.So we can say that those last reviewed before 2017 are no longer in business with Airbnb.So we have to remove such listings.

In [None]:
# Convert the last_review column to datetime value.
def str2date(x):
    try:
        return datetime.strptime(x,'%Y-%m-%d')
    except:
        return pd.NaT

df1['last_review']=df1['last_review'].apply(str2date)


In [None]:
#we can see that there are reviews from before 2017.
df1['last_review'].value_counts()

In [None]:
# We will strip the date into month and year column and then we will drop all data before july 2017. 

def convert_to_year(datevalue):
  return datevalue.year
def convert_to_month(datevalue):
  return datevalue.month

df1['last_review_year'] = df1['last_review'].apply(convert_to_year)
df1['last_review_month']=df1['last_review'].apply(convert_to_month)

index_def=df1[df1['last_review_year']<2017].index
df1=df1.drop(index_def)
index_def2=df1[(df1['last_review_year']==2017) & (df1['last_review_month']<7) ].index
df1=df1.drop(index_def2)

In [None]:
# Lets drop the year and month column to get as original data.
df1.drop(columns=['last_review_year','last_review_month'],inplace=True)

In [None]:
# Lets see the total null values present now
df1.isnull().sum()

We can say that 'name' & 'host_name' columns are compensated more uniquely by 'id' & 'host_id',so we dont have to worry of the missing values in those columns.For now the purpose of filtering out date is done so we can continue with null values present in 'last_review' column.

## **Data Visualizations**

In [None]:
# Create some font dictionary for title and x,y axis labelling purpose.
font1 = {'family':'serif','size':18}
font2 = {'family':'serif','color':'darkred','size':14}

### **Q1) highest no of airbnb owned by a host**

In [None]:
highest_hotel_owner=df1['host_id'].value_counts().head(10)
highest_hotel_owner

In [None]:
plt.figure(figsize=(10,5))
highest_hotel_owner.plot(kind='bar',color='#7eb54e',edgecolor='green')
plt.xlabel('Host ID',fontdict=font2)
plt.ylabel('No of Properties',fontdict=font2)
plt.title('Highest No of Properties',fontdict=font1)
plt.show()

### Reason for selecting bar chart:
The bar chart is used because we can easily compare between the values
### Insights found from the chart
We found that the Host with unique ID '219517861','107434423' and '30283594' respectively own highest no of properties.
### Gained useful insight
Top 3 hosts have almost more than 100 properties hosted.So company can provide some extra facilities,rewards to them which will help in maintaining good relations.  

### **Q2) Number of hotel listing depending on location**

In [None]:
Locations = df['neighbourhood_group'].value_counts()
Locations

In [None]:
plt.figure(figsize=(10,6))
plt.title("Neighbourhood Group",fontdict=font1)
plt.pie(Locations, labels=Locations.index, autopct='%1.1f%%')
plt.show()

### Reason for selecting Pie chart
Pie chart can efficiently display the percentage of categorical data in simple understandable manner
### Insights found from the chart
* We can see that almost majority of customers (85.4%) prefer Brooklyn and Manhattan as a staying location.
* Remaining customers preferred Queens also,but less customers are staying in Bronx,Staten Island.

### Gained useful insights
* As more customers may prefer to stay in Brooklyn and Manhattan,we can increase the no of properties in these areas.
* Staten Island and Bronx are least preferred locations,so we can assume that properties in these areas may have negative impact on revenue and potential competition can be high. 
* Company should try to attract more customer to Staten Island and Bronx.

### **Q3) Most popular neighbourhood preferred by customers** 

In [None]:

popular_neighbourhood =df1.neighbourhood.value_counts().sort_values(ascending=False)[:10]
popular_neighbourhood

In [None]:
x = list(popular_neighbourhood.index)
y = list(popular_neighbourhood.values)
x.reverse()
y.reverse()
colors = ['#0d2c54', '#143d8d', '#1f4e9f', '#2455b6', '#2c70b7', '#3183b8', '#4393c9', '#5aa9d1', '#81c2e7', '#aed6f1']
plt.figure(figsize=(8, 6))
plt.title("Most Popular Neighbourhood",fontdict=font1)
plt.ylabel("Neighbourhood Area",fontdict=font2)
plt.xlabel("Number of guest",fontdict=font2)
plt.barh(x,y,color=colors)

### Reason for selecting Horizontal Bar Plot
We can easily compare the categorical to numerical values using horizontal bar plot.
### Insights found from Bar plot
* Top 4 neighbourhood hosted more than 2000 guests and remaining 6 also have 
  hosted more than 1000 of guests. 
* Bedford-Stuyvesant and Williamsburg are the busiest neighbourhoods with more than 3000 visits in total.

### Useful Insights
Company should provide more facilities to customers in these neighbourhoods as it may have high impact on no of customers.







### **Q4) Host with highest reviews**

In [None]:
preferred_host=df1.groupby('host_id')['number_of_reviews'].sum().sort_values(ascending=False).head(10).reset_index()
preferred_host['host_id'] = preferred_host['host_id'].apply(str)
preferred_host


In [None]:
plt.figure(figsize=(6,6))
colour=['#1f77b4', '#2ca02c', '#d62728', '#ff7f0e', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
plt.hlines(y='host_id', xmin=0, xmax='number_of_reviews', color=colour,data=preferred_host)
plt.title('Host with Highest Reviews',fontdict=font1)
plt.ylabel('Host ID',fontdict=font2)
plt.xlabel('Number of reviews',fontdict=font2)
plt.scatter('number_of_reviews','host_id',data=preferred_host,c=colour)


### Reason to use horizontal line plot
For comparison between a categarical and numerical data, horizontal line plot combined with scatter points provide easy and understandable visualization.
### Insights from the plot
We can see that top 3 hosts have more than 2000 reviews and remaining 7 have more than 1000 reviews.
### Useful insights
Company can take feedback from these high reviewed hosts and can implement them for other hosts to improvr their service. 

### **Q5) Room types preferred by customers in each neighbourhood group**

In [None]:
plt.figure(figsize=(10,5))
plt.title("Room types preferred by customer",fontdict=font1)
sns.countplot(df.neighbourhood_group,hue=df.room_type, palette="tab10")


### Reason for choosing countplot 
Countplot in bar kind provides efficient comparison between two categorical and one numerical (count) value.Bar plots are easy to understand and explain.
### Insights found from countplot
* In Brooklyn and Staten Island,peoples preference for booking private rooms or entire home/apartments are almost same.
* However we can see that in Manhattan,people prefer more to book entire home/apartments.
* In Queens and Bronx ,people book private rooms more than other room types.
* Irrespective of location,less no of people prefer to stay in shared rooms.

### Useful Insights
* If company wants to expand their business in Manhattan,they should focus more on creating more no of homes/apartments.
* Company can see why less no of people prefer to stay in shared rooms.If it has negative impact on revenue then they should opt out or decrease the no of shared rooms.

### **Q6) average price of neighbourhood group depending on room type**

In [None]:
avg_price= df1.groupby(['neighbourhood_group','room_type'])['price'].mean().reset_index()
avg_price

In [None]:
sns.catplot(x='neighbourhood_group',y='price',hue='room_type',data=avg_price,kind='bar',height=5,aspect=1.5,palette='hls').set(title='Average Price at Neighbourhood Group')

### Reason for using catplot
Categorical plot in bar kind provides efficient comparison between two categorical and one numerical value.Bar plots are easy to understand and explain.
### Insights found
* Average price of entire home/apartments is higher in each neighbourhood group.
* We can see that average price for booking private rooms and shared rooms are almost same.

### Useful insights
* From previous graph,we see that more people prefer to use private rooms than shared rooms.A customer will prefer to stay in private rooms for a little more of price than staying in shared rooms.
* Company should reduce the price of shared rooms.As more customers will stay in one shared room,we can earn good revenue from shared rooms too.
* Even though no of visitors are less in Queens,Bronx and Staten Island,the average prices are comparatively high.So company should reduce the price in these neighbourhood groups to increase the traffic.

### **Q7) Availability of rooms in neighbourhood groups depending on their type**

In [None]:
availability=df1.groupby(['neighbourhood_group','room_type']).agg(mean_avail=('availability_365', np.mean)) 
availability

In [None]:
sns.set(rc = {'figure.figsize':(7,7)})

sns.lineplot(x='mean_avail',y='neighbourhood_group',hue='room_type',data=availability)
sns.scatterplot(x='mean_avail',y='neighbourhood_group',hue='room_type',data=availability,legend=False)
plt.title('Average Rooms Available in Neighbourhood Groups',fontdict=font1)
plt.ylabel('Neighbourhood Groups',fontdict=font2)
plt.xlabel('Availabilty of Rooms',fontdict=font2)
plt.legend(loc = 'upper left')


### Reason for using lineplot 
lineplot can be used to compare two categorical and one numerical value effectively.The variations can be compared with scatterplot to point exact variations more effectively.
### Insights found from plot 
* Room availability in more crowded areas such as Brooklyn and Manhattan is less than that of low crowded areas.
*Brooklyn has low availability of entire home/apartments and private rooms even though they are preferred more for stay by customers.
*Manhattan can have more availability of private rooms to increase its traffic.


In [None]:
sns.set(rc = {'figure.figsize':(7,7)})
sns.boxplot(data=df, x='neighbourhood_group',y='availability_365',palette='plasma')
plt.title('Average Rooms Available in Neighbourhood Groups',fontdict=font1)
plt.ylabel('Availabilty of Rooms',fontdict=font2)
plt.xlabel('Neighbourhood Groups',fontdict=font2)

### Reason for using Boxplot
We use box plot to show the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers”.
### Insights from the plot
* From the quartile range of box plot,in Brooklyn and Manhattan,we found that some rooms are available for high no of days while majority of rooms are available less no of days which gives us irregular distribution.
* The distribution of rooms available is pretty much symmetric in Staten Island and Bronx.Both neighbourhood groups also have more available rooms than others.
* Although Queens have more available rooms than Manhattan and Brooklyn,the distribution is less symmetric than Staten Island and Bronx.

### Useful Insights
* Even though properties in Bronx,Queens and Staten Island are more available,people still prefer to visit Manhattan and Brooklyn.It may be because of less no of properties and their comparatively higher prices in Staten Island and Bronx.
* Properties in Manhattan and Brooklyn are more crowded and preferred.So company should try to increase availability of rooms in these areas.
* As availability of rooms is positive point for Staten Island,Bronx and Queens.Steps can be taken to increase the no of visits in these areas. 

### **Q8) Trend of number of reviews with respect to price of the listings**

In [None]:
price_review_relate = df.groupby(['price'])['number_of_reviews'].max().reset_index()
fig = plt.figure(figsize = (10, 7))
plt.scatter('price','number_of_reviews',data=price_review_relate,s=8)
plt.xlabel("Price",fontdict=font2)
plt.ylabel("Number of Review",fontdict=font2)
plt.title("Price vs Number of Reviews",fontdict=font1)
plt.show()

### Reason to choose scatterplot
Scatterplot can be used for effective comparison of two high no of numerical data.The density distibution can help in finding trend relation.
### Insights found from data  
We can see that as price increases the no of reviews decreases.
### Useful Insights
Low priced rooms are more likely to be visited and price factor effects on the no of reviews i.e popularity of property.  

### **Q9) Minimum nights spent depending on the neighbourhood and room type**

In [None]:
min_nights=df1.groupby(['neighbourhood_group','room_type','minimum_nights'])['price'].mean().reset_index()
min_nights=min_nights.loc[min_nights['minimum_nights'] < 365]
min_nights

In [None]:
sns.scatterplot(x='minimum_nights',y='price',data=min_nights,hue='room_type');
plt.title('Price vs Minimum Nights Spent ',fontdict=font1)
plt.ylabel('Price',fontdict=font2)
plt.xlabel('Minimum Nights',fontdict=font2)

### Reason to choose scatterplot
Scatterplot can be used to effectively show the relation between two numerical and one categorical variable.
### Insight found from scatterplot
* Average prices of properties are not much varying with the changes in minimum numbers of night spent in it.
* People are spending more nights in private rooms and entire home/apartments

### Useful Insights
Company has provided the service to customer that on an average they can be charged less or equally for any amount of nights spent.

###  **Q10)Latitude and longitude relation with neighbourhood group** 

In [None]:
sns.scatterplot(df1.longitude, df1.latitude,s=10,hue=df1.neighbourhood_group)

plt.xlabel('Longitude',fontdict=font2)
plt.ylabel('Latitude',fontdict=font2)
plt.title('Locations',fontdict=font1)

### Reason to use scatterplot
Scatterplot are useful in multivariate analysis and highly used to plot latitudes and longitudes distribution
### Insights from the scatterplot
We can see that properties are more concentrated on perticular places like in middle of the plot and they seem to decrease as we towrds left or right from the middle.
### Useful insights
It is possible that Manhattan and Brooklyn area of scatterplot where the properties are densely located are somewhere near seashore or tourist attraction places.

### **Q11) Correlation heatmap**

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(df1.corr(),cmap='Blues', annot=True)
plt.title("Correlation Heatmap",fontdict=font1)

### Reason to use Heatmap
We use the heat maps to visualize website user behavior. By looking at an aggregate of all user interactions on a web page, it becomes easier to spot issues and opportunities for improvement.
### Insights from the heatmap
Number of reviews and reviews per month have strong positive correlation and it is very obvious Availability_365 and minimum nights have a small positive correlation(0.2) 
### useful Insights
from the heatmap we can say that numerical values are independent in terms of correlation.All the numerical values are highly dependent on categorical values.

##  **Solution to Business Objective**

### from above data visualizations we can suggest
* Top 3 hosts have almost more than 100 properties hosted.So company can provide some extra facilities,rewards to them which will help in maintaining good relations.
* As more customers may prefer to stay in Brooklyn and Manhattan,company can increase the no of properties in these areas.
* Company should try to attract more customer to Staten Island and Bronx.
* Company should provide more facilities to customers in top 10 busiest neighbourhoods as it may have high impact on no of customers.
* Company can take feedback from the highly reviewed hosts and can implement them for other hosts to improve their service.
* Even though properties in Bronx,Queens and Staten Island are more available,people still prefer to visit Manhattan and Brooklyn.It may be because of less no of properties and their comparatively higher prices in Staten Island and Bronx.
* Company should try to reduce the prices of shared rooms across all neighbourhoods.They seem to be overvalued in comparison to private rooms.
* As availability of rooms is positive point for Staten Island,Bronx and Queens.Steps can be taken to increase the no of visits in these areas.
* Properties in Manhattan and Brooklyn are more crowded and preferred.So company should try to increase availability of rooms in these areas.
* Low priced rooms are more likely to be visited and price factor effects on the no of reviews i.e popularity of property.
* price,no of reviews,availability,minimum nights spent are independent in terms of correlation with each other.All these factors are highly dependent on values such as neighbourhood,neighbourhood groups,room type.

### ***Hurrah! We have successfully completed our EDA Capstone Project !!!***