<a href="https://colab.research.google.com/github/AkshayAI007/AirBnB-Bookings-Analysis-/blob/main/AirBnB_EDA_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - EDA- AirBnB booking analysis
##### **Contribution**    - Individual


# **Project Summary -**

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.



# **GitHub Link -**

https://github.com/AkshayAI007/AirBnB-Bookings-Analysis-.git

# **Problem Statement**


Explore and analyze the data to discover key understandings (not limited to these) such as :


*   What can we learn about different hosts and areas?

*   What can we learn from predictions? (ex: locations, prices, reviews, etc)
*   Which hosts are the busiest and why?


*   Is there any noticeable difference of traffic among different areas and what could be the reason for it?








#### **Define Your Business Objective?**

**To find hidden patterns and explore the dataset to find meaningful statistical inferences**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Dataset Loading

In [None]:

# Load Dataset
path= '/content/Airbnb NYC 2019.csv'
Airbnb_data=pd.read_csv(path)


### Dataset First View

In [None]:
# Dataset First Look
Airbnb_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
Airbnb_data.shape

### Dataset Information

In [None]:
# Dataset Info
Airbnb_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
Airbnb_data.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
Airbnb_data.isna().sum()

In [None]:
# Visualizing the missing values using Seaborn 
#Seaborn Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(Airbnb_data.isna().transpose(),
            cmap="YlGnBu",
            cbar_kws={'label': 'Missing Data'})
plt.figure(figsize=(10,6))
#Seaborn Displot
sns.displot(
    data=Airbnb_data.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25
)

### What did you know about your dataset?

The data is fairly consistent with not null values except for 'last_review' and 'reviews_per_month' columns which have 10052 null values from 48895 observations

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
Airbnb_data.columns

In [None]:
# Dataset Describe
Airbnb_data.describe()

### Variables Description 

From the dataset description we can observe the following:


*   The average pice of the room is about $152.72 per night 
*   Preferred booking is for 7nights on average
*   Average room avability in USA is about 112 rooms per nights
  




### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
Airbnb_data.nunique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#Ensuring the original data remains unedited, we will make the copy of this dataframe for furthur processing
df =Airbnb_data.copy()

#Removing Duplicates If Any
df.duplicated().sum()
df.drop_duplicates(inplace=True)

#Dropping Down the Obsolete Columns for our analysis
df.drop(['id','name','last_review',],axis=1, inplace=True)

### What all manipulations have you done and insights you found?

It is important to make sure that we do not consider the duplicate values in our analysis as it makes no impact on our findings.
Dropping the unecessary features makes our analysis easy and efficient. It is imporatant to stay on point which performing EDA

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10,10))
sns.countplot(df['neighbourhood_group'])
plt.title('Most Booked Neighbourhood Group',fontsize=20)
plt.xlabel('Neighbourhood Group')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

**To find out which is the most booked Neighbourhood Group**

##### 2. What is/are the insight(s) found from the chart?


*   Manhatten is the most booked neighbourhood followed by Broklyn and Queens neighbourhood group




##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


*   Based on the most booked neighbourhood, company can make sure to expand in the mentioned areas so that they can accommodate the demand.



*  Company should find reason for why there are lesser bookings in staten island. They can initiate special offers to increase demands in such areas




#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12,10))
plt.pie(df['neighbourhood_group'].value_counts(), labels=df['neighbourhood_group'].value_counts().keys(),autopct='%0.2f%%', startangle=180)
plt.title('Percentage of Most Booked Neighbourhood Group',fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

The pie chart helps us visualize the percentage distribution of unique values of a feature in a concise way

##### 2. What is/are the insight(s) found from the chart?

*   Manhatten was the highest booked neighbour with 43.84% followed by Broklyn with 42.91% bookings 
*   Manhatten and Broklyn combined has 86.75% of total bookings whihch is a very significant insight



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



*   The combined bookings in Manhatten and Broklyn is significant for smooth running of business. This makes it extremely important for the company to focus on providing quality service as well as maintain a minimum avaibility of rooms to serve the demand
*   Based on the demands, dynamic prices can be introducred in such areas to maximize the profits 



#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10,10))
sns.countplot(df['room_type'])
plt.title('Type of Room',fontsize=20)
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Countplot helps us identify significant value distribution in a feature. 

##### 2. What is/are the insight(s) found from the chart?

The Entire Home/Apartment has the highest share, followed by the Private Room and the least preferred is Shared Room.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



*   The Target customers preferes renting out entire home/aparments and least prefer to share a room
*   Based on the plot,company should focus on getting more listings of entire house/Apartments. 
*   One significant difference in a shared room and getting entire house is the privacy and safety of users.Company should focus on addressing the privacy concerns of the user .


*   To improve the number of bookings of shared rooms, company must work towards framing a customer privacy& safety policy which must address the issue of it's users








#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Finding the Busiest Host
#On the basis of Host Name
top_10_host = df['host_name'].value_counts().head(10)
print(top_10_host)

top_10_host_df = pd.DataFrame(top_10_host)
top_10_host_df.reset_index(inplace=True)
top_10_host_df.rename(columns={'index':'host_name','host_name':'count'},inplace=True)
print(top_10_host_df)

plt.figure(figsize=(10,10))
sns.barplot(x='host_name',y='count',data=top_10_host_df)
plt.title('Busiest Host',fontsize=20)
plt.xlabel('Host Name')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

To Find the Busiest Host on the basis of Host.

##### 2. What is/are the insight(s) found from the chart?

Michael(417) is the busiest host among all, followed by David(403) and Sonder(327).

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


*   This plot helps us identify key listers which recieve most of the bookings and are key listers for the company.
*   Company must ensure that these listers are given the necessary incentives and attention so that they remain loyal to the company .



#### Chart - 5

In [None]:
# Chart - 5 visualization code
busy_host = df.groupby(['host_name','host_id','room_type'])['number_of_reviews'].max()
busy_host = busy_host.reset_index()
busy_host = busy_host.sort_values(by='number_of_reviews',ascending=False).head(10)
print(busy_host)

plt.figure(figsize=(10,10))
sns.barplot(x='host_name',y='number_of_reviews',data=busy_host)
plt.title('Busiest Host According to Number of Review',fontsize=20)
plt.xlabel('Host Name')
plt.ylabel('Number of Reviews')
plt.show()

     

##### 1. Why did you pick the specific chart?

To find the Busiest host according to number of reviews

##### 2. What is/are the insight(s) found from the chart?



*   Dona has the received most number of reviews which is 629, followed by JJ(607). Which tells us that people are loving their stay there.
*   More number of reviews signify that the listing is genuine and builds more trust in the customer which is a critical element for better performance of business




##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



*   There is a disparity in the host that are most busiest and those who are most reviewed. To build more trust and credibilty of listers, AirBnB must ensure ways to increase the reviews.
*   Reviews are a way for the company to assess the deliverables such as service provided and customer retention.




#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10,10))
sns.countplot(df.neighbourhood_group,hue=df.room_type)
plt.title('Room Type on Neighbourhood Group',fontsize=20)
plt.xlabel('Neighbourhood Group')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

To know the types of Rooms Occupied by Neighbourhood

##### 2. What is/are the insight(s) found from the chart?

The graph shows that the Entire Home/Apartment is listed most near Manhattan and Private rooms and Apartments near Brooklyn are nearly equal.Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Based on the count plot, the company should ensure minimum lisitings of Entire apartments .Overboarding will lead to huge losses.


#### Chart - 7

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(15,10))
sns.boxplot(y='price', x='neighbourhood_group', data=df[df.price<500])
plt.title('Neighbourhood Group Price Distribution<500',fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?



*   A box and whisker plot—also called a box plot—displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median.
*   In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles.


*   Neighbourhood Group Price Distribution for price less than 500 helps us understand the relation between booking price and the neighbourhood it is listed in.





##### 2. What is/are the insight(s) found from the chart?



*   From the above boxplot we can say that Manhattan has the highest range price for listings with about 140 as an avg price, followed by Brooklyn with $90 per night.
*   Queens and Staten Island seem to have a very similar distribution.
*   The Bronx is the cheapest.






##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The price range of the bookings is almost similar in all the neighbourhood groups except Manhatten. Yet the demand of rooms in Broklyn is high and staten island is low. 

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Visualizing Neighbourhood Group Location
plt.figure(figsize=(15,10))
sns.scatterplot(df.longitude,df.latitude,hue=df.neighbourhood_group)
plt.title("Map of Neighbourhood Group",fontsize=20)
plt.ioff()
     

##### 1. Why did you pick the specific chart?

Scatter plot with the geographical location helps us identify the number of listings in certain part of the map

##### 2. What is/are the insight(s) found from the chart?

From the neighbourhood groups geospatial map it is important to observe the spatial discontinuity between the staten island and other groups which can be one of the reasons why the bookings are less in staten island( connectivity issues)

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Spatial map helps us understand the spatial distribution and reachability of the listings. Some listings in staten island are very far off the main cluster. As a customer, isolated and faroff bookings are a red flag with safety point of view. 

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#Visualizing Room Type Location Per Neighbourhood Group
plt.figure(figsize=(15,10))
sns.scatterplot(df.longitude,df.latitude,hue=df.room_type)
plt.title("Room Type Location Per Neighbourhood Group",fontsize=20)
plt.ioff()

##### 1. Why did you pick the specific chart?

Room type distribution with spatial location helps us identify the avability density of a specific type of room in a neighbourhood.
For example: If there is high demand of entire apartment near a Beach but the avabilty is very scattered .i.e. far from beach, it will negetively impact our business. 

##### 2. What is/are the insight(s) found from the chart?

By the above mapping, we can infer that there is very less shared room throughout NYC as compared to private and Entire home/apt.

95% of the listings on Airbnb are either Private rooms or Entire/homes apt. Very few guests had opted for shared rooms on Airbnb.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Cluster density of listings in staten island is less . 

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(15,10))
sns.boxplot(data=df, x='neighbourhood_group', y='availability_365')
plt.title("Neighbourhood Group and Availability of Room",fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

Neighbourhood Group and Availability of Room helps us identify seasonal pattern and trend in the avaibilty of the rooms in perticular neighbourhood


##### 2. What is/are the insight(s) found from the chart?



*   We have a very interesting observation here, the neighbour Manhatten and Broklyn gives us maximum business but the avability is almost 50% throughout the year 
* From the above boxplot, we can say that Staten Island has 225 days of availability which is more availability than the rest.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

We must figure out a way to increase the avability of rooms in high demand areas

#### Chart - 11

In [None]:
# Chart - 11 visualization code
area_review = df.groupby(['neighbourhood_group'])['number_of_reviews'].max()
area_review = area_review.reset_index()
print(area_review)

#Visualizing the number of reviews in each neighbourhood group
plt.figure(figsize=(15,10))
reviews = area_review['number_of_reviews']
plt.pie(reviews, labels=area_review['neighbourhood_group'],autopct='%0.2f%%',startangle=90,explode=[0.1,0.1,0.1,0.1,0.1],shadow=True)
plt.title('Number of Reviews in Each Neighbourhood Group',fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

Pie chart helps us to visualize the data in a concise way when the number of features are small

##### 2. What is/are the insight(s) found from the chart?

Pie Chart shows that Queens has the most number of reviews 26.45% followed by Manhattan which has 25.53% reviews.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Number of reviews signifies transparency of the company and credibility of the listings. The efforts must be to maximize the reviews in Broklyn,The staten island

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(10,10))
sns.distplot(df.price)
plt.title('Price of Room',fontsize=20)
plt.xlabel('Price')
plt.ylabel('Density')
plt.show()

##### 1. Why did you pick the specific chart?

The distplot represents the univariate distribution of data i.e. data distribution of a variable against the density distribution

##### 2. What is/are the insight(s) found from the chart?


*   Most of the Prices range between 0 to 1000.
*   There is some property that has a price listed as zero which is quite unusual.




##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

There are some property with zero listed price which can be detrimental for business profits 

#### Chart - 13

In [None]:
# Chart - 13 visualization code
area_price = df.groupby(['price'])['number_of_reviews'].max().reset_index()
print(area_price.head(5))


Prices = area_price['price']
Review = area_price['number_of_reviews']
fig = plt.figure(figsize=(10,5))
plt.scatter(Prices,Review)
plt.title('Relation Between Prices and Number of Reviews',fontsize=20)
plt.xlabel('Prices')
plt.ylabel('Review')
plt.show()
     

##### 1. Why did you pick the specific chart?

Scatter plot between prices and number of reviews can help us identify the type of customers which commit to reviews 


##### 2. What is/are the insight(s) found from the chart?

Scatter Plot shows that people who opt for listings with range between 0 to $2000 commit to taking out time to review.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

To increase the reviews from the group which cannot spare time, the review process must be made simple and time efficient.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize=(15,10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='Blues')
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('Correlation Heatmap',fontsize=20)

##### 1. Why did you pick the specific chart?

Correlation heatmaps are a type of plot that visualize the strength of relationships between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

The variables are weakly co-related except reviews_per_month and number_of_reviews.

Longitude has negetive correlation with price and minimum_number_of_nights

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code
g=sns.pairplot(df)

##### 1. Why did you pick the specific chart?

Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical.
Plot pairwise relationships in a data-set.

Pairplot is a module of seaborn library which provides a high-level interface for drawing attractive and informative statistical graphics.

##### 2. What is/are the insight(s) found from the chart?

There is no clear strong correlations amongst features.

Some features are weakly correlated 


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

The client must focus on following pointers to acheive Business objectives:


1.   The client must focus on serving high demands in neighbour groups like Manhatten and Brooklyn as they have a combined booking of 86.75% of total bookings
1.   The client must work on increasing the number of reviews and reviews per month as reviews by the customers is the only way to measure the quality of service and customer satisfaction.
2.   To increase the reviews, the client must make an technology inclusive method which takes very less time and effort.For eg: Just scan a QR and include a voice guided review 

2.   The client must focus on neighbourhood groups like staten island as they have very less booking and are available most of the days throughout the year. The client can  work on special offers which can catch the eye of the user to increase the occupancy.

  Although there are listings available in staten island but they are very scattered and not enough reviews which could inflict trust in the consumer .Therefore, to build the trust ,the client must focus on getting properties more closer and getting more reviews.
1.   The client must ensure proper incentive and priviledge to those listers which bring maximum business to the company so that they remain loyal and satisfied by the company as competitors will try to poach as soon as they are aware about them.






# **Conclusion**



*   People are more attracted to lower prices. Most of the Prices are ranging between 0 to $2000.

*   Price and night stay are inversely proportional to each other. It means one has to pay more when staying for fewer days and the price will decrease if the night stay is longer.

*   Minimum night stay provides a maximum number of reviews.

*   There are some property which has a price listed as zero which is quite unusual

*   The Entire Home/Apartment has the highest share followed by the Private Room.
*   The Entire Home/Apartment is listed most near Manhattan, and Private rooms and Apartments near Brooklyn are nearly equal.


*   Manhattan has the most number of listings, followed by Brooklyn.


*   Manhattan is the most expensive and the Bronx, is the least expensive neighbourhood.


*   On the basis of hostname Michael is the busiest host among all. And on the basis of reviews received Dona is the busiest as she has received the most number of reviews.


*  Staten Island has more availability throughout the year.









### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***