<a href="https://colab.research.google.com/github/Aniruddha5164/AirBnb-Bookings-Analysis/blob/main/Air_Bnb_Bookings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Aniruddha Banerjee**

# **Project Summary -**

In the project we are analyzing Airbnb’s New York City(NYC) data of 2019. NYC is not only a famous city in the world but also has top global destination for visitors attracted to its museums, entertainment, restaurants, UN offices and commerce.

The project began with a comprehensive understanding of the Airbnb dataset, including data size, information like properties and their availability, price, location, reviews and ratings, exploring data related to Airbnb listings, including the number of properties listed, host characteristics, the variety of amenities available, and the occupancy rate of different properties etc. Further analysis of data to understand the significance of the reviews left by Airbnb users.

Exploratory data analysis projects on Airbnb typically involve investigating patterns and trends in various aspects of the platform, such as pricing, popularity and availability of listings. This data can be used to gain insights into consumer behavior and preferences, as well as to inform marketing and business strategies for hosts and Airbnb as a company. Techniques such as data visualization and objective solution may be used to analyze the data and draw meaningful conclusions.

In this type of analysis, data visualizations such as line plots, scatter plots, and bar charts are used to help identify trends, patterns, and relationships in the data. For instance, a bar chart can be used to show the distribution of properties across different neighbourhoods in a city.

Overall, the exploratory data analysis provides crucial insights for the Airbnb platform to improve customer satisfaction and enhance rental revenues. The insights also benefitted renters who can use the data generated to gain a deeper understanding of the landscape and make informed decisions.

# **GitHub Link -**

https://github.com/Aniruddha5164/AirBnb-Bookings-Analysis/blob/main/Air_Bnb_Bookings.ipynb

# **Problem Statement**


The purpose of this exploratory data analysis project is to analyze and examine the factors that influence customer bookings and preferences. The dataset used in this analysis includes information on customer demographics, subscription room type and location, minimum stays, retention rate and experience with service.

The aim is to identify insights and patterns in the data that can help the company understand the drivers of customer retention and inform future decision-making regarding host listing, location, price and customer service and marketing strategies.

#### **Define Your Business Objective?**

1.   Recommending marketing campaign strategies and predicting the destination neighbourhood which are in high demand.

2.   Using Exploratory Data Analysis, find out the most demanded room type, neighbourhood_group.

3.   Find the average days guests prefer to stay in single visit in different room type in varied neighbourhood_group.

4.   Find out the most sought after Price bracket in which maximum booking happens and get most reviews.

5.   Find the neighbourhood_group in which maximum listings done by top hosts? Specify the reason behind it with your insight.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt     #for visualization
import seaborn as sns

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

file_path = '/content/drive/MyDrive/Air-Bnb/Airbnb NYC 2019.csv'
airbnb_df=pd.read_csv(file_path)

In [None]:
airbnb_df

### Dataset First View

In [None]:
airbnb_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airbnb_df.shape

### Dataset Information

In [None]:
# Dataset Info
airbnb_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
airbnb_df.duplicated().sum()
airbnb_df.drop_duplicates(inplace=True)
airbnb_df.shape  #at this point looks like it doesn't have duplicate values

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = airbnb_df.isnull().sum()
missing_values  #only four columns has null values

The missing values table shows that there are 4 columns containing null values which are name, host_name, last_reviews and reviews_per_month .

In [None]:
# Visualizing the missing values
sns.barplot(y=missing_values.index, x=missing_values.values)
plt.title('Number of Null Values in Each Column')
plt.ylabel('Column')
plt.xlabel('Number of Null Values')
plt.show()


### What did you know about your dataset?

We can see our dataset has 48895 rows/indexes and 16 columns/variables. Lets try to understand about the variables we've got here.

1.id : a unique id identifying an airbnb lisitng

2.name : the name of listed properties/room_type on platform

3.host_id : a unique id identifying an airbnb host


4.host_name : name under whom host is registered

5.neighbourhood_group : a group of area

6.neighbourhood : area falls under neighbourhood_group

7.latitude : coordinate of listing

8.longitude : coordinate of listing

9.room_type : type to categorize listing rooms

10.price : price of listing

11.minimum_nights : for the minimum nights required to pay in a single visit

12.number_of_reviews : total count of reviews given by visitors

13.last_review : content of last review given

14.reviews_per_month : checks of per month/reviews given per month

15.calculated_host_listings_count : total no of listing registered under the host

16.availability_365 : the number of days for which a host is available in a year.





## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airbnb_df.columns

In [None]:
# Dataset Describe
# We are excluding latitude & longitude as they are coordinates and id & host_id as they're unique, on unique values statistical operations would not give desired insight.
col_after_excluding = set(airbnb_df.columns) - {'latitude', 'longitude', 'id', 'host_id'}
airbnb_df[col_after_excluding].describe()

### Variables Description

So, we get to know that some columns falls under categorical and remaining are numeriacal except one last_review comes under Date_Time category.

Categorical variable : name, host_name, neighbourhood_group, neighbourhood, room_type.

Numerical variable : id, host_id, price, minimum_nights, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365

Date_Time variable : last_review

Coordinates : latitude, longitude

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
airbnb_df["id"].nunique()#looks all the property ids are different and each listings are different here!


In [None]:
airbnb_df["name"].nunique() #shows some listing names are common

In [None]:
airbnb_df["host_id"].nunique() #shows that as many as 20k host_ids are repeatative

In [None]:
airbnb_df["host_name"].nunique() #11.5k hosts and 49k listings shows single host have multiple listings.

In [None]:
airbnb_df["neighbourhood_group"].nunique() #no. of neighbourhood

In [None]:
airbnb_df["neighbourhood_group"].unique() #areas of the city

In [None]:
airbnb_df["neighbourhood"].nunique()#no. of neighbourhood

In [None]:
airbnb_df["room_type"].value_counts() #room_type listing count

In [None]:
price_value_counts = airbnb_df["price"].value_counts().reset_index()

price_value_counts.sort_values(by="price",ascending=True)   #shows 673 different prices ranging from 0 to 10k

In [None]:
airbnb_df["calculated_host_listings_count"].unique()   #unique no of listings by hosts

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# droping unnecessary columns
airbnb_df.drop(['id','last_review'], axis=1, inplace=True)

In [None]:
# Examining changes after droping unnecessary columns

airbnb_df.head(5)

In [None]:
#We know that reviews_per_month column have many null values, we will replace it with '0'
airbnb_df["reviews_per_month"].fillna(0,inplace=True)

In [None]:
#And name and host_name also have some empty indexes, replace it with 'Unknown' and 'no_name' resp.
airbnb_df['host_name'].fillna('no_name',inplace=True)
airbnb_df['name'].fillna('Unknown',inplace=True)

In [None]:
# again examining changes
print(airbnb_df["reviews_per_month"].isnull().sum())
print(airbnb_df["host_name"].isnull().sum())
print(airbnb_df["name"].isnull().sum())

In [None]:
len(airbnb_df[airbnb_df["availability_365"]==0]) #checking round the year busy properties

In [None]:
len(airbnb_df[airbnb_df['price']==0])  #some properties have $0 listing price

In [None]:
len(airbnb_df[airbnb_df['price']<10]) #the fare less than $10 doesn't have listings other than $0 Price.

In [None]:
index_names = airbnb_df[(airbnb_df["price"]==0)] # I thought of dropping rows having zero listings price
airbnb_df.drop(index_names.index,inplace=True)


In [None]:
len(airbnb_df[airbnb_df['price']==0])

In [None]:
len(airbnb_df[airbnb_df['price']>=500])#This shows >=$500 listing price constitutes 2.5% of data, so these values cautiously be considered as outliers.

In [None]:
airbnb_df.info()  #updated data after dropping zero price rows and filling null values

In [None]:
airbnb_df["host_name"].value_counts()[:5]#top 5 hosts listing counts in entire dataset

In [None]:
#Maximum listings by hosts in entire dataset with unique listings within neighbourhood_group and this table gives partial answer for 5th objective
hosts_listings = airbnb_df.groupby(['host_name','host_id','neighbourhood_group'])['calculated_host_listings_count'].max().reset_index()
hosts_listings.sort_values(by='calculated_host_listings_count', ascending=False).head(10)

In [None]:
airbnb_df.loc[(airbnb_df['neighbourhood_group']=='Manhattan') & (airbnb_df['host_name']=='John')]
#Same hosts have many listings in same neighbourhood_groups with different room type or same/different room_type in other neighbporhood

### What all manipulations have you done and insights you found?

**From the above experiments we get some more insights like :**
*   Entire home/apt has highest listings.
*   Unique host_name number and above experiment shows that one host have many roomtype and/or more than one listing in same and/or different neighbourhood.
*   Overall listing has distibuted in 5 neighbourhood_group which are having over 200 neighbourhoods. Its price goes upto 10k.
*   After inspection I figured out that a particular property name have one particular host_name hosted by that same individual but a particular host_name can have multiple properties in a neighbourhood_group or neighbourhood.
*   From the unique ids we get to know that all the property ids are different and each listings are different here.
*   By experimenting we get to know that in columns "price" and "availability_365" shows zero cost and not available throughout year respectively. Hosts not available round the justifies but zero Price doesn't.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
top_20_neigbours=airbnb_df["neighbourhood"].value_counts().head(20)
top_20_neigbours.plot(kind='bar')
plt.xlabel('neighbourhood')
plt.ylabel('counts in entire NYC')
plt.title('Top neighbourhoods in entire NYC on the basis of count of listings')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
avg_price = airbnb_df.groupby(["neighbourhood_group"])["price"].mean()
a = avg_price.plot.bar(figsize = (5,5), fontsize = 10)
a.set_xlabel("neighbourhood_group", fontsize = 11)
a.set_ylabel("average price", fontsize = 11)
a.set_title("average price in different neighbourhood_groups", fontsize=12)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
most_damand_room = airbnb_df.groupby(['room_type'])["host_id"].count()
most_damand_room.plot(kind="bar")
plt.xlabel("room type")
plt.ylabel("no of booking")
plt.title("most demanded room type")


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
most_damand_room = airbnb_df.groupby(['room_type'])["price"].mean()
most_damand_room.plot(kind="bar")
plt.xlabel("room type")
plt.ylabel("average price")
plt.title("average price in different room type")


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
airbnb_df.columns

In [None]:
# Chart - 6 visualization code
avg_stay=airbnb_df.groupby(["room_type"])["minimum_nights"].mean()
avg_stay.plot( kind='bar', color='red')
plt.title('Average Stays in different room types', fontsize = 14)
plt.xlabel('Room types', fontsize = 12)
plt.ylabel('Average Stays', fontsize = 12 )

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

In [None]:
airbnb_df.columns

#### Chart - 7

In [None]:
# Chart - 7 visualization code
sns.scatterplot(x = airbnb_df["longitude"], y = airbnb_df["latitude"],hue= airbnb_df["neighbourhood_group"])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
sns.lineplot(data=airbnb_df,x="neighbourhood_group",y="availability_365",hue="room_type")
plt.title("Room Availability throughout Neighbourhood/Room Type")

In [None]:
sns.scatterplot(data=airbnb_df, x='price', y='number_of_reviews', hue='room_type', ax=ax[1])
ax[1].set_title('Price vs Number of Reviews')
sns.despine(fig, left=True)

In [None]:
sns.scatterplot(data=airbnb_df,x="price",y="number_of_reviews",hue="room_type")
plt.title('Price vs Number of Reviews')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

In [None]:
fig = plt.subplots(figsize=(6, 6))

sns.countplot(data=airbnb_df[airbnb_df['availability_365']  == 365], x='neighbourhood_group', hue='room_type', palette='GnBu_d')
plt.title('No. of Properties Available 365 days', fontsize=12)
plt.show()

#### Chart - 9

In [None]:
# Chart - 9 visualization code
properties=airbnb_df[airbnb_df["availability_365"]==365]
sns.countplot(data=properties,x='neighbourhood_group',hue="room_type")
plt.title('No. of Properties Available 365 days',fontsize=12)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
airbnb_df.columns

In [None]:
avg_price_df = airbnb_df.groupby(['neighbourhood_group','room_type'])['price'].mean().unstack()
avg_price_df

In [None]:
# Chart - 10 visualization code
avg_price_df=airbnb_df.groupby(["neighbourhood_group",'room_type'])["price"].mean().unstack()
avg_price_df

In [None]:
avg_price_df.plot(kind="bar")

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

In [None]:
airbnb_df.columns

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
busiest_hosts = airbnb_df.groupby(['host_name', 'host_id','room_type'])['number_of_reviews'].max().reset_index()
busiest_hosts = busiest_hosts.sort_values(by='number_of_reviews', ascending=False).head(10)
busiest_hosts

In [None]:
name = busiest_hosts['host_name']
reviews = busiest_hosts['number_of_reviews']

fig = plt.figure(figsize = (8, 5))
plt.bar(name, reviews, color ='chocolate', width = 0.4)
plt.xlabel("Name of the Host")
plt.ylabel("Number of Reviews")
plt.title("Busiest Hosts", fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
entire_home_apt = airbnb_df[airbnb_df['room_type'] == 'Entire home/apt']
private_room = airbnb_df[airbnb_df['room_type'] == 'Private room']
shared_room = airbnb_df[airbnb_df['room_type'] == 'Shared room']

# Create boxplots for price
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
axs[0].boxplot(entire_home_apt['price'])
axs[0].set_title('Entire home/apt')
axs[1].boxplot(private_room['price'])
axs[1].set_title('Private room')
axs[2].boxplot(shared_room['price'])
axs[2].set_title('Shared room')
plt.show()

In [None]:
# Remove outliers from price variable for each room_type
def remove_outliers(data):
    Q1 = np.percentile(data['price'], 25)
    Q3 = np.percentile(data['price'], 75)
    IQR = Q3 - Q1
    upper_bound = Q3 + 1.5 * IQR
    lower_bound = Q1 - 1.5 * IQR
    data1 = data[(data['price'] >= lower_bound) & (data['price'] <= upper_bound)]
    return data1

entire_home_apt1 = remove_outliers(entire_home_apt)
private_room1 = remove_outliers(private_room)
shared_room1 = remove_outliers(shared_room)

# Combine the datasets in combined_df
combined_df = pd.concat([entire_home_apt1, private_room1, shared_room1], axis=0)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))
ax = axes.flatten()

sns.violinplot(data=combined_df, x='neighbourhood_group', y='price', ax=ax[0])

sns.violinplot(data=combined_df, x='neighbourhood_group', y='price', hue='room_type')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
sns.heatmap(data=airbnb_df.corr(),annot=True,cmap="coolwarm")

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
sns.pairplot(combined_df, hue='room_type',
             x_vars=['price', 'number_of_reviews','reviews_per_month','availability_365'],
             y_vars=['price', 'number_of_reviews','reviews_per_month','availability_365'],
             kind='scatter', diag_kind= 'hist')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***