<a href="https://colab.research.google.com/github/Pratyush-Tripathi-Ds/AirBnb_EDA_Submission_by_Pratyush_Tripathi.ipynb/blob/main/AirBnb_EDA_Submission_by_Pratyush_Tripathi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Airbnb dataset EDA/Unsupervised
##### **Contribution**    - Individual
##### **Project by - Pratyush Tripathi**

# **Project Summary -**

In terms of the following assessment, a dataset developed from the AirBnb Listings is to be analysed. This is essential for the betterment of consumer experience and it leads to adjustment of different factors within the AirBnb facilities. The data has been gathered from 2011 to 2019 and the analysis is subjected to understand the price, availability of the bookings and it also explores the idea to deliver how consumers can engage in better way based on the number of feedbacks provided by them. The business objective in the context is subjected to deliver the idea of gathering better resourcces and proper allocation of the resources has been part of the analysis. The suitable ideas to deliver the business objective has also been discussed.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**The quality of services in AirBnb booking needs to be analysed in terms of the different variables present in the data in order to improve the services with proper resource allocation to specific locations for better consumer experience and improved engagement with the customers.**

#### **Define Your Business Objective?**

The business of AirBnb has the following objectives to be delivered by this assessment:


*   To determine the locations of AirBnb where most of the bookings are done
*   To determine which type of room has been booked most from the gathered data
*   To assess the correlations between the different aspects with help of proportional values between variables
*   To determine the average price taken by the bookings based on types of the room and locations
*   To analyse the pattern of bookings based on different types of rooms and number of nights booked
*   To analyse the quality of services based on the number of feedbacks
*   To determine how the above analysis can be used to improve the quality of services and make the business better.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

### Dataset Loading

In [None]:
# Mounting drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
airbnb_ny_2019_df = pd.read_csv('/content/drive/MyDrive/Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
airbnb_ny_2019_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Number of rows in dataset: {airbnb_ny_2019_df.shape[0]}")
print(f"Number of columns in dataset: {airbnb_ny_2019_df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
airbnb_ny_2019_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
airbnb_ny_2019_df.nunique()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
airbnb_ny_2019_df.isna().sum()

In [None]:
# Visualizing the missing values
msno.matrix(airbnb_ny_2019_df)

### What did you know about your dataset?

After observing the given dataset, we know that:

1. The dataset has no duplicate values in the rows,

2. The dataset has null values which will cause problems during the exploratory data analysis.

Therefore we need to clean the data before performing the data analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airbnb_ny_2019_df_columns = airbnb_ny_2019_df.columns
airbnb_ny_2019_df_columns

In [None]:
# Dataset Describe
airbnb_ny_2019_df.size

# Dataset description of all the columns
airbnb_ny_2019_df.describe(include='all')

### Variables Description

The description of the variables is as follows:

*   id - Unique ID
*   name - Name of the listing
*   host_id - Unique ID of host
*   host_name - Name of host
*   neighbourhood_group - location
*   neighbourhood - area
*   latitude - Latitude range
*   longitude - Longitude range
*   room_type - Type of listing
*   price - price of listing
*   minimum_nights - Minimum nights to be paid for
*   number_of_reviews - Number of reviews
*   last_review - Date of last review
*   reviews_per_month - Numbers of reviews per month
*   calculated_host_listing_count - Total count
*   availability_365 - Availability around the year

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in airbnb_ny_2019_df.columns.tolist():
  print(f"Number of unique values in {i} is {airbnb_ny_2019_df[i].nunique()}.")

for i in airbnb_ny_2019_df.columns.tolist():
  if airbnb_ny_2019_df[i].nunique() < 50:
    print(f"Unique Values for {i} are: {airbnb_ny_2019_df[i].unique()}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
airbnb_ny_2019_df.info()

In [None]:
# Checking the NaN values
airbnb_ny_2019_df.isnull().sum()

In [None]:
# As rows containing NaN values for host_name and name cannot be replaced we need to remove them
airbnb_ny_2019_df.dropna(subset = ['host_name'], inplace = True)
airbnb_ny_2019_df.dropna(subset = ['name'], inplace = True)

# As last_review does not provide any valuable information we can remove it
airbnb_ny_2019_df = airbnb_ny_2019_df.drop(['last_review'], axis = 1)

# The NaN values of reviews_per_month can be replaced by the mean value of the column
airbnb_ny_2019_df['reviews_per_month'].fillna(airbnb_ny_2019_df['reviews_per_month'].mean(), inplace = True)

airbnb_ny_2019_df.info()

### What all manipulations have you done and insights you found?

The manipulations done to clean the data for accurate analysis are:

1. The rows of 'host_name' having NaN value were removed as replacement was not possible,

2. The rows of 'name' having NaN value were removed as replacement was not possible,

3. Since 'last_review' was not giving any valuable information and had lots of NaN values, it was removed from the dataframe,

4. The NaN values of the 'review_per_month' were replaced by the mean of the column.

After the manipulations we can see the number of rows have been lowered and the dataframe now has 48858 rows.

Also now the dataframe has 15 columns as we dropped the 'last_review' column, so the size of our dataframe is now (48858, 15)

The dataset is now ready for analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Plot of bookings based on neighbourhood group

ax = sns.countplot(airbnb_ny_2019_df, x = 'neighbourhood_group')
ax.bar_label(ax.containers[0])
plt.xlabel('Neighbourhood Groups')
plt.ylabel('Neighbourhood Group Count')
plt.title('Neighbourhood Group Bookings')

##### 1. Why did you pick the specific chart?

This chart shows the number of bookings done in Brooklyn, Manhattan, Queens, Staten Island and Bronx. The reason for selecting this chart is to get the total amount of location based bookings.

##### 2. What is/are the insight(s) found from the chart?

The chart shows the number of bookings for respective locations.

1. Brooklyn - 20089

2. Manhattan - 21643

3. Queens - 5664

4. Staten Island - 373

5. Bronx - 1089

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Following can be concluded from the graph:

1. The graph shows that the Manhattan and Brooklyn have the most number of bookings,

2. Queens has little over than 1/4 bookings of Manhattan,

3. Bronx has under 1/5 bookings of Queens,

4. Staten Islands has least number of bookings.

The most booked locations show the higher demand and businesses must provide the adequate supply to meet the demand.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Plot of bookings based on type of room
ax = sns.countplot(airbnb_ny_2019_df, x = 'room_type')
ax.bar_label(ax.containers[0])
plt.xlabel('Room Type')
plt.ylabel('Room Type Count')
plt.title('Room Type Bookings')

##### 1. Why did you pick the specific chart?

The specific chart has been selected to determine the number of bookings by the type of rooms. It defines the most desired room type by the customers which help the business to select their resourses properly.

##### 2. What is/are the insight(s) found from the chart?

The three type of rooms had bookings:

1. Private room - 22306

2. Entire Home/ Apt Room - 25393

3. Shared room - 1159

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The entire home and apartment room type is the most booked. The business has an impact based on different room types. Hence, the positive impact of this can bve included in the business to develop their business.

The negative impact is not present here except for including more resources on shared rooms would be waste. Thus, including the resources to other room types will be most appropriate choice.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Plot of average price based on neighbourhood groups
price = airbnb_ny_2019_df.groupby('neighbourhood_group').agg(mean_price=('price', 'mean'))

ax = sns.barplot(data = price, x = price.index, y = 'mean_price')
ax.bar_label(ax.containers[0])
plt.xlabel('Neighbourhood Groups')
plt.ylabel('Average Price')
plt.title('Average price as per the Neighbourhood Group')

##### 1. Why did you pick the specific chart?

The average price of bookings in different neighbourhood groups have been included. The reason for picking this chart is to understand the average prices of AirBnb bookings in different locations.

##### 2. What is/are the insight(s) found from the chart?

The location based price bookings are the key insights and it provides idea on which location has the most price for the AirBnb bookings. It describes that Manhattan has the highest value of price for booking. Brooklyn comes second followed by Staten Island, Queens and Bronx.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The data provides a positive impact on the idea to book AirBnb in suitable price and with the previous data, it provides an idea that the places with high price has the most bookings. this ensures the fact that AirBnb provides the high quality of services in these locations.

Negative aspect is that the Queens and Bronx must be provided special care for ensuring more number of bookings.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Plot of average price based on the type of room
room_price = airbnb_ny_2019_df.groupby('room_type').agg(mean_price=('price', 'mean'))

ax = sns.barplot(data = room_price, x = room_price.index, y = 'mean_price')
ax.bar_label(ax.containers[0])
plt.xlabel('Room Type')
plt.ylabel('Average Price')
plt.title('Average price as per the type of room')

##### 1. Why did you pick the specific chart?

The reason for selecting this specific chart in the exploratory data analysis is to understand the costing verage for different types of rooms in the AirBnb bookings.


##### 2. What is/are the insight(s) found from the chart?

The following insights are provided:

1. Entire Home/Apartment - 211.807

2. Private room - 89.7944

3. Shared room - 70.0759

These are the average price of the AirBnb bookings based on different types of rooms.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It provided the insight that the Home and Apartment booking has the highest price and the other types of room ar far less in price in comparision. Hence, the possibilities of consumption of the rooms should be more in private or shared room.

But in the previous data, the bookings were more done with home and partment which indicates quality service.


#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Plot of average minimum nights based on type of room

nights_by_room = airbnb_ny_2019_df.groupby('room_type').agg(mean_nights=('minimum_nights', 'mean'))

ax = sns.barplot(data = nights_by_room, x = nights_by_room.index, y = 'mean_nights')
ax.bar_label(ax.containers[0])
plt.xlabel('Room Type')
plt.ylabel('Average Nights')
plt.title('Average nights as per the type of room')

##### 1. Why did you pick the specific chart?

The selection of the chart can be reasoned with the assessment of average occupancy of the rooms based on nights for which bookings are done.

##### 2. What is/are the insight(s) found from the chart?

This provides insight on the minimum nights average as per room type and it provided the results as:

1. Entire Home/ Apartment - 8.47

2. Private Rooms - 5.38

3. Shared Rooms - 6.47

This depicts that the number of night stays are high in the apartments followed by the shared rooms and then the private rooms.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In terms of business impact, the results depicted that the maximum occupancy of the rooms will be in the home and apartments. This indicates the number of house and apartment should be more to accomodate more bookings.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Plot of average minimum nights based on Neighbourhood Group
nights_by_neighbour = airbnb_ny_2019_df.groupby('neighbourhood_group').agg(mean_nights=('minimum_nights', 'mean'))

ax = sns.barplot(data = nights_by_neighbour, x = nights_by_neighbour.index, y = 'mean_nights')
ax.bar_label(ax.containers[0])
plt.xlabel('Neighbourhood Groups')
plt.ylabel('Average Nights')
plt.title('Average night based on Neighbourhood Group')

##### 1. Why did you pick the specific chart?

This specific chart has been selected to understand average nights stays with respect to different locations. It can help us understand the maximum number of nights spend by the customers in the AirBnb.

##### 2. What is/are the insight(s) found from the chart?

The grapg provides insights on the nights spend in specific locations:

1. Brooklyn - 6.05

2. Manhattan - 8.53

3. Queens - 5.18

4. Staten Island - 4.83

5. Bronx - 4.56

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The determined idea from the graph describes positive business impact and it provides Brooklyn and Manhattan to be locations where most amount of nights have been spent.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Plot of number of reviews based on type of room
review_of_room = airbnb_ny_2019_df.groupby('room_type').agg(mean_review=('number_of_reviews', 'mean'))

ax = sns.barplot(data = review_of_room, x = review_of_room.index, y = 'mean_review')
ax.bar_label(ax.containers[0])
plt.xlabel('Room Type')
plt.ylabel('Average number of reviews')
plt.title('Average number of reviews for type of room')

##### 1. Why did you pick the specific chart?

The specific chart delivers the idea of average number of reviews for the particular room type.

##### 2. What is/are the insight(s) found from the chart?

The insights gained from the graph can be added in terms of average reviews for room type:

1. Entire Home/Apartment - 22.835270

2. Private Room - 24.117502

3. Shared Room - 16.614523

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The data shows the average number of reviews for particular type of room.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Plot of number of reviews based on neighbourhood groups

reviews_per_neighbourhood = airbnb_ny_2019_df.groupby('neighbourhood_group').agg(mean_reviews = ('reviews_per_month', 'mean'))

ax = sns.barplot(data = reviews_per_neighbourhood, x = reviews_per_neighbourhood.index, y = 'mean_reviews')
ax.bar_label(ax.containers[0])
plt.xlabel('Neighbourhood Groups')
plt.ylabel('Average Review per month')
plt.title('Average review per month for Neighbourhood Groups')

##### 1. Why did you pick the specific chart?

The selection of the chart can be reasoned with the number of average reviews per month woth respect to the locations.

##### 2. What is/are the insight(s) found from the chart?

The following are the insights of the chart:

1. Bronx - 1.747108

2. Brooklyn - 1.299685

3. Manhattan - 1.295301

4. Queens - 1.832280

5. Staten Island - 1.793594

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights have been gained from the data in terms of identifying the suitable locations where the most reviews are gathered. This helps in identifying the prime locations for business which the customers like more than others.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Plot of price distribution by neighbourhood group

plt.figure(figsize=(12,8))
sns.boxplot(x='neighbourhood_group', y='price', data=airbnb_ny_2019_df)
plt.xlabel('Neighbourhood Groups')
plt.ylabel('Price')
plt.title('Price Distribution with respect to Neighbourhood Group')
plt.show()

##### 1. Why did you pick the specific chart?

This particular chart was selected to highlight the density of price according to neighbourhood groups.

##### 2. What is/are the insight(s) found from the chart?

The graph shows that majority of the listings were below 2000 for all the neighbourhood groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights have been gained from the data in terms of identifying the price range where the most bookings are done. This helps in identifying the most coomon range in terms of price for business which the customers like more than others.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Plot for availability as per the neighbourhood group

plt.figure(figsize=(12,6))
sns.boxplot(x='neighbourhood_group', y = 'availability_365', data=airbnb_ny_2019_df)
plt.ylabel('Availability')
plt.xlabel('Neighbourhood Group')
plt.title('Availability Distribution by Neighbourhood Group')
plt.show()

##### 1. Why did you pick the specific chart?

This particular chart was selected to highlight different number of availability of listings in particular neighbourhood.

##### 2. What is/are the insight(s) found from the chart?

The chart shows Staten Island and Bronx has the most availability while Brooklyn had the least year round availability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this graph tells businesses in Brooklyn and Manhattan
needs more number of listings as they have less availability.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Plot for number of listings for different type of rooms in each neighbourhood group

listing_group_df = pd.DataFrame(airbnb_ny_2019_df.groupby(['neighbourhood_group', 'room_type'])['room_type'].count().unstack())

listing_group_df.plot(kind='bar')
plt.rcParams['figure.figsize'] = (7,7)
plt.xlabel('Neighbourhood Group')
plt.ylabel('Number of listings')

##### 1. Why did you pick the specific chart?

This chart represents number of listings of type of rooms in different neighbourhood groups.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart are:

1. Most number of listings are in Manhattan and the least number of listings are in Staten island.

2. Entire home/apt is the most listed property type and Shared room is the least listed property type.

3. Because less number of people prefer to stay in shared room type ,so it has the least number of listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As per the conclusions drawn, most listed property is home/apt so business needs to focus their resources on home, and shared type is the least listed property so the resources spent should be controlled for profit.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Plot for comparing the prize of different type of rooms in different neighbourhood groups

neighbourhood_group_price_df = pd.DataFrame(airbnb_ny_2019_df.groupby(['neighbourhood_group','room_type'])['price'].mean().unstack())

neighbourhood_group_price_df.plot(kind='bar')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Average Price')
plt.title('Average price for different rooms in neighbourhood groups')
plt.rcParams['figure.figsize'] = (5, 5)

##### 1. Why did you pick the specific chart?

This chart represents average price of different type of rooms in different neighbourhood groups.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart are:

1. The most expensive listings are in Manhattan and the least expensive listings are in Brooklyn.

2. Entire home/apt is the most expensive property type and Shared room is the least expensive property type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights have been gained from the data in terms of identifying the price range where the most bookings are done. This helps in identifying the most coomon range in terms of price for business which the customers like more than others.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Plot for busiest hosts

busiest_host_df = airbnb_ny_2019_df.groupby(['host_name', 'neighbourhood_group', 'room_type'])['minimum_nights'].count().reset_index()
busiest_host_df = busiest_host_df.sort_values(by = 'minimum_nights', ascending = False).head()

plt.bar(busiest_host_df['host_name'], busiest_host_df['minimum_nights'], width=0.5)
plt.xlabel('Host')
plt.ylabel('Minimum nights stayed')
plt.title('Busiest Hosts')
plt.rcParams['figure.figsize'] = (10,5)
plt.show()

##### 1. Why did you pick the specific chart?

This specific chart is used to find the busiest hosts on basis of minimum nights stayed.

##### 2. What is/are the insight(s) found from the chart?

The graph shows us that Sonder (NYC), Blueground, Michael Host, Kara, David are the busiest hosts in that paricular order.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The business owners other than the ones the graph shows can look into this information and find reasons to further develop their business.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

data_part_airbnb_ny_2019_df = airbnb_ny_2019_df.drop(['id', 'name','host_name', 'host_id', 'neighbourhood', 'latitude', 'longitude'], axis=1)

sns.heatmap(data_part_airbnb_ny_2019_df.iloc[:,2:].corr(), cmap='YlGnBu', annot=True)

##### 1. Why did you pick the specific chart?

The plot has been picked to understand the relation between the different variables. First of all, the numerical variables have been separated by dropping all the other columns. This new dataset is used to plot the heatmap of the vairables with annotations.

##### 2. What is/are the insight(s) found from the chart?

This provided an idea on how the different variables are correlated to each other by equivalent and inverse equivalent relation.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data_part_airbnb_ny_2019_df, hue='room_type')

In [None]:
sns.pairplot(data_part_airbnb_ny_2019_df, hue='neighbourhood_group')

##### 1. Why did you pick the specific chart?

The specific chart has been picked to deliver the idea of correlation between the variables according to room type and neighbourhood group.

##### 2. What is/are the insight(s) found from the chart?

It provided the insight to deliver the increasing and decreasing correlation between the numerical variables of the dataset.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

In order to achieve the business objectives, the client should take some respective measures in terms of resource allocation.


*   The data analysed shows that the Home/Apt type room and private room is the goto choice for customers

*   The data analysed also show that the Brooklyn and Manhatan has the highest price and the availability of the bookings were low in these locations.
*   The business should be widely expanded in these two locations for a better prospect of the organisation.

*   Proper price allocation to the rooms in different locations are also important to improve the customer engagement in the not so famous locations like the Staten Island and Bronx



# **Conclusion**

Conclusively, the improvment of consumer engagement and development of quality of services can be possible by AirBnb with proper resource allocation. The data gathered has been effective in understanding the price and other aspects of AirBnb bookings. This has helped understand the requirments of the customers for having a better experience. In order to improve the services and better experience, proper financial and other resource allocation should be there. This is necessary to adjust the prices of the bookings in some locations and more number of hotels must be there for accommodating the customers in the prime locations like Brooklyn and Manhatan.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***