# **Project Name**    - AirBnb Bookings Analysis EDA Project





##### **Project Type**    - EDA
##### **Contribution**    - Team
##### **Team Member 1 -**   Shivam Bhardwaj
##### **Team Member 2 -**   Shivam Tiwari

# **Project Summary -**

Hello, all. We have worked on the AirBnb Booking dataset of the year 2019. The dataset contains information of various hosts, their listings and the response of the users in terms of the number of reviews, when was the listing last reviewed etc.

We have tried to work on the data and get three things out of it:

a. We have tried to understand the user behaviour 

b. We have tried to understand the hosts' behaviour 

c. And most importantly, we have used those two to advice or suggest the stakeholders of AirBnb few points which when worked on will help drive the business in our humble opinion.

We hope that you like the project and we would like to listen to your thoughts on it. You can reach me at shivambhardwaj0078@gmail.com

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


## <b>This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values. </b>

#These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. 

#### **Define Your Business Objective?**

*Understanding the user and host behaviour*



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know the Data***

### Importing Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
path="/content/drive/MyDrive/Almabetter/modules/Python for Data Science/Capstone Projects/Airbnb NYC 2019.csv"
dataset=pd.read_csv(path)


In [None]:
# Generating a copy
df = dataset.copy()

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

##Dataset Info

In [None]:
df.info()

##Observations

####1) The total number of rows and columns are 48895 and 16 respectively
####2)Some columns have missing values and null data, so we are now going to clean this data 

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

##Observations

####1)Columns like name,host_name,last_review,reviews_per_month have missing values. name and host_name column is not important as it does not convey any meaningful data. So those columns can be dropped.
####2)last_review and reviews_per_month have the same number of NaN values which means certain properties were not rated at all hence last_review has no date and reviews_per_month has no data.

# Understanding the Data

In [None]:
#Dataset Columns
df.columns

## 3. ***Data Wrangling***

### Cleaning the Data

In [None]:
#Dropping the columns names, host_names and last_review
df.drop(['host_name','name','last_review'],inplace=True,axis=1)

In [None]:
#Checking changes
df.head()

In [None]:
df.isnull().sum()

In [None]:
#Now null values exist only in reviews_per_month
#Replacing NaN with 0
df.fillna({"reviews_per_month":0},inplace=True)
df.isnull().sum()
# All null values are eliminated

In [None]:
df.describe()

##Observation

####1)The minimum price is zero which is not possible
####2)The Maximum minimum_nights is 1250 which is very unlikely.



In [None]:
#Rows which have 'price'=0
df[df['price']==0]

In [None]:
#Removing the properties with price 0 which is not possible
df=df[df["price"]>0]

In [None]:
#Checking the data again
df[df['price']==0]

In [None]:

#Function to correct minimum nights
def minimum_nights_correction(minimum_nights):
  if minimum_nights>365:
    return 365
  else:
    return minimum_nights


In [None]:
# Using the minimum_nights_correction function to change the value of 'minimum_nights' to 365 which are greater than 365
df["minimum_nights"]=df["minimum_nights"].apply(minimum_nights_correction)

In [None]:
df[df["minimum_nights"]>365]

In [None]:
df.describe()

###Data is Cleaned and all values which do not make sense are removed

**4. Understanding the relationships between variables using different kinds of visualisation techniques:**

Basic understanding of the data's linear relationships:

In [None]:
# Chart - 1 visualization code
# Getting the numerical variables
corr_data = df[['price','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']]

# Heatmap
corrmat = corr_data.corr()
plt.figure(figsize=(15,8))
sns.heatmap(corrmat,annot=True,fmt='.2f',annot_kws={'size':10},vmax=.8,square=True);

**Why heatmap?**

It helps understand the linear relationship between any two numerical variables of a dataset. And it is beautiful.



#####**Insight(s) found from the chart?**

From the overlook, there seems to be no relationship between any two variables of the dataset except for the very obvious one (number of reviews and reviews per month).


**How does the insight help?**

Now we know that we can't expect any linear kind of relationship between two numerical variables of our problem.

**What is the distribution of the room types?**

In [None]:
# Value counts of the room type column
pie_data = df.room_type.value_counts(normalize=True)*100
pie_data = pie_data.reset_index().rename(columns={'index':'room_type','room_type':'% share'})
pie_data

In [None]:
# Using colorblind palette for pie chart
sns.set_palette("colorblind")
plt.figure(figsize=(12,8))
plt.pie(pie_data['% share'],
        labels = pie_data['room_type'].values,
        autopct = "%.1f%%", # % values in 1 decimals
        textprops = {"size":"x-large",
                     "fontweight":"bold",
                     "color":"w"}) # styling the text
plt.legend()
# Giving our chart a suitable title
plt.title("% share of room type in dataset", fontweight = "bold", fontsize = 18)
plt.show();

Why pie chart?

The pie chart is a pretty straightforward way to show the distribution of the data among smaller number of categories, especially when the individual % share of the categories are not very small in numbers.



Insights from the chart?

The most common type of rooms available are apartments that are fully available on rent. These are a bit expensive than the other two categories and account for 52.0% of the listings.

The next most common type of listings is the private room type. Here, a room from an apartment or a home is available for use to an individual. This kind of rooms account for 45.7% of the listings. They are cheaper options and also interest those who don't travel in large groups.

Shared rooms are available as well and are perhaps the cheapest options where individuals can book rooms that other will also be using. These account for 2.4% of the data.


How does the insights help?

Now we know that there aren't enough shared room listings. It should be looked at why so. Is it because the demand isn't so much or is it a sector that has been so far overlooked by the AirBnb team. If it turns out to be the later, hosts can be incentivized and encouraged to add such more listings.



**What are the price ranges for the three room types?**

In [None]:
# Preparing the data
# Segregating the dataset for each room type
df_Shared = df[df['room_type']=='Shared room']
df_Private = df[df['room_type']=='Private room']
df_EHA = df[df['room_type']=='Entire home/apt']


temp = pd.concat([df_EHA['price'].describe(),df_Private['price'].describe(),df_Shared['price'].describe()], axis=1)
temp.reset_index(inplace=True)
temp.columns = ['stat','EHA','Private','Shared']
# Not including the distribution post 75% since two of the three room types have high volume of outliers
line_data = temp.iloc[3:7,:] 
line_data


In [None]:
# Setting the size of the figure
plt.figure(figsize = (15,8))
# Plotting the lines
plt.plot(line_data.stat, line_data.EHA, color = "midnightblue")
plt.plot(line_data.stat, line_data.Private, color = "crimson")
plt.plot(line_data.stat, line_data.Shared, color = "green")
# Assigning the labels
plt.ylabel("Price")
plt.xlabel("Distribution")
# Adding labels
labels = ['Entire home/apt','Private room','Shared room']
plt.legend(labels = labels, fontsize = "large")
plt.title("Price for Entire home/apt vs Private room vs Shared room", fontsize = 16, fontweight = "bold")
plt.show();

Why line graph?

Line graphs are the best possible choice to explain the distribution of the prices for the three kinds of rooms. This is because, as aforementioned, two of the three room types have high volume of outliers when it comes to the prices of the rooms. A violin plot or a boxplot would thus not be able to entirely paint the picture here.



Insights from the chart?

It was pretty easy to guess which of the three would have been the most affordable and the most expensive options. However, it was difficult to tell how much exactly the prices differed for the three kinds.

All the three have the cheapest options starting from $10. However, the rise in the price for an entire home or apartment is very considerable. Whereas, there is very little difference between the prices of the other two types, even as we move towards the higher end of the pay.

More than 75% of the rooms for the private and shared ones cost lesser than $100 whereas the price doubles at the 75th percentile for entire rooms and apartments.

How does the insight helps?
We already knew that the shared room types are the cheapest options but now we can quantify how cheap the prices can expected to be. A shared room is anywhere between 20-50% cheaper option.

We also know how much expensive the entire rooms are. A customer will have to pay atleast twice as much as he would ideally pay for getting a private or a shared room to get an apartment or home.

So, the visualisation tells us about the consumer's spending power.



**Which type of rooms get the most reviews on average?**

In [None]:
# Using group by on room type to get median of reviews per month
bar_data = df.groupby(['room_type']).agg({'reviews_per_month':'median'}).reset_index() #median because data has too many outliers
bar_data


In [None]:
plt.figure(figsize=(10,5))
plt.barh(y=bar_data['room_type'],
         width=bar_data['reviews_per_month'], color="midnightblue", height=0.5)
plt.title("Reviews Frequency for every room type", fontsize=16, color="midnightblue", fontweight="bold")
plt.xticks(fontsize=12, color="midnightblue")
plt.yticks(fontsize=10, color="midnightblue")
plt.xlabel("reviews per month", fontsize=14, color="midnightblue")
# Annotations on the bar
for i in range(len(bar_data['room_type'])):
    plt.annotate(bar_data['reviews_per_month'][i], 
                 (bar_data['reviews_per_month'][i] - .027, -0.05+i), 
                 color="white", fontweight="bold")
plt.show();

Why horizontal bar?

Horizontal bars help identify the small differences between values for the different categories, like we have in between Shared room and Private room.



Insights from the chart?

We can see that the shared rooms get reviews more frequently than the two more expensive types.

Actually, private rooms are reviewed almost as frequently as the shared rooms.



How does the insight help?

The cheaper rooms will always have the biggest crowd if they have all the basic necessities made available. This betters the user experience and since the users already review such rooms more frequently, there will be more positive feedbacks for the listings and thus AirBnb.



**Which room_type is available to book most of the times on average?**

In [None]:
# Using group by on room type to get median of availability_365
bar_data = df.groupby(['room_type']).agg({'availability_365':'median'}).reset_index() #median because data has too many outliers
bar_data


In [None]:
# Setting the size
plt.figure(figsize=(15,8))
plt.bar(bar_data["room_type"], bar_data["availability_365"], color="midnightblue", width=0.5)
# Title for the graph
plt.title("Availability based on room types", fontsize=16, color="midnightblue", fontweight="bold")
# The markings on the axes
plt.xticks(fontsize=12, color="midnightblue")
plt.yticks(fontsize=10, color="midnightblue")
# Labels for the axes
plt.ylabel("No. of days available", fontsize=14, color="midnightblue")
plt.xlabel("Room types", fontsize=14, color="midnightblue")
# Adding annotations on the graph
for i in range(len(bar_data["room_type"])):
    plt.annotate(bar_data["availability_365"][i], 
                 (i-0.03, bar_data["availability_365"][i] - 5), 
                 color="white", fontweight="bold", fontsize=14)
plt.show();

Why bar chart?

Because it is easier to display the information using a bar chart when you have far too many categories for a pie chart. If we were to use pie chart here, the information could have been cluttered and difficult to read, especially because of the skewed nature of the categories.



Insights from the chart?

Shared rooms are more available (90 days) whereas entire homes or apartments and private rooms have almost similar availability (42-45 days).



How does the insight help?

Marketing campaigns to create awareness about the easy availability and easy affordability can be created to inform the consumers about the shared rooms where we have already established the fact that there is a big room for growth.



**What is the distribution of the minimum booking days necessary for the different kinds of room types?**

In [None]:
# Preparing the data
temp = pd.concat([df_EHA['minimum_nights'].describe(),df_Private['minimum_nights'].describe(),df_Shared['minimum_nights'].describe()], axis=1)
temp.reset_index(inplace=True)
temp.columns = ['stat','EHA','Private','Shared']
# Not including the distribution post 75% since two of the three room types have high volume of outliers
line_data = temp.iloc[3:7,:] 
line_data

In [None]:
# Setting the size of the figure
plt.figure(figsize = (15,8))
# Plotting the lines
plt.plot(line_data.stat, line_data.EHA, color = "midnightblue")
plt.plot(line_data.stat, line_data.Private, color = "crimson")
plt.plot(line_data.stat, line_data.Shared, color = "green")
# Assigning the labels
plt.ylabel("Minimum Nights")
plt.xlabel("Distribution")
# Adding labels
labels = ['Entire home/apt','Private room','Shared room']
plt.legend(labels = labels, fontsize = "large")
plt.title("Minimum nights to book for Entire home/apt vs Private room vs Shared room", fontsize = 16, fontweight = "bold")
plt.show();

Why line graph?

Line graphs, as explained earlier, are best suited for us to show the distribution of the three kinds where there are a lot of outliers in the dataset.



Insights from the chart?

More than half of the shared room listings can be booked just for a day and then the other half see the growth in the number of minimum days of bookings required at the same rate as the other two room types.

Entire Rooms have consistent growth and are expected to be booked usually for days than the other two room types.

And the private rooms see almost similar growth as entire rooms in the later 75 percentile of listings before seeing none in the first quantile.



How does the insight help?

Another stat that now tells us a lot about the room types and the hosts behaviour. Booking entire homes doesn't only need more money but a committment of more days as well. However, half of the shared rooms have almost no committments.



**Which are the most popular neighbourhood groups in terms of listings?**

In [None]:
pie_data = df.neighbourhood_group.value_counts(normalize=True)*100
pie_data = pie_data.reset_index().rename(columns={'index':'neighbourhood_group','neighbourhood_group':'% share'})
pie_data

In [None]:
# Using colorblind palette for pie chart
plt.figure(figsize=(15,10))
plt.pie(pie_data['% share'],
        pctdistance = 1.125,
        autopct = "%.2f%%",
        textprops = {"size":"x-large",
                     "color":"black"})
plt.legend(labels = pie_data['neighbourhood_group'].values, loc='upper right', fontsize=12)
# Giving our chart a suitable title
plt.title("% share of neighbourhood groups", fontweight = "bold", fontsize = 18)
plt.show();

Why pie chart?

Bars fail when the difference between the two extremes in terms of % share is too big and thus we have to use pie chart here.



Insights from the chart?

Manhattan and Brooklyn comprise almost 85% of the listings of New York on AirBnb. Whereas Bronx and Staten Island (the largest of the five neighbourhood groups) lack presence and account for only about 3% combined.



How does the insight help?

It helps a company to know where its user base is from and where it is not doing well. When the largest neighbourhood group in terms of area of the five has the lowest number of listings, AirBnb needs to acquire most listings from over there.



**Distribution by room types**

In [None]:
bar_data=df.groupby(['neighbourhood_group','room_type']).agg({'room_type':'count'}).rename(columns={'room_type':'no_of_listings'}).reset_index()
bar_data


In [None]:
# Plotting the bar multivariate graph
sns.set_palette('colorblind')
plt.figure(figsize=(20,10))
sns.barplot(data=bar_data, x='neighbourhood_group', y ='no_of_listings', hue='room_type')
# Title for the graph
plt.title("No of listings of various room types in the neighbourhood groups", fontsize=16, color="black", fontweight="bold")
# The markings on the axes
plt.xticks(fontsize=12, color="black")
plt.yticks(fontsize=10, color="black")
# Labels for the axes
plt.ylabel("No. of listings", fontsize=16, color="black")
plt.xlabel("Neighbourhood", fontsize=16, color="black")
plt.legend(title="Room types")
plt.show();


Why bar chart?

We can see that the bar chart helps us understand the difference in the numbers so clearly here. It is what made us go for it.



Insights from the chart?

Apart from Manhattan, the rest of the neighbourhood groups have more private room listings than entire room/apartment listings.



How does the insight help?

It gives us an idea of what the users want and what the hosts provide. The most interesting point for the stakeholders should be looking into what makes the users get entire apartments rather than a private room even though the former are much more expensive.



**Which are the 10 most popular neighbourhood in terms of listings?**

In [None]:
# Preparing the data by grouping by neighbourhood group and neighbourhood and then extracting the top 10 listings by count
bar_data = df.groupby(['neighbourhood_group','neighbourhood']).agg({'neighbourhood':'count'}).rename(columns={'neighbourhood':'no_of_listings'}).reset_index().sort_values('no_of_listings',ascending=False).head(10)
bar_data


In [None]:
# Setting the size
colors = {'Brooklyn':'midnightblue','Manhattan':'green'}
plt.figure(figsize=(20,10))
plt.bar(bar_data["neighbourhood"], bar_data["no_of_listings"], color=bar_data['neighbourhood_group'].replace(colors), width=0.5)
# Title for the graph
plt.title("Top 10 neighbourhoods and their groups", fontsize=16, color="midnightblue", fontweight="bold")
# The markings on the axes
plt.xticks(fontsize=12, color="black")
plt.yticks(fontsize=10, color="black")
# Labels for the axes
plt.ylabel("No. of listings", fontsize=16, color="midnightblue")
plt.xlabel("Neighbourhood", fontsize=16, color="midnightblue")
# Adding legends using Patch
from matplotlib.patches import Patch
plt.legend([Patch(facecolor=colors['Brooklyn']),Patch(facecolor=colors['Manhattan'])],["Brooklyn","Manhattan"], fontsize = 14)
plt.show();

Why bar chart?

There is no other chart that depicts the rankings in a better and more easier-to-understand way than a bar graph.



Insights from the chart?

As expected, the top 10 listings are from the two most popular neighbourhood groups, Brooklyn and Manhattan. And surprisingly, they account for 47.95% of the total listings of the dataset.



How does the insight help?

The insights inform us about the ten most popular neighbourhoods of New York in terms of host listings. Now this may very well be because of high demands of accomodations in these neighbourhoods.



**How does the price vary with the number of reviews per month?**

In [None]:
# Preparing the data
temp = df[['price', 'number_of_reviews']]
# Removing the outliers to have a closer look at the patterns
scatter_data = temp[(temp['price']<500)&(temp['number_of_reviews']<250)]
scatter_data

In [None]:
plt.figure(figsize = (18,12))
# Plotting the scatter plot
scatter = plt.scatter(scatter_data['price'],
            scatter_data['number_of_reviews'],
            alpha = 0.6)
plt.title("Relationship between No. of reviews and Price",
          fontsize = 14,
          weight = "bold")
plt.xlabel("Price (in $)", weight = "bold")
plt.ylabel("Number of reviews", weight = "bold")
plt.show();

Why scatter plot?

Scatter plot tells the relationship between two numerical variables in ways no other plots can. It is the best way to see if there is any kind of underlying relationship between two variables.



Insights from the plot?

It is easy to tell by looking at the graph that there is no relation between the price and the number of reviews of a listing.



How does the insight help?

The chart ensures that there will be no idea of a relationship between the price and the number of reviews a listing recieves in the minds of the decision makers.

More luxurious rooms don't mean the place is going to be reviewed more oftenly. Or that more affordable listings will get more feedbacks. The price doesn't have a say in the user's motive to review a place.



**Do hosts with more listings have higher reviews rate?**

In [None]:
# Preparing the data
temp = df[['calculated_host_listings_count', 'reviews_per_month']]
# Removing the outliers to have a closer look at the patterns
scatter_data = temp[(temp['reviews_per_month']<15)&(temp['calculated_host_listings_count']<60)]
scatter_data


In [None]:
plt.figure(figsize = (18,12))
# Plotting the scatter plot
scatter = plt.scatter(scatter_data['calculated_host_listings_count'],
            scatter_data['reviews_per_month'],
            alpha = 0.6)
plt.title("Relationship between Number of listings by hosts and Reviews received per month",
          fontsize = 14,
          weight = "bold")
plt.xlabel("No. of listings by a host", weight = "bold")
plt.ylabel("Reviews per month", weight = "bold")
plt.show();

In [None]:
scatter_data.corr()

So, the corr method confirms what the visualisation was suggesting i.e. there is not necessarily a strong relation between the number of listings by a host and the number of reviews his listings get every month.



Why scatter plot?

Scatter plot helps understand the relations better between two numerical variables.



Insights from the chart?

There is no relationship between the total number of listings by a host and the number of reviews his listing gets every month on average.



How does the insight help?

It tells that the hosts with more listings are not doing anything differently to drive the number of reviews their listings are getting. However, getting reviewed frequently and positively helps hosts build trust and acquire more users.

Hosts can certainly be adviced to do better to get reviewed by users.



**How the price varies between the different neighbourhood groups for the three types of rooms?**

In [None]:
# Preparing the dataset
bar_data = df.groupby(['neighbourhood_group', 'room_type'], as_index=False)['price'].median()
bar_data

In [None]:
# Plotting the bar multivariate graph
sns.set_palette('colorblind')
plt.figure(figsize=(20,10))
sns.barplot(data=bar_data, x='neighbourhood_group', y ='price', hue='room_type')
# Title for the graph
plt.title("Price for various room types in the neighbourhood groups", fontsize=16, color="black", fontweight="bold")
# The markings on the axes
plt.xticks(fontsize=12, color="black")
plt.yticks(fontsize=10, color="black")
# Labels for the axes
plt.ylabel("Price", fontsize=16, color="black")
plt.xlabel("Neighbourhood", fontsize=16, color="black")
plt.legend(title="Room types")
plt.show();

Why bar graph?

Bar graph are one of the easiest understood graph kinds, especially for our kind of data.



Insight from the chart?

Manhattan is the most expensive neighbourhood group whereas Bronx and Staten Island have almost similar costs and are the cheapest.

Manhattan is so expensive that getting a private room there is as expensive as getting an entire home/apt in Staten Island and Bronx.

The price of getting a private room is almost same for all the neighbourhood groups but Manhattan.



How does the insight help?

The insight will help understand which kind of user should be shown what type of room when looking for one in a certain neighbourhood group.

A user who usually gets a private room when in Manhattan to save money could be swayed towards getting an entire home/apt in neighbourhood groups such as Staten Island and Bronx. This way, the user will have a better experience at the same cost at which he would get lesser in Manhattan. This way such insights can be used to have a positive impact on a user.



**How does the availability of the rooms vary depending on the price?**

In [None]:
temp = df[['availability_365','price']]
scatter_data = temp[temp['price']<500] #Removing the outliers
scatter_data


In [None]:
plt.figure(figsize = (18,12))
# Plotting the scatter plot
scatter = plt.scatter(scatter_data['availability_365'],
            scatter_data['price'],
            alpha = 0.6)
plt.title("Relationship between Availability of rooms and Price",
          fontsize = 14,
          weight = "bold")
plt.xlabel("Availability over 365 days", weight = "bold")
plt.ylabel("Price (in $)", weight = "bold")
plt.show();


Why scatter plot?

Scatter plots help us see the exact relationship between two variables.



Insight from the plot?

There exists no relation between the price and the availability of a listing.



How does the insight help?

We can tell that price does not decide whether a listing will be popular among the users or not. It may be one of the reasons but not the only. And, of course, there are going to be exceptions given how hosts may offer promotional discounts.



**How does the availability of the rooms vary in the three room types for each neighbourhood group?**

In [None]:
# Preparing the dataset
bar_data = df.groupby(['neighbourhood_group', 'room_type'], as_index=False)['availability_365'].median()
bar_data


In [None]:
# Plotting the bar multivariate graph
sns.set_palette('colorblind')
plt.figure(figsize=(20,10))
sns.barplot(data=bar_data, x='neighbourhood_group', y ='availability_365', hue='room_type')
# Title for the graph
plt.title("Availability over 365 days for various room types in the neighbourhood groups", fontsize=16, color="black", fontweight="bold")
# The markings on the axes
plt.xticks(fontsize=12, color="black")
plt.yticks(fontsize=10, color="black")
# Labels for the axes
plt.ylabel("No. of days available in a year", fontsize=16, color="black")
plt.xlabel("Neighbourhood", fontsize=16, color="black")
plt.legend(title="Room types")
plt.show();


Why bar plot?

Bar plots helps us understand two categorical and a numerical variable together in one of the most simplest way.



Insights from the plot?

Private rooms and entire home/apt are available for lesser than 50 days on average in Brooklyn and Manhattan. Hence, the places with more listings prefer entire rooms or a private room over shared rooms.

Shared rooms are the busiest of all the room types in Staten Island and have some of the lowest availability rate in New York.



How does the insight help?

Looking at the insights we know which room types are available more or less in the different neighbourhood groups.



**Density of Rooms**

In [None]:
# visualization code
plt.figure(figsize=(10,6))
ax_5 = sns.scatterplot(df.longitude,df.latitude,hue=df.neighbourhood_group)
ax_5.set_title('Density of rooms')
ax_5.set_ylabel('latitude')
ax_5.set_xlabel('longitude')
plt.show()

Why did you pick the specific chart?

Scatter plots present the relationship between two variables in a data-set.


What is/are the insight(s) found from the chart?

latitude and longtitude visulalizes us that Brooklyn and Manhattan are the most dense with hotels and apartments followed by queens island.


**Affordable ,Cheap and Expensive Properties**

In [None]:
# visualization code
def price_catagory(price):
  if price<=80:
    return 'cheep'
  elif price>=80 and price<=500:
    return 'affordable'
  else:
    return 'Expensive'




plt.figure(figsize=(10,5))
ax_7 = sns.countplot(x=df['price'].apply(price_catagory))
ax_7.set_title('Count Plot')
ax_7.set_xlabel('Catogories of rooms')
ax_7.set_ylabel("Count")
plt.show()


Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable.



What is/are the insight(s) found from the chart?

we have considered to devide the whole price range into three catogories

A. cheap (price range below or equal to 80$)

B. Affordable(for price range 80 to 500$)

C. Expensive(for price range more then 500$)

so, it look like people have more intrest in having "affordable" rooms/apartments rathre then having cheep and expensive rooms.



**Minimum number of nights distribution**

In [None]:
# visualization code
df[['minimum_nights']].value_counts()

In [None]:
df[['minimum_nights']].describe()

In [None]:
ax = sns.distplot(df.minimum_nights)
plt.title('Minimum no. of nights distribution')
plt.show()

Why did you pick the specific chart?

A Distplot or distribution plot, depicts the variation in the data distribution.

---



What is/are the insight(s) found from the chart?

1. most of the neights booked are from 0 to less then 100 days.
we will plot further and see.
2. we can see that there may be outliers.we can plot and check it.
3. we will plot box plot and check.
4. log scale can show us the shape of skwed data.

Note:How do you handle skewed data in Python?
One way of handling right, or left, skewed data is to carry out the logarithmic transformation on our data. For example, np.log(x) will log transform the variable x in Python. There are other options as well as the Box-Cox and Square root transformations.

Observations From statistics:

1.Average booking is aroun 7 neights.

2.minimum booking is for 1 neight.

3.max booking is for more then a year or we can say for few years.


**Chart for each Neigh ourhood group**

In [None]:
#  visualization code
price_df = pd.DataFrame(df['price'].apply(price_catagory))
price_df.head()

In [None]:
plt.style.use('fivethirtyeight')

price_500 = df[df.price <700]
plt.figure(figsize=(10,6))
plt.title("price for each neighbourhood_group")
sns.boxplot(y= 'price',x= 'neighbourhood_group',data=price_500)
plt.show()


Why did you pick the specific chart?

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups.

What is/are the insight(s) found from the chart?

1. We can see that Manhattam is the most expensive destination immediatly followed by Brooklyn.
2. Queens, staten island and Bronx, are having price range less as compaired to
other two.

 **Solution to Business Objective**

From our EDA, we have the following advices/suggestions for the stakeholders:

1) Increase foothold in Staten Island and Bronx.

2) Try to get data to understand what makes users of Manhattan prefer entire home/apt over private rooms, unlike other neighbourhood groups.

3) Incentivize users to review listings.

4) Investigate and if needed incentivize hosting shared rooms.

5) Marketing campaigns to create awareness about affordability and availability of shared rooms.

6) More data should be collected and analysed to understand what are driving the reviews and if they are positive or not.

 **Conclusion**

1) Manhattan and the places nearby are densely populated. It is the most
popular neighbourhood.

2) Manhattan and Brooklyn comprise almost 85% of the listings of New York on AirBnb. Whereas Bronx and Staten Island (the largest of the five neighbourhood groups) lack presence and account for only about 3% combined.

3) The most common types of rooms are apartments or homes fully available whereas the least common type is the shared room.

4)More than 75% of the rooms for the private and shared ones cost lesser than $100 whereas the price doubles at the 75th percentile for entire rooms and apartments.

5) There are more listings of entire home/apt in Manhattan than private rooms. A trend uncommon among the other neighbourhood groups.

6) Generally, shared and private rooms are reviewed more frequently than entire house/apt room type.

7) Shared rooms are more available (90 days) whereas entire homes or apartments and private rooms have almost similar availability (42-45 days).

8) Booking entire home/apartment doesn't only need more money but a committment of more days as well.

9) There is no relation between the price and the number of reviews of a listing.

10) Manhattan is so expensive that getting a private room there is as expensive as getting an entire home/apt in Staten Island and Bronx.
11) The price of getting a private room is almost same for all the neighbourhood groups but Manhattan.

12) Price does not decide whether a listing will be popular among the users or not.

13) Private rooms and entire home/apt are available for lesser than 50 days on average in Brooklyn and Manhattan.

14)Manhattan has users writing reviews for the shared and private rooms more than entire rooms, a trend not common with the other neighbourhood groups.