### **Project Name**    - Airbnb Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**            - Diwakar Kumar

# **Project Summary -**

We philosophied about the variables, we analysed 'price' and also checked with the most correlated variables, we dealt with missing data and outliers, we tested some of the fundamental statistical assumptions and we even transformed categorial variables into dummy variables. That's a lot of work that Python helped us make easier.

1. Importing Libraries
2. Loading the dataset
3.Data Cleaning:
* Deleting redundant columns.
* Dropping duplicates.
* Cleaning individual columns.
* Remove the NaN values from the dataset
* Some Transformations
4. Data Visualization: Using plots to find relations between the features.
 * Get Correlation between different variables
 * Plot all Neighbourhood Group
 * Neighbourhood
 * Room Type
 * Relation between neighbourgroup and Availability of Room
 * Map of Neighbourhood group
 * Map of Neighbourhood
 * Map of Availability of Room
 * Map of price
 * Availabity of rooom
 * checking top 10 neighbourhoods on the basis of no of listings in entire NYC!
 * Plot all Neighbourhood Group and Review
 * Room_types and their relation with availability in different neighbourhood groups!
 * Also, Lets look how monthly reviews varies with room types in each neighbourhood groups!

# **GitHub Link -**

https://github.com/Diwakar201kumar

# **Problem Statement**


**Write Problem Statement Here.**
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.

#### **Define Your Business Objective?**

Airbnb’s goal for guests is to enrich travel by providing accommodations with character. Any ordinary property holder can list their home on Airbnb’s platform, creating a massive database of distinct home rentals.Instead of living in cookie-cutter hotel rooms, you can enjoy a space with personality and character. Stepping out the front door introduces you to an entire community, immersing you in a new environment.

This perspective informs every aspect of Airbnb’s marketing strategy:

**Ad creative:** Airbnb’s campaigns emphasize all the ways rentals are homes, not just accommodations. The “Don’t Go There, Live There” campaign is an excellent example highlighting how guests interact with their surroundings.

**Guidebooks:** Guidebooks are resources that hosts can attach to listings, detailing restaurants, parks, and other local attractions that may interest guests.

**Neighborhood guides:** Airbnb curates locations from guidebooks to create lists of attractions that any user can browse.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# First Mount Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Now Load Dataset
df = pd.read_csv("/content/drive/MyDrive/Airbnb booking analysis/Airbnb NYC 2019.csv")

### Dataset First View

In [None]:
# Dataset First Look
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape                          # The total number of rows and columns is 48895 and 16 respectively.

### Dataset Information

In [None]:
# Dataset Info
df.info()                       # basic information about the dataset

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_data = df.duplicated()
print(duplicate_data.sum())
df[duplicate_data]
#looks like there's no duplicate data present!

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
df.isnull()
sns.heatmap(df.isnull(),yticklabels=False)

In [None]:
# Visualizing the missing values
# As, host_names and names are not that important in our analysis, so atleast we are good to fill those with some substitutes in both the columns!
df['name'].fillna('unknown',inplace=True)
df['host_name'].fillna('no_name',inplace=True)

df[['host_name','name']].isnull().values.any()         #looks the null values are removed!

In [None]:
# Also the column: 'last_review' has many null values. And since it's not much required for our analysis as compared to number_of_reviews & reviews_per_month. We're good to drop this column.
df= df.drop(['last_review'],axis=1)

In [None]:
# Replacing the NaN with zero
df.fillna({'reviews_per_month':0},inplace=True)
df.isnull().sum()
# All Null Values are Eliminated

In [None]:
# data visualization of null value after removing last_review , reviews_per_month columns
sns.heatmap(df.isnull(),yticklabels=False)

### What did you know about your dataset?

By basic inspection I figured out that a particular property name will have one particular host_name hosted by that same individual but a particular host_name can have multiple properties in a area.

So, host_name is a categorical variable here. Also neighbourhood_group,neighbourhood and room_type fall into this category.

While id, latitude, longitude,price,minimum_nights,number_of_reviews,last_review, reviews_per_month calculated_host_listings_count, availability_365 are numerical variables

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()          # getting the overall summary statistics for all numerical columns

### Variables Description 

Ok, so we can see our dataset has 48895 data and 16 columns. Lets try to understand about the columns we've got here.

id : a unique id identifying an airbnb lisitng

name : name representing the accommodation

host_id : a unique id identifying an airbnb host

host_name : name under whom host is registered

neighbourhood_group : a group of area

neighbourhood : area falls under neighbourhood_group

latitude : coordinate of listing

longitude : coordinate of listing

room_type : type to categorize listing rooms

price : price of listing

minimum_nights : the minimum nights required to stay in a single visit

number_of_reviews : total count of reviews given by visitors

last_review : date of last review given

reviews_per_month : rate of reviews given per month

calculated_host_listings_count : total no of listing registered under the host

availability_365 : the number of days for which a host is available in a year.

latitude and longitude has represented a co-ordinate, neighbourhood_group, neighbourhood and room_type are columns of categorical type.

last_review is a column of date type, we will convert it as required.

We can check there are 4 columns containing null values which are name, host_name (looks like listing name and host_name doesn't really matter to us for now) and last_reviews, reviews_per_month (obviously, if a listing has never received a review, its possible and valid). So we will just fillna(0) to those null values

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

#looks all the property ids are different and each listings are different here!
df['id'].unique()

In [None]:
df['host_id'].unique()        # unique host id

In [None]:
df['host_name'].unique()      # nique hosts

In [None]:
df['name'].unique()           # nique listings

In [None]:
df['neighbourhood'].unique()        # looks this can be a categorical var too. lets check as well

In [None]:
df['neighbourhood_group'].unique()        # looks this can be a categorical var too. lets check as well

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#top 5 listings on Airbnb in entire NYC:

listings_count_df=df['name'].value_counts()[:5].reset_index()
listings_count_df.rename(columns={'index':'Listings on Airbnb','name':'Total_listings'},inplace=True)
listings_count_df                           # Let's check the most no of listings in NYC first!

# Hillside Hotel is found to have listed more listings in entire NYC, followed by Home away from Home.

In [None]:
no_of_hosts= df['host_name'].nunique()
print(f'The no of hosts in NYC: {no_of_hosts}')
no_of_listings= df['name'].nunique()
print(f'The total no of listings in NYC: {no_of_listings}')

In [None]:
df['name'].value_counts() 
#an interesting observation, looks like few listings have no particular host name as below observation!
#also few listings/property with same names has different hosts in different areas/neighbourhoods of a neighbourhood_group

In [None]:
busiest_hosts = df.groupby(['host_name','host_id','room_type'])['number_of_reviews'].max().reset_index()
busiest_hosts = busiest_hosts.sort_values(by='number_of_reviews', ascending=False).head(10)
busiest_hosts

### What all manipulations have you done and insights you found?

Answer :- Hillside Hotel is found to have listed more listings in entire NYC


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Correlation Heatmap

In [None]:
df.info()

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))
heatmap = sns.heatmap(df.corr(), linewidths=0, vmin=-1, annot=True, cmap="YlGnBu")
plt.show()                           # Get Correlation between different variables

##### 1. Why did you pick the specific chart?

Answer :- As Correlation Heatmap chart take all numerical value of our Dataset to compare.


##### 2. What is/are the insight(s) found from the chart?

**Answer :-** Clearly from the heatmap we can see the correlation between different features that can affect a airbnb listing.

There's correlation among host_id to reveiws_per_month & availability_365. Also there's noticiable correlation between min_nights to no_of_listings_count & availability_365. Price also shows some correlation with availability_365 & host_listings_count.

no_of_reviews and reviews_per_month gives almost the same information. so we can carry out analysis with any of the two variable. Also, no_of_reviews is correlated to availability_365!

#### Chart - 2

In [None]:
# Chart - 2 visualization code
fig = plt.figure(figsize = (20, 5))
top_30_neigbours= df['neighbourhood'].value_counts()[:10] #checking top 10 neighbourhoods on the basis of no of listings in entire NYC!
top_30_neigbours.plot(kind='bar',color='pink')
# Naming X & Y axis
plt.xlabel('neighbourhood')
plt.ylabel('counts in entire NYC')
plt.title('Top neighbourhoods in entire NYC on the basis of count of listings')

##### 1. Why did you pick the specific chart?

Answer :- Because it is easy to analysis and visualise.

##### 2. What is/are the insight(s) found from the chart?

Answer :- top 10 neighbourhood make maximum number of counts in entire NYC.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer :- Neighbourhood help in creating a positive business impact.as it increases count of listings.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
top_10_hosts=df['host_name'].value_counts()[:10] #top 10 hosts on the basis of no of listings in entire NYC!
top_10_hosts

In [None]:
# top 10 hosts on the basis of no of listings in entire NYC!
top_10_hosts.plot(kind='bar',color='skyblue',figsize = (20, 5))
plt.xlabel('top10_hosts')
plt.ylabel('total_NYC_listings')
plt.title('top 10 hosts on the basis of no of listings in entire NYC!')

##### 1. Why did you pick the specific chart?

Answer :- Easy to analysis with this chart.

##### 2. What is/are the insight(s) found from the chart?


Answer :- As we see Michael, David, Sonder (NYC) are top three hosts on the basis of no of listings in entire NYC!.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer :- yes, the gained insight help creating a positive business impact . it do not lead to negative growth.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# checking the relationship between numerical variables!
# price vs minimum_nights

var='minimum_nights'

data=pd.concat([df['price'],df[var]],axis=1)
data.plot.scatter(x=var,y='price',ylim=(0,12000),figsize = (15, 5))

##### 1. Why did you pick the specific chart?

Answer :- Easy to chack minimum night at 0.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here** :- looks many data points are clustured on 0 price range, few have min nights for stay but price is 0. looks like anomaly in price.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer :- From the above Analysis we can say that most people prefer to stay in place where price is less. this lead to negative growth.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
areas_reviews = df.groupby(['neighbourhood_group'])['number_of_reviews'].max().reset_index()
areas_reviews

In [None]:
# Plot all Neighbourhood Group and Review
area = areas_reviews['neighbourhood_group']
review = areas_reviews['number_of_reviews']

fig = plt.figure(figsize = (10, 5))
 
# creating the bar plot
plt.bar(area, review, color ='violet',
        width = 0.4)
 
plt.xlabel("area")
plt.ylabel("review")
plt.title("Area vs Number of reviews")
plt.show()

##### 1. Why did you pick the specific chart?

Answer :- Easy to check relationship between Neighbourhood Group and Review from bar chart.

##### 2. What is/are the insight(s) found from the chart?

Answer :- As we analyse from bar chart Neighbourhood Groups such as Queens and Manhattan area has highest review.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer :- Neighbourhood Group creates a positive business impact.

#### Chart - 6

In [None]:
# Chart - 6 - Total count of each room type as per listing.

explode = (0.1,0,0)
dt = df['room_type'].value_counts()
dt.plot(kind='pie',figsize=(6,6),title='Most frequent room type',fontsize=10,explode=explode,startangle=90,autopct='%1.1f%%',colors=['yellow','skyblue','red'],shadow=True)
plt.title("Pie Chart of Room Type",fontweight='bold',pad=10)
plt.ylabel("")
plt.show()

##### 1. Why did you pick the specific chart?

Answer :- Easy to discuss area covered by property.

##### 2. What is/are the insight(s) found from the chart?

Answer :- Entire home/apt has more than 50% proportion in new york city and it too has highest avg price also. Shared room are the cheapest, but only has 2.4% proportion. No wonder New York life is of high standard.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer :- as people doesn't prefer shared room. it lead to negative growth.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
fig = plt.figure(figsize=(24, 6))
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    top=False,         # ticks along the top edge are off
    labelbottom=False)
sns.lineplot(data=df, x='neighbourhood', y='price', hue='room_type')
plt.title('Distribution of Prices in neighbourhoods')
sns.despine(fig)


##### 1. Why did you pick the specific chart?

Answer :- Easy to check price range of neighbourhood through this chart.
as we have three parameters in it. 

##### 2. What is/are the insight(s) found from the chart?

Answer :- Clearly, room type Entire home/apt has maintained higher price range in almost all neighbourhoods.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer :- it creat a positive business impact.

#### Chart - 8

In [None]:
# Which hosts are the busiest ?
busiest_hosts = df.groupby(['host_name','host_id','room_type'])['number_of_reviews'].max().reset_index()
busiest_hosts = busiest_hosts.sort_values(by='number_of_reviews', ascending=False).head(10)
busiest_hosts

In [None]:
# Which hosts are the busiest ?
name = busiest_hosts['host_name']
reviews = busiest_hosts['number_of_reviews']

fig = plt.figure(figsize = (10, 5))
 
# creating the bar plot
plt.bar(name, reviews, color ='purple',
        width = 0.4)
 
plt.xlabel("Name of the Host")
plt.ylabel("Number of Reviews")
plt.title("Busiest Hosts")
plt.show()

##### 1. Why did you pick the specific chart?

Answer :- Easy to find busyest host thriugh bar chart.

##### 2. What is/are the insight(s) found from the chart?

Answer:-Busiest hosts are:

Dona,
Ji,
Maya,
Carol,
Danielle
Because these hosts listed room type as Entire home and Private room which is preferred by most number of people.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer :- Hosts plays important role in term collecting reviews that creat positive business impact.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Map of price
f, ax = plt.subplots(figsize=(10, 8))
sns.scatterplot(data=df[df['price'] < 300], x='longitude', y='latitude', hue='price', size="price", sizes=(20, 60), palette='GnBu_d')
ax.set_title('Variation of Price based on Location ($0 - 300)')

##### 1. Why did you pick the specific chart?

Answer :- trying to find where the coordinates belong from the latitude and longitude

##### 2. What is/are the insight(s) found from the chart?

*Answer* :- In the second plot, we have considered only listing with a price range max to usd 300, as our 75th percentile data lies in range of usd 175. We can check how variation in prices distributed throughout the city location. The south of Manhattan and north of Brooklyn belongs to the expensive areas of New York.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Map of Neighbourhood group
plt.figure(figsize=(10,6))
sns.scatterplot(df.longitude,df.latitude,hue=df.neighbourhood_group)
plt.ioff()       # The ioff() function in pyplot module of matplotlib library is used to turn the interactive mode off.

##### 1. Why did you pick the specific chart?

Answer :- trying to find where the coordinates belong from the latitude and longitude

##### 2. What is/are the insight(s) found from the chart?

Answer :- In this plot we can check the neighbourhood location of New York city where our dataset currently belongs to.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Also, Lets look how monthly reviews varies with room types in each neighbourhood groups!
f,ax = plt.subplots(figsize=(10,8))

# Dodge is a Python library that allows easy creation of data objects.
# Using categorical Color Brewer palettes
ax= sns.stripplot(x='room_type',y='reviews_per_month',hue='neighbourhood_group',dodge=True,data=df,palette='Set2')
ax.set_title('Most Reviewed room_types in each Neighbourhood Groups')

##### 1. Why did you pick the specific chart?

Answer :- A strip plot is drawn on its own. It is a good complement to a boxplot or violinplot in cases where all observations are shown along with some representation of the underlying distribution. It is used to draw a scatter plot based on the category.

##### 2. What is/are the insight(s) found from the chart?

Answer :- We can see that Private room recieved the most no of reviews/month where Manhattan had the highest reviews received for Private rooms with more than 50 reviews/month, followed by Manhattan in the chase.

Manhattan & Queens got the most no of reviews for Entire home/apt room type.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*Answer* :- There were less reviews recieved from shared rooms as compared to other room types and it was from Staten Island followed by Bronx.this leadto negative growth.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Let's observe the type of rooms as well as the Map of Room Type.
sns.set(rc={"figure.figsize": (10, 8)})
ax= sns.scatterplot(x=df.longitude, y=df.latitude,hue=df.room_type,palette='muted')
ax.set_title('Distribution of type of rooms across NYC')

##### 1. Why did you pick the specific chart?

Answer :- I trying to find where the coordinates belong from the latitude and longitude

##### 2. What is/are the insight(s) found from the chart?


Answer :- By the two scatterplots of latitude vs longitude we can infer there's is very less shared room throughout NYC as compared to private and Entire home/apt.

95% of the listings on Airbnb are either Private room or Entire/home apt. Very few guests had opted for shared rooms on Airbnb.

#### Chart - 13

In [None]:
# Chart - 13- Let's look at the listings availability in a year throughout NYC
# Map of Availability of Room
plt.figure(figsize=(10,6))
sns.scatterplot(df.longitude,df.latitude,hue=df.availability_365)
plt.ioff()               # interactive mode will be off

##### 1. Why did you pick the specific chart?

Answer :- I trying to find where the coordinates belong from the latitude and longitude

##### 2. What is/are the insight(s) found from the chart?

Answer :- Also, looks Bronx & Staten Island has listings which are mostly available throughout the year, might be the case as they are not much costlier as compared to other boroughs as in Manhanttan, Brooklyn & Queens.

#### Chart - 14

In [None]:
# visualization code
# Room_types and their relation with availability in different neighbourhood groups!
f,ax = plt.subplots(figsize=(15,8))
ax=sns.boxplot(x='neighbourhood_group',y='availability_365',data=df,palette="bright")

# Naming the Chart
plt.title("Neighbourhood Group vs. Room Availabilty")

# Naming X & Y axis
plt.xlabel('Neighbourhood groups')
plt.ylabel('Availability(365)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer :- we pick categorical box plot as it is easy to analysis.

##### 2. What is/are the insight(s) found from the chart?

Answer :- Looking at the above categorical box plot we can infer that the listings in Staten Island seems to be more available throughout the year to more than 300 days. On an average, these listings are available to around 210 days every year followed by Bronx where every listings are available for 150 on an average every year.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Answer :- Airbnb’s goal for guests is to enrich travel by providing accommodations with character. Any ordinary property holder can list their home on Airbnb’s platform, creating a massive database of distinct home rentals.Instead of living in cookie-cutter hotel rooms, you can enjoy a space with personality and character. Stepping out the front door introduces you to an entire community, immersing you in a new environment.

This perspective informs every aspect of Airbnb’s marketing strategy:

Ad creative: Airbnb’s campaigns emphasize all the ways rentals are homes, not just accommodations. The “Don’t Go There, Live There” campaign is an excellent example highlighting how guests interact with their surroundings.

Guidebooks: Guidebooks are resources that hosts can attach to listings, detailing restaurants, parks, and other local attractions that may interest guests.

Neighborhood guides: Airbnb curates locations from guidebooks to create lists of attractions that any user can browse.

# **Conclusion**

So, this AirBNB dataset is a rich in data but not on features. From the entire above analysis we can conclude that,

1. The people who prefer to stay in Entire home or Apartment they are going to stay bit longer in that particular Neighbourhood only.
2. The people who prefer to stay in Private room they won't stay longer as compared to Home or Apartment.
3. Most people prefer to pay less price.
4. If there are more number of Reviews for particular Neighbourhood group that means that place is a tourist place.
5. If people are not staying more then one night means they are travellers.
6. 'Entire home/apt' room type has the highest number of listing of 52% and ‘Shared Room’ is the least listed room type at only 2.4% in total.

Most visitors don't prefer shared rooms, they tend to visit private room or entire home.

Manhattan and Brooklyn are the two distinguished, expensive & posh areas of NY

Though location of property has high relation on deciding its price, but a property in popular location doesn't mean it will stay occupied in most of the time.

Performing a regression on this dataset may result in high error rate, as the features given in this dataset, are of very poor quality in deciding the property valuation. We can see this by looking at corelation heatmap.

We would need more features like bedrooms, bathroom, property age (guessed it'd be a very important one), tax_rate applicable on land, room extra amenities, distance to nearest hospital, stores or schoolds. These features might have a high relation with price.

We could use a time series analysis to make prediction of occupancy rate at particular time of a month, or particular time of a season.

It'd be a better if we had avg guest ratings of a property, that would be beneficial in understanding the property more and could also be a factor in deciding price (a low rated property tends to lower their price)

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***