<a href="https://colab.research.google.com/github/HarshR09/AirBnb-data-Analysis-/blob/main/EDA_AirBnb_Booking_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnb_Bookings_Analysis



##### **Project Type**    - EDA_Capstone_Project
##### **Contribution**    - Individual


# **Project Summary -**

**AirBnb (Air Bed And Breakfast) is a online market place for booking homestays founded in 2008 and since then to expand on traveling possibilities and present a more unique, personalized way of experiencing the stay during holiday vacation or business trip. Today, Airbnb services are available are available allover the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more**.

**This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values**.

**Explore and analyze the data to discover key understandings (not limited to these) such as** :



*   What can we learn about different AirBnb locations and Hosts?
*   What can we learn from data analysis? (ex: locations, prices, reviews, etc)
*   Which hosts are the busiest and why?
*   Is there any noticeable difference of traffic among different areas and what could be the reason for it?

Explore and analyze the data to discover important factors that govern the listings.






# **GitHub Link -**

https://github.com/HarshR09/AirBnb-data-Analysis-

### **Import Libraries**

In [None]:
# Importing NumPy,Pands,Matplotlib and Seaborn libraries (Necessory for exploring data and visualisation)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### **Dataset Loading**

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive')

file_path = "/content/drive/MyDrive/Data Set/Capstone_Harshvardhan_Rajput_Airbnb NYC 2019.csv"
Abnb_df = pd.read_csv(file_path)

### **Know your data**

### Dataset First View

In [None]:
# Dataset First Look
Abnb_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
Abnb_df.shape

### Dataset Information

In [None]:
# Dataset Info
Abnb_df.info()

## ***Understanding Your Variables***

In [None]:
# Dataset Columns
Abnb_df.columns

In [None]:
# Dataset Describe
Abnb_df.describe()

**Inference**:
 

*   We can say that average room prices are 152 USD in New York.
*   75% of rooms have a price of 175 USD.

*   The minimum price is 0, which is incorrect data, so we have to deal with such rows where price is 0.
*   maximum value of minimum_nights is 1250 which is definitely an outlier.
*   Maximum price is 10000 which is an outlier.








# **Variables Description** 

*a) id: Unique serial number.

b) name: Name given to each accomodation.

c) host_id: Unique serial number given to each host.

d) host_name: Name of every host.

e) neighbourhood_group: Various district within New York city.

f) neighbourhood: Various towns within each neighbourhood_group.

g) latitude and longitude: It is geographic coordinates that specify the position of a particular location.

h) room_type: Different types of rooms depending on the size.

i) price: Cost of the rooms per night.

j) minimum_nights: minimum umber of nights, guests have to stay in that accomodation.

k) number_of_reviews: Number of times guests give reviews.

l) last_review: Date of last review.

m) reviews_per_month: Ratio of number of reviews to number of days in each month.

n) calculated_host_listings_count: Number of accomodation listed by hosts according to room type.

o)availability_365: Number of days, rooms are available in a year.

### **Check Unique Values for each variable**.

In [None]:
# Check Unique Values for host_names.
Abnb_df['host_name'].nunique()


In [None]:
 #Check Unique Values for room type.
Abnb_df['room_type'].nunique()


In [None]:
Abnb_df['room_type'].unique()

In [None]:
 #Check Unique Values for neighbourhood group.
Abnb_df['neighbourhood_group'].nunique()

In [None]:
Abnb_df['neighbourhood_group'].unique()

In [None]:
#Check Unique Values for neighbourhood.
Abnb_df['neighbourhood'].nunique()

##  ***Data Wrangling***

## **Duplicate Values**

In [None]:
# Dataset Duplicate Value Count
Abnb_df.drop_duplicates(inplace = True)


## **Missing Values/Null Values**

In [None]:
#This code will return us valid and valuable information about the dataset 
#also we can see that the verbose mode is on so it will give all the hidden information also.

Abnb_df.info(verbose = True)

In [None]:
# Missing Values/Null Values Count
Abnb_df.isna().sum()

In [None]:
# Missing Data visualistaion

missing = pd.DataFrame((Abnb_df.isnull().sum())*100/Abnb_df.shape[0]).reset_index()
plt.figure(figsize=(10,5))
ax = sns.pointplot('index',0,data=missing)
plt.xticks(rotation =90,fontsize =7)
plt.title("Percentage of Missing values")
plt.ylabel("PERCENTAGE")
plt.show()

## **Insights**

Last review and reviews per month have maximum number of missing values.

In [None]:
# Price distribution box plot
Abnb_df.groupby(['id', 'host_id','latitude','longitude']).mean().plot(kind="box", figsize = [16,10])
plt.title('Distribution by variable Type')
plt.xlabel('Property Type')

## **Insights**

The data which is outside the 1.5*IQR, (where, IQR = inter-quartile range (25 - 75) percentile) are considered as an outlier.
Here price range have the maximum outliers.


**Missing Data – Initial Intuition**

Here last_review and reviews_per_month have most number of missing values

In [None]:
# dropping missing values from host name column, Since host name has only 21 missing values, this will not effect over all data
Abnb_df.dropna(axis = 0 , subset = ['host_name'], inplace = True)



In [None]:
#replacing missing values from reviews per month column to 0
Abnb_df['reviews_per_month'].fillna(0, inplace = True)

In [None]:
#Removing columns which are not importand for data analysis( ie - columns which do not contribute for data analysis)

new_Abnb_df = Abnb_df.drop(['id', 'host_id', 'name', 'last_review'], axis = 1)

#storing remaining cloumns in a new table new_Abnb_df

### **Data Vizualization**

**Relation between Neighbourhood groups nad Neighbourhood and other variables**







In [None]:
# Bar Plot
# Number of listings distributed according to Neighbourhood groups.
neighbourhood_group_listings = new_Abnb_df.groupby('neighbourhood_group')['calculated_host_listings_count'].count()
plt.rcParams['figure.figsize'] = (15, 5)
neighbourhood_group_listings.plot(kind='bar')

plt.title('Number of listings distributed according to Neighbourhood groups')
plt.ylabel('Listing count')
plt.xlabel('Neighbourhood groups')

**Insights**

Above graph gives us insight into AirBnb listings across different neighbourhood groups.


*   Manhattan have the highest AirBnb listings followed by Brooklyn
*   Statan Island have lowest number of AirBnb listings



In [None]:
# Since Manhattan and Brooklyn have maximum number istings,
# we can also analyse the listing distribution  among different Neighbourhoods in Manhattan and Brooklyn.
max_listings_df = new_Abnb_df[['neighbourhood_group','neighbourhood']]
manhattan_df= max_listings_df.groupby(['neighbourhood_group'])['neighbourhood'].value_counts().Manhattan.nlargest(10)




In [None]:
#Bar graph for top 10 listings in different Neighbourhoods in Manhattan District.
ax = manhattan_df.plot.bar(figsize = (10,5),fontsize = 14,color = 'c')

#Set the title
ax.set_title("Top 10 neighbourhood listings in Manhattan", fontsize = 20)

# Set x and y-labels
ax.set_xlabel("Neighbourhood", fontsize = 18)
ax.set_ylabel("Listing count ", fontsize = 18)

**Insights**

Above graph raflects Harlem is top destination in Manhattan followed by upper west side. This indidicate Harlem is major tourist attaraction and business city.

In [None]:
#We can find similar neighbourhoods in Brooklyn, which is the second most important visited neighbourhood_group.
brooklyn_df= max_listings_df.groupby(['neighbourhood_group'])['neighbourhood'].value_counts().Brooklyn.nlargest(10)

#Bar graph for top 10 listings in different Neighbourhood in Brooklyn District.
ax = brooklyn_df.plot.bar(figsize = (10,5),fontsize = 14,color = 'limegreen')

#Set the title
ax.set_title("Top 10 neighbourhood listings in Brooklyn", fontsize = 20)

# Set x and y-labels
ax.set_xlabel("Neighbourhood", fontsize = 18)
ax.set_ylabel("Listing count ", fontsize = 18)

**Insights**

Above graph shows, in  Williamsburg have most AirBnb listings followed by Bedford-Stuyvesant in Brooklyn district.

In [None]:
#Average price of AirBnb in different neighbourhood_group
neighbourgod_group_by_price = new_Abnb_df.groupby('neighbourhood_group')['price'].mean().sort_values(ascending =False)

plt.figure(figsize=[16,8])
neighbourgod_group_by_price.plot(kind= 'bar',title = 'Average prices in each neighbourhood group', xlabel ='Neighbourhood group' ,ylabel= 'Average price', color = 'lightcoral')


## **Insights**

Above location wise price distribution shows that Manhattan is expensive compared to other neighbourhood groups in new york.

Bronx has low priced rooms (Almost half compared to Manhattan)

In [None]:
# Average price of AirBnb in each neighbourhood_group with respect to Room type.
neighbourgod_group_by_price_room_type = new_Abnb_df.groupby(['neighbourhood_group','room_type'])['price'].mean().sort_values(ascending =False).unstack()


neighbourgod_group_by_price_room_type.plot(kind= 'bar',title = 'Average prices in each neighbourhood group', xlabel ='Neighbourhood group' ,ylabel= 'Average price',figsize=[16,8])


## **Insights**

According to above chart, it is clearly visible the most sought after room type in each neighbourhood group is Private_room and Entire_home/apt.

As the maximum number of preferred roms are present in Manhattan and Brooklyn respectively. Queens too have a sizeable amount of rooms, followed by Bronx and Staten Island.


## **Data Vizualization**
**Relation between hosts and other variables**

In [None]:
#which host have maximum number of listings in New York
df2 = new_Abnb_df[['host_name','calculated_host_listings_count']]
df_host_listing_count= df2.groupby(['host_name'])['calculated_host_listings_count'].count().sort_values(ascending= False).nlargest(10)

plt.figure(figsize=[15,10])
df_host_listing_count.plot(kind= 'bar', xlabel = 'Host name', ylabel= 'Number of listings', color = ['firebrick', 'pink', 'blue', 'yellow', 'red', 
                                                                              'purple', 'seagreen', 'skyblue', 'magenta', 'tomato'])


## **Insights**

*   Above chart shows 10 hosts with maximum number of AirBnb listings in New York.

*   Michael, David, and Sonder have maximum listed AirBnb's among all the hosts.






In [None]:
# Top 10 hosts who have higher potential for income from all AirBnb listed under their names
df3 = new_Abnb_df[['host_name','price','reviews_per_month','number_of_reviews']]
host_name_by_price_df= df3.groupby(['host_name'])['price'].sum().nlargest(10)
plt.figure(figsize=[15,10])
host_name_by_price_df.plot(kind= 'bar', title = 'Top 10 hosts who have higher price income',ylabel= 'Potential income in USD')

## **Insights**

Above graph shows top AirBnB hosts who can generate high income from all listings under their names.

In [None]:
#Which neighbourhood group and Room type is preferred by top three hosts

#--------------------------------------------------------

#Neighbourhood_grouppreferred by  host Sonder

Sonder_df = new_Abnb_df[new_Abnb_df['host_name']=='Sonder (NYC)']
host1_df = Sonder_df['neighbourhood_group'].value_counts()
ax = host1_df.plot.pie(legend = True ,autopct='%1.1f%%',figsize=(10,5))

#Room type preferred by host Sonder
host2_df = Sonder_df['room_type'].value_counts()
plt.figure(figsize=[10,5])
host2_df.plot(kind= 'bar',xlabel ='Preferred Room type' ,ylabel= 'Number of listings',color = ['cornflowerblue','orange','green'])

In [None]:
#Which neighbourhood group is preferred by top hosts
#Neighbourhood_group prefeeered by  host Michael

Michael_df = new_Abnb_df[new_Abnb_df['host_name']=='Michael']
host3_df = Michael_df['neighbourhood_group'].value_counts()
ax = host3_df.plot.pie(legend = True ,autopct='%1.1f%%',figsize=(10,5))

#Room type preferred by host Michael
host4_df = Michael_df['room_type'].value_counts()
plt.figure(figsize=[10,5])
host4_df.plot(kind= 'bar',xlabel ='Preferred Room type' ,ylabel= 'Number of listings',color = ['cornflowerblue','orange','green'])

In [None]:
#Which neighbourhood group is preferred by top hosts
#Neighbourhood_group prefeeered by  host David 

David_df = new_Abnb_df[new_Abnb_df['host_name']=='David']
host5_df = David_df['neighbourhood_group'].value_counts()
ax = host5_df.plot.pie( legend = True ,autopct='%1.1f%%',figsize=(10,5))

#Room type preferred by host David
host6_df = David_df['room_type'].value_counts()
plt.figure(figsize=[10,5])
host6_df.plot(kind= 'bar',xlabel ='Preferred Room type' ,ylabel= 'Number of listings',color = ['cornflowerblue','orange','green'])

### **Insights**


*   Above three pie charts indicate that Manhattan and Brooklyn are most   preferred districts for doing business for AirBnb hosts in new york.

*   Bar chart indicates that Entire home/apt is most preffered room type by top hosts.

*   Private room is the second most preferred room type








In [None]:
# Find out Who are the top 10 most reviewed hosts are.

review_df = df3.groupby('host_name')['reviews_per_month'].max().sort_values(ascending=False).head(10)

plt.figure(figsize=[16,8])
review_df.plot(kind= 'bar',title = 'Most reviewd Hosts', xlabel ='Host name' ,ylabel= 'Number of Reviews')

## **Insights**

Above Graph show top 10 hosts who received most number of reviews in a month
*   Row NYC received the maximum number of reviews per month followed by Lauann




### **Data Vizualization**

**Relation between price and other variables**

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(new_Abnb_df['price'], hist=True)

# **Insights**

As we see the price distribution is a skewed tail, This means the outliers of the distribution curve are further out towards the right and closer to the mean on the left.

Above graphalso shows that prices of most rooms is between 0 and $1000



In [None]:
plt.figure(figsize=(12,7))
sns.distplot(np.log(new_Abnb_df[~(new_Abnb_df['price']==0)]['price']))

In [None]:
#Find price distribution (between 0 to 100, 100 to 200 and 200>) categorising pricess as A,B,C
#A= 0-100
#B=100-200
#C= >200

new_Abnb_df['price_distribution'] = new_Abnb_df['price'].apply(lambda x : 'A' if x<100 else ('B' if 100<= x < 200 else 'C'))
new_Abnb_df['price_distribution'].value_counts()

From above information we can conclude that:-

21877 rooms having price less than 100

17233 rooms havind price between 100 to 200

9785 rooms having price greater than 200

Most rooms are below 200 price range with almost 80% and very few are more than 200.

In [None]:
price_df = new_Abnb_df['price_distribution'].value_counts()

ax = price_df.plot.pie(legend = True ,autopct='%1.1f%%',figsize=(12,8))

## **Insights**

Above chart shows maximum room prices are less than 100 and only 20% prices are above 200 USD  

In [None]:
#Average price of AirBnb listings based on room type 
avgprice_room_type = new_Abnb_df.groupby('room_type')['price'].mean().reset_index().sort_values('price',ascending=False)

# Barplot to see different room type by price
plt.figure(figsize=[12,8])

ax = sns.barplot(x='price', y='room_type', data=avgprice_room_type)
plt.title('Average price of Airbnb based on room type',size=18)
ax.set_xlabel('price',size =18);

## **Insights**

*   Most expensive room type is entire home followed by private and share room.
*   Shared room type is the cheapest.



In [None]:
#Find out the most expensive neighbourhood in top two Neighbourhood Groups - Manhattan and Brooklyn

#For Average prices in each neighbourhood in Manhattan

expensive_manhattan_neighbourhood_groups = pd.DataFrame(new_Abnb_df.groupby(['neighbourhood_group', 'neighbourhood'])['price'].mean().Manhattan.nlargest(10))

expensive_manhattan_neighbourhood_groups.plot(kind= 'bar',title = 'Average prices in each neighbourhood in Manhattan', xlabel ='Neighbourhood' ,ylabel= 'Average price', color = 'mediumslateblue',figsize=[16,8])



In [None]:
#For Average prices in each neighbourhood in Brooklyn

expensive_brooklyn_neighbourhood_groups = pd.DataFrame(new_Abnb_df.groupby(['neighbourhood_group', 'neighbourhood'])['price'].mean().Brooklyn.nlargest(10))

expensive_brooklyn_neighbourhood_groups.plot(kind= 'bar',title = 'Average prices in each neighbourhood in Brooklyn', xlabel ='Neighbourhood' ,ylabel= 'Average price', color = 'mediumslateblue',figsize=[16,8])


## **Insights**

Tribeca and Sea Gate are the most expansive AirBnb Neighbourhoods in Manhattan and Brooklyn.

## **Plotting AirBnb listings Map of ney york city**

**install geopandas using pip**






In [None]:
!pip install geopandas

In [None]:
import geopandas as gpd
from shapely.geometry import Point, Polygon

In [None]:
#read .shp file
new_york_map = gpd.read_file('/content/drive/MyDrive/New York coordinates/geo_export_d8028617-3a91-4a13-b8b5-1e20acf3d6fc.shp')

.shp file for NYC map can be found at https://data.cityofnewyork.us/City-

**Creating GeoPandas DataFrame**

In [None]:
#zip x and y coordinates into single feature
location = [Point(xy) for xy in zip(Abnb_df['longitude'], Abnb_df['latitude'])]
# create GeoPandas dataframe
geo_df = gpd.GeoDataFrame(Abnb_df,
geometry = location)

In [None]:
geo_df.head()

In [None]:
#create figure and axes, assign to subplot
fig, ax = plt.subplots(figsize=(15,15))
# add .shp mapfile to axes
new_york_map.plot(ax=ax, alpha=0.4,color='grey')
# add geodataframe to axes
# assign calculated_host_listings_count variable to represent coordinates on graph

# make datapoints transparent using alpha
# assign size of points using markersize
geo_df.plot(column='calculated_host_listings_count',ax=ax,alpha=0.5, legend=True,markersize=10)
# add title to graph
plt.title('AirBnb Listings in NYC', fontsize=15,fontweight='bold')
# set latitiude and longitude boundaries for map display
plt.xlim(-74.2658,-73.6896)
plt.ylim( 40.4858,40.952)
# show map
plt.show()

# **Conclusion**



*  **Manhttan and Brooklyn** are most visited places in newyork city, Manhattan being densly populated and and a business center due to presence of stock exchanges. It has many globally recognised tourist attractions, thus people from around the globe flock into this city. It is also the **most expensive** places among other neighbourhoods.


*  Brooklyn is both residential and industrial hotspot. Many people around the country and globe comes here in search of employment. It is the **second most expensive** neighbourhood.
 


*  Maximum listings are in **Williamsburg** and **Harlem** Neighbourhoods in Brooklyn and Manhattan neighbourhood_group.



*  Manhattan and Brooklyn are more expansive and urban facilities are more developed there compared to other neighbourhood groups, Hence there are more Airbnb listings present in these two neighbourhood groups.



*  **Sonder(NYC), Blueground, Michael and David** are top 4 most Hosts. Most of their listings are in Manhattan and Brooklyn and Queens.


*   Almost **80% **people likely to stay at rooms with price below **200**.



*   Average price of all listings is **$150**.

*   Prices very wildly based on property and **room types**.

*   Sea Gate and Tribeca are the **most expensive** neighborhoods.



*   Majority of listings are rented for their **entirety**, although private room is a close second. This is the most important factor when people choose where to stay.


*   Duration of stay in less expensive rooms are more compared to more expensive rooms.


*  AirBnb business competition is very high in top two Neighbourhood groups.

*   For Business propespective we can study social, curtural, business and tourist locations in these Neighbourhoods so that we can find other locations having similar conditions to set up new AirBnb locations.



