<a href="https://colab.research.google.com/github/Manass20/Airbnb-Booking-Analisys/blob/main/Airbnb_Bookings_Analysis_Capstone_Project_Manas_Ranjan_Behera.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. </b>

## <b>This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values. </b>

## <b> Explore and analyze the data to discover key understandings (not limited to these) such as : 
* What can we learn about different hosts and areas?
* What can we learn from predictions? (ex: locations, prices, reviews, etc)
* Which hosts are the busiest and why?
* Is there any noticeable difference of traffic among different areas and what could be the reason for it? </b>

# **Collecting and loading data**

For this project, we are using Google colab a web IDE with a python programming language to write our script. IDE or Integrated Development Environment is a software application used for software development.
To get the data, we are using Airbnb data that publicly shared on the internet under the Creative Commons License. Before we are able to load the data into our IDE, first we need to import various external libraries/modules that needed for visualization and analysis.

**a. Load python libraries**

*   Pandas and Numpy library used for data analysis

*   Matplotlib and Seaborn library used for data visualization

In [1]:
#Importing libraries whichever needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
%matplotlib inline


In [2]:
#Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
#Read of airbnb file
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/AlmaBetter/PROJECTS/EDA/Airbnb Bookings Analysis/Airbnb NYC 2019.csv')

In [5]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [6]:
#Checking the data shape
df.shape

(48895, 16)

In [7]:
#Checking the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

In [8]:
#describtion of the data
df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


# **Experimenting On Data set**

In [9]:
#Checking For Duplicates
df[df.duplicated()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


**From the following expriments, listing having 0 reviewes in number_of_reviews data in same row last Review and Reviews per month doesn't have any values it was showing NAN, we can conclude that may be newly listed or does'nt have any occupancy till now.**

**In following second cell we can see after removing data of "0" reviews we doesnt see any non Null values in last Review and Reviews per month data.**

After seperating data i found some interesting thing why these 10052 listing have 0 reviews, 

*   25% of listing kept their avaliablity was 0 day. that mean they just listed but not interested in renting there property. 50% of listing kept their avaliablity was 6 daythrought the year.
*   25% of listing kept their price above 200$ Due to high cost these listing may have 0 occupency




In [10]:
#Checking zero reviewded listings
zero_reviewded_listings=df[df['number_of_reviews']==0]
zero_reviewded_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10052 entries, 2 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              10052 non-null  int64  
 1   name                            10042 non-null  object 
 2   host_id                         10052 non-null  int64  
 3   host_name                       10047 non-null  object 
 4   neighbourhood_group             10052 non-null  object 
 5   neighbourhood                   10052 non-null  object 
 6   latitude                        10052 non-null  float64
 7   longitude                       10052 non-null  float64
 8   room_type                       10052 non-null  object 
 9   price                           10052 non-null  int64  
 10  minimum_nights                  10052 non-null  int64  
 11  number_of_reviews               10052 non-null  int64  
 12  last_review                     

In [11]:
#Checking non zero reviewded listings
non_zero_reviewded_listings=df[~(df['number_of_reviews']==0)]
non_zero_reviewded_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38843 entries, 0 to 48852
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              38843 non-null  int64  
 1   name                            38837 non-null  object 
 2   host_id                         38843 non-null  int64  
 3   host_name                       38827 non-null  object 
 4   neighbourhood_group             38843 non-null  object 
 5   neighbourhood                   38843 non-null  object 
 6   latitude                        38843 non-null  float64
 7   longitude                       38843 non-null  float64
 8   room_type                       38843 non-null  object 
 9   price                           38843 non-null  int64  
 10  minimum_nights                  38843 non-null  int64  
 11  number_of_reviews               38843 non-null  int64  
 12  last_review                     

In [12]:
#Describing zero reviewded listings
zero_reviewded_listings.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,10052.0,,,,22574844.585555,11355626.19826,3647.0,12084042.0,23377571.0,34020916.5,36487245.0
name,10042.0,9884.0,Hillside Hotel,11.0,,,,,,,
host_id,10052.0,,,,80684372.294867,87125927.334585,4632.0,12075188.0,39795381.5,133000062.5,274321313.0
host_name,10047.0,3816.0,Blueground,204.0,,,,,,,
neighbourhood_group,10052.0,5.0,Manhattan,5029.0,,,,,,,
neighbourhood,10052.0,193.0,Williamsburg,757.0,,,,,,,
latitude,10052.0,,,,40.732099,0.052598,40.49979,40.69757,40.72887,40.763643,40.91169
longitude,10052.0,,,,-73.956117,0.043796,-74.24285,-73.984758,-73.960175,-73.939877,-73.7169
room_type,10052.0,3.0,Entire home/apt,5077.0,,,,,,,
price,10052.0,,,,192.919021,358.653017,0.0,70.0,120.0,200.0,10000.0


In [13]:
#Describing non zero reviewded listings
non_zero_reviewded_listings.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,38843.0,,,,18096462.187756,10693697.398115,2539.0,8720027.0,18871455.0,27554820.5,36455809.0
name,38837.0,38269.0,Home away from home,12.0,,,,,,,
host_id,38843.0,,,,64239145.026337,75888474.582013,2438.0,7033823.5,28371926.0,101846465.5,273841667.0
host_name,38827.0,9886.0,Michael,335.0,,,,,,,
neighbourhood_group,38843.0,5.0,Manhattan,16632.0,,,,,,,
neighbourhood,38843.0,218.0,Williamsburg,3163.0,,,,,,,
latitude,38843.0,,,,40.728134,0.05499,40.50641,40.68864,40.72171,40.76299,40.91306
longitude,38843.0,,,,-73.951148,0.046695,-74.24442,-73.98247,-73.9548,-73.93502,-73.71299
room_type,38843.0,3.0,Entire home/apt,20332.0,,,,,,,
price,38843.0,,,,142.317947,196.945624,0.0,69.0,101.0,170.0,10000.0


From below four experiments we can conclude that no of unique **host id**'s are more compare to **host names** and some people may have same name but cant have same id's, we can't consider these names are duplicate but may be a different persons

From above experiments i just filtered only "JOHN" it was result of 294 outcomes, of this 188 different host ids

But from data i just considered John(as first person in list) host Id 2787 as it only having 6 listing on that host id and we can as see 'calculated_host_listings_count' it was showing that host was listed 6 properties.	

In [14]:
#Checking number unique host_id
df['host_id'].nunique()

37457

In [15]:
#Checking number unique host_name
df['host_name'].nunique()

11452

In [16]:
#Checking number host_name and host_id = John
jondf=df[df['host_name']=='John']
len(jondf),jondf['host_id'].nunique()

(294, 188)

In [17]:
#Calculating host listings count
a=jondf[jondf['host_id']==2787]
jondf[jondf['host_id']==2787]
a[['id','name','host_id','host_name','calculated_host_listings_count']]

Unnamed: 0,id,name,host_id,host_name,calculated_host_listings_count
0,2539,Clean & quiet apt home by the park,2787,John,6
10372,7937553,Riomaggiore Room. Queen Bedroom in Bklyn Townh...,2787,John,6
13583,10160215,Torre del Lago Room.,2787,John,6
13688,10267242,Cinque Terre Room. Clean and Quiet Queen Bedroom,2787,John,6
13963,10593675,"La Spezia room. Clean, quiet and comfortable bed",2787,John,6
21556,17263207,Brooklyn home. Comfort and clean. Liguria room.,2787,John,6


**From below outcomes we can conclude that neighbourhood_group created with colabrating group of areas where neighbourhood_group consisted only 5 main area of total data.**

**where as neighbourhood are may be sub areas in neighbourhood_groups. total no of places are listed are 221, from this we mostly have to deal with neighbourhood_group data for best outcomes.**

In [19]:
#Checking number and name of unique neighbourhood_group
number = df["neighbourhood_group"].nunique()
name = df["neighbourhood_group"].unique()
print(name)
print(number)

['Brooklyn' 'Manhattan' 'Queens' 'Staten Island' 'Bronx']
5


In [20]:
#Checking number and name  of unique neighbourhood
number = df["neighbourhood"].nunique()
name = df["neighbourhood"].unique()
print(number)
print(name)

221
['Kensington' 'Midtown' 'Harlem' 'Clinton Hill' 'East Harlem'
 'Murray Hill' 'Bedford-Stuyvesant' "Hell's Kitchen" 'Upper West Side'
 'Chinatown' 'South Slope' 'West Village' 'Williamsburg' 'Fort Greene'
 'Chelsea' 'Crown Heights' 'Park Slope' 'Windsor Terrace' 'Inwood'
 'East Village' 'Greenpoint' 'Bushwick' 'Flatbush' 'Lower East Side'
 'Prospect-Lefferts Gardens' 'Long Island City' 'Kips Bay' 'SoHo'
 'Upper East Side' 'Prospect Heights' 'Washington Heights' 'Woodside'
 'Brooklyn Heights' 'Carroll Gardens' 'Gowanus' 'Flatlands' 'Cobble Hill'
 'Flushing' 'Boerum Hill' 'Sunnyside' 'DUMBO' 'St. George' 'Highbridge'
 'Financial District' 'Ridgewood' 'Morningside Heights' 'Jamaica'
 'Middle Village' 'NoHo' 'Ditmars Steinway' 'Flatiron District'
 'Roosevelt Island' 'Greenwich Village' 'Little Italy' 'East Flatbush'
 'Tompkinsville' 'Astoria' 'Clason Point' 'Eastchester' 'Kingsbridge'
 'Two Bridges' 'Queens Village' 'Rockaway Beach' 'Forest Hills' 'Nolita'
 'Woodlawn' 'University Height

**From the following data we seen a listing that have maximum reviews among all given data. interestingly that was located in Queens but not it either Brooklyn and Manhattan.**


In [21]:
#Maximum reviews
max_reviews=df[df['number_of_reviews']==df['number_of_reviews'].max()]
max_reviews.number_of_reviews.values


array([629])

**From the following data we seen a listing that have maximum reviews per month among all given data.**

In [22]:
#Maximum review of month
max_reviews_permonth=df[df['reviews_per_month']==df['reviews_per_month'].max()]
max_reviews_permonth.reviews_per_month.values

array([58.5])

In [23]:
#Maximum reviews per month
max_reviews_permonth['host_id'].values

array([244361589])

In [24]:
#Testing of host_id 244361589
testing=df[df['host_id']==244361589]
testing

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
42074,32678718,Luxury accommodation minutes from Central Park!,244361589,Row NYC,Manhattan,Theater District,40.75781,-73.98903,Private room,499,1,0,,,9,293
42075,32678719,Enjoy great views of the City in our Deluxe Room!,244361589,Row NYC,Manhattan,Theater District,40.75918,-73.98801,Private room,100,1,156,2019-07-07,58.5,9,299
42076,32678720,Great Room in the heart of Times Square!,244361589,Row NYC,Manhattan,Theater District,40.75828,-73.98876,Private room,199,1,82,2019-07-07,27.95,9,299
42077,32678721,Nice Room 1 block away from Times Square action!,244361589,Row NYC,Manhattan,Theater District,40.75783,-73.98908,Private room,100,1,38,2019-07-04,14.62,9,295
42078,32678723,Spacious room in the Heart of Midtown!,244361589,Row NYC,Manhattan,Theater District,40.75803,-73.98887,Private room,100,1,6,2019-06-15,2.61,9,289
42079,32678724,Steps from varied cuisines at Restaurant Row!,244361589,Row NYC,Manhattan,Theater District,40.75792,-73.989,Private room,249,1,0,,,9,278
42080,32678725,Enjoy the Times Square experience with the fam...,244361589,Row NYC,Manhattan,Theater District,40.75976,-73.98761,Private room,249,1,22,2019-06-23,7.59,9,283
42081,32678726,Steps away from the Heart of the Theater Distr...,244361589,Row NYC,Manhattan,Theater District,40.75925,-73.98767,Private room,100,1,1,2019-05-04,0.45,9,299
42082,32678727,In the center of all Broadway Theater ACTION!,244361589,Row NYC,Manhattan,Theater District,40.75821,-73.9882,Private room,249,1,0,,,9,298


**From following experiment i just tried to use location data plot on x and y axies resulted a below output. As data was from NYC with 5 particular areas.**