<a href="https://colab.research.google.com/github/AdnanAli-10/Airbnb-Bookings-Analysis-Capstone-Project./blob/main/Airbnb_Bookings_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. </b>

## <b>This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values. </b>

## <b> Explore and analyze the data to discover key understandings (not limited to these) such as : 
* What can we learn about different hosts and areas?
* What can we learn from predictions? (ex: locations, prices, reviews, etc)
* Which hosts are the busiest and why?
* Is there any noticeable difference of traffic among different areas and what could be the reason for it? </b>

## <b>Defining a hypothetical situation</b>
Let us assume that my dad is a wealthy individual and is looking to make some investments in real estate. He has heard from his colleagues that the New York City real estate market has been thriving and makes good enough ROI(return on investment). Being in India, my dad is quite skeptical about investing in New York and leaving the property abandoned, He also wants some cash flow to be generated out of the property every now and then. To get out of this dilemma my dad calls his old friend, Mr Robert Kiyosaki (author of the book Rich Dad Poor Dad), he advices my dad to buy the property and list it on some rental services like <b>"Airbnb"</b>. My dad after listening to his friend comes and asks me <b>"Son, what is Airbnb?"</b>

<b>Defining Airbnb - Airbnb is an online marketplace connecting travelers with local hosts. On one side, the platform enables people to list their available space and earn extra income in the form of rent. On the other, Airbnb enables travelers to book unique homestays from local hosts, saving them money and giving them a chance to interact with locals. Catering to the on-demand travel industry, Airbnb is present in over 190 countries across the world.</b>

After this professional explanation to my dad, he is quite happy about the opportunity his friend has opened up for him. Now he wants answers to his questions from the business point of view. He asks to me find out a few unique details about the New York Airbnb market, for example:- Which part of New York generates the most revenue, etc. Now, me  being a data scientist I decide to tackle this problem through the art of Data Science and make my Indian parent proud(Probably the most hypothetical thing I've mentioned till now). 

# <b>So lets begin!!!</b>

## <b> Questions to answer:</b>
1. For which location do the customers pay the highest and lowest rent.
2. Top 5 highest listing areas/locations.
3. What is the average preferred price by the customer.
4. What are the total number of nights spent per location.
5. Which hosts are busiest and why.

Now that we have the questions, it is time to find the answers, but as the great Sherlock Holmes says - "It is a capital mistake to theorize before one has data.", So lets try finding some data.
In this hypothetical world, the organisation I'm working for (AlmaBetter) is kind enough to research and offer me the data required for the analysis. It can also be fetched from kaggle, named - Airbnb-NYC-cleaned by Sandeep Majumdar(The data might be a bit different so the results may vary). 

In [1]:
#Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
#Importing Dependencies
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 7)

In [3]:
#Reading the data in a dataframe using pandas library to perform exploratory analysis
airbnb_df = pd.read_csv('/content/drive/MyDrive/Airbnb NYC 2019.csv')

## <b>Initial EDA and basic operations</b>

In [4]:
#checking the shape of the dataframe
airbnb_df.shape

(48895, 16)

In [6]:
#Checking the first 5 values of the dataframe
airbnb_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [7]:
#Checking the last 5 values of the dataframe
airbnb_df.tail()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2
48894,36487245,Trendy duplex in the very heart of Hell's Kitchen,68119814,Christophe,Manhattan,Hell's Kitchen,40.76404,-73.98933,Private room,90,7,0,,,1,23


In [8]:
#Checking the columns
col_list = airbnb_df.columns.tolist()
col_list

['id',
 'name',
 'host_id',
 'host_name',
 'neighbourhood_group',
 'neighbourhood',
 'latitude',
 'longitude',
 'room_type',
 'price',
 'minimum_nights',
 'number_of_reviews',
 'last_review',
 'reviews_per_month',
 'calculated_host_listings_count',
 'availability_365']

In [5]:
airbnb_df.room_type.value_counts()

Entire home/apt    25409
Private room       22326
Shared room         1160
Name: room_type, dtype: int64

<b> Defining the columns:</b>
We can see that we have 16 columns and 48,895 observations. To better understand the dataset let's see what each column means.
* id : A unique id given to each airbnb lisitng.
* name : The Ad title for the listing on Airbnb website.
* host_id : A unique id given to an Airbnb host.
* host_name : The name with which the host is registered.
* neighbourhood_group : A group of areas/neighbourhoods.
* neighbourhood : Name of a particular area/ neighbourhood.
* latitude : latitudinal coordinate of the listing.
* longitude : longitudinal coordinate of listing.
* room_type : listing type(1 of 3 types) - 1.Entire Home/apartment, 2.Private room, 3.Shared room.
* price : price of the listing.
* minimum_nights : Minimum number of nights required to stay in a single visit.
* number_of_reviews : The total number of reviews given by visitors.
* last_review : date of the last recorded review.
* reviews_per_month : The number of reviews given per month for a listing.
* calculated_host_listings_count : the total number of listings registered under a given host.
* availability_365 : the number of days for which a listing is available in a year.

In [9]:
#general info about the dataset
airbnb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

Now I will be removing a few columns (last_review, reviews_per_month, latitude and longitude), because I don't see them adding any value to the questions I have to answer. 

In [10]:
#removing columns
airbnb_df.drop(['latitude','longitude','last_review','reviews_per_month'],axis=1,inplace=True)

In [11]:
#checking for null values
airbnb_df.isnull().sum()

id                                 0
name                              16
host_id                            0
host_name                         21
neighbourhood_group                0
neighbourhood                      0
room_type                          0
price                              0
minimum_nights                     0
number_of_reviews                  0
calculated_host_listings_count     0
availability_365                   0
dtype: int64

<b> Dealing with null values: </b>
There are two ways of dealing with null values, one is deleting the null vlaues and the other is to impute the null values with some meaningfull values.

In this case I will just delete the null values because they are not many and deleting them will not affect the trends much.

In [12]:
#Deleting the observations with null values
airbnb_df.dropna(inplace=True)

In [13]:
airbnb_df.isnull().sum()

id                                0
name                              0
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

We can see that there are no more null values left in the dataset.

In [14]:
#Getting a statistical summary of the dataframe
airbnb_df.describe()

Unnamed: 0,id,host_id,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365
count,48858.0,48858.0,48858.0,48858.0,48858.0,48858.0,48858.0
mean,19023350.0,67631690.0,152.740309,7.012444,23.273098,7.148369,112.801425
std,10982890.0,78623890.0,240.232386,20.019757,44.549898,32.9646,131.610962
min,2539.0,2438.0,0.0,1.0,0.0,1.0,0.0
25%,9475980.0,7818669.0,69.0,1.0,1.0,1.0,0.0
50%,19691140.0,30791330.0,106.0,3.0,5.0,1.0,45.0
75%,29157650.0,107434400.0,175.0,5.0,24.0,2.0,227.0
max,36487240.0,274321300.0,10000.0,1250.0,629.0,327.0,365.0


Here we can see that there are price values which are 0. This doesn't make sense because I don't think so people would put up listing for free. We will impute values for these wrong observations.

In [15]:
airbnb_df[airbnb_df['price']==0]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,room_type,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365
23161,18750597,"Huge Brooklyn Brownstone Living, Close to it all.",8993084,Kimberly,Brooklyn,Bedford-Stuyvesant,Private room,0,4,1,4,28
25433,20333471,★Hostel Style Room | Ideal Traveling Buddies★,131697576,Anisha,Bronx,East Morrisania,Private room,0,2,55,4,127
25634,20523843,"MARTIAL LOFT 3: REDEMPTION (upstairs, 2nd room)",15787004,Martial Loft,Brooklyn,Bushwick,Private room,0,2,16,5,0
25753,20608117,"Sunny, Quiet Room in Greenpoint",1641537,Lauren,Brooklyn,Greenpoint,Private room,0,2,12,2,0
25778,20624541,Modern apartment in the heart of Williamsburg,10132166,Aymeric,Brooklyn,Williamsburg,Entire home/apt,0,5,3,1,73
25794,20639628,Spacious comfortable master bedroom with nice ...,86327101,Adeyemi,Brooklyn,Bedford-Stuyvesant,Private room,0,1,93,6,176
25795,20639792,Contemporary bedroom in brownstone with nice view,86327101,Adeyemi,Brooklyn,Bedford-Stuyvesant,Private room,0,1,95,6,232
25796,20639914,Cozy yet spacious private brownstone bedroom,86327101,Adeyemi,Brooklyn,Bedford-Stuyvesant,Private room,0,1,95,6,222
26259,20933849,the best you can find,13709292,Qiuchi,Manhattan,Murray Hill,Entire home/apt,0,3,0,1,0
26841,21291569,Coliving in Brooklyn! Modern design / Shared room,101970559,Sergii,Brooklyn,Bushwick,Shared room,0,30,2,6,333


In [16]:
airbnb_df[airbnb_df['price']==0].shape

(11, 12)

So there are 11 observations with price as 0 which need to be treated.
We will achieve this by imputing values based on minimum number of nights and their average price. 

In [17]:
airbnb_df.groupby('minimum_nights')['price'].mean().reset_index()

Unnamed: 0,minimum_nights,price
0,1,142.062756
1,2,146.279374
2,3,160.285643
3,4,161.229603
4,5,157.263765
...,...,...
103,400,50.000000
104,480,199.000000
105,500,88.800000
106,999,96.000000


In [18]:
#function for imputing average value of price wherever price is 0
def price_imputer(min_nights_list,airbnb_df):
  for i in min_nights_list:
    avg_val = airbnb_df[airbnb_df['minimum_nights']==i].groupby('minimum_nights')['price'].mean().reset_index().loc[0][1]
    airbnb_df['price']=np.where((airbnb_df['price']==0)&(airbnb_df['minimum_nights']==i),avg_val,airbnb_df['price'])

In [19]:
min_nights_list = [1,2,3,4,5,30]
price_imputer(min_nights_list,airbnb_df)

In [20]:
airbnb_df[airbnb_df['price']==0]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,room_type,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365


We can see that there are no more observations that have the value for price as 0. 

## <b>A great hack:</b>
Using Pandas profiling for EDA

In [21]:
!pip install pandas-profiling==2.7.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [23]:
from pandas_profiling import ProfileReport
ProfileReport(airbnb_df)

Summarize dataset:   0%|          | 0/25 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



<B>NOTE</B>

Using Profilereport may throw an error stating : <b> ImportError: cannot import name 'ABCIndexClass' from 'pandas.core.dtypes.generic'</b>

To solve this problem just go to the boolean.py file and replace ABCIndexclass with ABCIndex