# Analysis of Airbnb bookings across Seattle

In this project, I will try to analyse and answer a few questions on the airbnb seattle dataset available on kaggle. This implementation is following CRISP-DM process i.e., the project is organized in the following steps,

1. Business understanding
2. Data Understanding
3. Preperation of the data
4. Modelling (If Necessary)
5. Results and conclustions

The questions that would be answered in this project are listed as follows,
1. What is the suitable price based on the features of the room {property_type, room_type, no.of bedrooms, no. of bathrooms etc.}?
2. What is the probable price for a room based on the neighbourhood?
3. What is the effect on frequency of bookings by becoming a starhost?
4. What is the effect of selecting "Strict" cancellation policy on the frequency of bookings?
5. What is the neighbourhood where the price is highest despite providing minimum amenities?
 

In [1]:
import pandas as pd
import numpy as np

listings = pd.read_csv('./Data/listings.csv')
calendar = pd.read_csv('./Data/calendar.csv')

pd.set_option("display.max_columns", None)

In [2]:
listings.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', '

Among the plethora of columns, we will first try and select the columns which are relavent for our question.

In [3]:
relavent_columns = ['id', 'neighbourhood_group_cleansed', 'property_type',
                    'room_type', 'accommodates','bathrooms',
                    'square_feet', 'bedrooms', 'beds',
                    'bed_type', 'amenities', 'cancellation_policy',
                    'minimum_nights', 'instant_bookable', 'review_scores_rating',
                    'review_scores_accuracy', 'review_scores_cleanliness','review_scores_checkin',
                    'review_scores_communication','review_scores_location', 'review_scores_value',
                    'price']

listings_new = listings[relavent_columns]

In [4]:
listings_new.head()

Unnamed: 0,id,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,square_feet,bedrooms,beds,bed_type,amenities,cancellation_policy,minimum_nights,instant_bookable,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
0,241032,Queen Anne,Apartment,Entire home/apt,4,1.0,,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",moderate,1,f,95.0,10.0,10.0,10.0,10.0,9.0,10.0,$85.00
1,953595,Queen Anne,Apartment,Entire home/apt,4,1.0,,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",strict,2,f,96.0,10.0,10.0,10.0,10.0,10.0,10.0,$150.00
2,3308979,Queen Anne,House,Entire home/apt,11,4.5,,5.0,7.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",strict,4,f,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$975.00
3,7421966,Queen Anne,Apartment,Entire home/apt,3,1.0,,0.0,2.0,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Indoor ...",flexible,1,f,,,,,,,,$100.00
4,278830,Queen Anne,House,Entire home/apt,6,2.0,,3.0,3.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",strict,1,f,92.0,9.0,9.0,10.0,10.0,9.0,9.0,$450.00


Now, it is time to reformat the price.

In [5]:
listings_new.loc[:,"price"] = listings_new.price.str.replace("[$, ]", "", regex=True).astype("float")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  listings_new.loc[:,"price"] = listings_new.price.str.replace("[$, ]", "", regex=True).astype("float")


In [6]:
listings_new.head()

Unnamed: 0,id,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,square_feet,bedrooms,beds,bed_type,amenities,cancellation_policy,minimum_nights,instant_bookable,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
0,241032,Queen Anne,Apartment,Entire home/apt,4,1.0,,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",moderate,1,f,95.0,10.0,10.0,10.0,10.0,9.0,10.0,85.0
1,953595,Queen Anne,Apartment,Entire home/apt,4,1.0,,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",strict,2,f,96.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0
2,3308979,Queen Anne,House,Entire home/apt,11,4.5,,5.0,7.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",strict,4,f,97.0,10.0,10.0,10.0,10.0,10.0,10.0,975.0
3,7421966,Queen Anne,Apartment,Entire home/apt,3,1.0,,0.0,2.0,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Indoor ...",flexible,1,f,,,,,,,,100.0
4,278830,Queen Anne,House,Entire home/apt,6,2.0,,3.0,3.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",strict,1,f,92.0,9.0,9.0,10.0,10.0,9.0,9.0,450.0


Now, let us check for any null values.

In [7]:
100*listings_new.isnull().sum().sort_values(ascending=False)/listings_new.shape[0]

square_feet                     97.459403
review_scores_checkin           17.234154
review_scores_accuracy          17.234154
review_scores_value             17.181771
review_scores_location          17.155579
review_scores_cleanliness       17.103195
review_scores_communication     17.050812
review_scores_rating            16.946045
bathrooms                        0.419068
bedrooms                         0.157150
property_type                    0.026192
beds                             0.026192
id                               0.000000
cancellation_policy              0.000000
instant_bookable                 0.000000
minimum_nights                   0.000000
neighbourhood_group_cleansed     0.000000
amenities                        0.000000
bed_type                         0.000000
accommodates                     0.000000
room_type                        0.000000
price                            0.000000
dtype: float64

The information of squarefeet is not provided in most of the listings, hence we can assume that this information has very little effect on the other parameters. Hence, we can drop them.  However, the reviews cannot be dealt in the same way, As, we have the values for most of the listings, but for the entries which donot have any entries in the related field. 

In [8]:
listings_new.drop('square_feet', axis=1, inplace=True)
listings_new.dtypes

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  listings_new.drop('square_feet', axis=1, inplace=True)


id                                int64
neighbourhood_group_cleansed     object
property_type                    object
room_type                        object
accommodates                      int64
bathrooms                       float64
bedrooms                        float64
beds                            float64
bed_type                         object
amenities                        object
cancellation_policy              object
minimum_nights                    int64
instant_bookable                 object
review_scores_rating            float64
review_scores_accuracy          float64
review_scores_cleanliness       float64
review_scores_checkin           float64
review_scores_communication     float64
review_scores_location          float64
review_scores_value             float64
price                           float64
dtype: object

In [9]:
review_cols = [x for x in listings_new.columns if 'review' in x]

listings_new[review_cols].describe()

Unnamed: 0,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value
count,3171.0,3160.0,3165.0,3160.0,3167.0,3163.0,3162.0
mean,94.539262,9.636392,9.556398,9.786709,9.809599,9.608916,9.452245
std,6.606083,0.698031,0.797274,0.595499,0.568211,0.629053,0.750259
min,20.0,2.0,3.0,2.0,2.0,4.0,2.0
25%,93.0,9.0,9.0,10.0,10.0,9.0,9.0
50%,96.0,10.0,10.0,10.0,10.0,10.0,10.0
75%,99.0,10.0,10.0,10.0,10.0,10.0,10.0
max,100.0,10.0,10.0,10.0,10.0,10.0,10.0


It is assumed that the values at 50th percentile, could be considered a good value to be added in the place values for null values.

In [10]:
percentile50 = listings_new[review_cols].quantile(0.5)
listings_new.loc[:,review_cols]=listings_new[review_cols].fillna(percentile50)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  listings_new.loc[:,review_cols]=listings_new[review_cols].fillna(percentile50)


In [11]:
listings_new.isnull().sum().sort_values(ascending=False)

bathrooms                       16
bedrooms                         6
property_type                    1
beds                             1
id                               0
review_scores_rating             0
review_scores_value              0
review_scores_location           0
review_scores_communication      0
review_scores_checkin            0
review_scores_cleanliness        0
review_scores_accuracy           0
cancellation_policy              0
instant_bookable                 0
minimum_nights                   0
neighbourhood_group_cleansed     0
amenities                        0
bed_type                         0
accommodates                     0
room_type                        0
price                            0
dtype: int64

Now it is time to deal with the null values in the parameters of the room

In [12]:
listings_new.loc[:,['bathrooms','bedrooms','beds']] = listings_new.loc[:,['bathrooms','bedrooms','beds']].fillna(listings_new.loc[:,['bathrooms','bedrooms','beds']].quantile(0.60))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  listings_new.loc[:,['bathrooms','bedrooms','beds']] = listings_new.loc[:,['bathrooms','bedrooms','beds']].fillna(listings_new.loc[:,['bathrooms','bedrooms','beds']].quantile(0.60))


For the property type, let us check if there is a way we can infer the property_type from the room_type.

In [13]:
listings_new[listings_new.property_type.isnull()]

Unnamed: 0,id,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,cancellation_policy,minimum_nights,instant_bookable,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
2184,3335,Rainier Valley,,Entire home/apt,4,1.0,2.0,2.0,Real Bed,"{""Wireless Internet"",Kitchen,""Free Parking on ...",strict,2,f,96.0,10.0,10.0,10.0,10.0,10.0,10.0,120.0


Based on the room_type, we can say that the property_type is apartment. 

In [14]:
listings_new.iloc[2184, 2] = "Apartment" 

In [16]:
listings_new.isnull().sum().sort_values(ascending=False)

id                              0
minimum_nights                  0
review_scores_value             0
review_scores_location          0
review_scores_communication     0
review_scores_checkin           0
review_scores_cleanliness       0
review_scores_accuracy          0
review_scores_rating            0
instant_bookable                0
cancellation_policy             0
neighbourhood_group_cleansed    0
amenities                       0
bed_type                        0
beds                            0
bedrooms                        0
bathrooms                       0
accommodates                    0
room_type                       0
property_type                   0
price                           0
dtype: int64