# Airbnb data analysis
### Questions:
>* What is the price range monthly in each region in Boston and Seattle?
>
>* What is the most vibe time in each region in Boston and Seattle?
>  
>* Can we predict the possible cost as per the corresponding holder's profiles (e.g., region, ratings, and month and day)?

# Load data

In [1]:
# data location
%ls ../../Datasets

[34mBoston Airbnb Open Data[m[m/        Dataset of USED CARS.zip
Boston Airbnb Open Data.zip     Netflix_movie_and_TV_shows.csv
Car Sales.xlsx - car_data.csv   Netflix_movie_and_TV_shows.zip
Car sales report.zip            [34mSeattle_Airbnb[m[m/
Dataset of USED CARS.csv        Seattle_Airbnb.zip


In [2]:
# set data location
data_dir = '../../Datasets/'
boston_dir = data_dir+"Boston Airbnb Open Data/"
seattle_dir = data_dir+'Seattle_Airbnb/'

In [3]:
import os
# all boston datasets and seattle datasets
bs_all,sa_all = [],[]
for root,dirs,files in os.walk(boston_dir):
    for file in files:
        bs_all.append(os.path.join(root,file))
for root,dirs,files in os.walk(seattle_dir):
    for file in files:
        sa_all.append(os.path.join(root,file))

> ## Load all datasets

In [4]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',100)

In [5]:
# since both datasets contain 'reviews','listings', and 'calendar', create a dictionary key
dict_keys = ['reviews','listings','calendar']
# create dictionary of dataframes for both boston and seattle
dict_bs, dict_sa = {}, {}
for i,dict_key in enumerate(dict_keys):
    dict_bs[dict_key] = pd.read_csv(bs_all[i])
    dict_sa[dict_key] = pd.read_csv(sa_all[i])

> ## Wrangle data

> The data size is very large, directly merging will be too huge. Drop the non-essential columns and decrease the granuarity of the data.
>
> Focusing on the three questions shown above, NLP is not necessarily efficient in the case that the numerical ratings are given. Therefore, NLP remains to be optional for further analysis including keywords extraction and word vectorisation.

In [6]:
dict_sa['reviews'].sample(5)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
59072,3610724,21898648,2014-10-26,22093450,直人,we had an awesome stay! The location was good.
9218,7807087,56061324,2015-12-08,23039628,Stephanie,My friend Rebecca arrived the day before me an...
64358,1060467,43383361,2015-08-19,37510327,Anne-Sophie,We had a perfect stay at Hande's place. \nHand...
42988,2481869,23389012,2014-11-30,9236155,Robert,Andrea is very prompt when making the arrangem...
9793,387078,15974500,2014-07-19,14314819,Theresa,This was my first AirBnB experience and it was...


In [7]:
dict_sa['listings'].sample(1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
1653,8441263,https://www.airbnb.com/rooms/8441263,20160104002432,2016-01-04,Capitol Hill Apartment,Centrally located 1-bedroom apartment in the h...,,Centrally located 1-bedroom apartment in the h...,none,,,,https://a1.muscache.com/ac/pictures/107346707/...,https://a1.muscache.com/im/pictures/107346707/...,https://a1.muscache.com/ac/pictures/107346707/...,https://a1.muscache.com/ac/pictures/107346707/...,21657753,https://www.airbnb.com/users/show/21657753,Josué,2014-09-22,"Seattle, Washington, United States",I am a gay male who loves to travel. I work an...,,,,f,https://a2.muscache.com/ac/users/21657753/prof...,https://a2.muscache.com/ac/users/21657753/prof...,,1.0,1.0,"['email', 'phone', 'google', 'reviews', 'jumio']",t,t,"Boylston Avenue, Seattle, WA 98101, United States",,First Hill,Downtown,Seattle,WA,98101,Seattle,"Seattle, WA",US,United States,47.612651,-122.323922,f,Apartment,Entire home/apt,4,1.0,1.0,2.0,Real Bed,"{TV,Internet,""Wireless Internet"",""Air Conditio...",,$180.00,,,,$30.00,2,$20.00,2,1125,2 months ago,t,0,0,0,0,2016-01-04,0,,,,,,,,,,f,,WASHINGTON,f,strict,f,f,1,


In [8]:
dict_sa['calendar'].sample(5)

Unnamed: 0,listing_id,date,available,price
1313333,3041619,2016-03-07,f,
881178,8556665,2016-03-12,f,
1222348,3250577,2016-11-27,t,$78.00
1323900,3504521,2016-02-18,t,$90.00
895839,10211928,2016-05-12,t,$125.00


In [9]:
# save the dataframe for wrangling
ls_bs, ls_sa, cd_bs, cd_sa = dict_bs['listings'], dict_sa['listings'], dict_bs['calendar'], dict_sa['calendar']

In [10]:
# drop nans
ls_bs.dropna(how='all', axis=1, inplace=True)
ls_sa.dropna(how='all', axis=1, inplace=True)
ls_bs.dropna(how='all', axis=0, inplace=True)
ls_sa.dropna(how='all', axis=0, inplace=True)
# get the common columns for better comparision
ls_com_col = [col for col in ls_bs.columns if col in ls_sa.columns]
ls_bs, ls_sa = ls_bs[ls_com_col], ls_sa[ls_com_col]

In [11]:
ls_bs.sample(2)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
2606,9410831,https://www.airbnb.com/rooms/9410831,20160906204935,2016-09-07,10 Minutes southwest of Boston,"Single family house in Brighton, MA with 1 per...",,"Single family house in Brighton, MA with 1 per...",none,,,,https://a2.muscache.com/im/pictures/75865224-1...,https://a2.muscache.com/im/pictures/75865224-1...,https://a2.muscache.com/im/pictures/75865224-1...,https://a2.muscache.com/im/pictures/75865224-1...,48796146,https://www.airbnb.com/users/show/48796146,Ryan,2015-11-11,US,,,,,f,https://a2.muscache.com/im/pictures/9b1e5a86-e...,https://a2.muscache.com/im/pictures/9b1e5a86-e...,Allston-Brighton,1,1,"['email', 'phone']",t,f,"Adair Road, Boston, MA 02135, United States",Allston-Brighton,Brighton,Boston,MA,2135,Boston,"Boston, MA",US,United States,42.353401,-71.166446,t,House,Private room,1,1.0,1.0,1.0,Real Bed,"{""Cable TV"",""Wireless Internet"",Kitchen,Heatin...",,$55.00,,,,,1,$0.00,1,1125,10 months ago,0,0,0,0,2016-09-06,0,,,,,,,,,,f,f,flexible,f,f,1,
2209,14608495,https://www.airbnb.com/rooms/14608495,20160906204935,2016-09-07,Nice and clean room in 2 bd apt,In the heart of boston,,In the heart of boston,none,,,,https://a2.muscache.com/im/pictures/9a107652-b...,https://a2.muscache.com/im/pictures/9a107652-b...,https://a2.muscache.com/im/pictures/9a107652-b...,https://a2.muscache.com/im/pictures/9a107652-b...,90512786,https://www.airbnb.com/users/show/90512786,Ainur,2016-08-17,"Gearhart, Oregon, United States",,within an hour,100%,79%,f,https://a2.muscache.com/im/pictures/f25b46ff-b...,https://a2.muscache.com/im/pictures/f25b46ff-b...,,3,3,"['email', 'phone', 'reviews']",t,f,"Park Drive, Boston, MA 02215, United States",,Fenway,Boston,MA,2215,Boston,"Boston, MA",US,United States,42.34685,-71.103373,f,Apartment,Private room,1,1.0,1.0,1.0,Real Bed,"{Internet,""Wireless Internet"",Kitchen,Washer,D...",,$85.00,,,,$5.00,2,$15.00,1,1125,2 days ago,7,23,49,324,2016-09-06,2,2016-08-25,2016-09-02,80.0,9.0,8.0,10.0,9.0,9.0,9.0,f,f,strict,f,f,3,2.0


> To address the questions in this investigation, dataframe 'reviews' and NLP are not necessarily to be included.

> check the columns and select the essential columns
> 
> * 'id', 'neighbourhood_cleansed', 'price', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',	'review_scores_checkin', 'review_scores_communication',	'review_scores_location', 'review_scores_value', 'reviews_per_month'

In [12]:
select_ls_col = ['id', 'neighbourhood_cleansed', 'price', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'reviews_per_month']

In [13]:
ls_bs, ls_sa = ls_bs[select_ls_col], ls_sa[select_ls_col]

In [14]:
ls_bs.sample(5)

Unnamed: 0,id,neighbourhood_cleansed,price,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month
407,5510597,Mission Hill,$62.00,,,,,,,,
2586,7921649,Brighton,$79.00,93.0,10.0,9.0,10.0,10.0,9.0,10.0,1.38
1892,8090623,Beacon Hill,$199.00,94.0,10.0,9.0,10.0,10.0,10.0,9.0,2.72
3051,12287687,South Boston Waterfront,$399.00,,,,,,,,
919,1183032,South End,$150.00,93.0,9.0,9.0,10.0,10.0,10.0,9.0,0.5


In [15]:
ls_sa.sample(5)

Unnamed: 0,id,neighbourhood_cleansed,price,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month
2591,6032726,Broadway,$550.00,94.0,9.0,10.0,10.0,10.0,10.0,9.0,2.37
2281,777159,Mount Baker,$150.00,93.0,9.0,9.0,10.0,10.0,9.0,9.0,0.65
1748,9545664,Alki,$99.00,,,,,,,,
1064,5615620,Pike-Market,$100.00,95.0,10.0,9.0,10.0,10.0,10.0,9.0,5.98
2801,941467,Broadway,$95.00,99.0,10.0,10.0,10.0,10.0,10.0,10.0,3.33


In [16]:
ls_bs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3585 entries, 0 to 3584
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           3585 non-null   int64  
 1   neighbourhood_cleansed       3585 non-null   object 
 2   price                        3585 non-null   object 
 3   review_scores_rating         2772 non-null   float64
 4   review_scores_accuracy       2762 non-null   float64
 5   review_scores_cleanliness    2767 non-null   float64
 6   review_scores_checkin        2765 non-null   float64
 7   review_scores_communication  2767 non-null   float64
 8   review_scores_location       2763 non-null   float64
 9   review_scores_value          2764 non-null   float64
 10  reviews_per_month            2829 non-null   float64
dtypes: float64(8), int64(1), object(2)
memory usage: 308.2+ KB


> need to reformat the price column

In [29]:
ls_bs.price = ls_bs.price.str.extract(r'(\d+\.\d+)').astype(float)
ls_sa.price = ls_sa.price.str.extract(r'(\d+\.\d+)').astype(float)

> clean calendar

In [42]:
cd_bs.date = pd.to_datetime(cd_bs.date)
cd_sa.date = pd.to_datetime(cd_sa.date)

Unnamed: 0,listing_id,date,available,price
875479,8164146,2017-02-07,t,$319.00
404533,826555,2017-05-15,t,$84.00
581391,12233285,2017-07-04,f,
1195623,13863117,2016-12-05,t,$119.00
612931,1066767,2017-04-05,t,$275.00
