# Airbnb data analysis
### Questions:
>* What is the price range monthly in each region in Boston and Seattle?
>
>* What is the most vibe time in each region in Boston and Seattle?
>  
>* Can we predict the possible cost as per the corresponding holder's profiles (e.g., 'neighbourhood_group_cleansed','host_response_time','host_response_rate', 'host_acceptance_rate', 'name', 'note','transit', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification'), region, and month or day?

# Load data

In [1]:
# data location
%ls ../../Datasets

[34mBoston Airbnb Open Data[m[m/        Dataset of USED CARS.zip
Boston Airbnb Open Data.zip     Netflix_movie_and_TV_shows.csv
Car Sales.xlsx - car_data.csv   Netflix_movie_and_TV_shows.zip
Car sales report.zip            [34mSeattle_Airbnb[m[m/
Dataset of USED CARS.csv        Seattle_Airbnb.zip


In [2]:
# set data location
data_dir = '../../Datasets/'
boston_dir = data_dir+"Boston Airbnb Open Data/"
seattle_dir = data_dir+'Seattle_Airbnb/'

In [3]:
import os
# all boston datasets and seattle datasets
bs_all,sa_all = [],[]
for root,dirs,files in os.walk(boston_dir):
    for file in files:
        bs_all.append(os.path.join(root,file))
for root,dirs,files in os.walk(seattle_dir):
    for file in files:
        sa_all.append(os.path.join(root,file))

> ## Load all datasets

In [4]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)

In [5]:
# since both datasets contain 'reviews','listings', and 'calendar', create a dictionary key
dict_keys = ['reviews','listings','calendar']
# create dictionary of dataframes for both boston and seattle
dict_bs, dict_sa = {}, {}
for i,dict_key in enumerate(dict_keys):
    dict_bs[dict_key] = pd.read_csv(bs_all[i])
    dict_sa[dict_key] = pd.read_csv(sa_all[i])

> ## Wrangle data

> The data size is very large, directly merging will be too huge. Drop the non-essential columns and decrease the granuarity of the data.

In [6]:
dict_sa['reviews'].sample(5)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
29635,3551668,52975095,2015-11-03,5357454,Blake,"What an exceptional Airbnb! Great hosts, locat..."
79269,66611,2373695,2012-09-22,3503147,Tyler,I stayed in this room for about one week. The...
53777,1377812,13359845,2014-05-26,7172691,Mary,John's home is beautifully designed. It is sim...
71405,4047058,57984858,2015-12-30,23454266,Ellise,James and Roland have a great and very easy se...
35690,535300,22626938,2014-11-10,23256892,Lauren,This was my first time staying at an airbnb an...


In [7]:
dict_sa['listings'].sample(5)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
3053,9591718,https://www.airbnb.com/rooms/9591718,20160104002432,2016-01-04,Million Dollars View of Lake of WA,To share a huge upscale house with University ...,CLOSE TO BUS STOP ON SANDPOINT WAY N.E. MINS T...,To share a huge upscale house with University ...,none,LOCATED IN A QUIET AND PEACEFUL NEIGHBORHOOD ...,THERE ARE CHINESE PHD STUDENTS AND SCHOLARS IN...,2 MINS WALK TO BUS STOP #75 ON SANDPOINT WAY N...,https://a2.muscache.com/ac/pictures/730b2066-2...,https://a2.muscache.com/im/pictures/730b2066-2...,https://a2.muscache.com/ac/pictures/730b2066-2...,https://a2.muscache.com/ac/pictures/730b2066-2...,49644871,https://www.airbnb.com/users/show/49644871,Linda,2015-11-22,US,,,,,f,https://a2.muscache.com/ac/pictures/80735f2a-1...,https://a2.muscache.com/ac/pictures/80735f2a-1...,Mathews Beach,1.0,1.0,"['email', 'phone']",t,f,"Durland Avenue Northeast, Seattle, WA 98125, U...",Mathews Beach,Matthews Beach,Lake City,Seattle,WA,98125,Seattle,"Seattle, WA",US,United States,47.713169,-122.279579,t,House,Private room,1,1.0,1.0,1.0,Real Bed,"{""Wireless Internet"",""Air Conditioning"",""Wheel...",,$125.00,,,$400.00,$50.00,2,$10.00,1,1125,6 weeks ago,t,30,60,90,365,2016-01-04,0,,,,,,,,,,f,,WASHINGTON,f,flexible,f,f,1,
2330,2150760,https://www.airbnb.com/rooms/2150760,20160104002432,2016-01-04,Seward Park Mother In Law,Our cozy private detached MIL is walking dista...,Studio Detached Mother in law unit in a garden...,Our cozy private detached MIL is walking dista...,none,,,Short walk to multiple buses and light rail st...,https://a0.muscache.com/ac/pictures/51537111/f...,https://a0.muscache.com/im/pictures/51537111/f...,https://a0.muscache.com/ac/pictures/51537111/f...,https://a0.muscache.com/ac/pictures/51537111/f...,10977007,https://www.airbnb.com/users/show/10977007,Michelle,2014-01-02,"Seattle, Washington, United States",We are a small family that loves everything Se...,a few days or more,40%,100%,f,https://a2.muscache.com/ac/users/10977007/prof...,https://a2.muscache.com/ac/users/10977007/prof...,Columbia City,1.0,1.0,"['email', 'phone', 'reviews', 'kba']",t,t,"South Brandon Street, Seattle, WA 98118, Unite...",Columbia City,Seward Park,Seward Park,Seattle,WA,98118,Seattle,"Seattle, WA",US,United States,47.552437,-122.270692,t,House,Entire home/apt,2,1.0,0.0,1.0,Real Bed,"{""Wireless Internet"",Kitchen,""Pets live on thi...",,$100.00,,,$200.00,,0,$0.00,2,14,3 months ago,t,30,60,90,365,2016-01-04,40,2014-03-16,2015-09-28,94.0,10.0,10.0,10.0,9.0,9.0,9.0,f,,WASHINGTON,f,moderate,f,f,1,1.82
2332,1606171,https://www.airbnb.com/rooms/1606171,20160104002432,2016-01-04,Seattle's BEST RATE - Seward Park,A/C PRIVATE ROOM in VINTAGE building in a pret...,PLEASE READ ALL CAREFULLY. I've had TWO guests...,A/C PRIVATE ROOM in VINTAGE building in a pret...,none,-- WORK DESTINATIONS -- • SeaTac Airport: Ea...,I love doing Airbnb and strive for 5-star revi...,All the typical big city amenities are availab...,https://a0.muscache.com/ac/pictures/25493929/5...,https://a0.muscache.com/im/pictures/25493929/5...,https://a0.muscache.com/ac/pictures/25493929/5...,https://a0.muscache.com/ac/pictures/25493929/5...,8556853,https://www.airbnb.com/users/show/8556853,Daisy,2013-08-31,"Seattle, Washington, United States","Hello Traveler,\r\n\r\nI'm a Washington state ...",within an hour,100%,100%,f,https://a0.muscache.com/ac/users/8556853/profi...,https://a0.muscache.com/ac/users/8556853/profi...,Seward Park,1.0,1.0,"['email', 'phone', 'facebook', 'linkedin', 're...",t,t,"South Dawson Street, Seattle, WA 98118, United...",Seward Park,Seward Park,Seward Park,Seattle,WA,98118,Seattle,"Seattle, WA",US,United States,47.545899,-122.263749,t,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",,$38.00,,,$100.00,,1,$10.00,2,365,2 weeks ago,t,21,51,81,354,2016-01-04,166,2013-10-02,2015-12-31,93.0,9.0,9.0,10.0,10.0,9.0,10.0,f,,WASHINGTON,f,moderate,t,f,1,6.04
2015,8028801,https://www.airbnb.com/rooms/8028801,20160104002432,2016-01-04,"Greenwood Go-To, private Bed/Bath","Greenwood is a Northern neighborhood, and stil...",The room is on the first floor of a three floo...,"Greenwood is a Northern neighborhood, and stil...",none,"Right next to a Greenwood institution, Chuck's...",,A central location on the North side. Easy ro...,https://a2.muscache.com/ac/pictures/9540a9b0-3...,https://a2.muscache.com/im/pictures/9540a9b0-3...,https://a2.muscache.com/ac/pictures/9540a9b0-3...,https://a2.muscache.com/ac/pictures/9540a9b0-3...,8117374,https://www.airbnb.com/users/show/8117374,Cody,2013-08-12,"Seattle, Washington, United States",,within a few hours,100%,100%,f,https://a2.muscache.com/ac/pictures/8669799f-7...,https://a2.muscache.com/ac/pictures/8669799f-7...,Greenwood,1.0,1.0,"['email', 'phone', 'facebook', 'google', 'revi...",t,t,"Northwest 85th Street, Seattle, WA 98117, Unit...",Greenwood,Greenwood,Other neighborhoods,Seattle,WA,98117,Seattle,"Seattle, WA",US,United States,47.691774,-122.365675,t,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",,$50.00,$300.00,"$1,100.00",,,2,$15.00,1,1125,5 weeks ago,t,30,60,90,365,2016-01-04,14,2015-09-11,2015-11-29,94.0,10.0,9.0,10.0,10.0,10.0,10.0,f,,WASHINGTON,f,flexible,f,f,1,3.62
2029,4567243,https://www.airbnb.com/rooms/4567243,20160104002432,2016-01-04,Beautiful Affordable Holiday Rental,Our house is very nice and quaint. 750sq feet ...,Bus stops nearby and 15 minutes to downtown G...,Our house is very nice and quaint. 750sq feet ...,none,Greenwood,,Bus stops nearby and 15 minutes to downtown,https://a2.muscache.com/ac/pictures/57351550/7...,https://a2.muscache.com/im/pictures/57351550/7...,https://a2.muscache.com/ac/pictures/57351550/7...,https://a2.muscache.com/ac/pictures/57351550/7...,2855271,https://www.airbnb.com/users/show/2855271,Andrew,2012-07-07,"Seattle, Washington, United States",,within a day,50%,100%,f,https://a0.muscache.com/ac/users/2855271/profi...,https://a0.muscache.com/ac/users/2855271/profi...,,1.0,1.0,"['email', 'phone', 'facebook', 'reviews']",t,f,"Evanston Ave N, Seattle, WA 98103, United States",,Greenwood,Other neighborhoods,Seattle,WA,98103,Seattle,"Seattle, WA",US,United States,47.692504,-122.352163,f,House,Entire home/apt,4,1.0,1.0,2.0,Real Bed,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",,$75.00,$490.00,"$1,700.00",,$50.00,1,$0.00,7,31,2 months ago,t,2,2,2,123,2016-01-04,2,2015-01-03,2015-12-28,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,WASHINGTON,f,moderate,f,f,1,0.16


> check the columns and select the essential columns
> 
> * 'id', 'neighbourhood_group_cleansed','host_response_time','host_response_rate', 'host_acceptance_rate', 'name', 'note','transit', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification', 'price', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',	'review_scores_checkin', 'review_scores_communication',	'review_scores_location', 'review_scores_value'

In [8]:
dict_sa['calendar'].sample(5)

Unnamed: 0,listing_id,date,available,price
1174429,6961346,2016-08-15,t,$85.00
466280,4243163,2016-06-27,f,
42776,6337492,2016-03-15,f,
1312673,9435483,2016-05-16,t,$152.00
261391,1549973,2016-02-24,t,$78.00


> clean the reviews dataframe

In [9]:
drop_colns = ['id','reviewer_id', 'reviewer_name', 'date']
dict_bs['reviews'].drop(columns=drop_colns, inplace=True)
dict_sa['reviews'].drop(columns=drop_colns, inplace=True)
dict_bs['reviews'].sample(5)

Unnamed: 0,listing_id,comments
25397,12870514,The place was bigger than it looked in picture...
47001,3601030,Shira's condo is lovely. It has all modern fix...
19420,2798787,Molly and her family were great hosts! Their h...
19274,8052617,We really enjoyed our stay. The home was very...
17718,8662258,It was really nice !! Great Experience !! Gre...


In [10]:
import warnings
warnings.filterwarnings('ignore')

In [16]:
# creat a funciton to extract essential words
def extract_essential_words(df):
    from nltk.tokenize import word_tokenize
    from nltk.tag import pos_tag
    from nltk.stem import WordNetLemmatizer
    import string
    import re
    
    # preloaded values    
    essential_words = []
    stem_fit = WordNetLemmatizer()
    # start to extract
    for i, comment in enumerate(df.comments):
        # check if isnan
        if comment != np.nan:
            try:
                # tokenise the sentence
                words = word_tokenize(comment)
                # extract only the addjectives
                tagged_tokens = pos_tag(words)
                words = [wd.lower() for wd, pos in tagged_tokens if (pos == 'JJ') and (wd.lower() not in ['airbnb'])]
                # stem the words
                stem_words = [stem_fit.lemmatize(wd) for wd in words]
                df.comments[i] = stem_words  
                essential_words = [essential_words.append(wd) for wd in stem_words if wd not in essential_words]  
            except:
                # print(comment)
                # so far the translation is not working properly
                df.comments[i] = np.nan
                # words = word_tokenize(translator.translate(comment, dest='en').text)        
    return essential_words

In [12]:
comments_bs = extract_essential_words(dict_bs['reviews'])
comments_sa = extract_essential_words(dict_sa['reviews'])

In [25]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer
stem_fit = WordNetLemmatizer()
words = word_tokenize(dict_bs['reviews'].comments[2])
# extract only the addjectives
tagged_tokens = pos_tag(words, tagset='universal')
print(tagged_tokens)
words = [wd.lower() for wd, pos in tagged_tokens if (pos == 'ADJ') and (wd.lower() not in ['airbnb'])]
stem_words = [stem_fit.lemmatize(wd) for wd in words]
stem_words

[('We', 'PRON'), ('really', 'ADV'), ('enjoyed', 'VERB'), ('our', 'PRON'), ('stay', 'NOUN'), ('at', 'ADP'), ('Islams', 'NOUN'), ('house', 'NOUN'), ('.', '.'), ('From', 'ADP'), ('the', 'DET'), ('outside', 'ADJ'), ('the', 'DET'), ('house', 'NOUN'), ('did', 'VERB'), ("n't", 'ADV'), ('look', 'VERB'), ('so', 'ADV'), ('inviting', 'ADJ'), ('but', 'CONJ'), ('the', 'DET'), ('inside', 'NOUN'), ('was', 'VERB'), ('very', 'ADV'), ('nice', 'ADJ'), ('!', '.'), ('Even', 'ADV'), ('though', 'ADP'), ('Islam', 'NOUN'), ('himself', 'PRON'), ('was', 'VERB'), ('not', 'ADV'), ('there', 'ADV'), ('everything', 'NOUN'), ('was', 'VERB'), ('prepared', 'VERB'), ('for', 'ADP'), ('our', 'PRON'), ('arrival', 'NOUN'), ('.', '.'), ('The', 'DET'), ('airport', 'NOUN'), ('T', 'NOUN'), ('Station', 'NOUN'), ('is', 'VERB'), ('only', 'ADV'), ('a', 'DET'), ('5-10', 'ADJ'), ('min', 'NOUN'), ('walk', 'VERB'), ('away', 'ADV'), ('.', '.'), ('The', 'DET'), ('only', 'ADJ'), ('little', 'ADJ'), ('issue', 'NOUN'), ('was', 'VERB'), ('that

['outside', 'inviting', 'nice', '5-10', 'only', 'little', 'fine']

In [14]:
comments_sa

[None, None, None, None, None, None]

['My',
 'stay',
 'at',
 'islam',
 "'s",
 'place',
 'was',
 'really',
 'cool',
 '!',
 'Good',
 'location',
 ',',
 '5min',
 'away',
 'from',
 'subway',
 ',',
 'then',
 '10min',
 'from',
 'downtown',
 '.',
 'The',
 'room',
 'was',
 'nice',
 ',',
 'all',
 'place',
 'was',
 'clean',
 '.',
 'Islam',
 'managed',
 'pretty',
 'well',
 'our',
 'arrival',
 ',',
 'even',
 'if',
 'it',
 'was',
 'last',
 'minute',
 ';',
 ')',
 'i',
 'do',
 'recommand',
 'this',
 'place',
 'to',
 'any',
 'airbnb',
 'user',
 ':',
 ')']

[('My', 'PRP$'),
 ('stay', 'NN'),
 ('at', 'IN'),
 ('islam', 'NN'),
 ("'s", 'POS'),
 ('place', 'NN'),
 ('was', 'VBD'),
 ('really', 'RB'),
 ('cool', 'JJ'),
 ('!', '.'),
 ('Good', 'JJ'),
 ('location', 'NN'),
 (',', ','),
 ('5min', 'CD'),
 ('away', 'RB'),
 ('from', 'IN'),
 ('subway', 'NN'),
 (',', ','),
 ('then', 'RB'),
 ('10min', 'CD'),
 ('from', 'IN'),
 ('downtown', 'NN'),
 ('.', '.'),
 ('The', 'DT'),
 ('room', 'NN'),
 ('was', 'VBD'),
 ('nice', 'JJ'),
 (',', ','),
 ('all', 'DT'),
 ('place', 'NN'),
 ('was', 'VBD'),
 ('clean', 'JJ'),
 ('.', '.'),
 ('Islam', 'NNP'),
 ('managed', 'VBD'),
 ('pretty', 'RB'),
 ('well', 'RB'),
 ('our', 'PRP$'),
 ('arrival', 'NN'),
 (',', ','),
 ('even', 'RB'),
 ('if', 'IN'),
 ('it', 'PRP'),
 ('was', 'VBD'),
 ('last', 'JJ'),
 ('minute', 'NN'),
 (';', ':'),
 (')', ')'),
 ('i', 'NN'),
 ('do', 'VBP'),
 ('recommand', 'VB'),
 ('this', 'DT'),
 ('place', 'NN'),
 ('to', 'TO'),
 ('any', 'DT'),
 ('airbnb', 'JJ'),
 ('user', 'NN'),
 (':', ':'),
 (')', ')')]

['cool', 'last']

> clean the listings dataframe

In [None]:
select_colns = ['id', 'neighbourhood_group_cleansed','host_response_time','host_response_rate', 'host_acceptance_rate', 'name', 'note','transit', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification', 'price', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']

> Merge the dataframes for boston

In [108]:
df_bs = dict_bs['reviews'].merge(dict_bs['listings'], how='inner', left_on='listing_id', right_on='id')
df_bs = df_bs.merge(dict_bs['calendar'], how='inner', on='listing_id')
df_bs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24920375 entries, 0 to 24920374
Columns: 104 entries, listing_id to price_y
dtypes: float64(18), int64(18), object(68)
memory usage: 19.3+ GB


> merge dataframe for seattle

In [109]:
df_sa = dict_sa['reviews'].merge(dict_sa['listings'], how='inner', left_on='listing_id', right_on='id')
df_sa = df_sa.merge(dict_sa['calendar'], how='inner', on='listing_id')
df_sa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30969885 entries, 0 to 30969884
Columns: 101 entries, listing_id to price_y
dtypes: float64(17), int64(16), object(68)
memory usage: 23.3+ GB
