# Airbnb data analysis
### Questions:
>* What is the price range monthly in each region in Boston and Seattle?
>
>* What is the most vibe time in each region in Boston and Seattle?
>  
>* Can we predict the possible cost as per the corresponding holder's profiles (e.g., 'neighbourhood_group_cleansed','host_response_time','host_response_rate', 'host_acceptance_rate', 'name', 'note','transit', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification'), region, and month or day?

# Load data

In [43]:
# data location
%ls ../../Datasets

[34mBoston Airbnb Open Data[m[m/        Dataset of USED CARS.zip
Boston Airbnb Open Data.zip     Netflix_movie_and_TV_shows.csv
Car Sales.xlsx - car_data.csv   Netflix_movie_and_TV_shows.zip
Car sales report.zip            [34mSeattle_Airbnb[m[m/
Dataset of USED CARS.csv        Seattle_Airbnb.zip


In [44]:
# set data location
data_dir = '../../Datasets/'
boston_dir = data_dir+"Boston Airbnb Open Data/"
seattle_dir = data_dir+'Seattle_Airbnb/'

In [45]:
import os
# all boston datasets and seattle datasets
bs_all,sa_all = [],[]
for root,dirs,files in os.walk(boston_dir):
    for file in files:
        bs_all.append(os.path.join(root,file))
for root,dirs,files in os.walk(seattle_dir):
    for file in files:
        sa_all.append(os.path.join(root,file))

> ## Load all datasets

In [46]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)

In [47]:
# since both datasets contain 'reviews','listings', and 'calendar', create a dictionary key
dict_keys = ['reviews','listings','calendar']
# create dictionary of dataframes for both boston and seattle
dict_bs, dict_sa = {}, {}
for i,dict_key in enumerate(dict_keys):
    dict_bs[dict_key] = pd.read_csv(bs_all[i])
    dict_sa[dict_key] = pd.read_csv(sa_all[i])

> ## Wrangle data

> The data size is very large, directly merging will be too huge. Drop the non-essential columns and decrease the granuarity of the data.

In [48]:
dict_sa['reviews'].sample(5)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
7035,227636,14167726,2014-06-13,15689119,June,We booked this apartment based on location (we...
45028,3768626,35914124,2015-06-23,21824003,S,"The studio is nice. Very clean, large and comf..."
32269,719233,47671696,2015-09-21,5050144,Amy,Great location! Nice clean apartment w/ in wal...
35904,1686930,55901731,2015-12-07,28657819,Vendalyn,A very friendly host who reassured us when we ...
20896,486344,49374834,2015-10-03,3085478,Alice,Lara was so friendly and welcoming! The room w...


In [49]:
dict_sa['listings'].sample(5)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
3091,8717068,https://www.airbnb.com/rooms/8717068,20160104002432,2016-01-04,Tiny House in Wedgwood,Welcome to the tiny house! Located in Northeas...,***We have moved our tiny house from Wallingfo...,Welcome to the tiny house! Located in Northeas...,none,,We just moved into our new home and are SO exc...,"Wedgwood is GREAT for walking and biking, even...",https://a1.muscache.com/ac/pictures/110531834/...,https://a1.muscache.com/im/pictures/110531834/...,https://a1.muscache.com/ac/pictures/110531834/...,https://a1.muscache.com/ac/pictures/110531834/...,5474268,https://www.airbnb.com/users/show/5474268,Maya,2013-03-15,"Seattle, Washington, United States",I currently work at a middle school as a coach...,within an hour,100%,100%,f,https://a1.muscache.com/ac/users/5474268/profi...,https://a1.muscache.com/ac/users/5474268/profi...,Meadowbrook,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'kba']",t,t,"Northeast 95 Street, Seattle, WA 98115, United...",Meadowbrook,Meadowbrook,Lake City,Seattle,WA,98115.0,Seattle,"Seattle, WA",US,United States,47.698547,-122.29345,t,Cabin,Entire home/apt,2,1.0,0.0,1.0,Futon,"{Kitchen,""Free Parking on Premises"",Heating,Wa...",,$68.00,$380.00,"$1,300.00",,,1,$5.00,1,1125,2 months ago,t,0,0,0,0,2016-01-04,1,2015-11-23,2015-11-23,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,WASHINGTON,f,flexible,f,f,1,0.7
369,3193738,https://www.airbnb.com/rooms/3193738,20160104002432,2016-01-04,Floating Home~Huge Deck~Kayaks~Bike,Enjoy the tranquility of this floating home wi...,Best waterfront living in Seattle! Ideally loc...,Enjoy the tranquility of this floating home wi...,none,Join this unique friendly lake union live aboa...,Living on the water is a Eco Conscious experie...,Easy to get around on the Burke Gilman Trail. ...,,,https://a0.muscache.com/ac/pictures/40659022/4...,,3979528,https://www.airbnb.com/users/show/3979528,Amelia & Foxy,2012-10-26,"Seattle, Washington, United States",Amelia & Foxy~We are true Seattleites at heart...,within a few hours,100%,100%,f,https://a2.muscache.com/ac/users/3979528/profi...,https://a2.muscache.com/ac/users/3979528/profi...,Ballard,5.0,5.0,"['email', 'phone', 'facebook', 'linkedin', 're...",t,t,"North Northlake Way, Seattle, WA 98103, United...",Wallingford,Wallingford,Other neighborhoods,Seattle,WA,98103.0,Seattle,"Seattle, WA",US,United States,47.647345,-122.333486,t,House,Entire home/apt,4,1.0,1.0,2.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",,$150.00,"$1,600.00","$3,700.00",$400.00,$120.00,1,$0.00,7,1125,2 months ago,t,0,0,2,277,2016-01-04,2,2014-10-05,2014-12-01,90.0,9.0,8.0,7.0,9.0,10.0,10.0,f,,WASHINGTON,f,strict,f,f,2,0.13
2671,9774404,https://www.airbnb.com/rooms/9774404,20160104002432,2016-01-04,Bohemian Studio in Capitol Hill,Beautifully designed studio with hardwood floo...,The studio is well separated into 4 chunks: Li...,Beautifully designed studio with hardwood floo...,none,+Located in the heart of Capitol Hill +97 Walk...,+Dog Friendly +Couch & inflatable twin air-mat...,,,,https://a2.muscache.com/ac/pictures/298c8a8f-3...,,18746557,https://www.airbnb.com/users/show/18746557,Amanda,2014-07-23,"Seattle, Washington, United States",,,,,f,https://a2.muscache.com/ac/pictures/d7ff4ccc-f...,https://a2.muscache.com/ac/pictures/d7ff4ccc-f...,,1.0,1.0,"['phone', 'facebook']",t,f,"Seattle, WA, United States",,Broadway,Capitol Hill,Seattle,WA,,Seattle,"Seattle, WA",US,United States,47.617062,-122.325872,f,Apartment,Entire home/apt,2,1.0,0.0,1.0,Real Bed,"{""Wireless Internet"",Kitchen,""Smoking Allowed""...",,$100.00,,,,,1,$0.00,1,1125,5 weeks ago,t,0,0,0,7,2016-01-04,0,,,,,,,,,,f,,WASHINGTON,f,flexible,f,f,1,
1647,4614955,https://www.airbnb.com/rooms/4614955,20160104002432,2016-01-04,P4 Metro living: Winter promo,"Located in the heart of downtown Seattle, next...","Winter promo (Dec. & Jan.): FREE parking, FREE...","Located in the heart of downtown Seattle, next...",none,Walk Score: 96. Transit Score: 100 You will fi...,BUILDING NOISES The building is equipped with ...,Walk Score: 96. Transit Score: 100 with major ...,https://a2.muscache.com/ac/pictures/59998582/4...,https://a2.muscache.com/im/pictures/59998582/4...,https://a2.muscache.com/ac/pictures/59998582/4...,https://a2.muscache.com/ac/pictures/59998582/4...,9058822,https://www.airbnb.com/users/show/9058822,Andre And Joel,2013-09-25,"Seattle, Washington, United States",Outgoing and friendly!,within an hour,100%,100%,f,https://a2.muscache.com/ac/users/9058822/profi...,https://a2.muscache.com/ac/users/9058822/profi...,First Hill,8.0,8.0,"['email', 'phone', 'facebook', 'linkedin', 're...",t,t,"Hubbell Place, Seattle, WA 98101, United States",First Hill,First Hill,Downtown,Seattle,WA,98101.0,Seattle,"Seattle, WA",US,United States,47.612597,-122.329876,t,Apartment,Entire home/apt,5,1.0,1.0,3.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",,$151.00,"$1,029.00","$3,500.00",$250.00,$65.00,2,$17.00,3,1125,today,t,11,11,26,301,2016-01-04,39,2014-12-16,2015-11-29,93.0,10.0,9.0,10.0,10.0,10.0,9.0,f,,WASHINGTON,t,strict,t,t,4,3.04
1985,1935439,https://www.airbnb.com/rooms/1935439,20160104002432,2016-01-04,Sunny Private Room by Carkeek Park,Large comfortable private room with lots of li...,West exposure means lots of sun.,Large comfortable private room with lots of li...,none,,,,,,https://a2.muscache.com/ac/pictures/27338575/4...,,10014839,https://www.airbnb.com/users/show/10014839,Timothy,2013-11-14,US,,within a day,100%,100%,f,https://a0.muscache.com/ac/users/10014839/prof...,https://a0.muscache.com/ac/users/10014839/prof...,Greenwood,1.0,1.0,"['email', 'phone', 'reviews']",t,f,"4th Avenue Northwest, Seattle, WA 98177, Unite...",Greenwood,Greenwood,Other neighborhoods,Seattle,WA,98177.0,Seattle,"Seattle, WA",US,United States,47.703299,-122.360732,t,House,Private room,2,2.0,1.0,1.0,Futon,"{Kitchen,""Pets live on this property"",Cat(s),""...",,$49.00,$250.00,$800.00,$400.00,$30.00,1,$10.00,2,1125,2 months ago,t,0,0,0,270,2016-01-04,3,2014-06-22,2015-10-04,80.0,10.0,8.0,8.0,7.0,10.0,9.0,f,,WASHINGTON,f,moderate,f,f,1,0.16


> check the columns and select the essential columns
> 
> * 'id', 'neighbourhood_group_cleansed','host_response_time','host_response_rate', 'host_acceptance_rate', 'name', 'note','transit', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification', 'price', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',	'review_scores_checkin', 'review_scores_communication',	'review_scores_location', 'review_scores_value'

In [50]:
dict_sa['calendar'].sample(5)

Unnamed: 0,listing_id,date,available,price
532155,9012948,2016-12-19,t,$130.00
735757,8028801,2016-10-12,t,$50.00
1312150,7584142,2016-12-09,t,$100.00
278418,7035077,2016-10-18,t,$30.00
843312,1750681,2016-06-14,f,


> clean the reviews dataframe

In [51]:
drop_colns = ['id','reviewer_id', 'reviewer_name', 'date']
dict_bs['reviews'].drop(columns=drop_colns, inplace=True)
dict_sa['reviews'].drop(columns=drop_colns, inplace=True)
dict_bs['reviews'].sample(5)

Unnamed: 0,listing_id,comments
27328,1887581,The apartment was exactly as described. Great ...
61261,311240,Fantastic location! Great views!
48222,2592416,We have had an incredible trip to Boston. The...
8264,7914359,Jeff was always available and answer every que...
35658,5025015,My family and I felt welcome immediately. The ...


In [52]:
import warnings
warnings.filterwarnings('ignore')

In [37]:
# creat a funciton to extract essential words
def extract_essential_words(df):
    import nltk
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    import string
    import re

    from nltk.tag import pos_tag
    
    # preloaded values    
    essential_words = []
    # translator = Translator()
    stem_fit = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    # start to extract
    for i, comment in enumerate(df.comments):
        # check if isnan
        if comment != np.nan:
            try:
                # tokenise the sentence
                words = word_tokenize(comment)
                # extract only the addjectives
                tagged_tokens = pos_tag(words)
                words = [wd for wd, pos in tagged_tokens if pos == 'JJ']
                # remove punctuations and others
                words = [wd for wd in words if (wd not in stop_words) and (wd not in string.punctuation) 
                         and (not re.match(r'^[A-Z]+', wd)) and (not re.match(r"^'[a-z]$", wd))]   
                # stem the words
                stem_words = [stem_fit.lemmatize(wd) for wd in words]
                df.comments[i] = stem_words  
                essential_words = [essential_words.append(wd) for wd in stem_words if wd not in essential_words]  
            except:
                # print(comment)
                # so far the translation is not working properly
                df.comments[i] = np.nan
                # words = word_tokenize(translator.translate(comment, dest='en').text)        
    return essential_words

In [38]:
comments_bs = extract_essential_words(dict_bs['reviews'])
comments_sa = extract_essential_words(dict_sa['reviews'])

In [39]:
comments_bs

[]

In [41]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

In [53]:
words = word_tokenize(dict_bs['reviews'].comments[0])
words

['My',
 'stay',
 'at',
 'islam',
 "'s",
 'place',
 'was',
 'really',
 'cool',
 '!',
 'Good',
 'location',
 ',',
 '5min',
 'away',
 'from',
 'subway',
 ',',
 'then',
 '10min',
 'from',
 'downtown',
 '.',
 'The',
 'room',
 'was',
 'nice',
 ',',
 'all',
 'place',
 'was',
 'clean',
 '.',
 'Islam',
 'managed',
 'pretty',
 'well',
 'our',
 'arrival',
 ',',
 'even',
 'if',
 'it',
 'was',
 'last',
 'minute',
 ';',
 ')',
 'i',
 'do',
 'recommand',
 'this',
 'place',
 'to',
 'any',
 'airbnb',
 'user',
 ':',
 ')']

In [55]:
tag_words = pos_tag(words)
words = [wd for wd, pos in tag_words if pos == 'JJ']

In [56]:
words

['cool', 'Good', 'nice', 'clean', 'last', 'airbnb']

> clean the listings dataframe

In [None]:
select_colns = ['id', 'neighbourhood_group_cleansed','host_response_time','host_response_rate', 'host_acceptance_rate', 'name', 'note','transit', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification', 'price', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']

> Merge the dataframes for boston

In [108]:
df_bs = dict_bs['reviews'].merge(dict_bs['listings'], how='inner', left_on='listing_id', right_on='id')
df_bs = df_bs.merge(dict_bs['calendar'], how='inner', on='listing_id')
df_bs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24920375 entries, 0 to 24920374
Columns: 104 entries, listing_id to price_y
dtypes: float64(18), int64(18), object(68)
memory usage: 19.3+ GB


> merge dataframe for seattle

In [109]:
df_sa = dict_sa['reviews'].merge(dict_sa['listings'], how='inner', left_on='listing_id', right_on='id')
df_sa = df_sa.merge(dict_sa['calendar'], how='inner', on='listing_id')
df_sa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30969885 entries, 0 to 30969884
Columns: 101 entries, listing_id to price_y
dtypes: float64(17), int64(16), object(68)
memory usage: 23.3+ GB
