# Airbnb data analysis
### Questions:
>* What is the price range monthly in each region in Boston and Seattle?
>
>* What is the most vibe time in each region in Boston and Seattle?
>  
>* Can we predict the possible cost as per the corresponding holder's profiles (e.g., 'neighbourhood_group_cleansed','host_response_time','host_response_rate', 'host_acceptance_rate', 'name', 'note','transit', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification'), region, and month or day?

# Load data

In [1]:
# data location
%ls ../../Datasets

[34mBoston Airbnb Open Data[m[m/        Dataset of USED CARS.zip
Boston Airbnb Open Data.zip     Netflix_movie_and_TV_shows.csv
Car Sales.xlsx - car_data.csv   Netflix_movie_and_TV_shows.zip
Car sales report.zip            [34mSeattle_Airbnb[m[m/
Dataset of USED CARS.csv        Seattle_Airbnb.zip


In [2]:
# set data location
data_dir = '../../Datasets/'
boston_dir = data_dir+"Boston Airbnb Open Data/"
seattle_dir = data_dir+'Seattle_Airbnb/'

In [3]:
import os
# all boston datasets and seattle datasets
bs_all,sa_all = [],[]
for root,dirs,files in os.walk(boston_dir):
    for file in files:
        bs_all.append(os.path.join(root,file))
for root,dirs,files in os.walk(seattle_dir):
    for file in files:
        sa_all.append(os.path.join(root,file))

> ## Load all datasets

In [4]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)

In [5]:
# since both datasets contain 'reviews','listings', and 'calendar', create a dictionary key
dict_keys = ['reviews','listings','calendar']
# create dictionary of dataframes for both boston and seattle
dict_bs, dict_sa = {}, {}
for i,dict_key in enumerate(dict_keys):
    dict_bs[dict_key] = pd.read_csv(bs_all[i])
    dict_sa[dict_key] = pd.read_csv(sa_all[i])

> ## Wrangle data

> The data size is very large, directly merging will be too huge. Drop the non-essential columns and decrease the granuarity of the data.

In [6]:
dict_sa['reviews'].sample(5)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
63621,479653,4868412,2013-05-29,6423079,Vidhya,Patricia's place was just lovely and perfect f...
19101,2039149,42364601,2015-08-12,17534297,Nicholas,Apartment was clean and comfortable. We really...
7778,8616488,52180899,2015-10-26,37707766,Nicole,Roy was very nice & greeted me with keys to th...
67938,4082986,39474287,2015-07-24,21363734,Heidi,"Great location, close to Seatttle center (go s..."
54788,491958,8256737,2013-10-22,8855175,Dale,The Treehouse in Columbia City is a lovely pla...


In [7]:
dict_sa['listings'].sample(5)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
1393,4776823,https://www.airbnb.com/rooms/4776823,20160104002432,2016-01-04,BEST (historic) locat A/C FREE prkg,Prime SPACE NEEDLE VIEW!! Perfect location. PA...,TRUE URBAN LOFT Designed by Elmer Fisher in 18...,Prime SPACE NEEDLE VIEW!! Perfect location. PA...,none,Close To Everything Downtown & Belltown Has To...,Garage is SB in elevator - Parking #34.,"Zip car, UBER and car-to-go all flood this are...",https://a0.muscache.com/ac/pictures/59938008/6...,https://a0.muscache.com/im/pictures/59938008/6...,https://a0.muscache.com/ac/pictures/59938008/6...,https://a0.muscache.com/ac/pictures/59938008/6...,15942582,https://www.airbnb.com/users/show/15942582,Jen,2014-05-24,"Seattle, Washington, United States","I am a Puget Sound area native, living in Seat...",within an hour,100%,100%,f,https://a1.muscache.com/ac/users/15942582/prof...,https://a1.muscache.com/ac/users/15942582/prof...,Belltown,5.0,5.0,"['email', 'phone', 'linkedin', 'reviews', 'kba']",t,t,"1st Avenue, Seattle, WA 98121, United States",Belltown,Belltown,Downtown,Seattle,WA,98121,Seattle,"Seattle, WA",US,United States,47.615214,-122.347636,t,Loft,Entire home/apt,4,1.0,0.0,2.0,Real Bed,"{TV,Internet,""Wireless Internet"",""Air Conditio...",,$125.00,,,$150.00,$45.00,2,$10.00,1,15,today,t,24,52,81,81,2016-01-04,67,2014-12-24,2015-12-10,94.0,9.0,10.0,9.0,9.0,10.0,10.0,f,,WASHINGTON,f,strict,f,f,5,5.33
1525,6691324,https://www.airbnb.com/rooms/6691324,20160104002432,2016-01-04,Beautiful Apartment! 99 Walkscore,"Centrally located in Downtown Seattle, this ho...",,"Centrally located in Downtown Seattle, this ho...",none,,,,https://a2.muscache.com/ac/pictures/101992623/...,https://a2.muscache.com/im/pictures/101992623/...,https://a2.muscache.com/ac/pictures/101992623/...,https://a2.muscache.com/ac/pictures/101992623/...,23316664,https://www.airbnb.com/users/show/23316664,Shellie,2014-11-03,"Seattle, Washington, United States",,within an hour,100%,100%,t,https://a1.muscache.com/ac/users/23316664/prof...,https://a1.muscache.com/ac/users/23316664/prof...,Central Business District,2.0,2.0,"['email', 'reviews', 'jumio']",t,f,"Western Avenue, Seattle, WA 98101, United States",Pike Place Market,Central Business District,Downtown,Seattle,WA,98101,Seattle,"Seattle, WA",US,United States,47.606302,-122.340528,t,Apartment,Entire home/apt,2,1.0,0.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Po...",,$150.00,,,,$60.00,2,$30.00,2,1125,today,t,16,46,76,351,2016-01-04,29,2015-06-10,2015-12-26,99.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,WASHINGTON,t,strict,f,f,2,4.16
1752,7776701,https://www.airbnb.com/rooms/7776701,20160104002432,2016-01-04,Beautiful waterfront beach house,Totally remodeled unit right on Alki beach! Tw...,Enjoy a front row seat on Alki beach! Listen t...,Totally remodeled unit right on Alki beach! Tw...,none,,This is going to sound strange but the ocean c...,There is a water taxi at the other end of Alki...,,,https://a2.muscache.com/ac/pictures/109797685/...,,40912896,https://www.airbnb.com/users/show/40912896,Diane,2015-08-09,"San Francisco, California, United States",Born and raised in Seattle. Moved to San Franc...,within a day,100%,,f,https://a2.muscache.com/ac/users/40912896/prof...,https://a2.muscache.com/ac/users/40912896/prof...,,1.0,1.0,"['email', 'phone', 'facebook', 'reviews']",t,f,"Alki Avenue Southwest, Seattle, WA 98116, Unit...",,Alki,West Seattle,Seattle,WA,98116,Seattle,"Seattle, WA",US,United States,47.576853,-122.416362,t,House,Entire home/apt,4,2.0,2.0,2.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",,$269.00,"$1,875.00","$7,500.00",,$120.00,4,$75.00,2,30,3 months ago,t,0,12,42,298,2016-01-04,2,2015-09-02,2015-09-16,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,WASHINGTON,f,flexible,f,f,1,0.48
1211,7597244,https://www.airbnb.com/rooms/7597244,20160104002432,2016-01-04,BELLTOWN/DOWNTOWN Spacious&Luminous,A beautiful vintage apartment home is located ...,This is a 100 year old building right in the h...,A beautiful vintage apartment home is located ...,none,,,There is close bus lines and readily available...,,,https://a2.muscache.com/ac/pictures/f781a6ec-4...,,39839653,https://www.airbnb.com/users/show/39839653,Lisa Marie,2015-07-28,"Seattle, Washington, United States",It is a pleasure to meet you! I am a local Sea...,within an hour,100%,100%,f,https://a2.muscache.com/ac/pictures/c201afe3-9...,https://a2.muscache.com/ac/pictures/c201afe3-9...,Belltown,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"2nd Avenue, Seattle, WA 98121, United States",Belltown,Belltown,Downtown,Seattle,WA,98121,Seattle,"Seattle, WA",US,United States,47.612973,-122.344426,t,Apartment,Entire home/apt,2,1.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",,$125.00,$900.00,"$3,300.00",,$40.00,1,$0.00,1,1125,3 weeks ago,t,3,31,61,322,2016-01-04,4,2015-11-25,2015-12-16,95.0,10.0,10.0,10.0,10.0,9.0,10.0,f,,WASHINGTON,t,strict,f,f,1,2.93
3063,170469,https://www.airbnb.com/rooms/170469,20160104002432,2016-01-04,Private Bed & Bath in Ballard,Your cozy room with full-sized bed includes a ...,Your private bedroom with attached bath is on ...,Your cozy room with full-sized bed includes a ...,none,"Ballard is very popular with trendy shops, bar...","* Check-in is 3pm, check-out is at Noon. * I ...","Street parking is free, but hard to find durin...",https://a2.muscache.com/ac/pictures/af2aa4a4-4...,https://a2.muscache.com/im/pictures/af2aa4a4-4...,https://a2.muscache.com/ac/pictures/af2aa4a4-4...,https://a2.muscache.com/ac/pictures/af2aa4a4-4...,756099,https://www.airbnb.com/users/show/756099,Carie,2011-06-28,"Seattle, Washington, United States","I've lived in Seattle for 25 years, am a Life ...",within an hour,100%,100%,f,https://a2.muscache.com/ac/pictures/7937d2aa-4...,https://a2.muscache.com/ac/pictures/7937d2aa-4...,Ballard,1.0,1.0,"['email', 'phone', 'facebook', 'jumio']",t,t,"Alonzo Ave NW, Seattle, WA 98117, United States",Ballard,Whittier Heights,Ballard,Seattle,WA,98117,Seattle,"Seattle, WA",US,United States,47.678605,-122.374046,t,Townhouse,Private room,2,1.0,1.0,1.0,Real Bed,"{Internet,""Wireless Internet"",""Pets live on th...",,$60.00,,,,,1,$0.00,1,730,today,t,26,56,86,359,2016-01-04,0,,,,,,,,,,f,,WASHINGTON,f,flexible,f,f,1,


> check the columns and select the essential columns
> 
> * 'id', 'neighbourhood_group_cleansed','host_response_time','host_response_rate', 'host_acceptance_rate', 'name', 'note','transit', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification', 'price', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',	'review_scores_checkin', 'review_scores_communication',	'review_scores_location', 'review_scores_value'

In [8]:
dict_sa['calendar'].sample(5)

Unnamed: 0,listing_id,date,available,price
190580,4025593,2016-02-23,t,$40.00
1123087,488268,2016-12-16,f,
1342603,9511777,2016-05-16,f,
1298794,1547337,2016-05-07,t,$175.00
444292,3534364,2016-03-31,f,


> clean the reviews dataframe

In [9]:
drop_colns = ['id','reviewer_id', 'reviewer_name', 'date']
dict_bs['reviews'].drop(columns=drop_colns, inplace=True)
dict_sa['reviews'].drop(columns=drop_colns, inplace=True)
dict_bs['reviews'].sample(5)

Unnamed: 0,listing_id,comments
49526,2698996,We received a warm welcome! Ray is a nice pers...
57899,1550047,This place was an excellent stay. Very well or...
54294,7395978,Great flat and location. Easy communication. R...
19131,7379913,"Adam had a lovely, clean and spacious apartmen..."
42806,9711934,"\r\nI had a great stay with Olena, Olga's Mom...."


In [10]:
import warnings
warnings.filterwarnings('ignore')

In [11]:
# creat a funciton to extract essential words
def extract_essential_words(df):
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords
    from nltk.probability import FreqDist
    import string
    import re

    ''' 
    The function extracts all essential words from the entire comments column,
    pick up the most frequent words to form the vocabulary for word vectorization
    
    '''    
    # preloaded    
    essential_words = []
    stem_fit = WordNetLemmatizer()
    # concat the entire column for the whole vocabulary    
    for i, comment in enumerate(df.comments):
        try:
            # tokenise the sentence
            comment = comment.lower().translate(str.maketrans('','',string.punctuation))
            words = word_tokenize(comment)
            # stem the words
            stem_words = [stem_fit.lemmatize(wd) for wd in words if wd not in stopwords.words('english')]
            if len(stem_words) != 0:
                df.comment[i] = stem_words
                [essential_words.append(wd) for wd in stem_words]
            else:
                df.comment[i] = np.nan
        except:
            df.comments[i] = np.nan
    # get the most frequenct words
    print(essential_words)
    freq_dist = FreqDist(essential_words)
    essential_words = freq_dist.most_common(10)
    print(essential_words)
    essential_words,_ = list(zip(*essential_words))
    essential_words = list(essential_words)
    # start to extract
    for i, comment in enumerate(df.comments):
        if comment is not np.nan:
            df.comments[i] = [wd for wd in comment if wd in essential_words]
        else:
            pass
    
    return essential_words

In [12]:
comments_bs = extract_essential_words(dict_bs['reviews'])
comments_sa = extract_essential_words(dict_sa['reviews'])

[]
[]


ValueError: not enough values to unpack (expected 2, got 0)

In [96]:
comments_bs

[]

In [97]:
dict_bs['reviews'].comments.sample(15)

27822    NaN
60772    NaN
17060    NaN
14045    NaN
54991    NaN
25932    NaN
6802     NaN
65285    NaN
10832    NaN
35972    NaN
21490    NaN
40191    NaN
57873    NaN
28614    NaN
45546    NaN
Name: comments, dtype: object

In [11]:
comment = dict_bs['reviews'].comments[1]

In [12]:
comment

'Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful even though he was away'

In [13]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.probability import FreqDist
import string
import re

In [14]:
comment = comment.lower().translate(str.maketrans('','',string.punctuation))

In [15]:
comment

'great location for both airport and city  great amenities in the house plus islam was always very helpful even though he was away'

In [16]:
words = word_tokenize(comment)
words

['great',
 'location',
 'for',
 'both',
 'airport',
 'and',
 'city',
 'great',
 'amenities',
 'in',
 'the',
 'house',
 'plus',
 'islam',
 'was',
 'always',
 'very',
 'helpful',
 'even',
 'though',
 'he',
 'was',
 'away']

In [17]:
stem_fit = WordNetLemmatizer()

In [18]:
stem_words = [stem_fit.lemmatize(wd) for wd in words if wd not in stopwords.words('english')]
stem_words

['great',
 'location',
 'airport',
 'city',
 'great',
 'amenity',
 'house',
 'plus',
 'islam',
 'always',
 'helpful',
 'even',
 'though',
 'away']

In [19]:
essential_words = []
[essential_words.append(wd) for wd in stem_words]
essential_words

['great',
 'location',
 'airport',
 'city',
 'great',
 'amenity',
 'house',
 'plus',
 'islam',
 'always',
 'helpful',
 'even',
 'though',
 'away']

In [20]:
freq_dist = FreqDist(essential_words)
freq_dist

FreqDist({'great': 2, 'location': 1, 'airport': 1, 'city': 1, 'amenity': 1, 'house': 1, 'plus': 1, 'islam': 1, 'always': 1, 'helpful': 1, ...})

In [33]:
essential_words = freq_dist.most_common(10)
essential_words,_ = list(zip(*essential_words))
essential_words = list(essential_words)

In [38]:
xxx = [wd for wd in stem_words if wd in essential_words]
xxx

['great',
 'location',
 'airport',
 'city',
 'great',
 'amenity',
 'house',
 'plus',
 'islam',
 'always',
 'helpful']

In [36]:
words

['great',
 'location',
 'for',
 'both',
 'airport',
 'and',
 'city',
 'great',
 'amenities',
 'in',
 'the',
 'house',
 'plus',
 'islam',
 'was',
 'always',
 'very',
 'helpful',
 'even',
 'though',
 'he',
 'was',
 'away']

> clean the listings dataframe

In [None]:
select_colns = ['id', 'neighbourhood_group_cleansed','host_response_time','host_response_rate', 'host_acceptance_rate', 'name', 'note','transit', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification', 'price', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']

> Merge the dataframes for boston

In [108]:
df_bs = dict_bs['reviews'].merge(dict_bs['listings'], how='inner', left_on='listing_id', right_on='id')
df_bs = df_bs.merge(dict_bs['calendar'], how='inner', on='listing_id')
df_bs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24920375 entries, 0 to 24920374
Columns: 104 entries, listing_id to price_y
dtypes: float64(18), int64(18), object(68)
memory usage: 19.3+ GB


> merge dataframe for seattle

In [109]:
df_sa = dict_sa['reviews'].merge(dict_sa['listings'], how='inner', left_on='listing_id', right_on='id')
df_sa = df_sa.merge(dict_sa['calendar'], how='inner', on='listing_id')
df_sa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30969885 entries, 0 to 30969884
Columns: 101 entries, listing_id to price_y
dtypes: float64(17), int64(16), object(68)
memory usage: 23.3+ GB
