## Airbnb Dataset: 
https://www.kaggle.com/airbnb/seattle/data
### Context
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Beijing, China.

### Content
The following Airbnb activity is included in this Beijing dataset:

Listings, including full descriptions and average review score
Reviews, including unique id for each reviewer and detailed comments
Calendar, including listing id and the price and availability for that day

### Inspiration
Can you describe the vibe of each Seattle neighborhood using listing descriptions?
What are the busiest times of the year to visit Seattle? By how much do prices spike?
Is there a general upward trend of both new Airbnb listings and total Airbnb visitors to Seattle?
For more ideas, visualizations of all Seattle datasets can be found here.

### Acknowledgement
This dataset is part of Airbnb Inside, and the original source can be found here.
(http://insideairbnb.com/get-the-data.html)


## Analysis Procedures (CRISP-DM)


### 1. Business understanding

    - Can you build a recommendation model, by grouping similar listings into one cluster?
    - What are the busiest times of the year to visit Seattle?
    - How the price changes over one year?
    - Can you predict/suggest a price given a listing information? which features impact the price most?
    - * Can you predict the occupancy rate?

    
### 2. Data understanding
    - Data exploration
    
### 3. Data preparation
    - Handling missing values, categorical values, and feature engineering.

### 4. Modelling
    - Use clustering algorithms for grouping/recommendation purpose.
    - Use regression to suggest owner the reseaonable price range.

### 5. Results
    - Answer questions in step 1 'Business understanding' with data visualization.

## 2. Data understanding

In this part, the three datasets in csv format is imported into dataframes using `pandas`, and explored using available pandas functions in below cells. The information obtained about the datasets is summarized as below in this cell.

**Calender dataset** contains the price information for each listing in a calender year. There are 3818 unique `listing_id` in the dataset, and for each `listing_id` there are 365 rows of price corresponding one day between _2016-01-04 and 2017-01-02_ .<br>
The `available` column has two unique values _'t' or 'f'_ meaning _True or False_. When a listing is not available for the day, the columns `price` is _nan_.

**Listing dataset** contains the full description for each listing scraped on _2016-01-04_ , with 3818 rows describing 3818 unique listings. <br>
There are 92 columns/features in this dataset, thus a lot of information that need to be selectively used in later session. <br>
Telling from the column names, the features can be roughly divided into below categories: listing info, host info, location, room/house info, price and booking, reviews, and policies.

**Reviews dataset** contains all the review entries for above mention 3818 listings by _2016-01-03_ since 2009. Each row records the review's info and the detailed text comments for a listing in a certain day, without a numerical score feature. <br>
The positivity of each comments could be predicted through a NLP modelling, but it's not in the objectives of this analysis and the _listing_ dataset already have features of reviews in numerical scores. Thus, the _reviews_ dataset will not be used for furthur analysis.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

calender_path = '/Users/clairegong/Desktop/UdacityDataScienceNanoDegree/seattle airbnb dataset/calendar.csv'
listing_path = '/Users/clairegong/Desktop/UdacityDataScienceNanoDegree/seattle airbnb dataset/listings.csv'
reviews_path = '/Users/clairegong/Desktop/UdacityDataScienceNanoDegree/seattle airbnb dataset/reviews.csv'

calender=pd.read_csv(calender_path)
listing=pd.read_csv(listing_path)
reviews=pd.read_csv(reviews_path)


In [57]:
calender.head(5)

Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,t,$85.00
1,241032,2016-01-05,t,$85.00
2,241032,2016-01-06,f,
3,241032,2016-01-07,f,
4,241032,2016-01-08,f,


In [56]:
calender.listing_id.nunique()

3818

In [41]:
calender.listing_id.value_counts().head()

6752031     365
7404370     365
1259305     365
4672934     365
10310373    365
Name: listing_id, dtype: int64

In [35]:
print('Date range of calender data is between {} and {}.'.format(calender.date.min(),calender.date.max()))

Date range of calender data is between 2016-01-04 and 2017-01-02.


In [3]:
calender.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1393570 entries, 0 to 1393569
Data columns (total 4 columns):
listing_id    1393570 non-null int64
date          1393570 non-null object
available     1393570 non-null object
price         934542 non-null object
dtypes: int64(1), object(3)
memory usage: 42.5+ MB


In [27]:
# listing.describe()

In [3]:
# listing.info() #3818 entries,92 columns

In [60]:
listing.shape

(3818, 92)

In [45]:
listing.columns # 92 columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', '

In [54]:
listing.sample()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
3468,6118198,https://www.airbnb.com/rooms/6118198,20160104002432,2016-01-04,Private Suite in Modern Home,"Modern, clean, and light-filled! Private bedro...",All the comforts of home! A comfortable queen-...,"Modern, clean, and light-filled! Private bedro...",none,We're lucky to live in the Alaska junction nei...,...,10.0,f,,WASHINGTON,t,moderate,f,f,1,2.8


In [53]:
# listing.iloc[:,:10].sample()

In [22]:
reviews.sample(5)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
15313,9460,1510659,2012-06-18,1866904,Katie,I wish I could give 10 stars for Siena's apart...
47300,1956553,32650545,2015-05-19,4482226,Andrew,Mia and Chris made it simple to check-in and c...
28916,5057466,30488346,2015-04-22,335510,Sun,Eli's friendly hospitality and beautiful home ...
15627,20927,2360406,2012-09-21,3125596,Sarah,Absolutely loved my stay at the cottage! Liz a...
40439,6451305,43050050,2015-08-17,10490385,Austin,My experience in Seattle was amazing! I couldn...


In [23]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84849 entries, 0 to 84848
Data columns (total 6 columns):
listing_id       84849 non-null int64
id               84849 non-null int64
date             84849 non-null object
reviewer_id      84849 non-null int64
reviewer_name    84849 non-null object
comments         84831 non-null object
dtypes: int64(3), object(3)
memory usage: 3.9+ MB


In [39]:
print('Date range of review data is between {} and {}.'.format(reviews.date.min(),reviews.date.max()))

Date range of review data is between 2009-06-07 and 2016-01-03.


## 3. Data preparation

The _listing_ and _calender_ datasets will be processed in this section to be ready for comsumption for prediction models. From the information gathered in last part, below processing is in order: data cleaning, feature engineering, missing values imputing, categorical values imputing etc.



**Calender dataset**

In [13]:
#Data cleaning
# remove dollar signs $ and , in the price feature
calender.price = calender.price.replace('[\$,\,]','', regex=True)
calender.price = pd.to_numeric(calender.price)

# modify available feature to boolean values to be meaningful, and convenient for furthur calculation.
calender.available.replace({'t':True, 'f': False}, inplace=True)
calender.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1393570 entries, 0 to 1393569
Data columns (total 4 columns):
listing_id    1393570 non-null int64
date          1393570 non-null object
available     1393570 non-null bool
price         934542 non-null float64
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 33.2+ MB


In [80]:
#Feature engineering
# One of our purposes is to see the price trends, considering neighborhoods, so I need to extract the useful 
# neighbourhood features from the listing dataset. After some checking, neighbourhood_group_cleansed feature 
# is the best one, merge this feature to the calender dataset.
neighourhood_group=listing[['id','neighbourhood_group_cleansed']].\
    rename(columns={'id':'listing_id','neighbourhood_group_cleansed':'neighbourhood'})
calender = calender.merge(neighourhood_group, how='left')

calender.sample(5)

Unnamed: 0,listing_id,date,available,price,neighbourhood
1080641,7651702,2016-09-01,True,100.0,Capitol Hill
293451,10126050,2016-12-25,False,,University District
749665,23356,2016-11-19,True,240.0,Beacon Hill
500388,23430,2016-12-07,False,,Downtown
1224169,5123904,2016-11-23,False,,Other neighborhoods


**Listing dataset**

In [81]:
listing.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', '

In [124]:
listing.bed_type.value_counts()
#Below features are discarded as they do not provide substantial information.
#experiences_offered: 3055/3055 'none'
#host_acceptance_rate: 3044/3045 is 100%
#host_total_listings_count has same info as host_listings_count
#host_verifications could be useful, by splitting list into elements then into categorical values such as 'has_email',
    #but nor worth the effort in this case.
#host_has_profile_pic 3809/3817 't'

Real Bed         3657
Futon              74
Pull-out Sofa      47
Airbed             27
Couch              13
Name: bed_type, dtype: int64

In [120]:
listing.iloc[:,0:10].sample()
listing.iloc[:,10:20].sample()
listing.iloc[:,20:30].sample()
listing.iloc[:,30:40].sample()
listing.iloc[:,40:50].sample()
listing.iloc[:,50:60].sample()

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price
1664,6,2.0,2.0,3.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",,$245.00,,


In [133]:
test=listing.amenities.str.split(',',expand=True)
count_amenities=test.apply(axis=0)
test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,{TV,"""Cable TV""",Internet,"""Wireless Internet""","""Air Conditioning""",Kitchen,Heating,"""Family/Kid Friendly""",Washer,Dryer},...,,,,,,,,,,
1,{TV,Internet,"""Wireless Internet""",Kitchen,"""Free Parking on Premises""","""Buzzer/Wireless Intercom""",Heating,"""Family/Kid Friendly""",Washer,Dryer,...,,,,,,,,,,
2,{TV,"""Cable TV""",Internet,"""Wireless Internet""","""Air Conditioning""",Kitchen,"""Free Parking on Premises""","""Pets Allowed""","""Pets live on this property""",Dog(s),...,Shampoo},,,,,,,,,
3,{Internet,"""Wireless Internet""",Kitchen,"""Indoor Fireplace""",Heating,"""Family/Kid Friendly""",Washer,Dryer,"""Smoke Detector""","""Carbon Monoxide Detector""",...,,,,,,,,,,
4,{TV,"""Cable TV""",Internet,"""Wireless Internet""",Kitchen,Heating,"""Family/Kid Friendly""","""Smoke Detector""","""Carbon Monoxide Detector""","""First Aid Kit""",...,,,,,,,,,,


In [None]:
# As there are so many features, first I need to select those possibly useful in predicting prices and 
#answering business questions. Features with descriptive text values is not useful in this context
# so I will exclude them all, with the exception for categorical values. 
#There are a lot of feature examining in this process, for the cleaniness of the code will not be included.
# The tip is to examine EVERY feature, a more efficient way is to do this by BATCH of 10 features each time.
features = ['id','host_since','host_response_time','host_response_rate','host_is_superhost','host_listings_count',\
            'host_identity_verified','neighbourhood_group_cleansed','is_location_exact','property_type','room_type',\
           'accommodates','bathrooms','bedrooms','beds','bed_type',]