1) Pick a dataset.

Boston Airbnb Rental Prices

2) Pose at least three questions related to business or real-world applications of how the data could be used.

What features affect the rental price ? 
what's predicted Boston Airbnb Rental Prices ?
how Boston Airbnb Rental Prices related to seasons?

3) Create a Jupyter Notebook, using any associated packages you'd like, to:

Prepare data:
Gather necessary data to answer your questions
Handle categorical and missing data
Provide insight into the methods you chose and why you chose them
Analyze, Model, and Visualize
Provide a clear connection between your business questions and how the data answers them.

4) Communicate your business insights:

Create a Github repository to share your code and data wrangling/modeling techniques, with a technical audience in mind
Create a blog post to share your questions and insights with a non-technical audience
Your deliverables will be a Github repo and a blog post. Use the rubric here to assist in successfully completing this project!

In [None]:
# Import library 
import pandas as pd
import numpy as np
%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import style

import seaborn as sns
sns.set_style('darkgrid')

In [None]:
import os 
# get the current working directory
cwd = os.getcwd()
print(cwd)

In [None]:
# Import data to dataframes
list = pd.read_csv('listings.csv')
print(list.head())
print(list.info()) # check basic information 

It includes basic information of Airbnb in Boston areas 

In [None]:
reviews = pd.read_csv('reviews.csv')
print(reviews.head())
reviews.info()

In [None]:
calendar = pd.read_csv("calendar.csv")
print(calendar.head())
calendar.info()

CRISP_DM
1. STEP 1. Business Understanding:  find question want to solve ?
        Questions: What features affect the price ? 
        
2. STEP 2. Data Understanding:  What kind of data that you need to find the insights ?

3. STEP 3. Data preparation: data wrangling and cleaning 
        (no need for this question)
        
4. STEP 4. Modeling: 
5. STEP 5. Evaluation:
6. STEP 6. Deployment: (visulization and well communication)

In [None]:
# Copy the dataframe
list_clean = list.copy()
reviews_clean = reviews.copy()
calendar_clean = calendar.copy()
# merge two dataset as one
rew_cal = pd.merge(reviews_clean,calendar_clean,on = 'listing_id',how = "inner")

In [None]:
rew_cal.info()

In [None]:
# STEP 2. Data Understanding:  What kind of data that you need to find the insights ?

In [None]:
# analysis the columns
col_list = list_clean.columns
print(col_list)

In [None]:
list_clean["reviews_per_month"]

In [None]:
# First, let's drop some columns that are not useful for analysis (not related to the rental price, objects, url,)
cols = ['listing_url', 'scrape_id', 'last_scraped', 'name', 'summary','thumbnail_url','medium_url','picture_url',
        'description', 'experiences_offered', 'neighborhood_overview',
       'host_id', 'host_url', 'host_name', 'host_location',
       'host_about',  'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
        'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_group_cleansed', 'market',
       'smart_location', 'is_location_exact', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'requires_license',
       'license', 'jurisdiction_names', 'instant_bookable',
    'calculated_host_listings_count',
       'reviews_per_month','xl_picture_url']
list_clean.drop(cols, axis=1, inplace=True)

In [None]:
print(list_clean)
list_clean.info()

In [None]:
# drop the colunms with more than half of the missing value
cols = list_clean.columns[list_clean.isnull().sum()/list_clean.shape[0] > 0.5]
list_clean.drop(cols, axis=1, inplace=True)

In [None]:
print(list_clean)
list_clean.info()

In [None]:
# Next, let's fix some datatype errors, extract numbers and change to int type
list_clean.info()
list_clean.head()

In [None]:
#convert to string and extract the integer using regular expressions.
list_clean['price']=list_clean['price'].str.extract('(\d+)').astype(int)

In [None]:
print(list_clean['price'])

In [None]:
list_clean['cleaning_fee']=list_clean['cleaning_fee'].str.extract('(\d+)').astype(float)

In [None]:
print(list_clean['cleaning_fee'])

In [None]:
list_clean["extra_people"]=list_clean['extra_people'].str.extract('(\d+)').astype(float)
print(list_clean['extra_people'])

In [None]:
list_clean["zipcode"]=list_clean['zipcode'].str.extract('(\d+)').astype(float)
print(list_clean['zipcode'])

In [None]:
# Change datatype for host_since
list_clean['host_since'] = pd.to_datetime(list_clean.host_since)
print(list_clean['host_since'])

In [None]:
# select subste of datalist with "int" and "float",it has been selected to explore person's correlation
df_num = list_clean.select_dtypes(include=['int64','int32','float64'])
df_num.head()
df_num.info()

In [None]:
# use scatterplot to explore the location
sns.scatterplot(data = list_clean, x = "latitude", y ="longitude",hue = "price", palette ="Blues")

 we can see there is no clear relationship bewteen price and locations 

In [None]:
# visualizae the price
sns.distplot(df_num['price'], bins=20, kde=True)
plt.ylabel('Percentage', fontsize=11)
plt.xlabel('Price (dollar)', fontsize=11)
plt.title('Listed Price Distribution', fontsize=12);

In [None]:
# visualize the correlation matrix
corr = df_num.corr()
mask = np.zeros_like(corr) #Use a mask to plot only part of a matrix
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(18, 16))
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True, annot=True, fmt='.2f', cmap='coolwarm')

Based on the numerical data,the price is highly related to the number of accommodates, bedrooms, beds, cleanning_fee, and guests_included

In [None]:
price_list = df_num.groupby(['bedrooms','bathrooms']).mean()['price']
sns.scatterplot(data = df_num, x = "bedrooms", y ="price",hue ='guests_included', palette ="Blues")

In [None]:
sns.scatterplot(data = df_num, x = "zipcode", y ="price", palette ="Blues")

In [None]:
sns.boxplot(x = "bedrooms", y ="price", data = df_num, palette ="Blues")

In [None]:
sns.regplot(x = "cleaning_fee", y ="price", data = df_num) 

In [None]:
# Now start to work with "object" 
df_cat = list_clean.select_dtypes(include=['object']) # subset dataframe 
df_cat.info()

In [None]:
df_cat.head()

In [None]:
# Take a quick look about the specific columns and their relationship with price

In [None]:
sns.countplot(x='property_type', data=df_cat,palette="Set3")
plt.show()

sns.boxplot(x = "property_type", y ="price", data = list_clean, palette ="Blues")


In [None]:
sns.countplot(x='cancellation_policy', data=df_cat,palette="Set3")
plt.show()

sns.boxplot(x = "cancellation_policy", y ="price", data = list_clean, palette ="Blues")

In [None]:
sns.countplot(x='room_type', data=df_cat,palette="Set3")
plt.show()
sns.boxplot(x = "room_type", y ="price", data = list_clean, palette ="Blues")

In [None]:
sns.countplot(x='require_guest_phone_verification', data=df_cat, palette="Set3")   
plt.show()

sns.boxplot(x = "require_guest_phone_verification", y ="price", data = list_clean, palette ="Blues")

In [None]:
sns.countplot(x='require_guest_profile_picture', data=df_cat,palette="Set3")   
plt.show()
sns.boxplot(x = "require_guest_profile_picture", y ="price", data = list_clean, palette ="Blues")

In [None]:
sns.countplot(x='host_is_superhost', data=df_cat,palette="Set3")   
plt.show()
sns.boxplot(x = "host_is_superhost", y ="price", data = list_clean, palette ="Blues")

require_guest_phone_verification, room_type，cancellation_policy are affect the price 

CRISP_DM
1. STEP 1. Business Understanding:  find question want to solve ?
        Questions: predict the prices? 
        
2. STEP 2. Data Understanding:  What kind of data that you need to find the insights ?

3. STEP 3. Data preparation: data wrangling and cleaning 
        (no need for this question)
        
4. STEP 4. Modeling: 
5. STEP 5. Evaluation:
6. STEP 6. Deployment: (visulization and well communication)

In [None]:
# mutiple linear regression will be used to predict price
y = list_clean['price']

In [None]:
# Base on previous exploration, 
# the number of accommodates, bedrooms, beds, cleanning_fee, and guess_included, 
# require_guest_phone_verification, room_type，cancellation_policy, 
# are affect the price 

In [None]:
# Dummy variables for Categorical Values
list_clean['require_guest_phone_verification'].hist() 

In [None]:
list_clean['require_guest_phone_verification'] = pd.get_dummies(list_clean['require_guest_phone_verification']) # code F as 1; code t as 0

In [None]:
print(list_clean['require_guest_phone_verification'])

In [None]:
list_clean['room_type'] = pd.get_dummies(list_clean['room_type']) 
list_clean['room_type'].hist()

In [None]:
list_clean['cancellation_policy'] = pd.get_dummies(list_clean['cancellation_policy']) 
list_clean['cancellation_policy'].hist()

In [None]:
# Then, make a prediction using mutiple linear regression
# parameters:
# numerical data, accommodates, bedrooms, beds, cleanning_fee, and guess_included
# Categorical data, require_guest_phone_verification, room_type，cancellation_policy, 

Prepare for traning data

Supervised ML process: 
			§ Instantiate 
			§ Fit the model using training data
			§ Predict the results based on fitted model
			§ Score



In [None]:
list_clean["guests_included"]

In [None]:
# Prepare for traning data
# only set the quantitative vars
varables = list_clean[['accommodates', 'bedrooms', 'beds', 'cleaning_fee','guests_included',
                       'require_guest_phone_verification','room_type','cancellation_policy',
                       'price']]

# remove all nan value
all_va = varables.dropna()

X = all_va[['accommodates', 'bedrooms', 'beds', 'cleaning_fee','guests_included',
                       'require_guest_phone_verification','room_type','cancellation_policy',]]
y = all_va['price'] 

# Split data into training and test data, and fit a linear model
from sklearn.model_selection import train_test_split
# Split data into training and test data, and fit a linear model
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=.30, random_state=32)

In [None]:
#Four steps:  Supervised ML process: 
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

#Instantiate
lm_model = LinearRegression(normalize=True)
#Fit - why does this break?
lm_model.fit(X_train, y_train) 
#Predict
y_test_preds =lm_model.predict(X_test)
#Score
r_test = r2_score(y_test, y_test_preds) # Rsquared here
print("Rsquared in test dataset: "+ str(r_test) )


In [None]:
#or Use cross validation  
from sklearn.model_selection  import cross_val_score 

#Instantiate
lm_model = LinearRegression(normalize=True)
#Fit,Predict,Score
scores = cross_val_score(lm_model, X, y, cv =5, scoring= "r2"  )
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

CRISP_DM
1. STEP 1. Business Understanding:  find question want to solve ?
        Questions: How rental price is related to seasons ? 
        
2. STEP 2. Data Understanding:  What kind of data that you need to find the insights ?

3. STEP 3. Data preparation: data wrangling and cleaning 
        (no need for this question)
        
4. STEP 4. Modeling: 
5. STEP 5. Evaluation:
6. STEP 6. Deployment: (visulization and well communication)

In [None]:
rew_cal.info()
rew_cal.head()

In [None]:
# To explore the relationship between date and price, calendar dataset is necessary
calendar = pd.read_csv("calendar.csv")
print(calendar.head())
calendar.info()

In [None]:
data_price = calendar[['date','price',"available"]]
data_price.info()

In [None]:
data_price = data_price.dropna()
data_price.info()

In [None]:
data_price['date'] = pd.to_datetime(data_price['date'])

In [None]:
#convert to string and extract the integer using regular expressions.
data_price['price']=data_price['price'].str.extract('(\d+)').astype(int)

In [None]:
data_price['available'].hist()

In [None]:
from pandas import read_csv
from matplotlib import pyplot

data_price["date"].head()

In [None]:
from pandas import Grouper
date_group= data_price.groupby(pd.Grouper(key="date", freq="M")).mean() # groupby date by Momth
print(date_group)

In [None]:
date_group.plot()

In [None]:
import seaborn as sns
sns.lineplot(x="date", y="price",
             data=data_price)

The rental price is decreased from 240 to 180 