# AirBnB Project for  **Project: Write A Data Science Blog Post**

### 0.1  Key Steps for Project

Feel free to be creative with your solutions, but do follow the CRISP-DM process in finding your solutions.

1) Pick a dataset.

2) Pose at least three questions related to business or real-world applications of how the data could be used.

3) Create a Jupyter Notebook, using any associated packages you'd like, to:

    Prepare data:
        Gather necessary data to answer your questions
        Handle categorical and missing data
        Provide insight into the methods you chose and why you chose them

    Analyze, Model, and Visualize
        Provide a clear connection between your business questions and how the data answers them.

4) Communicate your business insights:

    Create a Github repository to share your code and data wrangling/modeling techniques, with a technical audience in mind
    Create a blog post to share your questions and insights with a non-technical audience

Your deliverables will be a Github repo and a blog post. Use the rubric here to assist in successfully completing this project!

## 0.2 [Rubric](https://review.udacity.com/#!/rubrics/1507/view)

#### Code Functionality and Readability
* Code is readable (uses good coding practices - PEP8) 
* Code is functional.
* Write code that is well documented and uses functions and classes as necessary.

#### Data
* Project follows the CRISP-DM Process while analyzing their data.
* Proper handling of categorical and missing values in the dataset.
* Categorical variables are handled appropriately for machine learning models (if models are created). 

#### Analysis, Modeling, Visualization
* There are 3-5 business questions answered.
	
#### Github Repository
* Student must publish their code in a public Github repository.
	
#### Blog Post
* Communicate their findings with stakeholders.
* There should be an intriguing title and image related to the project.
* The body of the post has paragraphs that are broken up by appropriate white space and images.
* Each question has a clearly communicated solution.

###  0.3  CRISP-DM

## 1.1 Header

In [92]:
%matplotlib inline
import pandas as pd
import numpy as np
import os.path as op
import ast
import os


In [93]:
PATH = os.getcwd()+"\All Data"
PATH

'C:\\Users\\tcanty\\Documents\\Udacity\\DSND_Term2\\project_files\\AirBnB\\All Data'

In [94]:
os.listdir(PATH)

['b_calendar.csv',
 'b_listings.csv',
 'b_reviews.csv',
 's_calendar.csv',
 's_listings.csv',
 's_reviews.csv']

In [95]:
df_listing = pd.read_csv(PATH+'\s_listings.csv',index_col=0)

In [96]:
df_cal = pd.read_csv(PATH+'\s_calendar.csv',index_col=0)

  mask |= (ar1 == a)


In [97]:
df_rev = pd.read_csv(PATH+'\s_reviews.csv',index_col=0)

In [98]:
df_listing['host_acceptance_rate'] = df_listing['host_acceptance_rate'].str.strip("%")

In [99]:
df_listing['host_acceptance_rate'] = df_listing['host_acceptance_rate'].astype('float')

In [100]:
df_listing['host_response_rate'] = df_listing['host_response_rate'].str.strip("%")

In [101]:
df_listing['host_response_rate'] = df_listing['host_response_rate'].astype('float')

In [102]:
df_listing['host_response_rate'] = df_listing['host_response_rate'].apply(lambda x: x/100)
df_listing['host_acceptance_rate'] = df_listing['host_acceptance_rate'].apply(lambda x: x/100)

In [103]:
df_listing['host_response_time'] = df_listing['host_response_time'].astype('category')

In [104]:
df_listing['host_location'] = df_listing['host_location'].astype('category')

In [105]:
df_listing['host_neighbourhood'] = df_listing['host_neighbourhood'].astype('category')

In [106]:
df_listing['host_is_superhost'] = df_listing['host_is_superhost'].replace({'t': True,'f':False})

In [107]:
df_listing['host_has_profile_pic'] = df_listing['host_has_profile_pic'].replace({'t': True,'f':False})

In [108]:
df_listing['host_has_profile_pic'] = df_listing['host_has_profile_pic'].astype(bool)
df_listing['host_is_superhost'] = df_listing['host_is_superhost'].astype(bool)

In [109]:
df_listing['host_identity_verified'] = df_listing['host_identity_verified'].replace({'t': True,'f':False})
df_listing['host_identity_verified'] = df_listing['host_identity_verified'].astype(bool)

In [110]:
df_listing['is_location_exact'] = df_listing['is_location_exact'].replace({'t': True,'f':False})

In [111]:
df_listing['last_scraped'] = pd.to_datetime(df_listing['last_scraped'])

In [112]:
df_listing['host_since'] = pd.to_datetime(df_listing['host_since'])

In [113]:
df_listing['neighbourhood'] = df_listing['neighbourhood'].astype('category')
df_listing['neighbourhood_cleansed'] = df_listing['neighbourhood_cleansed'].astype('category')
df_listing['neighbourhood_group_cleansed'] = df_listing['neighbourhood_group_cleansed'].astype('category')
df_listing['city'] = df_listing['city'].astype('category')
df_listing['state'] = df_listing['state'].astype('category')
df_listing['zipcode'] = df_listing['zipcode'].astype('category')

In [114]:
df_listing['market'] = df_listing['market'].astype('category')
df_listing['smart_location'] = df_listing['smart_location'].astype('category')
df_listing['country_code'] = df_listing['country_code'].astype('category')
df_listing['country'] = df_listing['country'].astype('category')
df_listing['property_type'] = df_listing['property_type'].astype('category')
df_listing['room_type'] = df_listing['room_type'].astype('category')



In [115]:
df_listing.iloc[:,30:40].head()

Unnamed: 0_level_0,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
241032,"['email', 'phone', 'reviews', 'kba']",True,True,"Gilman Dr W, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119
953595,"['email', 'phone', 'facebook', 'linkedin', 're...",True,True,"7th Avenue West, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119
3308979,"['email', 'phone', 'google', 'reviews', 'jumio']",True,True,"West Lee Street, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119
7421966,"['email', 'phone', 'facebook', 'reviews', 'jum...",True,True,"8th Avenue West, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119
278830,"['email', 'phone', 'facebook', 'reviews', 'kba']",True,True,"14th Ave W, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119


In [116]:
df_listing.iloc[:,40:50].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3818 entries, 241032 to 10208623
Data columns (total 10 columns):
market               3818 non-null category
smart_location       3818 non-null category
country_code         3818 non-null category
country              3818 non-null category
latitude             3818 non-null float64
longitude            3818 non-null float64
is_location_exact    3818 non-null bool
property_type        3817 non-null category
room_type            3818 non-null category
accommodates         3818 non-null int64
dtypes: bool(1), category(6), float64(2), int64(1)
memory usage: 146.9 KB


In [188]:
df_listing['host_verifications'] = df_listing['host_verifications'].astype(list)

In [189]:
df_host_verification = df_listing[['host_verifications']]
df_host_verification.head()

Unnamed: 0_level_0,host_verifications
id,Unnamed: 1_level_1
241032,"['email', 'phone', 'reviews', 'kba']"
953595,"['email', 'phone', 'facebook', 'linkedin', 're..."
3308979,"['email', 'phone', 'google', 'reviews', 'jumio']"
7421966,"['email', 'phone', 'facebook', 'reviews', 'jum..."
278830,"['email', 'phone', 'facebook', 'reviews', 'kba']"


In [191]:
df_host_verification['host_verifications'] = df_host_verification['host_verifications'].apply(lambda x:ast.literal_eval(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [192]:
test = df_host_verification.loc[241032][0]

In [195]:
socmed = df_host_verification.host_verifications.apply(pd.Series)

In [244]:
df_hv = df_host_verification.merge(socmed, right_index=True, left_index=True)

In [245]:
df_hv = df_hv.reset_index().melt(id_vars=['id','host_verifications'],value_name = 'host_sm_ver')


In [243]:
df_hv = df_hv.drop(columns=['variable'])

In [248]:
dv_hv = df_hv.pivot_table(values='variable', columns='host_sm_ver',index='id',aggfunc='count',fill_value=0)