# AirBnB Project for  **Project: Write A Data Science Blog Post**

### 0.1  Key Steps for Project

Feel free to be creative with your solutions, but do follow the CRISP-DM process in finding your solutions.

1) Pick a dataset.
   I chose the AirBnnB dataset.  Because why not..

2) Pose at least three questions related to business or real-world applications of how the data could be used.

3) Create a Jupyter Notebook, using any associated packages you'd like, to:

    Prepare data:
        Gather necessary data to answer your questions
        Handle categorical and missing data
        Provide insight into the methods you chose and why you chose them

    Analyze, Model, and Visualize
        Provide a clear connection between your business questions and how the data answers them.

4) Communicate your business insights:

    Create a Github repository to share your code and data wrangling/modeling techniques, with a technical audience in mind
    Create a blog post to share your questions and insights with a non-technical audience

Your deliverables will be a Github repo and a blog post. Use the rubric here to assist in successfully completing this project!

## 0.2 [Rubric](https://review.udacity.com/#!/rubrics/1507/view)

#### Code Functionality and Readability
* Code is readable (uses good coding practices - PEP8) 
* Code is functional.
* Write code that is well documented and uses functions and classes as necessary.

#### Data
* Project follows the CRISP-DM Process while analyzing their data.
* Proper handling of categorical and missing values in the dataset.
* Categorical variables are handled appropriately for machine learning models (if models are created). 

#### Analysis, Modeling, Visualization
* There are 3-5 business questions answered.
	
#### Github Repository
* Student must publish their code in a public Github repository.
	
#### Blog Post
* Communicate their findings with stakeholders.
* There should be an intriguing title and image related to the project.
* The body of the post has paragraphs that are broken up by appropriate white space and images.
* Each question has a clearly communicated solution.

##  0.3  CRISP-DM
### 0.3.1 Business Understanding/Data Understanding
          AirBnB is an online marketplace for vacation/temporary houseing rentals.  Thier members/hosts own the property and rent via the  AirBnB marketplace.
          
          The data provides was provided each from Seattle and Bostom
          * listing.csv
          * calendar.csv
          * reviews.csv
          
          
### 0.3.2 Data Preparation
#### 0.3.2.1 Cleaning Data
        * 
### 0.3.3 Modeling
### 0.3.4 Evaluation
### 0.3.5 Deployment

## 1.1 Header

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import os.path as op
import ast
import os


In [2]:
PATH = os.getcwd()+"\All Data"
PATH

'C:\\Users\\tcanty\\Documents\\Udacity\\DSND_Term2\\project_files\\AirBnB\\All Data'

In [3]:
os.listdir(PATH)

['.ipynb_checkpoints',
 'b_calendar.csv',
 'b_listings.csv',
 'b_reviews.csv',
 's_calendar.csv',
 's_listings.csv',
 's_reviews.csv']

In [4]:
df_listing = pd.read_csv(PATH+'\s_listings.csv',index_col=0)

In [5]:
df_cal = pd.read_csv(PATH+'\s_calendar.csv',index_col=0)

  mask |= (ar1 == a)


In [6]:
df_rev = pd.read_csv(PATH+'\s_reviews.csv',index_col=0)

In [7]:
def clean_listings(df):
    ''' Return a cleaned dataframe derived from listing.csv file

    1) Fixes Datetime cols -> datetime format
    2) Fixes percentage strings to float
    3) Fixes bool strings to bool cols
    4) Fixes datatype of oject to categories were appropraite
    5) 

    Parameters
    -------
    df:  Pandas DataFrame with an already imported lsiting.csv

    '''

    ## Clean percentage strings to float values
    pct_col = ['host_acceptance_rate','host_response_rate']
    for pc in pct_col:
        df[pc] = df[pc].str.strip("%")
        df[pc] = df[pc].astype('float')
        df[pc] = df[pc].apply(lambda x: x/100)
        df[pc] = df[pc].map('{:,.2%}'.format)
        
    ## Clean dollar strings to value

    dol_col = ['cleaning_fee','extra_people']
    for dol in dol_col:
        df[dol] = df[dol].str.strip('$').astype('float')
        df[dol] = df[dol].map('${:,.2f}'.format)
        
    ## Change type to category
    cat_col = ['host_response_time','host_location','host_neighbourhood','neighbourhood',
               'neighbourhood_cleansed','neighbourhood_group_cleansed','city','state','zipcode',
              'market','smart_location','country_code','country','property_type','room_type',
              'calendar_updated','jurisdiction_names','cancellation_policy']
    
    for cc in cat_col:
        df[cc] = df[cc].astype('category')
  

    ## Fix Boolean Columns
    bool_col = ['host_is_superhost','host_has_profile_pic','host_identity_verified',
                'is_location_exact','has_availability','requires_license','instant_bookable',
               'require_guest_profile_picture','require_guest_phone_verification']
    for bc in bool_col:
        df[bc] = df[bc].replace({'t': True,'f':False})
        df[bc] = df[bc].astype(bool)

    ## Fix Datetime columns
    dt_col = ['last_scraped','host_since','calendar_last_scraped','first_review','last_review']
    for dt in dt_col:
        df[dt] = pd.to_datetime(df[dt])

    ## Fix list column
    ## The following code transforms column 'host_verification' to a usable matrix of 
    ##     one hot encoding the contained communicaiton methods
      
    df2 = pd.DataFrame(df['host_verifications'].apply(lambda x:ast.literal_eval(x)))  # string to list #
    df3 = df2.host_verifications.apply(pd.Series)                                   # list -> series across columns #
    df2 = df2.merge(df3, right_index=True, left_index=True)
    df2 = df2.reset_index().melt(id_vars=['id','host_verifications'],value_name = 'host_sm_ver')
    df2 = df2.pivot_table(values='variable',columns='host_sm_ver',index='id',aggfunc='count',fill_value=0)
    df2 = df2.add_prefix('hv_')
    df = df.merge(df2,left_index=True, right_index=True)
    
    
    ## Drop Columns
    '''Reasons
    All N/A: licence
    No N/A: listing_url
    onehot: host_verifications
    
    
    '''
    drop_col = ['license','host_verifications']
    df.drop(columns=drop_col)
    
    return df


In [8]:
df_listing = clean_listings(df_listing)

In [14]:
df_lobj = df_listing.select_dtypes('object')

In [15]:
df_lobj.iloc[:,0:10].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3815 entries, 241032 to 10208623
Data columns (total 10 columns):
listing_url              3815 non-null object
name                     3815 non-null object
summary                  3638 non-null object
space                    3248 non-null object
description              3815 non-null object
experiences_offered      3815 non-null object
neighborhood_overview    2785 non-null object
notes                    2211 non-null object
transit                  2883 non-null object
thumbnail_url            3495 non-null object
dtypes: object(10)
memory usage: 327.9+ KB


In [16]:
df_lobj.iloc[:,0:10].head()

Unnamed: 0_level_0,listing_url,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,thumbnail_url
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
241032,https://www.airbnb.com/rooms/241032,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,,,
953595,https://www.airbnb.com/rooms/953595,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",What's up with the free pillows? Our home was...,"Convenient bus stops are just down the block, ...",https://a0.muscache.com/ac/pictures/14409893/f...
3308979,https://www.airbnb.com/rooms/3308979,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,Our house is located just 5 short blocks to To...,A bus stop is just 2 blocks away. Easy bus a...,
7421966,https://www.airbnb.com/rooms/7421966,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,,,
278830,https://www.airbnb.com/rooms/278830,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,Belltown,The nearest public transit bus (D Line) is 2 b...,


In [19]:
df_lobj.listing_url.isna().any()

False

In [248]:
df_listing['host_verifications'] = df_listing['host_verifications'].astype(list)

df_host_verification = df_listing[['host_verifications']]

In [46]:
df2 = pd.DataFrame(df_listing['host_verifications'].apply(lambda x:ast.literal_eval(x)))
df2.head()

Unnamed: 0_level_0,host_verifications
id,Unnamed: 1_level_1
241032,"[email, phone, reviews, kba]"
953595,"[email, phone, facebook, linkedin, reviews, ju..."
3308979,"[email, phone, google, reviews, jumio]"
7421966,"[email, phone, facebook, reviews, jumio]"
278830,"[email, phone, facebook, reviews, kba]"


In [47]:
socmed = df2.host_verifications.apply(pd.Series)
socmed.head(4)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
241032,email,phone,reviews,kba,,,,
953595,email,phone,facebook,linkedin,reviews,jumio,,
3308979,email,phone,google,reviews,jumio,,,
7421966,email,phone,facebook,reviews,jumio,,,


In [56]:
df_hv = df2.merge(socmed, right_index=True, left_index=True)
df_hv.head(4)

Unnamed: 0_level_0,host_verifications,0,1,2,3,4,5,6,7
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
241032,"[email, phone, reviews, kba]",email,phone,reviews,kba,,,,
953595,"[email, phone, facebook, linkedin, reviews, ju...",email,phone,facebook,linkedin,reviews,jumio,,
3308979,"[email, phone, google, reviews, jumio]",email,phone,google,reviews,jumio,,,
7421966,"[email, phone, facebook, reviews, jumio]",email,phone,facebook,reviews,jumio,,,


In [57]:
df_hv = df_hv.reset_index().melt(id_vars=['id','host_verifications'],value_name = 'host_sm_ver')
df_hv.head(4).sort_values('id')

Unnamed: 0,id,host_verifications,variable,host_sm_ver
0,241032,"[email, phone, reviews, kba]",0,email
1,953595,"[email, phone, facebook, linkedin, reviews, ju...",0,email
2,3308979,"[email, phone, google, reviews, jumio]",0,email
3,7421966,"[email, phone, facebook, reviews, jumio]",0,email


In [58]:


dv_hv = df_hv.pivot_table(values='variable', columns='host_sm_ver',index='id',aggfunc='count',fill_value=0)

In [69]:
dv_hv = dv_hv.add_prefix('hv_')

df_listing.merge(dv_hv, left_index=True, right_index=True)