Team Member Names:

- Name 1: Jasmine Coleman
- Name 2: Yat Leung
- Name 3: Karen Somes

# Will an AirBnb host receive a perfect rating?
### Classifying "perfect" hosts using logistic regression and SVM ###

__Cleaning the data__  
First, we removed variables with high correlation, repetitive values, and attributes with a high number of missing values to narrow our focus of predictors and performed data transformations to aid our analysis.  
    
Our target classification variable is a Boolean indicator for whether or not a host received a "perfect" 100 review score. Through our classification and analysis, we will determine which attributes are most powerfully associated with a 100 rating.

In [1]:
import pandas as pd
import numpy as np
from decimal import Decimal
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
 
data = pd.read_csv("/Users/jazis/Downloads/listings.csv")

#data cleaning from LAB1
#drop redundant info and fields not useful for analysis
sub=data.drop(['id','listing_url','scrape_id','last_scraped','summary','space','description','experiences_offered'
              , 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules',
              'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_url', 'host_thumbnail_url',
              'host_picture_url', 'country_code', 'country','amenities', 'minimum_minimum_nights',
              'maximum_minimum_nights','minimum_maximum_nights', 'maximum_maximum_nights','minimum_nights_avg_ntm',
              'maximum_nights_avg_ntm', 'availability_30', 'availability_365','availability_90','has_availability',
               'calculated_host_listings_count','calculated_host_listings_count_shared_rooms',
               'is_business_travel_ready','host_about', 'host_acceptance_rate', 'host_total_listings_count',
              'jurisdiction_names','license','monthly_price','square_feet','weekly_price', 'requires_license'], axis=1)
def money_to_decimal(x):
    x = x.replace("$", "").replace(",", "").replace(" ", "")
    return float(x)
def rem_percent(x):
    x=x.replace("%","")
    return float(x)/100
def truncate(n):
    return int(n * 1000) / 1000
#converts objects with money values into decimal values to become continous attribute
sub.cleaning_fee = sub.cleaning_fee.astype(str)
sub.extra_people = sub.extra_people.astype(str)
sub.security_deposit = sub.security_deposit.astype(str)
sub.price = sub.price.astype(str)
sub.loc[:,'price'] = sub.loc[:,'price'].apply(money_to_decimal)
sub.loc[:,'cleaning_fee'] = sub.loc[:,'cleaning_fee'].apply(money_to_decimal)
sub.loc[:,'extra_people'] = sub.loc[:,'extra_people'].apply(money_to_decimal)
sub.loc[:,'security_deposit'] = sub.loc[:,'security_deposit'].apply(money_to_decimal)
sub.host_response_rate = sub.host_response_rate.astype(str)
sub.loc[:,'host_response_rate'] = sub.loc[:, 'host_response_rate'].apply(rem_percent)

  interactivity=interactivity, compiler=compiler, result=result)


__Creating dummy variables for SVM__

We created dummy variables for the categorical variables that we would include in the model. We iteratively determined which categorical variables to include by checking the accuracy of each interim model when removing terms.

In [118]:
df = sub[~sub['review_scores_rating'].isnull()]
df['perf_score'] = np.where(df['review_scores_rating']==100, 1, 0)

df_data=df
df_y=df['perf_score']

#create dummy vars
##host_loc = pd.get_dummies(df_data['host_location'],drop_first=True)
host_response = pd.get_dummies(df_data['host_response_time'],drop_first=True)
##host_neigh = pd.get_dummies(df_data['neighbourhood_group_cleansed'],drop_first=True)
##host_verif = pd.get_dummies(df_data['host_verifications'],drop_first=True)
df_data['host_identity_verified'] = pd.get_dummies(df_data['host_identity_verified'],drop_first=True)
##street = pd.get_dummies(df_data['street'],drop_first=True)
neighborhood = pd.get_dummies(df_data['neighbourhood_group_cleansed'],drop_first=True)
##city = pd.get_dummies(df_data['city'],drop_first=True)
# make into continuous zipcode = pd.get_dummies(x_train['zipcode'],drop_first=True)
##market = pd.get_dummies(df_data['market'],drop_first=True)
df_data['is_location_exact'] = pd.get_dummies(df_data['is_location_exact'],drop_first=True)
prop_type = pd.get_dummies(df_data['property_type'],drop_first=True)
room_type = pd.get_dummies(df_data['room_type'],drop_first=True)
bed_type = pd.get_dummies(df_data['bed_type'],drop_first=True)
df_data['instant_bookable'] = pd.get_dummies(df_data['instant_bookable'],drop_first=True)
cancel = pd.get_dummies(df_data['cancellation_policy'],drop_first=True)
df_data['host_is_superhost'] = pd.get_dummies(df_data['host_is_superhost'],drop_first=True)

df_data.drop(['host_location','host_response_time','host_neighbourhood','host_verifications',
             'street', 'neighbourhood', 'city', 'market', 
             'property_type', 'room_type', 'bed_type', 'instant_bookable',
             'cancellation_policy', 'name', 'host_name', 'host_has_profile_pic', 'neighbourhood_cleansed',
              'neighbourhood_group_cleansed', 'host_neighbourhood', 'smart_location', 'calendar_updated',
             'calendar_last_scraped','require_guest_profile_picture', 'require_guest_phone_verification',
             'host_since', 'first_review', 'last_review', 'state', 'smart_location', 'zipcode','review_scores_rating'],axis=1,inplace=True)

df_data = pd.concat([df_data, host_response, prop_type, room_type,
                    bed_type, cancel, neighborhood],axis=1)

##x_train, x_test, y_train, y_test = train_test_split(df_data, df_y, test_size=0.2)
#ent_r = entropy_value(ds.target[feature28>2.5])
#ent_l = entropy_value(ds.target[feature28<=2.5])
#ent_t = entropy_value(feature28)
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-v

In [None]:
df_data.head()

__Fitting the SVM__

In [119]:
##splitting prection from predictor variables

from sklearn.model_selection import ShuffleSplit

# we want to predict the X and y data as follows:

if 'perf_score' in df_data:
    y = df_data['perf_score'].values # get the labels we want
    del df_data['perf_score'] # get rid of the class label
    X = df_data.values # use everything else to predict!


    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
    

# to use the cross validation object in scikit learn, we need to grab an instance
#    of the object and set it up. This object will be able to split our data into 
#    training and testing splits
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)
                         
print(cv_object)

ShuffleSplit(n_splits=3, random_state=None, test_size=0.2, train_size=None)


In [120]:
for train_indices, test_indices in cv_object.split(X,y): 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]


In [None]:
X_train

In [121]:
#converting training numpy to pandas df to inpute nan values
cols = list(df_data.columns.values)
impute = pd.DataFrame(X_train, columns=cols)
#impute

In [122]:
#converting test numpy to pandas df to inpute nan values
cols = list(df_data.columns.values)
impute_test = pd.DataFrame(X_test, columns=cols)
#impute_test

In [123]:
#imputations - training
impute['price']=impute.price.mask(impute.price == 0,impute.price.median())
impute.cleaning_fee=impute.cleaning_fee.fillna(impute.cleaning_fee.median())
impute.host_response_rate=impute.host_response_rate.fillna(impute.host_response_rate.median())
impute.review_scores_accuracy=impute.review_scores_accuracy.fillna(truncate(impute.review_scores_accuracy.median()))
impute.review_scores_checkin=impute.review_scores_checkin.fillna(truncate(impute.review_scores_checkin.median()))
impute.review_scores_cleanliness=impute.review_scores_cleanliness.fillna(truncate(impute.review_scores_cleanliness.median()))
impute.review_scores_communication=impute.review_scores_communication.fillna(truncate(impute.review_scores_communication.median()))
impute.review_scores_location=impute.review_scores_location.fillna(truncate(impute.review_scores_location.median()))
#sub.review_scores_rating=sub.review_scores_rating.fillna(truncate(sub.review_scores_rating.median()))
impute.review_scores_value=impute.review_scores_value.fillna(truncate(impute.review_scores_value.median()))
impute.reviews_per_month=impute.reviews_per_month.fillna(impute.reviews_per_month.median())
impute.security_deposit=impute.security_deposit.fillna(impute.security_deposit.median())
impute.bathrooms=impute.bathrooms.fillna(impute.bathrooms.median())
impute.bedrooms=impute.bedrooms.fillna(impute.bedrooms.median())
impute.host_listings_count=impute.host_listings_count.fillna(impute.host_listings_count.median())
impute.beds=impute.beds.fillna(impute.beds.median())
#sub.host_response_time=sub.host_response_time.fillna('missing')

In [124]:
#imputations - test
impute_test['price']=impute_test.price.mask(impute.price == 0,impute.price.median())
impute_test.cleaning_fee=impute_test.cleaning_fee.fillna(impute.cleaning_fee.median())
impute_test.host_response_rate=impute_test.host_response_rate.fillna(impute.host_response_rate.median())
impute_test.review_scores_accuracy=impute_test.review_scores_accuracy.fillna(truncate(impute.review_scores_accuracy.median()))
impute_test.review_scores_checkin=impute_test.review_scores_checkin.fillna(truncate(impute.review_scores_checkin.median()))
impute_test.review_scores_cleanliness=impute_test.review_scores_cleanliness.fillna(truncate(impute.review_scores_cleanliness.median()))
impute_test.review_scores_communication=impute_test.review_scores_communication.fillna(truncate(impute.review_scores_communication.median()))
impute_test.review_scores_location=impute_test.review_scores_location.fillna(truncate(impute.review_scores_location.median()))
#sub.review_scores_rating=sub.review_scores_rating.fillna(truncate(sub.review_scores_rating.median()))
impute_test.review_scores_value=impute_test.review_scores_value.fillna(truncate(impute.review_scores_value.median()))
impute_test.reviews_per_month=impute_test.reviews_per_month.fillna(impute.reviews_per_month.median())
impute_test.security_deposit=impute_test.security_deposit.fillna(impute.security_deposit.median())
impute_test.bathrooms=impute_test.bathrooms.fillna(impute.bathrooms.median())
impute_test.bedrooms=impute_test.bedrooms.fillna(impute.bedrooms.median())
impute_test.host_listings_count=impute_test.host_listings_count.fillna(impute.host_listings_count.median())
impute_test.beds=impute_test.beds.fillna(impute.beds.median())
#sub.host_response_time=sub.host_response_time.fillna('missing')

In [125]:
#converting pandas df back to numpy matrix - for train
X_train = impute.values

In [126]:
#converting pandas df back to numpy matrix - for test
X_test = impute_test.values

In [None]:
X_train

In [None]:
X_test

In [127]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
imp = Imputer(strategy="median", axis=0)

# scale attributes by the training set
scl_obj = StandardScaler()
scl_obj.fit(X_train) # find scalings for each column that make this zero mean and unit std
# the line of code above only looks at training data to get mean and std and we can use it 
# to transform new feature data


X_train_scaled = scl_obj.transform(X_train) # apply to training
X_test_scaled = scl_obj.transform(X_test) 



In [None]:
X_test_scaled

In [None]:
X_train_scaled

In [None]:
# lets investigate SVMs on the data and play with the parameters and kernels
from sklearn.svm import SVC
from sklearn import metrics as mt

# train the model just as before
svm_clf = SVC(C=0.5, kernel='rbf', degree=3, gamma='auto') # get object
svm_clf.fit(X_train_scaled, y_train)  # train object

y_hat = svm_clf.predict(X_test_scaled) # get test set precitions

acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print('accuracy:', acc )
print(conf)
##took longer to run

__SVM Performance__  
The support vector machine outperformed the logistic model with 84% accuracy. The logistic model was only able to predict non-perfect scores -- in which the majority of reviews fall. However, we are more interested in being able to accurately predict when a customer will leave a perfect review which occurs far less frequently. This logistic model was able to identify both with 91% accuracy identifying true-negatives (less than perfect scores) and 68% accuracy identifying true-positives (perfect scores).

__SVM Advantage__  
This model outperforms the logistic model in terms of accuracy. Although this model does take longer to run than the logistic model, because this dataset is not that large, the extra few seconds it takes to run is not a big disadvantage. This classification model is also advantageous because SVMs tend to perform better with higher dimensional models, are less sensitive to outliers, and are less computationally intensive. This is important because we include several categorical variables that have numerous unique levels (ranging from 2 to 222) in our predictive model that have been broken out into dummy variables.

In [None]:
# looking at the support vectors
print(svm_clf.support_vectors_.shape) #(observations, variables) to be support vectors
print(svm_clf.support_.shape) #(observations,) on the edge of vector
print(svm_clf.n_support_ ) #([observations on the edge for one class/ obs on edge for the other class ])

In [None]:
# Now let's do some different analysis with the SVM and look at the instances that were chosen as support vectors

# now lets look at the support for the vectors and see if we they are indicative of anything
# grab the rows that were selected as support vectors (these are usually instances that are hard to classify)

# make a dataframe of the training data
df_tested_on = df_data.iloc[train_indices] # saved from above, the indices chosen for training
# now get the support vectors from the trained model
df_support = df_tested_on.iloc[svm_clf.support_,:]

df_support['perf_score'] = y[svm_clf.support_] # add back in the 'perf_score' column to the pandas dataframe
df_data['perf_score'] = y # also add it back in for the original data
df_support.info()

In [None]:
# now lets see the statistics of these attributes
from pandas.plotting import boxplot

# group the original data and the support vectors
df_grouped_support = df_support.groupby(['perf_score'])
df_grouped = df_data.groupby(['perf_score'])

# plot KDE of Different variables
vars_to_plot = ['host_id',
'host_response_rate',
'host_is_superhost',
'host_listings_count',
'host_identity_verified',
'latitude',
'longitude',
'is_location_exact',
'accommodates',
'bathrooms',
'bedrooms',
'beds',
'price',
'security_deposit',
'cleaning_fee',
'guests_included',
'extra_people',
'minimum_nights',
'maximum_nights',
'availability_60',
'number_of_reviews',
'number_of_reviews_ltm',
'review_scores_accuracy',
'review_scores_cleanliness',
'review_scores_checkin',
'review_scores_communication',
'review_scores_location',
'review_scores_value',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'reviews_per_month',
'within a day',
'within a few hours',
'within an hour',
'Apartment',
'Bed and breakfast',
'Boat',
'Boutique hotel',
'Bungalow',
'Cabin',
'Camper/RV',
'Casa particular (Cuba)',
'Castle',
'Cave',
'Condominium',
'Cottage',
'Dome house',
'Earth house',
'Guest suite',
'Guesthouse',
'Hostel',
'Hotel',
'House',
'Houseboat',
'Lighthouse',
'Loft',
'Nature lodge',
'Other',
'Resort',
'Serviced apartment',
'Tent',
'Tiny house',
'Townhouse',
'Treehouse',
'Villa',
'Private room',
'Shared room',
'Couch',
'Futon',
'Pull-out Sofa',
'Real Bed',
'moderate',
'strict',
'strict_14_with_grace_period',
'super_strict_30',
'super_strict_60',
'Brooklyn',
'Manhattan',
'Queens',
'Staten Island']

for v in vars_to_plot:
    plt.figure(figsize=(10,4))
    # plot support vector stats
    plt.subplot(1,2,1)
    ax = df_grouped_support[v].plot.kde() 
    plt.legend(['Not Perfect','Perfect'])
    plt.title(v+' (SV Instances)')
    
    # plot original distributions
    plt.subplot(1,2,2)
    ax = df_grouped[v].plot.kde() 
    plt.legend(['Not Perfect','Perfect'])
    plt.title(v+' (Original)')



__SVM Chosen Support Vectors__  
The model identified ~13000 records in the dataset across 80 variables that were on the border of being misclassified which created 80 support vectors. These vectors include variables pertaining to the cancellation policy, location (neighborhood and coordinates), accommodations (property type, bed type, rooms, etc.), host characteristics (id, response time, verified, number of listings, etc.), prices and fees and other rating categories. The support vectors range to include many different types of variables, but there is no specific category that stands out. However, many of the support vectors are from levels of a categorical variable that had very small volumes (like castle, caves, pull-out sofa) or many unique values (like host id or price). It makes sense that the model would have a harder time predicting a classification class when there is very little training data available. 