# 🏡 **Helping the Hosts:** Determining Airbnb Host Ratings 🏨

---

> **Phase 3 Project: Classification**
>
> **Author:** Ben McCarty

---

---

**In a post-COVID world, hospitality faces challenges as travel restrictions are imposed and lifted (and then re-imposed).** Travel and tourism came to a crashing halt in 2020 and still face challenges in returning to pre-2020 business levels.

As restless travelers look to escape the confines of their homes, they expect the same high-quality services and experiences as pre-COVID. Competition within the hospitality industry is stronger than ever, putting more pressure on businesses to keep and grow their customer base.

**The main performance metric for every company involved in hospitality is guest satisfaction.** If a guest isn't satisfied, they are not likely to return for another visit and may share their experience with others, pushing away potential business.

Airbnb hosts face the same challenges as traditional hotels in these aggressive and challenging market conditions. In order to maximize their profitability and to distinguish themselves from traditional hotels, **Airbnb needs to know which aspects of a host property are the strongest predictors of whether a guest will give a satisfaction score of 4.8 or higher (out of 5).**

With this question in mind, I obtained data about Airbnb host properties from the [Inside Airbnb project](http://insideairbnb.com/get-the-data.html#:~:text=Washington%2C%20D.C.%2C%20District%20of%20Columbia%2C%20United%20States) for the Washington, D.C. area. The dataset includes details about the hosts themselves; property details (bedrooms, bathrooms, property types); and reservation availability.

**Once I have the data readied, I will use machine learning modeling techniques to determine my most important features for the region.** Then I will provide my final recommendations on what Airbnb should do to maximize the likelihood of their hosts obtaining a score of 4.8 or greater.

---

# 📂 **Imports and Settings**

In [None]:
## Data Handling
import pandas as pd
import numpy as np

## Visualizations
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact_manual
import missingno

## SKLearn
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, \
                                AdaBoostClassifier,GradientBoostingClassifier 
from sklearn import set_config
set_config(display='diagram')

In [None]:
## Settings
%matplotlib inline
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('display.max_rows', 100)

In [None]:
## Personal functions
import clf_functions.functions as cf

## Tools to reload personal functions when called - prevents errors
%load_ext autoreload
%autoreload 1
%aimport clf_functions.functions

## ✅ Show Visualizations Setting

In [None]:
## Setting to control whether or not to show visualizations
show_visualizations = False

# 📖 **Read Data**

In [None]:
## Reading data and saving to a DataFrame

source = 'data/listings.csv.gz'

data = pd.read_csv(source)

In [None]:
## Inspecting imported dataset
data.head(5)

In [None]:
## Checking number of rows and columns
data.shape

---

> The initial read of the dataset shows there are 74 features and 8,033 entries. A quick glance at the `.head()` gives a sample of the entries, showing that some of the features are not relevant to my analysis.
>
> I need to get a better idea of the statistics for the dataset, especially any missing values and the datatypes for each column. I need to pre-process this data before I can perform any modeling.

---

# 👨‍💻 **Interactive Investigation**

---

> To increase accessibility to the data, **I include a widget to allow the user to sort through the data interactively.** I use [**Jupyter Widgets**](https://ipywidgets.readthedocs.io/en/latest/index.html) to create this interactive report.
>
>**To use:** select which column by which you would like to sort from the dropdown menu, then click the "Run Interact" button.
>
>***Note about 'Drop_Cols' and Cols:*** these keyword arguments are used to allow the user to drop specific columns.
>
> **Only click the "Drop_Cols" option when specifying "Cols"!** Otherwise it will cause an error.
>
>The 'Cols' dropdown menu does not affect the resulting report; the data is filtered from the report prior to displaying the results. 
>
>I chose to include this option for flexibility and adaptability, but it does have the unintended consequence of creating another drop-down menu. Please ignore this menu, as it does not provide any additional functionality. For future work, I will disable the menu to prevent confusion.

---

In [None]:
## Running report on unfiltered dataset

interact_manual(cf.sort_report, Sort_by=list(cf.report_df(data).columns),
                Source=source);

In [None]:
## Reviewing percentages of datatypes
dt_pct = pd.DataFrame(data.dtypes.value_counts(1)\
                      .map(lambda x: f'{x*100:.0f}')).rename({0:'% Overall'},
                                                             axis=1)
dt_pct.style.set_caption('Data Types')

---

**Feature Review**

> After reviewing my data, I see there are several features that contain irrelevant entries (URLs, source data, meta-data) or text values that are too complicated for simple processing (such as host and listing descriptions).
>
> ***I will drop these columns for the second report to review the remaining data for further processing.***

**Data Type Breakdown**

> There is nearly a 50/50 split between numeric/non-numeric features. ***I will need to determine how to pre-process these non-numeric values prior to modeling***. My options include:
* Breaking down the values into distinct categories
* Converting values to numeric data types as appropriate

**Next Steps**

> I determined that there are many features to drop from the dataset as well as a large number of non-numeric features to review and convert to distinct categories for encoding.
>
> ***I will start by dropping the irrelevant columns; then I will review the remaining features and update as appropriate.***

---

In [None]:
## Specifying columns to drop

drop = ['id', 'name', 'description', 'neighborhood_overview', 'host_name',
        'host_about', 'host_location', 'neighbourhood', 'property_type',
        'listing_url', 'scrape_id', 'last_scraped', 'picture_url','host_url',
        'host_thumbnail_url','host_picture_url','calendar_last_scraped']

In [None]:
## Creating updated interactive report

interact_manual(cf.sort_report, Drop_Cols = True, Cols = drop,
                Sort_by=list(cf.report_df(data).columns), Source=source);

---

> **Interpretation:**
>
> The report shows that the dataset has a big problem with missing values:
>
> * **Empty:**
>   * `neighbourhood_group_cleansed`
>   * `bathrooms`
>   * `calendar_updated`
>
>
> * **Nearly empty:**
>  * `license`
>
>
> * **Missing 26-39% of data:**
>  * `host_about`
>  * `neighborhood_overview`
>  * `neighbourhood`
>  * `host_response_time`
>  * `host_response_rate`
>  * `review_scores_value`
>  * `review_scores_checkin`
>  * `review_scores_location`
>  * `review_scores_accuracy`
>  * `review_scores_communication`
>  * `review_scores_cleanliness`
>  * `host_acceptance_rate`
>  * `reviews_per_month`
>  * `first_review`
>  * `review_scores_rating`
>  * `last_review`
>
>---
>
> I will need to address these missing values before processing with the modeling. A few options include:
>
> * **Filling with the string "missing"** to indicate the value was missing.
>    * *I would be able to treat "missing" as a distinct category and use it for modeling as well.*
>
>
> * **Dropping the rows with missing values.**
>    * *This may negatively impact the accuracy of my results by overfitting to the training data.*
>
>
> Due to the large percentages of features and properties that are missing data, I feel it is best to drop those features and property entries that are missing values instead of attempting to fill in the gaps.

---

---

> To get a better idea of the missing values, I create a visual of the values via the 'Missingno' package. This visualization package includes several options for visualizing the missing data.

---

In [None]:
## Visually inspecting missing values
if show_visualizations == True:
    missingno.bar(data, labels=True);

---

> Based on this visualization, I see that **there is a consistent trend in missing values for review scores:** if a row is missing one review score, it seems to be missing all of them.
>
> Additionally, **there are many missing values for the response time, response rate, and acceptance rate.** I want to use these columns in my classification, so I will need to replace those missing values.
>
> After reviewing these details, **I feel more comfortable with the option of dropping those rows with missing review values.** I will drop the values as part of my overall classification process.

---

# 🧼 **Data Cleaning and EDA**

## 🔎 Fixing Missing Values

---

> This dataset is missing a significant number of values for different columns. **In order to perform any modeling, I will address these missing values first.**
>
> As described previously, I will drop those features and rows with high percentages of missing values. Then, I will be able to fill in the missing values for the 'beds'/'bedrooms’ features by cross-referencing the columns. If one of the rows is missing a value for one feature, but has a value for the other, I will simply fill in the missing value with the value from the other column.

---

In [None]:
# Dropping features with high percentages (25%+) of missing values

drop_na_cols = []
for col in data.columns:
    if ((data[col].isna().sum()) / len(data[col])) > .25 and col != 'review_scores_rating':
        drop_na_cols.append(col)

drop_na_cols

In [None]:
## Appending previous list of columns to drop (metadata, etc.)

for col in drop:
    if col not in drop_na_cols:
        drop_na_cols.append(col)

drop_na_cols

In [None]:
## Creating new dataframe that does not include the features to drop
df = data.drop(columns= drop_na_cols).copy()
df

In [None]:
## Inspecting values prior to dropping
cf.report_df(df)

## 🛌 Filling Beds

---

> After dropping columns and rows with large percentages of missing values, I proceed to address the missing values in the 'Beds' and "Bedrooms" columns.
>
> As the values are similar between the two, I will compare the rows against each other. For each row, if there is a missing value in one column that is present in the other, I will fill the missing value with the value present in the other column.

---

In [None]:
## Filling missing values for 'beds' with values for 'bedrooms'

for idx in list(df['beds'][df['beds'].isna()].index):
    if df['bedrooms'][idx] > 0:
        df['beds'][idx] = df['bedrooms'][idx]

In [None]:
## Filling missing values for 'bedrooms' with values for 'beds'

for idx in list(df['bedrooms'][df['bedrooms'].isna()].index):
    if df['beds'][idx] > 0:
        df['bedrooms'][idx] = df['beds'][idx]

In [None]:
## Resetting the index after dropping rows

df.reset_index(drop=True, inplace=True)

In [None]:
## Confirming reduction in missing values for 'beds' and 'bedrooms'

rpt_clean  = cf.report_df(df)
rpt_clean[rpt_clean['null_sum'] >0]

In [None]:
## Identifying index for remaining missing row for "beds"

nan_bed = list(df['beds'][df['beds'].isna() > 0].index)

## Inspecting row with missing value for "bed"
df.iloc[nan_bed]

---

> After cleaning the 'bed'/'bedroom' columns, I see that I have one remaining missing pair of columns in one row. Later on, I will fill that value with an imputer as part of my modeling pipeline process.

---

## 🚮 Dropping Rows with 6+ Missing Values

In [None]:
sum(df.isna().sum(axis=1) > 6)

In [None]:
## Removing rows with 6+ null values

df = df[df.isna().sum(axis=1) < 6]
df.head(5)

In [None]:
## Reviewing remaining missing values

df.isna().sum()[df.isna().sum() > 0]

In [None]:
cf.report_df(df)

## 🚮 Dropping Rows Without Target Value

In [None]:
## Checking for rows missing target values
nan_index = df['review_scores_rating'].isna()
nan_index

In [None]:
## Inspecting rows to be dropped for missing the target feature
df[nan_index][:5]

In [None]:
## Dropping rows from main dataframe
df.drop(df[nan_index].index, inplace=True)

## Resetting the index after dropping rows
df.reset_index(drop=True, inplace=True)

In [None]:
## Reviewing results
cf.report_df(df)

# 🔨 **Fixing Features**

---

**Making Changes**

> Now that I addressed most of my missing values, I process the remaining features and columns to allow for the modeling process.

**Conversions and Creations**
>
>* Convert features with "t" / "f" values to 1/0, respectively
>
>
>* Convert the `price` values from strings to floats
>
>
>* Create a new feature, `Years_Hosting`, based on the  `host_since` feature
>
>
>* Convert `bathrooms_text` into a new `num_bathrooms` numeric feature
>
>
>* Convert the `room_type` column values into simpler string values
>
>
> * Separate each neighborhood in `neighbourhood_cleansed` to a standalone binary feature

---

## 🎯 Inspecting Review Ratings

---

**Converting the Target**

> Currently, the the target variable `review_scores_rating` is a range of values from zero to five. ***I need to convert these values into binary values to represent whether the rating meets and/or exceeds the threshold of 4.8 (indicated by a 0/1 value, negative/positive respectively).*** I will create a new column to represent this binary classification and will drop the original feature.

---

In [None]:
## Using np.select to reassign target values based on conditional evaluations

cond = [df['review_scores_rating'] >= 4.8,
        df['review_scores_rating'] < 4.8]

choice = [1,0]

df['meets_threshold'] = np.select(cond, choice, 0)

In [None]:
## Dropping original target
try:
    df = df.drop(columns = 'review_scores_rating')
except:
    print('Feature was previously dropped.')
    pass

In [None]:
## Confirming removal
'review_scores_rating' not in df.columns

In [None]:
## Reviewing results to confirm only 0/1 values and inspecting balance
threshold_counts = pd.DataFrame(df['meets_threshold']\
                                .value_counts(dropna=False, sort=False))
threshold_counts = pd.concat([threshold_counts,
                              df['meets_threshold'].value_counts(dropna=0,
                                                                 normalize=1,
                                                                 sort=0)],
                             axis=1)
threshold_counts.columns = ['Count','Percent']
threshold_counts['Percent'] = threshold_counts['Percent']\
                                      .map(lambda x: int(round(x, 2)*100))
threshold_counts.style.set_caption('Target Breakdown')

In [None]:
## Visualizing the overall distribution of ratings

ax = sns.barplot(data = df, x = threshold_counts.index, y = threshold_counts['Percent'])#, hue= df['meets_threshold'])

ax.set(title = 'Breakdown of "Meets Threshold"',
       xlabel = 'Rating', ylabel = 'Percentage of Reviews',
       xticklabels = ["Below", "Meets/Exceeds"]);

---

**Rating Distributions**

> After processing the missing values and formatting the data, the values are properly converted into 0/1 values with **62% of the reviews at or below the target threshold of 4.8.**
>
> This imbalance is very important for the later modeling process. ***I will need to address the imbalance to ensure the best model performance.***

---

## Converting True/False Columns to Binary Values

In [None]:
## Creating list of true/false features to convert to 1/0, respectively

t_f_xf = []

for col in df.columns:
    if df[col].nunique() == 2 and df[col].dtype == 'O':
        print(col,":",df[col].unique())
        t_f_xf.append(col)
        
t_f_xf

In [None]:
tf_vc = pd.DataFrame()
for col in t_f_xf:
    tf_vc = pd.concat([tf_vc, df[col].value_counts(normalize = 1, dropna=0, sort=0)], axis=1)

tf_vc

In [None]:
## Converting t/f to 1/0, respectively

df.loc[:,t_f_xf] = df.loc[:,t_f_xf].replace({ 't' : 1, 'f' : 0})

In [None]:
## Verifying results
cf.report_df(df[t_f_xf])

## Converting Price to Float 

In [None]:
df['price'][:5]

In [None]:
## Converting each value into a float for processing

df['price'] = df['price'].map(lambda price: price[1:].replace(',','')).astype('float')
df['price'][0]

In [None]:
df['price'].describe()

## Creating "Years_Hosting"

---

> Since the 'host_since' feature is clearly a date, I will create a separate feature for how many years of activity for each host.

---

In [None]:
df['years_hosting'] = df["host_since"].map(lambda x: 2021- int(x.split("-")[0]))
df['years_hosting']

In [None]:
df['years_hosting'].value_counts()

In [None]:
df['years_hosting'].describe()

---

> I successfully created the new feature to represent how long each host is active (up to 2021). I will be curious to see the impact of the years of experience on the overall rating at the end of my modeling process.

---

## Bathrooms_Text to Num_Bathrooms

---

> In the raw data, the original "bathrooms" feature was empty and was dropped as part of processing missing data.
>
> **My goal is to convert the "bathrooms_text" feature into a new "num_bathrooms" feature to indicate the number of bathrooms at a host property.**
>
> I assume the number of bathrooms would have an impact on the rating . More bathrooms could mean more space/comfort for the guest, but could also cause an increase in price.


---

In [None]:
## Checking current values
df['bathrooms_text'].value_counts(dropna=False)

In [None]:
## Inspecting the rows in which there are null values
df[df['bathrooms_text'].isna()]

In [None]:
## Filling null values with unique string ('Baths' not present otherwise)
## Unique string can be used later to check for any other zero baths

df['bathrooms_text'] = df['bathrooms_text'].fillna('0 Baths')
df['bathrooms_text'].value_counts(dropna=False)

In [None]:
## Splitting each list into separate strings
df['num_bathrooms'] = df['bathrooms_text'].map(lambda x: x.split(' ')[0])
df['num_bathrooms'].value_counts()

---

> **I will replace these values with the numeric value .5 as they are half-baths.** This will allow me to convert the column datatype to a float and use the column more easily in my modeling.

---

In [None]:
## Replacing string values with .5 to represent half-bathrooms

replace = {'Half-bath': .5, 'Shared': .5}

df['num_bathrooms'] = df['num_bathrooms'].replace(replace).astype(float)

## Inspecting resulting values
print(df['num_bathrooms'].dtype)
df['num_bathrooms'].value_counts(dropna=False)

In [None]:
## Inspecting listings with more than 10 rooms
df[df['num_bathrooms'] >10]

---

> After taking a look at the locations listed above on Google Maps (using their latitude/longitude), I feel like these three listings with more than 10 bathrooms are either duplicates or incorrect values (for 50 baths).
>
> Due to the questionable nature of these values, I will drop these rows to prevent these outliers from impacting my results.

---

In [None]:
## Inspecting rows where 'num_bathrooms' is zero to validate data

df[(df['num_bathrooms'] ==0)]

In [None]:
## Removing old column post-conversion
df = df.drop(columns = 'bathrooms_text')

In [None]:
## Confirming removal

'bathrooms_text' in df.columns

---

> My review of the original bathroom text for the zero bathrooms column shows that the listings are associated with a private room. This would make sense as the listings may not include an option such as a shared bath, etc..
>
> Additionally I did fill 9 instances of missing values with "0 Baths," which would contribute slightly to this count.
>
> Overall, I feel the data is valid and I will use it for my modeling.

---

## Cleaning Room_Type

---

>  In order to use “room_type” as a categorical variable, I convert the values to standardized strings. This allows me to perform one-hot encoding as part of my pre-modeling steps below.

---

In [None]:
## Reviewing pre-existing values

df['room_type'].value_counts()

In [None]:
## Replacing values with updated strings

replace_rooms = {'Entire home/apt': 'entire_home', 
                 'Private room': 'private_room',
                 'Shared room': 'shared_room',
                 'Hotel room': 'hotel_room'
                }

df['room_type'].replace(replace_rooms, inplace=True)
df['room_type'].value_counts(dropna=False)

## Binarizing Columns

---

> The current values for "neighbourhood_cleansed", 'host_verifications', and 'amenities' are  single string values. **For each feature, I will separate each string into distinct, unique values and convert them into a binary column to represent whether or not that value is included in the listing, then drop the old column.**

---

### Neighbourhood_Cleansed

In [None]:
## Inspecting feature
df.loc[:,'neighbourhood_cleansed'][:5]

In [None]:
df.loc[:,'neighbourhood_cleansed'][0]

In [None]:
## Splitting string value between neighborhoods

unique_neighborhood = list(set(','.join(df['neighbourhood_cleansed']).split(',')))
unique_neighborhood

In [None]:
## Cleaning names and creating T/F binary columns

for neighborhood in unique_neighborhood:
    
    neighborhood = neighborhood.replace("'", "")
    
    if neighborhood[0] == ' ':
        neighborhood = neighborhood[1:]
    
    df[neighborhood] = df['neighbourhood_cleansed'].str\
                                           .contains(neighborhood).astype(int)

In [None]:
## Confirming results
df.columns[-20:]

In [None]:
## Confirming removal of leading spaces and any quotes

df.columns[-20:][0][:3]

### Host_Verifications

In [None]:
## Inspecting values
df['host_verifications'][:5]

In [None]:
## Inspecting the first five items of the second row

df.loc[:,'host_verifications'][1][:5]

In [None]:
## Splitting string value between verifications

unique_verif = list(set(','.join(df['host_verifications']).split(',')))
unique_verif

In [None]:
## Cleaning names and creating T/F binary columns

for verification in unique_verif:
    
    if len(verification) > 2:
        
        verification = verification.replace('[', '').replace(']', '').\
        replace("'", '').replace('"', '')

    if verification[0] == ' ':
        verification = verification[1:]

        df[verification] = df['host_verifications'].str.\
                            contains(verification).astype(int)

In [None]:
df.columns

---

> At this point, I successfully processed the 'host_verification' feature into distinct categories for modeling.

---

### Amenities


In [None]:
## Inspecting values
df['amenities'][:5]

In [None]:
## Inspecting the first five items of the second row

df.loc[:,'amenities'][1][:5]

In [None]:
## Splitting string value between items

unique_amenities = list(set(','.join(df['amenities']).split(',')))
unique_amenities

In [None]:
## Cleaning names and creating T/F binary columns

for amenity in unique_amenities:
           
    amenity = amenity.replace('[', '').replace(']', '').\
    replace("'", '').replace('"', '')

    if amenity[0] == ' ':
        amenity = amenity[1:]

        df[amenity] = df['amenities'].str.\
                            contains(amenity).astype(int)

In [None]:
df.columns

# 🔬 **Pre-Pipeline Review**

In [None]:
## Review remaining data
df.head(3)

In [None]:
## Removing columns with no impact on modeling

df.drop(columns = ['host_since', 'host_neighbourhood', 'amenities'], inplace=True)

In [None]:
## Final review

df.describe()

# 🪓 **Train/Test Split**

---

> Before I run any further pre-processing, I split my data into training and test sets to allow me to test my model's performance.
>
> **Since my target feature is converted into binary values, I will use the "stratify" parameter in my train/test split, preserving the class balance when I split my data.** This will be key for proper evaluation of my models.

---

In [None]:
## Specifying features and target columns for dataset
target = 'meets_threshold'

X = df.drop(columns = target).copy()
y = df[target].copy()

In [None]:
## Confirming same number of rows
X.shape[0] == y.shape[0]

In [None]:
## Splitting to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, 
                                                    random_state=42, 
                                                    stratify=y)

# 🚿 **Preprocessing Pipeline**

---

>  Before I start my modeling processes, I convert my remaining categorical column via one-hot encoding and perform standardization on my numeric columns. Once my columns are properly converted, I will save them as new dataframes and use them in my modeling.

---

In [None]:
## Specifying numeric columns for preprocessing
num_cols = X_train.select_dtypes(include=[int, float]).columns.to_list()
# num_cols

In [None]:
## Specifying categorical columns for preprocessing
cat_cols = ['room_type']
cat_cols

In [None]:
## Checking missing X-values for imputation
X_train.isna().sum()[X_train.isna().sum() > 0]

## Runnning Preprocessor

In [None]:
## Creating ColumnTransformer and sub-transformers for imputation and encoding


### --- Creating column transformers --- ###

# Filling missing values in "Beds" and "Bedrooms"
miss_num_transformer = SimpleImputer(strategy='mean')

## Encoding categoricals - ignoring errors to prevent issues w/ test set
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)


### --- Creating column pipelines --- ###

cat_pipe = Pipeline(steps=[('ohe', categorical_transformer)])

num_pipe = Pipeline(steps=[('imputer', miss_num_transformer),
                           ('scaler', StandardScaler())])

## Instantiating the ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[('nums', num_pipe, num_cols),
                  ('cats', cat_pipe, cat_cols)])

preprocessor

In [None]:
## Fitting feature preprocessor
preprocessor.fit(X_train)

## Getting feature names from OHE
ohe_cat_names = preprocessor.named_transformers_['cats'].named_steps['ohe'].get_feature_names(cat_cols)

## Generating list for column index
final_cols = [*num_cols, *ohe_cat_names]

In [None]:
## Transform the data via the ColumnTransformer preprocessor

X_train_tf = preprocessor.transform(X_train)
X_train_tf_df = pd.DataFrame(X_train_tf, columns=final_cols, index=X_train.index)

X_test_tf = preprocessor.transform(X_test)
X_test_tf_df = pd.DataFrame(X_test_tf, columns=final_cols, index=X_test.index)

display(X_train_tf_df.head(5),X_test_tf_df.head(5))

# 📊 **Baseline Model**

In [None]:
## Creating baseline classifier model

base = DummyClassifier(strategy='stratified', random_state = 42)

base.fit(X_train_tf_df, y_train)

cf.evaluate_classification(base,X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test, 
                           metric = 'accuracy')

In [None]:
## Saving the baseline scores for later comparisons

base_train_score, base_test_score, base_train_ll, base_test_ll = \
cf.model_scores(base, X_train_tf_df, y_train, X_test_tf_df, y_test)

base_train_score, base_test_score, base_train_ll, base_test_ll

---

**Interpretation**

> The baseline model is designed to be a poor-performer: the results are intended to be be close to .5 for most metrics, indicating the model is not performing better than simply guessing one result or the other.
>
> I use this model as a comparison point to judge the performance of my other models.

---

#  📊 **Logistic Regression Model**

In [None]:
clf = LogisticRegression(tol = 1e-3, C = 10, penalty = "l1", solver = 'saga', 
                         max_iter=500, class_weight='balanced', n_jobs=-1,
                         random_state = 42)

clf.fit(X_train_tf_df, y_train)

cf.evaluate_classification(clf, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy')

---

**Interpretation**

> The simple LogReg model shows a slight performance increase - the log-loss decreased, the accuracy incrased, and my macro recall score also increased.
>
> This model mis-predicts values about 64% of the time, most likely due to the class imbalances.

---

# 📊 **RandomForestClassifier**

In [None]:
rfc = RandomForestClassifier(bootstrap = False,max_features= 'sqrt', class_weight = 'balanced',
                            n_jobs=-1, max_depth = 15, min_samples_leaf = 3,
                            min_samples_split = 4, random_state=42)

In [None]:
rfc.fit(X_train_tf_df, y_train)

In [None]:
cf.evaluate_classification(rfc, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy')

## Results

---

 **Comparing with Logistic Regression Model**
>
> The Random Forest classification model shows a higher degree of over-fitting; this is to be expected for tree-style models.
>
> This model shows slight performance increases as well. The log-loss decreased slightly as well, and the main two metrics of macro recall and accuracy both increased slightly.
>
> I will use this model as my best-performing model and will use its feature importances for my recommendations.

---

## Visualizing Feature Importances

---

> Now that I have the feature importances from my model, I interpret the results via visualizing the most important features and the target feature.

---

In [None]:
cf.plot_importances(rfc, X_test_tf_df, count=5)

---

**Interpreting Results**

> My resulting feature importances show that **the strongest predictor of scores 4.8+ would be whether or not a host is a SuperHost.** This makes sense, as one of the requirements for a host to be a SuperHost is to maintain a 4.8+ score, in addition to other requirements.
>
> Following SuperHost status are the number of listings for a host. **If a host has a large number of properties, they would most likely be an established businessperson and would be committed to hospitality, versus someone just renting out a spare room.**

---

# 💡 **Final Recommendations**


---

> **Based on the results of my models, I would recommend for Airbnb to prioritize promoting hosts to SuperHost status.**  SuperHost status is the strongest predictor for the desired high scores, and it is realistic for Airbnb to invest in their development and support. The second- and third-strongest predictors are much more difficult (and unrealistic) for Airbnb and hosts to improve.
>
> For further development, I would do the following:
>* **Include details from text reviews:** while the traditional survey questions are respected and informative, text-based reviews take precedence. In my experience in hotel operations, I would often get much more information from the written reviews, including nuances and specifics that the yes/no or 1-5 ratings miss.
>* **Include other regions:** My current dataset focused only on the Washington, D.C. area. Due to different regional factors (social/economic demographics; legal restrictions; etc.), other markets may show other features to be more important than my results. Additionally, I would like to explore international data to compare with the domestic data.

--- 

# **Testing Models - Poorer Performances**

---

> The models below showed poorer performance versus my Logistic Regression and my Random Forest models. I include them for reference and example.

---

## AdaBoostClassifier

In [None]:
# abc = AdaBoostClassifier(n_estimators=100, random_state=42)

In [None]:
# abc.fit(X_train_tf_df, y_train)

# cf.evaluate_classification(abc, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'accuracy')

## Gradient Boosting

In [None]:
# gbc = GradientBoostingClassifier(learning_rate=1.0, max_depth=1, random_state=42)

In [None]:
# gbc.fit(X_train_tf_df, y_train)

# cf.evaluate_classification(gbc, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'accuracy')

## GridSearchCV: LogisticRegression

In [None]:
# import warnings
# warnings.filterwarnings('ignore')

In [None]:
# lr_params = {
#  'C': [.001, .01, .1, 1, 10, 100, 1000],
#     'penalty':['l1', 'l2', 'elasticnet', 'none'],
#     'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
#     'max_iter':[100, 200, 300, 400]}

In [None]:
# gscv = GridSearchCV(LogisticRegression(class_weight='balanced'), lr_params, scoring = 'balanced_accuracy', cv=3,
#                     n_jobs = -1)
# gscv

In [None]:
# gscv.fit(X_train_tf_df, y_train)

In [None]:
# logreg_params = gscv.best_params_

# logreg_params

In [None]:
# gscv.best_estimator_

In [None]:
# cf.evaluate_classification(gscv.best_estimator_, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced accuracy')

## GSCV: RandomForestClassifier

In [None]:
# rfc_params = {
#     'n_estimators':[100, 125, 150,],
#     'max_depth': [10,20,30,40],
#     'min_samples_split': [2,3,4],
#     'min_samples_leaf': [1,2,3]
# }

In [None]:
# rfc = RandomForestClassifier(class_weight = 'balanced',
#                             n_jobs=-1, random_state=42)

In [None]:
# rfgs = GridSearchCV(rfc, rfc_params, scoring = 'balanced_accuracy', cv=3,
#                     n_jobs = -1)
# rfgs

In [None]:
# rfgs.fit(X_train_tf_df, y_train)

In [None]:
# rfc_params = rfgs.best_params_

# rfc_params

In [None]:
# rfgs.best_score_

In [None]:
# rfc_new = rfgs.best_estimator_

In [None]:
# cf.evaluate_classification(rfc_new, X_train_tf_df, y_train, X_test_tf_df, 
#                            y_test, 'recall (macro)')

## GSCV: AdaBoost

In [None]:
# abc_params = {'n_estimators': [10,20, 30],
# 'learning_rate': [0.0001, 0.001, 0.01, 0.1]}

# # cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# abc = AdaBoostClassifier(DecisionTreeClassifier(),random_state=42)

# abgs = GridSearchCV(estimator=abc, param_grid=abc_params, n_jobs=-1,
#                            cv=3, scoring='balanced_accuracy', verbose=2)


In [None]:
# abgs.fit(X_train_tf_df, y_train)

In [None]:
# abc_best_params = abgs.best_params_

# abc_best_params

In [None]:
# abgs.best_score_

In [None]:
# abc_new = abgs.best_estimator_

In [None]:
# cf.evaluate_classification(abc_new, X_train_tf_df, y_train, X_test_tf_df, 
#                            y_test, 'recall (macro)')

## XGBoost

In [None]:
# from xgboost import XGBClassifier

In [None]:
# xbgc = XGBClassifier()

In [None]:
# xbgc.fit(X = X_train_tf_df, y=y_train)

In [None]:
# xbgc.predict(X_test_tf_df)

In [None]:
# cf.evaluate_classification(xbgc,X_train_tf_df, y_train, X_test_tf_df, y_test, metric= 'accuracy')

In [None]:
# xgbc_names = xbgc.get_booster().feature_names
# # xgbc_names

In [None]:
# xgbc_importances = xbgc.feature_importances_
# # xgbc_importances

In [None]:
# xgbc_results = pd.Series(data = xgbc_importances, index = xgbc_names)
# xgbc_results

In [None]:
# xgbc_results.sort_values(ascending = False)