# 🏡 **Helping Your Hosts:** Predicting Airbnb Host Ratings 🏨

---

> **Phase 3 Project: Classification**
>
> **Author:** Ben McCarty

---

---

**In a post-COVID world, hospitality faces challenges as travel restrictions are imposed and lifted (and then re-imposed).** Travel and tourism came to a crashing halt in 2020 and still face challenges in returning to pre-2020 business levels.

As restless travelers look to escape the confines of their homes, they expect the same high-quality services and experiences as pre-COVID. Competition within the hospitality industry is stronger than ever, putting more pressure on businesses to keep and grow their customer base.

**The main performance metric for every company involved in hospitality is guest satisfaction.** If a guest isn't satisfied, they are not likely to return for another visit and may share their experience with others, pushing away potential business.

Airbnb hosts face the same challenges as traditional hotels in these aggressive and challenging market conditions. In order to maximize their profitability and to distinguish themselves from traditional hotels, **Airbnb needs to know which aspects of a host property are the strongest predictors of whether a guest will give a satisfaction score of 4.8 or higher (out of 5).**

With this question in mind, I obtained data about Airbnb host properties from the [Inside Airbnb project](http://insideairbnb.com/get-the-data.html#:~:text=Washington%2C%20D.C.%2C%20District%20of%20Columbia%2C%20United%20States) for the Washington, D.C. area. The dataset includes details about the hosts themselves; property details (bedrooms, bathrooms, property types); and reservation availability.

**Once I have the data readied, I will use machine learning modeling techniques to determine my most important features for the region.** Then I will provide my final recommendations on what Airbnb should do to maximize the likelihood of their hosts obtaining a score of 4.8 or greater.

---

# 📂 **Imports and Settings**

In [None]:
## Tools to reload functions
%load_ext autoreload
%autoreload 2

In [None]:
## Data Handling
import pandas as pd
import numpy as np
import datetime

## Visualizations
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact_manual
import missingno
import shap

## Personal functions
from bmc_functions import eda
from bmc_functions import classification as clf

## Settings
from IPython.display import display
%matplotlib inline
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 100)

In [None]:
## Speeding up SKLearn via Intel(R) Extension for Scikit-learn*
from sklearnex import patch_sklearn
patch_sklearn()

In [None]:
## Scikit-Learn
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import set_config
set_config(display='diagram')

from imblearn.over_sampling import SMOTE

## ✅ Show Visualizations Setting

In [None]:
## Setting to control whether or not to show visualizations
show_visualizations = False

# 📖 **Read Data**

In [None]:
## Reading data and saving to a DataFrame

source = 'data/listings.csv.gz'
data = pd.read_csv(source)
data.head(5)

In [None]:
## Checking number of rows and columns
data.shape

---

> The initial read of the dataset shows there are 74 features and 8,033 entries. A quick glance at the `.head()` gives a sample of the entries, showing that some of the features are not relevant to my analysis.
>
> I need to get a better idea of the statistics for the dataset, especially any missing values and the datatypes for each column. I need to pre-process this data before I can perform any modeling.

---

# 👨‍💻 **Interactive Investigation**

---

> To increase accessibility to the data, **I include a widget to allow the user to sort through the data interactively.** I use [**Jupyter Widgets**](https://ipywidgets.readthedocs.io/en/latest/index.html) to create this interactive report.
>
>**To use:** select which column by which you would like to sort from the dropdown menu, then click the "Run Interact" button.
>
>***Note about 'Drop_Cols' and Cols:*** these keyword arguments are used to allow the user to drop specific columns.
>
> **Only click the "Drop_Cols" option when specifying "Cols"!** Otherwise it will cause an error.
>
>The 'Cols' dropdown menu does not affect the resulting report; the data is filtered from the report prior to displaying the results. 
>
>I chose to include this option for flexibility and adaptability, but it does have the unintended consequence of creating another drop-down menu. Please ignore this menu, as it does not provide any additional functionality. For future work, I will disable the menu to prevent confusion.

---

In [None]:
## Running report on unfiltered dataset

interact_manual(eda.sort_report, Sort_by=list(eda.report_df(data).columns),
                Source=source);

In [None]:
data.head(3)

---

> After reviewing my data, I see there are several features that contain irrelevant entries (URLs, source data, meta data) or values that are too complicated for simple processing (such as host and listing descriptions).
>
> I will drop these columns for the second report to review the remaining data for further processing.

---

In [None]:
## Specifying columns to drop

drop = ['id', 'host_id', 'name', 'description', 'neighborhood_overview', 'host_name',
        'host_about', 'host_location', 'neighbourhood', 'property_type',
        'listing_url', 'scrape_id', 'last_scraped', 'picture_url','host_url',
        'host_thumbnail_url','host_picture_url','calendar_last_scraped']

In [None]:
df = data.drop(columns=drop).copy()
df

In [None]:
## Creating updated interactive report

interact_manual(eda.sort_report, Drop_Cols = True, Cols = drop,
                Sort_by=list(eda.report_df(df).columns), Source=source);

---

**Interpretation:**

> The report shows that the dataset has a big problem with missing values:
>
>* **100% Missing:**
>   * `neighbourhood_group_cleansed`
>   * `bathrooms`
>   * `calendar_updated`
></br></br>
> * **~100% Missing:**
>   * `license`
></br></br>
> * **26-39% Missing:**
>   * `host_about`
>   * `neighborhood_overview`
>   * `neighbourhood`
>   * `host_response_time`
>   * `host_response_rate`
>   * `review_scores_value`
>   * `review_scores_checkin`
>   * `review_scores_location`
>   * `review_scores_accuracy`
>   * `review_scores_communication`
>   * `review_scores_cleanliness`
>   * `host_acceptance_rate`
>   * `reviews_per_month`
>   * `first_review`
>   * `review_scores_rating`
>   * `last_review`

---

**Handling the Missing Values**

> I will need to address these missing values before processing with the modeling. My options include:
>
>
>* **Filling missing values with an imputer as part of modeling pipeline**
>   * *Allows for the flexibility to test different imputation methods*
>   * *Can add a feature to indicate which features were missing values.*</br></br>
>
>* **Dropping the rows with missing values.**
>   * *Reduces the number of features used in modeling, reducing dimensionality*
>   * *May decrease model performance due to less information*</br>

**I will use a mix of these two options:** I will drop the features missing nearly all of the values, then I will use an imputer during my pipeline process in combination with a GridSearch to identify the best method to use to fill the missing values.

---

---

**MissingNo**

> To get a better idea of the missing values, I create a visual of the values via the 'Missingno' package. This visualization package includes several options for visualizing the missing data.
</br></br>
> *Note: please set the "show_visualizations" variable to "True" to show these visualizations. By default they are disabled due to time required for processing.*

---

In [None]:
## Visually inspecting missing values
if show_visualizations == True:
    missingno.bar(data, labels=True);

In [None]:
## Visually inspecting missing values
if show_visualizations == True:
    missingno.matrix(data, labels=True);

---

> Based on this visualization, I see that **there is a consistent trend in missing values for review scores:** if a row is missing one review score, it seems to be missing all of them.
>
> Additionally, **there are many missing values for the response time, response rate, and acceptance rate.** I want to use these columns in my classification, so I will need to replace those missing values.
>
> After reviewing these details, **I feel more comfortable with the option of dropping those rows with missing review values.** I will drop the values as part of my overall classification process.

---

# 🧼 **Data Cleaning and EDA**

---

**Cleaning Process**

> To begin the cleaning process, I will:
>   * Drop the rows missing values for the target feature
>   * Drop features missing 98-100% of values

**Processing Remaining Missing Values**

> Instead of performing any manual updates to the remaining values, I will test different imputation methods as part of my modeling pipeline. </br></br>
> Potential methods would include:
>   * Imputing the string "MISSING"
>   * Imputing the most frequent value for string values
>   * Using the mean, median, or mode for numeric datatypes
>
> The benefit of including this step in a pipeline is that I will be able to include these different methods in a GridSearchCV as part of my hyperparameter turning steps.

---

## Dropping Rows Without Target Value

---

To start my data cleaning and exploration, I will drop any rows missing values for my target feature.

---

In [None]:
## Calculating number of missing values for target feature
display(df['review_scores_rating'].isna().sum())
f"The target feature is missing {round(df['review_scores_rating'].isna().sum() / len(df)*100, 2)}% of its data."

In [None]:
## Identifying row indices
nan_index = df['review_scores_rating'].isna()
nan_index

In [None]:
## Dropping rows by index and resetting index
df = df.drop(df[nan_index].index)
df.reset_index(drop=True, inplace=True)
df

In [None]:
## Reviewing results
f"There are {df['review_scores_rating'].isna().sum()} missing values."

---

> Now that I dropped the missing values for my target feature, I will review the distribution of my ratings.

---


In [None]:
## Reviewing the percentage of ratings at or above the threshold of 4.8

f"{round(len(df['review_scores_rating'][df['review_scores_rating'] >= 4.8])/ len(df)*100, 0)} of ratings are at/above the threshold"

In [None]:
## Visualizing the overall distribution of ratings

ax = sns.histplot(data = df['review_scores_rating'], bins = 'auto')

ax.set(title = 'Distribution of All Review Ratings',
       xlabel = 'Rating Score', ylabel = 'Number of Ratings')

median = df['review_scores_rating'].median()
mean = round(df['review_scores_rating'].mean(), 2)
ax.axvline(median, label = f'Median Score: {median}', color='g')
ax.axvline(mean, label = f'Average Score: {mean}', color='y')
ax.axvline(4.8, label = 'Score Threshold: 4.80', color='k')
ax.legend(fontsize= 'large',title = 'Score Thresholds',
          title_fontsize = 'large');

In [None]:
## Zooming in on 4.5 - 5.0 range
ax = sns.histplot(data = df['review_scores_rating'][df['review_scores_rating']>4.5], bins = 'auto')
ax.set(title = 'Distribution of Review Ratings: 4.5+', xlabel = 'Rating Score', ylabel = 'Number of Ratings')
median = df['review_scores_rating'].median()
mean = round(df['review_scores_rating'].mean(), 2)
ax.axvline(median, label = f'Median Score: {median}', color='g')
ax.axvline(mean, label = f'Average Score: {mean}', color='y')
ax.axvline(4.8, label = 'Score Threshold: 4.80', color='k')
ax.legend(fontsize= 'large',title = 'Score Thresholds',title_fontsize = 'large');

---

**Observations and Next Steps**

> Based on the results above, I see that **62% of the reviews are at or above the target threshold of 4.8.**
>
> These scores show that there's a close balance of scores that are meeting our threshold. However, this imbalance may still impair the performance of my future model.
>
> To address this imbalance, I will later use the SMOTE technique to oversample the minority class.

---

# 🔨 Feature Selection and Engineering

---

**Selecting Features to Drop**

> * Dropping:
>    * all features beginning with "review" except target (these review scores constitute the target)
>    * features missing nearly all/all values
>    * `host_is_superhost` - status requires score of 4.8+
>    * `amenities` - string containing listing-specific amenities; removed due to non-standard/varied data adding little extra value to the model
>        * **Future work:** using NLP to identify the most common amenities to create more valuable features
>    * `host_neighborhood` - `neighbourhood_cleansed` is more substantial and covers the same data
>    * `first_review` - does not add significant value
</br></br>

**Engingeering New Features**

>    * `reviewed_within_year`: boolean column representing whether or not the reservation was booked within a year from the specified date

## Dropping Features

In [None]:
## Generating list of columns to drop starting with features beginning with "review_" (except the target feature)
drop_feats = df.loc[:, df.columns.str.startswith('review_')].drop(columns='review_scores_rating').columns.to_list()

## Adding additional features as described above
drop_feats.extend(['host_is_superhost',"amenities", 'host_neighbourhood', 'first_review'])
drop_feats

In [None]:
## Adding column names for features missing more than 90% of their data
drop_feats.extend(df.isna().mean()[df.isna().mean() > .9].index.to_list())
drop_feats

In [None]:
## Dropping selected features
df = df.drop(columns=drop_feats)
df

## Creating `reviewed_within_year` Feature

In [None]:
## Determining the date a year ago
selected_date = datetime.datetime(2021, 10, 14)
last_year = (selected_date-datetime.timedelta(days=365))
last_year

In [None]:
## Generating boolean feature representing if the last review was within a year of the specified date
df['reviewed_within_year'] = (pd.to_datetime(df['last_review']) >= last_year).astype(int)
df

In [None]:
## Dropping 'last_review' due to multicollinearity concerns
df = df.drop(columns='last_review')
df

## Creating `years_hosting`

---

> Using the `host_since` feature, I will create a new feature representing each host's years of experience.
>
> After creating the new feature, I will drop the `host_since` feature.
>
> ***Special Note:*** Due to the presence of a limited number of missing values, I will temporarily fill the missing values with a placeholder to create the features, then re-convert the placeholders to NaN values.

---

In [None]:
## Calculating initial number of missing values
df["host_since"].isna().sum()

In [None]:
## Identifying indices for missing values
host_since_nan = df["host_since"][df["host_since"].isna()].index
host_since_nan

In [None]:
## Determining unique placeholder
f'Is the value "2000-01-01" in the column? {len(df["host_since"][df["host_since"].isin(["2000-01-01"])])>0}'

In [None]:
## Filling with placeholder value
df["host_since"] = df["host_since"].fillna('2000-01-01')
f'There are {df["host_since"].isna().sum()} missing values.'

In [None]:
## Creating feature
df['years_hosting'] = df["host_since"].map(lambda x: 2021- int(x.split("-")[0]))
df['years_hosting']

In [None]:
## Confirm post-conversion results for origina missing rows
df.loc[host_since_nan, ["host_since", 'years_hosting']]

In [None]:
## Converting back to missing values
df.loc[host_since_nan, ["host_since", 'years_hosting']] = np.nan
df.loc[host_since_nan, ["host_since", 'years_hosting']]

In [None]:
## Confirming final values
print(df['years_hosting'].value_counts(dropna=False))
eda.report_df(df).loc['years_hosting':]

In [None]:
## Dropping `host_since` feature after update
df = df.drop(columns='host_since')
df

---

**Conclusion: Creating `Host_Since` Feature**

> I successfully created the new feature representing each host's years of experience (up to 2021).

---

# 🔧 **Fixing Features**

---

> After dropping certain features and creating new ones, I will process the remaining features and columns to allow for the modeling process.
>
> I perform the following changes:</br></br>

> **Data Conversions:**
>   * **Binarizing target:** currently, my target feature `review_scores_rating` is a range of values 
>   * **T/F values:** Any features with 't'/'f' values need to be converted to 1/0, respectively.
>   * **`price`:** The `price` feature consists of string values; to use it properly, I will convert the values to the float datatype.
>   * **`room_type`:** Converting to simpler string values.
>   * **`neighbourhood_cleansed`:** The 'neighbourhood_cleansed' feature values are a single string of neighborhoods. I will split these strings into boolean features for each neighborhood. </br></br>

> **Feature Engineering:**
>   * `years_hosting`: Using the year in which the host started in the `host_since` feature to calculate the number of years as a host.
>   * `bathrooms_text`: Converting to a new `num_bathrooms` numeric feature.

---

# **Binarizing Target Feature**

---

> In order to achieve the goal of identifying the most important features for review scores, I convert the target variable 'review_scores_rating" into binary values to represent if the score is below the threshold of 4.8 (represented as a '0') and above the threshold (represented as a '0').

---

In [None]:
## Using np.select to reassign target values based on conditional evaluations

cond = [df['review_scores_rating'] >= 4.8,
        df['review_scores_rating'] < 4.8]

choice = [1,0]

df['meets_score_threshold'] = np.select(cond, choice, 0)

In [None]:
## Reviewing results to confirm only 0/1 values and inspecting balance
df['meets_score_threshold'].value_counts(dropna=False, normalize=True, sort=False)

In [None]:
## Dropping old feature
df = df.drop(columns= 'review_scores_rating')
df

---

> After processing the missing values and formatting the data, the values are properly converted into 0/1 values and the class balance is maintained.

---

## Converting True/False Columns to Binary Values

In [None]:
## Selecting columns containing string values
df.select_dtypes('O')

In [None]:
## Identifying columns with 't' values - also includes 'f' values
t_f_col = df.loc[:,(df == 't').any()].columns
t_f_col

In [None]:
## Converting t/f to 1/0, respectively
df.loc[:,t_f_col] = df.loc[:,t_f_col].replace({ 't' : 1, 'f' : 0})

In [None]:
df.loc[:,t_f_col]

In [None]:
## Confirming values
display(df['host_has_profile_pic'].value_counts(dropna=False))
display(df['host_identity_verified'].value_counts(dropna=False))

In [None]:
## Verifying results
eda.report_df(df[t_f_col])

---

**Conclusion: Binarizing True/False**

> The `t` and `f` values are now properly converted to binary values. The pipeline imputer will resolve the remaining missing values.

---

## Converting Price to Float 

In [None]:
## Inspecting original feature
df['price']

In [None]:
## Converting each string value into a float
df['price'] = df['price'].map(lambda price: price[1:].replace(',','')).astype('float')
df['price']

In [None]:
df['price'].head()

In [None]:
df['price'].describe()

---

**Conclusion: Prices to Floats**

> The `price` feature is now a float instead of string value.

---

## Converting `Host_Response_Time` and `Host_Acceptance_Rate`

---

**Response and Acceptance Rates**

> Currently the `Host_Response_Time` and `Host_Acceptance_Rate` features consistsof string values representing percentages.
>
> **I will convert these values to float values.**

---

In [None]:
## Iterating through selected features and filling missing values

feat_list = ['host_acceptance_rate', 'host_response_rate']

placeholder = '999%'

for x in feat_list:

    print(f'***'*5, f'\nFeature: {x}:\n')
    ## Checking for Missing Values
    print(f"There are {df[x].isna().sum()} missing values.")

    ## Determining placeholder value to fill missing values prior to conversion.
    print(f"Is the value '{placeholder}' in '{x}?' {df[x].isin([{placeholder}]).sum() >0}")

    if (df['host_acceptance_rate'].isin([{'999%'}]).sum() >0) == False:
        ## Filling missing values with placeholder
        df[x] = df[x].fillna(placeholder)
        print(f'Filled missing values in {x} with {placeholder}.')
    else:
        print('Please select a different placeholder value.')

    print(f"There are {df[x].isna().sum()} missing values remaining.\n")

In [None]:
## Removing '%' and converting to floats
for feat in feat_list:
    df[feat] = df[feat].map(lambda x: float(x.replace('%',''))*.01)

In [None]:
## Converting placeholder back to NaN
placeholder = int(placeholder.replace("%",''))*.01

for x in feat_list:
        df[x] = df[x].replace(9.99, np.nan)
        display(df[x].isna().sum())

In [None]:
## Reviewing final updates
df[feat_list].describe()

---

**Conculsion: Converting Percentages**

> The '`host_acceptance_rate`', '`host_response_rate`' features are now properly listed as float values, more accurately representing the percentage values.

---

## ❌ Filling Beds

---

**`Beds` and `Bedrooms`**

> The `beds` and `bedrooms` features are both missing values, and instead of using an imputer or dropping the columns, I will take a different approach: filling the missing values with the values from the other feature. As these two features are very similar (you would expect a bed to be in a bedroom, and while ), I feel it is acceptable to take this approach.
>
> As the values are similar between the two, I will compare the rows against each other. For each row, if there is a missing value in one column that is present in the other, I will fill the missing value with the value present in the other column.


These two columns represent similar data and both are missing values.

My first approach to fill the missing values involved using the value from each respective feature to fill the missing values in the other. However, I realized that this approach may not represent cases in which a listing may be for a shared room (like a hostel, for example).

**I will inspect the rows in which there is a value for `beds` but not `bedrooms` to determine if there are any listings that are not including a full bedroom.**

**For any cases in which this is not the case, I will simply use each feature's values to fill in any missing values for the other.**

---

In [None]:
eda.report_df(df)[eda.report_df(df)['null_sum'] >0].sort_values('null_sum', ascending=False)

In [None]:
eda.report_df(df).sort_values('null_sum', ascending=False).loc[['beds', 'bedrooms'],:]

In [None]:
## Inspecting room types and number of beds detailed for listings without values for 'bedrooms'
display(df['room_type'][df['bedrooms'].isna()].value_counts(dropna=False, normalize=True))
display(df['beds'][df['bedrooms'].isna()].value_counts(dropna=False, normalize=True))

---

These results show that listings that are missing the number of bedrooms are most often entire homes/apartments with either 1 or 2 bathrooms.

---

In [None]:
df['bedrooms'][df['room_type'] == 'Entire home/apt'].value_counts(dropna=False, normalize=True)

### Old Code

In [None]:
# ## Filling missing values for 'beds' with values for 'bedrooms'
# df['beds'].fillna(df['bedrooms'], inplace=True)
# df['beds'].isna().sum()

# ## Filling missing values for 'bedrooms' with values for 'beds'
# df['bedrooms'].fillna(df['beds'], inplace=True)

# df['beds'].isna().sum(), df['bedrooms'].isna().sum()

In [None]:
# ## Confirming reduction in missing values for 'beds' and 'bedrooms'

# # rpt_clean  = eda.report_df(df)
# # rpt_clean[rpt_clean['null_sum'] >0].sort_values('null_sum', ascending=False)

# f"There are {df['beds'].isna().sum()} missing value(s) for `beds` and {df['bedrooms'].isna().sum()} missing value(s) for `bedrooms`."

In [None]:
# ## Inspecting row with missing value for "bed"
# df['beds'][df['beds'].isna() >0], df['bedrooms'][df['bedrooms'].isna() >0]

## Converting `Bathrooms_Text` to `Num_Bathrooms`

---

**Remaking `Bathrooms` Feature**

> As part of my initial pre-processing steps, I removed features missing 90%+ of their data, including the `bathrooms` feature. However, the "`bathrooms_text` feature covers similar data as string values. To use this data, I will need to convert it to numeric data.

**I will convert the `bathrooms_text` feature to a new feature, `num_bathrooms`, to create a "new" feature based on the numeric values from the original text strings.**


---

In [None]:
## Inspecting the values for "bathrooms_text"
df['bathrooms_text'].value_counts(dropna=False, normalize=True)

In [None]:
## Inspecting the rows in which there are null values
df[df['bathrooms_text'].isna()]

In [None]:
## Filling null values with unique string ('999' not present otherwise)
df['bathrooms_text'].fillna('999', inplace=True)

## Verifying all null values are filled
df['bathrooms_text'].isna().sum()

In [None]:
## Reviewing updated values
df['bathrooms_text'].value_counts(dropna=False)

In [None]:
## Splitting each string value and selecting the first element (representing the number of bathrooms)
df['num_bathrooms'] = df['bathrooms_text'].map(lambda x: x.split(' ')[0])
df['num_bathrooms'].value_counts()

In [None]:
## Converting remaining string values to floats
df['num_bathrooms'] = df['num_bathrooms'].replace({'Shared': .05, "Half-bath": .05})
df['num_bathrooms'] = df['num_bathrooms'].astype(float)

In [None]:
df['num_bathrooms'].value_counts(dropna=False)

In [None]:
## Inspecting rows where 'num_bathrooms' is zero to validate data
df[df['num_bathrooms'] ==0].head()

In [None]:
## Removing old column post-conversion
df = df.drop(columns = 'bathrooms_text')

In [None]:
## Confirming removal
'bathrooms_text' in df.columns

In [None]:
df['num_bathrooms'] = df['num_bathrooms'].replace(999.00, np.nan)
df['num_bathrooms'].value_counts(dropna=False)

---

Conclusion: Converting `Bathroom_Text` to `Num_Bathrooms`

> I successfully created the new feature, `num_bathrooms`, and retained the four missing values for later imputation.

---

## Cleaning Room_Type

---

**Cleaning `Room_Type` for Encoding**

>  In order to use `room_type` as a categorical variable, I will standardize the string values to be encoded during my pipeline.

---

In [None]:
## Reviewing pre-existing values
df['room_type'].value_counts()

In [None]:
## Replacing values with updated strings

replace_rooms = {'Entire home/apt': 'entire_home', 
                 'Private room': 'private_room',
                 'Shared room': 'shared_room',
                 'Hotel room': 'hotel_room'
                }

df['room_type'].replace(replace_rooms, inplace=True)
df['room_type'].value_counts(dropna=False)

---

**Conclusion: `Room_Type`**

> The feature now consists of standardized one-length strings for encoding.

---

## Binarizing String Values

---

**Single Strings to Binary Features**

> The current values for "neighbourhood_cleansed" and 'host_verifications' are single string values consisting of unique values.
>
>**For each feature, I will:**
>   * Separate each string into distinct, unique values;
>   * Convert them into a binary column to represent whether or not that value is included in the listing;
>   * Drop the old column.

---

### Neighbourhood_Cleansed

---

**Unique Neighborhoods**

> Currently, the the `neighbourhood_cleansed` feature values consist of a single string of unique neighborhood names. This data is not currently usable for modeling and needs to be converted to be useful.
>
> **To convert this feature, I will:**
>   * *Use a `.join()` method to create a single string of all of the values*
>   * *Use a `.split()` method to split this string on each comma to divide them into separate strings*
>   * *Convert the resulting list to a set to eliminate any duplicate names*
>   * *Iterate over this set to remove any apostrophes or spaces in the names*
>   * *Create new binary features for each neighborhood to indicate whether the processed name is in the given string value*

At the end of this process, the dataframe will have a new binary column for each neighborhood.

---

In [None]:
## Inspecting original feature values
df.loc[:,'neighbourhood_cleansed'][:5]

In [None]:
## Create a set of unique neighborhood names
unique_nghbrhd = set(','.join(df['neighbourhood_cleansed']).split(','))
unique_nghbrhd

In [None]:
## Cleaning names and creating T/F binary columns

for ngbrhd in unique_nghbrhd:
    
    ngbrhd = ngbrhd.replace("'", "")
    
    if ngbrhd[0] == ' ':
        ngbrhd = ngbrhd[1:]
    
    df[ngbrhd] = df['neighbourhood_cleansed'].str.contains(ngbrhd).astype(int)

In [None]:
## Confirming new column names
df.columns[42:]

---

**Post-Conversion**

The resulting feature names and values are correct. **Now I will repeat the same process for the `host_verifications` feature.**

---

### Host_Verifications

In [None]:
## Inspecting values
df['host_verifications'][:5]

In [None]:
## Inspecting the first ten items of the second row

df.loc[:,'host_verifications'][1][:10]

In [None]:
## Splitting string value between verifications

unique_verif = set(','.join(df['host_verifications']).split(','))
unique_verif

In [None]:
len(unique_verif)

---

**Extra Cleaning**

> These results are slightly different than the previous neighborhood values as they include brackets within the verifications as well as strings consisting of brackets only.
>
> **I will iterate over this set to remove the extra brackets, then create the new binary features.

---

In [None]:
## Cleaning names and creating T/F binary columns

for verification in unique_verif:
    
    if len(verification) > 2:
        
        verification = verification.replace('[', '').replace(']', '').\
        replace("'", '').replace('"', '')

    if verification[0] == ' ':
        verification = verification[1:]

        df[verification] = df['host_verifications'].str.\
                            contains(verification).astype(int)

In [None]:
## Inspecting new column names
df.columns[-20:]

In [None]:
## Dropping old features
df = df.drop(columns = ['host_verifications', 'neighbourhood_cleansed'])
df

---

**Conclusion: Binarizing Strings**

> The string values for the `neighborhood_cleansed` and `host_verifications` features are now converted to new binary features to be used as part of my modeling.

---

# 🔬 **Pre-Pipeline Review**

---

**Final Review**

> At this stage, I completed my data cleaning and preparation. I will review the data summaries once more before proceeding with my modeling.

---

## String Features

In [None]:
## Reviewing string features
eda.report_df(df.loc[:, df.columns[:40]].select_dtypes('O'))

In [None]:
display(df['host_response_time'].value_counts())
display(df['room_type'].value_counts())

## Numeric Features

In [None]:
## Reviewing string features
eda.report_df(df.loc[:, df.columns[:40]].select_dtypes('number'))

---

**Conclusion: Cleaning and Prep**

> I am happy with the results of my cleaning process and I am ready to proceed with my modeling pipeline.

---

# ❌ Outlier Detection and Removal

---

Prior to my train/test split and further modeling, I will identify and remove listings with outlier values. These outliers may negatively affect my models' performance if retained.

---

In [None]:
## Testing z-score outlier detection - arbitrarily-selected feature for testing purposes
# df['price'][clf.find_outliers_z(df['price'])].index
# len(df['price'][clf.find_outliers_z(df['price'])])

In [None]:
# outlier_counts = pd.DataFrame()

# for col in df.select_dtypes('number').columns:
#     outlier_counts[col] = clf.find_outliers_z(df[col]).sum()

# outlier_counts

In [None]:
# outlier_counts = {}

# for col in df.select_dtypes('number').columns:
#     outlier_counts[col] = clf.find_outliers_z(df[col])

# outlier_df = pd.DataFrame(outlier_counts)
# outlier_df

In [None]:
# outlier_counts = {}

# for col in df.select_dtypes('number').columns:
#     outlier_counts[col] = clf.find_outliers_z(df[col])

# outlier_df = pd.DataFrame(outlier_counts)
# outlier_df

In [None]:
# outlier_df.any(axis=1)  ## t/f if a row has an outlier
# outlier_df.mean() ## average % of outliers
# outlier_df.mean()[outlier_df.mean() >= .05].sort_values(ascending=False) ## features with 5+% outlier values


In [None]:
# large_num_feats = df.select_dtypes('number').nunique()[df.select_dtypes('number').nunique() > 50].index.to_list()

# outlier_counts = {}

# for col in large_num_feats:
#     outlier_counts[col] = clf.find_outliers_z(df[col])

# outlier_df_big = pd.DataFrame(outlier_counts)
# outlier_df_big

In [None]:
# outlier_df_big.any()
# outlier_df_big.sum()
# outlier_df_big.mean()

In [None]:
# ## Selecting features with more than X unique values
# large_num_feats = df.select_dtypes('number').nunique()[df.select_dtypes('number').nunique() > 50].index.to_list()
# len(large_num_feats)

# ## Create list of indices with outlier values
# outlier_feats = set()

# for col in large_num_feats:
#     outlier_feats.update(df[col][clf.find_outliers_z(df[col])].index.to_list())
    
# ## Percentage of observations with outliers
# len(outlier_feats)/len(df)

In [None]:
# df_no_outs = df.drop(outlier_feats).copy()
# df_no_outs

# 🪓 **Train/Test Split**

---

> Before I run any further pre-processing, I split my data into training and test sets to allow me to test my model's performance.
>
> **Since my target feature is converted into binary values, I will use the "stratify" parameter in my train/test split, preserving the class balance when I split my data.** This will be key for proper evaluation of my models.

---

In [None]:
## Specifying features and target columns for dataset
target = 'meets_score_threshold'

X = df.drop(columns = target).copy()
y = df[target].copy()

In [None]:
## Confirming same number of rows
X.shape[0] == y.shape[0]

In [None]:
## Performing split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, 
                                                    random_state=42, 
                                                    stratify=y)

# 🚿 **Preprocessing Pipeline**

---

>  Before I start my modeling processes, I convert my remaining categorical column via one-hot encoding and perform standardization on my numeric columns. Once my columns are properly converted, I will save them as new dataframes and use them in my modeling.

---

In [None]:
## Specifying numeric columns for preprocessing
num_cols = X_train.select_dtypes(include='number').columns.to_list()
num_cols

In [None]:
## Specifying categorical columns for preprocessing
cat_cols = X_train.select_dtypes(include='O').columns.to_list()
cat_cols

## Runnning Preprocessor

In [None]:
## Creating ColumnTransformer and sub-transformers for imputation and encoding


### --- Creating column transformers --- ###

## Imputing missing values
missing_transformer = SimpleImputer(strategy='most_frequent')

## Encoding categoricals - ignoring errors to prevent issues w/ test set
categorical_transformer = OneHotEncoder(handle_unknown='ignore',
                                        sparse=False)


### --- Creating column pipelines --- ###

cat_pipe = Pipeline(steps=[('imputer', missing_transformer),
                            ('ohe', categorical_transformer)])

num_pipe = Pipeline(steps=[('imputer', missing_transformer),
                            ('scaler', StandardScaler())])

## Instantiating the ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[('nums', num_pipe, num_cols),
                  ('cats', cat_pipe, cat_cols)])

preprocessor

In [None]:
## Fitting feature preprocessor
preprocessor.fit(X_train)

## Getting feature names from OHE
ohe_cat_names = preprocessor.named_transformers_['cats'].named_steps['ohe'].get_feature_names(cat_cols)

## Generating list for column index
final_cols = [*num_cols, *ohe_cat_names]

In [None]:
## Transform the data via the ColumnTransformer preprocessor

X_train_tf = preprocessor.transform(X_train)
X_train_tf_df = pd.DataFrame(X_train_tf, columns=final_cols, index=X_train.index)

X_test_tf = preprocessor.transform(X_test)
X_test_tf_df = pd.DataFrame(X_test_tf, columns=final_cols, index=X_test.index)

display(X_train_tf_df.head(5),X_test_tf_df.head(5))

# **Resampling via SMOTE**

In [None]:
smote = SMOTE(random_state = 42, n_jobs=-1)

X_train_tf_df, y_train = smote.fit_sample(X_train_tf_df,y_train)
pd.Series(y_train).value_counts()

# 📊 **Baseline Model**

In [None]:
## Creating baseline classifier model

base = DummyClassifier(strategy='stratified', random_state = 42)

base.fit(X_train_tf_df, y_train)

clf.evaluate_classification(base,X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test, 
                           metric = 'accuracy')

In [None]:
## Saving the baseline scores for later comparisons

base_train_score, base_test_score, base_train_ll, base_test_ll = \
clf.model_scores(base, X_train_tf_df, y_train, X_test_tf_df, y_test)

base_train_score, base_test_score, base_train_ll, base_test_ll

---

**Interpretation**

> The baseline model is designed to be a poor-performer: the results are intended to be be close to .5 for most metrics, indicating the model is not performing better than simply guessing one result or the other.
>
> I use this model as a comparison point to judge the performance of my other models.

---

#  📊 **Logistic Regression Model**

In [None]:
## LogReg Model
logreg = LogisticRegression(max_iter = 500, random_state = 42, n_jobs=-1)

logreg.fit(X_train_tf_df, y_train)

In [None]:
clf.evaluate_classification(logreg, X_train = X_train_tf_df,y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'average precision')

In [None]:
clf.plot_log_odds(logreg, X_test_tf_df)

## Logistic Regression GridSearchCV

In [None]:
lg_params = {
    'max_iter': [500, 750, 1000],
    'C': [.01, 1, 10],
    'solver': ['lbfgs','newton-cg']
}

In [None]:
## LogReg Model
lrgs = GridSearchCV(LogisticRegression(random_state = 42, n_jobs = -1), lg_params,
                    scoring = 'average_precision', verbose = 2)

lrgs.fit(X_train_tf_df, y_train)

In [None]:
clf.evaluate_classification(lrgs, X_train = X_train_tf_df,y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test)

In [None]:
lrgs.best_params_

In [None]:
best_logreg = lrgs.best_estimator_

clf.plot_log_odds(best_logreg, X_test_tf_df)

---

**Interpretation**

> The simple LogReg model shows a slight performance increase - the log-loss decreased, the accuracy incrased, and my macro recall score also increased.
>
> This model mis-predicts values about 64% of the time, most likely due to the class imbalances.

---

# 📊 **RandomForestClassifier**

### Vanilla RFC

In [None]:
rfc = RandomForestClassifier(n_jobs=-1, random_state=42)

In [None]:
rfc.fit(X_train_tf_df, y_train)

In [None]:
clf.evaluate_classification(rfc, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy')

## RFC GSCV

In [None]:
rfc_params = {'criterion': ['gini', 'entropy'],
              'max_depth': [40,50, 60],
              'min_samples_split': [2,3]
}

rfgs = GridSearchCV(RandomForestClassifier(random_state = 42, n_jobs=-1),
                    rfc_params,scoring = 'average_precision',verbose = 2,
                   cv = 3)

rfgs.fit(X_train_tf_df, y_train)

In [None]:
rfgs.best_params_

In [None]:
best_rfc = rfgs.best_estimator_

In [None]:
clf.evaluate_classification(best_rfc, X_train = X_train_tf_df,y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'average precision')

In [None]:
## New Importances
clf.plot_importances(best_rfc, X_test_tf_df)

---

**Interpreting Results**

> My resulting feature importances show that **the strongest predictor of scores 4.8+ would be whether or not a host is a SuperHost.** This makes sense, as one of the requirements for a host to be a SuperHost is to maintain a 4.8+ score, in addition to other requirements.
>
> Following SuperHost status are the number of listings for a host. **If a host has a large number of properties, they would most likely be an established businessperson and would be committed to hospitality, versus someone just renting out a spare room.**

---

# **Interpreting Results with SHAP**

---

> **One of the downsides of tree-based models is the difficulty when interpreting the impact of a specific feature.** Feature importances from tree-based models indicate how often a feature was used to make a decision, but they do not indicate if that feature was more or less likely to predict the target feature.
>
> To interpret these results, I will utilize a visualization package called **SHAP** to produce "Shapely values" for each feature. These values indicate each feature's marginal contribution to the model - answering the question, "*How well does the model perform with this feature than without?*"
>
>Using tools within the package, I will focus on the `summary_plot`, which visualizes each feature's Shapely value and the feature's  specific values from low-high (relative to each feature).
>
> More information about SHAP:
* [SHAP Documentation](https://shap.readthedocs.io/en/latest/?badge=latest)
* [SHAP Repository](https://github.com/slundberg/shap)

---

In [None]:
 ## Initializing Javascript for SHAP models
shap.initjs()

In [None]:
## Generating a sample of the overall data for review:
X_shap = shap.sample(X_test_tf_df, nsamples=50)

In [None]:
## Initializing an explainer with the RandomForestClassifier model
t_explainer = shap.TreeExplainer(rfgs.best_estimator_)

## Using TreeExplainer

---

> The SHAP package includes a few different "Explainer" objects to "explain" the results of different types of models. Since I used a RandomForestClassifier, I will use the "TreeExplainer" to calculate my SHAP values for plotting.

---

In [None]:
## Calculating SHAP values for sample test data
shap_values = t_explainer.shap_values(X_shap)
len(shap_values)

In [None]:
## Preparing column names for visualization labels
X_shap.rename(columns = lambda x: x.title().replace('_', ' '), inplace=True)

In [None]:
## Better plot
shap.summary_plot(shap_values[1],X_shap,max_display=15)

# 💡 **Final Recommendations**


---

> **Based on the results of my models, I would recommend for Airbnb to prioritize promoting hosts to SuperHost status.**  SuperHost status is the strongest predictor for the desired high scores, and it is realistic for Airbnb to invest in their development and support. The second- and third-strongest predictors are much more difficult (and unrealistic) for Airbnb and hosts to improve.
>
> For further development, I would do the following:
>* **Include details from text reviews:** while the traditional survey questions are respected and informative, text-based reviews take precedence. In my experience in hotel operations, I would often get much more information from the written reviews, including nuances and specifics that the yes/no or 1-5 ratings miss.
>* **Include other regions:** My current dataset focused only on the Washington, D.C. area. Due to different regional factors (social/economic demographics; legal restrictions; etc.), other markets may show other features to be more important than my results. Additionally, I would like to explore international data to compare with the domestic data.

--- 