# 🔻 [Return to workflow](#leftoff)

⚓ ANCHOR FOR RETURN TO WORKFLOW LINK <a name="leftoff"></a>

# 🏡 **“ODD” ASPECTS OF AIRBNB** 🏨

# ❌ Update target audience and guiding questions

---

**Who?**
>* 🏢 **AirBNB Corporate** interested in maximizing customer satisfaction to increase repeat guests and encourage new guests to stay with AirBNB hosts
>
>
>* 🏡**AirBNB hosts** interested in maximizing the ratings

**Why?**
>* 💰 **Revenue Management:** 
>
>
>
>* 🤝 **Sales:**
>
>
>
>* 🛌 **Rooms Ops:**

>
>
>

**What?**
>* 🧾 Dataset comprised of... 
>  * different features
>  * reservation records
>  * Source cited in Readme

❌ **How?**
>* Which models/methods?
>* Data prep and feature engineering

---

# 🎯  **Goal:**

Determining whether or not a host location would receive a score greater than or equal to 4/5 (defined by `'review_scores_rating'`).

# 📌 **To-Do**

---

- [ ] [TD1](#td1)
- [ ] [TD2](#td2)
- [ ] [TD3](#td3)
- [ ] [todo4](#td4)
- [ ] [todo5](#td5)
- [ ] [todo6](#td6)
- [ ] [todo7](#td7)

---

# 📂 **Imports and Settings**

In [None]:
## Data Handling
import pandas as pd
import numpy as np
from scipy import stats


## Visualizations
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact_manual
import missingno

## SKLearn
from sklearn.preprocessing import Binarizer, MultiLabelBinarizer, \
                                    OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV,\
                                    RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV 
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, \
                                AdaBoostClassifier,GradientBoostingClassifier 
from sklearn import set_config
set_config(display='diagram')

from imblearn.over_sampling import SMOTE,SMOTENC


## Settings
%matplotlib inline
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 100)

In [None]:
## Personal functions
import clf_functions.functions as cf
%load_ext autoreload
%autoreload 1
%aimport clf_functions.functions

## ✅ Show Visualizations Setting

In [None]:
## Controlling whether or not to show visualizations
show_visualizations = False

# 📖 **Read Data**

In [None]:
## Reading data and saving to a DataFrame

source = 'data/listings.csv.gz'

data = pd.read_csv(source)

In [None]:
## Inspecting imported dataset
data.head(5)

In [None]:
## Checking number of rows and columns
data.shape

---

> The initial read of the dataset shows there are 74 features and 8,033 entries. A quick glance at the `.head()` gives a sample of the entries, showing that some of the features are not relevant to my analysis.
>
> I need to get a better idea of the statistics for the dataset, especially any missing values and the datatypes for each column. I need to pre-process this data before I can perform any modeling.

---

# 👨‍💻 **Interactive Investigation**

---

> To increase accessibility to the data, **I include a widget to allow the user to sort through the data interactively.** I use [**Jupyter Widgets**](https://ipywidgets.readthedocs.io/en/latest/index.html) to create this interactive report.
>
>**To use:** select which column by which you would like to sort from the dropdown menu, then click the "Run Interact" button.
>
>***Note about 'Drop_Cols' and Cols:*** these keyword arguments are used to allow the user to drop specific columns.
>
> **Only click the "Drop_Cols" option when specifying "Cols"!** Otherwise it will cause an error.
>
>The 'Cols' dropdown menu does not affect the resulting report; the data is filtered from the report prior to displaying the results. 
>
>I chose to include this option for flexibility and adaptability, but it does have the unintended consequence of creating another drop-down menu. Please ignore this menu, as it does not provide any additional functionality. For future work, I will disable the menu to prevent confusion.

---

In [None]:
## Running report on unfiltered dataset

interact_manual(cf.sort_report, Sort_by=list(cf.report_df(data).columns),
                Source=source);

In [None]:
data.head(3)

---

> After reviewing my data, I see there are several features that contain irrelevant entries (URLs, source data, meta data) or values that are too complicated for simple processing (such as host and listing descriptions).
>
> I will drop these columns for the second report to review the remaining data for further processing.

---

In [None]:
## Specifying columns to drop

drop = ['id', 'host_id', 'name', 'description', 'neighborhood_overview', 'host_name',
        'host_about', 'host_location', 'neighbourhood', 'property_type',
        'listing_url', 'scrape_id', 'last_scraped', 'picture_url','host_url',
        'host_thumbnail_url','host_picture_url','calendar_last_scraped']

In [None]:
## Creating updated interactive report

interact_manual(cf.sort_report, Drop_Cols = True, Cols = drop,
                Sort_by=list(cf.report_df(data).columns), Source=source);

---

> **Interpretation:**
>
> The report shows that the dataset has a big problem with missing values:
>
> * **Empty:**
>   * `neighbourhood_group_cleansed`
>   * `bathrooms`
>   * `calendar_updated`
>
>
> * **Nearly empty:**
>  * `license`
>
>
> * **Missing 26-39% of data:**
>  * `host_about`
>  * `neighborhood_overview`
>  * `neighbourhood`
>  * `host_response_time`
>  * `host_response_rate`
>  * `review_scores_value`
>  * `review_scores_checkin`
>  * `review_scores_location`
>  * `review_scores_accuracy`
>  * `review_scores_communication`
>  * `review_scores_cleanliness`
>  * `host_acceptance_rate`
>  * `reviews_per_month`
>  * `first_review`
>  * `review_scores_rating`
>  * `last_review`
>
>---
>
> I will need to address these missing values before processing with the modeling. A few options include:
>
> * **Filling with the string "missing"** to indicate the value was missing.
>    * *I would be able to treat "missing" as a distinct category and use it for modeling as well.*
>
>
> * **Dropping the rows with missing values.**
>    * *This may negatively impact the accuracy of my results by overfitting to the training data.*
>
>
> * I could **use the `SimpleImputer` tool from SKLearn to fill the missing values** with the mean, median, or mode values for each.
>    * *I could couple this with a `GridSearchCV` to identify the method that has the strongest positive impact on my classification metrics.*

---

---

> To get a better idea of the missing values, I create a visual of the values via the 'Missingno' package. This visualization package includes several options for visualizing the missing data.

---

In [None]:
## Visually inspecting missing values
if show_visualizations == True:
    missingno.bar(data, labels=True);

---

> Based on this visualization, I see that **there is a consistent trend in missing values for review scores:** if a row is missing one review score, it seems to be missing all of them.
>
> Additionally, **there are many missing values for the response time, response rate, and acceptance rate.** I want to use these columns in my classification, so I will need to replace those missing values.
>
> After reviewing these details, **I feel more comfortable with the option of dropping those rows with missing review values.** I will drop the values as part of my overall classification process.

---

# 🧼 **Data Cleaning and EDA**

In [None]:
ax = sns.histplot(data['review_scores_rating'])
ax.axvline(data['review_scores_rating'].median(), label = 'median', color='k')
ax.axvline(4.5, label = '4.5', color='red')

## 🔎 Fixing Missing Values

---

> This dataset is missing a significant number of values for different columns. **In order to perform any modeling, I will need to address these missing values first.**
>
> Depending on the feature and the number of missing values per row, I will take different approaches to keep as much data as possible and in its original state.

---

In [None]:
# Dropping features with high percentages (25%+) of missing values

drop_na_cols = []
for col in data.columns:
    if ((data[col].isna().sum()) / len(data[col])) > .25 and col != 'review_scores_rating':
        drop_na_cols.append(col)

drop_na_cols

In [None]:
## Appending previous list of columns to drop (metadata, etc.)

for col in drop:
    if col not in drop_na_cols:
        drop_na_cols.append(col)

drop_na_cols

In [None]:
## Creating new dataframe that does not include the features to drop
df = data.drop(columns= drop_na_cols).copy()
df

In [None]:
## Inspecting values prior to dropping
cf.report_df(df)

# Dropping Rows W/O Target

In [None]:
## Checking for rows missing target values

nan_index = df['review_scores_rating'].isna()
nan_index

In [None]:
## Inspecting rows to be dropped for missing the target feature
df[nan_index]

In [None]:
## Dropping rows from main dataframe
df.drop(df[nan_index].index, inplace=True)

In [None]:
cf.report_df(df)

## Filling Beds

In [None]:
## Filling missing values for 'beds' with values for 'bedrooms'

for idx in list(df['beds'][df['beds'].isna()].index):
    if df['bedrooms'][idx] > 0:
        df['beds'][idx] = df['bedrooms'][idx]

In [None]:
## Filling missing values for 'bedrooms' with values for 'beds'

for idx in list(df['bedrooms'][df['bedrooms'].isna()].index):
    if df['beds'][idx] > 0:
        df['bedrooms'][idx] = df['beds'][idx]

In [None]:
## Confirming reduction in missing values for 'beds' and 'bedrooms'

rpt_clean  = cf.report_df(df)
rpt_clean[rpt_clean['null_sum'] >0]

In [None]:
## Removing rows with 6+ null values

df = df[df.isna().sum(axis=1) < 6]
df.head(5)

In [None]:
df.isna().sum()

In [None]:
cf.report_df(df)

In [None]:
## Resetting the index after dropping rows

df.reset_index(drop=True, inplace=True)

In [None]:
print(len(df) == len(df.index),"\n")
print(len(df),len(df.index))

---

> At this point, **I cleaned up most of the null values via dropping columns with 25%+ missing values and dropping rows with 6+ missing values.**
>
>Additionally, **I filled missing values for 'beds'/'bedrooms' by checking the missing values for each column against the values in the other for each row.** If a row had a value in one of the columns but not the other, I filled the missing value with the value from the other column.
>
> At this point, I addressed most of the missing values in my dataset by dropping columns and filling missing values. There are still a few columns with missing values, but I will use a SimpleImputer combined with a GridSearchCV to determine the best method by which to fill those values.
>
> Now I will review the remaining data and determine if there are any other issues with my data.

---

In [None]:
len(df) == len(df.index)

# **COMMENT:** What else to clean?? 

* DONE: T/F columns to 1/0


* DONE: 'host_since' to DT


* DONE: 'price' -$, to float


* DONE: 'neighbourhood_cleansed' split on ", " and convert to binary columns, then drop host_neighbourhood


* DONE: 'bathrooms_text' split on space, keep 1st part, convert to int


* 'host_verifications' - single string, needs extensive work in order to MLB

## Converting True/False Columns to Binary Values

In [None]:
## Creating list of true/false features to convert to 1/0, respectively

t_f_xf = ['host_is_superhost','host_has_profile_pic','host_identity_verified',
          'has_availability','instant_bookable']
t_f_xf

In [None]:
## Converting datatype to "string" to replace values

df[t_f_xf] = df[t_f_xf].astype('str')
df[t_f_xf].dtypes

In [None]:
df[t_f_xf]

In [None]:
## Converting t/f to 1/0, respectively

df[t_f_xf] = df[t_f_xf].replace({ 't' : 1, 'f' : 0})

In [None]:
df[t_f_xf]

In [None]:
df[t_f_xf] = df[t_f_xf].astype(int)

In [None]:
## Verifying results

cf.report_df(df[t_f_xf])

## Converting Price to Float 

In [None]:
## Converting each value into a float for processing

df['price'] = df['price'].map(lambda price: price[1:].replace(',','')).astype('float')
df['price'][0]

In [None]:
df['price'].describe()

## Creating "Years_Hosting"

---

> Since the 'host_since' feature is clearly a date, I will create a separate feature for how many years of activity for each host.

---

In [None]:
df['years_hosting'] = df["host_since"].map(lambda x: 2021- int(x.split("-")[0]))
df['years_hosting']

In [None]:
df['years_hosting'].value_counts()

In [None]:
df['years_hosting'].describe()

---

> I successfully created the new feature to represent how long each host is active (up to 2021). I will be curious to see the impact of the years of experience on the overall rating at the end of my modeling process.

---

## Bathrooms_Text to Num_Bathrooms

---

> In the raw data, the original "bathrooms" feature was empty and was dropped as part of processing missing data.
>
> **My goal is to convert the "bathrooms_text" feature into a new "num_bathrooms" feature to indicate the number of bathrooms at a host property.**
>
> I assume the number of bathrooms would have an impact on the rating . More bathrooms could mean more space/comfort for the guest, but could also cause an increase in price.


---

In [None]:
## Checking current dataframe contents
df.head(3)

In [None]:
## Checking for null values overall
df.isna().sum()[df.isna().sum() > 0]

In [None]:
## Inspecting a selection of values from the column to understand the values
df.loc[:,'bathrooms_text'][:21]

In [None]:
## Inspecting the rows in which there are null values
df[df['bathrooms_text'].isna()]

In [None]:
## Filling null values with unique string ('Baths' not present otherwise)
## Unique string can be used later to check for any other zero baths

df.loc[:,'bathrooms_text'].fillna('0 Baths', inplace=True)

In [None]:
## Verifying all null values are filled
df.isna().sum()[df.isna().sum() > 0]

In [None]:
df.loc[:,'bathrooms_text'].isna().sum()

In [None]:
## Splitting each list into separate strings
df['num_bathrooms'] = df['bathrooms_text'].map(lambda x: x.split(' ')[0])
df['num_bathrooms'].value_counts()

In [None]:
## Inspecting results that are phrases, not numbers

replace = ['Half-bath', 'Shared', 'Private']

for x in df['bathrooms_text']:
    for i in replace:
        if i in x:
            print(x)

---

> **I will replace these values with the numeric value .5 as they are half-baths.** This will allow me to convert the column datatype to a float and use the column more easily in my modeling.

---

In [None]:
## Replacing string values with .5 to represent half-bathrooms

replace = {'Half-bath': .5, 'Shared': .5, 'Private': .5}

df['num_bathrooms'].replace(replace, inplace = True)

df['num_bathrooms'] = df['num_bathrooms'].astype(float)

In [None]:
## Inspecting resulting values

df['num_bathrooms'].value_counts(dropna=False)

In [None]:
## Inspecting listings with more than 10 rooms

df[df['num_bathrooms'] >10]

---

> After taking a look at the locations listed above on Google Maps (using their latitude/longitude), I feel like these three listings with more than 10 bathrooms are either duplicates or incorrect values (for 50 baths).
>
> Due to the questionable nature of these values, I will drop these rows to prevent these outliers from impacting my results.

---

In [None]:
## Inspecting rows where 'num_bathrooms' is zero to validate data

df[df['num_bathrooms'] ==0]

In [None]:
## Removing old column post-conversion

df = df.drop(columns = 'bathrooms_text')

In [None]:
## Confirming removal

'bathrooms_text' in df.columns

---

> My review of the original bathroom text for the zero bathrooms column shows that the listings are associated with a private room. This would make sense as the listings may not include an option such as a shared bath, etc..
>
> Additionally I did fill 9 instances of missing values with "0 Baths," which would contribute slightly to this count.
>
> Overall, I feel the data is valid and I will use it for my modeling.

---

## Cleaning Room_Type

In [None]:
df['room_type'].value_counts()

In [None]:
replace_rooms = {'Entire home/apt': 'entire_home', 
                 'Private room': 'private_room',
                 'Shared room': 'shared_room',
                 'Hotel room': 'hotel_room'
                }

df['room_type'].replace(replace_rooms, inplace=True)
df['room_type'].value_counts(dropna=False)

## Binarizing Neighbourhood_Cleansed

---

> The current values for "neighbourhood_cleansed" are a single string value. **I will separate each neighborhood and convert them into a binary column to represent whether or not that neighborhood is included in the listing, then drop the old column.**

---

In [None]:
## Inspecting feature
df.loc[:,'neighbourhood_cleansed']

In [None]:
## Identifying datatype
df.loc[:,'neighbourhood_cleansed'].dtype

In [None]:
## Testing the splitting between neighborhoods

df.loc[:,'neighbourhood_cleansed'][1].split(', ')

In [None]:
## Converting values into a list of strings of neighborhoods

df['neighbourhood_cleansed'] = df['neighbourhood_cleansed'] \
                                    .apply(lambda x: x.split(', '))

display(df.loc[:,'neighbourhood_cleansed'])

---

> The following code snippet is adapted from [here](https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list#:~:text=Sparse%20solution%20(for%20Pandas%20v0.25.0%2B)) by the user [Maxu](https://stackoverflow.com/users/5741205/maxu).

---

In [None]:
## Converting each neighborhood into a binary column and dropping old column

mlb = MultiLabelBinarizer()

df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('neighbourhood_cleansed')),
                              columns=mlb.classes_,index=df.index))

In [None]:
## Inspecting results

df.head(3)

---

> After using the MultiLabelBinarizer, I successfully added a column for each neighborhood, indicating whether or not that neighborhood was included in the listing.
>
> This enables me to use the presence/absence of a  neighborhood as a category in my modeling.

---

## Host_Verifications to Binary Columns

---

> For the "host_verifications" and "amenities" features, the values are a single string with several items within the string.
>
> It is somewhat similar to the "neighborhoods_cleaned" feature in the sense that I will need to filter out the individual items from the string. However, there is an added complication as I need to remove the brackets and quotations from the strings.
>
> Once I filter out the items, I will be able to use the MultiLabelBinarizer again to create more categories for each amenity.

---

In [None]:
## Inspecting contents
df['host_verifications'][:10]

In [None]:
## Testing the splitting between items

df.loc[:,'host_verifications'][1]

In [None]:
## Removing e'host_verifications'tra characters and splitting items

df['host_verifications'] = df['host_verifications'].str.replace('[', '')
df['host_verifications'] = df['host_verifications'].str.replace(']', '')
df['host_verifications'] = df['host_verifications'].str.replace("'", '')
df['host_verifications'] = df['host_verifications'].str.replace('"', '')
df['host_verifications'] = df['host_verifications'].apply(lambda x: x.split(', '))

In [None]:
df['host_verifications']

In [None]:
## Converting each value into a binary column and dropping old column

mlb2 = MultiLabelBinarizer()
    
df = df.join(pd.DataFrame(mlb2.fit_transform(df.pop('host_verifications')),
                                  columns=mlb2.classes_,index=df.index))

df

---

> At this point, I successfully processed the 'host_verification' feature into distinct categories for modeling.
>
> In the future, I may attempt to do the same for the 'amenities' feature, but I don't want to create too many columns before my initial modeling.

---

## ❌ ERROR ❌ Binarizing Room_Type

---

> **Can't get MLB/OHE to work for individual property types.**

---

In [None]:
# df['room_type'].describe()

In [None]:
# df['room_type'].value_counts(dropna=False)

In [None]:
# df['room_type'] = df['room_type'].replace('Entire home/apt', 'Home/Apt')

In [None]:
# df['room_type'] = df['room_type'].map(lambda x: x.split(' ')[0])

In [None]:
# df['room_type'].value_counts(dropna=False)

In [None]:
# ohe = OneHotEncoder(sparse=False)

# df_ohe = ohe.fit_transform([df['room_type']])
# df_ohe

In [None]:
# pd.DataFrame(df_ohe)

## ❌ ERROR ❌ Converting Amenities


---

> same issue as w/ room type

---

In [None]:
# for x in ['host_verifications', 'amenities']:
#     print(df[x])

In [None]:
# df['amenities'][:10]

In [None]:
# for x in ['host_verifications', 'amenities']:
#     df[x] = df[x].str.replace('and', '')

In [None]:
# ## Converting each value into a binary column and dropping old column

# mlb = MultiLabelBinarizer()
    
# df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('amenities')),
#                                   columns=mlb.classes_,index=df.index))

In [None]:
# df.loc[:,'host_verifications'] = df.loc[:,'host_verifications'].str.replace('[', '')
# df.loc[:,'host_verifications'] = df.loc[:,'host_verifications'].str.replace(']', '')
# df.loc[:,'host_verifications'] = df.loc[:,'host_verifications'].str.replace("'", '')

In [None]:
# df.loc[:,'host_verifications']

In [None]:
# df['amenities'] = df['amenities'].str.replace('[', '')
# df['amenities'] = df['amenities'].str.replace(']', '')
# df['amenities'] = df['amenities'].str.replace('"', '')

In [None]:
# df['amenities']

In [None]:
# df['amenities'] = df['amenities'].apply(lambda x: x.split(', '))

In [None]:
# df['amenities'][0]

In [None]:
# df['host_verifications'] = df['host_verifications'].apply(lambda x: x.split(', '))

In [None]:
# df['host_verifications'][0][0]

In [None]:
# def convert_to_col(df, list_cols):
#     '''For a given list of column names, separates each string value in the
#     column by the comma/space pattern to return new strings of single values.
    
#     Then, instantiates a MultiLabelBinarizer to create new columns for each 
#     new string to indicate the presence or absence of that string in the 
#     original column.'''
    
# #     mlb = MultiLabelBinarizer()
    
#     for x in list_cols:
#         try:
#             df[x] = df[x].apply(lambda x: x.split(', '))
#             print(f'Successfully split values in column "{x}"')
            
#         except Exception:
#             print('\nValues are already processed and saved.')
#             print(f"\nSample value: {df.loc[:,x][3]}")
            
# #         try:
# #             df = df.join(pd.DataFrame(mlb.fit_transform(df.pop(x)),
# #                                       columns=mlb.classes_,index=df.index))
# #         except Exception:
# #                 print('\nValues are already processed and saved.')
                
#     return df

In [None]:
# binarize_cols = ['host_verifications', 'amenities'] 

# convert_to_col(df, binarize_cols)

In [None]:
# ## Converting each value into a binary column and dropping old column

# mlb = MultiLabelBinarizer()
    
# df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('amenities')),
#                                   columns=mlb.classes_,index=df.index))

In [None]:
# # mlb = MultiLabelBinarizer()
    
# df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('amenities')),
#                                   columns=mlb.classes_,index=df.index))

In [None]:
# ## Converting values into a list of strings for each neighborhood

# try:
#     df['host_verifications'] = df['host_verifications'] \
#                                                 .apply(lambda x: x.split(', '))
#     display(df.loc[:,'host_verifications'])
# except Exception:
#     print('\nValues are already processed and saved. No changes necessary.')
#     print(f"\nSample value: {df.loc[:,'host_verifications'][3]}")
    
    

In [None]:
# ## Inspecting results

# df.head(3)

In [None]:
# test3 = df['host_verifications'][0]
# test3[1:-1].replace('"', "'").split(",")

In [None]:
# # df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))

# df['host_verifications'].apply(lambda x: x.split(','))[0]

# Pre-Pipeline Review

In [None]:
## Review remaining data
df.head(3)

In [None]:
## Removing columns with no impact on modeling

df.drop(columns = ['host_since', 'host_neighbourhood', 'amenities'], inplace=True)

In [None]:
## Final review

df.describe()

## Converting Remaining Datatypes

In [None]:
df.dtypes[:40]

In [None]:
df.isna().sum()[df.isna().sum() > 0]

# 🪓 **Train/Test Split**

---

> Before I run any further pre-processing, I split my data into training and test sets to allow me to test my model's performance.
>
> **In order to split my classification target feature properly, I will convert the original values to binary values.** Since my goal is to determine whether or not a given host property will have a high score (4+), I assign all values greater-than or equal-to 4 to '1' and anything less than 4 as '0.'
>
> **This conversion also allows me to use the "stratify" parameter in my train/test split,** which will preserve the class balance when I split my data. This will be key for proper evaluation of my models.

---

In [None]:
## Using np.select to reassign target values based on conditional evaluations

cond = [df['review_scores_rating'] >= 4.5,
        df['review_scores_rating'] < 4.5
       ]

choice = [1,0]

df['review_scores_rating'] = np.select(cond, choice, 0)

In [None]:
## Reviewing results to confirm only 0/1 values
df['review_scores_rating'].value_counts(dropna=False)

In [None]:
## Creating features/target for dataset
target = 'review_scores_rating'

X = df.drop(columns = target).copy()
y = df[target].copy()

In [None]:
## Confirming same number of rows
X.shape[0] == y.shape[0]

In [None]:
## Splitting to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, 
                                                    random_state=42, 
                                                    stratify=y)

# 🚿 **Preprocessing Pipeline**

In [None]:
num_cols = X_train.select_dtypes(include=[int, float]).columns.to_list()
# num_cols

In [None]:
cat_cols = ['room_type']
cat_cols

In [None]:
## Checking missing X-values for imputation
X_train.isna().sum()[X_train.isna().sum() > 0]

## Preprocessor

In [None]:
## Creating ColumnTransformer and sub-transformers for imputation and encoding

### --- Creating column transformers --- ###

# Filling missing values in "Beds" and "Bedrooms"
miss_num_transformer = SimpleImputer(strategy='mean')

## Encoding categoricals - ignoring errors to prevent issues w/ test set
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)


### --- Creating column pipelines --- ###

cat_pipe = Pipeline(steps=[('ohe', categorical_transformer)])

num_pipe = Pipeline(steps=[('imputer', miss_num_transformer),
                           ('scaler', StandardScaler())])

## Instantiating the ColumnTransformer and including all transformers
preprocessor = ColumnTransformer(
    transformers=[('nums', num_pipe, num_cols),
                  ('cats', cat_pipe, cat_cols)])

preprocessor

In [None]:
## Fitting feature preprocessor
preprocessor.fit(X_train)

## Getting feature names from OHE
ohe_cat_names = preprocessor.named_transformers_['cats'].named_steps['ohe'].get_feature_names(cat_cols)

## Generating list for column index
final_cols = [*num_cols, *ohe_cat_names]

In [None]:
## Transform the data via the ColumnTransformer preprocessor

X_train_tf = preprocessor.transform(X_train)
X_train_tf_df = pd.DataFrame(X_train_tf, columns=final_cols, index=X_train.index)

X_test_tf = preprocessor.transform(X_test)
X_test_tf_df = pd.DataFrame(X_test_tf, columns=final_cols, index=X_test.index)

display(X_train_tf_df.head(5),X_test_tf_df.head(5))

# Identifying columns with outliers to improve classification results

In [None]:
## Heatmap to visualize presence/absence of outliers
idx_train = (np.abs(X_train_tf_df) >= 3)
sns.heatmap(idx_train)

In [None]:
## Looking at the min/max values to ID extreme z-scores

X_train_tf_df.describe().loc[['min','50%', 'max']]

In [None]:
X_train_tf_df.describe().loc[['min','50%', 'max']].max()

In [None]:
## Visualizing max values for each feature

display(X_train_tf_df.describe().loc[['min','50%', 'max']].max())

sns.boxplot(x=X_train_tf_df.describe().loc[['min','50%', 'max']].max());

In [None]:
## Visualizing min values for each feature

display(X_train_tf_df.describe().loc[['min','50%', 'max']].min())

sns.boxplot(x=X_train_tf_df.describe().loc[['min','50%', 'max']].min());

## Next Step

Goal: ID features with extreme z-scores

# Baseline Model

In [None]:
## Creating baseline classifier model

clf = DummyClassifier(strategy='stratified')

clf.fit(X_train_tf_df, y_train)

cf.evaluate_classification(clf,X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test, 
                           metric = 'accuracy',)

---

**Interpretation**

> High log loss, very poor AUC (worse than random chance).

---

#  Logistic Regression Models

In [None]:
## Running logistic regression model to determine performance

clf = LogisticRegression(max_iter=350, n_jobs=-1, class_weight='balanced',
                         random_state = 42)

clf.fit(X_train_tf_df, y_train)

cf.evaluate_classification(clf, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy')

In [None]:
## Testing LogRegCV model to compare

clf = LogisticRegressionCV(max_iter=350, n_jobs=-1, class_weight='balanced',
                          random_state = 42)

clf.fit(X_train_tf_df, y_train)

cf.evaluate_classification(clf, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy')

---

**Interpretation**

> Log loss increased by .12/.08 and AUC decreased by .05/.04 for training/test sets, respectively.
>
> The LogisticRegressionCV model is performing slightly worse than the normal model.
>
> I will test a different model type to see if I can get better scores with different modeling methods.

---

# KNN Model

In [None]:
knn = KNeighborsClassifier(n_neighbors=3, n_jobs=-1)

In [None]:
knn.fit(X_train_tf_df, y_train)

cf.evaluate_classification(knn, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy', normalize=None)

# Decision Tree Model

In [None]:
dtc = DecisionTreeClassifier(class_weight = 'balanced')

In [None]:
dtc.fit(X_train_tf_df, y_train)

cf.evaluate_classification(dtc, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy', normalize=None)

In [None]:
dtc.get_depth()

# RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(class_weight = 'balanced',
                            n_jobs=-1)

In [None]:
rfc.fit(X_train_tf_df, y_train)

cf.evaluate_classification(rfc, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy', normalize=None)

In [None]:
depths = []

for i in rfc.estimators_:
    depths.append(i.get_depth())

In [None]:
np.max(depths)

In [None]:
sns.histplot(depths)

In [None]:
cf.plot_importances(rfc, X_train_tf_df, count=5)

In [None]:
df['review_scores_rating']

## Visualizing Feature Importances

---

> Now that I have the feature importances from my model, I interpret the results via visualizing the most important features and the target feature.

---

In [None]:
## Comparing top feature against target

sns.countplot(data=df, hue='review_scores_rating', x='host_is_superhost')

---

> The countplot above shows that if a host is not a "Superhost," there is a much greater chance that they will not receive a 4+ rating vs. those hosts who are "Superhosts."
>
> The **blue bars** represent the *number of host properties that **did not receive a 4+ rating**.* 
>
> The **orange bars** shows the *number of host properties that **did receive a 4+ rating**.*

---

# ExtraTreesClassifier

In [None]:
xtc = ExtraTreesClassifier(class_weight = 'balanced', random_state = 42,
                            n_jobs=-1)

In [None]:
xtc.fit(X_train_tf_df, y_train)

cf.evaluate_classification(xtc, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy')

# AdaBoostClassifier

In [None]:
abc = AdaBoostClassifier(n_estimators=100, random_state=42)

In [None]:
abc.fit(X_train_tf_df, y_train)

cf.evaluate_classification(abc, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy')

# Gradient Boosting

In [None]:
gbc = GradientBoostingClassifier(learning_rate=1.0, max_depth=1, random_state=42)

In [None]:
gbc.fit(X_train_tf_df, y_train)

cf.evaluate_classification(gbc, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'accuracy')

# GridSearchCV

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
lr_params = {
 'C': [.001, .01, .1, 1, 10, 100, 1000],
 'class_weight': ['balanced', None],
    'penalty':['l1', 'l2', 'elasticnet', 'none'],
 'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
  'max_iter':[100, 200, 300, 400]}

In [None]:
gscv = GridSearchCV(LogisticRegression(), lr_params, scoring = 'balanced_accuracy', cv=3,
                    n_jobs = -1)
gscv

In [None]:
# gscv.fit(X_train_tf_df, y_train)

In [None]:
# logreg_params = gscv.best_params_

# logreg_params

## Best LogReg params

logreg_params = `{'C': 0.1,
  'class_weight': 'balanced',
  'max_iter': 100,
  'penalty': 'l1',
  'solver': 'saga'}`

In [None]:
# gscv.best_estimator_

In [None]:
# cf.evaluate_classification(gscv.best_estimator_, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced accuracy')

## GSCV: RFC

In [None]:
rfc_params = {
    'n_estimators':[100, 125, 150,],
    'max_depth': [10,20,30,40],
    'min_samples_split': [2,3,4],
    'min_samples_leaf': [1,2,3]
}

In [None]:
rfc = RandomForestClassifier(class_weight = 'balanced',
                            n_jobs=-1, random_state=42)

In [None]:
rfgs = GridSearchCV(rfc, rfc_params, scoring = 'balanced_accuracy', cv=3,
                    n_jobs = -1)
rfgs

In [None]:
rfgs.fit(X_train_tf_df, y_train)

In [None]:
rfc_params = rfgs.best_params_

rfc_params

In [None]:
rfgs.best_score_


In [None]:
rfc_new = rfgs.best_estimator_

In [None]:
cf.evaluate_classification(rfc_new, X_train_tf_df, y_train, X_test_tf_df, 
                           y_test, 'recall (macro)')