# 🏡 **AirBNB Dataset Review** 🏨

# ❌ Update target audience and guiding questions

**Who?**
>* 🏢 **AirBNB Corporate** interested in maximizing customer satisfaction to increase repeat guests and encourage new guests to stay with AirBNB hosts
>
>
>* **AirBNB hosts** interested in maximizing the ratings

**Why?**
>* 💰 **Revenue Management:** 
>
>
>
>* 🤝 **Sales:**
>
>
>
>* 🛌 **Rooms Ops:**
>
>
>
>* 🍰 ☕ **Food and Beverage:**
>
>
>

**What?**
>* 🧾 Dataset comprised of... 
>  * 32 different features
>  * Nearly 120,000 reservation records
>  * Source cited in Readme

❌ **How?**
>* Which models/methods? 
>* Data prep and feature engineering

# ❌ **Goal:**

To determine whether or not a reservation would cancel given different details of a guest and their reservation.

# 📌 **To-Do**

---

- [ ] [TD1](#td1)
- [ ] [TD2](#td2)
- [ ] [TD3](#td3)
- [ ] [todo4](#td4)
- [ ] [todo5](#td5)
- [ ] [todo6](#td6)
- [ ] [todo7](#td7)

---

# 📂 **Imports**

In [None]:
## Data Handling
import pandas as pd
import numpy as np
from scipy import stats

## Visualizations
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact_manual

## Modeling - SKLearn
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from sklearn import set_config
set_config(display='diagram')

# from sklearn.naive_bayes import MultinomialNB # for naive bayes model

## Settings
%matplotlib inline
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 100)

In [None]:
## Personal functions
import clf_functions.functions as cf
%load_ext autoreload
%autoreload 1
%aimport clf_functions.functions

## FSDS

In [None]:
# import fsds as fs

In [None]:
# fs.ihelp_menu([fs.ihelp_menu, sort_report])

# 📖 **Read Data**

In [None]:
## Reading data and saving to a DataFrame

source = 'http://data.insideairbnb.com/united-states/dc/washington-dc/2021-07-10/data/listings.csv.gz'

data = pd.read_csv(source)

In [None]:
## Inspecting imported dataset
data.head(5)

In [None]:
## Checking number of rows and columns
data.shape

---

> The initial read of the dataset shows there are 74 features and 8,033 entries. A quick glance at the `.head()` gives a sample of the entries, showing that some of the features are not relevant to my analysis.

---

# 🧼 **Data Cleaning and EDA**

## 🔎 Interactive Investigation

---
> To increase accessibility to the data, I included functionality to allow the user to sort through the data interactively. I use [**Jupyter Widgets**](https://ipywidgets.readthedocs.io/en/latest/index.html) to create this interactive report.
>
>**To use:** select which column by which you would like to sort from the dropdown menu, then click the "Run Interact" button.
>
>*Note about 'Drop_Cols' and Cols:* these keyword arguments are used to allow the user to drop specific columns. The 'Cols' dropdown menu does not affect the resulting report; the data is filtered from the report prior to displaying the results. 
>
>I chose to include these as options for flexibility and adaptability, but it does have the unintended consequence of creating another drop-down menu.
>
---

In [None]:
## Running report on unfiltered dataset

interact_manual(cf.sort_report, Sort_by=list(cf.report_df(data).columns),
                Source=source);

In [None]:
data.head(3)

---
>
> After reviewing my data, I see there are several features that contain irrelevant entries (URLs, source data, meta data) or values that are too complicated for simple processing (such as host and listing descriptions).
>
> I will drop these columns and run a second report to review the remaining data for further processing.
>
---

In [None]:
## Specifying columns to drop

drop = ['name', 'description', 'neighborhood_overview', 'host_name',
        'host_about', 'neighbourhood', 'property_type',
        'listing_url', 'scrape_id', 'last_scraped', 'picture_url','host_url',
        'host_thumbnail_url','host_picture_url','calendar_last_scraped']

In [None]:
## Creating interactive report

interact_manual(cf.sort_report, Drop_Cols = True, Cols = drop,
                Sort_by=list(cf.report_df(data).columns), Source=source);

---
>
> **Interpretation:**
>
> The report shows that the dataset has a big problem with missing values:
>
> * **Empty:**
>   * `neighbourhood_group_cleansed`
>   * `bathrooms`
>   * `calendar_updated`
>
>
> * **Nearly empty:**
>  * `license`
>
>
> * **Missing 26-39% of data:**
>  * `host_about`
>  * `neighborhood_overview`
>  * `neighbourhood`
>  * `host_response_time`
>  * `host_response_rate`
>  * `review_scores_value`
>  * `review_scores_checkin`
>  * `review_scores_location`
>  * `review_scores_accuracy`
>  * `review_scores_communication`
>  * `review_scores_cleanliness`
>  * `host_acceptance_rate`
>  * `reviews_per_month`
>  * `first_review`
>  * `review_scores_rating`
>  * `last_review`
>
>---
>
> I will need to address these missing values before processing with the modeling. A few options include:
>
> * Filling with the string "missing" to indicate the value was missing.
>    * *I would be able to treat "missing" as a distinct category and use it for modeling as well.*
>
>
> * Dropping the rows with missing values.
>    * *This may negatively impact the accuracy of my results by overfitting to the training data.*
>
>
> * For numeric features, I could use the `SimpleImputer` tool from SKLearn to fill the missing values with the mean, median, or mode values for each.
>    * *I could couple this with a `GridSearchCV` to identify the method that has the strongest positive impact on my classification metrics.*
>
---

In [None]:
## Creating new dataframe that does not include the irrelevant 
df = data.drop(columns= drop).copy()

In [None]:
## Visually inspecting missing values
import missingno as missno

missno.matrix(df, labels=True);

>---
>
> The visualization above shows the missing values in each column as a blank space.
>
>
> Based on this visualization, I see that there is a consistent trend in missing values for review scores: if a row is missing one review score, it seems to be missing all of them.
>
>
> After reviewing these details, I feel more comfortable with the option of dropping those rows with missing values. I will consider dropping the values as part of my overall classification process.
>
>---

## 🎯 Inspecting the Target Variable

In [None]:
## Checking for class imbalances in target variable
df['review_scores_rating'].describe()

In [None]:
## Checking value counts (binned to see the ranges of values)
df['review_scores_rating'].value_counts(dropna=False,sort=False, bins=10, normalize=True)

In [None]:
## Confirming all scores are zero or above
df['review_scores_rating'].min()

In [None]:
## Excluding ratings above 4 to inspect the marginal ratings
df['review_scores_rating'][df['review_scores_rating'] < 4].value_counts(dropna=False,sort=False, bins=4, normalize=True)

---

> The target feature, `'review_scores_rating'`, is currently a range of values from 0 to 5, with 69% of the scores being 4 or above. The `.value_counts()` results show a value sub-zero; this is for the purpose of binning the values; the lowest value is actually 0.00.
>
>
> Of the scores less than 4, a little under half are between 3 and 4 (rounded) and about a third are between 0 and 1 (rounded).
>
>
>
>In order to use these values, I would need to bin them into categories based on ranges of values for analysis.

---

# 🪓 **Train/Test Split**

In [None]:
## Creating features/target for dataset
target = 'review_scores_rating'

X = df.drop(columns = target).copy()
y = df[target].copy()

In [None]:
## Confirming same number of rows
X.shape[0] == y.shape[0]

In [None]:
## Splitting to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# 🚿 **Preprocessing Pipeline**

In [None]:
# cat_cols = ['hotel', 'meal','arrival_date_month', 'country', 'market_segment',
#             'distribution_channel','is_repeated_guest','reserved_room_type',
#             'assigned_room_type','deposit_type', 'agent',
#             'customer_type','reservation_status']

# cont_cols = [col for col in X_train.drop(['reservation_status_date','company'],axis=1).columns if col not in cat_cols]

# cont_cols

In [None]:
# X_train[cat_cols] = X_train[cat_cols].astype(str)

In [None]:
# X_test[cat_cols] = X_test[cat_cols].astype(str)

In [None]:
# ## Creating ColumnTransformer and sub-transformers for imputation and encoding

# # Filling missing "Children"
# zero_transformer = SimpleImputer(strategy='constant', fill_value=0)

# ##  
# missing_transformer = SimpleImputer(strategy='constant', fill_value='missing')

# ## Encoding categoricals - handling errors to prevent issues w/ test set
# categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)

# cat_pipe = Pipeline(steps=[('imputer', missing_transformer),
#                       ('ohe', categorical_transformer)])

# cont_pipe = Pipeline(steps=[('imputer', zero_transformer),
#                            ('scaler', StandardScaler())])

# ## Instantiating the ColumnTransformer and including all transformers
# preprocessor = ColumnTransformer(
#     transformers=[('conts', cont_pipe, cont_cols),
#                   ('cats', cat_pipe, cat_cols)])

# preprocessor

In [None]:
# preprocessor.fit(X_train)

# ## Getting feature names from OHE
# ohe_cat_names = preprocessor.named_transformers_['cats'].named_steps['ohe'].get_feature_names(cat_cols)

In [None]:
# ## Generating list for column index
# final_cols = [*cont_cols, *ohe_cat_names]

# ## Fit and transform the data via the ColumnTransformer
# X_train_tf = preprocessor.transform(X_train)
# X_train_tf_df = pd.DataFrame(X_train_tf, columns=final_cols, index=X_train.index)

# ## Transforming the test set and saving
# X_test_tf = preprocessor.transform(X_test)
# X_test_tf_df = pd.DataFrame(X_test_tf, columns=final_cols, index=X_test.index)

# display(X_train_tf_df.head(5),X_test_tf_df.head(5))

# 📝 Next Steps

* Process classification model - i.e. Logreg, KNN, DecisionTrees, etc.
* Evaluate results
* Determine if I need to redo pre-processing steps

# 🚿 Classification Pipeline