# ❌ **Cancel Culture** ❌ - **EDA Notebook**

---

**Author:** Ben McCarty

**Capstone Project** - Classification, Time Series Modeling

**Contact:** bmccarty505@gmail.com

---

---

**Who?**
>* 🏢 **Revenue Management (RM) teams** for hotel groups (corporate, franchise)
>
>
>* 🏨 On-site GMs, Sales, and Ops teams

---

**Why?**
>* 💰 **Revenue Management:** 
>  * Revenue optimization: Right price, right time, right customer
>    * Dynamic pricing
>    * Distribution channels
>    * Pricing per room type
>
>
>* 🤝 **Sales:**
>  * Group sales (pickup/wash)
>  * BT (performance/company for both GPP and LNR rates)
>
>
>* 🛌 **Rooms Ops:**
>  * Forecasting occupancy, arrivals, departures, stay-overs, same-day booking demand, and probability of guest relocation in the case of oversell.
>  * Determining staff schedules and periods of high demand
>
>
>* 🍰 ☕ **Food and Beverage:**
>  * Ordering food/supplies overall
>  * Scheduling staff
>  * Determining busy times (breakfast, lunch, dinner)
>    * Staffing, specific food/supplies

---

**What?**
>* 🧾 Dataset comprised of... 
>  * 32 different features
>    * Detailed explanation of features (and sub-categories, when appropriate) available in Readme
>  * Nearly 120,000 reservation records
>  * Source cited in Readme

---

 **How?**
>* Which models/methods?
>  * 🔢 Classifiers 🌳
    * XGBoost, RFC, ABC, etc.
>  * ⏳ Time Series Analysis 📈
    * PMD auto-arima
    * Statsmodels vector autoregression
>
>
>* Data prep and feature engineering

---

---

> **Goal:** To prepare data for classification modeling in next notebook.
>
>
> **Purpose:** to explore, clean, and organize.
>
>
> **Process:**
>
>    * Inspecting data integrity and statistics
>    * Splitting data by hotel type ("City" vs. "Resort")
>    * Filling any missing values
>    * Save processed data for modeling notebook
>
>
> **Modeling Notebook:**
>
>    * Performing train/test split
>    * Training the model
>    * Evaluate performance metrics
>    * Provide final recommendations

---

# 📦 **Import Packages**

In [None]:
## Data Handling
import pandas as pd
import numpy as np
from scipy import stats

## Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## Custom-made Functions
from bmc_functions import eda

In [None]:
## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 50)
%matplotlib inline

In [None]:
%load_ext autoreload
%autoreload 2

# 📥 **Read Data**

In [None]:
## Reading data
source = './data/hotel_bookings.csv'
data = pd.read_csv(source)
data

In [None]:
## Inspecting percentage of city vs. resort hotels
data['hotel'].value_counts(1)

# 🎯 **Identifying Target Feature** 🎯

---

> For my classification analysis, **I will use the `is_canceled` feature as my target feature.** This feature indicates whether a reservation was canceled (0 = check-out, 1= canceled).
>
> There is another feature, `reservation_status`, that also looks valuable. I will compare that feature against `is_canceled` to investigate any differences between the two.

---

# 🪓 **Splitting "City" and "Resort"**

In [None]:
## Creating subgroup for city hotels
subgroup_city = data[data['hotel'] == 'City Hotel']
subgroup_city.drop(columns='hotel', inplace=True)
subgroup_city

In [None]:
## Creating subgroup for resort hotels
subgroup_resort = data[data['hotel'] == 'Resort Hotel']
subgroup_resort.drop(columns='hotel', inplace=True)
subgroup_resort

In [None]:
## Deleting the original variable to free up memory
del data

# 📊 **Reviewing Statistics**

---

**`Report_df()`: City**

---

In [None]:
## Sorting report by number of missing values
eda.report_df(subgroup_city).sort_values('null_sum', ascending=False)

---

**`Report_df()`: Resort**

---

In [None]:
## Selecting report values for columns with missing values 
eda.report_df(subgroup_resort).sort_values('null_sum', ascending=False)

---

**Reviewing Reports - Missing Values**

> Based on the post-split results, I see that both dataframes are missing values for `company,` `agent`, and `country`. Additionally, the `subgroup_city` dataframe is missing four values for `children`.
>
> **Special note:** As noted in the data's documentation ( located in *"details.md"*), any missing values are intentional representations of features that were not applicable to a reservation.
---

**`Company` and `Agent` Features**

> *Missing in `subgroup_city`:*
* `company:` 95%
* `agent:` 10%
>
> *Missing in `subgroup_resort`:*
* `company:`" 92%
* `agent:` 20%
>
> Due to the large number of missing values for `company`, **I will drop `company` from both dataframes.**
>
> Since the missing values for `agent` are valid, **I will keep `agent` and fill the missing values with a value to represent the lack of a value.** I will fill the missing values in the next section.

**`Country` and `Children` Features**

> The remaining two features with missing values are `country` and `children`.
>
> **As there are a small number of missing values in both dataframes' features, I will keep both features and fill the missing values with the most frequent values.** As there are so few missing values, my method for filling these missing values has a negligible impact on the final results.
>
> 

---

## Dropping `Company` Column

In [None]:
# Dropping "company" column (95% missing values)
subgroup_city.drop(columns = ['company'], inplace=True)
subgroup_city

In [None]:
# Dropping "company" column (95% missing values)
subgroup_resort.drop(columns = ['company'], inplace=True)
subgroup_resort

In [None]:
## Confirming 'company' removal from both
'company' not in subgroup_city and 'company' not in subgroup_resort

## Filling missing values in `agent`

In [None]:
## Identifying unique vales for both sub-groups

unique_values = set()
for value in subgroup_city['agent'].unique():
    unique_values.add(value)
    
for value in subgroup_resort['agent'].unique():
    unique_values.add(value)

In [None]:
## Confirming uniform datatype
unique_dtype = set()
for item in unique_values:
    unique_dtype.add(type(item))
    
unique_dtype

In [None]:
## Testing placeholder value to fill missing values
999.0 in unique_values

In [None]:
## Filling missing values and confirming no remaining values

for df in [subgroup_city,subgroup_resort]:
    df.loc[:,'agent'].fillna(999.0, inplace=True)
    print(df['agent'].isna().sum())
    del df

## Filling Remaining Missing Values

In [None]:
## Inspecting remaining missing values
display(subgroup_city.isna().sum()[subgroup_city.isna().sum() >0])
display(subgroup_resort.isna().sum()[subgroup_resort.isna().sum() >0])

In [None]:
## Determining most frequent value for subgroup_city
city_child = subgroup_city['children'].mode()[0]
city_country = subgroup_city['country'].mode()[0]

print(f'Most frequent value (children): {city_child}.')
print(f'Most frequent value (country): {city_country}.')

In [None]:
## Replacing missing values for 'children'
subgroup_city.loc[:,'children'].fillna(city_child,inplace=True)

In [None]:
## Replacing missing values for 'country
subgroup_city.loc[:,'country'].fillna(city_country,inplace=True)

In [None]:
## Confirming filled missing values
subgroup_city.isna().sum()

In [None]:
resort_country = subgroup_resort['country'].mode()[0]

In [None]:
## Filling missing value for resort - 'country'
subgroup_resort.loc[:,'country'].fillna(resort_country,inplace=True)

In [None]:
## Confirming no missing values
subgroup_resort.isna().sum()

# 🔬 **Inspecting Feature Data Types**

---

**City**

---

In [None]:
## Inspecting dataypes for "subgroup_city"
subgroup_city.dtypes.sort_values()

---

**Resort**

---

In [None]:
subgroup_resort.dtypes.sort_values()

In [None]:
## Confirming all datatypes match between dataframes
subgroup_city.dtypes.sort_values() == subgroup_resort.dtypes.sort_values()

---

**Review - Datatypes**

> After reviewing the data types, I noticed **`agent` need to be changed to the string type and `reservation_status_date` needs to be converted to the date time type**. This feature represents unique identifiers for booking agents and need to be treated as categorical data.
>
> As both dataframes' data types are the same, I do not need to make any other adjustments specific to either dataframe.

---

## Converting to Strings

In [None]:
## Converting "agent" to string for both sub-groups

for df in [subgroup_city, subgroup_resort]:
    df.loc[:,'agent'] = df['agent'].astype(int)
    df.loc[:,'agent'] = df['agent'].astype(str)
    print(f'Datatype: {df["agent"].dtype}')
    del df

## Converting to DateTime

In [None]:
for df in [subgroup_city, subgroup_resort]:
    df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])
#     df.loc[:,'agent'] = df['agent'].astype(str)
    print(f'Datatype: {df["reservation_status_date"].dtype}')
    del df

# 🔎 **EDA - Features**

---

**In-Depth EDA per Feature**

> Now that I reviewed my missing values and confirmed my datatypes, I will inspect the details of each of my features.

---
**Note:**

> DataFrame styling code used in `explore_feature()` function adapted from this [source](https://stackoverflow.com/questions/59769161/python-color-pandas-dataframe-based-on-multiindex#:~:text=2-,You,-can%20use%20Styler).

---

## 📊 **Toggle Visualizations**

In [None]:
## Boolean setting to control whether to show the EDA visualizations
show_visualization = False

## `Reservation_Status`

---

**City**

---

In [None]:
## Reviewing details for city - reservation_status
eda.explore_feature(subgroup_city,'reservation_status',
                    target_feature='is_canceled',
                    plot_label ='Status',
                    plot_title= 'Reservation Status - Resort',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resoty - reservation_status
eda.explore_feature(subgroup_resort,'reservation_status',
                    target_feature='is_canceled',
                    plot_label ='Status',
                    plot_title= 'Reservation Status - Resort',
                    show_visualization = show_visualization);

---

**Review** - `Reservation_Status`

---

---

**Feature Review**

> `Reservation_status` closely mirrors the values for my target feature, with some slight differences due to "no-show" values. **To prepare it for modeling, I will combine the `No-Show` status and `Canceled` values.**

**Actions**

>For the purposes of my analysis, **I will treat `No-Show` reservations as `Canceled` reservations** due to their limited number preventing me from effectively using it as a third class.

**City vs. Resort**

> The most notable difference between the city and resort hotels would be the number of cancellations: *The city hotel shows a much larger proportion of canceled reservations vs. the resort hotel.* 
* This may be due to a variety of factors, including resort guests booking when they are more certain of their plans or the resort hotel may charge a cancellation fee.
>
> No-Show reservations are low for both hotels, supporting my decision to merge no-shows with cancellations. 

---

## `Is_Canceled`

---

**City**

---

In [None]:
## Reviewing details for city - 'is_canceled'
eda.explore_feature(subgroup_city,'is_canceled', 
                    target_feature='is_canceled',
                    normalize=False,
                    plot_label ='Cancellation Status',
                    plot_title= 'Reservation Status - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'is_canceled'
eda.explore_feature(subgroup_resort,'is_canceled', 
                    target_feature='is_canceled',
                    normalize=False,
                    plot_label ='Cancellation Status',
                    plot_title= 'Reservation Status - Resort',
                    show_visualization = show_visualization);

---

**Review** - `Is_Canceled`

---

---

**Feature Review**

> After reviewing the results post-"no-show" conversion, `Is_canceled` is a binarization of the `reservation_status`. Reservations are indicated as cancellations if they either cancel or are marked as a "no-show" reservation.

**Actions**

> This feature is a better target feature as the values are already binarized and match the `reservation_status` feature for all of the reservations.
>
> **I will use `is_canceled` in place of the `reservation_status` feature as my target feature.**

**City vs. Resort**

> The breakdown between hotels is the same as `reservation_status` and confirms that the resort hotel experiences fewer cancellations vs. the city hotel.

---

## `Lead_Time`

---

**City**

---

In [None]:
## Reviewing details for city - 'lead_time'
eda.explore_feature(subgroup_city,'lead_time',bins = 5, marginal = 'box',
                    target_feature='is_canceled',plot_label ='Number of Days',
                    plot_title= 'Lead Time (Days) - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'lead_time'
eda.explore_feature(subgroup_resort,'lead_time',bins = 5,marginal= 'box',
                    target_feature='is_canceled',plot_label ='Number of Days',
                    plot_title= 'Lead Time (Days) - Resort',
                    show_visualization = show_visualization);

---

**Review** - `PLACEHOLDER`

---

---

**Feature Review**

> `Lead_Time` indicates how far in advance reservations are booked in days. *This information is particularly useful in hospitality for Revenue Management (RM) and Operations (Ops).*
>
>  * RM needs to know **when to expect bookings** and **when to monitor rates and availability** closely to make any necessary changes to optimize revenue.
>
>
>  * Ops uses this information to **forecast how many reservations will book in a short-term booking window** (I usually focused on 0-3 days prior to arrival).
>
> * **This forecast is critical to determine staffing and supplies in particular** - when building our schedules, we consider the current number of booked reservations and the forecasted bookings to determine how many staff members to schedule and if we have enough supplies, etc..
>  * *Being the only staff member at the Front Desk during a rush of arrivals due to a snow storm is NOT fun!*

**Actions**

> I noticed there are a significant number of outliers for both properties. **I will remove the outliers based on the z-score percentiles prior to modeling.**

**City vs. Resort**

> The histograms and box plots for both hotels match up closely, but it is clear that **the city hotel has a larger range of lead times for cancellations vs. the resort hotel.**

---

## `Arrival_Date_Year`

---

**City**

---

In [None]:
## Reviewing details for city - 'arrival_date_year'
eda.explore_feature(subgroup_city,'arrival_date_year',marginal = 'box',
                    target_feature='is_canceled',plot_label ='Year',
                    plot_title= 'Arrival Date (Year) - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'arrival_date_year'
eda.explore_feature(subgroup_resort,'arrival_date_year',marginal = 'box',
                    target_feature='is_canceled',plot_label ='Year',
                    plot_title= 'Arrival Date (Year) - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER 

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Stays_in_Weekend_Nights`

---

**City**

---

In [None]:
## Reviewing details for city - 'stays_in_weekend_nights'
eda.explore_feature(subgroup_city,'stays_in_weekend_nights',bins = 5,
                    marginal = 'box',target_feature='is_canceled',
                    plot_label ='Number of Days',
                    plot_title= 'Lead Time (Days) - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'stays_in_weekend_nights'
eda.explore_feature(subgroup_resort,'stays_in_weekend_nights',
                    bins = 5,marginal = 'box',target_feature='is_canceled',
                    plot_label ='Number of Prior Stays',
                    plot_title= 'Stays in Weekend Nights - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Stays_in_Week_Nights`

---

**City**

---

In [None]:
## Reviewing details for city - 'stays_in_week_nights'
eda.explore_feature(subgroup_city,'stays_in_week_nights',bins = 5,
                    marginal = 'box',target_feature='is_canceled',
                    plot_label ='Number of Prior Stays',
                    plot_title= 'Stays in Week Nights - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'stays_in_week_nights'
eda.explore_feature(subgroup_resort,'stays_in_week_nights',bins = 5,
                    marginal = 'box',target_feature='is_canceled',
                    plot_label ='Number of Prior Stays',
                    plot_title= 'Stays in Week Nights - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Adults`

---

**City**

---

In [None]:
## Reviewing details for city - 'adults'
eda.explore_feature(subgroup_city,'adults',bins = 3,
                    target_feature='is_canceled',plot_label ='Number of Adults',
                    plot_title= 'Adults - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'adults'
eda.explore_feature(subgroup_resort,'adults',bins = 3,
                    target_feature='is_canceled',
                    plot_label ='Number of Adults',
                    plot_title= 'Adults - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Children`

---

**City**

---

In [None]:
## Reviewing details for city - 'children'
eda.explore_feature(subgroup_city,'children',bins = 5,
                    target_feature='is_canceled',
                    plot_label ='Number of Children',
                    plot_title= 'Children - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'children'
eda.explore_feature(subgroup_resort,'children',bins = 3,
                    target_feature='is_canceled',
                    plot_label ='Number of Children',
                    plot_title= 'Children - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Babies`

---

**City**

---

In [None]:
## Reviewing details for city - 'babies'
eda.explore_feature(subgroup_city,'babies',bins = 5,
                    target_feature='is_canceled',
                    plot_label ='Number of Babies',
                    plot_title= 'Babies - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'babies'
eda.explore_feature(subgroup_resort,'babies',bins = 3,
                    target_feature='is_canceled',
                    plot_label ='Number of Babies',
                    plot_title= 'Babies - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Meal`

---

**City**

---

In [None]:
## Reviewing details for city - 'meal'
eda.explore_feature(subgroup_city,'meal',target_feature='is_canceled',
                    plot_label ='Types of Meal',plot_title= 'Meal - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'meal'
test = eda.explore_feature(subgroup_resort,'meal',
                    target_feature='is_canceled',plot_label ='Types of Meals',
                    plot_title= 'Meal - Resort',
                    show_visualization = show_visualization)

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Country`

---

**City**

---

In [None]:
## Reviewing details for city - 'country'
eda.explore_feature(subgroup_city,'country',marginal = 'box',normalize=False,
                    target_feature='is_canceled',plot_label ='Country',
                    plot_title= 'Country - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'country'
eda.explore_feature(subgroup_resort,'country',normalize=False,marginal ='box',
                    target_feature='is_canceled',plot_label ='Country',
                    plot_title= 'Country - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Market_Segment`

---

**City**

---

In [None]:
## Reviewing details for city - 'market_segment'
eda.explore_feature(subgroup_city,'market_segment',marginal = 'box',
                    target_feature='is_canceled',plot_label ='Market Segment',
                    plot_title= 'Market Segment - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'market_segment'
eda.explore_feature(subgroup_resort,'market_segment',normalize=False,
                    target_feature='is_canceled',plot_label ='Market Segment',
                    plot_title= 'Market Segment - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Distribution_Channel`

---

**City**

---

In [None]:
## Reviewing details for city - 'distribution_channel'
eda.explore_feature(subgroup_city,'distribution_channel',normalize=False,
                    target_feature='is_canceled',
                    plot_label ='Distribution Channel',
                    plot_title= 'Distribution Channel - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'distribution_channel'
eda.explore_feature(subgroup_resort,'distribution_channel',normalize=False,
                    target_feature='is_canceled',
                    plot_label ='Distribution Channel',
                    plot_title= 'Distribution Channel - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Is_Repeated_Guest`

---

**City**

---

In [None]:
## Reviewing details for city - 'is_repeated_guest'
eda.explore_feature(subgroup_city,'is_repeated_guest',
                    target_feature='is_canceled',plot_label =' Repeat Guest',
                    plot_title= ' Repeat Guest - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for city - 'is_repeated_guest'
eda.explore_feature(subgroup_resort,'is_repeated_guest',
                    target_feature='is_canceled',plot_label ='Repeat Guest',
                    plot_title= 'Repeat Guest - City',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Previous_Cancellations`

---

**City**

---

In [None]:
## Reviewing details for city - 'previous_cancellations'
eda.explore_feature(subgroup_city,'previous_cancellations',bins = 5,
                    normalize=False,target_feature='is_canceled',
                    plot_label ='Number of Cancellations',
                    plot_title= 'Previous Cancellations - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'previous_cancellations'
eda.explore_feature(subgroup_resort,'previous_cancellations',bins = 4,
                    target_feature='is_canceled',
                    plot_label ='Previous Cancellations',
                    plot_title= 'Previous Cancellations - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Previous_Bookings_Not_Canceled`

---

**City**

---

In [None]:
## Reviewing details for city - 'previous_bookings_not_canceled'
eda.explore_feature(subgroup_city,'previous_bookings_not_canceled',
                    bins = 5,marginal = 'box',target_feature='is_canceled',
                    plot_label ='Number of Bookings Not Canceled',
                    plot_title= 'Previous Bookings Not Canceled - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'previous_bookings_not_canceled'
eda.explore_feature(subgroup_resort,'previous_bookings_not_canceled',
                    bins = 5,target_feature='is_canceled',
                    plot_label ='Previous Bookings Not Canceled',
                    plot_title= 'Previous Bookings Not Canceled - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Reserved_Room_Type`

---

**City**

---

In [None]:
## Reviewing details for city - 'reserved_room_type'
eda.explore_feature(subgroup_city,'reserved_room_type',
                    target_feature='is_canceled',
                    plot_label ='Reserved Toom Type',
                    plot_title= 'Reserved Toom Type - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'reserved_room_type'
eda.explore_feature(subgroup_resort,'reserved_room_type'
                    ,target_feature='is_canceled',
                    plot_label ='Preserved Room Type',
                    plot_title= 'Preserved Room Type - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Assigned_Room_Type`

---

**City**

---

In [None]:
## Reviewing details for city - 'assigned_room_type'
eda.explore_feature(subgroup_city,'assigned_room_type',
                    target_feature='is_canceled',
                    plot_label ='Assigned Room Type',
                    plot_title= 'Assigned Room Type - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'assigned_room_type'
eda.explore_feature(subgroup_resort,'assigned_room_type',
                    target_feature='is_canceled',
                    plot_label ='Assigned Room Type',
                    plot_title= 'Assigned Room Type - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Booking_Changes`

---

**City**

---

In [None]:
## Reviewing details for city - 'booking_changes'
eda.explore_feature(subgroup_city,'booking_changes',bins = 5,
                    target_feature='is_canceled',
                    plot_label ='booking_changes',
                    plot_title= 'booking_changes - city',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'booking_changes'
eda.explore_feature(subgroup_resort,'booking_changes',bins = 5,
                    target_feature='is_canceled',
                    plot_label ='Booking Changes',
                    plot_title= 'Booking Changes - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Deposit_Type`

---

**City**

---

In [None]:
## Reviewing details for city - 'deposit_type'
eda.explore_feature(subgroup_city,'deposit_type',normalize=False,
                    target_feature='is_canceled',plot_label ='Deposit Type',
                    plot_title= 'Deposit Type - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'deposit_type'
eda.explore_feature(subgroup_resort,'deposit_type',
                    target_feature='is_canceled', plot_label ='Deposit Type',
                    plot_title= 'Deposit Type - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Agent`

---

**City**

---

In [None]:
## Reviewing details for city - 'agent'
eda.explore_feature(subgroup_city,'agent',target_feature='is_canceled',
                    plot_label ='Booking Agent',plot_title= 'Agent - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'agent'
eda.explore_feature(subgroup_resort,'agent',target_feature='is_canceled',
                    plot_label ='Booking Agent',plot_title= 'Agent - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Days_in_Waiting_List`

---

**City**

---

In [None]:
## Reviewing details for city - 'days_in_waiting_list'
eda.explore_feature(subgroup_city,'days_in_waiting_list',bins = 5,
                    normalize=False,target_feature='is_canceled',
                    plot_label ='Days in Waiting List',
                    plot_title= 'Days in Waiting List - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'days_in_waiting_list'
eda.explore_feature(subgroup_resort,'days_in_waiting_list',bins = 5,
                    plot_type='histogram',target_feature='is_canceled',
                    plot_label ='Days in Waiting List',
                    plot_title= 'Days in Waiting List - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Customer_Type`

---

**City**

---

In [None]:
## Reviewing details for city - 'customer_type'
eda.explore_feature(subgroup_city,'customer_type',marginal = 'box',
                    target_feature='is_canceled',plot_label ='Customer Type',
                    plot_title= 'Customer Type - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'customer_type'
eda.explore_feature(subgroup_resort,'customer_type',marginal = 'box',
                    target_feature='is_canceled',plot_label ='Customer Type',
                    plot_title= 'Customer Type - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `ADR`

---

**City**

---

In [None]:
## Reviewing details for city - 'adr'
eda.explore_feature(subgroup_city ,'adr',bins = 5,
                    target_feature='is_canceled',plot_label ='ADR (€)',
                    plot_title= 'ADR (€) - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'adr'
eda.explore_feature(subgroup_resort,'adr',bins = 5,
                    target_feature='is_canceled',plot_label ='ADR (€)',
                    plot_title= 'ADR (€) - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Required_Car_Parking_Spaces`

---

**City**

---

In [None]:
## Reviewing details for city - 'required_car_parking_spaces'
eda.explore_feature(subgroup_city,'required_car_parking_spaces',bins = 5,
                    normalize=False,target_feature='is_canceled',
                    plot_label ='Required Car Parking Spaces',
                    plot_title= 'Required Car Parking Spaces - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'required_car_parking_spaces'
eda.explore_feature(subgroup_resort,'required_car_parking_spaces',bins = 5,
                    normalize=False,target_feature='is_canceled',
                    plot_label ='Required Car Parking Spaces',
                    plot_title= 'Required Car Parking Spaces - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Total_of_Special_Requests`

---

**City**

---

In [None]:
## Reviewing details for city - 'total_of_special_requests'
eda.explore_feature(subgroup_city,'total_of_special_requests',bins = 5,
                    marginal = 'box',target_feature='is_canceled',
                    plot_label ='Total of Special Requests',
                    plot_title= 'Total of Special Requests - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'total_of_special_requests'
eda.explore_feature(subgroup_resort,'total_of_special_requests',bins = 5,
                    marginal = 'box',target_feature='is_canceled',
                    plot_label ='Total of Special Requests',
                    plot_title= 'Total of Special Requests - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## `Reservation_Status_Date`

---

**City**

---

In [None]:
## Reviewing details for city - 'reservation_status_date'
eda.explore_feature(subgroup_city,'reservation_status_date',marginal = 'box',
                    bins=3,target_feature='is_canceled',
                    plot_label ='Reservation Status Date',
                    plot_title= 'Reservation Status Date - City',
                    show_visualization = show_visualization);

---

**Resort**

---

In [None]:
## Reviewing details for resort - 'reservation_status_date'
eda.explore_feature(subgroup_resort,'reservation_status_date',
                    marginal = 'box',bins=3,target_feature='is_canceled',
                    plot_label ='Reservation Status Date',
                    plot_title= 'Reservation Status Date - Resort',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

# 🛑 **Post-EDA**

---

**FIX/UPDATE ALL OF THE NARRATIVE ELEMENTS OF THE REST OF THIS CODE.**

---

---

**Finishing Touches**

> Now that I reviewed all of my features; confirmed there are no missing values; and confirmed all of the datatypes are correct, I will finish the remaining preprocessing.

**Outliers**

> Based on my EDA, I noticed several features show significant outliers. If I kept these outlying data points, they could have a negative impact on my future models' performances.

**Process**
> I will first identify the features with outliers; then use the z-scores of each data point to determine the outliers. Any absolute-valued z-score greater than 3 will be considered an outlier and will be disregarded.

---

## `Reservation_Status`: Converting `No-Show` to `Canceled`

In [None]:
## Changing no-show values to "canceled"
subgroup_city.loc[:,'reservation_status'].replace('No-Show', 'Canceled',
                                            inplace=True)
subgroup_resort.loc[:,'reservation_status'].replace('No-Show', 'Canceled',
                                            inplace=True)

In [None]:
## Confirming the change
'No-Show' not in subgroup_city['reservation_status'] and \
                        'No-Show' not in subgroup_city['reservation_status']

In [None]:
## Inspecting the updated target classes
subgroup_city['reservation_status'].value_counts(1, dropna=False)

In [None]:
subgroup_resort['reservation_status'].value_counts(1, dropna=False)

### Review - `Reservation_Status`

---

> I successfully converted all `No-Show` values to `Canceled`, **resulting in a binary classification of whether a reservation will actualize (`Check-Out`) or not (`No-Show`).**

---

## Engineering `Arrival_Date`

---

**City**

---

In [None]:
## Converting from month, day of month, and year to a single datetime column
subgroup_city['arrival_date'] = subgroup_city['arrival_date_month'] +' '+ \
                                subgroup_city['arrival_date_day_of_month']\
                                .astype(str) +', '+ \
                                subgroup_city['arrival_date_year'].astype(str)
subgroup_city['arrival_date'] = pd.to_datetime(subgroup_city['arrival_date'])
subgroup_city['arrival_date']

In [None]:
## Determining the day of the week of arrival 
subgroup_city.loc[:,'arrival_day'] = subgroup_city.loc[:,'arrival_date'].dt.day_name()
subgroup_city['arrival_day']

In [None]:
## Reviewing results
subgroup_city[['arrival_day', 'arrival_date']]

---

**Resort**

---

In [None]:
## Converting from month, day of month, and year to a single datetime column
subgroup_resort['arrival_date'] = subgroup_resort['arrival_date_month'] +' '+ \
                                subgroup_resort['arrival_date_day_of_month']\
                                .astype(str) +', '+ \
                                subgroup_resort['arrival_date_year'].astype(str)
subgroup_resort['arrival_date'] = pd.to_datetime(subgroup_resort['arrival_date'])
subgroup_resort['arrival_date']

In [None]:
## Determining the day of the week of arrival 
subgroup_resort.loc[:,'arrival_day'] = subgroup_resort.loc[:,'arrival_date'].dt.day_name()
subgroup_resort['arrival_day']

In [None]:
## Reviewing results
subgroup_resort[['arrival_day', 'arrival_date']]

In [None]:
subgroup_resort

---

**Feature Review**

> I created this new feature to merge the arrival year/month/day-of-month features into one usable feature. 

**Actions**

> PLACEHOLDER

**City vs. Resort**

> PLACEHOLDER

---

## Outlier Removal

---

**City**

---

In [None]:
## Creating dataframe visualizing percentage of outliers in subgroup_city
eda.outlier_percentage(subgroup_city);

In [None]:
## Visualizing feature with highest percentage of outliers
subgroup_city['children'].plot(kind='hist')
subgroup_city['children'].describe()

---

> Using the `Children` feature as an example, we can see that the statistics show very few reservations bring children during their stay.
>
> As there are so few reservations with children, these entries may negatively impact my models' performance. **I will use the z-scores for this feature to determine which rows have values beyond 3 standard deviations, then remove those values** This will be approx. 1% of the feature's data, eliminating the smallest number of outliers to preserve the original data as much as possible.

---

> This process of identifying outlying values for each numeric feature will most likely result in identifying rows in which only *one* value is an outlier. However, I will need to disregard the whole entry to be able to model the data; keeping the rows while disregarding that one value would result in creating more missing values. 
>
> I will iterate through each numeric feature in both the city and resort hotel dataframes and save the index value of each row with an outlier value to a set specific to each dataframe.
>
> Using these sets, I will filter the index values from each respective dataframe. The end result will be smaller dataframes with a stronger normal distribution.
>
----

In [None]:
## Inspecting the statistics minus the outlying values
subgroup_city['children'][~eda.find_outliers_z(subgroup_city['children'])].describe()

In [None]:
## Creating a set of indices for filtering
unique_idx_val = set()

for i in list(subgroup_city.select_dtypes('number').columns):
    unique_idx_val.update(list(subgroup_city[i]\
                               [eda.find_outliers_z(subgroup_city[i])].index))

In [None]:
## Calculating the number of values 
len(unique_idx_val)

In [None]:
## Calculating the percentage of rows to drop from the overall dataframe
len(unique_idx_val)/len(subgroup_city)

In [None]:
## Generating a new dataframe after filtering the outliers
sg_c_drop = subgroup_city.drop(unique_idx_val)
sg_c_drop

In [None]:
## Inspecting new statistics
pd.concat([subgroup_city.describe(),sg_c_drop.describe()], keys=('Original', 'New'))

---

**Resort**

---

---

> Now that I successfully created a new, filtered dataframe for the city hotel reservations, I will do the same for the resort hotel reservations.

----

In [None]:
## Creating dataframe visualizing percentage of outliers in subgroup_resort
eda.outlier_percentage(subgroup_resort);

In [None]:
## Visualizing feature with highest percentage of outliers
subgroup_resort['is_repeated_guest'].describe()

In [None]:
## Inspecting the statistics minus the outlying values
subgroup_resort['is_repeated_guest'][~eda.find_outliers_z(subgroup_resort['is_repeated_guest'])].describe()

In [None]:
## Creating a set of indices for filtering
unique_idx_resort = set()

for i in list(subgroup_resort.select_dtypes('number').columns):
    unique_idx_resort.update(list(subgroup_resort[i]\
                               [eda.find_outliers_z(subgroup_resort[i])].index))

In [None]:
## Calculating the number of values 
len(unique_idx_resort)

In [None]:
## Calculating the percentage of rows to drop from the overall dataframe
len(unique_idx_resort)/len(subgroup_resort)

In [None]:
unique_idx_resort## Generating a new dataframe after filtering the outliers
sg_r_drop = subgroup_resort.drop(unique_idx_resort)
sg_r_drop

In [None]:
## Inspecting new statistics
pd.concat([subgroup_resort.describe(),sg_r_drop.describe()], keys=('Original', 'New'))

# 🐼 **Of Pandas and Pickles** 🥒

---

> Now I am ready to save the cleaned and processed data for modeling in my next notebook.
>
> In order to preserve the datatypes and details of my data, I will use the "Pickle" module to serialize the data and save four files - one for each dataframe (two hotels; filtered/not).
>
>**First**, I will add unique names to each of my dataframe indices. **Then**, I will pickle the files. Finally, I will reopen the pickled files in my next notebook.

---

In [None]:
## Adding unique names to dataframes to easily ID
subgroup_city.index.rename('city_old', inplace=True)
subgroup_resort.index.rename('resort_old', inplace=True)

sg_c_drop.index.rename('city_filtered', inplace=True)
sg_r_drop.index.rename('resort_filtered', inplace=True)

In [None]:
## Confirming results
display(subgroup_city, subgroup_resort, sg_c_drop, sg_r_drop)

## Peter Panda Picked a Peck of Pickled DataFrames...

> Cells are commented-out to prevent over-writing files unintentionally.

In [None]:
# ## Creating a dictionary of dataframes and file names
# files = {'reservation_city_unfiltered': subgroup_city,
#          'reservation_city_filtered': sg_c_drop,
#          "reservation_resort_unfiltered": subgroup_resort,
#          'reservation_resort_filtered': sg_r_drop
#         }

In [None]:
# ## Pickling with Pandas
# for k, v in files.items():
#     v.to_pickle(path = f'./data/{k}.pickle', compression = 'gzip')
#     print(f'Successfully pickled: {k}')