# ❌ **Cancel Culture** ❌ - **EDA Notebook**

**Who?**
>* 🏢 **Revenue Management (RM) teams** for hotel groups (corporate, franchise)
>
>
>* 🏨 On-site GMs, Sales, and Ops teams

**Why?**
>* 💰 **Revenue Management:** 
>  * Revenue optimization: Right price, right time, right customer
>    * Dynamic pricing
>    * Distribution channels
>    * Pricing per room type
>
>
>* 🤝 **Sales:**
>  * Group sales (pickup/wash)
>  * BT (performance/company for both GPP and LNR rates)
>
>
>* 🛌 **Rooms Ops:**
>  * Forecasting occupancy, arrivals, departures, stay-overs, same-day booking demand, and probability of guest relocation in the case of oversell.
>  * Determining staff schedules and periods of high demand
>
>
>* 🍰 ☕ **Food and Beverage:**
>  * Ordering food/supplies overall
>  * Scheduling staff
>  * Determining busy times (breakfast, lunch, dinner)
>    * Staffing, specific food/supplies

**What?**
>* 🧾 Dataset comprised of... 
>  * 32 different features
>    * Detailed explanation of features (and sub-categories, when appropriate) available in Readme
>  * Nearly 120,000 reservation records
>  * Source cited in Readme

❌ **How?**
>* Which models/methods? 
>* Data prep and feature engineering

---

> **Goal:** To prepare data for time series modeling and forecasting in next notebook.
>
>
> **Purpose:** to explore, clean, and organize.
>
>
> **Process:**
>
>    * Inspecting data integrity and statistics
>    * Splitting data by hotel type ("City" vs. "Resort")
>    * Filling any missing values
>    * 
>    * Save processed data for modeling notebook
>
>
> **Modeling Notebook:**
>
>    * Performing train/test split
>    * 
>    * Training the model
>    * 
>    * Evaluate performance metrics
>    * Provide final recommendations

---

# ✅ **To-Do List**

---

**Copy:**
- [ ] Imports
- [ ] Personal module
- [ ] Data
- [ ] Starter code from P4P

**Links:**
- [ ] 

---

# 📦 **Import Packages**

In [1]:
## Data Handling
import pandas as pd
import numpy as np
from scipy import stats

## Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Modeling - SKLearn
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn import set_config
set_config(display='diagram')


## Custom-made Functions
from bmc_functions import eda
from bmc_functions import classification as clf

## Settings
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 100)
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2

# 📥 **Read Data**

In [3]:
## Reading data
source = './data/hotel_bookings.csv'
data = pd.read_csv(source)
data

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.00,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.00,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017,August,35,30,2,5,2,0.00,0,BB,BEL,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,394.00,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,City Hotel,0,102,2017,August,35,31,2,5,3,0.00,0,BB,FRA,Online TA,TA/TO,0,0,0,E,E,0,No Deposit,9.00,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,City Hotel,0,34,2017,August,35,31,2,5,2,0.00,0,BB,DEU,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,9.00,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,City Hotel,0,109,2017,August,35,31,2,5,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,89.00,,0,Transient,104.40,0,0,Check-Out,2017-09-07


# 🔎 **EDA** 🔍

In [4]:
## Inspecting percentage of city vs. resort hotels
data['hotel'].value_counts(1)

City Hotel     0.66
Resort Hotel   0.34
Name: hotel, dtype: float64

## Splitting "City" and "Resort" 

In [5]:
## Creating subgroup for city hotels
subgroup_city = data[data['hotel'] == 'City Hotel']
subgroup_city

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
40060,City Hotel,0,6,2015,July,27,1,0,2,1,0.00,0,HB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,6.00,,0,Transient,0.00,0,0,Check-Out,2015-07-03
40061,City Hotel,1,88,2015,July,27,1,0,4,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,76.50,0,1,Canceled,2015-07-01
40062,City Hotel,1,65,2015,July,27,1,0,4,1,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,68.00,0,1,Canceled,2015-04-30
40063,City Hotel,1,92,2015,July,27,1,2,4,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,76.50,0,2,Canceled,2015-06-23
40064,City Hotel,1,100,2015,July,27,2,0,2,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,76.50,0,1,Canceled,2015-04-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017,August,35,30,2,5,2,0.00,0,BB,BEL,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,394.00,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,City Hotel,0,102,2017,August,35,31,2,5,3,0.00,0,BB,FRA,Online TA,TA/TO,0,0,0,E,E,0,No Deposit,9.00,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,City Hotel,0,34,2017,August,35,31,2,5,2,0.00,0,BB,DEU,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,9.00,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,City Hotel,0,109,2017,August,35,31,2,5,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,89.00,,0,Transient,104.40,0,0,Check-Out,2017-09-07


In [6]:
## Creating subgroup for resort hotels
subgroup_resort = data[data['hotel'] == 'Resort Hotel']
subgroup_resort

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.00,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.00,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,Resort Hotel,0,212,2017,August,35,31,2,8,2,1.00,0,BB,GBR,Offline TA/TO,TA/TO,0,0,0,A,A,1,No Deposit,143.00,,0,Transient,89.75,0,0,Check-Out,2017-09-10
40056,Resort Hotel,0,169,2017,August,35,30,2,9,2,0.00,0,BB,IRL,Direct,Direct,0,0,0,E,E,0,No Deposit,250.00,,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10
40057,Resort Hotel,0,204,2017,August,35,29,4,10,2,0.00,0,BB,IRL,Direct,Direct,0,0,0,E,E,0,No Deposit,250.00,,0,Transient,153.57,0,3,Check-Out,2017-09-12
40058,Resort Hotel,0,211,2017,August,35,31,4,10,2,0.00,0,HB,GBR,Offline TA/TO,TA/TO,0,0,0,D,D,0,No Deposit,40.00,,0,Contract,112.80,0,1,Check-Out,2017-09-14


# Testing Hierarchical Indexing

---

> Instead of splitting the data into two different dataframes, I may be able to create a new index for the same dataframe by splitting the "`hotel`" feature and using the two values as the first level of the row index, then the normal index values as the second level.
>
>
> This would add a layer of complexity to the data processing steps, but would reduce memory consumption and the number of dataframes.

---

In [35]:
data_mi = data
data_mi

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.00,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.00,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017,August,35,30,2,5,2,0.00,0,BB,BEL,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,394.00,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,City Hotel,0,102,2017,August,35,31,2,5,3,0.00,0,BB,FRA,Online TA,TA/TO,0,0,0,E,E,0,No Deposit,9.00,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,City Hotel,0,34,2017,August,35,31,2,5,2,0.00,0,BB,DEU,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,9.00,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,City Hotel,0,109,2017,August,35,31,2,5,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,89.00,,0,Transient,104.40,0,0,Check-Out,2017-09-07


In [25]:
## Saving original index values
org_idx = data.index.values
org_idx

array([     0,      1,      2, ..., 119387, 119388, 119389], dtype=int64)

In [26]:
## Saving "hotel type" values
hotel_types = data.loc[:,'hotel'].unique()
hotel_types

array(['Resort Hotel', 'City Hotel'], dtype=object)

In [31]:
## Creating new multi-index from hotel types and original index values
multi_idx = pd.MultiIndex.from_product([hotel_types, org_idx])
multi_idx

MultiIndex([('Resort Hotel',      0),
            ('Resort Hotel',      1),
            ('Resort Hotel',      2),
            ('Resort Hotel',      3),
            ('Resort Hotel',      4),
            ('Resort Hotel',      5),
            ('Resort Hotel',      6),
            ('Resort Hotel',      7),
            ('Resort Hotel',      8),
            ('Resort Hotel',      9),
            ...
            (  'City Hotel', 119380),
            (  'City Hotel', 119381),
            (  'City Hotel', 119382),
            (  'City Hotel', 119383),
            (  'City Hotel', 119384),
            (  'City Hotel', 119385),
            (  'City Hotel', 119386),
            (  'City Hotel', 119387),
            (  'City Hotel', 119388),
            (  'City Hotel', 119389)],
           length=238780)

In [33]:
## Confirming index length doubled as expected
len(multi_idx) == 2*len(org_idx)

True

In [40]:
data_mi.reindex(labels = multi_idx, axis='index')

Unnamed: 0,Unnamed: 1,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
Resort Hotel,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Resort Hotel,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Resort Hotel,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Resort Hotel,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Resort Hotel,4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
City Hotel,119385,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
City Hotel,119386,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
City Hotel,119387,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
City Hotel,119388,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## Reviewing Statistics

### Report - City

In [11]:
eda.report_df(subgroup_city).sort_values('null_sum', ascending=False)

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
company,75641,0.95,float64,207,3689.0,145.27,119.77,8.0,40.0,91.0,219.0,497.0
agent,8131,0.1,float64,223,71199.0,28.14,56.43,1.0,9.0,9.0,17.0,509.0
country,24,0.0,object,166,,,,,,,,
children,4,0.0,float64,4,79326.0,0.09,0.37,0.0,0.0,0.0,0.0,3.0
adr,0,0.0,float64,5405,79330.0,105.3,43.6,0.0,79.2,99.9,126.0,5400.0
lead_time,0,0.0,int64,453,79330.0,109.74,110.95,0.0,23.0,74.0,163.0,629.0
market_segment,0,0.0,object,8,,,,,,,,
meal,0,0.0,object,4,,,,,,,,
previous_bookings_not_canceled,0,0.0,int64,73,79330.0,0.13,1.69,0.0,0.0,0.0,0.0,72.0
previous_cancellations,0,0.0,int64,10,79330.0,0.08,0.42,0.0,0.0,0.0,0.0,21.0


### Report - City

In [9]:
eda.report_df(subgroup_city).sort_values('null_sum')

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
total_of_special_requests,0,0.0,int64,6,79330.0,0.55,0.78,0.0,0.0,0.0,1.0,5.0
adults,0,0.0,int64,5,79330.0,1.85,0.51,0.0,2.0,2.0,2.0,4.0
stays_in_week_nights,0,0.0,int64,29,79330.0,2.18,1.46,0.0,1.0,2.0,3.0,41.0
arrival_date_day_of_month,0,0.0,int64,31,79330.0,15.79,8.73,1.0,8.0,16.0,23.0,31.0
required_car_parking_spaces,0,0.0,int64,4,79330.0,0.02,0.15,0.0,0.0,0.0,0.0,3.0
arrival_date_week_number,0,0.0,int64,53,79330.0,27.18,13.4,1.0,17.0,27.0,38.0,53.0
arrival_date_year,0,0.0,int64,3,79330.0,2016.17,0.7,2015.0,2016.0,2016.0,2017.0,2017.0
previous_cancellations,0,0.0,int64,10,79330.0,0.08,0.42,0.0,0.0,0.0,0.0,21.0
babies,0,0.0,int64,5,79330.0,0.0,0.08,0.0,0.0,0.0,0.0,10.0
booking_changes,0,0.0,int64,21,79330.0,0.19,0.61,0.0,0.0,0.0,0.0,21.0


In [9]:
eda.report_df(subgroup_city).sort_values('datatypes')

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
total_of_special_requests,0,0.0,int64,6,79330.0,0.55,0.78,0.0,0.0,0.0,1.0,5.0
adults,0,0.0,int64,5,79330.0,1.85,0.51,0.0,2.0,2.0,2.0,4.0
stays_in_week_nights,0,0.0,int64,29,79330.0,2.18,1.46,0.0,1.0,2.0,3.0,41.0
arrival_date_day_of_month,0,0.0,int64,31,79330.0,15.79,8.73,1.0,8.0,16.0,23.0,31.0
required_car_parking_spaces,0,0.0,int64,4,79330.0,0.02,0.15,0.0,0.0,0.0,0.0,3.0
arrival_date_week_number,0,0.0,int64,53,79330.0,27.18,13.4,1.0,17.0,27.0,38.0,53.0
arrival_date_year,0,0.0,int64,3,79330.0,2016.17,0.7,2015.0,2016.0,2016.0,2017.0,2017.0
previous_cancellations,0,0.0,int64,10,79330.0,0.08,0.42,0.0,0.0,0.0,0.0,21.0
babies,0,0.0,int64,5,79330.0,0.0,0.08,0.0,0.0,0.0,0.0,10.0
booking_changes,0,0.0,int64,21,79330.0,0.19,0.61,0.0,0.0,0.0,0.0,21.0


---

**Reviewing Reports - Missing Values**

> Based on the post-split results, I see that the "`company`" feature is missing 95% of the results and the "`agent`" feature is missing 10%.

**"`Company`" and "`Agent`" Features**

> **Due to the large number of missing values for "`company`," I will drop that column.** As noted in the data's documentation, any missing values are intentional representations of features that were not applicable to a reservation. **Since the missing values are valid, I will keep the "`agent`" column and will fill the missing values with an "N/A" value to represent the lack of a value.**

**"`Country`" and "`Children`" Features**

> The remaining two features with missing values are "`country`" and "`children`." Due to the small number of missing values for each, I will simply impute the most frequent value, respectively.

---

### Report - City - Data Types

In [13]:
eda.report_df(subgroup_city).sort_values('datatypes', ascending=False)

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
distribution_channel,0,0.0,object,5,,,,,,,,
country,24,0.0,object,166,,,,,,,,
reserved_room_type,0,0.0,object,8,,,,,,,,
reservation_status_date,0,0.0,object,864,,,,,,,,
arrival_date_month,0,0.0,object,12,,,,,,,,
reservation_status,0,0.0,object,3,,,,,,,,
meal,0,0.0,object,4,,,,,,,,
assigned_room_type,0,0.0,object,9,,,,,,,,
market_segment,0,0.0,object,8,,,,,,,,
hotel,0,0.0,object,1,,,,,,,,


### Report - Resort - Data Types

In [15]:
eda.report_df(subgroup_resort).sort_values('datatypes', ascending=False)

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
distribution_channel,0,0.0,object,4,,,,,,,,
country,464,0.01,object,125,,,,,,,,
reserved_room_type,0,0.0,object,10,,,,,,,,
reservation_status_date,0,0.0,object,913,,,,,,,,
arrival_date_month,0,0.0,object,12,,,,,,,,
reservation_status,0,0.0,object,3,,,,,,,,
meal,0,0.0,object,5,,,,,,,,
assigned_room_type,0,0.0,object,11,,,,,,,,
market_segment,0,0.0,object,6,,,,,,,,
hotel,0,0.0,object,1,,,,,,,,


## Dropping "Company" Column

In [None]:
# Dropping "company" column (95% missing values) , "hotel" column (only 1 value)
subgroup_city.drop(columns = ['company', 'hotel'], inplace=True)
subgroup_city

In [None]:
# Dropping "company" column (95% missing values) , "hotel" column (only 1 value)
subgroup_resort.drop(columns = ['company', 'hotel'], inplace=True)
subgroup_resort

### City

In [None]:
eda.report_df(subgroup_city).sort_values('null_sum', ascending=False)

In [None]:
# Dropping "company" column (95% missing values) , "hotel" column (only 1 value)
subgroup_city.drop(columns = ['company', 'hotel'], inplace=True)
subgroup_city

In [None]:
## Identifying columns for feature-by-feature EDA
subgroup_city.columns

### Resort

In [None]:
eda.report_df(subgroup_resort).sort_values('null_sum', ascending=False)

In [None]:
subgroup_resort.columns

# **EDA - Features**

## reservation_status

### City

In [None]:
subgroup_city['reservation_status'].value_counts(1, dropna=False)

### Resort

In [None]:
subgroup_resort['reservation_status'].value_counts(1, dropna=False)

### ❌ Binarizing - New Feature

### City

In [None]:
## Changing no-show values to "canceled"
subgroup_city['reservation_status'].replace('No-Show', 'Canceled',
                                            inplace=True)

In [None]:
## Confirming the change
'No-Show' in subgroup_city['reservation_status']

In [None]:
## Inspecting the updated target classes
subgroup_city['reservation_status'].value_counts(1, dropna=False)

In [None]:
cond = [subgroup_city['reservation_status'] == 'Check-Out',
       subgroup_city['reservation_status'] == 'Canceled']

choice = [0, 1]

subgroup_city['res_status_binary'] = np.select(cond, choice, 0)
subgroup_city['res_status_binary']

In [None]:
subgroup_city['res_status_binary'].value_counts(1)

#### Resort

In [None]:
## Changing no-show values to "canceled"
subgroup_resort['reservation_status'].replace('No-Show', 'Canceled',
                                            inplace=True)

In [None]:
## Confirming the change
'No-Show' in subgroup_resort['reservation_status']

In [None]:
## Inspecting the updated target classes
subgroup_resort['reservation_status'].value_counts(1, dropna=False)

In [None]:
cond = [subgroup_resort['reservation_status'] == 'Check-Out',
       subgroup_resort['reservation_status'] == 'Canceled']

choice = [0, 1]

subgroup_resort['res_status_binary'] = np.select(cond, choice, 0)
subgroup_resort['res_status_binary']

In [None]:
subgroup_resort['res_status_binary'].value_counts(1)

## is_canceled

### City

In [None]:
subgroup_city['is_canceled'].value_counts(1, dropna=False)

In [None]:
## Visualizing results
fig, ax = plt.subplots()
subgroup_city['is_canceled'].value_counts(1, dropna=False).plot(kind='barh',
                                                                ax=ax)
ax.set_yticklabels(['Not Cancelled', 'Cancelled'])
ax.set_xlabel('Percentage')
ax.set_ylabel('Status')
ax.set_title('Reservation Statuses');

### Resort

In [None]:
subgroup_resort['is_canceled'].value_counts(1, dropna=False)

In [None]:
## Visualizing results
fig, ax = plt.subplots()
subgroup_resort['is_canceled'].value_counts(1, dropna=False).plot(kind='barh',
                                                                ax=ax)
ax.set_yticklabels(['Not Cancelled', 'Cancelled'])
ax.set_xlabel('Percentage')
ax.set_ylabel('Status')
ax.set_title('Reservation Statuses');

## lead_time - Fix legend labels!

### City

In [None]:
subgroup_city['lead_time'].describe()

In [None]:
# fig, ax = plt.subplots(figsize=(5,10))
# subgroup_city['lead_time'].plot(kind='box', ax=ax)

# ax.set_title('Lead Time');

In [None]:
fig = px.box(subgroup_city, y="lead_time", title='Overview of Lead Time',
             width=600, color='reservation_status',
             labels = {'lead_time': 'Lead Time (Days)'},
            category_orders = {'lead_time': 'Lead Time (Days)',
                               'reservation_status': ['Check-Out', 'Canceled']})

fig.show()

### Resort

In [None]:
subgroup_resort['lead_time'].describe()

In [None]:
# fig, ax = plt.subplots(figsize=(5,10))
# subgroup_resort['lead_time'].plot(kind='box', ax=ax)

# ax.set_title('Lead Time');

In [None]:
fig = px.box(subgroup_resort, y="lead_time", title='Overview of Lead Time',
             width=600, color='reservation_status',
             labels = {'lead_time': 'Lead Time (Days)'},
            category_orders = {'lead_time': 'Lead Time (Days)',
                               'reservation_status': ['Check-Out', 'Canceled']})

fig.show()

## Arrival Date as Full Datetime

### City

In [None]:
## Converting from month, day of month, and year to a single datetime column
subgroup_city['arrival_date'] = subgroup_city['arrival_date_month'] +' '+ \
                                subgroup_city['arrival_date_day_of_month']\
                                .astype(str) +', '+ \
                                subgroup_city['arrival_date_year'].astype(str)
subgroup_city['arrival_date'] = pd.to_datetime(subgroup_city['arrival_date'])
subgroup_city['arrival_date']

In [None]:
fig = px.box(subgroup_city, y="lead_time", title='Lead Times',
             width=600, color='reservation_status',
             labels = {'lead_time': 'Lead Time (Days)'},
            category_orders = {'lead_time': 'Lead Time (Days)',
                               'reservation_status': ['Check-Out', 'Canceled']})

fig.show()

In [None]:
fig = px.histogram(subgroup_city,'lead_time', marginal = 'box',
                   color='reservation_status',
                   labels={'lead_time': 'Lead Time (Days)'}, 
                   title="Lead Times", nbins=40)
fig.update_layout(bargap=0.2)
fig.show()

In [None]:
fig = px.scatter(subgroup_city, x="reservation_status", y="lead_time",
                 marginal_y="box",marginal_x="histogram",
                 color='reservation_status')
fig.show()

In [None]:
# ## Try again with select features as dimensions
# fig = px.scatter_matrix(subgroup_city,
# #     dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"],
#     color="reservation_status")
# fig.show()

### Resort

In [None]:
## Converting from month, day of month, and year to a single datetime column
subgroup_resort['arrival_date'] = subgroup_resort['arrival_date_month'] +' '+ \
                                subgroup_resort['arrival_date_day_of_month']\
                                .astype(str) +', '+ \
                                subgroup_resort['arrival_date_year'].astype(str)
subgroup_resort['arrival_date'] = pd.to_datetime(subgroup_resort['arrival_date'])
subgroup_resort['arrival_date']

In [None]:
fig = px.box(subgroup_resort, y="lead_time", title='Lead Times',
             width=600, color='reservation_status',
             labels = {'lead_time': 'Lead Time (Days)'},
            category_orders = {'lead_time': 'Lead Time (Days)',
                               'reservation_status': ['Check-Out', 'Canceled']})

fig.show()

In [None]:
fig = px.histogram(subgroup_resort,'lead_time', marginal = 'box',
                   color='reservation_status',
                   labels={'lead_time': 'Lead Time (Days)'}, 
                   title="Lead Times", nbins=40)
fig.update_layout(bargap=0.2)
fig.show()

In [None]:
fig = px.scatter(subgroup_resort, x="reservation_status", y="lead_time",
                 marginal_y="box",marginal_x="histogram",
                 color='reservation_status')
fig.show()

In [None]:
# ## Try again with select features as dimensions
# fig = px.scatter_matrix(subgroup_resort,
# #     dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"],
#     color="reservation_status")
# fig.show()

## stays_in_weekend_nights

In [None]:
subgroup_city['stays_in_weekend_nights'].value_counts(1)

In [None]:
fig = px.histogram(subgroup_city,'stays_in_weekend_nights', marginal = 'box',
                   labels={'stays_in_weekend_nights': 'Number of Weekend Nights'}, 
                   title="Weekend Stays", color='reservation_status', nbins=15)
fig.update_layout(bargap=0.2)
fig.show()

## stays_in_week_nights

In [None]:
subgroup_city['stays_in_week_nights'].value_counts(1)

In [None]:
subgroup_city['stays_in_week_nights'].value_counts(1)[:6]

In [None]:
fig = px.histogram(subgroup_city,'stays_in_week_nights', marginal = 'box',
                   labels={'stays_in_week_nights': 'Number of Week Nights'}, 
                   title="Weekday Stays", color='reservation_status', nbins=40)
fig.update_layout(bargap=0.2)
fig.show()

## Adults

In [None]:
subgroup_city['adults'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'adults',
                   labels={'adults': 'Number of Adults'},
                   title="Number of Adults", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## Children

In [None]:
subgroup_city['children'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'children', marginal = 'box',
                   labels={'children': 'Number of Children'}, 
                   title="Number of Children", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## babies

In [None]:
subgroup_city['babies'].value_counts(dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'babies', marginal = 'box',
                   labels={'babies': 'Number of Babies'}, 
                   title="Number of Babies", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## meal

In [None]:
subgroup_city['meal'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'meal',labels={'meal': 'Types of Meals'}, 
                   title="Dining with Us?", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## country

In [None]:
subgroup_city['country'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'country',
                   labels={'country': 'Country of Origin'}, 
                   title="'Where's Home?'", color='reservation_status', nbins=15)
fig.update_layout(bargap=0.2)
fig.show()

## market_segment

In [None]:
subgroup_city['market_segment'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'market_segment',
                   labels={'market_segment': 'Market Segment'}, 
                   title="Segmentation", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## distribution_channel

In [None]:
subgroup_city['distribution_channel'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'distribution_channel',
                   labels={'distribution_channel': 'Channel'}, 
                   title="Distribution Channels", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## is_repeated_guest

In [None]:
subgroup_city['is_repeated_guest'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'is_repeated_guest',
                   labels={'is_repeated_guest': 'Repeat Status'}, 
                   title="Welcome Back!", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## previous_cancellations

In [None]:
subgroup_city['previous_cancellations'].describe()

In [None]:
fig = px.histogram(subgroup_city,'previous_cancellations',
                   labels={'previous_cancellations': 'Number of Cancellations'}, 
                   title="Previous Cancellations", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## previous_bookings_not_canceled

In [None]:
subgroup_city['previous_bookings_not_canceled'].describe()

In [None]:
fig = px.histogram(subgroup_city,'previous_bookings_not_canceled', marginal = 'box',
                   labels={'previous_bookings_not_canceled': 'Number of Prior Stays'}, 
                   title="Prior Stays", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

##  reserved_room_type

In [None]:
subgroup_city['reserved_room_type'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'reserved_room_type',
                   labels={'reserved_room_type': 'Room Type'}, 
                   title="Reserved Room Types", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## assigned_room_type

In [None]:
subgroup_city['assigned_room_type'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'assigned_room_type',
                   labels={'assigned_room_type': 'Assigned Room Type'}, 
                   title="Assigned Room Types", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## booking_changes

In [None]:
subgroup_city['booking_changes'].value_counts(1)

In [None]:
fig = px.histogram(subgroup_city,'booking_changes', marginal = 'box',
                   labels={'booking_changes': 'Number of Changes'}, 
                   title="Booking Changes", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## deposit_type

In [None]:
subgroup_city['deposit_type'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'deposit_type',
                   labels={'deposit_type': 'Type'}, 
                   title="Deposit Types", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## agent

In [None]:
subgroup_city['agent'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'agent', marginal = 'box',
                   labels={'agent': 'Booking Agent ID Number'}, 
                   title="Bookings per Agent", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## days_in_waiting_list

In [None]:
subgroup_city['days_in_waiting_list'].describe()

In [None]:
fig = px.histogram(subgroup_city,'days_in_waiting_list', marginal = 'box',
                   labels={'days_in_waiting_list': 'Number of Days'}, 
                   title="Days on Waiting List", color='reservation_status', nbins=15)
fig.update_layout(bargap=0.2)
fig.show()

## customer_type

In [None]:
subgroup_city['customer_type'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'customer_type',
                   labels={'customer_type': 'Reservation Type'}, 
                   title="Reservation Types", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## adr

In [None]:
subgroup_city['adr'].describe()

In [None]:
fig = px.histogram(subgroup_city,'adr', marginal = 'box',
                   labels={'adr': 'Rate'}, title="Average Daily Rate (ADR)",
                   color='reservation_status', nbins=15)
fig.update_layout(bargap=0.2)
fig.show()

## required_car_parking_spaces

In [None]:
subgroup_city['required_car_parking_spaces'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'required_car_parking_spaces',
                   labels={'required_car_parking_spaces': 'Number of Cars'}, 
                   title="Number of Cars", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## total_of_special_requests

In [None]:
subgroup_city['total_of_special_requests'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'total_of_special_requests',
                   labels={'total_of_special_requests': 'Number of Requests'}, 
                   title="Number of Special Requests", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

## reservation_status_date

In [None]:
subgroup_city['reservation_status_date']

# 📅 **Setting Datetime Index**

In [None]:
city_ts = subgroup_city.set_index('arrival_date')
city_ts

In [None]:
resort_ts = subgroup_resort.set_index('arrival_date')
resort_ts