# ❌ **Cancel Culture** ❌ - **EDA Notebook**

**Who?**
>* 🏢 **Revenue Management (RM) teams** for hotel groups (corporate, franchise)
>
>
>* 🏨 On-site GMs, Sales, and Ops teams

**Why?**
>* 💰 **Revenue Management:** 
>  * Revenue optimization: Right price, right time, right customer
>    * Dynamic pricing
>    * Distribution channels
>    * Pricing per room type
>
>
>* 🤝 **Sales:**
>  * Group sales (pickup/wash)
>  * BT (performance/company for both GPP and LNR rates)
>
>
>* 🛌 **Rooms Ops:**
>  * Forecasting occupancy, arrivals, departures, stay-overs, same-day booking demand, and probability of guest relocation in the case of oversell.
>  * Determining staff schedules and periods of high demand
>
>
>* 🍰 ☕ **Food and Beverage:**
>  * Ordering food/supplies overall
>  * Scheduling staff
>  * Determining busy times (breakfast, lunch, dinner)
>    * Staffing, specific food/supplies

**What?**
>* 🧾 Dataset comprised of... 
>  * 32 different features
>    * Detailed explanation of features (and sub-categories, when appropriate) available in Readme
>  * Nearly 120,000 reservation records
>  * Source cited in Readme

❌ **How?**
>* Which models/methods? 
>* Data prep and feature engineering

---

> **Goal:** To prepare data for time series modeling and forecasting in next notebook.
>
>
> **Purpose:** to explore, clean, and organize.
>
>
> **Process:**
>
>    * Inspecting data integrity and statistics
>    * Splitting data by hotel type ("City" vs. "Resort")
>    * Filling any missing values
>    * 
>    * Save processed data for modeling notebook
>
>
> **Modeling Notebook:**
>
>    * Performing train/test split
>    * 
>    * Training the model
>    * 
>    * Evaluate performance metrics
>    * Provide final recommendations

---

# ✅ **To-Do List**

---

**Copy:**
- [ ] Imports
- [ ] Personal module
- [ ] Data
- [ ] Starter code from P4P

**Links:**
- [ ] 

---

# 📦 **Import Packages**

In [53]:
## Data Handling
import pandas as pd
import numpy as np
from scipy import stats

## Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Modeling - SKLearn
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn import set_config
set_config(display='diagram')


## Custom-made Functions
from bmc_functions import eda
from bmc_functions import classification as clf

## Settings
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 100)
%matplotlib inline

In [54]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 📥 **Read Data**

In [55]:
## Reading data
source = './data/hotel_bookings.csv'
data = pd.read_csv(source)
data

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.00,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.00,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017,August,35,30,2,5,2,0.00,0,BB,BEL,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,394.00,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,City Hotel,0,102,2017,August,35,31,2,5,3,0.00,0,BB,FRA,Online TA,TA/TO,0,0,0,E,E,0,No Deposit,9.00,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,City Hotel,0,34,2017,August,35,31,2,5,2,0.00,0,BB,DEU,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,9.00,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,City Hotel,0,109,2017,August,35,31,2,5,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,89.00,,0,Transient,104.40,0,0,Check-Out,2017-09-07


In [56]:
## Inspecting percentage of city vs. resort hotels
data['hotel'].value_counts(1)

City Hotel     0.66
Resort Hotel   0.34
Name: hotel, dtype: float64

# 🪓 **Splitting "City" and "Resort"**

In [57]:
## Creating subgroup for city hotels
subgroup_city = data[data['hotel'] == 'City Hotel']
subgroup_city.drop(columns='hotel', inplace=True)
subgroup_city



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
40060,0,6,2015,July,27,1,0,2,1,0.00,0,HB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,6.00,,0,Transient,0.00,0,0,Check-Out,2015-07-03
40061,1,88,2015,July,27,1,0,4,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,76.50,0,1,Canceled,2015-07-01
40062,1,65,2015,July,27,1,0,4,1,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,68.00,0,1,Canceled,2015-04-30
40063,1,92,2015,July,27,1,2,4,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,76.50,0,2,Canceled,2015-06-23
40064,1,100,2015,July,27,2,0,2,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,76.50,0,1,Canceled,2015-04-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,23,2017,August,35,30,2,5,2,0.00,0,BB,BEL,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,394.00,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,0,102,2017,August,35,31,2,5,3,0.00,0,BB,FRA,Online TA,TA/TO,0,0,0,E,E,0,No Deposit,9.00,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,0,34,2017,August,35,31,2,5,2,0.00,0,BB,DEU,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,9.00,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,0,109,2017,August,35,31,2,5,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,89.00,,0,Transient,104.40,0,0,Check-Out,2017-09-07


In [58]:
## Creating subgroup for resort hotels
subgroup_resort = data[data['hotel'] == 'Resort Hotel']
subgroup_resort.drop(columns='hotel', inplace=True)
subgroup_resort

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,0,342,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.00,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.00,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,0,212,2017,August,35,31,2,8,2,1.00,0,BB,GBR,Offline TA/TO,TA/TO,0,0,0,A,A,1,No Deposit,143.00,,0,Transient,89.75,0,0,Check-Out,2017-09-10
40056,0,169,2017,August,35,30,2,9,2,0.00,0,BB,IRL,Direct,Direct,0,0,0,E,E,0,No Deposit,250.00,,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10
40057,0,204,2017,August,35,29,4,10,2,0.00,0,BB,IRL,Direct,Direct,0,0,0,E,E,0,No Deposit,250.00,,0,Transient,153.57,0,3,Check-Out,2017-09-12
40058,0,211,2017,August,35,31,4,10,2,0.00,0,HB,GBR,Offline TA/TO,TA/TO,0,0,0,D,D,0,No Deposit,40.00,,0,Contract,112.80,0,1,Check-Out,2017-09-14


### Testing Hierarchical Indexing

---

> Instead of splitting the data into two different dataframes, I may be able to create a new index for the same dataframe by splitting the "`hotel`" feature and using the two values as the first level of the row index, then the normal index values as the second level.
>
>
> This would add a layer of complexity to the data processing steps, but would reduce memory consumption and the number of dataframes.

---

In [59]:
# data_mi = data
# data_mi

In [60]:
# ## Creating new multi-index from hotel types and original index values
# data_mi.reset_index(inplace=True)
# multi = data_mi.set_index(['hotel'])
# multi

In [61]:
# ## Testing indexing  - City Hotel
# multi.loc['City Hotel']

In [62]:
# ## Testing indexing  - Resort Hotel
# multi.loc['Resort Hotel']

In [63]:
# eda.report_df(multi.loc['City Hotel']).sort_values('null_sum', ascending=False)

---

**Hierarchical Indexing Results**

> While the multi-indexed results can represent the dimensionality of the data, it is not best for this dataset. I will continue to use the sub-grouped dataframes for my analysis and modeling.

---

# 📊 **Reviewing Statistics**

---

**`Report_df()`: City**

---

In [64]:
## Sorting report by number of missing values
eda.report_df(subgroup_city).sort_values('null_sum', ascending=False)

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
company,75641,0.95,float64,207,3689.0,145.27,119.77,8.0,40.0,91.0,219.0,497.0
agent,8131,0.1,float64,223,71199.0,28.14,56.43,1.0,9.0,9.0,17.0,509.0
country,24,0.0,object,166,,,,,,,,
children,4,0.0,float64,4,79326.0,0.09,0.37,0.0,0.0,0.0,0.0,3.0
adr,0,0.0,float64,5405,79330.0,105.3,43.6,0.0,79.2,99.9,126.0,5400.0
previous_cancellations,0,0.0,int64,10,79330.0,0.08,0.42,0.0,0.0,0.0,0.0,21.0
market_segment,0,0.0,object,8,,,,,,,,
meal,0,0.0,object,4,,,,,,,,
previous_bookings_not_canceled,0,0.0,int64,73,79330.0,0.13,1.69,0.0,0.0,0.0,0.0,72.0
required_car_parking_spaces,0,0.0,int64,4,79330.0,0.02,0.15,0.0,0.0,0.0,0.0,3.0


---

**`Report_df()`: Resort**

---

In [65]:
## Selecting report values for columns with missing values 
eda.report_df(subgroup_resort).sort_values('null_sum', ascending=False)

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
company,36952,0.92,float64,235,3108.0,241.49,125.93,6.0,154.0,223.0,330.0,543.0
agent,8209,0.2,float64,185,31851.0,217.57,88.26,1.0,240.0,240.0,242.0,535.0
country,464,0.01,object,125,,,,,,,,
adr,0,0.0,float64,5880,40060.0,94.95,61.44,-6.38,50.0,75.0,125.0,508.0
previous_cancellations,0,0.0,int64,11,40060.0,0.1,1.34,0.0,0.0,0.0,0.0,26.0
lead_time,0,0.0,int64,412,40060.0,92.68,97.29,0.0,10.0,57.0,155.0,737.0
market_segment,0,0.0,object,6,,,,,,,,
meal,0,0.0,object,5,,,,,,,,
previous_bookings_not_canceled,0,0.0,int64,31,40060.0,0.15,1.0,0.0,0.0,0.0,0.0,30.0
required_car_parking_spaces,0,0.0,int64,5,40060.0,0.14,0.35,0.0,0.0,0.0,0.0,8.0


---

**Reviewing Reports - Missing Values**

> Based on the post-split results, I see that both dataframes are missing values for `company,` `agent`, and `country`. Additionally, the `subroup_city` dataframe is missing four values for `children`.
>
> **Special note:** As noted in the data's documentation ( located in "details.md"), any missing values are intentional representations of features that were not applicable to a reservation.
---

**`Company` and `Agent` Features**

> *Missing in `subgroup_city`:*
* `company:` 95%
* `agent:` 10%
>
> *Missing in `subgroup_resort`:*
* `company:`" 92%
* `agent:` 20%

> **Due to the large number of missing values for `company`, I will drop that column from both dataframes.** Since the missing values for `agent` are valid, I will keep the column and fill the missing values with the value "N/A"  to represent the lack of a value. I will fill the values in the next section.

**`Country` and `Children` Features**

> The remaining two features with missing values are `country` and `children`.**As there are a small number of missing values in both dataframes' features, I will keep both features. I will use** `SimpleImputer` **transformer during my preprocessing pipeline step to impute values and use a** `GridSearchCV` **to determine the best method.**

---

##### Dropping "Company" Column

In [66]:
# Dropping "company" column (95% missing values)
subgroup_city.drop(columns = ['company'], inplace=True)
subgroup_city

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
40060,0,6,2015,July,27,1,0,2,1,0.00,0,HB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,6.00,0,Transient,0.00,0,0,Check-Out,2015-07-03
40061,1,88,2015,July,27,1,0,4,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,0,Transient,76.50,0,1,Canceled,2015-07-01
40062,1,65,2015,July,27,1,0,4,1,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,0,Transient,68.00,0,1,Canceled,2015-04-30
40063,1,92,2015,July,27,1,2,4,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,0,Transient,76.50,0,2,Canceled,2015-06-23
40064,1,100,2015,July,27,2,0,2,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,0,Transient,76.50,0,1,Canceled,2015-04-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,23,2017,August,35,30,2,5,2,0.00,0,BB,BEL,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,394.00,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,0,102,2017,August,35,31,2,5,3,0.00,0,BB,FRA,Online TA,TA/TO,0,0,0,E,E,0,No Deposit,9.00,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,0,34,2017,August,35,31,2,5,2,0.00,0,BB,DEU,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,9.00,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,0,109,2017,August,35,31,2,5,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,89.00,0,Transient,104.40,0,0,Check-Out,2017-09-07


In [67]:
# Dropping "company" column (95% missing values)
subgroup_resort.drop(columns = ['company'], inplace=True)
subgroup_resort

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,0,342,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.00,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.00,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,0,212,2017,August,35,31,2,8,2,1.00,0,BB,GBR,Offline TA/TO,TA/TO,0,0,0,A,A,1,No Deposit,143.00,0,Transient,89.75,0,0,Check-Out,2017-09-10
40056,0,169,2017,August,35,30,2,9,2,0.00,0,BB,IRL,Direct,Direct,0,0,0,E,E,0,No Deposit,250.00,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10
40057,0,204,2017,August,35,29,4,10,2,0.00,0,BB,IRL,Direct,Direct,0,0,0,E,E,0,No Deposit,250.00,0,Transient,153.57,0,3,Check-Out,2017-09-12
40058,0,211,2017,August,35,31,4,10,2,0.00,0,HB,GBR,Offline TA/TO,TA/TO,0,0,0,D,D,0,No Deposit,40.00,0,Contract,112.80,0,1,Check-Out,2017-09-14


In [68]:
## Confirming 'company' removal from both
'company' not in subgroup_city and 'company' not in subgroup_resort

True

# 🔬 **Inspecting Feature Data Types**

---

**City**

---

In [69]:
## Inspecting dataypes for "subgroup_city"
subgroup_city.dtypes.sort_values()

is_canceled                         int64
previous_bookings_not_canceled      int64
previous_cancellations              int64
is_repeated_guest                   int64
days_in_waiting_list                int64
required_car_parking_spaces         int64
adults                              int64
babies                              int64
stays_in_weekend_nights             int64
arrival_date_day_of_month           int64
arrival_date_week_number            int64
total_of_special_requests           int64
arrival_date_year                   int64
lead_time                           int64
stays_in_week_nights                int64
booking_changes                     int64
children                          float64
adr                               float64
agent                             float64
deposit_type                       object
customer_type                      object
distribution_channel               object
reserved_room_type                 object
reservation_status                

---

**Resort**

---

In [70]:
## Inspecting dataypes for "subgroup_resort"
subgroup_resort.dtypes.sort_values()

is_canceled                         int64
previous_bookings_not_canceled      int64
previous_cancellations              int64
is_repeated_guest                   int64
days_in_waiting_list                int64
required_car_parking_spaces         int64
adults                              int64
babies                              int64
stays_in_weekend_nights             int64
arrival_date_day_of_month           int64
arrival_date_week_number            int64
total_of_special_requests           int64
arrival_date_year                   int64
lead_time                           int64
stays_in_week_nights                int64
booking_changes                     int64
children                          float64
adr                               float64
agent                             float64
deposit_type                       object
customer_type                      object
distribution_channel               object
reserved_room_type                 object
reservation_status                

In [71]:
## Confirming all datatypes match between dataframes
subgroup_city.dtypes.sort_values() == subgroup_resort.dtypes.sort_values()

is_canceled                       True
previous_bookings_not_canceled    True
previous_cancellations            True
is_repeated_guest                 True
days_in_waiting_list              True
required_car_parking_spaces       True
adults                            True
babies                            True
stays_in_weekend_nights           True
arrival_date_day_of_month         True
arrival_date_week_number          True
total_of_special_requests         True
arrival_date_year                 True
lead_time                         True
stays_in_week_nights              True
booking_changes                   True
children                          True
adr                               True
agent                             True
deposit_type                      True
customer_type                     True
distribution_channel              True
reserved_room_type                True
reservation_status                True
market_segment                    True
country                  

---

**Review - Datatypes**

> After reviewing the datatypes, I noticed **one feature need to be changed to the string datatype: `agent`**. This feature represents unique identifiers for booking agents and need to be treated as categorical data.
>
> As both dataframes' datatypes are the same, I do not need to make any other adjustments specific to either dataframe.

---

## Converting to Strings

In [75]:
## Converting subgroup_city "country" to string
subgroup_city.loc[:,'country'] = subgroup_city.loc[:,'country'].astype(str)
subgroup_city.loc[:,'country']

40060     PRT
40061     PRT
40062     PRT
40063     PRT
40064     PRT
         ... 
119385    BEL
119386    FRA
119387    DEU
119388    GBR
119389    DEU
Name: country, Length: 79330, dtype: object

In [76]:
## Converting subgroup_resort "country" to string
subgroup_resort.loc[:,'country'] = subgroup_resort.loc[:,'country']\
                                                                .astype(str)
subgroup_resort.loc[:,'country']

0        PRT
1        PRT
2        GBR
3        GBR
4        GBR
        ... 
40055    GBR
40056    IRL
40057    IRL
40058    GBR
40059    DEU
Name: country, Length: 40060, dtype: object

# 🔎 **EDA - Features**

---

> Now that I reviewed my missing values and confirmed my datatypes, I will inspect the details of each of my features.

---

## reservation_status

---

**City**

---

In [77]:
subgroup_city['reservation_status'].value_counts(1, dropna=False)

Check-Out   0.58
Canceled    0.41
No-Show     0.01
Name: reservation_status, dtype: float64

In [78]:
subgroup_city['reservation_status'].describe()

count         79330
unique            3
top       Check-Out
freq          46228
Name: reservation_status, dtype: object

---

**Resort**

---

In [80]:
subgroup_resort['reservation_status'].value_counts(1, dropna=False)

Check-Out   0.72
Canceled    0.27
No-Show     0.01
Name: reservation_status, dtype: float64

In [79]:
subgroup_resort['reservation_status'].describe()

count         40060
unique            3
top       Check-Out
freq          28938
Name: reservation_status, dtype: object

### Review - `Reservation_Status`

---

> `Reservation_status` will be my target feature for my classification modeling. To prepare it for modeling, I will need to replace the `No-Show` status with `Canceled` values. 
>
>For the purposes of my analysis, **I will treat `No-Show` reservations as `Canceled` reservations due to their limited number preventing me from effectively using it as a third class.**

---

### Converting `No-Show` to `Canceled`

In [83]:
## Changing no-show values to "canceled"
subgroup_city.loc[:,'reservation_status'].replace('No-Show', 'Canceled',
                                            inplace=True)
subgroup_resort.loc[:,'reservation_status'].replace('No-Show', 'Canceled',
                                            inplace=True)

In [84]:
## Confirming the change
'No-Show' not in subgroup_city['reservation_status'] and \
                        'No-Show' not in subgroup_city['reservation_status']

True

In [86]:
## Inspecting the updated target classes
subgroup_city['reservation_status'].value_counts(1, dropna=False)

Check-Out   0.58
Canceled    0.42
Name: reservation_status, dtype: float64

In [85]:
subgroup_resort['reservation_status'].value_counts(1, dropna=False)

Check-Out   0.72
Canceled    0.28
Name: reservation_status, dtype: float64

### 📌 Review - `Reservation_Status`

---

> I successfully converted `No-Show` to 

---

## is_canceled

---

**City**

---

In [None]:
subgroup_city['is_canceled'].value_counts(1, dropna=False)

In [None]:
## Visualizing results
fig, ax = plt.subplots()
subgroup_city['is_canceled'].value_counts(1, dropna=False).plot(kind='barh',
                                                                ax=ax)
ax.set_yticklabels(['Not Cancelled', 'Cancelled'])
ax.set_xlabel('Percentage')
ax.set_ylabel('Status')
ax.set_title('Reservation Statuses');

---

**Resort**

---

In [None]:
subgroup_resort['is_canceled'].value_counts(1, dropna=False)

In [None]:
## Visualizing results
fig, ax = plt.subplots()
subgroup_resort['is_canceled'].value_counts(1, dropna=False).plot(kind='barh',
                                                                ax=ax)
ax.set_yticklabels(['Not Cancelled', 'Cancelled'])
ax.set_xlabel('Percentage')
ax.set_ylabel('Status')
ax.set_title('Reservation Statuses');

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## lead_time - Fix legend labels!

---

**City**

---

In [None]:
subgroup_city['lead_time'].describe()

In [None]:
# fig, ax = plt.subplots(figsize=(5,10))
# subgroup_city['lead_time'].plot(kind='box', ax=ax)

# ax.set_title('Lead Time');

In [None]:
fig = px.box(subgroup_city, y="lead_time", title='Overview of Lead Time',
             width=600, color='reservation_status',
             labels = {'lead_time': 'Lead Time (Days)'},
            category_orders = {'lead_time': 'Lead Time (Days)',
                               'reservation_status': ['Check-Out', 'Canceled']})

fig.show()

---

**Resort**

---

In [None]:
subgroup_resort['lead_time'].describe()

In [None]:
# fig, ax = plt.subplots(figsize=(5,10))
# subgroup_resort['lead_time'].plot(kind='box', ax=ax)

# ax.set_title('Lead Time');

In [None]:
fig = px.box(subgroup_resort, y="lead_time", title='Overview of Lead Time',
             width=600, color='reservation_status',
             labels = {'lead_time': 'Lead Time (Days)'},
            category_orders = {'lead_time': 'Lead Time (Days)',
                               'reservation_status': ['Check-Out', 'Canceled']})

fig.show()

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## Arrival Date as Full Datetime

---

**City**

---

In [None]:
## Converting from month, day of month, and year to a single datetime column
subgroup_city['arrival_date'] = subgroup_city['arrival_date_month'] +' '+ \
                                subgroup_city['arrival_date_day_of_month']\
                                .astype(str) +', '+ \
                                subgroup_city['arrival_date_year'].astype(str)
subgroup_city['arrival_date'] = pd.to_datetime(subgroup_city['arrival_date'])
subgroup_city['arrival_date']

In [None]:
fig = px.box(subgroup_city, y="lead_time", title='Lead Times',
             width=600, color='reservation_status',
             labels = {'lead_time': 'Lead Time (Days)'},
            category_orders = {'lead_time': 'Lead Time (Days)',
                               'reservation_status': ['Check-Out', 'Canceled']})

fig.show()

In [None]:
fig = px.histogram(subgroup_city,'lead_time', marginal = 'box',
                   color='reservation_status',
                   labels={'lead_time': 'Lead Time (Days)'}, 
                   title="Lead Times", nbins=40)
fig.update_layout(bargap=0.2)
fig.show()

In [None]:
fig = px.scatter(subgroup_city, x="reservation_status", y="lead_time",
                 marginal_y="box",marginal_x="histogram",
                 color='reservation_status')
fig.show()

In [None]:
# ## Try again with select features as dimensions
# fig = px.scatter_matrix(subgroup_city,
# #     dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"],
#     color="reservation_status")
# fig.show()

---

**Resort**

---

In [None]:
## Converting from month, day of month, and year to a single datetime column
subgroup_resort['arrival_date'] = subgroup_resort['arrival_date_month'] +' '+ \
                                subgroup_resort['arrival_date_day_of_month']\
                                .astype(str) +', '+ \
                                subgroup_resort['arrival_date_year'].astype(str)
subgroup_resort['arrival_date'] = pd.to_datetime(subgroup_resort['arrival_date'])
subgroup_resort['arrival_date']

In [None]:
fig = px.box(subgroup_resort, y="lead_time", title='Lead Times',
             width=600, color='reservation_status',
             labels = {'lead_time': 'Lead Time (Days)'},
            category_orders = {'lead_time': 'Lead Time (Days)',
                               'reservation_status': ['Check-Out', 'Canceled']})

fig.show()

In [None]:
fig = px.histogram(subgroup_resort,'lead_time', marginal = 'box',
                   color='reservation_status',
                   labels={'lead_time': 'Lead Time (Days)'}, 
                   title="Lead Times", nbins=40)
fig.update_layout(bargap=0.2)
fig.show()

In [None]:
fig = px.scatter(subgroup_resort, x="reservation_status", y="lead_time",
                 marginal_y="box",marginal_x="histogram",
                 color='reservation_status')
fig.show()

In [None]:
# ## Try again with select features as dimensions
# fig = px.scatter_matrix(subgroup_resort,
# #     dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"],
#     color="reservation_status")
# fig.show()

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## stays_in_weekend_nights

---

**City**

---

In [None]:
subgroup_city['stays_in_weekend_nights'].value_counts(1)

In [None]:
fig = px.histogram(subgroup_city,'stays_in_weekend_nights', marginal = 'box',
                   labels={'stays_in_weekend_nights': 'Number of Weekend Nights'}, 
                   title="Weekend Stays", color='reservation_status', nbins=15)
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## stays_in_week_nights

---

**City**

---

In [None]:
subgroup_city['stays_in_week_nights'].value_counts(1)

In [None]:
subgroup_city['stays_in_week_nights'].value_counts(1)[:6]

In [None]:
fig = px.histogram(subgroup_city,'stays_in_week_nights', marginal = 'box',
                   labels={'stays_in_week_nights': 'Number of Week Nights'}, 
                   title="Weekday Stays", color='reservation_status', nbins=40)
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## Adults

---

**City**

---

In [None]:
subgroup_city['adults'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'adults',
                   labels={'adults': 'Number of Adults'},
                   title="Number of Adults", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## Children

---

**City**

---

In [None]:
subgroup_city['children'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'children', marginal = 'box',
                   labels={'children': 'Number of Children'}, 
                   title="Number of Children", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## babies

---

**City**

---

In [None]:
subgroup_city['babies'].value_counts(dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'babies', marginal = 'box',
                   labels={'babies': 'Number of Babies'}, 
                   title="Number of Babies", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## meal

---

**City**

---

In [None]:
subgroup_city['meal'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'meal',labels={'meal': 'Types of Meals'}, 
                   title="Dining with Us?", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## country

---

**City**

---

In [None]:
subgroup_city['country'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'country',
                   labels={'country': 'Country of Origin'}, 
                   title="'Where's Home?'", color='reservation_status', nbins=15)
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## market_segment

---

**City**

---

In [None]:
subgroup_city['market_segment'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'market_segment',
                   labels={'market_segment': 'Market Segment'}, 
                   title="Segmentation", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## distribution_channel

---

**City**

---

In [None]:
subgroup_city['distribution_channel'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'distribution_channel',
                   labels={'distribution_channel': 'Channel'}, 
                   title="Distribution Channels", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## is_repeated_guest

---

**City**

---

In [None]:
subgroup_city['is_repeated_guest'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'is_repeated_guest',
                   labels={'is_repeated_guest': 'Repeat Status'}, 
                   title="Welcome Back!", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## previous_cancellations

---

**City**

---

In [None]:
subgroup_city['previous_cancellations'].describe()

In [None]:
fig = px.histogram(subgroup_city,'previous_cancellations',
                   labels={'previous_cancellations': 'Number of Cancellations'}, 
                   title="Previous Cancellations", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## previous_bookings_not_canceled

---

**City**

---

In [None]:
subgroup_city['previous_bookings_not_canceled'].describe()

In [None]:
fig = px.histogram(subgroup_city,'previous_bookings_not_canceled', marginal = 'box',
                   labels={'previous_bookings_not_canceled': 'Number of Prior Stays'}, 
                   title="Prior Stays", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

##  reserved_room_type

---

**City**

---

In [None]:
subgroup_city['reserved_room_type'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'reserved_room_type',
                   labels={'reserved_room_type': 'Room Type'}, 
                   title="Reserved Room Types", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## assigned_room_type

---

**City**

---

In [None]:
subgroup_city['assigned_room_type'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'assigned_room_type',
                   labels={'assigned_room_type': 'Assigned Room Type'}, 
                   title="Assigned Room Types", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## booking_changes

---

**City**

---

In [None]:
subgroup_city['booking_changes'].value_counts(1)

In [None]:
fig = px.histogram(subgroup_city,'booking_changes', marginal = 'box',
                   labels={'booking_changes': 'Number of Changes'}, 
                   title="Booking Changes", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## deposit_type

---

**City**

---

In [None]:
subgroup_city['deposit_type'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'deposit_type',
                   labels={'deposit_type': 'Type'}, 
                   title="Deposit Types", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## agent

---

**City**

---

In [None]:
subgroup_city['agent'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'agent', marginal = 'box',
                   labels={'agent': 'Booking Agent ID Number'}, 
                   title="Bookings per Agent", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## days_in_waiting_list

---

**City**

---

In [None]:
subgroup_city['days_in_waiting_list'].describe()

In [None]:
fig = px.histogram(subgroup_city,'days_in_waiting_list', marginal = 'box',
                   labels={'days_in_waiting_list': 'Number of Days'}, 
                   title="Days on Waiting List", color='reservation_status', nbins=15)
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## customer_type

---

**City**

---

In [None]:
subgroup_city['customer_type'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'customer_type',
                   labels={'customer_type': 'Reservation Type'}, 
                   title="Reservation Types", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## adr

---

**City**

---

In [None]:
subgroup_city['adr'].describe()

In [None]:
fig = px.histogram(subgroup_city,'adr', marginal = 'box',
                   labels={'adr': 'Rate'}, title="Average Daily Rate (ADR)",
                   color='reservation_status', nbins=15)
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## required_car_parking_spaces

---

**City**

---

In [None]:
subgroup_city['required_car_parking_spaces'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'required_car_parking_spaces',
                   labels={'required_car_parking_spaces': 'Number of Cars'}, 
                   title="Number of Cars", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## total_of_special_requests

---

**City**

---

In [None]:
subgroup_city['total_of_special_requests'].value_counts(1, dropna=False)

In [None]:
fig = px.histogram(subgroup_city,'total_of_special_requests',
                   labels={'total_of_special_requests': 'Number of Requests'}, 
                   title="Number of Special Requests", color='reservation_status')
fig.update_layout(bargap=0.2)
fig.show()

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

## reservation_status_date

---

**City**

---

In [None]:
subgroup_city['reservation_status_date']

---

**Resort**

---

### Review - `PLACEHOLDER`

---

> TEXT 
>
> TEXT

---

# 📅 **Setting Datetime Index**

In [None]:
city_ts = subgroup_city.set_index('arrival_date')
city_ts

In [None]:
resort_ts = subgroup_resort.set_index('arrival_date')
resort_ts

# Binarizing Target - New Feature

❌ **MOVE THIS TO POST-EDA PROCESSING** ❌

In [None]:
cond = [subgroup_city['reservation_status'] == 'Check-Out',
       subgroup_city['reservation_status'] == 'Canceled',
       subgroup_resort['reservation_status'] == 'Check-Out',
       subgroup_resort['reservation_status'] == 'Canceled']

choice = [1, 0, 1, 0]

In [None]:
subgroup_city['res_status_binary'] = np.select(cond, choice, 2)
subgroup_city['res_status_binary']

In [None]:
subgroup_resort['res_status_binary'] = np.select(cond, choice, 2)
subgroup_resort['res_status_binary']

In [88]:
subgroup_city['res_status_binary'].value_counts(1)

KeyError: 'res_status_binary'

In [None]:
subgroup_resort['res_status_binary'].value_counts(1)