# **Hotel Cancel Culture** - **EDA Notebook**

---

**Author:** Ben McCarty

**Extension of Capstone Project** - Expanding Hotel Reservation dataset analysis and modeling

**Contact:** bmccarty505@gmail.com

---

## Revisiting the Reservations

---

Originally, I used this notebook to perform EDA with the intention of using the dataset only for classifying whether a reservation would cancel.

Now, as part of my efforts to revisit and revamp this overall repository and workflow, I am adapting it for broader uses, such as regression modeling and time series forecasting.

The end goal is to have a comprehensive overview of the data and to be flexible enough to handle different workflows.

**Warning: Work-in-Progress**

As this is a revamp of the original workbook, some of the code and comments may be outdated. I intend to update and clarify all steps in time, but there may be some parts that are out of place while I clean things up.

---

**Of Demand and Cancellations**

*This was the initial intro to the notebook with a focus on classification modeling.*

>**Every aspect of hospitality depends on accurately anticipating business demand**: how many rooms to clean; how many rooms are available to sell; what would be the best rate; and how to bring it all together to make every guest satisfied. 
>
> Proper forecasting is critical to every department and staff member, and to generate our forecasts, **hotel managers need to know how many guests will cancel prior to arrival**. Using data from two European hotels, I developed a model to predict whether a given reservation would cancel based on 30 different reservation details.

**In order to develop and train my models, I need to prepare the data in advance.**

>In this notebook, I explore the original dataset and its features; condense several features into smaller subsets; engineer new features; and remove unwanted features from the data.
>
**Once the data is prepared, I will reload the data in a new notebook to create and train my models to determine my predictions of who will stay and who will cancel.**

# **Import Packages**

In [1]:
## Used to re-import custom functions during development
%load_ext autoreload
%autoreload 2

In [38]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import os
import sys

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils, eda

## Data Handling
import numpy as np
import pandas as pd

## Visualizations
import matplotlib.pyplot as plt
from missingno import matrix
import plotly.express as px
import seaborn as sns

In [3]:
## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('display.max_rows', 50)
%matplotlib inline

# Read Source Data (with UUIDs)

In [4]:
# # Path to the DuckDB database file
# db_path = '../data/hotel_reservations.duckdb'

# ## Select subset of data for review
# q = 'SELECT * FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     data = conn.execute(q).df()
    
# data.head()

In [5]:
backup_data_path = '../data/data_condensed_with_uuid.parquet'

data = pd.read_parquet(backup_data_path)

data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID
0,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,8ca998d6-fae7-4ee4-a706-3765721aaff5
1,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,e535835e-b19a-4e32-9e9f-6d70a0182d4b
2,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,9429383d-0efd-4c37-bb9b-0aaa63d5aade
3,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,dd6424ee-6838-4007-ad85-de9ff96be14b
4,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,50ff56ee-6a72-40dc-8ff1-4246b831c779


## Add Pre-Engineered Date Features

In [6]:
filepath = '../data/engineered_data_dates.parquet'

df_dates = pd.read_parquet(filepath)
df_dates.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDate,DepartureDate,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalDate_WeekNumber,ArrivalDate_DayOfWeek,DepartureDate_WeekNumber,DepartureDate_DayOfWeek,BookingDate_WeekNumber,BookingDate_DayOfWeek
0,8ca998d6-fae7-4ee4-a706-3765721aaff5,2015-07-01,2015-07-01,2015-07-01,2014-07-24,45,21,45,21,22,44,27,3,27,3,30,4
1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,2015-07-01,2015-07-01,2015-07-01,2013-06-24,45,21,45,21,52,14,27,3,27,3,26,1
2,9429383d-0efd-4c37-bb9b-0aaa63d5aade,2015-07-02,2015-07-01,2015-07-02,2015-06-24,45,21,44,22,52,14,27,3,27,4,26,3
3,dd6424ee-6838-4007-ad85-de9ff96be14b,2015-07-02,2015-07-01,2015-07-02,2015-06-18,45,21,44,22,58,8,27,3,27,4,25,4
4,50ff56ee-6a72-40dc-8ff1-4246b831c779,2015-07-03,2015-07-01,2015-07-03,2015-06-17,45,21,43,23,59,7,27,3,27,5,25,3


In [7]:
df_dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 17 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   UUID                             119390 non-null  object        
 1   ReservationStatusDate            119390 non-null  datetime64[ns]
 2   ArrivalDate                      119390 non-null  datetime64[ns]
 3   DepartureDate                    119390 non-null  datetime64[ns]
 4   BookingDate                      119390 non-null  datetime64[ns]
 5   ArrivalDate_DaysBeforeHoliday    119390 non-null  int64         
 6   ArrivalDate_DaysAfterHoliday     119390 non-null  int64         
 7   DepartureDate_DaysBeforeHoliday  119390 non-null  int64         
 8   DepartureDate_DaysAfterHoliday   119390 non-null  int64         
 9   BookingDate_DaysBeforeHoliday    119390 non-null  int64         
 10  BookingDate_DaysAfterHoliday     119390 non-

## Condense to Single DataFrame

In [8]:
data = data.merge(right = df_dates, how = 'left', on = 'UUID')
data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate_x,HotelNumber,UUID,ReservationStatusDate_y,ArrivalDate,DepartureDate,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalDate_WeekNumber,ArrivalDate_DayOfWeek,DepartureDate_WeekNumber,DepartureDate_DayOfWeek,BookingDate_WeekNumber,BookingDate_DayOfWeek
0,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,8ca998d6-fae7-4ee4-a706-3765721aaff5,2015-07-01,2015-07-01,2015-07-01,2014-07-24,45,21,45,21,22,44,27,3,27,3,30,4
1,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,2015-07-01,2015-07-01,2015-07-01,2013-06-24,45,21,45,21,52,14,27,3,27,3,26,1
2,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,9429383d-0efd-4c37-bb9b-0aaa63d5aade,2015-07-02,2015-07-01,2015-07-02,2015-06-24,45,21,44,22,52,14,27,3,27,4,26,3
3,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,dd6424ee-6838-4007-ad85-de9ff96be14b,2015-07-02,2015-07-01,2015-07-02,2015-06-18,45,21,44,22,58,8,27,3,27,4,25,4
4,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,50ff56ee-6a72-40dc-8ff1-4246b831c779,2015-07-03,2015-07-01,2015-07-03,2015-06-17,45,21,43,23,59,7,27,3,27,5,25,3


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 49 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   IsCanceled                       119390 non-null  int64         
 1   LeadTime                         119390 non-null  int64         
 2   ArrivalDateYear                  119390 non-null  int64         
 3   ArrivalDateMonth                 119390 non-null  object        
 4   ArrivalDateWeekNumber            119390 non-null  int64         
 5   ArrivalDateDayOfMonth            119390 non-null  int64         
 6   StaysInWeekendNights             119390 non-null  int64         
 7   StaysInWeekNights                119390 non-null  int64         
 8   Adults                           119390 non-null  int64         
 9   Children                         119386 non-null  float64       
 10  Babies                           119390 non-

## Dropping Old Features

*Some features were used to engineer new features - particularly arrival details.*

In [10]:
drop_feats = list(data.columns)[1:8]
drop_feats.append(list(data.columns)[30])
drop_feats.extend(list(data.columns)[33:37])
drop_feats

['LeadTime',
 'ArrivalDateYear',
 'ArrivalDateMonth',
 'ArrivalDateWeekNumber',
 'ArrivalDateDayOfMonth',
 'StaysInWeekendNights',
 'StaysInWeekNights',
 'ReservationStatusDate_x',
 'ReservationStatusDate_y',
 'ArrivalDate',
 'DepartureDate',
 'BookingDate']

In [11]:
data = data.drop(columns = drop_feats)
data.head()

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,HotelNumber,UUID,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalDate_WeekNumber,ArrivalDate_DayOfWeek,DepartureDate_WeekNumber,DepartureDate_DayOfWeek,BookingDate_WeekNumber,BookingDate_DayOfWeek
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,1,8ca998d6-fae7-4ee4-a706-3765721aaff5,45,21,45,21,22,44,27,3,27,3,30,4
1,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,45,21,45,21,52,14,27,3,27,3,26,1
2,0,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,1,9429383d-0efd-4c37-bb9b-0aaa63d5aade,45,21,44,22,52,14,27,3,27,4,26,3
3,0,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,1,dd6424ee-6838-4007-ad85-de9ff96be14b,45,21,44,22,58,8,27,3,27,4,25,4
4,0,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,1,50ff56ee-6a72-40dc-8ff1-4246b831c779,45,21,43,23,59,7,27,3,27,5,25,3


# Identifying Target Feature - Classification

---

> For my classification analysis, I will use the `IsCanceled` feature as my target feature. This feature indicates whether a reservation was canceled (0 = check-out, 1 = canceled).
>
> There is another feature, `ReservationStatus`, that is closely tied to the `IsCanceled` feature. The `ReservationStatus` feature indicates the last date on which the reservation was changed, which would match the cancellation date for a reservation since changes usually do not occur after that point.
>
> Due to this close relationship, I will review both features and determine which to keep.

---

In [62]:
## Inspecting target feature
data['IsCanceled'].describe()

count   119,390.00
mean          0.37
std           0.48
min           0.00
25%           0.00
50%           0.00
75%           1.00
max           1.00
Name: IsCanceled, dtype: float64

In [63]:
data['IsCanceled'].value_counts(normalize=True, ascending = False, dropna = False)

IsCanceled
0   0.63
1   0.37
Name: proportion, dtype: float64

In [59]:
# ## Viewing breakdown of target feature
# target_breakdown = temp_data['IsCanceled'].value_counts(normalize=True, dropna=False)
# target_breakdown

In [28]:
# ## Generating visualization of target feature for presentation

# fig, ax = plt.subplots(figsize=(8,5))

# ax = target_breakdown.plot(kind='bar',ax=ax)

# ax.set(xlabel = 'Status', ylabel= 'Percent')

# plt.xticks([0, 1], ['Checked-Out', 'Cancelled'], rotation=0)
# plt.yticks([.1, .2, .3, .4, .5, .6], [10, 20, 30, 40, 50, 60])
# plt.suptitle('Breakdown of Reservation Statuses')

# ax.set_facecolor('0.9')
# fig.set_facecolor('0.975')
# plt.savefig('../img/cxl_stat.png',transparent=False, bbox_inches='tight',
#            dpi=300)
# plt.show()
# plt.close()

In [58]:
# px.histogram(temp_data, x='IsCanceled')

---

**Initial Review** 

> Based on this initial review, there is a moderate class imbalance between whether a reservation canceled favoring non-canceled reservations with a 63%/37% split between not-canceled/canceled, respectively.
>
> **I will keep this imbalance in mind when I perform my modeling in my next notebook.** Imbalanced classes may have a significant negative impact on a model's performance; I will need to address the imbalance prior to modeling.

---

In [None]:
## Cleaning up memory
del target_breakdown

# Reviewing Statistics

## Summary Stats via Describe Method

In [71]:
## Numeric Stats
data.describe(include = 'number', )

Unnamed: 0,IsCanceled,Adults,Children,Babies,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,BookingChanges,DaysInWaitingList,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,HotelNumber,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalDate_WeekNumber,ArrivalDate_DayOfWeek,DepartureDate_WeekNumber,DepartureDate_DayOfWeek,BookingDate_WeekNumber,BookingDate_DayOfWeek
count,119390.0,119390.0,119386.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0
mean,0.37,1.86,0.1,0.01,0.03,0.09,0.14,0.22,2.32,101.83,0.06,0.57,1.66,30.91,30.95,31.66,30.24,36.7,31.97,26.76,4.0,26.81,4.21,24.77,3.6
std,0.48,0.58,0.4,0.1,0.18,0.84,1.5,0.65,17.59,50.54,0.25,0.79,0.47,26.21,26.54,26.51,26.24,28.27,26.41,13.57,1.95,13.59,2.06,16.3,1.84
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-6.38,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,69.29,0.0,0.0,1.0,9.0,9.0,10.0,8.0,12.0,10.0,16.0,2.0,16.0,2.0,9.0,2.0
50%,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.58,0.0,0.0,2.0,24.0,24.0,24.0,23.0,31.0,25.0,27.0,4.0,27.0,4.0,25.0,4.0
75%,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126.0,0.0,1.0,2.0,46.0,48.0,47.0,47.0,55.0,49.0,38.0,6.0,38.0,6.0,40.0,5.0
max,1.0,55.0,10.0,10.0,1.0,26.0,72.0,21.0,391.0,5400.0,8.0,5.0,2.0,115.0,114.0,115.0,114.0,115.0,114.0,53.0,7.0,53.0,7.0,53.0,7.0


---

- Outliers present in many features
- Outlier detection/removal may be required in preprocessing pipeline for certain model types

---

In [65]:
## Non-Numeric Stats
data.describe(exclude = 'number')

Unnamed: 0,count,unique,top,freq
Meal,119390,5,BB,92310
Country,118902,177,PRT,48590
MarketSegment,119390,8,Online TA,56477
DistributionChannel,119390,5,TA/TO,97870
ReservedRoomType,119390,10,A,85994
AssignedRoomType,119390,12,A,74053
DepositType,119390,3,No Deposit,104641
Agent,119390,334,9,31961
Company,119390,353,,112593
CustomerType,119390,4,Transient,89613


## Missing Values

In [68]:
nan_sum = data.isna().sum()
nan_sum[nan_sum>0]

Children      4
Country     488
dtype: int64

In [70]:
nan_avg = data.isna().mean()
nan_avg[nan_avg>0]

Children   0.00
Country    0.00
dtype: float64

---

- Two features missing values
- Average number of missing values less than 1%
- No action taken; will address in model pipeline

---

# **EDA - Features**

---

**In-Depth Feature EDA**

> Now that I reviewed my missing values and confirmed my datatypes, I will inspect the details of each of my features. For each feature, I will review the summary statistics; the value counts; and the datatype. I will start with my target feature, including my observations and future actions at the end of each feature analysis.

**Toggling Visualizations**

> There is a simple boolean variable, set at the start of this EDA process, controlling whether to show the visualizations of each feature. By default it is set to show the visualizations. 
>
> **If you experience issues with the Notebook running slowly, please disable the visualizations by changing "True" to "False" and restarting the kernel.**
>
> Additionally, you may change the `show_visualization` argument for any feature to show the visualization for that feature. Please be aware that running multiple features may result in very poor notebook performance!

---
***Styling:***

> DataFrame styling code used in `explore_feature()` function adapted from this [source](https://stackoverflow.com/questions/59769161/python-color-pandas-dataframe-based-on-multiindex#:~:text=2-,You,-can%20use%20Styler).

---

## Toggle Visualizations

In [None]:
## Boolean setting to control whether to show the EDA visualizations
show_visualization = False

## `Is_Canceled`

---

**Starting with the Target**

> I will start my EDA process by reviewing the target feature, `is_canceled`.

---

In [None]:
## Reviewing details for 'IsCanceled'
eda.explore_feature(data, column_name = 'IsCanceled', target_feature = 'IsCanceled', normalize=False,
                    plot_label ='Cancellation Status',
                    plot_title= 'Reservation Status',
                    show_visualization = True);

---

**Feature Review**

> My target feature, `is_canceled`, shows a binary representation of whether a reservation stayed and checked-out, or if the reservation cancelled. Reservations are indicated as cancellations if they either cancel or are marked as a "no-show" reservation.

**Actions**

> The imbalance between these two classes will create problems during the modeling process. Most models are sensitive to class imbalances; they will erroneously predict the majority class more often, decreasing the accuracy of the models. **In my modeling notebook, I will perform the SMOTE resampling technique to address this imbalance.**

---

## `Reservation_Status`

In [None]:
## Reviewing details for reservation_status
eda.explore_feature(data,'ReservationStatus', target_feature = 'IsCanceled', 
                    plot_label ='Status', plot_title= 'Reservation Status',
                    show_visualization = show_visualization);

---

**Feature Review**

> `Reservation_status` closely mirrors the values for my target feature, with some slight differences due to "no-show" values.

**Actions**

> As this features is nearly the same as my target, **I will drop this feature.** Additionally, this feature is not known prior to a guest's arrival, lending further support to my decision to remove the feature.

---

## `Lead_Time`

In [None]:
## Reviewing details for 'lead_time'
eda.explore_feature(data,'LeadTime',target_feature = 'IsCanceled', bins = 5, marginal = 'box',
                    plot_label ='Number of Days',
                    plot_title= 'Lead Time (Days)',
                    show_visualization = show_visualization);

---

**Feature Review**

> `Lead_Time` indicates how far in advance reservations are booked in days. This information is particularly useful in hospitality for Revenue Management and Operations teams.*
>
>  * Revenue Management set rates for the different room types and manage room type availability. These teams need to know **when to expect bookings** and **which days to monitor rates and availability more closely** to make any necessary changes to optimize revenue.
>
>
>  * Hotel Operations teams use this information to **forecast how many reservations will book in a short-term booking window** (in my experience, I usually focused on 0-3 days prior to arrival).
>
> * **This forecast is critical to determine staffing and supplies in particular** - when building our schedules, we consider the current number of booked reservations and the forecasted bookings to determine how many staff members to schedule and if we have enough supplies, etc..
>  * *Being the only staff member at the Front Desk during a rush of arrivals due to a snow storm is NOT fun!*

**Actions**

> I noticed half of reservation lead times fall between 9 and 214 days - quite a large range! Additionally, there are some noticeable outliers in the data which may affect my future models. For now, I will leave the data as-is; depending on the models I use, I may include regularization parameters during the modeling process.

---

## `Arrival_Date_Year`

In [None]:
## Reviewing details for 'arrival_date_year'
eda.explore_feature(data,'ArrivalDateYear',target_feature = 'IsCanceled',
                    marginal = 'box',plot_label ='Year',plot_title= 'Arrival Date (Year)',
                    show_visualization = show_visualization);

---

**Feature Review**

> While this feature may be useful for future forecasting models, it is not relevant for my classification modeling as these years will not be repeated for future reservations.

**Actions**

> This feature is not useful for my future modeling, so I will drop this feature at the end of my EDA process.

---


## `Stays_in_Weekend_Nights`

In [None]:
## Reviewing details for 'stays_in_weekend_nights'
eda.explore_feature(data,'StaysInWeekendNights',target_feature = 'IsCanceled', bins = 5,
                    marginal = 'box',
                    plot_label ='Number of Days',
                    plot_title= 'Lead Time (Days)',
                    show_visualization = show_visualization);

---

**Feature Review**

> Similar to `lead_time`, this feature shows most values are within a small window of time, with a few rare outliers (less than 2%).

**Actions**

> I will leave this feature as-is for now; future regularization would address this issue if needed.

---

## `Stays_in_Week_Nights`

In [None]:
## Reviewing details for 'stays_in_week_nights'
eda.explore_feature(data,'StaysInWeekNights',target_feature = 'IsCanceled', bins = 5,
                    marginal = 'box',
                    plot_label ='Number of Prior Stays',
                    plot_title= 'Stays in Week Nights',
                    show_visualization = show_visualization);

---

**Feature Review**

> Similar to `stays_in_weekend_nights`, this feature shows most values are within a small window of time, with a few rare outliers (less than 2%).

**Actions**

> I will leave this feature as-is for now; future regularization would address this issue if needed.

---

## `Adults`

In [None]:
## Reviewing details 'adults'
eda.explore_feature(data,'Adults',target_feature = 'IsCanceled', bins = 3,
                    plot_label ='Number of Adults',
                    plot_title= 'Adults',
                    show_visualization = show_visualization);

---

**Feature Review**

> `Adults` refers to the number of adults listed on the reservation. In my experience, this feature is not often the most accurate to real-life (I would see "1 Adult" as the default, but I may have three or four people sharing a room, for example). Still, I feel the values are reasonable enough to include for analysis.

**Actions**

> No action required.

---

## `Children`

In [None]:
## Reviewing details for 'children'
eda.explore_feature(data,'Children',target_feature = 'IsCanceled', bins = 5,
                    plot_label ='Number of Children',
                    plot_title= 'Children',
                    show_visualization = show_visualization);

---

**Feature Review**

> `Children` refers to the number of children listed on the reservation. As in the case of `adults`, in my experience this feature does not match with reality, but I will keep it in case it may show value.

**Actions**

> No action required.

---

## `Babies`

In [None]:
## Reviewing details - 'babies'
eda.explore_feature(data,'Babies',target_feature = 'IsCanceled', bins = 5,
                    plot_label ='Number of Babies',
                    plot_title= 'Babies',
                    show_visualization = show_visualization);

---

**Feature Review**

> `Babies` refers to the number of children listed on the reservation. As in the case of `adults`, in my experience this feature does not match with reality, but I will keep it in case it may show value.

**Actions**

> No action required.

---

## `Meal`

In [None]:
## Reviewing details for - 'Meal'
eda.explore_feature(data,'Meal',target_feature = 'IsCanceled', 
                    plot_label ='Types of Meal',
                    plot_title= 'Meal',show_visualization = show_visualization);

---

**Feature Review**

> `Meal` represents the type of Meal included with the reservation booking. Per the data dictionary, the categories are:
* "Undefined/SC – no Meal package"
* "BB – Bed & Breakfast"
* "HB – Half board (breakfast and one other Meal – usually dinner)"
* "FB – Full board (breakfast, lunch and dinner)"

**Actions**

> I will condense the "Undefined/SC" values into one category as they are representative of the same information.

---

## `Country`

In [None]:
## Reviewing details for 'Country'
eda.explore_feature(data,'Country',target_feature = 'IsCanceled', 
                    marginal = 'box',normalize=True,
                    plot_label ='Country',
                    plot_title= 'Country',
                    show_visualization = show_visualization);

In [None]:
## Inspecting top 10 countries
data['Country'].value_counts(1, ascending=False)[:10]

---

**Feature Review**

> `Country` represents the, "Country of origin. Categories are represented in the ISO 3155–3:2013 format," per the data dictionary. I noticed that most values are assocaited with PRT - Portugal and Western Europe, which leads me to believe these hotels may be based in Portugal. 

**Actions**

> Due to the large diversity of countries represented on the reservations, I would like to condense the values similarly to the `Agent` feature to best represent the values. I would split the data at a threshold of 5% or less, resulting in an "Other" category of about 27% of the data.

---

## `Market_Segment`

In [None]:
## Reviewing details for - 'market_segment'
eda.explore_feature(data,'MarketSegment',target_feature = 'IsCanceled', marginal = 'box',
                    plot_label ='Market Segment',
                    plot_title= 'Market Segment',
                    show_visualization = show_visualization);

---

**Feature Review**

> Market_Segment` represents the "market segment designation," per the feature dictionary. In simple terms, they are distinct categories of reservations representing the different markets:
* Guests who are part of a group booking (usually via on-site Sales teams) are considered "Groups"
* "Online TA/TO" would refer to booking groups such as Expedia, Priceline, etc.
* "Corporate" bookings would be rates negotiated with certain companies with different benefits, such as lower rates; complimentary Meals; etc..

**Actions**

> I noticed the "Undefined" category shows less than 1% of values. I will drop rows that include that feature as they are such a small number.
 
---

## `Distribution_Channel`

In [None]:
## Reviewing details for 'distribution_channel'
eda.explore_feature(data,'DistributionChannel',target_feature = 'IsCanceled', 
                    marginal = 'box',
                    plot_label ='Distribution Channel',
                    plot_title= 'Distribution Channel',
                    show_visualization = show_visualization);

---

**Feature Review**

> The `Distribution_Channel` feature closely matches the values in the `market_segment` feature. Distribution channels are the means by which reservations are booked and are often the same as the market segment categories, in my experience.

**Actions**

> I will drop this feature as I feel it does not add more value versus the `market_segment` feature.
 
---

## `Is_Repeated_Guest`

In [None]:
## Reviewing details for 'is_repeated_guest'
eda.explore_feature(data,'IsRepeatedGuest',target_feature = 'IsCanceled', 
                    plot_label =' Repeat Guest',
                    plot_title= ' Repeat Guest',
                    show_visualization = show_visualization);

---

**Feature Review**

> It is clear that nearly all guests are new, non-repeat guests at these hotels. In my personal experience, I find this unusual; my hotels had regular guests who would come in for work; to see nearby family; or visiting the area for leisure and recreation.

**Actions**

> This feature is unique to every property, so I will keep the features to maintain generalizability for other hotels.
 
---

## `Previous_Cancellations`

In [None]:
## Reviewing details for 'previous_cancellations'
eda.explore_feature(data,'PreviousCancellations',target_feature = 'IsCanceled', bins = 5,
                    plot_label ='Number of Cancellations',
                    plot_title= 'Previous Cancellations',
                    show_visualization = show_visualization);

---

**Feature Review**

> There are very few reservations with a history of past cancellations. This is unsurprising as most reservations are new guests who did not stay at the hotel previously.

**Actions**

> This feature is unique to every property, so I will keep the features to maintain generalizability for other hotels.
 
---

## `Previous_Bookings_Not_Canceled`

In [None]:
## Reviewing details for 'previous_bookings_not_canceled'
eda.explore_feature(data,'PreviousBookingsNotCanceled',target_feature = 'IsCanceled', 
                    bins = 5,marginal = 'box',
                    plot_label ='Number of Bookings Not Canceled',
                    plot_title= 'Previous Bookings Not Canceled',
                    show_visualization = show_visualization);

---

**Feature Review**

> There are very few reservations with a history of past reservations not cancelled. This matches with the `previous_cancellations` and `is_repeated_guest` features as a representation of guest stay patterns/history.

**Actions**

> While this feature may have higher multicollinearity with the `previous_cancellations` and `is_repeated_guest` features, I will keep it for now. If multicollinearity is an issue for future models, I will drop one or more of these features.
 
---

## `Reserved_Room_Type`

In [None]:
## Reviewing details for - 'reserved_room_type'
eda.explore_feature(data,'ReservedRoomType',
                    target_feature = 'IsCanceled', 
                    plot_label ='Reserved Toom Type',
                    plot_title= 'Reserved Toom Type',
                    show_visualization = show_visualization);

---

**Feature Review**

> This feature is very unique in this dataset as is hard to interpret any specific details about the room type features. We only know that these are unique room types; we do not know if they are standard, upgraded, or suite-style rooms. Different room types/levels may be associated with different guests/rates/etc., which would give more insight to a specific hotel.

**Actions**

> Despite the anonymity and lack of depth in this feature, it is still very important in my opinion for generalizability. I will leave this feature as-is.
 
---

## `Assigned_Room_Type`

---

**City**

---

In [None]:
## Reviewing details for 'assigned_room_type'
eda.explore_feature(data,'AssignedRoomType',
                    target_feature = 'IsCanceled', 
                    plot_label ='Assigned Room Type',
                    plot_title= 'Assigned Room Type',
                    show_visualization = show_visualization);

---

**Feature Review**

> I have some mixed feelings about whether to keep this feature. 
* One argument in favor of dropping it would be that room types are "assigned" during registration at the hotel; this information would not be available prior to arrival.
* A counter-argument would be that the hotel may change room types in advance of the guest's arrival (such as upgrades/downgrades, guest requesting different room types, etc.) and is a valid feature.
>
The feature dictionary defines this feature as:
> "Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons."

**Actions**

> I will keep this feature with the assumption that this information is available prior to arrival and would reflect whether the room type assigned to the reservation would match the room type initially booked.
 
---

## `Booking_Changes`

In [None]:
## Reviewing details for 'booking_changes'
eda.explore_feature(data,'BookingChanges',bins = 5,
                    target_feature = 'IsCanceled', 
                    plot_label ='booking_changes',
                    plot_title= 'booking_changes',
                    show_visualization = show_visualization);

---

**Feature Review**

> This feature represents the number of reservation changes between the initial booking and the time of arrival/cancellation. **In my personal experience, this feature could mis-represent the changes initiated by guests, as my systems would record *any* changes.** I would often perform many updates to reservations during our pre-arrival processes, including assigning room types; adding comments or special requests; etc..
>
> **Such changes could result in results that do not represent the number of guest-initiated changes.** One way to address such an issue would be to compare the average number of changes, with the assumption that most reservations would have a larger number of changes due to pre-arrival processes.

**Actions**

> I will keep this feature as-is with the caveat that other hotels may need to handle this feature differently based on their processes for handling reservation updates.
 
---

## `Deposit_Type`

In [None]:
## Reviewing details for 'deposit_type'
eda.explore_feature(data,'DepositType',target_feature = 'IsCanceled', 
                    normalize=False,
                    plot_label ='Deposit Type',
                    plot_title= 'Deposit Type',
                    show_visualization = show_visualization);

---

**Feature Review**

> This feature is particularly intriguing to me, as it represents whether a guest pre-paid the reservation prior to arrival. My initial assumption would be that guests who pre-pay are less likely to cancel. However, I see **there are a signficant number of canceled reservations that are non-refundable, while most check-out reservations did not require a deposit.**
>
> My personal understanding/rationalization would be that these non-refundable reservations would be booked through a travel agency that would require pre-payment at the time of booking. On the hotel-side, we consider such reservations to be non-refundable, but we were usually flexible with cancellations and would refund pre-payments. **If guests are able to cancel without being charged regardless of any pre-payments, it would make sense that there would be so many non-refundable cancellations.**

**Actions**

> I will keep this feature as-is with the assumption that the hotel would be flexible with refunds. This assumption could dimish the value/impact of this feature, however, as non-refundable reservations could be considered as "no-deposit." Still, it is a distinguishing characteristic to be considered.
 
---

## `Days_in_Waiting_List`

In [None]:
## Reviewing details for 'days_in_waiting_list'
eda.explore_feature(data,'DaysInWaitingList',target_feature = 'IsCanceled', bins = 5,
                    plot_label ='Days in Waiting List',
                    plot_title= 'Days in Waiting List',
                    show_visualization = show_visualization);

---

**Feature Review**

> I am not surprised that most reservations do not spend any time on a wait-list. In my experience, my hotels did not have a wait-list for reservations. Availability was on a "first come, first served" basis and we would instruct guests to check our availability at a later time if we were sold out.

**Actions**

> I will keep this feature as-is, but I do not expect it to be very valuable for my modeling due to most reservations showing 0-days on a wait list.
 
---

## `Customer_Type`

In [None]:
## Reviewing details for 'customer_type'
eda.explore_feature(data,'CustomerType',
                    target_feature = 'IsCanceled', 
                    marginal = 'box',
                    plot_label ='Customer Type',
                    plot_title= 'Customer Type',
                    show_visualization = show_visualization);

---

**Feature Review**

> The `Customer Type` feature is closely related to the `market_segment` and `distribution_channel` features as all of their categories are similar. However, this feature includes a "transient-party" category, indicating that these guests would be associated with other transient reservations. I did not have such indicators with my prior reservation systems, so I am curious to see its impact on my models. 

**Actions**

> I will keep this feature as-is. It may not be strongly generalizable to other hotels, but my models will be able to handle any multi-collinearity issues between this feature and the other marketing/distribution features.
 
---

## `ADR`

In [None]:
## Reviewing details for 'adr'
eda.explore_feature(data ,'ADR',target_feature = 'IsCanceled', 
bins = 5,
                    plot_label ='ADR (€)',
                    plot_title= 'ADR (€)',
                    show_visualization = show_visualization);

---

**Feature Review**

> ADR is an acronym for "average daily rate," a summary of how much a guest would pay for each day of their stay. ADR is one of three major metrics for hotel performance and (in my experience) is a valuable indicator of whether a reservation will stay or not when paired with other features such as marketing details or pre-payment requirements.
>
> I am intrigued by the value counts for the cancellations due to the significant outlier values in the thousands of dollars. I would not expect such a high ADR for most properties; the extreme price helps explain why it was cancelled.

**Actions**

> I will keep this feature as-is.
 
---

## `Required_Car_Parking_Spaces`

In [None]:
## Reviewing details for 'required_car_parking_spaces'
eda.explore_feature(data,'RequiredCarParkingSpaces',target_feature = 'IsCanceled', 
bins = 5,
                    plot_label ='Required Car Parking Spaces',
                    plot_title= 'Required Car Parking Spaces',
                    show_visualization = show_visualization);

---

**Feature Review**

> Similar to the `assigned_room_type` feature, **I have mixed feelings about whether to keep this feature.** In my experience, it is rare for a reservation to indicate how many cars they will park; usually it is confirmed at registration. However, there are times when I do have this information, and so **I believe it is possible for hotels to have this information prior to arrival.**

**Actions**

> **I will keep this feature with the assumption that the information is available prior to arrival.** If this feature is included in my top results post-modeling, I will reconsider whether or not to keep the feature.
 
---

## `Total_of_Special_Requests`

In [None]:
## Reviewing details for 'total_of_special_requests'
eda.explore_feature(data,'TotalOfSpecialRequests',target_feature = 'IsCanceled', bins = 5,
                    marginal = 'box',
                    plot_label ='Total of Special Requests',
                    plot_title= 'Total of Special Requests',
                    show_visualization = show_visualization);

---

**Feature Review**

> I can interpret this feature in two ways, personally. My hotels had "special requests" as well as "comments," which would often overlap. They were distinct information respectively, but often I had special requests that were noted in the "comments" field. I assume this feature represents all special requests (in my case, it would include all unique requests for both fields).
>
> I am not surprised that canceled reservations have fewer special requests. In my experience, guests who provide a larger number of requests are more committed to their plans.

**Actions**

> I will keep this feature with the assumption mentioned above.
 
---

## `Reservation_Status_Date`

In [None]:
## Converting from "string" to "datetime" data type for statistical analysis
data['ReservationStatusDate'].dtypes

In [None]:
data['ReservationStatusDate'] = pd.to_datetime\
                                             (data['ReservationStatusDate'])
data['ReservationStatusDate']

In [None]:
## Reviewing details for 'reservation_status_date'
eda.explore_feature(data,'ReservationStatusDate',target_feature = 'IsCanceled', marginal = 'box',
                    bins=3, plot_label ='Reservation Status Date',
                    plot_title= 'Reservation Status Date',
                    show_visualization = show_visualization);

---

**Feature Review**

> The `reservation_status_date` represents the, "date at which the last status was set," per the feature dictionary.

**Actions**

> I do not feel that this feature adds any substantial information for my modeling, and so I will drop it.
 
---

## `Agent`

In [None]:
## Reviewing details for 'Agent_group'
eda.explore_feature(data,'Agent',target_feature = 'IsCanceled', plot_label ='Booking Agent ID',
                    plot_title= 'Agent IDs', show_visualization = show_visualization);

In [None]:
## Calculating number of unique values in "Agent"
data['Agent'].nunique()

---

**Feature Review**

> The `Agent` feature represents the unique IDs for each booking Agent/group on most reservations. I noticed 14% of the reservations do not have a value associated with them, and there are 333 unique IDs otherwise. This is a similar situation to the "Country" feature, in that I will need to condense these values to ascertain the most meaningful information and to prevent adding 333 additional one-hot-encoded columns later on.

**Actions**

> I will need to condense these values down into a smaller subset of values representing a selection of top-producing Agents by percent; the number of reservations without an Agent ID; and then the remaining Agents as an "Other" aggregate category. Additionally, I will need to convert these values from a float to a string value for modeling.
 
---

#  **Post-EDA Updates**

---

**Feature Changes and Engineering**

> After reviewing all of the features and their statistics, I noticed a few things needed to change. 

**These Changes Include:**
>* Condensing specific features
* Engineering new features
* Dropping duplicated/unnecessary features

---

## Condensing Features

### Condensing `Meal`

---

> The feature reference dictionary states that the "undefined" values for `Meal` are part of the "SC" category. While there are a relatively small number of "undefined" entries, I feel that it would be most accurate to re-categorize them  to the "SC" Meal type.

---

In [None]:
## Inspecting normalized value counts
data['Meal'].value_counts()

In [None]:
## Testing normalized value counts after replacement
data['Meal'].replace('Undefined', 'SC').value_counts(1)

In [None]:
## Performing replacement and verifying results
data['Meal'] = data['Meal'].replace('Undefined', 'SC')
data['Meal'].value_counts(1)

In [None]:
## Reviewing new details for Meal
eda.explore_feature(data,'Meal',target_feature = 'IsCanceled', plot_label ='Meal Type',plot_title= 'Meals',
                    show_visualization = show_visualization);

---

> After these changes, I have four distinct categories with the "undefined" Meal types added to the "SC" Meal types. Now I will perform a larger condensation with the `Agent` feature.

---

### Condensing `Agent` into `Agent_Group`

---

**This feature requires a more complex approach than the `Meal` feature.**

> `Agent` includes a large number of unique values (each unique value being a unique identifier for an Agent). To use this feature in modeling, I would need to perform one-hot encoding, resulting in an additional 300+ features for all of my reservations. I will explore two main questions to determine how to handle this feature:
1. What are the top ten IDs by percentages?
2. Is there a way to condense this feature into a smaller set of categories?

**Depending on the breakdown of percentages, I would like to condense this feature to a smaller number of unique values.**

> Condensing the features would result in fewer additional columns post-encoding as well as potentially increasing the impact of this feature. **However, it would impair the interpretability of the results;** if my model's results show that the condensed value is significant, it would mean that the *combination of all of the values* is significant, not any particular one Agent.

**Despite the loss of interpretability, I feel it is best to condense these values.**

---

In [None]:
## Confirming number of unique values
data['Agent'].nunique()

In [None]:
## Reviewing top ten Agents by percentage of bookings
data['Agent'].value_counts(normalize=True, ascending=False,
                           dropna=False).iloc[:10]

In [None]:
## Visualizing top ten Agents by percentage of bookings
top_Agents = data['Agent'].value_counts(normalize=True, ascending=True,
                                        dropna=False).iloc[-10:]

fig, ax = plt.subplots(figsize= (8, 4))
top_Agents.plot(kind='barh', ax=ax)

fig.suptitle('Top Ten Agents by Volume')
ax.set_xlabel("Percentage of Total Reservations")
ax.set_xticks([0, .05, .10, .15, .20, .25])
ax.set_xticklabels(['0', '5','10', '15', '20', '25'])
ax.set_ylabel("Agent ID")

for i,vc in enumerate(top_Agents):
    plt.text(x=vc+.0025, y=i, s=f"{vc:.0%}")

plt.tight_layout()
plt.show()
plt.close();

---

**Analysis shows there are three top Agents, representing a combined total of nearly half of the total reservations.**

> Additionally, the 14% of missing values represents a significant proportion of the overall dataset. The remaining 41% percent of the data consists of the other Agents with 3% or less of the overall reservations.

**Based on these results, I feel comfortable in condensing the values into five distinct categories:** 
>* The top three Agent IDs (no changes)
* A placeholder ('0') to indicate the lack of an ID
* A placeholder ('999') representing the combination of all of the lower-production Agents

**I will replace the values and inspect the breakdown of the resulting data.**

---

In [None]:
## Reviewing number of missing values
data['Agent'].isna().sum()

In [None]:
## Filling with placeholder value and confirming results
data['Agent'] = data['Agent'].fillna(0.00)
data['Agent'].isna().sum()

In [None]:
## Converting non-top-3 ID values to placeholder "999"
cond = [data['Agent'] == 9.00,
        data['Agent'] == 240.00,
        data['Agent'] == 1.00,
        data['Agent'] == 0.00
       ]

choice = [data['Agent'], data['Agent'], data['Agent'], data['Agent']]

data['Agent_group'] = np.select(cond, choice, 999)
data[['Agent', 'Agent_group']]

In [None]:
## Converting column to string

data.loc[:,'Agent_group'] = data.loc[:,'Agent_group'].astype(int)
data.loc[:,'Agent_group'] = data.loc[:,'Agent_group'].astype(str)
print(f'Datatype: {data["Agent_group"].dtype}')

In [None]:
## Creating new Series for visualization
new_cats = data['Agent_group'].value_counts(1, ascending=True)
new_cats

In [None]:
##Visualizing new feature

fig, ax = plt.subplots(figsize= (8, 4))
new_cats.plot(kind='barh', ax=ax)

fig.suptitle('Agent - New Categories')

ax.set_xlabel("Percentage of Total Reservations")
ax.set_xticks([0, .1, .2, .3, .4, .5])
ax.set_xticklabels(['0', '10','20', '30', '40', '50'])
ax.set_ylabel("Agent ID")

for i,vc in enumerate(new_cats):
    plt.text(x=vc+.005, y=i, s=f"{vc:.0%}")

plt.tight_layout()
plt.show()
plt.close();

In [None]:
# Dropping "Agent" feature after conversion
data.drop(columns = ['Agent'], inplace=True)
data.columns

In [None]:
## Confirming 'Agent' removal from dataframe
'Agent' not in data

In [None]:
## Deleting variables to free up space
del cond, choice, new_cats

---

**After converting the values, there are now five unique values instead of 333.**

> The resulting values represent the top three Agents by production percentage; the percentage of reservations not associated with an Agent (missing values prior to conversion); and then the combined percentage of Agents producing 3% or less of the overall reservations.
>
> This condensation balances my desire to simplify this category; the need to address missing values; and my intention to maintain the value of all of the Agent IDs.

**Now I will condense the `Country` category in a similar manner.**

---

### Condensing `Country` Categories

---

> The `Country` feature includes 177 unique Country names, but 5 countries compose about 73% of all of the values. To reduce the complexity of this feature and to increase the value of the remaining distinct countries, I will condense these features as I did for the `Agent` feature.

---

In [None]:
## Confirming number of unique values
data['Country'].nunique()

In [None]:
## Reviewing top ten Agents by percentage of bookings
data['Country'].value_counts(normalize=True, ascending=False, dropna=False).iloc[:10]

In [None]:
## Visualizing top ten countries by percentage of bookings
top_countries = data['Country'].value_counts(normalize=True, ascending=True,
                                        dropna=False).iloc[-10:]

fig, ax = plt.subplots(figsize= (8, 4))
top_countries.plot(kind='barh', ax=ax)

fig.suptitle('Top Ten Countries by Volume')
ax.set_xlabel("Percentage of Total Reservations")
ax.set_xticks([0, .05, .10, .15, .20, .25, .30, .35, .40])
ax.set_xticklabels(['0', '5','10', '15', '20', '25', '30', '35', '40'])
ax.set_ylabel("Country ID")
ax.axhline(y=4.5, ls = ":", c='k', label = "5%")
ax.legend(labels = ['5% Threshold', 'Volume (%)'])

for i,vc in enumerate(top_countries):
    plt.text(x=vc+.0025, y=i, s=f"{vc:.0%}")

plt.tight_layout()
plt.show()
plt.close();

---

**Analysis shows there are five countries composing over 5% of the reservations, representing a combined total of nearly 73% of the total reservations.**

> These countries represent more than 5% of the overall feature values, with the remaining 27% below 5%.

**Based on these results, I feel comfortable in condensing the values into six distinct categories:** 
>* The top five Country IDs (no changes)
* A placeholder ('Other') representing the combination of all of the lower-production countries

**I will replace the values and inspect the breakdown of the resulting data.**

---

In [None]:
## Converting non-top-3 ID values to placeholder "999"
cond = [data['Country'] == 'PRT',
        data['Country'] == 'GBR',
        data['Country'] == 'FRA',
        data['Country'] == 'ESP',
        data['Country'] == 'DEU'
       ]

choice = [data['Country'], data['Country'], data['Country'], data['Country'],
          data['Country']]

data['Country'] = np.select(cond, choice, 'Other')
data['Country']

In [None]:
## Creating new Series for visualization
new_cats = data['Country'].value_counts(1, ascending=True)
new_cats

In [None]:
##Visualizing new feature

fig, ax = plt.subplots(figsize= (8, 4))
new_cats.plot(kind='barh', ax=ax)

fig.suptitle('Country - Updated Categories')

ax.set_xlabel("Percentage of Total Reservations")
ax.set_xticks([0, .1, .2, .3, .4, .5])
ax.set_xticklabels(['0', '10','20', '30', '40', '50'])
ax.set_ylabel("Country ID")

for i,vc in enumerate(new_cats):
    plt.text(x=vc+.005, y=i, s=f"{vc:.0%}")

plt.tight_layout()
plt.show()
plt.close();

In [None]:
## Deleting variables to free up space
del cond, choice, new_cats

## Feature Engineering

---

> Now that I condensed my existing features, I will create two new ones:
* `arrival_date`: representing the date the reservation is expected to check-in; useful for future forecasting
* `stay_length`: for how many nights a guest will stay
>
> This information is standard for all hotels and will be helpful for modeling (and later forecasting).

---

### Engineering `Arrival_Date`

In [None]:
## Converting from month, day of month, and year to a single datetime column
data['arrival_date'] = data['ArrivalDateMonth'] +' '+ \
                                data['ArrivalDateDayOfMonth']\
                                .astype(str) +', '+ \
                                data['ArrivalDateYear'].astype(str)
data['arrival_date'] = pd.to_datetime(data['arrival_date'])
data['arrival_date']

In [None]:
## Determining the day of the week of arrival 
data.loc[:,'arrival_day'] = data.loc[:,'arrival_date'].dt.day_name()
data['arrival_day']

In [None]:
## Reviewing results
data[['arrival_day', 'arrival_date']]

In [None]:
## Reviewing details for arrival_day
eda.explore_feature(data,'arrival_day',target_feature = 'IsCanceled', 
plot_label ='Day',
                    plot_title= 'Arrival Day',
                    show_visualization = show_visualization);

---

**Feature Review**

> This new feature gives a more precise representation of when a reservation is expected to arrive. Having an idea of anticipated occupancy by day is critical for operations for daily cleaning, staffing, etc..

---

### Engineering `Stay Length`

---

> One major feature of a guest's reservation is missing from the original dataset: the overall stay length. I feel a reservation's stay length is relevant to my analysis, and so I will engineer this feature by adding the number of week day and weekend nights together.

---

In [None]:
## Creating stay_length as a summation of weekday and weekend night counts
data['stay_length'] = data['StaysInWeekendNights'] + data['StaysInWeekNights']
data['stay_length'].value_counts(dropna=False)

In [None]:
## Reviewing details for stay_length
eda.explore_feature(data,'stay_length',target_feature = 'IsCanceled', plot_label ='Length',
                    plot_title= 'Length of Stay', bins=10,
                    show_visualization = show_visualization);

---

**Feature Review**

> Both checkouts and cancellations show that most guests are staying for less than a week, which is understandable for standard hotels and resorts (there are exceptions, such as brands oriented towards long-term stays of 7+ days).

---

## Dropping Features

---

> During the EDA process, I noted several features were less useful than others and would need to be dropped from the dataset. I will drop the following features/rows:
* `market_segment`, `"undefined"` value: represented less than 1% of reservations
* `distribution_channel`: too similar to `market_segment`
* `reservation_status`: too similar to target feature
* `arrival_date_year`: specific to reservations in dataset; not usable for future reservations
* `reservation_status_date`: does not add substantial information

---

### Dropping "Undefined" from `Market_Segment`

---

> There are less than 1% of reservations with an "undefined" market segment. To reduce the future number of features post-one-hot-encoding, I will remove these features.

---

In [None]:
## Confirming low number of "undefined" values in market_segment
data['MarketSegment'].value_counts()

In [None]:
## Inspecting reservations listed as "undefined"
data[data['MarketSegment'] == 'Undefined']

In [None]:
## Removing those reservations from the dataset
data = data[data['MarketSegment'] != 'Undefined']
data

### Comparing `Market_Segment` and `Distribution_Channel`

---

**What's the difference?**

>Based on my personal career experience, I know that these two features represent nearly the same information.
* ***Market segments*** represent customers with distinct needs, behaviors, and may have additional distinguishing characteristics (groups, contracts, and businesses).
* ***Distribution channels*** are the means by which these reservations are booked and processed, often overlapping with the guest segmentation.

**Which to keep?**

> **As the two values demonstrate high multicollinearity, I will keep the feature providing the most diverse insights.** I will inspect the values for each feature and keep whichever feature adds the most diverse insights. I will drop the other feature.

---

In [None]:
## Inspecting breakdowns
print('Distribution Channel:\n',data['DistributionChannel'].value_counts(1),
      '\n\n')
print('Market Segment:\n',data['MarketSegment'].value_counts(1))

---

**While both features describe similar information, `market_segment` includes more distinct information.** I will keep `market_segment` and drop `distribution_channel`.

---

In [None]:
## Dropping "distribution_channel"
data = data.drop(columns = 'DistributionChannel')

In [None]:
## Confirming 'distribution_channel' removal from dataframe
'DistributionChannel' not in data

### Dropping `Reservation_Status`

---

> `Reservation_Status` is nearly identical to my target feature and would be too strong of a predictor in my models.

---

In [None]:
## Dropping "reservation_status"
data.drop(columns = 'ReservationStatus', inplace=True)

In [None]:
## Confirming 'reservation_status' removal from dataframe
'ReservationStatus' not in data

### Dropping `Arrival_Date_Year`

---

> `Arrival_Date_Year` would not be applicable to future models; it would impair model performance when the model is given data for new years.

---

In [None]:
## Dropping "arrival_date_year"
data.drop(columns = 'ArrivalDateYear', inplace=True)

In [None]:
## Confirming 'arrival_date_year' removal from dataframe
'ArrivalDateYear' not in data

### Dropping `Reservation_Status_Date`

In [None]:
## Dropping "reservation_status_date"
data.drop(columns = 'ReservationStatusDate', inplace=True)

In [None]:
## Confirming 'reservation_status_date' removal from dataframe
'ReservationStatusDate' not in data

## Final Data Review

---

> Now that I completed my condensation, engineering, and dropping of features, I will review the final data prior to saving the results for modeling.

---

In [None]:
## Inspecting final dataframe
data

# Preserving the Pandas (DataFrame)

---

Now I am ready to save the cleaned and processed data for modeling in my next notebook.

---

In [None]:
# ## Pickling with Pandas
# data.to_pickle(path = '../data/data_prepped.pickle',
#             compression = 'gzip')
# print(f'Successfully pickled!')

In [None]:
## Pickling with Pandas
data.to_parquet(path = '../data/data_prepped.parquet')
print(f'Successfully saved!')

# Future Work: EDA

---

In the future, I will revisit the visualization aspects of my EDA function to convert them from Plotly Express figures to Matplotlib figures. The goal with Plotly Express was to have additional interativity; however these models crippled my notebook's operations. Matplotlib figures would be more appropriate in this case, and I will revisit this work when I have more time.

---


# Moving to Modeling!

---

> Now that I completed the pre-processing and EDA steps, I will move to my next notebook to perform my classification modeling.

---