# **Hotel Cancel Culture** - **EDA Notebook**

---

**Author:** Ben McCarty

**Capstone Project** - Classification, Time Series Modeling

**Contact:** bmccarty505@gmail.com

---

---

**Who?**
>* 🏢 **Revenue Management (RM) teams** for hotel groups (corporate, franchise)
>
>
>* 🏨 On-site GMs, Sales, and Ops teams

---

**Why?**
>* 💰 **Revenue Management:** 
>  * Revenue optimization: Right price, right time, right customer
>    * Dynamic pricing
>    * Distribution channels
>    * Pricing per room type
>
>
>* 🤝 **Sales:**
>  * Group sales (pickup/wash)
>  * BT (performance/company for both GPP and LNR rates)
>
>
>* 🛌 **Rooms Ops:**
>  * Forecasting occupancy, arrivals, departures, stay-overs, same-day booking demand, and probability of guest relocation in the case of oversell.
>  * Determining staff schedules and periods of high demand
>
>
>* 🍰 ☕ **Food and Beverage:**
>  * Ordering food/supplies overall
>  * Scheduling staff
>  * Determining busy times (breakfast, lunch, dinner)
>    * Staffing, specific food/supplies

---

**What?**
>* 🧾 Dataset comprised of... 
>  * 32 different features
>    * Detailed explanation of features (and sub-categories, when appropriate) available in Readme
>  * Nearly 120,000 reservation records
>  * Source cited in Readme

---

 **How?**
>* Which models/methods?
>  * 🔢 Classifiers 🌳
    * XGBoost, RFC, ABC, etc.
>  * ⏳ Time Series Analysis 📈
    * PMD auto-arima
    * Statsmodels vector autoregression
>
>
>* Data prep and feature engineering

---

---

> **Goal:** To prepare data for classification modeling in next notebook.
>
>
> **Purpose:** to explore, clean, and organize.
>
>
> **Process:**
>
>    * Inspecting data integrity and statistics
>    * Splitting data by hotel type ("City" vs. "Resort")
>    * Filling any missing values
>    * Save processed data for modeling notebook
>
>
> **Modeling Notebook:**
>
>    * Performing train/test split
>    * Training the model
>    * Evaluate performance metrics
>    * Provide final recommendations

---

# **Import Packages**

---

> To start off, I will import a variety of packages to assist with handling my data; creating visualizations; and reviewing statistical data.

---

In [None]:
## Data Handling
import pandas as pd
import numpy as np

## Visualizations
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

## Custom-made Functions
from bmc_functions import eda

In [None]:
## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 50)
%matplotlib inline

In [None]:
%load_ext autoreload
%autoreload 2

# **Read Data**

In [None]:
## Reading data
source = './data/hotel_bookings.pickle'
data = pd.read_pickle(source, compression = "gzip")
data

# **Identifying Target Feature**

---

> For my classification analysis, **I will use the `is_canceled` feature as my target feature.** This feature indicates whether a reservation was canceled (0 = check-out, 1= canceled).
>
> There is another feature, `reservation_status`, that also looks valuable. I will compare that feature against `is_canceled` to investigate any differences between the two.

---

In [None]:
## Inspecting target feature
data['is_canceled'].describe()

In [None]:
data['is_canceled'].value_counts(normalize=True, dropna=False)

---

> Based on this initial review, I see there is a moderate class imbalance between whether a reservation canceled favoring non-canceled reservations with a 63%/37% split between not-canceled/canceled, respectively.
>
> I will keep this imbalance in mind when I perform my modeling in my next notebook. Imbalanced classes may have a significant negative impact on a model's performance; I will need to address the imbalance at that time.

---

# **Reviewing Statistics**

In [None]:
## Sorting report by number of missing values
eda.report_df(data).sort_values('null_sum', ascending=False).style\
                                       .background_gradient(subset='null_pct')

---

> **Notes**

---

---

**Reviewing Reports - Missing Values**

> Based on the post-split results, I see that both dataframes are missing values for `company,` `agent`, `country`, and `children`.
>
> ***Special note:*** As noted in the data's documentation (located in *"details.md"*), any missing values are intentional representations of features that were not applicable to a reservation.
---

**`Company` and `Agent` Features**

>* `company:` 94%
>* `agent:` 14%
>
> Due to the large number of missing values for `company`, **I will drop the `company` feature.**
>
> The `agent` feature shows a significant number of missing values while including a very large number of unique values. **I will need to review this feature in more depth to determine whether or not to keep it for modeling.**

**`Country` and `Children` Features**

> The remaining two features with missing values are `country` and `children`.
>
> **As there are a small number of missing values in both features, I will keep them and fill the missing values via a SimpleImputer in my modeling pipeline.** By incorporating a SimpleImputer in my pipeline, I will be able to determine the best method to determine the value to fill my missing values. 

---

## Dropping `Company` Column

In [None]:
# Dropping "company" column (95% missing values)
data.drop(columns = ['company'], inplace=True)
data

In [None]:
## Confirming 'company' removal from dataframe
'company' not in data

## Inspecting the `Agent` Feature

---

> Large number of unique features, all categorical/unique identifiers.
>
> Creates a problem for OHE to prep for modeling - creates 300+ new features to model

---

In [None]:
## Confirming number of unique values
data['agent'].nunique()

In [None]:
## Reviewing top ten agents by percentage of bookings
data['agent'].value_counts(normalize=True, ascending=False, dropna=False).iloc[:10]

In [None]:
## Visualizing top ten agents by percentage of bookings
fig, ax = plt.subplots()
data['agent'].value_counts(normalize=True, ascending=True, dropna=False)\
                                           .iloc[-10:].plot(kind='barh', ax=ax)
fig.suptitle('Top Ten Agents by Volume')
ax.set_xlabel("Percentage of Total Reservations")
ax.set_ylabel("Agent ID");

---

> Compress to 4 total classes  - top three and the rest
>
> Top three (non-NaN) represent 45% all data
>
> Becomes 4-class classification, reducing dimensionality while keeping data

---

In [None]:
## Converting non-top-3 ID values to placeholder "999"
cond = [data['agent'] == 9.00,
        data['agent'] == 240.00,
        data['agent'] == 1.00]

choice = [data['agent'], data['agent'], data['agent']]

data['agent_group'] = np.select(cond, choice, 999)
data['agent_group']

In [None]:
data[['agent', 'agent_group']]

In [None]:
## Creating new Series for visualization
new_cats = data['agent_group'].value_counts(1, ascending=True)
new_cats

In [None]:
##Visualizing new feature

fig, ax = plt.subplots(figsize= (8, 4))
new_cats.plot(kind='barh', ax=ax)

fig.suptitle('Agent - New Categories')

ax.set_xlabel("Percentage of Total Reservations")
ax.set_xticks([0, .1, .2, .3, .4, .5])
ax.set_xticklabels(['0', '10','20', '30', '40', '50'])
ax.set_ylabel("Agent ID")

for i,vc in enumerate(new_cats):
    plt.text(x=vc, y=i, s=f"{vc:.0%}")

plt.tight_layout();

In [None]:
# Dropping "agent" feature after conversion - try/except to prevent KeyErrors

try:
    data.drop(columns = ['agent'], inplace=True)
except:
    pass

data.columns

In [None]:
## Confirming 'agent' removal from dataframe
'agent' not in data

In [None]:
## Deleting variables to free up space
try:
    del cond, choice, new_cats
except:
    pass

---

> Created new feature to condense values into one of four categories: one of top three agents by volume and then the rest.

---

## Comparing `Market_Segment` and `Distribution_Channel`

---

> What are the differences between the values in these features?
>
> Is one feature more descriptive than the other?

---

In [None]:
data['distribution_channel'].value_counts(1)

In [None]:
data['market_segment'].value_counts(1)

---

> Both features describe similar information.
>
> `Market_segment` is more descriptive; will keep mkt seg and drop dist chnl

---

In [None]:
## Dropping "distribution_channel"
data.drop(columns = 'distribution_channel', inplace=True)

In [None]:
## Confirming 'distribution_channel' removal from dataframe
'distribution_channel' not in data

## Filling Missing Values for `Country` and `Children`

---

> As there are so few missing values for the country and children features, I will impute the most frequent values for each feature.
>
> **I do not expect the imputation method to effect my future modeling results.** If the number of missing values was more substantial, I would incorporate a `SimpleImputer` in my future modeling pipeline.

---

In [None]:
## Identify columns with missing data
nan_list = list(data.isna().sum()[data.isna().sum() > 0].index)
nan_list

In [None]:
## Impute the most frequent value for each column
for col in nan_list:
    data[col].fillna(data[col].mode()[0], inplace=True)

In [None]:
## Confirming there are no remaining missing values
for col in nan_list:
    display(data[col].value_counts(1, dropna=0))
    print(f'Total missing values for {col.title()}: {data[col].isna().sum()}')

In [None]:
## Deleting variables to free up space
try:
    del nan_list, col
except:
    pass

# **Inspecting Feature Data Types**

---

>

---

In [None]:
## Inspecting dataypes for "data"
data.dtypes.sort_values()

---

**Review - Datatypes**

> After reviewing the data types, I noticed `agent_group` should be changed to the `string` datatype. This feature represents unique identifiers for booking agents and need to be treated as categorical data.
>
> **Convert to `string` type:**
> * `agent_group`
>
> **Convert to `datetime` type:**
> * `reservation_status_date`

---

## Converting to String

In [None]:
## Converting column to string

data.loc[:,'agent_group'] = data.loc[:,'agent_group'].astype(int)
data.loc[:,'agent_group'] = data.loc[:,'agent_group'].astype(str)
print(f'Datatype: {data["agent_group"].dtype}')

## Converting to DateTime

In [None]:
## Converting to datetime
data['reservation_status_date'] = pd.to_datetime(data['reservation_status_date'])
print(f'Datatype: {data["reservation_status_date"].dtype}')

# **EDA - Features**

---

**In-Depth EDA per Feature**

> Now that I reviewed my missing values and confirmed my datatypes, I will inspect the details of each of my features.

---
**Note:**

> DataFrame styling code used in `explore_feature()` function adapted from this [source](https://stackoverflow.com/questions/59769161/python-color-pandas-dataframe-based-on-multiindex#:~:text=2-,You,-can%20use%20Styler).

---

# -- > 🛑 **FIX**: Update narrative per feature

## **Toggle Visualizations**

In [None]:
## Boolean setting to control whether to show the EDA visualizations
show_visualization = False

## `Reservation_Status`

---

> Text

---

In [None]:
## Reviewing details for reservation_status
eda.explore_feature(data,'reservation_status',plot_label ='Status',
                    plot_title= 'Reservation Status',
                    show_visualization = show_visualization);

---

**Feature Review**

> `Reservation_status` closely mirrors the values for my target feature, with some slight differences due to "no-show" values. **To prepare it for modeling, I will combine the `No-Show` status and `Canceled` values.**

**Actions**

>For the purposes of my analysis, **I will treat `No-Show` reservations as `Canceled` reservations** due to their limited number preventing me from effectively using it as a third class.

****

> The most notable difference between the city and resort hotels would be the number of cancellations: *The city hotel shows a much larger proportion of canceled reservations vs. the resort hotel.* 
* This may be due to a variety of factors, including resort guests booking when they are more certain of their plans or the resort hotel may charge a cancellation fee.
>
> No-Show reservations are low for both hotels, supporting my decision to merge no-shows with cancellations. 

---

## `Is_Canceled`

---

**test**

---

In [None]:
## Reviewing details for 'is_canceled'
eda.explore_feature(data,'is_canceled', 
                    
                    normalize=False,
                    plot_label ='Cancellation Status',
                    plot_title= 'Reservation Status',
                    show_visualization = show_visualization);

---

**Feature Review**

> After reviewing the results post-"no-show" conversion, `Is_canceled` is a binarization of the `reservation_status`. Reservations are indicated as cancellations if they either cancel or are marked as a "no-show" reservation.

**Actions**

> This feature is a better target feature as the values are already binarized and match the `reservation_status` feature for all of the reservations.
>
> **I will use `is_canceled` in place of the `reservation_status` feature as my target feature.**

****

> The breakdown between hotels is the same as `reservation_status` and confirms that the resort hotel experiences fewer cancellations vs. the city hotel.

---

## `Lead_Time`

---

**test**

---

In [None]:
## Reviewing details for 'lead_time'
eda.explore_feature(data,'lead_time',bins = 5, marginal = 'box',
                    plot_label ='Number of Days',
                    plot_title= 'Lead Time (Days)',
                    show_visualization = show_visualization);

---

**Feature Review**

> `Lead_Time` indicates how far in advance reservations are booked in days. *This information is particularly useful in hospitality for Revenue Management (RM) and Operations (Ops).*
>
>  * RM needs to know **when to expect bookings** and **when to monitor rates and availability** closely to make any necessary changes to optimize revenue.
>
>
>  * Ops uses this information to **forecast how many reservations will book in a short-term booking window** (I usually focused on 0-3 days prior to arrival).
>
> * **This forecast is critical to determine staffing and supplies in particular** - when building our schedules, we consider the current number of booked reservations and the forecasted bookings to determine how many staff members to schedule and if we have enough supplies, etc..
>  * *Being the only staff member at the Front Desk during a rush of arrivals due to a snow storm is NOT fun!*

**Actions**

> I noticed there are a significant number of outliers for both properties. **I will remove the outliers based on the z-score percentiles prior to modeling.**

****

> The histograms and box plots for both hotels match up closely, but it is clear that **the city hotel has a larger range of lead times for cancellations vs. the resort hotel.**

---

## `Arrival_Date_Year`

---

**test**

---

In [None]:
## Reviewing details for 'arrival_date_year'
eda.explore_feature(data,'arrival_date_year',marginal = 'box',
                    plot_label ='Year',
                    plot_title= 'Arrival Date (Year)',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER 

**Actions**

> This feature is not useful for my future modeling, so I will drop this feature at the end of my EDA process.

---


## `Stays_in_Weekend_Nights`

---

**City**

---

In [None]:
## Reviewing details for 'stays_in_weekend_nights'
eda.explore_feature(data,'stays_in_weekend_nights',bins = 5,
                    marginal = 'box',
                    plot_label ='Number of Days',
                    plot_title= 'Lead Time (Days)',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Stays_in_Week_Nights`

---

**test**

---

In [None]:
## Reviewing details for 'stays_in_week_nights'
eda.explore_feature(data,'stays_in_week_nights',bins = 5,
                    marginal = 'box',
                    plot_label ='Number of Prior Stays',
                    plot_title= 'Stays in Week Nights',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Adults`

---

**test**

---

In [None]:
## Reviewing details 'adults'
eda.explore_feature(data,'adults',bins = 3,
                    plot_label ='Number of Adults',
                    plot_title= 'Adults',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Children`

---

**data**

---

In [None]:
## Reviewing details for 'children'
eda.explore_feature(data,'children',bins = 5,
                    
                    plot_label ='Number of Children',
                    plot_title= 'Children',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Babies`

---

**data**

---

In [None]:
## Reviewing details - 'babies'
eda.explore_feature(data,'babies',bins = 5,
                    
                    plot_label ='Number of Babies',
                    plot_title= 'Babies',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Meal`

---

**data**

---

In [None]:
## Reviewing details for - 'meal'
eda.explore_feature(data,'meal',plot_label ='Types of Meal',
                    plot_title= 'Meal',show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Country`

---

**City**

---

In [None]:
## Reviewing details for 'country'
eda.explore_feature(data,'country',marginal = 'box',normalize=False,
                    plot_label ='Country',
                    plot_title= 'Country',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Market_Segment`

---

**City**

---

In [None]:
## Reviewing details for - 'market_segment'
eda.explore_feature(data,'market_segment',marginal = 'box',
                    plot_label ='Market Segment',
                    plot_title= 'Market Segment',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Is_Repeated_Guest`

---

**City**

---

In [None]:
## Reviewing details for 'is_repeated_guest'
eda.explore_feature(data,'is_repeated_guest',
                    plot_label =' Repeat Guest',
                    plot_title= ' Repeat Guest',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Previous_Cancellations`

---

**City**

---

In [None]:
## Reviewing details for 'previous_cancellations'
eda.explore_feature(data,'previous_cancellations',bins = 5,
                    normalize=False,
                    plot_label ='Number of Cancellations',
                    plot_title= 'Previous Cancellations',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Previous_Bookings_Not_Canceled`

---

**City**

---

In [None]:
## Reviewing details for 'previous_bookings_not_canceled'
eda.explore_feature(data,'previous_bookings_not_canceled',
                    bins = 5,marginal = 'box',
                    plot_label ='Number of Bookings Not Canceled',
                    plot_title= 'Previous Bookings Not Canceled',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Reserved_Room_Type`

---

**City**

---

In [None]:
## Reviewing details for - 'reserved_room_type'
eda.explore_feature(data,'reserved_room_type',
                    
                    plot_label ='Reserved Toom Type',
                    plot_title= 'Reserved Toom Type',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Assigned_Room_Type`

---

**City**

---

In [None]:
## Reviewing details for 'assigned_room_type'
eda.explore_feature(data,'assigned_room_type',
                    
                    plot_label ='Assigned Room Type',
                    plot_title= 'Assigned Room Type',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Booking_Changes`

---

**City**

---

In [None]:
## Reviewing details for 'booking_changes'
eda.explore_feature(data,'booking_changes',bins = 5,
                    
                    plot_label ='booking_changes',
                    plot_title= 'booking_changes',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Deposit_Type`

---

**City**

---

In [None]:
## Reviewing details for 'deposit_type'
eda.explore_feature(data,'deposit_type',normalize=False,
                    plot_label ='Deposit Type',
                    plot_title= 'Deposit Type',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Days_in_Waiting_List`

---

**City**

---

In [None]:
## Reviewing details for 'days_in_waiting_list'
eda.explore_feature(data,'days_in_waiting_list',bins = 5,
                    normalize=False,
                    plot_label ='Days in Waiting List',
                    plot_title= 'Days in Waiting List',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Customer_Type`

---

**City**

---

In [None]:
## Reviewing details for 'customer_type'
eda.explore_feature(data,'customer_type',marginal = 'box',
                    plot_label ='Customer Type',
                    plot_title= 'Customer Type',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `ADR`

---

**City**

---

In [None]:
## Reviewing details for 'adr'
eda.explore_feature(data ,'adr',bins = 5,
                    plot_label ='ADR (€)',
                    plot_title= 'ADR (€)',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Required_Car_Parking_Spaces`

---

**City**

---

In [None]:
## Reviewing details for 'required_car_parking_spaces'
eda.explore_feature(data,'required_car_parking_spaces',bins = 5,
                    normalize=False,
                    plot_label ='Required Car Parking Spaces',
                    plot_title= 'Required Car Parking Spaces',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Total_of_Special_Requests`

---

**City**

---

In [None]:
## Reviewing details for 'total_of_special_requests'
eda.explore_feature(data,'total_of_special_requests',bins = 5,
                    marginal = 'box',
                    plot_label ='Total of Special Requests',
                    plot_title= 'Total of Special Requests',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Reservation_Status_Date`

---

**City**

---

In [None]:
## Reviewing details for 'reservation_status_date'
eda.explore_feature(data,'reservation_status_date',marginal = 'box',
                    bins=3,
                    plot_label ='Reservation Status Date',
                    plot_title= 'Reservation Status Date',
                    show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## `Agent_Group`

---

>

---

In [None]:
## Reviewing details for 'agent_group'
eda.explore_feature(data,'agent_group', plot_label ='Booking Agent Group',
                    plot_title= 'Agent Group', show_visualization = show_visualization);

---

**Feature Review**

> PLACEHOLDER

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

# 🛑 **Post-EDA**

---

**FIX/UPDATE ALL OF THE NARRATIVE ELEMENTS OF THE REST OF THIS CODE.**

---

---

**Finishing Touches**

> Now that I reviewed all of my features; confirmed there are no missing values; and confirmed all of the datatypes are correct, I will finish the remaining preprocessing.

**Outliers**

> Based on my EDA, I noticed several features show significant outliers. If I kept these outlying data points, they could have a negative impact on my future models' performances.

**Process**
> I will first identify the features with outliers; then use the z-scores of each data point to determine the outliers. Any absolute-valued z-score greater than 3 will be considered an outlier and will be disregarded.

---

## `Reservation_Status`: Converting `No-Show` to `Canceled`

In [None]:
## Changing no-show values to "canceled"
data.loc[:,'reservation_status'].replace('No-Show', 'Canceled',
                                            inplace=True)

In [None]:
## Confirming the change
'No-Show' not in data['reservation_status']

In [None]:
## Inspecting the updated target classes
data['reservation_status'].value_counts(1, dropna=0)

### Review - `Reservation_Status`

---

> I successfully converted all `No-Show` values to `Canceled`, **resulting in a binary classification of whether a reservation will actualize (`Check-Out`) or not (`No-Show`).**

---

## Engineering `Arrival_Date`

---

**City**

---

In [None]:
## Converting from month, day of month, and year to a single datetime column
data['arrival_date'] = data['arrival_date_month'] +' '+ \
                                data['arrival_date_day_of_month']\
                                .astype(str) +', '+ \
                                data['arrival_date_year'].astype(str)
data['arrival_date'] = pd.to_datetime(data['arrival_date'])
data['arrival_date']

In [None]:
## Determining the day of the week of arrival 
data.loc[:,'arrival_day'] = data.loc[:,'arrival_date'].dt.day_name()
data['arrival_day']

In [None]:
## Reviewing results
data[['arrival_day', 'arrival_date']]

---

**Feature Review**

> I created this new feature to merge the arrival year/month/day-of-month features into one usable feature. 

**Actions**

> PLACEHOLDER

****

> PLACEHOLDER

---

## Dropping `Reservation_Status`

---

> 

---

In [None]:
## Dropping "reservation_status"
data.drop(columns = 'reservation_status', inplace=True)

In [None]:
## Confirming 'reservation_status' removal from dataframe
'reservation_status' not in data

## Dropping `Arrival_Date_Year`

---

> 

---

In [None]:
## Dropping "arrival_date_year"
data.drop(columns = 'arrival_date_year', inplace=True)

In [None]:
## Confirming 'arrival_date_year' removal from dataframe
'arrival_date_year' not in data

## Final Data Review

---

> 

---

In [None]:
data

# **Of Pandas and Pickles**

---

> Now I am ready to save the cleaned and processed data for modeling in my next notebook.
>
> In order to preserve the datatypes and details of my data, I will use the "Pickle" module to serialize the data and save four files - one for each dataframe (two hotels; filtered/not).
>
>**First**, I will add unique names to each of my dataframe indices. **Then**, I will pickle the files. Finally, I will reopen the pickled files in my next notebook.

---

## Peter Panda Picked a Peck of Pickled DataFrames...

> 

In [None]:
## Pickling with Pandas
data.to_pickle(path = f'./data/data_prepped.pickle',
            compression = 'gzip')
print(f'Successfully pickled!')

# Moving to Modeling!

---

> Now that I completed the pre-processing and EDA steps, I will move to my next notebook to perform my classification modeling.

---