# Customer Classification - EDA Notebook

**Who?**
>* 🏢 **Revenue Management (RM) teams** for hotel groups (corporate, franchise)
>
>
>* 🏨 On-site GMs, Sales, and Ops teams

**Why?**
>* 💰 **Revenue Management:** 
>  * Revenue optimization: Right price, right time, right customer
>    * Dynamic pricing
>    * Distribution channels
>    * Pricing per room type
>
>
>* 🤝 **Sales:**
>  * Group sales (pickup/wash)
>  * BT (performance/company for both GPP and LNR rates)
>
>
>* 🛌 **Rooms Ops:**
>  * Forecasting occupancy, arrivals, departures, stay-overs, same-day booking demand, and probability of guest relocation in the case of oversell.
>  * Determining staff schedules and periods of high demand
>
>
>* 🍰 ☕ **Food and Beverage:**
>  * Ordering food/supplies overall
>  * Scheduling staff
>  * Determining busy times (breakfast, lunch, dinner)
>    * Staffing, specific food/supplies

**What?**
>* 🧾 Dataset comprised of... 
>  * 32 different features
>    * Detailed explanation of features (and sub-categories, when appropriate) available in Readme
>  * Nearly 120,000 reservation records
>  * Source cited in Readme

❌ **How?**
>* Which models/methods? 
>* Data prep and feature engineering

---

> **Goal:** To prepare data for time series modeling and forecasting in next notebook.
>
>
> **Purpose:** to explore, clean, and organize.
>
>
> **Process:**
>
>    * Inspecting data integrity and statistics
>    * Splitting data by hotel type ("City" vs. "Resort")
>    * Filling any missing values
>    * 
>    * Save processed data for modeling notebook
>
>
> **Modeling Notebook:**
>
>    * Performing train/test split
>    * 
>    * Training the model
>    * 
>    * Evaluate performance metrics
>    * Provide final recommendations

---

# To-Do List

---

**Copy:**
- [ ] Imports
- [ ] Personal module
- [ ] Data
- [ ] Starter code from P4P

**Links:**
- [ ] 

---

# Import Packages

In [1]:
## Data Handling
import pandas as pd
import numpy as np
from scipy import stats

## Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling - SKLearn
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn import set_config
set_config(display='diagram')


## Custom-made Functions
from bmc_functions import eda
from bmc_functions import classification as clf

## Settings
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 100)
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2

# Read Data

In [3]:
## Reading data
source = './data/hotel_bookings.csv'
data = pd.read_csv(source)
data

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.00,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.00,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017,August,35,30,2,5,2,0.00,0,BB,BEL,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,394.00,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,City Hotel,0,102,2017,August,35,31,2,5,3,0.00,0,BB,FRA,Online TA,TA/TO,0,0,0,E,E,0,No Deposit,9.00,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,City Hotel,0,34,2017,August,35,31,2,5,2,0.00,0,BB,DEU,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,9.00,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,City Hotel,0,109,2017,August,35,31,2,5,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,89.00,,0,Transient,104.40,0,0,Check-Out,2017-09-07


# EDA

In [4]:
## Inspecting percentage of city vs. resort hotels
data['hotel'].value_counts(1)

City Hotel     0.66
Resort Hotel   0.34
Name: hotel, dtype: float64

## Splitting "City" and "Resort" 

In [5]:
## Creating subgroup for city hotels
subgroup_city = data[data['hotel'] == 'City Hotel']
subgroup_city

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
40060,City Hotel,0,6,2015,July,27,1,0,2,1,0.00,0,HB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,6.00,,0,Transient,0.00,0,0,Check-Out,2015-07-03
40061,City Hotel,1,88,2015,July,27,1,0,4,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,76.50,0,1,Canceled,2015-07-01
40062,City Hotel,1,65,2015,July,27,1,0,4,1,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,68.00,0,1,Canceled,2015-04-30
40063,City Hotel,1,92,2015,July,27,1,2,4,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,76.50,0,2,Canceled,2015-06-23
40064,City Hotel,1,100,2015,July,27,2,0,2,2,0.00,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.00,,0,Transient,76.50,0,1,Canceled,2015-04-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017,August,35,30,2,5,2,0.00,0,BB,BEL,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,394.00,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,City Hotel,0,102,2017,August,35,31,2,5,3,0.00,0,BB,FRA,Online TA,TA/TO,0,0,0,E,E,0,No Deposit,9.00,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,City Hotel,0,34,2017,August,35,31,2,5,2,0.00,0,BB,DEU,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,9.00,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,City Hotel,0,109,2017,August,35,31,2,5,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,89.00,,0,Transient,104.40,0,0,Check-Out,2017-09-07


In [6]:
## Creating subgroup for resort hotels
subgroup_resort = data[data['hotel'] == 'Resort Hotel']
subgroup_resort

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.00,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.00,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.00,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.00,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,Resort Hotel,0,212,2017,August,35,31,2,8,2,1.00,0,BB,GBR,Offline TA/TO,TA/TO,0,0,0,A,A,1,No Deposit,143.00,,0,Transient,89.75,0,0,Check-Out,2017-09-10
40056,Resort Hotel,0,169,2017,August,35,30,2,9,2,0.00,0,BB,IRL,Direct,Direct,0,0,0,E,E,0,No Deposit,250.00,,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10
40057,Resort Hotel,0,204,2017,August,35,29,4,10,2,0.00,0,BB,IRL,Direct,Direct,0,0,0,E,E,0,No Deposit,250.00,,0,Transient,153.57,0,3,Check-Out,2017-09-12
40058,Resort Hotel,0,211,2017,August,35,31,4,10,2,0.00,0,HB,GBR,Offline TA/TO,TA/TO,0,0,0,D,D,0,No Deposit,40.00,,0,Contract,112.80,0,1,Check-Out,2017-09-14


## Reviewing Statistics

In [7]:
eda.report_df(subgroup_city).sort_values('null_sum', ascending=False)

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
company,75641,0.95,float64,207,3689.0,145.27,119.77,8.0,40.0,91.0,219.0,497.0
agent,8131,0.1,float64,223,71199.0,28.14,56.43,1.0,9.0,9.0,17.0,509.0
country,24,0.0,object,166,,,,,,,,
children,4,0.0,float64,4,79326.0,0.09,0.37,0.0,0.0,0.0,0.0,3.0
adr,0,0.0,float64,5405,79330.0,105.3,43.6,0.0,79.2,99.9,126.0,5400.0
lead_time,0,0.0,int64,453,79330.0,109.74,110.95,0.0,23.0,74.0,163.0,629.0
market_segment,0,0.0,object,8,,,,,,,,
meal,0,0.0,object,4,,,,,,,,
previous_bookings_not_canceled,0,0.0,int64,73,79330.0,0.13,1.69,0.0,0.0,0.0,0.0,72.0
previous_cancellations,0,0.0,int64,10,79330.0,0.08,0.42,0.0,0.0,0.0,0.0,21.0


In [None]:
cont = adr, lead_time, previous_bookings_not_canceled, previous_cancellations,
cat = market_segment, meal, is_canceled

In [8]:
# Dropping "company" column (95% missing values) , "hotel" column (only 1 value)
subgroup_city.drop(columns = ['company', 'hotel'], inplace=True)
subgroup_city

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
40060,0,6,2015,July,27,1,0,2,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,6.0,0,Transient,0.0,0,0,Check-Out,2015-07-03
40061,1,88,2015,July,27,1,0,4,2,0.0,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.0,0,Transient,76.5,0,1,Canceled,2015-07-01
40062,1,65,2015,July,27,1,0,4,1,0.0,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.0,0,Transient,68.0,0,1,Canceled,2015-04-30
40063,1,92,2015,July,27,1,2,4,2,0.0,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.0,0,Transient,76.5,0,2,Canceled,2015-06-23
40064,1,100,2015,July,27,2,0,2,2,0.0,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,9.0,0,Transient,76.5,0,1,Canceled,2015-04-02


In [14]:
subgroup_city.columns

Index(['is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month',
       'arrival_date_week_number', 'arrival_date_day_of_month',
       'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children',
       'babies', 'meal', 'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

## is_canceled

In [79]:
subgroup_city['is_canceled'].value_counts(1, dropna=False)

0   0.58
1   0.42
Name: is_canceled, dtype: float64

## lead_time

In [84]:
subgroup_city['lead_time'].describe()

count   79,330.00
mean       109.74
std        110.95
min          0.00
25%         23.00
50%         74.00
75%        163.00
max        629.00
Name: lead_time, dtype: float64

## Arrival as Datetime

In [19]:
## Converting from month, day of month, and year to a single datetime column
subgroup_city['arrival_date'] = subgroup_city['arrival_date_month'] +' '+ \
                                subgroup_city['arrival_date_day_of_month']\
                                .astype(str) +', '+ \
                                subgroup_city['arrival_date_year'].astype(str)
subgroup_city['arrival_date'] = pd.to_datetime(subgroup_city['arrival_date'])
subgroup_city['arrival_date']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


40060    2015-07-01
40061    2015-07-01
40062    2015-07-01
40063    2015-07-01
40064    2015-07-02
            ...    
119385   2017-08-30
119386   2017-08-31
119387   2017-08-31
119388   2017-08-31
119389   2017-08-29
Name: arrival_date, Length: 79330, dtype: datetime64[ns]

## stays_in_weekend_nights

In [82]:
subgroup_city['stays_in_weekend_nights'].describe()

count   79,330.00
mean         0.80
std          0.89
min          0.00
25%          0.00
50%          1.00
75%          2.00
max         16.00
Name: stays_in_weekend_nights, dtype: float64

## stays_in_week_nights

In [76]:
subgroup_city['stays_in_week_nights'].describe()

count   79,330.00
mean         2.18
std          1.46
min          0.00
25%          1.00
50%          2.00
75%          3.00
max         41.00
Name: stays_in_week_nights, dtype: float64

## Adults

In [51]:
subgroup_city['adults'].value_counts(1, dropna=False)

2   0.73
1   0.20
3   0.06
0   0.00
4   0.00
Name: adults, dtype: float64

## Children

In [52]:
subgroup_city['children'].value_counts(1, dropna=False)

0.00   0.94
1.00   0.04
2.00   0.03
3.00   0.00
nan    0.00
Name: children, dtype: float64

## babies

In [53]:
subgroup_city['babies'].value_counts(1, dropna=False)

0    1.00
1    0.00
2    0.00
10   0.00
9    0.00
Name: babies, dtype: float64

In [87]:
subgroup_city['babies'].describe()

count   79,330.00
mean         0.00
std          0.08
min          0.00
25%          0.00
50%          0.00
75%          0.00
max         10.00
Name: babies, dtype: float64

## meal

In [54]:
subgroup_city['meal'].value_counts(1, dropna=False)

BB   0.79
SC   0.13
HB   0.08
FB   0.00
Name: meal, dtype: float64

## country

In [55]:
subgroup_city['country'].value_counts(1, dropna=False)

PRT   0.39
FRA   0.11
DEU   0.08
GBR   0.07
ESP   0.06
      ... 
MLI   0.00
UMI   0.00
DMA   0.00
MMR   0.00
PYF   0.00
Name: country, Length: 167, dtype: float64

## market_segment

In [56]:
subgroup_city['market_segment'].value_counts(1, dropna=False)

Online TA       0.49
Offline TA/TO   0.21
Groups          0.18
Direct          0.08
Corporate       0.04
Complementary   0.01
Aviation        0.00
Undefined       0.00
Name: market_segment, dtype: float64

## distribution_channel

In [57]:
subgroup_city['distribution_channel'].value_counts(1, dropna=False)

TA/TO       0.87
Direct      0.09
Corporate   0.04
GDS         0.00
Undefined   0.00
Name: distribution_channel, dtype: float64

## is_repeated_guest

In [58]:
subgroup_city['is_repeated_guest'].value_counts(1, dropna=False)

0   0.97
1   0.03
Name: is_repeated_guest, dtype: float64

## previous_cancellations

In [88]:
subgroup_city['previous_cancellations'].describe()

count   79,330.00
mean         0.08
std          0.42
min          0.00
25%          0.00
50%          0.00
75%          0.00
max         21.00
Name: previous_cancellations, dtype: float64

## previous_bookings_not_canceled

In [89]:
subgroup_city['previous_bookings_not_canceled'].describe()

count   79,330.00
mean         0.13
std          1.69
min          0.00
25%          0.00
50%          0.00
75%          0.00
max         72.00
Name: previous_bookings_not_canceled, dtype: float64

##  reserved_room_type

In [64]:
subgroup_city['reserved_room_type'].value_counts(1, dropna=False)

A   0.79
D   0.15
F   0.02
E   0.02
B   0.01
G   0.01
C   0.00
P   0.00
Name: reserved_room_type, dtype: float64

## assigned_room_type

In [65]:
subgroup_city['assigned_room_type'].value_counts(1, dropna=False)

A   0.72
D   0.19
E   0.03
F   0.03
B   0.03
G   0.01
K   0.00
C   0.00
P   0.00
Name: assigned_room_type, dtype: float64

## booking_changes

In [90]:
subgroup_city['booking_changes'].describe()

count   79,330.00
mean         0.19
std          0.61
min          0.00
25%          0.00
50%          0.00
75%          0.00
max         21.00
Name: booking_changes, dtype: float64

## deposit_type

In [67]:
subgroup_city['deposit_type'].value_counts(1, dropna=False)

No Deposit   0.84
Non Refund   0.16
Refundable   0.00
Name: deposit_type, dtype: float64

## agent

In [68]:
subgroup_city['agent'].value_counts(1, dropna=False)

9.00     0.40
nan      0.10
1.00     0.09
14.00    0.05
7.00     0.04
         ... 
280.00   0.00
444.00   0.00
54.00    0.00
270.00   0.00
303.00   0.00
Name: agent, Length: 224, dtype: float64

## days_in_waiting_list

In [91]:
subgroup_city['days_in_waiting_list'].describe()

count   79,330.00
mean         3.23
std         20.87
min          0.00
25%          0.00
50%          0.00
75%          0.00
max        391.00
Name: days_in_waiting_list, dtype: float64

## customer_type

In [70]:
subgroup_city['customer_type'].value_counts(1, dropna=False)

Transient         0.75
Transient-Party   0.22
Contract          0.03
Group             0.00
Name: customer_type, dtype: float64

## adr

In [72]:
subgroup_city['adr'].describe()

count   79,330.00
mean       105.30
std         43.60
min          0.00
25%         79.20
50%         99.90
75%        126.00
max      5,400.00
Name: adr, dtype: float64

## required_car_parking_spaces

In [73]:
subgroup_city['required_car_parking_spaces'].value_counts(1, dropna=False)

0   0.98
1   0.02
2   0.00
3   0.00
Name: required_car_parking_spaces, dtype: float64

## total_of_special_requests

In [74]:
subgroup_city['total_of_special_requests'].value_counts(1, dropna=False)

0   0.60
1   0.27
2   0.10
3   0.02
4   0.00
5   0.00
Name: total_of_special_requests, dtype: float64

## reservation_status

In [75]:
subgroup_city['reservation_status'].value_counts(1, dropna=False)

Check-Out   0.58
Canceled    0.41
No-Show     0.01
Name: reservation_status, dtype: float64

## reservation_status_date

In [44]:
subgroup_city['reservation_status_date']

40060     2015-07-03
40061     2015-07-01
40062     2015-04-30
40063     2015-06-23
40064     2015-04-02
             ...    
119385    2017-09-06
119386    2017-09-07
119387    2017-09-07
119388    2017-09-07
119389    2017-09-07
Name: reservation_status_date, Length: 79330, dtype: object