# Hotel Customer Segmentation - EDA
##### Lindsey Robertson

## Objective

Investigate outliers, analyze relationships between feaetures and visualize correlations observed. 

## Data

This real-world customer dataset with 31 variables describes
83,590 instances (customers) from a hotel in Lisbon, Portugal.
Instances include; customer personal, behavioral,
demographic, and geographical information for 3 full years.
The dataset can be found on Kaggle [here].(https://www.kaggle.com/datasets/nantonio/a-hotels-customers-dataset)

Kaggle dataset origin, domain assumptions and data collection information: 

Nuno Antonio, Ana de Almeida, Luis Nunes. A hotel's customer's personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015-2018). Data in Brief 33(2020)106583, 24(November), 2020. URL: https://www.sciencedirect.com/journal/data-in-brief.


## Data Assumptions

* Some hotels have a policy of creating a profile for each guest companion(adult or children) only in particular cases and with authorization at times. 
* Typically a customer profile is created by one of three things:
    - customer's first checked-out at the hotel
    - customer's first cancelation
    - customer's first no-show
* Sometimes there are more than one profile for the same customer
* Only after the customer's first stay can hotels confirm the guest's personal details, such as nationality.

## Hypothesis

We can segment these customers based on recency, frequency and monotary value, predict their future value. 

A separate hypothesis is that we can predict a customer's cancelation or no show with these customer records. 

## Questions to guide analysis:

Revenue:
1) Does missing revenue need to be imputed? If so how? 
2) Is it bad for CLV prediction to have so much missing revenue data?
3) Can we leave zero reveues reported for our analysis even though it is a high percentage of customers?

Age:
1) Will the under 5% zero age affect our analysis?
2) 

## Process:

1) [Categorical feature analysis - cardinality, usablility](#categorical-analysis)
2) [Numerical feature analysis - distributions, correlations, usability](#numerical-feature-analysis)
3) 

## Import Libraries

In [12]:
pip install plotly

Collecting plotly
  Downloading plotly-5.9.0-py2.py3-none-any.whl (15.2 MB)
     --------------------------------------- 15.2/15.2 MB 24.2 MB/s eta 0:00:00
Collecting tenacity>=6.2.0
  Using cached tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.9.0 tenacity-8.0.1
Note: you may need to restart the kernel to use updated packages.




In [16]:
import plotly.express as px

## Import data

Import previous wrangling notebook progress

In [1]:
from IPython.utils import io
with io.capture_output() as captured:
    %run 1_wrangling.ipynb

## Categorical analysis

Create a separate dataframe which has only Categorical Variables


In [7]:
data_cat = Data.select_dtypes(include = 'object').copy()
data_cat.head(2).T

Unnamed: 0,0,1
Nationality,PRT,PRT
NameHash,0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375...,0x21EDE41906B45079E75385B5AA33287CA09DE1AB86DE...
DocIDHash,0x71568459B729F7A7ABBED6C781A84CA4274D571003AC...,0x5FA1E0098A31497057C5A6B9FE9D49FD6DD47CCE7C26...
DistributionChannel,Corporate,Travel Agent/Operator
MarketSegment,Corporate,Travel Agent/Operator


The Hash information will be of little use in our analysis and should be removed. The other's will be benifiical information. 

### Nationality

In [14]:
nationality = Data['Nationality'].nunique()
print(nationality)

188


In [20]:
top_nationality = Data['Nationality'].value_counts().head(20)
fig = px.pie( values = top_nationality.values,names=top_nationality.keys(), title='Distributoion of Nationalities')
fig.show()

In [21]:
top_nationality

FRA    12422
PRT    11597
DEU    10232
GBR     8656
ESP     4902
USA     3429
ITA     3365
BEL     3119
BRA     2902
NLD     2725
CHE     2108
IRL     1996
CAN     1524
AUT     1489
SWE     1231
ISR      900
CHN      891
NOR      795
POL      760
AUS      723
Name: Nationality, dtype: int64

## Numerical feature analysis

### Age