# Business Case Description
We have chosen a data set which deals with hotel bookings in two different Hotel chains. The data set contains booking information for a "city hotel" and a "resort hotel", and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces and if the booking has been canceled or not. The dataset extends from 2015 through 2017. 

We in the group are interested in finding out if it is possible to predict when people will often cancel their hotel bookings. Furthermore, we would also like to find out, when in the year and what percentage cancel their hotel booking for the two hotel chains?

The dataset can be found on kaggle.
https://www.kaggle.com/jessemostipak/hotel-booking-demand 

This dataset can allow us to answer our problem and help the hotel industry predict an approximate number of people who cancels.

# Library and cleanup in the dataset

Tidyverse is a collection of essential R packages for data science. The packages under the tidyverse umbrella help us in performing and interacting with the data. (https://www.analyticsvidhya.com/blog/2019/05/beginner-guide-tidyverse-most-powerful-collection-r-packages-data-science/) Tidyverse contains, among other things a library called "dplyr". 
"dplyr" is a grammar of data manipulation. This allows us to take advantage of the verbs that comes by importing "dplyr", such as. select, filter, summarise and more. 

In [3]:
library(tidyverse) 
library("dplyr")

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ readr   1.3.1
✔ tibble  3.0.4     ✔ purrr   0.3.3
✔ tidyr   1.0.2     ✔ stringr 1.4.0
✔ ggplot2 3.2.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


These two lines of code mean that we refer our data set to HD and we read our CSV data set

In [3]:
HD <- "hotel_bookings.csv"
hData <- read.csv(HD)

The str() is a R function that can show us the internal structure of our data set. We use the str() function to return information about the our rows/observations and columns/variables along with extra information like the names of the columns, class of each column, followed by some of the initial observations of each of the columns in our data set "Hotel_Bookings". (https://r-lang.com/str-function-in-r/)

In [4]:
str(hData)

'data.frame':	119390 obs. of  32 variables:
 $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
 $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
 $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
 $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ arrival_date_month            : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
 $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
 $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : Factor w/ 5 levels "BB","

In these lines code we find all NA in our dataset. And remove them as they can be an obstacle when we need to work with our dataset.

In [5]:
table(is.na(hData))
hotelData <- na.omit(hData)
table(is.na(hotelData))


  FALSE    TRUE 
3820476       4 


  FALSE 
3820352 

Here we retrieve all the column headings. To get an overview of what the columns are called again, so we can use the correct names to retrieve the information we are interested in later on in Business Case. 

In [14]:
head(hotelData)

hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0,0,0,Check-Out,2015-07-01
Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0,0,0,Check-Out,2015-07-01
Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75,0,0,Check-Out,2015-07-02
Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75,0,0,Check-Out,2015-07-02
Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98,0,1,Check-Out,2015-07-03
Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98,0,1,Check-Out,2015-07-03
