# Hotel Reservations Dataset

Can you predict if customer is going to cancel the reservation ?

## About Dataset

### Context

The online hotel reservation channels have dramatically changed booking possibilities and customers’ behavior. A significant number of hotel reservations are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with.

#### Can you predict if the customer is going to honor the reservation or cancel it ?

##### About the file 

The file contains the different attributes of customers' reservation details. The detailed data dictionary is given below.

###### Data Dictionaly 

* **Booking_ID** :  unique identifier of each booking
* **no_of_adults** :  Number of adults
* **no_of_children** :  Number of Children
* **no_of_weekend_nights** :  Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
* **no_of_week_nights** :  Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
* **type_of_meal_plan** :  Type of meal plan booked by the customer:
* **required_car_parking_space** :  Does the customer require a car parking space? (0 - No, 1- Yes)
* **room_type_reserved** :  Type of room reserved by the * customer. The values are ciphered (encoded) by INN * Hotels.
* **lead_time** :  Number of days between the date of booking and the arrival date
* **arrival_yea** :  Year of arrival date
* **arrival_month** :  Month of arrival date
* **arrival_date** :  Date of the month
* **market_segment_type** :  Market segment designation.
* **repeated_guest** :  Is the customer a repeated guest? (0 - No, 1- Yes)
no_of_previous_cancellations: Number of previous * bookings that were canceled by the customer prior to the current booking
* **no_of_previous_bookings_not_canceled** :  Number of previous bookings not canceled by the customer prior to the current booking
* **avg_price_per_room** :  Average price per day of the * **reservation; prices of the rooms are dynamic. (in euros)
* **no_of_special_requests** :  Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
* **booking_status** :  Flag indicating if the booking was canceled or not.

# Lets import the required modules

In [1]:
import pandas as pd # Data Processing 
import numpy as np # Array Processing
from sklearn.preprocessing import OneHotEncoder # Encodfing of Catetgorical Data
from sklearn.preprocessing import StandardScaler # Scaling of Data 
from imblearn.over_sampling import RandomOverSampler # Sampling of Data 
from sklearn.linear_model import LogisticRegression # Logistic Regression
from sklearn.neighbors import KNeighborsClassifier # K Nearest Neighbors Classifier
from sklearn.naive_bayes import GaussianNB # Gaussian Naive Bayes
from sklearn.metrics import classification_report # Classification Report
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/hotel-reservations-classification-dataset/Hotel Reservations.csv


Lets get our data into working 

In [2]:
 data = pd.read_csv("/kaggle/input/hotel-reservations-classification-dataset/Hotel Reservations.csv")

It is a good practice to take a look at our dataset before processing it

In [3]:
data

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.00,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.00,0,Canceled
3,INN00004,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.00,0,Canceled
4,INN00005,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.50,0,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,INN36271,3,0,2,6,Meal Plan 1,0,Room_Type 4,85,2018,8,3,Online,0,0,0,167.80,1,Not_Canceled
36271,INN36272,2,0,1,3,Meal Plan 1,0,Room_Type 1,228,2018,10,17,Online,0,0,0,90.95,2,Canceled
36272,INN36273,2,0,2,6,Meal Plan 1,0,Room_Type 1,148,2018,7,1,Online,0,0,0,98.39,2,Not_Canceled
36273,INN36274,2,0,0,3,Not Selected,0,Room_Type 1,63,2018,4,21,Online,0,0,0,94.50,0,Canceled


So the data set has `36275 rows` and `19 columns` accounting to `689,225 values`. This huge number of data will really help us in making a good model 

Booking_ID is unique for every row and may decrease the efficiency of the model. So lets delete this column

In [4]:
data.drop("Booking_ID" , axis = 1 , inplace = True)

In [5]:
data

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.00,0,Not_Canceled
1,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.00,0,Canceled
3,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.00,0,Canceled
4,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.50,0,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,3,0,2,6,Meal Plan 1,0,Room_Type 4,85,2018,8,3,Online,0,0,0,167.80,1,Not_Canceled
36271,2,0,1,3,Meal Plan 1,0,Room_Type 1,228,2018,10,17,Online,0,0,0,90.95,2,Canceled
36272,2,0,2,6,Meal Plan 1,0,Room_Type 1,148,2018,7,1,Online,0,0,0,98.39,2,Not_Canceled
36273,2,0,0,3,Not Selected,0,Room_Type 1,63,2018,4,21,Online,0,0,0,94.50,0,Canceled


So now we have `36,275 rows` and `18 columns` with `652,950 values`. Computer feels easier to understand and work on numerical value instead of string values. So lets get the datatypes of these values

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 18 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   no_of_adults                          36275 non-null  int64  
 1   no_of_children                        36275 non-null  int64  
 2   no_of_weekend_nights                  36275 non-null  int64  
 3   no_of_week_nights                     36275 non-null  int64  
 4   type_of_meal_plan                     36275 non-null  object 
 5   required_car_parking_space            36275 non-null  int64  
 6   room_type_reserved                    36275 non-null  object 
 7   lead_time                             36275 non-null  int64  
 8   arrival_year                          36275 non-null  int64  
 9   arrival_month                         36275 non-null  int64  
 10  arrival_date                          36275 non-null  int64  
 11  market_segment_

As we can see there are 4 columns with object datatype, which are categorical. We need to encode these values into numbers. For this we will use `one hot encoder` from `sklearn.metrics`. You can take a glance of `one hot encoder` [here in english](https://youtu.be/agvfUvUNI4A) and more information would be available [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) 

* `drop = "first"` is used to drop the first value or take only (n-1) values only 
* `sparse = "False"` to return a numpy array in the output  

In [7]:
ohe = OneHotEncoder(drop = "first" , sparse = False)

Now lets get the object values in one feild/list for better usage 

In [8]:
cat = []
for i in data.columns:
    if data[i].dtypes == float or data[i].dtypes == int:
        continue
    else:
        cat.append(i)

Now lets encode our data, and name it `new_data`

In [9]:
new_data = ohe.fit_transform(data[cat])

One hot encoder only encodes the data but do not combine the encodes data with the original dataset. We have to do it manually. So first lets drop the values that were encoded and store the values somewhere else. Just naming it, `half_data`

In [10]:
half_data = data.drop(cat , axis = 1)

Now lets combine these datasets into one

In [11]:
df = np.hstack((half_data.values , new_data))

The `df` here is not a dataframe, it is still a dictionary, we need to transform this into a dataframe 

In [12]:
df = pd.DataFrame(df)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,2.0,0.0,1.0,2.0,0.0,224.0,2017.0,10.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,2.0,0.0,2.0,3.0,0.0,5.0,2018.0,11.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,1.0,0.0,2.0,1.0,0.0,1.0,2018.0,2.0,28.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,2.0,0.0,0.0,2.0,0.0,211.0,2018.0,5.0,20.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2.0,0.0,1.0,1.0,0.0,48.0,2018.0,4.0,11.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,3.0,0.0,2.0,6.0,0.0,85.0,2018.0,8.0,3.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
36271,2.0,0.0,1.0,3.0,0.0,228.0,2018.0,10.0,17.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
36272,2.0,0.0,2.0,6.0,0.0,148.0,2018.0,7.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
36273,2.0,0.0,0.0,3.0,0.0,63.0,2018.0,4.0,21.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Now if we get the datatypes of the data, they will all be numerical 

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 28 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       36275 non-null  float64
 1   1       36275 non-null  float64
 2   2       36275 non-null  float64
 3   3       36275 non-null  float64
 4   4       36275 non-null  float64
 5   5       36275 non-null  float64
 6   6       36275 non-null  float64
 7   7       36275 non-null  float64
 8   8       36275 non-null  float64
 9   9       36275 non-null  float64
 10  10      36275 non-null  float64
 11  11      36275 non-null  float64
 12  12      36275 non-null  float64
 13  13      36275 non-null  float64
 14  14      36275 non-null  float64
 15  15      36275 non-null  float64
 16  16      36275 non-null  float64
 17  17      36275 non-null  float64
 18  18      36275 non-null  float64
 19  19      36275 non-null  float64
 20  20      36275 non-null  float64
 21  21      36275 non-null  float64
 22

There can still be null values here. Lets find if there are any

In [14]:
df.isnull().values.any()

False

As we can see there arent any null values. So we can get through this. Just in case if you find any null values at this step. You can use the code below to get rid of them. There are different approaches eliminating this problem and this is one of them that is commonly used 

```
for i in df.columns:
    if df[i].isnull().values.any():
        df[i].fillna(df[i].mean() , axis = 0 , inplace = True)
```


Lets see how many values of 1 and 0s are there in the value

In [15]:
df[27].value_counts()

1.0    24390
0.0    11885
Name: 27, dtype: int64

As we can see `1 are 24390` and `0 aare 11885` which are just half, this can lead to bad effieicncy of the model. We will use `RandomOverSmapling` from `imblearn.over_sampling`. You can get more info about it [here](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html). 

We can see that some data are in 100s and some are just 0/1. This can lead to baising in our model towards one feature. We need to scale the data accordingly 
We will be using `StandardScaler` from `sklearn.preprocessing `. You canm get more info about it [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

But lets first dplit our dataset into train and test now

In [16]:
train , test = np.split(df.sample(frac = 1) , [int(0.8 * len(df))])

In [17]:
def scaler(dataframe , oversampling = False):
    X = dataframe.drop(27 , axis = 1)
    Y = dataframe[27]

    sc = StandardScaler()
    sc.fit_transform(X)

    ros = RandomOverSampler()
    if oversampling:
        ros.fit_resample(X , Y)

    return X , Y

In [18]:
X_train , Y_train = scaler(train , oversampling = True)

RandomOverSampler is not required in test dataset, as it has predictions to be made

In [19]:
X_test , Y_test = scaler(test)

Our data is good to go for modeling

We will first use the `KNeighbors CLasifier`. **You can understand it [here in english](https://youtu.be/wKmEULDRszo) and more infomration will be available [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**

In [20]:
model = KNeighborsClassifier()
model.fit(X_train , Y_train)
model.predict(X_test)

array([0., 0., 1., ..., 0., 1., 1.])

The second will be `Logistic Regression`. You can understand it [here in english](https://youtu.be/yIYKR4sgzI8) and more infomration will be available [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [21]:
model_1 = LogisticRegression()
model_1.fit(X_train , Y_train)
model_1.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


array([0., 0., 1., ..., 1., 1., 1.])

The third and last would be `Gaussian Naive Bayes`. You can understand it [here in english](https://youtu.be/H3EjCKtlVog) and more infomration will be available [here](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

In [22]:
model_2 = GaussianNB()
model_2.fit(X_train , Y_train)
model_2.predict(X_test)

array([0., 0., 0., ..., 0., 0., 0.])

Now lets get the classification report of all these dataset.

In [23]:
print((classification_report(Y_test , model.predict(X_test))))

              precision    recall  f1-score   support

         0.0       0.73      0.65      0.69      2382
         1.0       0.84      0.88      0.86      4873

    accuracy                           0.81      7255
   macro avg       0.78      0.77      0.77      7255
weighted avg       0.80      0.81      0.80      7255



In [24]:
print((classification_report(Y_test , model_1.predict(X_test))))

              precision    recall  f1-score   support

         0.0       0.72      0.59      0.65      2382
         1.0       0.82      0.89      0.85      4873

    accuracy                           0.79      7255
   macro avg       0.77      0.74      0.75      7255
weighted avg       0.78      0.79      0.78      7255



In [25]:
print((classification_report(Y_test , model_2.predict(X_test))))

              precision    recall  f1-score   support

         0.0       0.36      0.97      0.52      2382
         1.0       0.90      0.16      0.27      4873

    accuracy                           0.42      7255
   macro avg       0.63      0.56      0.40      7255
weighted avg       0.73      0.42      0.35      7255

