# **Project - Classification and Hypothesis Testing: Hotel Booking Cancellation Prediction**<a
href="#Project---Classification-and-Hypothesis-Testing:-Hotel-Booking-Cancellation-Prediction"
class="anchor-link">¶</a>

## **Marks: 40**<a href="#Marks:-40" class="anchor-link">¶</a>

------------------------------------------------------------------------

## **Problem Statement**<a href="#Problem-Statement" class="anchor-link">¶</a>

### **Context**<a href="#Context" class="anchor-link">¶</a>

**A significant number of hotel bookings are called off due to
cancellations or no-shows.** Typical reasons for cancellations include
change of plans, scheduling conflicts, etc. This is often made easier by
the option to do so free of charge or preferably at a low cost. This may
be beneficial to hotel guests, but it is a less desirable and possibly
revenue-diminishing factor for hotels to deal with. Such losses are
particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically
changed customers’ booking possibilities and behavior. This adds a
further dimension to the challenge of how hotels handle cancellations,
which are no longer limited to traditional booking and guest
characteristics.

This pattern of cancellations of bookings impacts a hotel on various
fronts:

1.  **Loss of resources (revenue)** when the hotel cannot resell the
    room.
2.  **Additional costs of distribution channels** by increasing
    commissions or paying for publicity to help sell these rooms.
3.  **Lowering prices last minute**, so the hotel can resell a room,
    resulting in reducing the profit margin.
4.  **Human resources to make arrangements** for the guests.

### **Objective**<a href="#Objective" class="anchor-link">¶</a>

This increasing number of cancellations calls for a Machine Learning
based solution that can help in predicting which booking is likely to be
canceled. INN Hotels Group has a chain of hotels in Portugal - they are
facing problems with this high number of booking cancellations and have
reached out to your firm for data-driven solutions. You, as a Data
Scientist, have to analyze the data provided to find which factors have
a high influence on booking cancellations, build a predictive model that
can predict which booking is going to be canceled in advance, and help
in formulating profitable policies for cancellations and refunds.

### **Data Description**<a href="#Data-Description" class="anchor-link">¶</a>

The data contains the different attributes of customers' booking
details. The detailed data dictionary is given below:

**Data Dictionary**

-   **Booking_ID:** Unique identifier of each booking
-   **no_of_adults:** Number of adults
-   **no_of_children:** Number of children
-   **no_of_weekend_nights:** Number of weekend nights (Saturday or
    Sunday) the guest stayed or booked to stay at the hotel
-   **no_of_week_nights:** Number of weekday nights (Monday to Friday)
    the guest stayed or booked to stay at the hotel
-   **type_of_meal_plan:** Type of meal plan booked by the customer:
    -   Not Selected – No meal plan selected
    -   Meal Plan 1 – Breakfast
    -   Meal Plan 2 – Half board (breakfast and one other meal)
    -   Meal Plan 3 – Full board (breakfast, lunch, and dinner)
-   **required_car_parking_space:** Does the customer require a car
    parking space? (0 - No, 1- Yes)
-   **room_type_reserved:** Type of room reserved by the customer. The
    values are ciphered (encoded) by INN Hotels.
-   **lead_time:** Number of days between the date of booking and the
    arrival date
-   **arrival_year:** Year of arrival date
-   **arrival_month:** Month of arrival date
-   **arrival_date:** Date of the month
-   **market_segment_type:** Market segment designation.
-   **repeated_guest:** Is the customer a repeated guest? (0 - No, 1-
    Yes)
-   **no_of_previous_cancellations:** Number of previous bookings that
    were canceled by the customer prior to the current booking
-   **no_of_previous_bookings_not_canceled:** Number of previous
    bookings not canceled by the customer prior to the current booking
-   **avg_price_per_room:** Average price per day of the reservation;
    prices of the rooms are dynamic. (in euros)
-   **no_of_special_requests:** Total number of special requests made by
    the customer (e.g. high floor, view from the room, etc)
-   **booking_status:** Flag indicating if the booking was canceled or
    not.

## **Importing the libraries required**<a href="#Importing-the-libraries-required" class="anchor-link">¶</a>

In \[1\]:

    # Importing the basic libraries we will require for the project

    # Libraries to help with reading and manipulating data
    import pandas as pd
    import numpy as np

    # Libaries to help with data visualization
    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.set()

    # Importing the Machine Learning models we require from Scikit-Learn
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.tree import DecisionTreeClassifier
    from sklearn import tree
    from sklearn.ensemble import RandomForestClassifier

    # Importing the other functions we may require from Scikit-Learn
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder

    # To get diferent metric scores
    from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,plot_confusion_matrix,precision_recall_curve,roc_curve,make_scorer

    # Code to ignore warnings from function usage
    import warnings;
    import numpy as np
    warnings.filterwarnings('ignore')

## **Loading the dataset**<a href="#Loading-the-dataset" class="anchor-link">¶</a>

In \[2\]:

    hotel = pd.read_csv("INNHotelsGroup.csv")

In \[3\]:

    # Copying data to another variable to avoid any changes to original data
    data = hotel.copy()

## **Overview of the dataset**<a href="#Overview-of-the-dataset" class="anchor-link">¶</a>

### **View the first and last 5 rows of the dataset**<a href="#View-the-first-and-last-5-rows-of-the-dataset"
class="anchor-link">¶</a>

Let's **view the first few rows and last few rows** of the dataset in
order to understand its structure a little better.

We will use the head() and tail() methods from Pandas to do this.

In \[4\]:

    data.head()

Out\[4\]:

|     | Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status |
|-----|------------|--------------|----------------|----------------------|-------------------|-------------------|----------------------------|--------------------|-----------|--------------|---------------|--------------|---------------------|----------------|------------------------------|--------------------------------------|--------------------|------------------------|----------------|
| 0   | INN00001   | 2            | 0              | 1                    | 2                 | Meal Plan 1       | 0                          | Room_Type 1        | 224       | 2017         | 10            | 2            | Offline             | 0              | 0                            | 0                                    | 65.00              | 0                      | Not_Canceled   |
| 1   | INN00002   | 2            | 0              | 2                    | 3                 | Not Selected      | 0                          | Room_Type 1        | 5         | 2018         | 11            | 6            | Online              | 0              | 0                            | 0                                    | 106.68             | 1                      | Not_Canceled   |
| 2   | INN00003   | 1            | 0              | 2                    | 1                 | Meal Plan 1       | 0                          | Room_Type 1        | 1         | 2018         | 2             | 28           | Online              | 0              | 0                            | 0                                    | 60.00              | 0                      | Canceled       |
| 3   | INN00004   | 2            | 0              | 0                    | 2                 | Meal Plan 1       | 0                          | Room_Type 1        | 211       | 2018         | 5             | 20           | Online              | 0              | 0                            | 0                                    | 100.00             | 0                      | Canceled       |
| 4   | INN00005   | 2            | 0              | 1                    | 1                 | Not Selected      | 0                          | Room_Type 1        | 48        | 2018         | 4             | 11           | Online              | 0              | 0                            | 0                                    | 94.50              | 0                      | Canceled       |

In \[5\]:

    data.tail()

Out\[5\]:

|       | Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status |
|-------|------------|--------------|----------------|----------------------|-------------------|-------------------|----------------------------|--------------------|-----------|--------------|---------------|--------------|---------------------|----------------|------------------------------|--------------------------------------|--------------------|------------------------|----------------|
| 36270 | INN36271   | 3            | 0              | 2                    | 6                 | Meal Plan 1       | 0                          | Room_Type 4        | 85        | 2018         | 8             | 3            | Online              | 0              | 0                            | 0                                    | 167.80             | 1                      | Not_Canceled   |
| 36271 | INN36272   | 2            | 0              | 1                    | 3                 | Meal Plan 1       | 0                          | Room_Type 1        | 228       | 2018         | 10            | 17           | Online              | 0              | 0                            | 0                                    | 90.95              | 2                      | Canceled       |
| 36272 | INN36273   | 2            | 0              | 2                    | 6                 | Meal Plan 1       | 0                          | Room_Type 1        | 148       | 2018         | 7             | 1            | Online              | 0              | 0                            | 0                                    | 98.39              | 2                      | Not_Canceled   |
| 36273 | INN36274   | 2            | 0              | 0                    | 3                 | Not Selected      | 0                          | Room_Type 1        | 63        | 2018         | 4             | 21           | Online              | 0              | 0                            | 0                                    | 94.50              | 0                      | Canceled       |
| 36274 | INN36275   | 2            | 0              | 1                    | 2                 | Meal Plan 1       | 0                          | Room_Type 1        | 207       | 2018         | 12            | 30           | Offline             | 0              | 0                            | 0                                    | 161.67             | 0                      | Not_Canceled   |

### **Understand the shape of the dataset**<a href="#Understand-the-shape-of-the-dataset" class="anchor-link">¶</a>

In \[6\]:

    data.shape

Out\[6\]:

    (36275, 19)

-   The dataset has 36275 rows and 19 columns.

### **Check the data types of the columns for the dataset**<a href="#Check-the-data-types-of-the-columns-for-the-dataset"
class="anchor-link">¶</a>

In \[7\]:

    data.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 36275 entries, 0 to 36274
    Data columns (total 19 columns):
     #   Column                                Non-Null Count  Dtype  
    ---  ------                                --------------  -----  
     0   Booking_ID                            36275 non-null  object 
     1   no_of_adults                          36275 non-null  int64  
     2   no_of_children                        36275 non-null  int64  
     3   no_of_weekend_nights                  36275 non-null  int64  
     4   no_of_week_nights                     36275 non-null  int64  
     5   type_of_meal_plan                     36275 non-null  object 
     6   required_car_parking_space            36275 non-null  int64  
     7   room_type_reserved                    36275 non-null  object 
     8   lead_time                             36275 non-null  int64  
     9   arrival_year                          36275 non-null  int64  
     10  arrival_month                         36275 non-null  int64  
     11  arrival_date                          36275 non-null  int64  
     12  market_segment_type                   36275 non-null  object 
     13  repeated_guest                        36275 non-null  int64  
     14  no_of_previous_cancellations          36275 non-null  int64  
     15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
     16  avg_price_per_room                    36275 non-null  float64
     17  no_of_special_requests                36275 non-null  int64  
     18  booking_status                        36275 non-null  object 
    dtypes: float64(1), int64(13), object(5)
    memory usage: 5.3+ MB

-   `Booking_ID`, `type_of_meal_plan`, `room_type_reserved`,
    `market_segment_type`, and `booking_status` are of object type while
    rest columns are numeric in nature.

-   There are no null values in the dataset.

### **Dropping duplicate values**<a href="#Dropping-duplicate-values" class="anchor-link">¶</a>

In \[8\]:

    # checking for duplicate values
    data.duplicated().sum()

Out\[8\]:

    0

-   There are **no duplicate values** in the data.

### **Dropping the unique values column**<a href="#Dropping-the-unique-values-column" class="anchor-link">¶</a>

**Let's drop the Booking_ID column first before we proceed forward**, as
a column with unique values will have almost no predictive power for the
Machine Learning problem at hand.

In \[9\]:

    data = data.drop(["Booking_ID"], axis=1)

In \[10\]:

    data.head()

Out\[10\]:

|     | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status |
|-----|--------------|----------------|----------------------|-------------------|-------------------|----------------------------|--------------------|-----------|--------------|---------------|--------------|---------------------|----------------|------------------------------|--------------------------------------|--------------------|------------------------|----------------|
| 0   | 2            | 0              | 1                    | 2                 | Meal Plan 1       | 0                          | Room_Type 1        | 224       | 2017         | 10            | 2            | Offline             | 0              | 0                            | 0                                    | 65.00              | 0                      | Not_Canceled   |
| 1   | 2            | 0              | 2                    | 3                 | Not Selected      | 0                          | Room_Type 1        | 5         | 2018         | 11            | 6            | Online              | 0              | 0                            | 0                                    | 106.68             | 1                      | Not_Canceled   |
| 2   | 1            | 0              | 2                    | 1                 | Meal Plan 1       | 0                          | Room_Type 1        | 1         | 2018         | 2             | 28           | Online              | 0              | 0                            | 0                                    | 60.00              | 0                      | Canceled       |
| 3   | 2            | 0              | 0                    | 2                 | Meal Plan 1       | 0                          | Room_Type 1        | 211       | 2018         | 5             | 20           | Online              | 0              | 0                            | 0                                    | 100.00             | 0                      | Canceled       |
| 4   | 2            | 0              | 1                    | 1                 | Not Selected      | 0                          | Room_Type 1        | 48        | 2018         | 4             | 11           | Online              | 0              | 0                            | 0                                    | 94.50              | 0                      | Canceled       |

### **Question 1: Check the summary statistics of the dataset and write your observations (2 Marks)**<a
href="#Question-1:-Check-the-summary-statistics-of-the-dataset-and-write-your-observations-(2-Marks)"
class="anchor-link">¶</a>

**Let's check the statistical summary of the data.**

In \[11\]:

    # Remove _________ and complete the code
    data.describe() 

Out\[11\]:

|       | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time    | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests |
|-------|--------------|----------------|----------------------|-------------------|----------------------------|--------------|--------------|---------------|--------------|----------------|------------------------------|--------------------------------------|--------------------|------------------------|
| count | 36275.000000 | 36275.000000   | 36275.000000         | 36275.000000      | 36275.000000               | 36275.000000 | 36275.000000 | 36275.000000  | 36275.000000 | 36275.000000   | 36275.000000                 | 36275.000000                         | 36275.000000       | 36275.000000           |
| mean  | 1.844962     | 0.105279       | 0.810724             | 2.204300          | 0.030986                   | 85.232557    | 2017.820427  | 7.423653      | 15.596995    | 0.025637       | 0.023349                     | 0.153411                             | 103.423539         | 0.619655               |
| std   | 0.518715     | 0.402648       | 0.870644             | 1.410905          | 0.173281                   | 85.930817    | 0.383836     | 3.069894      | 8.740447     | 0.158053       | 0.368331                     | 1.754171                             | 35.089424          | 0.786236               |
| min   | 0.000000     | 0.000000       | 0.000000             | 0.000000          | 0.000000                   | 0.000000     | 2017.000000  | 1.000000      | 1.000000     | 0.000000       | 0.000000                     | 0.000000                             | 0.000000           | 0.000000               |
| 25%   | 2.000000     | 0.000000       | 0.000000             | 1.000000          | 0.000000                   | 17.000000    | 2018.000000  | 5.000000      | 8.000000     | 0.000000       | 0.000000                     | 0.000000                             | 80.300000          | 0.000000               |
| 50%   | 2.000000     | 0.000000       | 1.000000             | 2.000000          | 0.000000                   | 57.000000    | 2018.000000  | 8.000000      | 16.000000    | 0.000000       | 0.000000                     | 0.000000                             | 99.450000          | 0.000000               |
| 75%   | 2.000000     | 0.000000       | 2.000000             | 3.000000          | 0.000000                   | 126.000000   | 2018.000000  | 10.000000     | 23.000000    | 0.000000       | 0.000000                     | 0.000000                             | 120.000000         | 1.000000               |
| max   | 4.000000     | 10.000000      | 7.000000             | 17.000000         | 1.000000                   | 443.000000   | 2018.000000  | 12.000000     | 31.000000    | 1.000000       | 13.000000                    | 58.000000                            | 540.000000         | 5.000000               |

**Write your answers here: data.describe()**

## **Exploratory Data Analysis**<a href="#Exploratory-Data-Analysis" class="anchor-link">¶</a>

### **Question 2: Univariate Analysis**<a href="#Question-2:-Univariate-Analysis" class="anchor-link">¶</a>

Let's explore these variables in some more depth by observing their
distributions.

We will first define a **hist_box() function** that provides both a
boxplot and a histogram in the same visual, with which we can perform
univariate analysis on the columns of this dataset.

In \[12\]:

    # Defining the hist_box() function
    def hist_box(data,col):
      f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(12,6))
      # Adding a graph in each part
      sns.boxplot(data[col], ax=ax_box, showmeans=True)
      sns.distplot(data[col], ax=ax_hist)
      plt.show()

#### **Question 2.1: Plot the histogram and box plot for the variable `Lead Time` using the hist_box function provided and write your insights. (1 Mark)**<a
href="#Question-2.1:--Plot-the-histogram-and-box-plot-for-the-variable-Lead-Time-using-the-hist_box-function-provided-and-write-your-insights.-(1-Mark)"
class="anchor-link">¶</a>

In \[13\]:

    # Remove _________ and complete the code
    hist_box(data, 'lead_time')

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/4c75dff02e7a063fc53de94a14cfd7b065d04be2.png)

**Write your answers here: hist_box(data, 'lead_time')**

#### **Question 2.2: Plot the histogram and box plot for the variable `Average Price per Room` using the hist_box function provided and write your insights. (1 Mark)**<a
href="#Question-2.2:--Plot-the-histogram-and-box-plot-for-the-variable-Average-Price-per-Room-using-the-hist_box-function-provided-and-write-your-insights.-(1-Mark)"
class="anchor-link">¶</a>

In \[14\]:

    # Remove _________ and complete the code
    hist_box(data, 'avg_price_per_room')

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/44e4d4f563f72e9948550f0894a4f6252d10ba31.png)

**Write your answers here:hist_box(data, 'avg_price_per_room')**

**Interestingly some rooms have a price equal to 0. Let's check them.**

In \[15\]:

    data[data["avg_price_per_room"] == 0]

Out\[15\]:

|       | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status |
|-------|--------------|----------------|----------------------|-------------------|-------------------|----------------------------|--------------------|-----------|--------------|---------------|--------------|---------------------|----------------|------------------------------|--------------------------------------|--------------------|------------------------|----------------|
| 63    | 1            | 0              | 0                    | 1                 | Meal Plan 1       | 0                          | Room_Type 1        | 2         | 2017         | 9             | 10           | Complementary       | 0              | 0                            | 0                                    | 0.0                | 1                      | Not_Canceled   |
| 145   | 1            | 0              | 0                    | 2                 | Meal Plan 1       | 0                          | Room_Type 1        | 13        | 2018         | 6             | 1            | Complementary       | 1              | 3                            | 5                                    | 0.0                | 1                      | Not_Canceled   |
| 209   | 1            | 0              | 0                    | 0                 | Meal Plan 1       | 0                          | Room_Type 1        | 4         | 2018         | 2             | 27           | Complementary       | 0              | 0                            | 0                                    | 0.0                | 1                      | Not_Canceled   |
| 266   | 1            | 0              | 0                    | 2                 | Meal Plan 1       | 0                          | Room_Type 1        | 1         | 2017         | 8             | 12           | Complementary       | 1              | 0                            | 1                                    | 0.0                | 1                      | Not_Canceled   |
| 267   | 1            | 0              | 2                    | 1                 | Meal Plan 1       | 0                          | Room_Type 1        | 4         | 2017         | 8             | 23           | Complementary       | 0              | 0                            | 0                                    | 0.0                | 1                      | Not_Canceled   |
| ...   | ...          | ...            | ...                  | ...               | ...               | ...                        | ...                | ...       | ...          | ...           | ...          | ...                 | ...            | ...                          | ...                                  | ...                | ...                    | ...            |
| 35983 | 1            | 0              | 0                    | 1                 | Meal Plan 1       | 0                          | Room_Type 7        | 0         | 2018         | 6             | 7            | Complementary       | 1              | 4                            | 17                                   | 0.0                | 1                      | Not_Canceled   |
| 36080 | 1            | 0              | 1                    | 1                 | Meal Plan 1       | 0                          | Room_Type 7        | 0         | 2018         | 3             | 21           | Complementary       | 1              | 3                            | 15                                   | 0.0                | 1                      | Not_Canceled   |
| 36114 | 1            | 0              | 0                    | 1                 | Meal Plan 1       | 0                          | Room_Type 1        | 1         | 2018         | 3             | 2            | Online              | 0              | 0                            | 0                                    | 0.0                | 0                      | Not_Canceled   |
| 36217 | 2            | 0              | 2                    | 1                 | Meal Plan 1       | 0                          | Room_Type 2        | 3         | 2017         | 8             | 9            | Online              | 0              | 0                            | 0                                    | 0.0                | 2                      | Not_Canceled   |
| 36250 | 1            | 0              | 0                    | 2                 | Meal Plan 2       | 0                          | Room_Type 1        | 6         | 2017         | 12            | 10           | Online              | 0              | 0                            | 0                                    | 0.0                | 0                      | Not_Canceled   |

545 rows × 18 columns

-   There are quite a few hotel rooms which have a price equal to 0.
-   In the market segment column, it looks like many values are
    complementary.

In \[16\]:

    data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()

Out\[16\]:

    Complementary    354
    Online           191
    Name: market_segment_type, dtype: int64

-   It makes sense that most values with room prices equal to 0 are the
    rooms given as complimentary service from the hotel.
-   The rooms booked online must be a part of some promotional campaign
    done by the hotel.

In \[17\]:

    # Calculating the 25th quantile
    Q1 = data["avg_price_per_room"].quantile(0.25)

    # Calculating the 75th quantile
    Q3 = data["avg_price_per_room"].quantile(0.75)

    # Calculating IQR
    IQR = Q3 - Q1

    # Calculating value of upper whisker
    Upper_Whisker = Q3 + 1.5 * IQR
    Upper_Whisker

Out\[17\]:

    179.55

In \[18\]:

    # assigning the outliers the value of upper whisker
    data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker

#### **Let's understand the distribution of the categorical variables**<a
href="#Let&#39;s-understand-the-distribution-of-the-categorical-variables"
class="anchor-link">¶</a>

**Number of Children**

In \[19\]:

    sns.countplot(data['no_of_children'])
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/a39f4a81479b4df9c1f794c675f007804777a612.png)

In \[20\]:

    data['no_of_children'].value_counts(normalize=True)

Out\[20\]:

    0     0.925624
    1     0.044604
    2     0.029166
    3     0.000524
    9     0.000055
    10    0.000028
    Name: no_of_children, dtype: float64

-   Customers were not travelling with children in 93% of cases.
-   There are some values in the data where the number of children is 9
    or 10, which is highly unlikely.
-   We will replace these values with the maximum value of 3 children.

In \[21\]:

    # replacing 9, and 10 children with 3
    data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)

**Arrival Month**

In \[22\]:

    sns.countplot(data["arrival_month"])
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/dd474a10f241db15f9bbd3e8cfc893524f144097.png)

In \[23\]:

    data['arrival_month'].value_counts(normalize=True)

Out\[23\]:

    10    0.146575
    9     0.127112
    8     0.105114
    6     0.088298
    12    0.083280
    11    0.082150
    7     0.080496
    4     0.075424
    5     0.071620
    3     0.065003
    2     0.046975
    1     0.027953
    Name: arrival_month, dtype: float64

-   October is the busiest month for hotel arrivals followed by
    September and August. **Over 35% of all bookings**, as we see in the
    above table, were for one of these three months.
-   Around 14.7% of the bookings were made for an October arrival.

**Booking Status**

In \[24\]:

    sns.countplot(data["booking_status"])
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/be1635e70e1c31a71604d7dc193edc034a90812f.png)

In \[25\]:

    data['booking_status'].value_counts(normalize=True)

Out\[25\]:

    Not_Canceled    0.672364
    Canceled        0.327636
    Name: booking_status, dtype: float64

-   32.8% of the bookings were canceled by the customers.

**Let's encode Canceled bookings to 1 and Not_Canceled as 0 for further
analysis**

In \[26\]:

    data["booking_status"] = data["booking_status"].apply(
        lambda x: 1 if x == "Canceled" else 0
    )

### **Question 3: Bivariate Analysis**<a href="#Question-3:-Bivariate-Analysis" class="anchor-link">¶</a>

#### **Question 3.1: Find and visualize the correlation matrix using a heatmap and write your observations from the plot. (2 Marks)**<a
href="#Question-3.1:-Find-and-visualize-the-correlation-matrix-using-a-heatmap-and-write-your-observations-from-the-plot.-(2-Marks)"
class="anchor-link">¶</a>

In \[27\]:

    # Remove _________ and complete the code
    cols_list = data.select_dtypes(include=np.number).columns.tolist()
    plt.figure(figsize=(12, 7))
    sns.heatmap(data[cols_list].corr(), annot=True, cmap='coolwarm')
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/841a0025b4fd4f005ddf72d02be1903a2d413873.png)

**Write your answers here: sns.heatmap(data\[cols_list\].corr(),
annot=True, cmap='coolwarm')**

**Hotel rates are dynamic and change according to demand and customer
demographics. Let's see how prices vary across different market
segments**

In \[28\]:

    plt.figure(figsize=(10, 6))
    sns.boxplot(
        data=data, x="market_segment_type", y="avg_price_per_room", palette="gist_rainbow"
    )
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/5e7ea98f165e531f37acf80a0e2e6502f520d25f.png)

-   Rooms booked online have high variations in prices.
-   The offline and corporate room prices are almost similar.
-   Complementary market segment gets the rooms at very low prices,
    which makes sense.

We will define a **stacked barplot()** function to help analyse how the
target variable varies across predictor categories.

In \[29\]:

    # Defining the stacked_barplot() function
    def stacked_barplot(data,predictor,target,figsize=(10,6)):
      (pd.crosstab(data[predictor],data[target],normalize='index')*100).plot(kind='bar',figsize=figsize,stacked=True)
      plt.legend(loc="lower right")
      plt.ylabel('Percentage Cancellations %')

#### **Question 3.2: Plot the stacked barplot for the variable `Market Segment Type` against the target variable `Booking Status` using the stacked_barplot function provided and write your insights. (1 Mark)**<a
href="#Question-3.2:-Plot-the-stacked-barplot-for-the-variable-Market-Segment-Type-against-the-target-variable-Booking-Status-using-the-stacked_barplot--function-provided-and-write-your-insights.-(1-Mark)"
class="anchor-link">¶</a>

In \[30\]:

    # Remove _________ and complete the code
    stacked_barplot(data, 'market_segment_type', 'booking_status')

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/13bec0f3488669d7771fe60d8cacc9a5dda225fa.png)

**Write your answers here: stacked_barplot(data, 'market_segment_type',
'booking_status')**

#### **Question 3.3: Plot the stacked barplot for the variable `Repeated Guest` against the target variable `Booking Status` using the stacked_barplot function provided and write your insights. (1 Mark)**<a
href="#Question-3.3:-Plot-the-stacked-barplot-for-the-variable-Repeated-Guest-against-the-target-variable-Booking-Status-using-the-stacked_barplot--function-provided-and-write-your-insights.-(1-Mark)"
class="anchor-link">¶</a>

Repeating guests are the guests who stay in the hotel often and are
important to brand equity.

In \[31\]:

    # Remove _________ and complete the code
    stacked_barplot(data, 'repeated_guest', 'booking_status')

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/813e97f8b5177311c8a3d92c781c6f7690b28010.png)

**Write your answers here: stacked_barplot(data, 'repeated_guest',
'booking_status')**

**Let's analyze the customer who stayed for at least a day at the
hotel.**

In \[32\]:

    stay_data = data[(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)]
    stay_data["total_days"] = (stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"])

    stacked_barplot(stay_data, "total_days", "booking_status",figsize=(15,6))

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/47b8ecd79020d5895165221638b7c5a21d1992aa.png)

-   The general trend is that the chances of cancellation increase as
    the number of days the customer planned to stay at the hotel
    increases.

**As hotel room prices are dynamic, Let's see how the prices vary across
different months**

In \[33\]:

    plt.figure(figsize=(10, 5))
    sns.lineplot(y=data["avg_price_per_room"], x=data["arrival_month"], ci=None)
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/909e7dea637f665a23c791ffd3beaac3bae94f7f.png)

-   The price of rooms is highest in May to September - around 115 euros
    per room.

## **Data Preparation for Modeling**<a href="#Data-Preparation-for-Modeling" class="anchor-link">¶</a>

-   We want to predict which bookings will be canceled.
-   Before we proceed to build a model, we'll have to encode categorical
    features.
-   We'll split the data into train and test to be able to evaluate the
    model that we build on the train data.

**Separating the independent variables (X) and the dependent variable
(Y)**

In \[34\]:

    X = data.drop(["booking_status"], axis=1)
    Y = data["booking_status"]

    X = pd.get_dummies(X, drop_first=True) # Encoding the Categorical features

**Splitting the data into a 70% train and 30% test set**

Some classification problems can exhibit a large imbalance in the
distribution of the target classes: for instance there could be several
times more negative samples than positive samples. In such cases it is
recommended to use the **stratified sampling** technique to ensure that
relative class frequencies are approximately preserved in each train and
validation fold.

In \[35\]:

    # Splitting data in train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30,stratify=Y, random_state=1)

In \[36\]:

    print("Shape of Training set : ", X_train.shape)
    print("Shape of test set : ", X_test.shape)
    print("Percentage of classes in training set:")
    print(y_train.value_counts(normalize=True))
    print("Percentage of classes in test set:")
    print(y_test.value_counts(normalize=True))

    Shape of Training set :  (25392, 27)
    Shape of test set :  (10883, 27)
    Percentage of classes in training set:
    0    0.672377
    1    0.327623
    Name: booking_status, dtype: float64
    Percentage of classes in test set:
    0    0.672333
    1    0.327667
    Name: booking_status, dtype: float64

## **Model Evaluation Criterion**<a href="#Model-Evaluation-Criterion" class="anchor-link">¶</a>

#### **Model can make wrong predictions as:**<a href="#Model-can-make-wrong-predictions-as:"
class="anchor-link">¶</a>

1.  Predicting a customer will not cancel their booking but in reality,
    the customer will cancel their booking.
2.  Predicting a customer will cancel their booking but in reality, the
    customer will not cancel their booking.

#### **Which case is more important?**<a href="#Which-case-is-more-important?" class="anchor-link">¶</a>

Both the cases are important as:

-   If we predict that a booking will not be canceled and the booking
    gets canceled then the hotel will lose resources and will have to
    bear additional costs of distribution channels.

-   If we predict that a booking will get canceled and the booking
    doesn't get canceled the hotel might not be able to provide
    satisfactory services to the customer by assuming that this booking
    will be canceled. This might damage brand equity.

#### **How to reduce the losses?**<a href="#How-to-reduce-the-losses?" class="anchor-link">¶</a>

-   The hotel would want the `F1 Score` to be maximized, the greater the
    F1 score, the higher the chances of minimizing False Negatives and
    False Positives.

**Also, let's create a function to calculate and print the
classification report and confusion matrix so that we don't have to
rewrite the same code repeatedly for each model.**

In \[37\]:

    # Creating metric function 
    def metrics_score(actual, predicted):
        print(classification_report(actual, predicted))

        cm = confusion_matrix(actual, predicted)
        plt.figure(figsize=(8,5))
        
        sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Cancelled', 'Cancelled'], yticklabels=['Not Cancelled', 'Cancelled'])
        plt.ylabel('Actual')
        plt.xlabel('Predicted')
        plt.show()

## **Building the model**<a href="#Building-the-model" class="anchor-link">¶</a>

We will be building 4 different models:

-   **Logistic Regression**
-   **Support Vector Machine (SVM)**
-   **Decision Tree**
-   **Random Forest**

### **Question 4: Logistic Regression (6 Marks)**<a href="#Question-4:-Logistic-Regression-(6-Marks)"
class="anchor-link">¶</a>

#### **Question 4.1: Build a Logistic Regression model (Use the sklearn library) (1 Mark)**<a
href="#Question-4.1:-Build-a-Logistic-Regression-model-(Use-the-sklearn-library)-(1-Mark)"
class="anchor-link">¶</a>

In \[38\]:

    # Remove _________ and complete the code

    # Fitting logistic regression model
    lg = LogisticRegression()
    lg.fit(X, Y)

Out\[38\]:

    LogisticRegression()

#### **Question 4.2: Check the performance of the model on train and test data (2 Marks)**<a
href="#Question-4.2:-Check-the-performance-of-the-model-on-train-and-test-data-(2-Marks)"
class="anchor-link">¶</a>

In \[39\]:

    # Remove _________ and complete the code

    # Checking the performance on the training data
    y_pred = lg.predict(X)
    metrics_score(Y, y_pred)

                  precision    recall  f1-score   support

               0       0.82      0.89      0.86     24390
               1       0.73      0.60      0.66     11885

        accuracy                           0.80     36275
       macro avg       0.78      0.75      0.76     36275
    weighted avg       0.79      0.80      0.79     36275

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/c4289e7f320fd15fff7bde0909cbbbf9af1d747b.png)

**Write your Answer here:**

#### Fitting logistic regression model<a href="#Fitting-logistic-regression-model" class="anchor-link">¶</a>

lg = LogisticRegression() lg.fit(X, Y)

#### Checking the performance on the training data<a href="#Checking-the-performance-on-the-training-data"
class="anchor-link">¶</a>

y_pred = lg.predict(X) metrics_score(y, y_pred)

Let's check the performance on the test set

In \[40\]:

    # Remove _________ and complete the code

    # Checking the performance on the test dataset
    y_pred_test = lg.predict(X_test)
    metrics_score(y_test, y_pred_test)

                  precision    recall  f1-score   support

               0       0.82      0.89      0.85      7317
               1       0.73      0.60      0.65      3566

        accuracy                           0.79     10883
       macro avg       0.77      0.74      0.75     10883
    weighted avg       0.79      0.79      0.79     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/1e242425fa700fcb49130211d5d0c134fcfd1e10.png)

**Write your Answer here:**

y_pred_test = lg.predict(X_test)

metrics_score(y_test, y_pred_test)

#### **Question 4.3: Find the optimal threshold for the model using the Precision-Recall Curve. (1 Mark)**<a
href="#Question-4.3:-Find-the-optimal-threshold-for-the-model-using-the-Precision-Recall-Curve.-(1-Mark)"
class="anchor-link">¶</a>

Precision-Recall curves summarize the trade-off between the true
positive rate and the positive predictive value for a predictive model
using different probability thresholds.

Let's use the Precision-Recall curve and see if we can find a **better
threshold.**

In \[41\]:

    # Remove _________ and complete the code

    # Predict_proba gives the probability of each observation belonging to each class
    y_scores_lg = lg.predict_proba(X_test)

    precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_test, y_scores_lg[:, 1])

    # Plot values of precisions, recalls, and thresholds
    plt.figure(figsize=(10,7))
    plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
    plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
    plt.xlabel('Threshold')
    plt.legend(loc='upper left')
    plt.ylim([0,1])
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/6f4a15427d8f3b0b8db077fccb2dc6395272d3f2.png)

**Write your answers here:**\_****

y_scores_lg = lg.predict_proba(X_test)

precisions_lg, recalls_lg, thresholds_lg =
precision_recall_curve(y_test, y_scores_lg\[:, 1\])

In \[42\]:

    # Calculating F1 score for different probability thresholds
    f1_scores_lg = 2 * (precisions_lg * recalls_lg) / (precisions_lg + recalls_lg)

    # Setting the optimal threshold
    optimal_threshold = thresholds_lg[np.argmax(f1_scores_lg)]

#### **Question 4.4: Check the performance of the model on train and test data using the optimal threshold. (2 Marks)**<a
href="#Question-4.4:-Check-the-performance-of-the-model-on-train-and-test-data-using-the-optimal-threshold.-(2-Marks)"
class="anchor-link">¶</a>

In \[43\]:

    # Remove _________ and complete the code

    # Creating confusion matrix
    y_pred_train = (lg.predict_proba(X_train)[:,1] >= optimal_threshold).astype(bool)
    metrics_score(y_train, y_pred_train)

                  precision    recall  f1-score   support

               0       0.88      0.78      0.83     17073
               1       0.63      0.78      0.70      8319

        accuracy                           0.78     25392
       macro avg       0.76      0.78      0.76     25392
    weighted avg       0.80      0.78      0.78     25392

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/8772fece2b9958aa388934bf405dec524019cfdc.png)

**Write your answers here:**

y_pred_train = (lg.predict_proba(X_train)\[:,1\] \>=
optimal_threshold).astype(bool)

metrics_score(y_train, y_pred_train)

Let's check the performance on the test set

In \[44\]:

    # Remove _________ and complete the code
    y_pred_test = (lg.predict_proba(X_test)[:,1] >= optimal_threshold).astype(bool)

    metrics_score(y_test, y_pred_test)

                  precision    recall  f1-score   support

               0       0.87      0.78      0.82      7317
               1       0.62      0.77      0.69      3566

        accuracy                           0.77     10883
       macro avg       0.75      0.77      0.75     10883
    weighted avg       0.79      0.77      0.78     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/5a426d460154395ad0de43eb7f6f742247e5a3be.png)

**Write your answers here:**\_****

y_pred_test = (lg.predict_proba(X_test)\[:,1\] \>=
optimal_threshold).astype(bool)

metrics_score(y_test, y_pred_test)

### **Question 5: Support Vector Machines (11 Marks)**<a href="#Question-5:-Support-Vector-Machines-(11-Marks)"
class="anchor-link">¶</a>

To accelerate SVM training, let's scale the data for support vector
machines.

Let's build the models using the two of the widely used kernel
functions:

1.  **Linear Kernel**
2.  **RBF Kernel**

In \[45\]:

    scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
    X_train_scaled = scaling.transform(X_train)
    X_test_scaled = scaling.transform(X_test)

#### **Question 5.1: Build a Support Vector Machine model using a linear kernel (1 Mark)**<a
href="#Question-5.1:-Build-a-Support-Vector-Machine-model-using-a-linear-kernel-(1-Mark)"
class="anchor-link">¶</a>

**Note: Please use the scaled data for modeling Support Vector Machine**

In \[46\]:

    # Remove _________ and complete the code

    svm = SVC(kernel='linear', probability=True) # Linear kernal or linear decision boundary
    model = svm.fit(X_train_scaled, y_train)

#### **Question 5.2: Check the performance of the model on train and test data (2 Marks)**<a
href="#Question-5.2:-Check-the-performance-of-the-model-on-train-and-test-data-(2-Marks)"
class="anchor-link">¶</a>

In \[47\]:

    # Remove _________ and complete the code

    y_pred_train_svm = model.predict(X_train_scaled)
    metrics_score(y_train, y_pred_train_svm)

                  precision    recall  f1-score   support

               0       0.83      0.90      0.86     17073
               1       0.74      0.61      0.67      8319

        accuracy                           0.80     25392
       macro avg       0.79      0.76      0.77     25392
    weighted avg       0.80      0.80      0.80     25392

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/3472955f0f130478888da5f4f9d30bb866e157e2.png)

**Write your answers here:**\_****

svm = SVC(kernel='linear', probability=True) \# Linear kernal or linear
decision boundary

model = svm.fit(X_train_scaled, y_train)

y_pred_train_svm = model.predict(X_train_scaled)

metrics_score(y_train, y_pred_train_svm)

Checking model performance on test set

In \[48\]:

    # Remove _________ and complete the code

    y_pred_test_svm = model.predict(X_test_scaled)
    metrics_score(y_test, y_pred_test_svm)

                  precision    recall  f1-score   support

               0       0.82      0.90      0.86      7317
               1       0.74      0.61      0.67      3566

        accuracy                           0.80     10883
       macro avg       0.78      0.75      0.76     10883
    weighted avg       0.80      0.80      0.80     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/6bb24af7fad313f65b3975e80f38e2dffd6844b5.png)

**Write your answers here:**\_****

y_pred_test_svm = model.predict(X_test_scaled)

metrics_score(y_test, y_pred_test_svm)

#### **Question 5.3: Find the optimal threshold for the model using the Precision-Recall Curve. (1 Mark)**<a
href="#Question-5.3:-Find-the-optimal-threshold-for-the-model-using-the-Precision-Recall-Curve.-(1-Mark)"
class="anchor-link">¶</a>

In \[49\]:

    # Remove _________ and complete the code

    # Predict on train data
    y_scores_svm=model.predict_proba(X_train_scaled)

    precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:, 1])

    # Plot values of precisions, recalls, and thresholds
    plt.figure(figsize=(10,7))
    plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
    plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
    plt.xlabel('Threshold')
    plt.legend(loc='upper left')
    plt.ylim([0,1])
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/2310276dc51cb938cc56ead0c877eee3d323605a.png)

**Write your answers here:**\_****

y_scores_svm=model.predict_proba(X_train_scaled)

precisions_svm, recalls_svm, thresholds_svm =
precision_recall_curve(y_train, y_scores_svm\[:, 1\])

In \[54\]:

    optimal_threshold_svm = thresholds_svm[np.argmax(f1_scores_lg)]

#### **Question 5.4: Check the performance of the model on train and test data using the optimal threshold. (2 Marks)**<a
href="#Question-5.4:-Check-the-performance-of-the-model-on-train-and-test-data-using-the-optimal-threshold.-(2-Marks)"
class="anchor-link">¶</a>

In \[56\]:

    # Remove _________ and complete the code

    y_pred_train_svm = (model.predict_proba(X_test_scaled)[:,1] >= optimal_threshold_svm).astype(int)

    metrics_score(y_test, y_pred_train_svm)

                  precision    recall  f1-score   support

               0       0.95      0.41      0.58      7317
               1       0.44      0.96      0.60      3566

        accuracy                           0.59     10883
       macro avg       0.70      0.68      0.59     10883
    weighted avg       0.78      0.59      0.59     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/557e3fb0a0d30a5353c6503398239fa521b4d4d5.png)

**Write your answers here:**\_****

y_pred_train_svm = (model.predict_proba(X_test_scaled)\[:,1\] \>=
optimal_threshold_svm).astype(int)

metrics_score(y_train, y_pred_train_svm)

In \[57\]:

    # Remove _________ and complete the code

    y_pred_test_svm = (model.predict_proba(X_test_scaled)[:,1] >= optimal_threshold_svm).astype(int)

    metrics_score(y_test, y_pred_test_svm)

                  precision    recall  f1-score   support

               0       0.95      0.41      0.58      7317
               1       0.44      0.96      0.60      3566

        accuracy                           0.59     10883
       macro avg       0.70      0.68      0.59     10883
    weighted avg       0.78      0.59      0.59     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/557e3fb0a0d30a5353c6503398239fa521b4d4d5.png)

**Write your answers here:**\_****

y_pred_test_svm = (model.predict_proba(X_test_scaled)\[:,1\] \>=
optimal_threshold_svm).astype(int)

metrics_score(y_test, y_pred_test_svm)

#### **Question 5.5: Build a Support Vector Machines model using an RBF kernel (1 Mark)**<a
href="#Question-5.5:-Build-a-Support-Vector-Machines-model-using-an-RBF-kernel-(1-Mark)"
class="anchor-link">¶</a>

In \[58\]:

    # Remove _________ and complete the code

    svm_rbf = SVC(kernel='rbf', probability=True)
    svm_rbf.fit(X_train_scaled, y_train)

Out\[58\]:

    SVC(probability=True)

#### **Question 5.6: Check the performance of the model on train and test data (2 Marks)**<a
href="#Question-5.6:-Check-the-performance-of-the-model-on-train-and-test-data-(2-Marks)"
class="anchor-link">¶</a>

In \[59\]:

    # Remove _________ and complete the code

    y_pred_train_svm = svm_rbf.predict(X_train_scaled)
    metrics_score(y_train, y_pred_train_svm)

                  precision    recall  f1-score   support

               0       0.84      0.91      0.88     17073
               1       0.79      0.65      0.71      8319

        accuracy                           0.83     25392
       macro avg       0.81      0.78      0.80     25392
    weighted avg       0.82      0.83      0.82     25392

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/67c2f4cacacc428b28a950ecebad38f654b4e360.png)

**Write your answers here:**\_****

svm_rbf = SVC(kernel='rbf', probability=True)
svm_rbf.fit(X_train_scaled, y_train)

y_pred_train_svm = svm_rbf.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)

#### Checking model performance on test set<a href="#Checking-model-performance-on-test-set"
class="anchor-link">¶</a>

In \[60\]:

    # Remove _________ and complete the code

    y_pred_test = svm_rbf.predict(X_test_scaled)

    metrics_score(y_test, y_pred_test_svm)

                  precision    recall  f1-score   support

               0       0.95      0.41      0.58      7317
               1       0.44      0.96      0.60      3566

        accuracy                           0.59     10883
       macro avg       0.70      0.68      0.59     10883
    weighted avg       0.78      0.59      0.59     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/557e3fb0a0d30a5353c6503398239fa521b4d4d5.png)

**Write your answers here:**\_****

y_pred_test = svm_rbf.predict(X_test_scaled)

metrics_score(y_test, y_pred_test_svm)

In \[61\]:

    # Predict on train data
    y_scores_svm=svm_rbf.predict_proba(X_train_scaled)

    precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])

    # Plot values of precisions, recalls, and thresholds
    plt.figure(figsize=(10,7))
    plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
    plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
    plt.xlabel('Threshold')
    plt.legend(loc='upper left')
    plt.ylim([0,1])
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/db52a2ee306ec11013c291eec2d30739f5506c2c.png)

In \[63\]:

    optimal_threshold_svm=thresholds_svm[np.argmax(f1_scores_lg)]

#### **Question 5.7: Check the performance of the model on train and test data using the optimal threshold. (2 Marks)**<a
href="#Question-5.7:-Check-the-performance-of-the-model-on-train-and-test-data-using-the-optimal-threshold.-(2-Marks)"
class="anchor-link">¶</a>

In \[64\]:

    # Remove _________ and complete the code

    y_pred_train_svm = model.predict_proba(X_train_scaled)[:, 1] >= optimal_threshold_svm
    metrics_score(y_train, y_pred_train_svm)

                  precision    recall  f1-score   support

               0       0.96      0.32      0.48     17073
               1       0.41      0.98      0.58      8319

        accuracy                           0.54     25392
       macro avg       0.69      0.65      0.53     25392
    weighted avg       0.78      0.54      0.51     25392

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/a3f518c6d65c6e4f0daedb5ee54103e53876d599.png)

**Write your answers here:**\_****

y_pred_train_svm = model.predict_proba(X_train_scaled)\[:, 1\] \>=
optimal_threshold_svm

metrics_score(y_train, y_pred_train_svm)

In \[65\]:

    # Remove _________ and complete the code

    y_pred_test = svm_rbf.predict_proba(X_test_scaled)[:, 1] > optimal_threshold_svm
    metrics_score(y_test, y_pred_test)

                  precision    recall  f1-score   support

               0       0.97      0.38      0.54      7317
               1       0.43      0.98      0.60      3566

        accuracy                           0.57     10883
       macro avg       0.70      0.68      0.57     10883
    weighted avg       0.80      0.57      0.56     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/299b8e5ec9ae7a92775c344db945f229f76baa47.png)

**Write your answers here:**\_****

y_pred_test = svm_rbf.predict_proba(X_test_scaled)\[:, 1\] \>
optimal_threshold_svm

metrics_score(y_test, y_pred_test)

### **Question 6: Decision Trees (7 Marks)**<a href="#Question-6:-Decision-Trees-(7-Marks)"
class="anchor-link">¶</a>

#### **Question 6.1: Build a Decision Tree Model (1 Mark)**<a href="#Question-6.1:-Build-a-Decision-Tree-Model-(1-Mark)"
class="anchor-link">¶</a>

In \[66\]:

    # Remove _________ and complete the code

    model_dt = DecisionTreeClassifier()
    model_dt.fit(X_train, y_train)

Out\[66\]:

    DecisionTreeClassifier()

#### **Question 6.2: Check the performance of the model on train and test data (2 Marks)**<a
href="#Question-6.2:-Check-the-performance-of-the-model-on-train-and-test-data-(2-Marks)"
class="anchor-link">¶</a>

In \[67\]:

    # Remove _________ and complete the code

    # Checking performance on the training dataset
    pred_train_dt = model_dt.predict(X_train)
    metrics_score(y_train, pred_train_dt)

                  precision    recall  f1-score   support

               0       0.99      1.00      1.00     17073
               1       1.00      0.99      0.99      8319

        accuracy                           0.99     25392
       macro avg       1.00      0.99      0.99     25392
    weighted avg       0.99      0.99      0.99     25392

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/5600f5276ec3c0ccbd364a6facffe3cfdb068393.png)

**Write your answers here:**\_****

model_dt = DecisionTreeClassifier()

model_dt.fit(X_train, y_train)

pred_train_dt = model_dt.predict(X_train)

metrics_score(y_train, pred_train_dt)

#### Checking model performance on test set<a href="#Checking-model-performance-on-test-set"
class="anchor-link">¶</a>

In \[68\]:

    pred_test_dt = model_dt.predict(X_test)
    metrics_score(y_test, pred_test_dt)

                  precision    recall  f1-score   support

               0       0.90      0.90      0.90      7317
               1       0.79      0.79      0.79      3566

        accuracy                           0.86     10883
       macro avg       0.85      0.85      0.85     10883
    weighted avg       0.86      0.86      0.86     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/21d62fe404d302364185fd42046a4e0bd7d6c74a.png)

**Write your answers here:**\_****

pred_test_dt = model_dt.predict(X_test)

metrics_score(y_test, pred_test_dt)

#### **Question 6.3: Perform hyperparameter tuning for the decision tree model using GridSearch CV (1 Mark)**<a
href="#Question-6.3:-Perform-hyperparameter-tuning-for-the-decision-tree-model-using-GridSearch-CV-(1-Mark)"
class="anchor-link">¶</a>

**Note: Please use the following hyperparameters provided for tuning the
Decision Tree. In general, you can experiment with various
hyperparameters to tune the decision tree, but for this project, we
recommend sticking to the parameters provided.**

In \[69\]:

    # Remove _________ and complete the code

    # Choose the type of classifier.
    estimator = DecisionTreeClassifier(random_state=42)

    # Grid of parameters to choose from
    parameters = {
        "max_depth": np.arange(2, 7, 2),
        "max_leaf_nodes": [50, 75, 150, 250],
        "min_samples_split": [10, 30, 50, 70],
    }


    # Run the grid search
    grid_obj = GridSearchCV(estimator, parameters)
    grid_obj = grid_obj.fit(X_train, y_train)

    # Set the clf to the best combination of parameters
    estimator = grid_obj.best_estimator_

    # Fit the best algorithm to the data.
    estimator.fit(X_train, y_train)

Out\[69\]:

    DecisionTreeClassifier(max_depth=6, max_leaf_nodes=50, min_samples_split=10,
                           random_state=42)

#### **Question 6.4: Check the performance of the model on the train and test data using the tuned model (2 Mark)**<a
href="#Question-6.4:-Check-the-performance-of-the-model-on-the-train-and-test-data-using-the-tuned-model-(2-Mark)"
class="anchor-link">¶</a>

#### Checking performance on the training set<a href="#Checking-performance-on-the-training-set"
class="anchor-link">¶</a>

In \[70\]:

    # Remove _________ and complete the code

    # Checking performance on the training dataset
    dt_tuned = estimator.predict(X_train)
    metrics_score(y_train, dt_tuned)

                  precision    recall  f1-score   support

               0       0.86      0.93      0.89     17073
               1       0.82      0.68      0.75      8319

        accuracy                           0.85     25392
       macro avg       0.84      0.81      0.82     25392
    weighted avg       0.85      0.85      0.84     25392

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/ae2c52ee55a5d2cf15d54481b313d3365a31bdf7.png)

**Write your answers here:**\_****

dt_tuned = estimator.predict(X_train)

metrics_score(y_train, dt_tuned)

In \[71\]:

    # Remove _________ and complete the code

    # Checking performance on the training dataset
    y_pred_tuned = estimator.predict(X_test)
    metrics_score(y_test, y_pred_tuned)

                  precision    recall  f1-score   support

               0       0.85      0.93      0.89      7317
               1       0.82      0.67      0.74      3566

        accuracy                           0.84     10883
       macro avg       0.84      0.80      0.81     10883
    weighted avg       0.84      0.84      0.84     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/e502013975947a868d325a09e2e1114b847a8c11.png)

**Write your answers here:**\_****

y_pred_tuned = estimator.predict(X_test)

metrics_score(y_test, y_pred_tuned

#### **Visualizing the Decision Tree**<a href="#Visualizing-the-Decision-Tree" class="anchor-link">¶</a>

In \[72\]:

    feature_names = list(X_train.columns)
    plt.figure(figsize=(20, 10))
    out = tree.plot_tree(
        estimator,max_depth=3,
        feature_names=feature_names,
        filled=True,
        fontsize=9,
        node_ids=False,
        class_names=None,
    )
    # below code will add arrows to the decision tree split if they are missing
    for o in out:
        arrow = o.arrow_patch
        if arrow is not None:
            arrow.set_edgecolor("black")
            arrow.set_linewidth(1)
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/57833643f18fdc461e5f2235b84d74d925cf5b14.png)

#### **Question 6.5: What are some important features based on the tuned decision tree? (1 Mark)**<a
href="#Question-6.5:-What-are-some-important-features-based-on-the-tuned-decision-tree?-(1-Mark)"
class="anchor-link">¶</a>

In \[73\]:

    # Remove _________ and complete the code

    # Importance of features in the tree building

    importances = estimator.feature_importances_
    indices = np.argsort(importances)

    plt.figure(figsize=(8, 8))
    plt.title("Feature Importances")
    plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
    plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
    plt.xlabel("Relative Importance")
    plt.show()

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/ffc203877bc813f07403039a246249fbd721561b.png)

**Write your answers here:**\_****

importances = estimator.feature*importances*

indices = np.argsort(importances)

------------------------------------------------------------------------

### **Question 7: Random Forest (4 Marks)**<a href="#Question-7:-Random-Forest-(4-Marks)" class="anchor-link">¶</a>

#### **Question 7.1: Build a Random Forest Model (1 Mark)**<a href="#Question-7.1:-Build-a-Random-Forest-Model-(1-Mark)"
class="anchor-link">¶</a>

In \[74\]:

    # Remove _________ and complete the code

    rf_estimator = RandomForestClassifier(random_state=42)

    rf_estimator.fit(X_train_scaled, y_train)

Out\[74\]:

    RandomForestClassifier(random_state=42)

#### **Question 7.2: Check the performance of the model on the train and test data (2 Marks)**<a
href="#Question-7.2:-Check-the-performance-of-the-model-on-the-train-and-test-data-(2-Marks)"
class="anchor-link">¶</a>

In \[75\]:

    # Remove _________ and complete the code

    y_pred_train_rf = rf_estimator.predict(X_train_scaled)

    metrics_score(y_train, y_pred_train_rf)

                  precision    recall  f1-score   support

               0       0.99      1.00      1.00     17073
               1       1.00      0.99      0.99      8319

        accuracy                           0.99     25392
       macro avg       0.99      0.99      0.99     25392
    weighted avg       0.99      0.99      0.99     25392

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/732c288b504c4165dc29c26e7c8511810267d3e3.png)

**Write your answers here:**\_****

rf_estimator = RandomForestClassifier(random_state=42)

rf_estimator.fit(X_train_scaled, y_train)

y_pred_train_rf = rf_estimator.predict(X_train_scaled)

metrics_score(y_train, y_pred_train_rf)

In \[76\]:

    # Remove _________ and complete the code

    y_pred_test_rf = rf_estimator.predict(X_test)

    metrics_score(y_test, y_pred_test_rf)

                  precision    recall  f1-score   support

               0       0.78      0.67      0.72      7317
               1       0.47      0.61      0.53      3566

        accuracy                           0.65     10883
       macro avg       0.63      0.64      0.63     10883
    weighted avg       0.68      0.65      0.66     10883

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/b5288d2e3c071aba6bb9ce1c0fb2ecc9463aeb13.png)

**Write your answers here:**\_****

y_pred_test_rf = rf_estimator.predict(X_test)

metrics_score(y_test, y_pred_test_rf)

#### **Question 7.3: What are some important features based on the Random Forest? (1 Mark)**<a
href="#Question-7.3:-What-are-some-important-features-based-on-the-Random-Forest?-(1-Mark)"
class="anchor-link">¶</a>

Let's check the feature importance of the Random Forest

In \[77\]:

    # Remove _________ and complete the code

    importances = rf_estimator.feature_importances_

    columns = X.columns

    importance_df = pd.DataFrame({'Feature':columns, 'Importance':importances})

    plt.figure(figsize = (13, 13))

    sns.barplot(importance_df.Importance, importance_df.index)

Out\[77\]:

    <AxesSubplot:xlabel='Importance'>

![](attachment:vertopal_3cee1bb49e1949f5981e28078e7c0fd2/ed66368a82006a13ee98b4959c0a361bd843760b.png)

**Write your answers here:**\_****

importances = rf_estimator.feature*importances*

importance_df = pd.DataFrame({'Feature':columns,
'Importance':importances})

### **Question 8: Conclude ANY FOUR key takeaways for business recommendations (4 Marks)**<a
href="#Question-8:-Conclude-ANY-FOUR-key-takeaways-for-business-recommendations-(4-Marks)"
class="anchor-link">¶</a>

**Write your answers here:**\_****

1.  Reduce lead time: Lead time is the most important feature for
    predicting hotel cancellations. Hence, the hotel should focus on
    optimizing the booking experience to reduce lead times for
    customers.

<!-- -->

1.  Enhance online presence: The market segment type "Online" is the
    second most important feature for predicting cancellations. Hence,
    the hotel should focus on enhancing its online presence and
    marketing strategies to attract and retain customers.

<!-- -->

1.  Provide personalized services: The number of special requests made
    by customers is an important feature for predicting cancellations.
    This highlights the importance of providing personalized services
    and amenities to customers to improve their experience and loyalty.

<!-- -->

1.  Optimize pricing strategies: The average price per room is an
    important feature for predicting cancellations. Hence, the hotel
    should consider adjusting its pricing strategies to remain
    competitive and appeal to price-sensitive customers.

Overall, the hotel should focus on improving the booking experience,
enhancing its online presence and marketing strategies, providing
personalized services and amenities, and optimizing pricing strategies
to attract and retain customers and reduce cancellations.

## **Happy Learning!**<a href="#Happy-Learning!" class="anchor-link">¶</a>

In \[ \]: